top of page

Observability Best Practices in 2023 - Ultimate Guide

If you have applications running in a live production environment, you probably have a suite of monitoring and observability tools, so that you can understand and fix issues when they occur.

However implementing good Observability is still very much an art than a science, with scores of vendors with different points of view on what you need.

How does one cut through the noise to figure out what one really needs, and how to get there in the simplest, cheapest way possible?

This is a practical guide to Observability in 2023 that talks about best practices at two levels -

  1. Planning the Observability stack. These are strategic stack questions that CTOs, devops, and platform teams are thinking about, such as -

    1. How much observability do you need, depending on your company stage?

    2. How much should you be spending on Observability?

    3. Do you select an integrated one-stop solution or go for a best-of-breed stack (e.g., different vendors for different parts of the stack)?

    4. How to decide between open-source and commercial? What are the trade-offs?

  2. Implementation best practices. What are specific best practices around implementations of logs, metrics, and distributed tracing?

Let's dive in.

1. What & how much observability do you need for your stage?


Depending on your company stage, your monitoring and observability efforts look different. Below are some heuristics on when companies typically adopt different observability practices.

Logs first

The most basic form of observability, logs are implemented right from day one, as soon as there's code, even before there's stuff in production

Add error & exception monitoring

Teams typically add an error & exception monitoring tool like Sentry soon after logs, typically also before they start having production environments.

Once production systems are up, add basic metrics

When systems begin to handle real user traffic, companies introduce metrics. Most companies start with infrastructure monitoring and basic alerts.

Application monitoring/ APM comes in much later when application latency becomes a recurring issue in production.

Once there are >25-30 services, introduce distributed tracing

At about this scale (25-30 services), is when companies typically add distributed tracing, to be able to track requests as they flow through several endpoints. Historically, companies would put off tracing until later, but today modern companies prefer to implement tracing earlier in the journey. It's partly because OTel auto-instrumentation has made implementation a lot easier.

It's also smarter to do tracing sooner rather than later, as implementation effort increases linearly the more number of services there are.

On-call and SLO systems

Once companies reach a certain scale and engineers are repeatedly getting woken up at odd hours with production issues, on-call systems come into place. Eventually, this evolves into more sophisticated SLO systems at v. large stage.

Profiling/more sophisticated observability systems based on triggers

More advanced observability tools are typically adopted upon specific triggers and use cases (e.g., continuous profiling, end-user monitoring).

Note that this above is the typical adoption pattern, and what one can consider minimum observability requirements at each stage. If you are a company that has a low margin of error in production (e.g., you have transactional systems) then you'll likely want to have more observability instrumented earlier in the journey.

2. How much should you be spending on Observability?

Observability has been getting expensive over time, and taming observability costs is a top-2 priority for nearly every devops/ platform team today.

There's a really broad spectrum in terms of how much people spend on Observability (as much as $65M annually!). However, here are some heuristics based on our conversations with 100+ companies -

  1. If you're a mid-sized company, on average observability should be 15-20% of cloud spend

  2. If you're a large enterprise, observability will likely be at 20-25% of cloud spend.

See this article on what the benchmarks are and why.

Observability costs estimate
Source: https://twitter.com/mipsytipsy/status/1667275388178042881?s=20

Note that these are the estimated averages, and there will be wide variations. That said, these ranges are good indicators to keep in mind while planning/ benchmarking your overall observability budget.

How much your observability spend varies from the average will depend on whether you-

  1. Have a cloud-native, distributed architecture (higher than avg.)

  2. Have a higher need for visibility into production for industry/ compliance reasons (higher than avg.)

  3. Have engineering teams in relatively lower-cost locations (e.g., India) and can implement open-source solutions (lower than avg.)

  4. Have a streamlined observability stack - e.g., you have a strong centralized platform team that has narrowed the stack down to fewer tools (lower than avg.)

How does the overall observability spend split across different segments like logs, metrics, etc.? Based on our conversations with >100 companies implementing observability, here's what the typical mix looks like today:

  • 40-60% on logs

  • 30-40% on distributed tracing

  • 10-30% on metrics.

In some sophisticated/ digital-first companies, we see distributed tracing & logs switch on spend share - tracing has higher share of spend (60%), with logs at 30% and metrics @10%.

This is because as companies deploy more sophisticated distributed tracing, they tend to make their traces more loaded/ contextual and reduce the volumes of logs they're capturing (e.g., only capture critical logs & above).

3. Unified Platform vs. Best-of-breed

This is a prominent question while considering any observability strategy - do you use a single vendor across everything, or do you go for a best-of-breed strategy (the best vendor for each module)?

Most large commercial vendors will tell you why it is important to have a "single pane of glass" (metrics, logs, distributed traces, and everything else in the same platform). This is because only a select few vendors have that offering (e.g., NewRelic and Datadog) - and all of them are very expensive.


Integrated Observability is expensive
Source: https://twitter.com/PierreDeWulf/status/1679543895686782977

The reality on the ground though, is that the average company has 5-7 observability tools.

This is because committing to one integrated vendor leads to vendor lock-in, price lock-in, and limited flexibility, which most companies tend to avoid, especially given how exorbitant observability has gotten in the last 5 years.

The other reason for a fragmented observability stack is also the natural adoption pattern - companies start with one module and add different observability modules at different times, which is done by different teams (core engineering team at the start, DevOps team later, etc.), so there are other structural reasons for this fragmentation.

So, how should you choose? Let's examine the pros and cons of each model -

Unified Platform

Benefits:

- Easier troubleshooting experience for developers as all data is in one place

- Only one vendor to manage

- Don't have to train developers on different platforms

Considerations:

- High TCO - Most integrated solutions are highly expensive; you'd have to keep a close watch on data volumes to ensure you don't get a surprise 3x bill because you had a spiky month. This is the biggest reason this has become unviable

- Vendor lock-in

- Oddly, the single-pane-of-glass doesn't wow as much as you expect it to. The integrated platforms are still built as distinct modules for easy selling/ pricing, and navigating across all the dashboards for troubleshooting continues to be complex. So companies don't always see MTTRs dropping dramatically after moving to an integrated solution.

- The overall product will likely have some gaps. Most vendors are really good at some parts of the stack, but not very good at others (as is expected), so expect some gaps. For example, Datadog is great at monitoring, but they wouldn't be the first vendor you'd choose for logging.

Best-of-breed observability tooling

In this model, you select different vendors for different parts of the stack - e.g., Prometheus for metrics, Sentry for error monitoring, ELK for logs, and Tempo for tracing.

Benefits:

- No vendor lock-in

- More cost-effective as you'll likely end up using at least 1-2 open-source solutions

- Can probably collect more data as well because your costs are under control

- Best-in-class product for each module

Considerations

- Fragmented troubleshooting experience; Developers have to navigate across several tools which is a pain

- Extra engineering overhead for managing multiple tools and vendors.

So how to decide?

Best of both worlds

Note that there's an emerging class of AI platforms that will likely eliminate this trade-off altogether.

They integrate with your existing observability stack, correlate data across sources to create a single-pane-of-glass across your entire system, analyze data, and use AI to automatically root-cause production issues. ZeroK.ai is one such solution.

The advantage of these solutions is that you can get to an integrated single-pane experience, plus AI-based insights, all without changing anything in your existing stack. This is a separate layer of "Inferencing" solutions, that sits on top of observability that's emerging.

Note that most of these solutions are quite early in their journey, but something to keep an eye on and try out.


4. Open-source vs. Commercial solutions

Another important question around observability is - whether to implement an open-source solution or to buy a commercial solution from a vendor.

In general, most companies prefer to buy a commercial solution, as this is not core to their business.

However, given how expensive Observability has gotten in recent years, more and more companies are preferring open-source so that they can collect and persist more data. In parallel, open-source offerings like Prometheus and Grafana have matured considerably and are a real option. In this example, Grammarly improved monitoring 10x by moving to Victoria metrics.

Let's look at the pros and cons of open-source and commercial options.

Commercial tools

Pros

- Better product experience for developers compared to open-source

- No operational overhead for the engineering team maintaining a non-core solution

Considerations

- Expensive - choice of vendor is critical

- Keeping an eye on volumes and spend -Teams need to watch volumes that are sent to the observability vendors to avoid surprises

- Hard to estimate TCO - e.g., egress costs in sending data to the observability vendor show up in the cloud bill, but we don't usually know how much.

Open-source tools

Pros

- Less expensive

- More flexibility in terms of product implementation - can customize to needs

- Availability of good-enough open-source monitoring and observability tools like Grafana and Prometheus with wide adoption and a strong community today means that the support is good enough


Considerations

- Need to expend meaningful dev. bandwidth - there's no getting around this. Especially as systems scale, open-source solutions are more effort-intensive to maintain.

- The fully loaded costs (incl. infrastructure to run, hiring engineers specifically for this, etc.) might come out to just 10-20% lower than buying a commercial solution

- Product experience is usually poorer - especially at scale. Things are usually slow and breaking often. This choice is entirely a function of the engineering team's operating model and mindset - where they lie on the build-vs-buy scale.

We looked at some common trade-offs around observability that teams make and examined the choices. While the right choice is individual to the company, our effort has been to offer some simple heuristics and clarity on when to choose what.

Implementation practices

Now let us look at some best practices in implementing each of the Observability pillars.

Unlike the strategic questions, implementation best practices are widely applicable and can be simple checklists that can be picked up and executed in most places.

We will look at best practices across the common Observability modules - Distributed Tracing, Logging, and Monitoring.

5. Distributed Tracing best practices

For most platform teams, tracing is typically one of the harder things to implement and get right, given the integration effort and the choices around sampling and data storage, all of which impact the value the organization gets from tracing.

Below is a list of practical guidelines in 2023 for a robust tracing implementation (for a more detailed guide on tracing best practices, see here)

  1. Pick OTel for instrumentation - OTel has matured enough for this to be an easy choice

  2. Leverage OTel's automatic instrumentation where possible - and avoid code-changes

  3. Start with critical paths (most frequently accessed) and expand from there

  4. Be intentional about sampling - the most common is a head-based probabilistic sampling, with anywhere between a 0.1-3% sampling rate

  5. Selectively implement custom tracing (reporting custom spans)

  6. Make sure to integrate with monitoring and logging (e.g., print TraceID in logs)

  7. Use a modern trace visualization front-end - most back-end developers find tracing hard to navigate otherwise

  8. Explore next-generation tools that combine AI and tracing (e.g., ZeroK) - they eliminate most of the implementation complexity (1-click install; no sampling), while generating AI-based inferences on root causes of issues

  9. Invest in developer onboarding - tracing is a complex product to use, and onboarding is important to drive adoption & usage. Limited developer use is a common risk with tracing.

6. Logging best practices

Logs have been around forever and most companies know how to do the basics well. The utility of logs extends beyond just production troubleshooting - they're used by security teams for detecting internal and external threats, for compliance audits, etc.

Today the considerations around logging are mostly on -a) Implementing structured logging, and b) Logging less without losing important visibility.

Nevertheless, below are some best practices around logging-

  1. Decouple logging from stashing - so your app can focus on core functionality, and you can centralize log ingestion and analysis.

  2. Ensure there's no sensitive information in logs. If there is sensitive information, make sure you tokenize/ redact/ mask the data. Most commercial logging providers have out-of-the-box tooling to enable this.

  3. Use structured logging (e.g., JSON) and create a standardized data logging schema for app logs. You can use a logging frameworks that have structured logging capabilities (e.g. monolog).

  4. Use decorators for exception logging - this allows developers to add more context to their logs while working with an existing log structure.

  5. Add TraceIDs to logs

  6. Only store logs related to critical events - you can discard most of the rest quickly

  7. Evaluate your log retention policy once a year. Your log retention policy is a function of your industry (e.g., financial services), compliances you need to adhere to (e.g., HIPAA), and business needs (Infosec, auditing, and engineering team requirements). Within these constraints, there's usually room to optimize around log-levels you store, or the type of storage (e.g., cold storage)

  8. Ensure you're using the RFC 5424 logging levels effectively (image below). Most logging frameworks support this out-of-the-box

Logging levels
RFC 5424 severity levels for logs

For a detailed read on logging best practices, see here.

7. Monitoring best practices

Monitoring tools are aplenty, and most of them come built-in with features and have recommendations on how to set up good quality monitoring. A few things to keep in mind while implementing monitoring tools -

  1. You don't need to monitor everything - this is a counterintuitive view, but only use custom monitoring data for critical components

  2. Only set up alerts for critical events - alert fatigue is an all-too-common problem. Often you may need to go back and reconfigure your alert thresholds

  3. Define SLOs and set up on-call around SLOs for a more streamlined response to incidents

  4. Set up custom dashboards where relevant - you may not need custom metrics, but often you need custom dashboards/ views on existing metrics for a better experience

  5. Make sure you have a scalable back-end - especially relevant if implementing open-source options like Prometheus which have well-known scaling issues.

Summary

In summary, we looked at common monitoring and observability practices. We looked at observability best practices at two levels - planning the observability stack, and more granular implementation practices. This is a journey of continuous improvement and even implementing some of it would help improve developer experience and keep costs under control.

We also looked at emerging AI plays in this space - for e.g, Inferencing solutions that use observability data to help root-cause production issues. We can expect a lot more AI solutions to come up in this space, given the biggest problem is the volume of data to be handled. Stuff like automated on-call systems and automated remediation systems, are all looking possible from today. Stay tuned for more!

bottom of page