In this blog, we will explore the key pillars of observability - logs, metrics, and traces - and answer the following questions -
Why do we need each of them and what are their limitations?
When/ at what stage of a company do each of these become relevant?
How does AI impact this construct now?
Understanding Logs, Metrics and Traces
The goal of observability is to give us visibility into our systems so we can understand its behaviour. In a modern observability stack, 3 types of telemetry data are widely considered fundamental - metrics, logs, and traces.
Let us take a quick look at each of them
Logs are the ‘what’ - Logs give us records & information around events that occurred within a system.
Metrics are the ‘how much’ - Metrics give us quantitative measures of system behavior and events.
Traces are the ‘where’ - Traces track a request through a system, mapping out the path and interactions across multiple services.
These data types didn’t come into existence by accident. The need for them has grown with the evolution of a distributed software environment with 10s of microservices.
Logs: The What?
Logs are the most common among the three. They've existed since the beginning of computing systems and continue to be a primary reference point whenever a problem arises.
Logs are event-based - they record events as they occur, and capture additional metadata about the event. Logs are super simple to generate - they are just text printed into a log file - and developers have flexibility on what level of detail they want to print about each event, so logs often end up being highly granular.
As a result, logs are incredibly versatile and are the most common telemetry data set used for any detailed examination into system occurrences - all kinds of investigations, troubleshooting and debugging, security audits etc.
However logs have some limitations. Issues with logs typically arise under two conditions - a) scale, and b) distributed systems
Limitations of Logs
Logs are typically unstructured text, and become unwieldy to use at scale. Imagine eyeballing and thousands of log lines to understand what happened.
In distributed systems, logs get broken up. Now we have several different application components printing their own log files independently, and stitching them together to understand what actually happened is very hard
Logs also easily increase in volume and often become highly expensive to process and store
Metrics: The How Much?
While logs are detailed text files that give us insights into specific events, we still need a system to track overall system health and performance over time, and alert us when something is off. This is where metrics come in handy.
Metrics capture quantifiable characteristics of a system such as throughput, CPU utilization or memory consumption, over a period of time. They are the statistics of the different systems stored as time-series data. Because running queries on time-series databases is much faster than with log data, metrics are an effective data structure for us to set up alerts around, as we need alerts to let us know asap when something unexpected is happening.
Anatomy of a Metric
Here are the fields that generally make up a metric - the name of the metric, the value, the timestamp, and any other labels that need to be attached to the metric to indicate context about the metric (e.g., the name of the service the metric belongs to).
While metrics are an effective monitoring mechanism and help us identify statistical anomalies, they have their limitations too.
Limitations of metrics
The main limitation of metrics is that they do not help us understand why an issue or anomaly is occurring. For example, they just tell us throughput for our service is now zero. To understand why, we often need more information - from either logs or traces.
Traces: The Where?
In distributed systems, a single request passes through multiple components, and all components interact with each other in many different ways. In these modern software systems, metrics and logs fail in providing sufficient visibility. Neither metrics not logs tell us anything about how the components are interacting with each other. Metrics track some metrics about each individual component, and logs just tell us what is happening within each individual component
Enter distributed traces.
A distributed trace provides a view of the request from its point of origin through all the services it interacts with, right to the final output. Distributed traces provide insight into how services interact with each other, which gives us a more holistic view of our systems.
Limitations of Traces
Distributed tracing is harder to implement than logs and metrics. Tracing needs to be implemented in a chain of services together so the trace doesn't break and this requires more organizational alignment.
The size of a distributed trace can be high, so companies often end up sampling a selection of traces for storage, which limits the utility of traces. Configuring the right sampling strategy is a complex undertaking.
These limitations have somewhat slowed the adoption of distributed tracing although it seems to be accelerating now. Read this for an assessment of the pros and cons of distributed tracing and what the future holds.
There has been some push in some companies, to replace logs with traces instead. The thinking is that traces with mode details appended are just structured logs, with better visibility into interactions. While early yet, we are likely to see log volumes reduce with time and trace volumes go up. Not because we stop logging entirely, but because logs get attached to traces, and that information might potentially move from logging systems to tracing platforms. Refer this blog- Can Distributed Tracing Replace Logging? - for an exploration of the topic
Given each of metrics logs and traces have their own utility and limitations, it has become common practice to use all 3 of them together to complement each other, and hence they are referred to as the 3 pillars of observability.
Other Observability data types
In addition to these, there are several other types of telemetry data we collect that feed into a modern observability stack. These components enable more detailed insights into different aspects of system behaviour and supplement the data provided by pure-play logs, metrics, and traces.
Error Monitoring and Exception Tracking
This involves capturing, recording, and analyzing application errors (exceptions, 400+ errors) in real-time. It helps developers quickly identify, diagnose, and fix issues.
Profiling tools are used to monitor aspects of system performance, like CPU utilization, memory consumption, and network activity. Profiling data can help us identify bottlenecks or inefficiencies in our code.
"Events" are quite like logs but are generally used to observe specific, individual occurrences within a system that are of significance, like when a user logs in or when a transaction is completed. Although these event details are frequently included in logs, when we talk about "events" as a distinct category, it usually means we're focusing on the occurrences that have direct relevance to business activities.
This involves tracking and analyzing system events and logs for potential security threats. The mechanism here is just like regular monitoring but the specific metrics being tracked are more related to security. Security monitoring tools can identify patterns and signals that indicate a security breach or attack.
When to implement what observability method?
Deciding when to introduce each component of observability—logs, metrics, traces, and other elements like error monitoring and profiling—largely depends on organization stage, system complexity, and the specific issues being faced.
Nevertheless, the most common adoption journey is as below.
Start with logs
The most basic form of observability, logs are implemented right from the start, from day one. Even in staging, log data provides visibility into code and help debug issues.
Add error & exception monitoring
Teams usually use an error & exception monitoring tool also right at the beginning, even before they have systems running in production. These tools help software engineers surface debug code errors faster.
Once production systems are up, add basic metrics
Once production system is up and begins to handle real user traffic, we’ll need a way to track system performance in real time. This is the right time to introduce metrics. Metrics can be used to set up alerts, to perform capacity and resource planning, and to track other KPIs of the system.
Once system scales beyond 25-30 services, introduce tracing
As the system grows in complexity - say, moved from a monolith to a distributed system with 30+ microservices, it becomes important to have tracing. It is also helpful to do tracing sooner rather than later, as it becomes harder to implement tracing with higher number of services.
Profiling/Others - add based on need
Other components like profiling, which provides insights into the runtime behaviour of our system, become relevant at larger scales or when specific problems need to be addressed and are typically a more niche use-case.
Challenges with Observability today
There are several types of telemetry data that's easily available today. As our software systems get more and more complex, we collect new types of data, and more of that data. Today, a team running a modern software system faces several challenges around observability -
Observability data volumes (and associated costs) are continuously increasing as we start collecting more and more data
The observability stack is fragmented in most companies - metrics, traces and logs are often in different platforms and are hard to connect together and make sense of
It is getting harder for developers to humanly look through multiple dashboards showing tons of data, and to make sense of them all.
This article explores some of the common challenges around Observability and what could come next.
So what does the future look like?
The Future of Observability - all AI led
Just like most digital industries, Observability will also see significant shifts due the tremendous developments in AI in the last few months. So far, Observability and monitoring have been all about collecting and storing different types of data
I'd argue the next step that is emerging (supported by AI) is Inferencing — where an AI platform looks at the different systems, and can reasonably explain why an error occurred, so we can fix it.
Imagine a solution that:
Automatically surfaces just the errors that need immediate developer attention.
Tells the developer exactly what is causing the issue and where the issue is: this pod, this server, this code path, this line of code, for this type of request.
Guides the developer on how to fix it.
Uses the developer's actual actions to improve its recommendations continuously.
There is some early activity in this space, including companies like ZeroK, but this is an open space as yet and we can expect several new companies to emerge here over the next couple of years.
For a comprehensive read on what the next generation of Inferencing solutions will look like, including their architecture, read this.
In this article, we took a detailed look at the three pillars of observability - logs, metrics, and traces, weighing their pros and cons, and at what stage each of them are implemented. We also looked at some of the challenges of the observability stack (too many tools, too much data, too expensive, hard to process) and briefly explored how AI could change that.
For further readings on Observability and Distributed Tracing, check out https://www.zerok.ai/blog.
Further Reading and Resources
[Oreilly, Chapter 4 - The Three Pillars of Observability](https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html#:~:text=Logs%2C)
Please share your experiences and thoughts on implementing observability in the comment section below.