Before we focus on distributed tracing, let’s pull back to look at microservices as a whole. For greenfield applications, microservices have become a default choice. They offer the kind of decoupling necessary to allow developers to fully transform systems and innovate at a far greater pace than previously. Microservices are simply regular distributed services at a larger scale, therefore, they are tasked by the same problems that any distributed system faces, such as a lack of visibility into a business transaction over process boundaries.
Metrics, logs and traces allow a system to be observable so that we can understand its state. However, in a distributed system, knowing the state of just one instance of a single service isn’t sufficient; we need to aggregate the metrics for all instances of a specific service, perhaps grouped by version, in a service like Prometheus. Likewise, logs need to be stored in a central location such as Logstash as logs cannot be analyzed from the individual instances of each service. This is often coupled with a storage solution like Elasticsearch. Lastly, end-to-end traces are necessary to provide an insight into the path a particular transaction has taken. This is where distributed tracing solutions are useful.
What is Distributed Tracing?
In a monolithic web application, logging frameworks offer sufficient capability to do a simple root-cause analysis when something fails. A developer simply needs to place log statements in the code and trace information such as “context” (such as “thread”) and “timestamp”, which are added automatically to the log entry, making it possible to understand how a given request has been executed and then correlate the log entries. This technique is decades old, but still remains at the heart of any modern tracing solution.
Distributed tracing, aka request tracing, meanwhile follows operations inside and over a range of systems to pinpoint where failures occur and what is behind poor performance. This lets engineers working in a microservices environment gain the full picture, hence debug and monitor modern distributed software architectures, and optimize their code. Questions can be asked, such as: Is this service the first in the call chain? What occurred at the inventory service, where we dispatched an event? As an application grows to over a dozen processes, begins to see increased concurrency, or non-trivial interaction between mobile and web clients and servers, distributed tracing can be essential.
Distributed tracing also differs from regular logging in that the data structure, which holds tracing data is more specialized, meaning we can identify causality as well. Having a dedicated data structure lets distributed tracing record not just the message at a single point in time, but the start and end time of a particular procedure as well.
Request tracing works with logs and metrics. A trace will tell you when a flow is broken or slow in addition to the latency of each step, but it won’t explain why. Logs can do this and metrics enable deeper analysis into system faults. Tracing, logs and metrics together form a complete telemetry solution. Monitoring is important not just for tracking binary “up” and “down” patterns, but also for offering a window into complex system behavior. By monitoring infrastructure over time, insights can be gleaned into performance, system health and behavioral patterns.
DevOp teams usually begin with logging and monitoring, and add tracing as needed. This is because the tracing solution needs to be customized with engineering teams instrumenting code, adding tracing to infrastructure elements like load balancers, and deploying the actual tracing system. The right solution for each developer will also need to take into account language and library support, production operations, and the level of community support.
In this post, we will help you begin to make that decision by doing a deep dive into Jaeger and Zipkin, two of the most popular choices for distributed tracing.
Zipkin vs. Jaeger
Zipkin and Jaeger are both open source distributed tracing offerings.
Zipkin was one of the first systems of this kind. It was developed at Twitter using a Google paper, which described Google’s Dapper, its own internally-built distributed app debugger. The social media company describe the impetus behind Zipkin’s creation as a way to help them “gather timing data for all the disparate services involved in managing a request to the Twitter API”. Twitter open sourced Zipkin in June 2012 under the APLv2 license. At the time, the company described it as “a performance profiler, like Firebug, but tailored for a website backend instead of a browser”. They also listed some of the untapped performance optimizations it had enabled, including “removing memcache requests, rewriting slow MySQL SELECTs, and fixing incorrect service timeouts”, all of which helped “make Twitter faster”.
In 2016, Mike Gehard, senior software engineer at Pivotal Labs told The New Stack, “The community is starting to standardize around Zipkin”. Uber and AirBnB also both use Zipkin.
Jaeger is a newer project, which was originally developed by Uber as an end-to-end distributed tracing solution and now like Zipkin, is also open source. Uber’s Observability team began building it in 2015 with “two engineers and two objectives: transform the existing prototype into a full-scale production system, and make distributed tracing available to and adopted by all Uber microservices”.
Today, Jaeger is supported by the Cloud Native Computing Foundation (CNCF) as an incubating project and it is maintained by a dedicated community. Jaeger implements the OpenTracing specification (more about that below). Jaeger includes elements to store, visualize and filter traces. Its overall architecture is similar to Zipkin. Instrumented systems send traces/events to the trace collector, which records the data and relation between traces. The tracing system additionally provides a UI to inspect traces.
Language and Library Support
Both Zipkin and Jaeger support common languages, with a few exceptions (Python, Ruby and PHP most notably). There are unofficial clients for these, but caution must be applied when proceeding with them.
Jaeger documents its supported features across official clients, providing a feature matrix for its existing client libraries. Different clients support a variety of transports and protocols for sending data to the tracing backend.
This is similar in Zipkin, whose documentation also lists support for its various features, along with instructions for instrumenting a library. The approach to support varies between the two, however. Zipkin supports popular frameworks among its official clients while letting the community figure out how to instrument smaller libraries such as database drivers. Jaeger, meanwhile, leverages OpenTracing instrumentation libraries, which allows the several opentracing-contrib projects to be used. This includes instrumentation for various database libraries, gRPC, Thrift and the AWS SDK in some languages.
There are three options for checking out Zipkin locally: Java, Docker or running it from source. Zipkin prefers Docker for users who are familiar with it as the already existing Docker Zipkin project can build docker images, provide scripts and launch pre-built images with a docker-compose.yml. The fastest start is to directly run the latest image.
As Jaeger is part of the CNCF, Kubernetes in the preferred deployment platform. Jaeger provides an official template for Kubernetes, which is useful being that Kubernetes is the defacto orchestrator for microservices infrastructure and deployment. This insightful Medium post from Masoor Hasan breaks down each Jaeger component and its Kubernetes deployment in some detail.
Zipkin offers no dedicated deployment documentation whereas Jaeger does in some detail.
Both Jaeger and Zipkin are running systems. Jaeger is a distributed system in itself, which necessitates monitoring its components and maintaining its data store. Both the systems export metrics from Prometheus; and maintaining the data store can be offloaded through the use of a hosted Elasticsearch, a more accessible service than Cassandra. Teams can choose to run the datastore themselves, however, in doing so, they must accept responsibility for maintaining a critical infrastructure component.
Zipkin was written using Java, and can use Cassandra or ElasticSearch as a scalable backend. The lowest support version of Java is Java 6. It draws on the Thrift binary communication protocol, which is hosted as an Apache project and is popular in the Twitter stack.
Unlike Jaeger, Zipkin is a single process, which includes reporters, collectors, storage, API and UI, which makes deployment easier. Delivery of data to the collectors can occur via three different means: HTTP, Kafka and Scribe.
Jaeger’s architecture is similar to Zipkin, including clients, collectors, a query service and a web UI, but it additionally has an agent on every host that locally aggregates the data. This agent receives data over a UDP connection, batches it and sends it to a collector. The data is stored in ElasticSearch or Cassandra. The query service is capable of directly accessing the data store and handing that information off to the web UI.
Jaeger samples only 0.1% of all traces that pass through each client, however, this number can be altered by re-configuring the agents size. The sampling isn’t random as Jaeger uses probabilistic sampling, which makes an educated as to whether a trace should be sampled or not. It is working on adaptive sampling, which will add additional context for making decisions to the sampling algorithm decision-making process.
Jaeger was written in Golang, meaning you don’t have to worry about having dependencies installed on the host or any interpreter or language virtual machine overhead. Like Zipkin, Jaeger supports Cassandra and ElasticSearch as scalable storage backends.
There is an active community around both Zipkin and Jaeger, although Zipkin’s has been in existence since it was open sourced in 2012 compared to Jaeger’s first public release not happening until 2017. For this reason, Zipkin has a larger community, as witnessed in the Gitter chatroom and Github stars rating. However, as Jaeger is a part of the CNCF, the project is framed as a piece within cloud-native architecture, meaning it works with containers, Kubernetes, and supports OpenTracing and the ecosystem around it. Zipkin, by contrast, is an isolated project, not part of a wider ecosystem in the same way as Jaeger as it springs from a pre-container world. It all comes down to which type of community you prefer.
Both Zipkin and Jaeger are strong solutions as request tracers. The choice of which one to work with depends on what makes the best sense for you. What are the official supported languages, libraries and frameworks? Zipkin initially seems to be the winner here, but Jaeger has more potential as it works with any open tracing instrumentation library. It comes down to what your own tech stack is, how much is instrumented already by the community, and how much, if any, you want to instrument yourself.
It is possible to combine the two and use elements of both as Jaeger is compatible with Zipkin’s API, meaning the Zipkin instrumentation libraries can be used with Jaeger’s collector.
Deployment is another arena to consider. If you are using Kubernetes, Jaeger is the natural fit as a request tracer. If there is no existing container infrastructure, Zipkin is a better fit as there are fewer moving pieces involved. A self-hosted solution could also be considered for the data layer. As there are no complete hosted Jaeger or Zipkin solutions, the exact best fit for your production system will need to be engineered.
There are a number of other distributed tracing systems out there as well, including New Relic and Appdash.
OpenTracing is a useful tool to know about when implementing distributed tracing as it makes it possible to instrument applications for distributed tracing with only a minimal amount of effort. Jaeger, Zipkin and Appdash all support OpenTracing. It isn’t a download or a program. Developers need to add instrumentation to an application’s code or to the frameworks used. Nor is OpenTracing a standard; the OpenTracing API project “is working towards creating more standardized APIs and instrumentation for distributed tracing”. Tracing can be set up in under 10 minutes with OpenTracing. It standardizes instrumentation making the tracing process easy so that instrumentation can take place first followed by implementation decisions later.