Service Discovery – Consul vs ZooKeeper vs etcd

January 17, 2019

Comparison Overview

All three are similar in architecture with server nodes, which need a quorum of nodes to operate, typically via a simple majority. They are highly-consistent and expose primitives that can be deployed via client libraries within applications to build complex distributed systems. All have roughly the same semantics in relation to offering key/value storage. The differences between them are more apparent when they are used for advanced cases. Zookeeper, for instance, only has a primitive K/V store and application developers have to build their own systems to provide service discovery. This is in comparison to Consul, which offers an opinionated framework for service discovery; this cuts out any guess work and the need for development.

Zookeeper has been around the longest. It originated in Hadoop for use in Hadoop clusters. Developers commend its high performance and the support it offers for Kafka and Spring Boot.

etcd is the newest option and is the simplest and easiest to use. Developers who have tried it say it is “one of the best-designed tools precisely because of this simplicity”. It is bundled with Coreo, and has fault tolerance as a key value store.

Consul offers more features than the other two, including a key value store, built-in framework for service discovery, health checking, gossip clustering and high availability, in addition to Docker integration. Developers often cite its first-class support for service discovery, health checking, K/V storage and multiple data centers as reasons for its use.

Consul

Consul is distributed, highly scalable and highly available. It is a decentralized fault-tolerant service developed by HashiCorp Company (behind Vagrant, TerraForm, Atlas, Otto, and others). It is a tool explicitly for service discovery and configuration. Each Consul agent is installed to each host and is a first-class cluster participants. This means that servers don’t have to know the discovery address within a network, so all discovery requests can be processed to a local address.

Consul uses algorithms for information distribution, which are based on an eventual consistency model. Its agents use gossip protocol for distribution information, and for leader election, servers use the Raft algorithm.

Consul can also be used in Cluster, which is a network of related nodes with running services that are registered in discovery. Consul ensures that information on clusters will be distributed between all cluster participants and be available when required. Not only is there peer support, but additionally multi-zoned cluster. This means it is possible to both work with data centers and to perform an action on any others. Agents from one data center can information from another data center to help in building an effective solution for distributed systems.

Service can be registered in two different ways in Consul: (i) Through the use of HTTP API or an agent configuration file if the service independently communicates with Consul (ii) Through registering the service as a third party component in the instance it can’t communicate with Consul.

Reasons developers have cited for choosing Consul include:

Thorough health checking

Superior service discovery infrastructure
Distributed key-value store
Insightful monitoring
High-availability
Web-UI
Gossip clustering
Token-based acls
DNS server
Docker integration

Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and offering group services. All these services are used by distributed applications in one way or another. Every time they are implemented, a large amount of manual work is put into fixing the bugs and racing inevitable conditions. Applications tend to skimp on them initially because of the size of workload, which can make them brittle in the face of change and hard to manage. Different implementations of these services can cause problems and lead to management complexity even when deployed correctly. Zookeeper is an attempt to solve these challenges via the enabling of highly reliable distributed coordination.

To achieve high availability with Zookeeper, multiple Zookeeper services must be used instead of just one. This is called an ensemble. In this instance, all Zookeeper servers storage copies of data. It is replicated across the hosts in the ensemble to guarantee the data’s high availability. Each server maintains an in-memory image of the stage, in addition to a transaction log in a persistent store in order to know about the other servers in the ensemble. As long as most of the servers are available, the Zookeeper service will be available.

A leader for the ensemble is chosen via leader election recipe The leader’s job is to maintain consensus. Leader election also happens in the case of failure of an existing leader. All update requests go through the leader to guarantee the data’s availability.

Zookeeper maintains a hierarchical structure of nodes, which are known as znodes. Each znode has data associated with it, and may have children connected to it as well. Node structure is similar to a standard file structure. There are two types of znode: persistent znodes and ephemeral znodes.

Reasons developers have cited for choosing Zookeeper include:

It is high performance
Straightforward generation of node specific config
Offers support of Kafka
Java enabled and embeddable in Java Service
Spring Boot Support
Supports DC/OS
Enables extensive distributed IPC
Used in Hadoop

Apache Zookeeper is a volunteer-led open source project managed by the Apache Software Foundation.

etcd

etcd is a distributed key value store that offers a reliable way to store data over a cluster of machines through offering shared configuration and service discovery for Container Linux clusters. It is available on GitHub as an open source project.

etcd is written in Go and uses the Raft protocol, which specializes in assisting multiple nodes in the maintenance of identical logs of state changing commands. Any node in a raft node can be treated as the master. It will then work in collaboration with the others to decide on the order state changes happen in.

etcd handles leader elections during network partitions and is able to tolerate machine failure, including the leader.

Application containers running on clusters can read and write data into etcd; use cases include storing database connection details, configuring cache settings. or feature flags in the form of key value pairs. The values can be watched, enabling your app to reconfigure itself when or if they change.

Advanced uses leverage the consistency guarantees to put into practice database leader elections or carry out distributed locking across a cluster of workers.

Kubernetes is built on top of etcd. It leverages the etcd distributed K/V store, as does Cloud Foundry. etcd also handles the storage and replication of data used by Kubernetes over the entire cluster. etcd is able to recover from hardware failure and network partitions because of the Raft consensus algorithm. It was designed to be the backbone of any distributed system, hence why projects like Kubernetes, Cloud Foundry and Fleet depend on etcd.

Developers cite choosing etcd for a range of reasons, including:

Service discovery
Bundled with CoreOS
Runs on a range of operating systems, including Linux, OS X and BSD
Fault tolerant key value store
Simple interface, which reads and writes values with curl, in addition to other HTTP libraries
Easy to manage cluster coordination and state management
Optional SSL client cert authentication
Optional TTLs for keys expiration
Properly distributed through Raft protocol
Benchmarked at 1000s of write/s per instance

Conclusion

If attention is not paid to the architecture when building an application, scaling problems can suddenly appear. Application scaling and efficient use of resources at the initial stage can save lots of time later on. Distributed architecture is frequently used to stop such problems emerging. However, microservices can bring with them their own set of challenges, such as cohesion and complexity of service configuration. This is when service discovery comes into play.

Discovery helps provide cohesion between a range of architecture components, asides from linking. Discovery is the equivalent of a meta-information registry of distributed architecture, which retains data contained in all the components. It is a form of decentralized storage, which gives access to storage for any node. This means there needs to be only minimum manual intervention with components.

The selection of the right service discovery tool for you is something to explore and experiment with until you find the right solution for your needs.