Service discovery is about locating the network location of a service provider. Developer Jason Xu likens it to changing your phone number without telling your friend, then losing contact. The same thing applies to services within a microservices architecture program. Two services may be talking to one another without any challenges until one moves to a different IP address. This is when service discovery comes into play.
It is not needed for the maintenance of physical servers (a configuration file will mainly satisfy this requirement in a monolithic application). However, service discovery becomes a must within a microservices environment. If you are writing code that invokes a service that has, for instance, a Thrift API or REST API, your code will need to know the network location (both IP address and port) of a service instance. In a microservices application, this will be challenging because service instances have dynamically assigned network locations. The manual maintenance of a configuration file is not feasible when your services have dynamic network locations because of restart, upgrades, failure and auto scaling.
Service discovery requires the involvement of three different parties:
(i) A service registry, which maintains the most up-to-date location of provider;
(ii) Service providers, which register themselves with the service registry and deregisters when it leaves the system;
(iii) Service consumers, which obtain the provider’s location from the registry, then communicate with the provider.
There are two main service discovery patterns: client-side and server-side.
Client-side discovery relates to the client needing to determine the network locations of available service instances in order to load balance between them. The client queries the service registry in order to find out what service instances are available, then deploys a load-balancing algorithm to choose an available service instance and make a request.
Server-side discovery is the main alternative approach. The client issues a request to a service via a load balancer, which queries the service registry then routes each request to an available service instance.
Popular Service Discovery Solutions
In their own words…
Consul is a distributed service mesh to connect, secure, and configure services across any runtime platform and public or private cloud.
Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.
etcd is a distributed, reliable key-value store for the most critical data of a distributed system.
Comparison Overview
All three are similar in architecture with server nodes, which need a quorum of nodes to operate, typically via a simple majority. They are highly-consistent and expose primitives that can be deployed via client libraries within applications to build complex distributed systems. All have roughly the same semantics in relation to offering key/value storage. The differences between them are more apparent when they are used for advanced cases. Zookeeper, for instance, only has a primitive K/V store and application developers have to build their own systems to provide service discovery. This is in comparison to Consul, which offers an opinionated framework for service discovery; this cuts out any guess work and the need for development.
Zookeeper has been around the longest. It originated in Hadoop for use in Hadoop clusters. Developers commend its high performance and the support it offers for Kafka and Spring Boot.
etcd is the newest option and is the simplest and easiest to use. Developers who have tried it say it is “one of the best-designed tools precisely because of this simplicity”. It is bundled with Coreo, and has fault tolerance as a key value store.
Consul offers more features than the other two, including a key value store, built-in framework for service discovery, health checking, gossip clustering and high availability, in addition to Docker integration. Developers often cite its first-class support for service discovery, health checking, K/V storage and multiple data centers as reasons for its use.
Consul
Consul is distributed, highly scalable and highly available. It is a decentralized fault-tolerant service developed by HashiCorp Company (behind Vagrant, TerraForm, Atlas, Otto, and others). It is a tool explicitly for service discovery and configuration. Each Consul agent is installed to each host and is a first-class cluster participants. This means that servers don’t have to know the discovery address within a network, so all discovery requests can be processed to a local address.
Consul uses algorithms for information distribution, which are based on an eventual consistency model. Its agents use gossip protocol for distribution information, and for leader election, servers use the Raft algorithm.
Consul can also be used in Cluster, which is a network of related nodes with running services that are registered in discovery. Consul ensures that information on clusters will be distributed between all cluster participants and be available when required. Not only is there peer support, but additionally multi-zoned cluster. This means it is possible to both work with data centers and to perform an action on any others. Agents from one data center can information from another data center to help in building an effective solution for distributed systems.
Service can be registered in two different ways in Consul: (i) Through the use of HTTP API or an agent configuration file if the service independently communicates with Consul (ii) Through registering the service as a third party component in the instance it can’t communicate with Consul.
Reasons developers have cited for choosing Consul include:
- Thorough health checking
- Superior service discovery infrastructure
- Distributed key-value store
- Insightful monitoring
- High-availability
- Web-UI
- Gossip clustering
- Token-based acls
- DNS server
- Docker integration
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and offering group services. All these services are used by distributed applications in one way or another. Every time they are implemented, a large amount of manual work is put into fixing the bugs and racing inevitable conditions. Applications tend to skimp on them initially because of the size of workload, which can make them brittle in the face of change and hard to manage. Different implementations of these services can cause problems and lead to management complexity even when deployed correctly. Zookeeper is an attempt to solve these challenges via the enabling of highly reliable distributed coordination.
To achieve high availability with Zookeeper, multiple Zookeeper services must be used instead of just one. This is called an ensemble. In this instance, all Zookeeper servers storage copies of data. It is replicated across the hosts in the ensemble to guarantee the data’s high availability. Each server maintains an in-memory image of the stage, in addition to a transaction log in a persistent store in order to know about the other servers in the ensemble. As long as most of the servers are available, the Zookeeper service will be available.
A leader for the ensemble is chosen via leader election recipe The leader’s job is to maintain consensus. Leader election also happens in the case of failure of an existing leader. All update requests go through the leader to guarantee the data’s availability.
Zookeeper maintains a hierarchical structure of nodes, which are known as znodes. Each znode has data associated with it, and may have children connected to it as well. Node structure is similar to a standard file structure. There are two types of znode: persistent znodes and ephemeral znodes.
Reasons developers have cited for choosing Zookeeper include:
- It is high performance
- Straightforward generation of node specific config
- Offers support of Kafka
- Java enabled and embeddable in Java Service
- Spring Boot Support
- Supports DC/OS
- Enables extensive distributed IPC
- Used in Hadoop
Apache Zookeeper is a volunteer-led open source project managed by the Apache Software Foundation.
etcd
etcd is a distributed key value store that offers a reliable way to store data over a cluster of machines through offering shared configuration and service discovery for Container Linux clusters. It is available on GitHub as an open source project.
etcd is written in Go and uses the Raft protocol, which specializes in assisting multiple nodes in the maintenance of identical logs of state changing commands. Any node in a raft node can be treated as the master. It will then work in collaboration with the others to decide on the order state changes happen in.
etcd handles leader elections during network partitions and is able to tolerate machine failure, including the leader.
Application containers running on clusters can read and write data into etcd; use cases include storing database connection details, configuring cache settings. or feature flags in the form of key value pairs. The values can be watched, enabling your app to reconfigure itself when or if they change.
Advanced uses leverage the consistency guarantees to put into practice database leader elections or carry out distributed locking across a cluster of workers.
Kubernetes is built on top of etcd. It leverages the etcd distributed K/V store, as does Cloud Foundry. etcd also handles the storage and replication of data used by Kubernetes over the entire cluster. etcd is able to recover from hardware failure and network partitions because of the Raft consensus algorithm. It was designed to be the backbone of any distributed system, hence why projects like Kubernetes, Cloud Foundry and Fleet depend on etcd.
Developers cite choosing etcd for a range of reasons, including:
- Service discovery
- Bundled with CoreOS
- Runs on a range of operating systems, including Linux, OS X and BSD
- Fault tolerant key value store
- Simple interface, which reads and writes values with curl, in addition to other HTTP libraries
- Easy to manage cluster coordination and state management
- Optional SSL client cert authentication
- Optional TTLs for keys expiration
- Properly distributed through Raft protocol
- Benchmarked at 1000s of write/s per instance
Conclusion
If attention is not paid to the architecture when building an application, scaling problems can suddenly appear. Application scaling and efficient use of resources at the initial stage can save lots of time later on. Distributed architecture is frequently used to stop such problems emerging. However, microservices can bring with them their own set of challenges, such as cohesion and complexity of service configuration. This is when service discovery comes into play.
Discovery helps provide cohesion between a range of architecture components, asides from linking. Discovery is the equivalent of a meta-information registry of distributed architecture, which retains data contained in all the components. It is a form of decentralized storage, which gives access to storage for any node. This means there needs to be only minimum manual intervention with components.
The selection of the right service discovery tool for you is something to explore and experiment with until you find the right solution for your needs.