Maglev Network Load Balancing Primer

Introduction to Maglev

Maglev is Google’s custom network load balancer, which is used within all its data centers and operates at the distributed packet-level. Google is of course a significant source of global Internet traffic, which provides hundreds of user-facing services, as well as many more hosted on its rapidly growing Google Cloud Platform. To meet the colossal level of demand at the necessary low latency, Google services are hosted on a range of servers located in multiple clusters worldwide. Within a cluster, it is important to distribute traffic load evenly across each server so that resources are utilized efficiently and no one server gets overloaded. Network load balancers are therefore a critical element of the company’s production network infrastructure.

Traditional Hardware Load Balancers

A network load balancer is normally composed of several devices situated between routers and service endpoints (typically TCP or UDP servers). The load balancer is responsible for matching packets to their corresponding services and forwarding them on to one of its endpoints. They are usually implemented as dedicated hardware devices. However, doing so imposes certain constraints, including:

  • Scalability – which is limited by the maximum capacity of a single unit, making it impossible to match Google’s growth in traffic;
  • High availability – despite being deployed in pairs to avoid single points of failure, they only provide +1 redundancy;
  • Quick iteration – this is rarely possible with a hardware load balancer as they are difficult to modify;
  • Expensive to upgrade –  this typically involves the purchase of new hardware, in addition to its physical deployment.

The Advantages of Maglev

Google developed alternative solutions to traditional hardware load balancers in order to avoid these limitations, and in 2008, launched Maglev as a distributed software system running on its commodity Linux servers. Maglev also provides network load balancing to the Google Cloud Platform.

There are multiple advantages to Google’s customized solution, including:

  • Flexibility – Maglev doesn’t require a specialized physical rack deployment;
  • Scalability – its capacity can be adjusted by simply adding or removing servers. Google refers to this as “the scale-out model”;
  • Availability and reliability – the system provides 1+ redundancy;
  • End-to-end thus easier to upgrade – as Google controls the whole system, its engineers can easily add, test and deploy new features;
  • Simplicity – as the Maglev system only uses existing servers inside the clusters, deployment of the load balancers is simplified;
  • Costs – by running on commodity hardware within Google’s data centers, costs are reduced;
  • Performance isolation – services can be divided between several shards of load balances in the same cluster.

Maglev is optimized for packet processing performance in order to accommodate high, continually increasing traffic. Maglev must provide an even distribution of traffic, connection persistence, and high throughput with small packets. One Maglev machine alone can saturate a 10Gbps link with small packets.

Maglev’s network routers use the Equal Cost Multipath (ECMP) to evenly distribute packets. Following this, each Maglev machine matches the right pack to the corresponding service and spreads them evenly to the service endpoints. Maglev capacity can be boosted by simply adding more servers to a pool. By spreading packets evenly, Maglev redundancy can be modeled as N + 1, thereby enhancing availability and reliability across traditional load balancing systems.

In order to reduce the negative impact of unanticipated faults and failures on connection-oriented protocols, Maglev was designed with consistent hashing and connection tracking features in order to consistently map connections to the same backends. These techniques coalesce TCP streams at Google’s HTTP reverse proxies (also known as Google Front Ends, or GFEs).

Designing an Alternative

Designing and implementing a software network load balancer is a highly sophisticated and challenging task. Each individual machine needs to provide high throughput, and the entire system must provide connection persistence in order to guarantee the quality of service. Clusters are highly dynamic and failures are fairly frequent.

Maglev has been running as “a critical component of Google’s frontend serving infrastructure since 2008”, and it serves almost the entirety of Google’s incoming user traffic. Through the exploitation of recent discoveries in high-speed server networking techniques, every Maglev machine can achieve line-rate throughput with the use of small packets. The reliability of packet delivery is ensured despite unexpected failures and frequent changes through the use of consistent hashing and connection tracking.

Load Balancing at the Frontend: Balancing User Traffic Between Datacenters

What is the optimal load distribution? Many considerations factor into this: load balancing for large systems, in particular, is a dynamic and complicated problem. Google has approached the challenge through load balancing at various levels.

Traditionally, load balancing has only taken place using DNS. Google, however, only uses DNS load balancing as its first layer. Before a client can send an HTTP request, the IP address has to be located using DNS. This in itself can be challenging and has three knock-on consequences on traffic management: (i) recursive resolution of IP addresses; (ii) nondeterministic reply paths (iii) further caching complications.

It is challenging to find the optimal IP address to return to the nameserver for a specific user’s request, and that nameserver may have to serve millions of different users across many different geographic regions at a huge variety of scales. The ISP’s nameservers then return a response with the most appropriate IP address for their datacenter, even though there are superior network paths for all users. Estimating the impact of a given reply is also difficult when load balancing via DNS; as is determining what the best location, and there are also problems related to caching.

Google has found that “To really solve the problem of frontend load balancing, this initial level of DNS load balancing should be followed by a level that takes advantage of virtual IP addresses.”

Virtual IP addresses (VIPs) are shared across multiple devices as opposed to being assigned to particular network interfaces. The user doesn’t know this; for them, the VIP remains a single, regular IP address. The most critical component of VIP implementation is the network load balancer, which receives packets and forwards them on to one of the machines sitting behind the VIP. The request is further processed by these backends.

The balancer can take various approaches when deciding which backend should serve the request. The first is to prefer the least loaded backend; however, for stateful protocols that need to use the same backend for an entire request’s duration, this logic breaks down quickly. An alternative is to use some aspects of a packet to create a connection ID, potentially deploying a hash function and some information from the packet, and to select a backend through the connection ID.

The alternative solution, however, that won’t force connections to reset if a single machine fails and doesn’t involve keeping the state of every connection in memory is consistent hashing. This approach reduces the disruption to existing connections when the backends pool changes.

Another aspect of Google’s load balancing solution involves the use of packet encapsulation. This offers significant flexibility in the way that Google’s networks can be designed and grow. However, it leads to inflated packet size, so the network must support large Protocol Data Units.

While load balancing sounds straightforward on the surface (load balancing early and often), the devil is in the details, for frontend load balancing and for handling packets after they have reached the datacenter.

Indeed, GCLB (Google Cloud Load Balancer), Google’s publicly consumable global load balancing solution (which we’ll discuss in more depth below), does not use DNS load balancing at all.

Load Balancing in the Datacenter

Assume that there will be a range of different queries arriving at the datacenter, coming from the data center itself, remote data centers, and a mixture of the two. Usually, these requests will arrive at a rate that doesn’t exceed the resources at the data center to handle them, or if it does, only for very short periods. Typically, there are services within the data center against which these queries operate. They are implemented as various homogenous, interchangeable server processes by and large running on different machines. The smallest services will have at least three such processes and the largest, depending on data center size, might have over 10,000 processes. In a typical instance, services are composed of between 100 to 1,000 processes. These processes are called backend tasks. Client tasks, by contrast, hold connections to the backend tasks. For every incoming query, a client task has to decide which backend task should handle the query. Clients communicate with backends using a protocol implemented on top of a mixture of TCP and UDP. Google’s data centers house a wide range of different services that implement a variety of policies.

In an ideal instance, the load for any given service is perfectly spread across all its backend tasks and at any specific point in time, the least and most loaded backend tasks consume precisely the same amount of CPU. Traffic is only sent on to a datacenter at the point at which the most loaded task has reached its limit. Once it has been reached, the cross-data center load balancing algorithm won’t send any further traffic to the data center in order to avoid overloading specific tasks.

Google Cloud Load Balancing

Most companies don’t develop and maintain an internal global load balancer, preferring instead to rely on load balancing services from larger public cloud providers. The Google Cloud Load Balancer (GCLB) is one such example. It allows Google to balance live user traffic between clusters in order to match user demand to available service capacity and means Google can take care of service failures in a transparent manner. GSLB controls the distribution of connections to GFEs and the distribution of requests to backend services. Users can be served from backends and from GFEs running in different clusters. As well as load balancing between frontends and backends, GSLB can automatically drain traffic away from failed clusters in order to protect the health of backend servers. GCLB was built atop a combined Maglev/GFE-augmented edge network. When customers create a load balancer, GCLB provisions an anycast VIP and programs Maglev to load-balance it over all the GFEs at the edge of Google’s network.

Pokémon GO Case Study

A case study of the Niantic Pokémon GO provides a useful example of real-world implementation of GCLB, including both challenges and solutions. Game developers launched Pokémon GO in the summer of 2016. Despite being Niantic’s very first project in collaboration with a major entertainment company and the first official smartphone Pokémon game, it went on to become a runaway hit, far more popular than Niantic imagined it would be. Prior to its release, the Niantic engineering team had load-tested its software stack to process up to 5x their most positive traffic estimates. The launch requests per second (RPS) rate was in fact almost 50x that estimate – sufficiently large to present a scaling challenge for almost any software stack. The fact that the world of Pokémon GO is also highly interactive and required near real-time updates to a state all participants shared involved even greater complexity.

At launch, Pokémon Go leveraged Google’s regional Network Load Balancer (NLB) to load-balance ingress traffic over a Kubernetes cluster. Each of these clusters had pods of Nginx instances, which acted as Layer 7 reverse proxies, which terminated SSL, buffered HTTP requests, and performed routing and load balancing over pods of application server backends. In order to scale at the level required, Niantic had to migrate across to GCLB. Within two days of its migration to GCLB, the Pokémon GO app became the single largest GCLB service.

A traffic surge of 200% higher than previously observed led to a cascade of failures, leading to overload and Niantic’s backends becoming extremely slow. Google took three main steps to resolve the issue, including (i) isolating the GFEs that would serve Pokémon GO from the main pool of load balancers; (ii) enlarging the isolated Pokémon GO pool until it was able to handle peak traffic despite the performance problems; (iii) in consultation with Niantic, Traffic SRE set up administrative overrides to limit the rate of traffic the load balancers would accept for Pokémon GO in order to sufficiently contain client demand until Niantic was able to reestablish normal operation and restart scaling upward.

Google and Niantic both then made several important changes to their systems in order to future-proof their work. Google’s main modifications were treating GFE backends as a potentially significant source of load, and putting qualification and load testing practices into place to detect GFE performance degradation caused by slow backends. Both companies realized they also needed to measure load as close as possible to the client going forwards.

The kinds of strategies used to manage load (load balancing, load shedding, and autoscaling) are all intended to promote the same goal: to equalize and stabilize the system load.

References: