Facebook is open sourcing Katran, the innovative load balancer that powers its platform and prevents it from crashing and burning.
Katran offers a software-based approach to load balancing in combination with a reengineered forwarding plane that leverages two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine. Backend servers in Facebook’s PoP worldwide utilize Katran. The company says, the deployment of Katran has “helped us improve the performance and scalability of network load balancing and reduce inefficiencies such as busy loops when there are no incoming packets”. In describing their goals for open sourcing it, Facebook says, “By sharing it with the open source community, we hope others can improve the performance of their load balancers and also use Katran as a foundation for future work.”
Katran has helped the company manage traffic at the massive scale required both by distributing the workload efficiently among its numerous backend servers, and by helping confront “the challenge of making the large fleet of (backend) servers appear as a single unit to the outside world”. A network load balancer that suits Facebook’s needs effectively, must be able to do several things:
- Run on commodity Linux servers
- Exist in parallel with other services on any given server
- Minimize disruption during maintenance and upgrades
- Provide straightforward instrumentation and debugging
Facebook’s engineering team have worked hard for numerous years to develop different iterations of a high-performance software network load balancer that can satisfy all these needs. The overall architecture of Katran is similar to the first-generation L4LB that Facebook developed, however, its main differences relate to (i) more easy and efficient packet handling; (ii) more stable and less expensive hashing; (iii) a more resilient local state and a runtime “compute only” switch to ignore its LRU cache altogether if in the event of catastrophic memory pressure on the host; (iv) its RSS-friendly encapsulation.
These new features dramatically increase performance, flexibility and scalability of the load balancer.
Along with the release of Katran, Facebook simultaneously announced details of the inner workings of its Zero Touch Provisioning tool, which it is deploying to help engineers automate large parts of the work necessary to build its backbone networks. Earlier in May, the company also open sourced PyTorch, the software that powers its machine learning and AI projects.
Facebook has had no choice but to become a software company in addition to the world’s largest social network and a huge player in advertising. There are not many companies that operate at the scale of Facebook, and/or that face unique challenges when designing the traffic patterns of its own social network. It made sense for the company to develop the software it needed in-house.
Fortunately, Facebook is generally open to sharing its software developments. Indeed in 2011, the social network was one of the founding members of the Open Compute Project. The Open Compute Project describes itself as “a rapidly growing, global community whose mission is to design, use, and enable mainstream delivery of the most efficient designs for scalable computing”. The group members believe that by sharing IP, they will maximize innovation and reduce complexity in tech components.