Facebook Sonar Edge Architecture: Confining The Domain of BGP

January 4, 2018

Categories

Facebook relies on its network as an integral part of its overall infrastructure, and is continually innovating in order to provide high performance standards amidst continual growth. As of the third quarter of 2017, Facebook had 2.07 billion monthly active users. Over 80% of its users are outside the U.S. and Canada, and all of them connect through Facebook’s edge networks.

Like other CDNs, Facebook provides static content, but it uniquely also offers a dynamic component, through its changing News Feed, likes, etc. which need to be continually up-to-date to accurately represent its users. It is also expanding to more interactive 360 and VR applications, as well as seeing a continual growth in the amount of photos and videos that it needs to deliver.

Five years ago, when Facebook surpassed the 1 billion user mark, it relied upon a global network of points of presence (PoPs), each connected to tens or hundreds of networks. However, wherever you were in the world, your TCP connection would terminate in a U.S data center, which led to poor worldwide service and slow speeds. Facebook gradually built out is own in-house hardware and software systems to scale its global network, adding edge PoPs, edge routers and edge servers worldwide to terminate TCP connections anywhere. In doing so, they dramatically increased the connection speed globally.

Simultaneously, Facebook’s engineering team has had to reconsider how to build its networks from a protocol perspective.

Software Stack Changes

Its software stack has changed – allowing Facebook to evolve beyond BGP. Most networks run on BGP – it’s how they distribute their traffic and load balance. Facebook does the same, but its network engineers began to realize they needed a different mechanism to serve users worldwide. They set out to build a system that could measure actual network service “closeness”: the latency to its users from every PoP in the world, leading to Sonar. This allows them to continually gather data to see the latency from nearly every network in the world to every edge cluster they have in the world. This dataset is one of the inputs for its global controller system.

The global controller system manages the ingress traffic load for Facebook’s networks. The system has visibility to many metrics. It sees the interface utilization of every peering router in the world and the BGP tables of every router. It takes the Sonar data and the health of its servers. From this, it builds a DNS map to send their static content from the PoPs that are closest to them. It gives different resolutions and sends users to different PoPs either via simple dynamic DNS resolution, or via the generation of unique URLs per user.

In the old BGP model, Anycast sends in traffic to the PoP as users wake up. They see the capacity of its PoP and if it’s above capacity, there was nothing that could be done – service was poor. However, by having the global controller that can shift its DNS maps, you can pull users from one area and serve them someone else at a regional and a global level.

How does this global targeting system interoperate with the old paradigm of a global BGP routing table? Originally, they were unaware of each other. Facebook could send its traffic into its network in Europe and egress it on the other side of the world. BGP allowed Facebook to enforce its simple rules e.g. prefer this peer for this set of prefixes. However, as its Sonar dataset grew as did their confidence in the global controller, Facebook had the opportunity to solve other problems: to improve performance, not only ingress-ing into the network, but also egressing out of it. This problem became particularly apparent during a serious outage when multiple diverse backbone fibre paths all failed on the same day: leading to dramatic outage for a significant region. Facebook’s global controller shifted the traffic away from this region; but its simplistic BGP rules simply pulled the traffic around the world across its broken backbone, further worsening service. They needed a new system.

Confine The Domain of BGP

They needed to confine the domain of BGP, to break it apart into city-level peering domains. In this way, the network engineers empowered their global controller not just to ingress traffic into their network, but also how to egress traffic out of their network. Arriving at this solution took a long time, however: it was an iterative process of trial and error, as Facebook Network Engineer James Quinn discussed in his talk, Being Open: How Facebook Got Its Edge at NANOG 68.

BGP solves the basic problem of fording well. Software is flexible, but it can do the wrong thing easily. Quinn and the others found that “it was an iterative process of learning, finding mistakes and gaining confidence in the software, slowly confining the domain of BGP and extending the power of the global controller and its ability to influence and improve the performance of its servers”.

The iterative process started with host routing – after initially feeling they had complete control over everything, realized it wasn’t so effective in practice, seeing problems such as how to signal from traffic to have an egress where you want it. Then they tried MPLS, but had kernel support problems and issues with the router features. Next they tried PBR – but found the most fundamental problem was the limitation of platform diversity and issues with the router configuration state. DSCP is the simplest – they spent the most time with it – but found it’s not flexible. Then they tried the GRE key but found repeated router bugs.

The Facebook culture is a buoyant, dynamic one by all accounts. As Quinn put it, “We expect to make mistakes, we try to fail fast and learn from it; and iterate and change directions when we need to.” Most importantly from this iterative process, they learned “to keep it simple”.

What is the simplest irreducible element of an edge network? BGP. Facebook evolved beyond BGP with BGP. What does this mean? The most basic problem to solve is between the peering router and a peer – your BGP attributes say this is the best path, but there’s still too much traffic and you’re dropping bits. However, Facebook realized it needed to team BGP with its data. Via Sonar, it has all this data: BMP exports of its routing tables, netflow data showing its utilization for every single prefix, linked counters showing interface utilization and packet drops. All this can be pulled into their controller software to decide where the traffic should actually go, not limited by BGP, then it can inject this decision as a BGP route simply into its routing tables. This is what Facebook does today.

For more information, see: