This entry in our video series on distributed computing will summarize how Facebook has scaled their memcache to support a massive, globally distributed read load of a billion reads per second. Rajesh Nishtala, a member of the team that developed Facebook’s memcache infrastructure, has explained how Facebook scaled from a system with a few databases and no memcache, to a few memcache servers, to clusters of memcache servers that are globally distributed.
Memcache is a memory caching systems used to reduce the number of reads to a database. Because Facebook is a read-heavy site with over 1.79 billion users active each month, memcache is used to significantly improves speed and helps to prevent database crashes. Facebook’s front-end clusters consist of several web servers and several memcache servers, which write and read updates to a storage cluster containing several databases. However, this setup can produce various problems such as incast congestion, all-to-all limits with horizontal scaling, consistent caching, and too many packets.
Below, we have summarized Rajesh’s primary talking points about the problems that occurred at each stage of scaling and the solutions that were developed to ensure a consistent user experience despite heavy reads across widely dispersed geographical areas.
- 0:48 – Lists the infrastructure requirements for Facebook, including real-time communication, aggregating content from multiple sources on the fly, accessing and updating popular shared content, and processing millions of user requests per second.
- 1:17 – Facebook’s design requirements included support for a heavy read load, global distribution, a constantly evolving product, and persistence handled outside the system that allows the cache to be evicted at different times.
- 2:25 – Describes three different steps to scaling: a single front-end cluster, which must support a read-heavy workload, wide fanout, and handling failures; multiple front-end clusters, which need to control data replication and ensure consistency; and multiple front-end clusters spread across various geographical regions, which need to ensure data consistency.
- 2:51 – Describes the set-up before Facebook implemented memcache, consisting of only databases and servers, with data sharded over various databases based on a user ID. As Facebook grew, memcache was required to handle high fanout and multiple rounds of data fetching to render a page for each request.
- 4:03 – As memcache servers were added to improve capacity, Facebook began to store data in a look-aside cache, where servers would attempt to get information from the memcache first. If the information was not in the cache, the request would return to the server and go to the database to fetch the information and refill the cache.
- 4:37 – Updates were handled by invalidating the memcache after each database write. It was up to the web application to specify which keys to invalidate after a database update.
- 5:12 – This means of updating caused a problem with stale sets, leading to inconsistencies between the memcache and database. To solve this problem, Facebook’s team extended the memcache protocol by attaching a 64-bit lease ID to each miss. The lease ID was invalidated inside the server after a delete and the set was disallowed if the lease ID was invalidated at the server.
- 6:25 – Describes another problem that occurs in this set-up when a popular story or video is updated in the database. As many users attempt to access the content, which is deleted from the memcache during the database update, they are forced to read from the database, which can cause it to crash. This problem was solved with another small extension to the leases which allowed the memcache server to arbitrate reads to the database.
- 7:43 – With the next step of scaling, more memcache servers were added to each cluster to increase read capacity. Items were then distributed across multiple memcache servers using consistent hashing on key. However, this resulted in a strain on the network as all-to-all communication took place.
- 9:05 – The incast congestion problem was solved by limiting the number of outstanding requests that could take place using a sliding window. To avoid congestion, the window would close, forcing more round-trips, whereas the window would open during less congested times for fewer trips and faster delivery.
- 9:11 – The next step in scaling involved deploying many memcache servers in multiple clusters to accommodate hundreds of millions of requests per second. Whereas horizontal scaling would have become impossible due to the all-to-all limits in the previous set-up, adding clusters solved this problem.
- 10:13 – Cache consistency created issues as more clusters were added, so Facebook implemented a system called McSqueal that looks at each write, extracts items that need to be invalidated from the memcache, and broadcasts the invalidation to all front-end clusters.
- 11:09 – To handle the high number of packets being transported, Facebook routed database instances with McSqueal running on top of them to a layer of memcache routers. These routers fanned out the deletes into the various front-end clusters which broadcast to their own memcache servers. This system resulted in a packet reduction of 18x its previous size.
- 12:18 – Currently, Facebook is scaled to support worldwide distribution and a billion requests per second. To do this, they have created replica databases in diverse geographical areas.
- 13:00 – One difficulty with adding replica databases occurs when handling writes outside the master database. If a user writes to the master database, the old value will be deleted from the memcache. This leads to a race between the database replication and the next database read, which could cause a permanent inconsistency if a user across the country sends the stale value back into the cache by reading to the replica database before it is updated.
- 14:20 – Remote markers were added to address this problem. These markers are made to the memcache when a database write is requested. The information is then written to the master database, deleted from the memcache, and updated in the replica database before finally deleting the remote marker. If a read is attempted while the marker is set, it will be directed to the master, rather than reading from the replica.