Spotify Upgrades Kafka Based Event Delivery System

March 28, 2019

Spotify’s Overall Move to Google Cloud

Spotify migrated from Apache Kafka to Google Cloud Pub/Sub for its Event Delivery solution in 2016 as part of its larger switch to the Google Cloud Platform. It is unusual to see such a large company move from a home-grown infrastructure to the public cloud. However, the music streaming giant decided “it didn’t want to be in the data center business” and wanted more time to “focus on innovative features and software”. It chose Cloud Platform over the public cloud competition “after careful review and testing”.

Spotify split its migration to Cloud Platform into two different streams: a services track and a data track.

The streaming platform was running its products on a range of tiny microservices, several of which were moved from on-premise data centers into Google’s cloud using its Cloud Storage and Compute Engine, among other products. By implementing a range of Google’s storage options, Spotify’s engineers were able to focus their attention on “complex back end logic, instead of focusing on how to store the data and maintain databases”. In addition, Spotify now deploys Google’s Cloud Networking Services, including Direct Peering, Cloud VPN and Cloud Router, in order to transfer petabytes of data and ensure a “fast, reliable and secure experience” for its users worldwide.

On the data side, the company has moved from its own home-grown dashboarding tools coupled with Hadoop, MapReduce and Hive to using Google’s latest data processing tools, including Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc and Google Cloud Pub/Sub for Event Delivery.

Event Delivery at Spotify

Whenever a user listens to a particular song or podcast, Spotify records it as an event and uses it as input to find out more about that user’s preferences. An “event” is an object that the system needs to register. There are hundreds of different kinds of events to log, such as a user opening the app, skipping a song, sharing a playlist and so on. Spotify also logs events that happen at the infrastructure level, such as when a logging server runs out of disk space. Over 300 different kinds of event are collected from Spotify clients.

Event delivery is the process of ensuring that all events are transported safely from clients worldwide to a central processing system. Spotify describes its Event Delivery system as “one of the core pillars of Spotify’s data infrastructure since almost all data processing depends, either directly or indirectly, on data that it delivers”. It is important that complete data is deliverable with a consistent latency and made available to Spotify’s developers through a well-defined interface.

The events that move through Spotify’s event delivery system have various uses. Most of the company’s product decisions rely on the results of A/B testing, which in turn rely on accurate usage data. Delays in the delivery of data can adversely impact users’ experience, causing the delay of popular features such as the Discover Weekly playlist, which was built using Spotify playback data. Billboard also draws on Spotify usage data as one of its sources for its top lists.

Reliably delivering and processing these events requires significant investment. As there are so many events and they are often received in uneven bursts, modern architecture relies on a scalable queuing system to buffer events. Different types of system respond to events in different ways; each system subscribes to the kinds of events it wants to hear. To add an event to the queue, the event producer “publishes” it. Each “subscriber” then receives that event. This is why queueing is known as pub/sub-publish/subscribe.

It is critical that Spotify’s Event Delivery system is reliable and can easily scale. Spotify’s engineering team found that one way to achieve higher scale was to break the problem out into smaller sub-problems so that instead of delivering all events through one stream, each event type is isolated and delivered independently. This has allowed them to achieve higher availability, in addition to ensuring that no “misbehaving event can affect the delivery of other events”.

The Migration from Apache Kafka to Google Cloud Pub/Sub

At the time of the migration from Apache Kafka to Google Cloud Pub/Sub, Igor Maravić, Software Engineer at Spotify, published an extensive set of blog posts describing Spotify’s “road to the cloud” – posts which we draw on in the following summary.

The Previous Delivery System and its Challenges

The system that was previously in production was built on top of Kafka 0.7. It was designed around the abstraction of hourly files. It streamed log files, which contained events, from the service machines to the HDFS. At HDFS, the log files are transformed from tab-separated text format to Avro format. The challenge with having events reliably persist only after reaching Hadoop is the fact that this makes the Hadoop cluster a single point of failure in the event delivery system. If Hadoop experiences a failure, the whole system stalls.

One of Kafka 0.7’s missing features from its inception onwards was the capability of the Kafka Broker cluster to act as a reliable persistent storage; this impacted a design decision by the Event Delivery team to not keep persistent data between the producer of data, Kafka Syslog Producer, and Hadoop. Only when a file was written to HDFS, was it considered to be reliably persistent.

The main challenge with this was that the local producer needed to ensure that data persisted in HDFS at a central location before it could be reliably delivered. A producer on a server in Asia, for instance, would need to know when data had been written to disk in California. If transfers were slow, this led to significant delivery delays. When the point of handover is in a local data center, the producer design is considerably simplified as networking between hosts within a data center is typically highly reliable.

Another problem with the previous design was the lack of flexibility in terms of managing event streams with different quality of service since all events were sent together via the same channel. This also limited real-time use cases, as real-time consumers had to tap on to the firehouse that transmits all events and filter down to only messages they were interested in.

Overall, however, Spotify’s engineering team were “quite happy to have built a system that can reliably push more than 700,000 events per second halfway across the world”. When necessary, they completed redesigns to the system to improve upon their software development process. The Spotify team also had access to both the hardware and source code and plenty of experience of different failure modes, so in theory, was able to find the root cause of any problem, giving them a great deal of control.

Nonetheless, Spotify found it hard to handle the consistently increasing number of delivered events. The increased load was leading to more outages, and the team began to realise that “neither we not the system would be able to keep up with the increased load for long”. They began to hunt for a solution.

The Switch to Cloud Pub/Sub

Initially, the Event Delivery team started experimenting with a variety of new approaches to keep up with the increased number of events, including upgrading to Kafka 0.8 and embedding a simple Kafka producer in the Event Delivery Service. To guarantee the system worked end-to-end, the team embedded numerous integration tests within its CI/CD process. However, “as soon as this system started handling production traffic, it started to fall apart”.

Various headaches later (with Mirror Maker and the Kafka Producer in particular), the team found themselves at a crossroads. They saw that they would need to redefine deployment strategies for Kafka Brokers and Mirror Makers, perform capacity modeling and execute planning for all system components, in addition to exposing performance metrics to Spotify’s monitoring system. They wondered whether to make the kind of significant investment this production workload would entail and continue to adapt Kafka to work for them, or to try something else.

Meanwhile, other Spotify teams were experimenting with an array of Google Cloud products as part of the larger migration. The Event Delivery team heard about Cloud Pub/Sub and began to investigate it as a potential alternative, which “might satisfy our basic need for a reliable persistent queue”. Three primary factors played into consideration:

Pub/Sub was available across all Google Cloud Zones and data would be transferred via the underlying Google network;
A REST API meant that Spotify could write their own client library if needed;
Google would handle all operational responsibility, leaving the Spotify team more time and resources to concentrate on other areas.

The two main potential downsides to the migration were that (i) Spotify would “have to trust [the] operations of another organisation”; (ii) at that time, Cloud Pub/Sub was in beta and no other organisation asides from Google was using it at the scale that Spotify required.

Testing

In order to see if Cloud Pub/Sub could cope with the anticipated load, Spotify’s Event Delivery team ran a producer load test. At that time, the Spotify production load was peaking at 700K events per second; the team decided that in order to successfully anticipate future growth and potential disaster recovery situations, a production load of 2M events per second was desired. They also decided to publish this volume of traffic from a single data center to see what would happen if all the requests were hitting the Pub/Sub machines in the same zone. If this were successful, then, in theory, the company should be able to push 2M messages across all zones.

One of the first stumbling blocks was Google’s Java client, which didn’t perform well enough; however, since Pub/Sub has a REST API, Spotify was able to write its own library. The client was designed with performance as the first priority. The service was put to the test again, and 2M messages per second were pushed through the Event Service on 29 machines. Pub/Sub “passed the test with flying colours.”

The other major test that the team ran was the consumer stability test, involving measuring the end-to-latency of the system under heavy load over a period of five days. The median end-to-end latency observed was around 20 seconds, including backlog recovery. They also observed that there were no lost messages during the five day period.

Based on the two tests, Spotify determined that Google Pub/Sub was the right choice for them. The only capacity constraints were those explicitly set by the available quota, and most importantly, latency was “low and consistent”.

Conclusion

The Spotify Event Delivery team also moved HDFS over to Cloud Storage, and Hive to BigQuery. This led to a move to Dataflow for writing the Extract, Transform and Load (ETL) job; a choice which was “influenced by us wanting to have as little operational responsibility as possible and having others solve hard problems for us”. Dataflow is both a framework for writing data pipelines and a fully managed service to do so. It is issued with out-of-the-box support for Cloud Pub/Sub, Cloud Storage and BigQuery.

At the time of the migration, Spotify found that the worst end-to-end latency observed with running Event Delivery on Pub/Sub was four times lower than that of the old system. Furthermore, it has seen a much lower operational overhead, allowing them “much more time to make Spotify’s products better”.