With a massive catalog of music, a large base of subscribers and myriad features for them to utilize, Spotify’s infrastructure needs are significant. Each day, they handle 100 billion user-generated events of over 300 different event types, which must not only be processed but used to train the machine learning algorithms that generate personalized recommendations, radio play, and Discover pages for each user.
In order to support this kind of workflow, Spotify has developed and made publicly available a number of tools for storage, event delivery, and security. This post will outline several of these tools, their features, and how Spotify uses them to deliver content to over 50 million paid subscribers and over 140 million monthly active users.
Luigi is a Python module with built-in Hadoop support for that is used to build complex pipelines of batch jobs. Similar to GNU Make, Luigi allows users to visualize, chain, and automate tasks with complex dependencies–essentially managing the workflow of tasks so that developers can focus on the tasks themselves. Luigi is used at Spotify to provide the infrastructure for recommendations, A/B tests, external reports, dashboards, and other features.
- Toolbox of templates for various common tasks
- Support for Python mapreduce jobs in Hadoop, Hive, and Pig
- File system abstractions for HDFS and local files that ensure file systems operations are atomic to prevent the pipeline from crashing in a state containing partial data
Sparkey is a simple persistent key/value store, similar to bitcask and optimized for read-heavy systems with infrequent bulk writes. It is implemented in C, but has Python bindings and a Java port, and is data agnostic. Spotify uses Sparkey for serving large amounts of static data to users with low latency and high throughput. Their goal in building Sparkey was to create a key-value store with a random read throughput that was similar to Tokyo Cabinet and CDN, high throughput on bulk writes, a low overhead, and capable of storing massive amounts of data. Sparkey replaced Spotify’s reliance on CDB after their data reached the 4GB limit and replaced their use of Tokyo Cabinet for writing datasets too large to keep in memory.
- Supports up to 2^63 – 1 bytes of data
- Allows for an unlimited number of concurrent independent readers
- Low overhead
- Support for block level compression
Annoy is a C++ library with Python bindings for performing nearest neighbor searches. Annoy is a fast library with a minimal memory footprint. It also has the ability to use static files as indexes so that users can share an index across different processes. Indexes can also be passed around as files and quickly mapped to memory, as Annoy decouples the process of creating files from the loading process. At Spotify, Annoy is used for music recommendations, where the optimal memory usage allows for searching similar users and items across millions of tracks in a high-dimensional space.
- Performs well up to 1,000 dimensions
- Minimal memory usage
- Shares memory between multiple processes
- Decouples index creation and lookup
- Native Python support
Spydra is a tool that enables ephemeral Hadoop clusters in Google Cloud Platform while simplifying troubleshooting and automating cluster lifecycle management. With Spydra, users create ephemeral clusters for executing a single job as well as long-living static clusters. Spydra was designed as a result of scaling and maintaining a Hadoop cluster of over 2500 nodes that runs 20,000 independent jobs per day. It supports the dual submission of data processing jobs to Dataproc and existing on-premise Hadoop infrastructure to simplify switching between the two. Spydra is currently in beta and used in production at Spotify to help with migrating its data infrastructure to GCP.
- Supports dual submission of jobs from Dataproc to on-premise Hadoop infrastructure and forwards GCP credentials from Dataproc to Hadoop by default
- Stores job execution data on GCS for after ephemeral clusters are shut down
- Experimental autoscaler monitors resources and scales according to user-defined parameters with preemptible VMs
- Experimental support for cluster pooling within a single GCP project
The Event Delivery System collects and delivers each generated event in Spotify’s system to a variety of storage implementations such as GCS, Big Query, Hadoop and Hive. For each delivery, events need to be deduplicated, delivered to hourly buckets and published to Cloud Pub/Sub, a GCS service for streaming data. Events are then consumed from Cloud Pub/Sub streams, grouped into hourly buckets, deduplicated and delivered to Cloud Storage. Spotify uses a dedicated Extract Transform Load (ETL) process for consuming data streams for each event type, built as a set of micro services and batch jobs that include:
- Consumer, an Apollo service deployed with Helios that consumes data from Cloud Pub/Sub streams
- Completionist, another Apollo service deployed with Helios, which keeps track of files written with Consumer and detects when it is safe to close hourly buckets
- Deduper, which deduplicates data in hourly buckets as determined by queries to Completionist
- Styx, which scheduled Deduper jobs for execution on Dataproc clusters and schedules batch jobs to export data from GCS to other storage systems
Similar to the Scout2 auditing tool for AWS, GCP Audit is a tool for auditing security in GCP projects. With GCP Audit, analysts can scan projects for common security issues as defined by an expandable internal repository. GCP Audit supports rules definitions in JSON or YAML and applies rules filters to Google’s API responses. Any issues are written to a report that details which objects the issues were found in. GCP Audit is currently in alpha status and undergoing testing in Spotify’s production environment.
- Rules are written in JSON or YAML
- Supports Python versions 2.7+
Scaling a firewall policy across different projects and ephemeral services proved a difficult task, so Spotify created the GCP Firewall Enforcer as a toolbox to simplify the enforcement of consistent firewall rules. GCP Firewall Enforcer is currently in alpha status and is used by Spotify to maintain control of their firewall policies while conducting reviews by network specialists.
- Easily retrieves the current rules set from GCP projects
- Quickly detects and automatically fixes accidental firewall alterations
- Allows users to easily monitor and investigate network problems