Spotify, A Giant In The Open Source Infrastructure Movement

Luigi

Luigi is a Python module with built-in Hadoop support for that is used to build complex pipelines of batch jobs. Similar to GNU Make, Luigi allows users to visualize, chain, and automate tasks with complex dependencies–essentially managing the workflow of tasks so that developers can focus on the tasks themselves. Luigi is used at Spotify to provide the infrastructure for recommendations, A/B tests, external reports, dashboards, and other features.

Features:

Toolbox of templates for various common tasks
Support for Python mapreduce jobs in Hadoop, Hive, and Pig
File system abstractions for HDFS and local files that ensure file systems operations are atomic to prevent the pipeline from crashing in a state containing partial data

Sparkey

Sparkey is a simple persistent key/value store, similar to bitcask and optimized for read-heavy systems with infrequent bulk writes. It is implemented in C, but has Python bindings and a Java port, and is data agnostic. Spotify uses Sparkey for serving large amounts of static data to users with low latency and high throughput. Their goal in building Sparkey was to create a key-value store with a random read throughput that was similar to Tokyo Cabinet and CDN, high throughput on bulk writes, a low overhead, and capable of storing massive amounts of data. Sparkey replaced Spotify’s reliance on CDB after their data reached the 4GB limit and replaced their use of Tokyo Cabinet for writing datasets too large to keep in memory.

Features:

Supports up to 2^63 – 1 bytes of data
Allows for an unlimited number of concurrent independent readers
Low overhead
Support for block level compression

Annoy

Annoy is a C++ library with Python bindings for performing nearest neighbor searches. Annoy is a fast library with a minimal memory footprint. It also has the ability to use static files as indexes so that users can share an index across different processes. Indexes can also be passed around as files and quickly mapped to memory, as Annoy decouples the process of creating files from the loading process. At Spotify, Annoy is used for music recommendations, where the optimal memory usage allows for searching similar users and items across millions of tracks in a high-dimensional space.

Features:

Performs well up to 1,000 dimensions
Minimal memory usage
Shares memory between multiple processes
Decouples index creation and lookup
Native Python support

Spydra

Spydra is a tool that enables ephemeral Hadoop clusters in Google Cloud Platform while simplifying troubleshooting and automating cluster lifecycle management. With Spydra, users create ephemeral clusters for executing a single job as well as long-living static clusters. Spydra was designed as a result of scaling and maintaining a Hadoop cluster of over 2500 nodes that runs 20,000 independent jobs per day. It supports the dual submission of data processing jobs to Dataproc and existing on-premise Hadoop infrastructure to simplify switching between the two. Spydra is currently in beta and used in production at Spotify to help with migrating its data infrastructure to GCP.

Features:

Supports dual submission of jobs from Dataproc to on-premise Hadoop infrastructure and forwards GCP credentials from Dataproc to Hadoop by default
Stores job execution data on GCS for after ephemeral clusters are shut down
Experimental autoscaler monitors resources and scales according to user-defined parameters with preemptible VMs
Experimental support for cluster pooling within a single GCP project

Event Delivery

The Event Delivery System collects and delivers each generated event in Spotify’s system to a variety of storage implementations such as GCS, Big Query, Hadoop and Hive. For each delivery, events need to be deduplicated, delivered to hourly buckets and published to Cloud Pub/Sub, a GCS service for streaming data. Events are then consumed from Cloud Pub/Sub streams, grouped into hourly buckets, deduplicated and delivered to Cloud Storage. Spotify uses a dedicated Extract Transform Load (ETL) process for consuming data streams for each event type, built as a set of micro services and batch jobs that include:

Consumer, an Apollo service deployed with Helios that consumes data from Cloud Pub/Sub streams
Completionist, another Apollo service deployed with Helios, which keeps track of files written with Consumer and detects when it is safe to close hourly buckets
Deduper, which deduplicates data in hourly buckets as determined by queries to Completionist
Styx, which scheduled Deduper jobs for execution on Dataproc clusters and schedules batch jobs to export data from GCS to other storage systems

GCP Audit

Similar to the Scout2 auditing tool for AWS, GCP Audit is a tool for auditing security in GCP projects. With GCP Audit, analysts can scan projects for common security issues as defined by an expandable internal repository. GCP Audit supports rules definitions in JSON or YAML and applies rules filters to Google’s API responses. Any issues are written to a report that details which objects the issues were found in. GCP Audit is currently in alpha status and undergoing testing in Spotify’s production environment.

Features:

Rules are written in JSON or YAML
Supports Python versions 2.7+

GCP Firewall Enforcer

Scaling a firewall policy across different projects and ephemeral services proved a difficult task, so Spotify created the GCP Firewall Enforcer as a toolbox to simplify the enforcement of consistent firewall rules. GCP Firewall Enforcer is currently in alpha status and is used by Spotify to maintain control of their firewall policies while conducting reviews by network specialists.

Features:

Easily retrieves the current rules set from GCP projects
Quickly detects and automatically fixes accidental firewall alterations
Allows users to easily monitor and investigate network problems

Spotify, A Giant In The Open Source Infrastructure Movement

Categories

Luigi

Sparkey

Annoy

Spydra

Event Delivery

GCP Audit

GCP Firewall Enforcer