Netflix and Mantis

Categories

The Mantis shrimp is an oxymoron in its own right. It’s a small, yet immensely strong and powerful shrimp, able to support weight far greater than their body size might make you think, but above all else it has incredible vision. Compared to humans who have only 3 photoreceptors in their eyes, the mantis shrimp has 16. And it’s this shrimp that helped inspire Netflix’s their stream-processing platform, Mantis.

Over the past 8 years, Netflix has exploded with over 75 million members in over 190 different countries, watching 125 million hours of content everyday. To support this customer base and maintain their prestige in the evolving world of video streaming, identifying issues in their system is critical. For the Netflix team, spotting issues at the application service level and service-level monitoring has always been relatively easy, but where they struggle has been with addressing issues of individual devices.

What Mantis does is help to allow Netflix teams gain access to real-time events and combat any issues by building low-latency, high throughput stream-processing apps on top of them. This helps them to detect and mitigate specific issues across various regions and devices.

The issue that Netflix was having is that they produce billions of events and metrics on a daily basis, yet all that data wasn’t processed in a way that made it useful. They were still constantly faced with the issue of not having the right data to address the problems at hand. What Mantis does is help to build highly granular, real-time insights applications that give the Netflix team more visibility into their interactions with their devices and their AWS services. There’s a long trail throughout the system, where any number of mistakes could occur, and Mantis helps Netflix see that more clearly.

Mantis Architecture

Mantis was built 100% with the cloud in mind, helping to reduce operational overhead and developer hours. Mantis manages a cluster of servers that run stream-processing jobs, with Apache Mesos used to created a shared pool, and an open-sourced library called Fenzo, which helps to allocate resources amongst different jobs.

Inside the Mantis architecture there are two main clusters that manage the job processes.

Master Cluster: The master cluster consists of the managing parts that help to disseminate the flow of all the work
- Resource Manager: assigns resources to a job worker using Fenzo
- Job Manager: manages the operations of a job, dealing with metadata, SLAs, artifact locations, job topology and life cycle

Agent Cluster: When a user submits a stream, the job runs as one or more workers on the agent cluster
- Instances: the agent cluster consists of multiple instances in pools running the jobs

Mantis Jobs

Matis defines a stream-processing job in two forms:

Single-stage Job: Basic transformation/aggregation use cases
Multi-stage Job: useful for processing high-volume, high-cardinality event streams

Within a Mantis job there are three main parts that contribute to the job performance.

Source: fetches data for a stream
Stage: processes the event stream and adds RxJava functionality, such as mapping scanning, mapping reduce etc.
Sink: collects and outputs the processed data

Features

WIth this architecture, a job is now broken up into three parts, which better helps to delegate and manage the flow of each event, making sure it’s all been properly processed and handled. Another feature of Mantis is job chaining, which allows for efficient data and code reusing. With this feature, they can use data across several different jobs and sources, compound that data together and use it to perform more complex functions.

Mantis is also very efficient when it comes to scaling, with the ability to autoscale both the cluster size and individual jobs as needed during certain peak hours. In order to save on expenses, Fenzo will autoscale the Apache Mesos worker cluster by adding or removing instances relative to the demand.

To help manage and configure jobs across all global regions, Mantis supports a dedicated UI and API. The UI feature allows the users to directly interact with jobs and platform functionality, while the API enables easy integration across networks.

Benefits

Mantis jobs has helped to round out Netflix’s system by processing events from about 20 different data sources, including API, personalization, device logging and so on. One of the best benefits they’ve seen from Mantis is the ability to apply their alert features to individual video titles across the globe.

What they are able to do know is detect anomalies by tracking windows of unique events based on the titles of specific shows/episodes. If a threshold is reached that shows a significant percentage of anomalies in a specific region or for certain content, then Netflix is notified. This helps them to recognize and address problems that they previously had little insight or visibility into, while locating the root causes.

Also, Mantis allows for alerting to happen not only at the title level, but through their performance indicator SPS, which stands for “starts per second.” This metric measures the amount of people that hit play on a given stream over a period of time, and they found that this statistic was the best indicator for any issues within their system. The advantage of this is a quicker reaction time for the Netflix team, reducing their time to detect (TTD) from 8 minutes down to less than 1 minute.

In the future, Netflix is looking to continue their dive into real-time applications and explore new innovations to better harness the power of stream-processing.They already have developments in place to work on their outlier detection by using a Mantis integration, as well as adding improvements to self-service tools provided in the UI. In the coming years as Netflix continues to grow and try to stave off the competition, it will be interesting to see how Mantis will come into play for their longevity.