How Netflix Uses Spinnaker Continuous Delivery Platform To Support 95% Of Its AWS Infrastructure

Netflix recently released details of how it enables multi-cloud continuous delivery via the Spinnaker platform. The streaming giant uses it to deploy more than 95% of its infrastructure in AWS, comprised of hundreds of microservices and thousands of daily deployments.

Spinnaker Overview

Originally, Netflix’s engineering team built Spinnaker to help its internal teams manage their deployments. Its flexible pipeline management system is combined with integrations to the major cloud providers. The platform is open source, and the open-source community quickly aided in battle-testing and validating Spinnaker’s application-centric, cloud-first view of delivery by adding tools, stages and cloud provider integrations.

Spinnaker offers two core sets of features to help developers release software changes with “high velocity and confidence”: (i) cluster management and (ii) deployment management.

Spinnaker’s cluster management features let you view and manage your resources in the cloud, including the server group, clusters, applications, load balancers and firewalls. Its deployment management features enable the construction and management of continuous delivery workflows. The main deployment management construct are pipelines, which allow you to pass parameters from stage to stage. These are made up of a sequence of actions, called stages. Pipelines can be initiated manually or triggered by automatic events, such as a CRON schedule or a new Docker image appearing in your registry.

Spinnaker comes with best practices for high availability and integration with Netflix tools, including Chaos Monkey, ChAP Chaos Automation Platform, Archeius, Automated Canary Analysis and Titus.

These tools help Netflix’s developers to create and manage pipelines that automate the integration and delivery process to cloud VMs, containers, CDNs and hardware OpenConnect devices.

 The open source platform can deploy across multiple cloud providers, including AWS EC2, Kubernetes, Google Compute Engine, Google Kubernetes Engine, Google App Engine, Microsoft Azure, and Openstack, with Oracle Bare Metal and DC/OS being added soon.

What is Continuous Delivery and Continuous Deployment?

Software delivery has traditionally mirrored that of physical products like cars and appliances. Software products are delivered on a certain timeline with a suite of new features and improvements developed over time, tested in depth, then released all at once. If bugs are found, users typically have to wait some time before the release of a patch to solve them.

Continuous delivery in relation to software is a new mode of delivery that aims to cut down the amount of inventory i.e. features and fixes that have been developed, but not yet delivered to users – by radically shrinking the time between releases. In many ways, it is the next stage of the agile software development movement, which aims to develop software on a more iterative basis and develop an ongoing back and forth with users in order to avoid redundancy, poor analysis, or features that are not fit for purpose.

In continuous delivery cycles, teams push features and fixes live as soon as they are ready rather than waiting to batch them into a formal release cycle. This can mean that multiple updates may go out in just one day.

Continuous deployment carries this idea to the next step by automatically pushing each change live as soon as it has passed a set of automated tests, load testing and canary analysis.

Both continuous delivery and continuous deployment rely on being able to define and automate a repeatable process for the release of updates. When as many as ten releases might be taking place over one week, it is not possible for each version to be manually deployed in an ad hoc style. Reliable, automated tools are necessary to help teams monitor and manage the release schedule.

The Benefits of Continuous Delivery

  • Quicker time to market for new features, configuration modifications and experiments
  • Innovation can be applied quickly, allowing users to enjoy new updates
  • Faster feedback and troubleshooting loops
  • Bug fixes can be solved more rapidly
  • Continuous delivery pipelines can be built to incrementally roll out changes at specific times for specific cloud targets
  • Automated delivery replaces manual error-prone processes
  • Incompatible upstream dependencies are reduced via a more frequent release cadence
  • Reduction of issues and outages caused by bad deployments
  • Developers can see the results of the services they deploy more quickly and have more ownership over the release timeline

Netflix’s eBook on Spinnaker

Netflix has just released an eBook on Spinnaker: ‘Continuous delivery with Spinnaker; Fast, Safe, Repeable, Multi-Cloud Deployments’ was written to provide a “high-level introduction” to engineering teams wanting to understand how Netflix delivers production changes, and how the features embedded in the Spinnaker platform work to simplify continuous delivery to the cloud. The goal is to help other teams that want to adopt a continuous delivery approach for software developed in the cloud by providing Spinnaker as a model for how to codify a release process.

The eBook’s contents include a discussion of the benefits of continuous delivery, advice on cloud deployment considerations and managing cloud infrastructure, in addition to in-depth sections on topics like structuring deployments as pipelines, and how to make deployments safer. Ways to take advantage of advanced features such as declarative delivery and automated canary analysis are also included.

Spinnaker: a Summary

The Invent of Spinnaker

Netflix developed Spinnaker to address these issues. Teams can automate deployments across multiple clouds and regions, in addition to over multiple cloud platforms, into “pipelines” that run when a new version is released. This allows teams to design and automate a delivery process that suits their release cadence.

Its first iteration was released as a microservice to the cloud in 2009. By 2014, most services, asides from billing, were running on AWS’ cloud. In January 2016, the final data center dependency was eliminated allowing Spinnaker to run entirely on AWS. Spinnaker was honed and developed during this migration process, mirroring Netflix’s experiences of delivering software to the cloud frequently, quickly and reliably.

Pipelines as the Key Workflow Construct

Pipelines are the key workflow construct used for Spinnaker deployments. Each pipeline can be uniquely configured (e.g. by defining a sequence of stages), allowing teams to develop their own deployment strategy.

All executions within the pipeline are depicted as JSON (JavaScript Object Notation) that holds all the information about the pipeline execution. Variables such as time started, stage status and server group names appear in JSON, which is then used to render the UI.

Pipeline Stages

Development teams can divide the tasks executed by a pipeline into small, customizable blocks called stages. Each stage is chained to the next to define all the work done as part of one overall whole. Each type of stage executes a specific operation or set of operations. Pipeline stages fall into four main categories:

  1. Infrastructure Stages – these operate on the underlying cloud infrastructure through the creation, updating or deletion of resources. Each of the following stages leads to the next.

Examples of this stage include:

  1. Bake (create an AMI [Amazon Machine Image] or Docker image from an artifact)
  2. Tag image – apply a tag to the just baked images for categorization
  3. Find image/Container from a Cluster/Tag – find a previously deployed version of your immutable infrastructure in order to refer to that version in later stages
  4. Deploy
  5. Disable/Enable/Resize/Shrink/Clone/Rollback a Cluster/Server Group
  6. Run job (run a container in Kubernetes)

The last three stages above represent the majority of the work in your deployment pipelines.

  1. External Systems Integrations – Spinnaker can be integrated with other custom systems to chain together logic performed on external systems. Examples of this stage include:
    1. Interaction with Continuous Integration systems, such as Jenkins
    2. Run Job
    3. Webhook (send an HTTP request into any other system that supports webhooks, and read the returning data)
  2. Testing – There are three testing phases to Spinnaker. These are:
    1. Chaos Automation Platform (ChaP) (internal only) – see if fallbacks behave as intended and uncover systemic problems that may occur when latency increases
    2. Citrus Squeeze Testing (internal only) – squeeze testing i.e. directing increasingly larger amounts of traffic toward an evaluation cluster to determine its load limit
    3. Canary (internal and open source) – send a small portion of production traffic to a new build to measure key metrics and find out if the new build introduces performance degradation of any nature.

Further functional tests can be run via Jenkins.

  1. Controlling Flow – The flow of your pipeline (for authorization, timing or branching logic) can be controlled in this group of stages. The stages are:
    1. Check preconditions – perform conditional logic
    2. Manual judgement – pauses your pipeline until a human validates it and propagates their credentials
    3. Wait – wait for a certain period of time
    4. Run a pipeline – run another pipeline from within your current pipeline
Configuring Pipeline Triggers

The final essential part of constructing a pipeline is triggering how it is initiated. By configuring a pipeline trigger, you can reach to events and chain steps simultaneously. Most Spinnaker pipelines are configured to be triggered by events. There are time-based triggers (manual and cron) or event-based triggers (Git, Continuous Integration, Docker, pipeline and/or Pub/Sub).

Manual triggers let you run a pipeline ad hoc. Cron triggers, by contrast, let you run pipelines according to a determined schedule.

Mainly pipelines are run following an event, however; for instance, Git triggers let you run a pipeline following a git event, such as a commit. Continuous Integration (CI) triggers (as in Jenkins or Travis) let you run a pipeline following the successful completion of a CI job. Docker triggers let you run a pipeline following the uploading of a new Docker image or after the publishing of a new Docker image. While pipeline triggers let you run another pipeline after a previous one has succesfully completed. Pub/Sub triggers, by contrast, let you run a pipeline following receipt of a specific message from a Pub/Sub system such as Google Pub/Sub or Amazon SNS.

By having access to this wide range of different triggers, you can create a customizable workflow moving between custom scripted logic and the in-built Spinnaker stages.

As a pipeline runs, it will transition across each stage and do the work specified at each stage. Notifications or auditing events will be invoked depending on whether a stage is starting, completing or failing. When the pipeline execution completes, it can be configured to trigger a further set of pipelines to continue the deployment flow.

Deployment to Virtual Machine-Based Systems

You can use Spinnaker to simplify the creation of images via a bake, tagging and deployment to Amazon EC2 or other similar VM-based systems such as Google Compute Engine and Microsoft Azure.

During the deploy and rollback cycles in the continuous deployment process, Spinnaker needs to manage the creation of new server groups, autoscaling and health checks. This helps take the management burden away from the user, and helps group commonly required resources in a manner that simplifies interactions with the resources.

Making Deployments More Safe

A major part of making continuous delivery successful is to help development and engineering teams push new code without fear, and allow users to deploy software quickly and automatically without having to wait for new updates or bug fixes. Part of lessening the fear around automation involves the setting up appropriate safeguards to make sure that systems do not fail and customers are not negatively impacted at any stage.

There are various actions that can be taken in Spinnaker to make deployments safer. These include:

  • Customizing your deployment strategy or choosing not to have one
  • Easy rollbacks – to revert changes if an issue is encountered
  • Cluster locking – Spinnaker automatically creates a protective bubble around a new server group every time it is added to a cluster; it also doesn’t allow a conflicting server group to be generated simultaneously in the same region and cluster, which means that automated pipelines and/or manual tasks don’t accidentally act on the same resources
  • Traffic guards – these are used to ensure there is always at least one active server group in operation
  • Deployment windows – this allows you to choose what time of day and day of week a deployment can take place in order to avoid peak traffic for instance

Customizing Spinnaker

Spinnaker can be extended and customized for your organization in four different ways:

  1. Hit the API Directly
  2. UI Integrations to fit the needs of your organization
  3. Custom Stages – Spinnaker lets you call out to custom business processes via the Webhook or the Run Job stage; for more complex integrations, you can build your own custom stage
  4. Internal Extensions – Spinnaker was built on Spring Boot and is written to be extended with company-specific logic that doesn’t have to be open sourced. You can fine tune Spinnaker’s internal behavior to match your organization’s unique needs by packaging our own JARs inside specific Spinnaker services.

The extensibility of the platform is important in allowing Spinnaker to gain wider adoption beyond Netflix.

In Conclusion

Spinnaker adoption has gone from zero to 95% within Netflix itself over the past four years. Netflix says that this success was “by no means a given”, and the way that they encouraged its adoption among internal teams and individuals was to make it “irresistible”. The Spinnaker engineers say they were able to do this via five main means: (i) making best practices easy to use (ii) making the platform secure by default (iii) by standardizing the cloud to make it easy to build additional tools that support the cloud landscape (iv) by reusing existing deployments – the Spinnaker API and tooling make it straightforward to use existing deployment pipelines while still gaining the advantage of the safer, more secure Spinnaker deployment primitives (v) by offering ongoing support to teams as they migrated their existing deployments into Spinnaker.’

We have summarized sections of the Spinnaker eBook during this overview post. The entire 81 page document can be accessed here.