Target Makes Fastly A Critical Part Of Its Cloud Infrastructure

May 15, 2018

Target’s Quest for the “Best Dashboards Ever”

This growing need set Rogers out on a quest “to build the best dashboards ever”. As he points out, “you know your data, so build your logs around it”. Fastly’s dashboards can help you visualize the data, but only the company providing that data can add detailed knowledge about the requests themselves based on their specific business; for instance by breaking down its API service by API name, or by data center. By leveraging Fastly’s real time log streaming, Target’s team of engineers began to be able to gather information and develop knowledge of what was going on at the CDN, and then integrate that into deeper knowledge into other parts of its stack.

From Rogers’ perspective, the desire for the best dashboards ever was driven overall by the need for greater logs and metrics. Logs “tell you what went wrong” and metrics “tell the story of what’s happening right now”.

Target’s internal measurement team is the first point of call in building its internal logs and metrics. Its goal is “to provide an easy-to-use and available system for collecting observability data that teams can interface with easily, allowing development teams to focus on their applications”. To this end, the Measurement team creates a Measurement pipeline for internal use comprised of an ELK stack (Elastic, Kibana, Logstash for logging) and influxdb and Grafana for metrics. Kafka is used as the transport layer.

Logs

Rogers describes the data gathered through logs as “the real gems… the pieces you extract in VCL, and send through your log stream” which give you a breakdown of what is happening at the granular level.

Rogers provided a detailed insight into how Target’s log stream breaks down. Logs enter Target’s ELK stack as structured JSON messages, formatted by Fastly in VCL. The engineers move through a series of steps to understand the logs.

First, the log stream only logs errors i.e. if the status code is 400 or above. Secondly, it logs which API is experiencing the error in order to differentiate between its some 150 different APIs. Next, Target uses Fastly as a cache (as well as being its router) to decide which data center should handle the incoming requests. This is noted in the log stream. There are then two different fields for ‘request backend’ i.e. where it should go and ‘actual’ i.e. where it did go. Rogers points out that ‘actual backend’ is particularly useful in telling you where an error is actually coming from.

Metrics

Target’s metrics come in similarly, except the message is formatted at Fastly as Influx Line Protocol. This allows Target’s engineers to send it immediately through to their Influx servers, primed for visualization in Grafana.

One line of ILP is made up of four elements:

Measurement – The grouping of the data you’re amassing, similar to a database table.
Tag – An quality of the measurement in key/value format. Tags are always formatted as strings.
Field – The actual measurement value, described as an int, float, string or bool.
Timestamp – Telling you when the measurement took place. Nanosecond-prevision is preferred, but it can take different forms.

By formatting metrics as ILP and sending them to Influx, the engineering team is able to access a dashboard of every product and its volume, in addition to a global line graph of volume by response status. Instead of just being able to see Fastly’s home page of volume by service, the Target engineers can “get an in-depth view of any service with contextually relevant data”. They can then break out the error chart into more detailed data that advises on which team needs to be engaged to solve specific errors. Errors can be seen by API and backend, or through a POP chart, or by individual backend server. All that information and different ways of viewing it is now possible, and available “quickly and responsively”.

Alerting

Grafana offers an alerting service stemming from the metrics. Some of Target’s Fastly dashboards have green solid hearts or red broken ones: these indicate the health of the metric based on customized parameters. A broken red is an alerting metric. Almost as soon as Grafana has identified a problem, Target also knows about it. Alerting tells you “when you actually need to worry”.

Takeaways

As Rogers says, “you know your data”… “What makes it the best dashboard? It’s the one you build based on your data.”

For more information, Rogers’ talk can be viewed here and snapshots of the dashboards along with detailed code can be viewed here.