At Altitude last month, Eddie Rogers, Principal Engineer, API Platform from Target presented on the retailer’s integration of Fastly’s log streaming to build dashboards and discussed the company’s history with the CDN.
It started in 2015 when Target added Fastly’s CDN to the front of its API platform. The initial configuration was very simple without any caching. Most mornings, Rogers said he would look at the Fastly dashboard over a coffee watching the volumes and ensuring that errors didn’t spike, which served him well for a long time.
Fast forward to 2018. Target’s footprint with Fastly has grown. They now have 10-15 services running the whole time talking to one another other. However, Target’s dashboard, which now opened to a first tab of “All Services”, was not offering a sufficient level of detail about each one. He had information available such as traffic volume, error count and bytes in and out, but only at a high level.
Rogers realized he needed a more complex dashboard to be able to get the same breadth and depth of information that he had in that initial dashboard for all of Target’s services. In its API platform alone, Target has over 150 different APIs, a number that grows daily. Rogers needed to know about errors when they start, which one was doing more traffic, which POP was showing the most unauthorized calls to a specific API, and so on.
Target’s Quest for the “Best Dashboards Ever”
This growing need set Rogers out on a quest “to build the best dashboards ever”. As he points out, “you know your data, so build your logs around it”. Fastly’s dashboards can help you visualize the data, but only the company providing that data can add detailed knowledge about the requests themselves based on their specific business; for instance by breaking down its API service by API name, or by data center. By leveraging Fastly’s real time log streaming, Target’s team of engineers began to be able to gather information and develop knowledge of what was going on at the CDN, and then integrate that into deeper knowledge into other parts of its stack.
From Rogers’ perspective, the desire for the best dashboards ever was driven overall by the need for greater logs and metrics. Logs “tell you what went wrong” and metrics “tell the story of what’s happening right now”.
Target’s internal measurement team is the first point of call in building its internal logs and metrics. Its goal is “to provide an easy-to-use and available system for collecting observability data that teams can interface with easily, allowing development teams to focus on their applications”. To this end, the Measurement team creates a Measurement pipeline for internal use comprised of an ELK stack (Elastic, Kibana, Logstash for logging) and influxdb and Grafana for metrics. Kafka is used as the transport layer.
Logs
Rogers describes the data gathered through logs as “the real gems… the pieces you extract in VCL, and send through your log stream” which give you a breakdown of what is happening at the granular level.
Rogers provided a detailed insight into how Target’s log stream breaks down. Logs enter Target’s ELK stack as structured JSON messages, formatted by Fastly in VCL. The engineers move through a series of steps to understand the logs.
First, the log stream only logs errors i.e. if the status code is 400 or above. Secondly, it logs which API is experiencing the error in order to differentiate between its some 150 different APIs. Next, Target uses Fastly as a cache (as well as being its router) to decide which data center should handle the incoming requests. This is noted in the log stream. There are then two different fields for ‘request backend’ i.e. where it should go and ‘actual’ i.e. where it did go. Rogers points out that ‘actual backend’ is particularly useful in telling you where an error is actually coming from.
Metrics
Target’s metrics come in similarly, except the message is formatted at Fastly as Influx Line Protocol. This allows Target’s engineers to send it immediately through to their Influx servers, primed for visualization in Grafana.
One line of ILP is made up of four elements:
- Measurement – The grouping of the data you’re amassing, similar to a database table.
- Tag – An quality of the measurement in key/value format. Tags are always formatted as strings.
- Field – The actual measurement value, described as an int, float, string or bool.
- Timestamp – Telling you when the measurement took place. Nanosecond-prevision is preferred, but it can take different forms.
By formatting metrics as ILP and sending them to Influx, the engineering team is able to access a dashboard of every product and its volume, in addition to a global line graph of volume by response status. Instead of just being able to see Fastly’s home page of volume by service, the Target engineers can “get an in-depth view of any service with contextually relevant data”. They can then break out the error chart into more detailed data that advises on which team needs to be engaged to solve specific errors. Errors can be seen by API and backend, or through a POP chart, or by individual backend server. All that information and different ways of viewing it is now possible, and available “quickly and responsively”.
Alerting
Grafana offers an alerting service stemming from the metrics. Some of Target’s Fastly dashboards have green solid hearts or red broken ones: these indicate the health of the metric based on customized parameters. A broken red is an alerting metric. Almost as soon as Grafana has identified a problem, Target also knows about it. Alerting tells you “when you actually need to worry”.
Takeaways
As Rogers says, “you know your data”… “What makes it the best dashboard? It’s the one you build based on your data.”
For more information, Rogers’ talk can be viewed here and snapshots of the dashboards along with detailed code can be viewed here.