Open Source Monitoring Stack: Prometheus and Grafana

January 25, 2019

Prometheus

Prometheus is an open-source systems monitoring system, a time series database and an alerting toolkit. It was originally developed at SoundCloud “to be the system you go to during an outage to allow you to quickly diagnose problems”. Since it was open sourced in 2012, it has gained a wide user base and active developer community. Prometheus is now maintained independently of any company. It joined the Cloud Native Computing Foundation in 2016 as only its second hosted project, following Kubernetes.

Prometheus is written in Golang; most of its components are written in Ruby, although some are written in Go, which makes them easy to build and deploy as static binaries. You run Prometheus by downloading and running it alongside its components. It is Docker compatible and several of the Prometheus components are available on the Docker Hub.

How Prometheus Works

The three primary components of Prometheus are the Prometheus server, the visualization layer with Grafana (which we will go into later) and the Alert Management with Prometheus Alert Manager.

The Prometheus Server

The main component of Prometheus is the Prometheus server; its servers monitor particular things (which can be anything from a complete Linux server to a single process to a database service). The things that Prometheus monitors are called Targets. The Prometheus server monitors the targets.

Each unit of a target is called a metric. Prometheus ships various metrics that can be monitored, for instance, since it stores all chunks and series in memory, a panel can be built based on the prometheus_local_storage_memory_chunks and prometheus_local_storage_memory_series metrics. These panels can be monitored to ensure no particular thresholds (you set) are passed. Prometheus scrapes metrics from instrumented jobs at regular intervals, which you define. It does this directly, or through an intermediary push gateway for short-lived jobs. The metrics can be stored locally or remotely and displayed back in the Prometheus server. It is important to note that Prometheus is a pull based system, meaning it has to be told where to scape the metrics from.

The metrics that you get from a third-party system are different to those you get from the Prometheus metrics. This is important to know since Prometheus draws on a standard data-model with a key-value based metrics, which may not match with the third-party system, which is why exporters are used to convert them.

Alert Manager Component

Prometheus also has an AlertManager component, which can fire alerts via email, Slack or other notification clients. The Alert Rules are defined in a file called alert.rules, via which the Prometheus server reads the alert configurations, then fires alerts at the necessary times via the Alert Manager component.

Pros and Cons

Prometheus works especially well for recording purely numeric time series. It is a good fit for both dynamic service-oriented frameworks and machine-centric monitoring. Its support for multi-dimensional data collection and querying is a strength for microservices management.

While Prometheus values reliability and was built to be dependable when other infrastructure components are not working or unavailable, it cannot offer 100% accuracy if you need it for per-request billing, for instance. Its collected data would unlikely be sufficiently detailed or complete. Prometheus itself advises using a different system for billing, and Prometheus just for monitoring.

Also Useful to Know…

The amount of RAM can also be fine tuned using the storage.local.memory-chunks configuration directive (Prometheus recommends having three times the amount of RAM the memory chunks alone consume).

Also useful to know, there are special purpose exporters for services like Graphite, HAProxy and StatsD. Client libraries are used for instrumenting code and there are various support tools available.

You can check to see if Prometheus is working as it should be by measuring the ingestion rate for the samples using the prometheus_local_storage_ingested_samples_total metric. If the rate displayed aligns with the same number of metrics, you can be certain it is ingesting correctly. Similarly, to identify latency issues, you can monitor the amount of time between target scrapes that you have configured using the prometheus_target_interval_length_seconds metric.

Grafana

Grafana is an open-source data visualization and monitoring tool, which has support for many different databases, including Elasticsearch, Graphite, InfluxDB, Logz.io and Prometheus. Grafana can be used to visualize data and to track long-term patterns and trends. It can be accessed via a series of dynamic and reusable dashboards that you create, which can then be shared between business and technical teams.

As a visualization tool, Grafana has a wide range of different options from heatmaps to graphs to histograms. There is a range of different panel plugins available to offer a variety of ways to view metrics and logs. Different data sources can be mixed in the same graph. You have the option to specify a data source on a per-query basis, which works for standardized templates and custom dashboards.

There is no standard way of developing Grafana’s dashboards (or visualizations) as code. However, it does support various third party libraries. Examples for how to configure these in Grafana can be found here.

Alert rules can be visually defined for the most important metrics. Grafana continuously evaluates them and sends notifications as needed. These can be sent via email, Slack, PagerDuty, VictorOps, OpsGenie, or via webhook.

The Grafana back-end has various configurable options, which can be specified using a .ini configuration file or through the use of environment variables. This is known as the datasource, which is where Grafana pulls all its metrics from.

Combining Prometheus Monitoring with Grafana: the Visualization Layer

Combining Prometheus and Grafana together is becoming an increasingly popular choice of monitoring stack for DevOp teams needing to store and visualize time series data. In this combination, Prometheus performs the role of storage backend while Grafana acts as the interface for visualization and analysis.

Prometheus exposes a wide range of metrics that can be monitored easily, such as used memory and storage in addition to general ones, which report on the status of the service. Through the addition of Grafana as a visualization layer, a monitoring stack for the monitoring stack itself can easily be set up!

There are various ways to set up Prometheus and Grafana together, including via Dockerized deployments.

Some examples of metrics you can explore with Grafana visualizations include:

Uptime: The amount of time in total since your Prometheus server was started.
Local storage memory series: The current total number of series held in memory.
Internal storage queue length: This queue length should be “empty” (0) or at a low number.
Sample ingested: This displays the samples Prometheus has ingested.
Target scrapes: This displays the frequency that the target – Prometheus – is scraped.

The Prometheus Benchmark dashboard can provide examples of further metrics that should be monitored. How exactly you set it up can involve a good deal of experimentation, an exploration which Grafana makes easy.

Instead of writing PromQL queries directly into the Prometheus server, you deploy Grafana UI boards to query metrics from the Prometheus server, then visualize them in the Prometheus dashboard as a whole (examples of which can be found here).