Prometheus Blog Series (Part 1): Metrics and Labels

When it comes to monitoring tools in the last while, Prometheus is definitely hard to miss. It has quickly risen to be top of the class, with overwhelming adoption from the community and integrations with all the major pieces of the Cloud Native puzzle.

Throughout this blog series, we will be learning the basics of Prometheus and how Prometheus fits within a service-oriented architecture. This first post the series will cover the main concepts used in Prometheus: metrics and labels.

What are metrics and labels?

Metrics are a core concept of Prometheus. Instrumented systems expose them, Prometheus stores them in its time-series database and makes them available to easily query to understand how these systems behave over time.

In short, a metric is an identifier linking data points together over time. For example, the metric http_requests_total denotes all the data points collected by Prometheus for services exposing http requests counters. As there is likely to be multiple services exposing the same http_requests_total metric, labels can be added to each data point to specify which service this counter applies to:

# Request counter for the User Directory service
http_requests_total{service="users-directory"}

# Request counter for the Billing History Service
http_requests_total{service="billing-history"}

# Overall request counter regardless of service
sum(http_requests_total)

Chances are, we won’t have only one instance of each service so another useful label would be an instance identifier. In a typical microservices architecture, the number of instances will vary and these will have a relatively short life span. As a result, queries will mostly be aggregating regardless of the instance, but being able to distinguish between different instances is a powerful debugging tool:

# Instance specific
http_requests_total{service="users-directory", instance="1.2.3.4"}`

# All instances of Users Directory service
sum(http_requests_total{service="users-directory"})

Using the power of labels

Labels in Prometheus are arbitrary and as such, they can be much more powerful than just which service/instance exposed a metric. Continuing with the simple example of http_requests_total, services can be more descriptive on the requests that are being counted and expose things like the endpoint being used or the status code returned.

# Number of unauthorised GET requests to the GET /users/:id endpoint of the Users Directory service
sum(http_requests_total{service="users-directory", method="GET", endpoint="/user/:id", status="403"})

Augmenting metrics with good labels is key to get the best out of Prometheus. Labels can be combined in a number of different ways using functions, in order to answer a wide range of questions from the all the data collected by Prometheus.

Filtering based on labels

As described in the above examples, it is possible to filter a metric based on the value of one of the labels:

# Only consider the Users Directory service
http_requests_total{service="users-directory}"

# Filter out successful requests
http_requests_total{status!="200"}

# Matching `region` label with regular expression
failed_logins_attempts_total{region=~"us-west-.*"}

Aggregating labels

If a label is not specified, the result of a query will return as many time-series as there are combinations of labels and label values. In order to collapse these combinations (partially or fully), aggregation operators can be used:

# Total number of requests regardless of any labels
sum(http_requests_total)

# Average memory usage for each service
avg(memory_used_bytes) by (service)

# Startup time of the oldest instance of each service in each datacenter
min(startup_time_milliseconds) by (service, datacenter)

A word on label cardinality

Labels are really powerful so it can be tempting to annotate each metric with very specific information, however there are some important limitations to what should be used for labels.

Prometheus considers each unique combination of labels and label value as a different time series. As a result if a label has an unbounded set of possible values, Prometheus will have a very hard time storing all these time series. In order to avoid performance issues, labels should not be used for high cardinality data sets (e.g. Customer unique ids).

What’s next?

In this first post, we went through the main building blocks of Prometheus: metrics, labels and basic operators to query these metrics. The few examples of metrics represented different types of data, such as counting the number of requests served or the amount of memory used at a given time. In the next post, we will dive into the 4 different types of Prometheus metrics (counters, gauges, histograms and summaries) and when to use them.

If there is any specific subject you would like me to cover in this series, feel free to reach out to me on Twitter at @PierreVincent