Instrumented applications bring in a wealth of information on how they behave. In the previous parts of this blog series, the focus has been mostly on getting applications to expose their metrics and on how to query Prometheus to make sense of these metrics. This exploratory approach is extremely valuable to uncover unknown unknowns, either pro-actively (testing) or reactively (debugging).

Metrics can also be used to help with things we already know and care about: instrumenting those things and knowing what is their normal state, then it’s possible to alert on situations that are judged problematic. Prometheus makes this possible through the definition of alerting rules.

Important: examples in this post follow the new rules syntax from Prometheus 2.0. If you haven’t upgraded yet, you can refer to the Prometheus 1.x documentation.

Defining thresholds to alert on

Continuing the jobs queue example from Part 4, the following rule is creating an alert if the total number of jobs in the queue is above 100. This threshold is obviously arbitrary (and the example simplistic) and would need to be based on the understanding of what normal looks like for the metric used.

alert: Lots_Of_Jobs_In_Queue
expr: sum(jobs_in_queue) > 100
for: 5m
labels:
   severity: major

In the above example, the expr field specifies the metric query and threshold under which this alert should fire, i.e. when the queue size is greater than 100.

The for field is used to delay the alert from triggering, in order to avoid spurious alerts when the threshold is only reached for a short period of time before returning to normal. In this case, 5m means that if the queue size goes over 100, the alert will be set to Pending and remain that way if it doesn’t recover. After 5 minutes, it will be set to Firing and the relevant notifications will be triggered.

Prometheus itself does not send the actual alert messages to users, this is the responsibility of the Alertmanager (deployed independently). When an alert reaches the Firing state, Prometheus notifies the Alertmanager, which in turn will route the alert to the right channel (e.g. Slack, Pagerduty…) and to the right people. Alertmanager routing will be covered in the next post of this series.

A word on namespacing queries in alerting rules

As Prometheus stores all metrics together, the metric named jobs_in_queue could come from different services to represent something different in their own context. Using sum(jobs_in_queue) could therefore create false positive if suddenly a new service started using the same metric name (without knowing it was already used by a different team).

In order to avoid this, it is best to filter the metric with a specific label for the services the alert related to, e.g. sum(jobs_in_queue{service="billing-processing"}) > 100.

An alternative approach is to keep the service label so it can be used later on to silence alerts, e.g. sum(jobs_in_queue) by (service).

Provide context to facilitate resolution

Nobody wants to be woken up in the middle of the night by a critical alert. Unfortunately, the complexity of distributed applications and the underlying infrastructure means that sometimes things go wrong. In order to facilitate the mitigation and resolution of an incident, it is crucial to provide the on-call person with as much context as possible, for example:

  • A short description of what was actually detected
  • The impact of the detected problem (Are customers affected? Are they about to be affected?)
  • Is there anything that can be done to quickly mitigate the issue? (while investigating for a more durable fix)
  • Dashboards with more in-depth information
  • A list of likely contributing causes
  • Steps to investigate

These things can be thought through quickly when the alert is defined and will make the on-call person’s life much easier. Even if you are both defining the alert and responding to it, will you remember the context at 4:00am, 6 months from now?

Prometheus alerting rules can be enriched with this kind of metadata using annotations:

alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
   severity: major
annotations:
   summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
   dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
   impact: Billing is experiencing delays, causing orders to be marked as pending
   runbook: https://wiki.intra/runbooks/billing-queues-issues.html

As the above example shows, annotations can be used to provide important information up-front (summary and impact), while also linking through additional resources (dashboards and runbooks) that will help the responder focusing on the problem, as opposed to spending time searching wiki pages.

Grouping rules

Prometheus 2.0 introduces the concept of rule group when configuring alerting rules (as well recording rules, which are covered below):

groups:
- name: batch_jobs
  rules:
  - alert: Lots_Of_Jobs_In_Queue
    expr: ...
  - alert: Jobs_Taking_Longer_Than_Expected
    expr: ...
- name: frontend_health
  rules:
  - alert: Frontend_High_Latency
    expr: ...
  - alert: Frontend_High_ErrorRate
    expr: ...

All the rules in a group are processed sequentially, while all groups are processed in parallel. By default, the rule processing follows the evaluation_interval duration from the Prometheus configuration, but groups can individually override this setting. This is useful if some rules are more expensive and should then be processed less often.

Difference between recording rules and alerting rules

In addition to alerting rules, Prometheus also allows the definition of recording rules. Recording rules are evaluated just like alerting rules but instead of creating notifications, they generate new metrics that can be queried.

These pre-calculated queries can be re-used in other rules (including alerting rules) and for dashboards, without the performance impact of repeating the query each time.

For example, the following group creates a new metric to pre-calculate the 99th percentile of the duration of each job_type, which can be reused to trigger alerts of different severity (without calculating the quantile twice).

groups:
- name: billing_jobs_health
  rules:
  - record: "job_type:billing_jobs_duration_seconds:99p5m"
    expr: histogram_quantile(0.99, sum(rate(jobs_duration_seconds_bucket{service="billing-processing"}[5m])) by (job_type))
  - alert: Billing_Processing_Very_Slow
    expr: "job_type:billing_jobs_duration_seconds:99p5m > 30"
    for: 5m
    labels:
       severity: critical
  - alert: Billing_Processing_Slow
    expr: "job_type:billing_jobs_duration_seconds:99p5m > 15"
    for: 5m
    labels:
       severity: major

Naming convention for recording rules differ from standard metrics, they should follow the pattern level:metric:operations. This is explained in more details in the Prometheus best practices documentation.

What’s next?

In this post, we focused on how to define alerting rules using metrics and thresholds, as well as how to make these alerts understandable and helpful. However, alerting on the right things is complicated! So much so that it would probably deserve its own post(s) - I might get to it at some point, but in the meantime, there are excellent resources on the subject:

As explained earlier in this post, Prometheus is only responsible for the rule evaluation and does not actually send alerts - for more information about send and routing alerts, you can refer to the Alertmanager Configuration documentation.