What is an Alert Rule?

A policy that turns a monitored signal into a decision to interrupt a human, defined by a condition, threshold, and window.

What is the purpose of an Alert Rule?

An alert rule fires when a monitored signal crosses a defined line, routing a notification to the person who can act on it. It connects monitoring to on-call response, firing only when human action is genuinely required.

How do you use an Alert Rule in product management?

Every alert rule must answer: what does this mean, what should I do, can this be automated? Avoid alert rules that fire but require no action, they erode trust. Design for low false positive rates. Route alerts by severity: P0/P1 → immediate page; P2 → Slack notification; P3 → ticket.

Where does the concept of an Alert Rule come from?

Alert rules formalise the conditions under which monitors trigger notifications. The evolution from simple threshold alerts to multi-condition, anomaly-based alerting reflects the maturation of observability tooling. Charity Majors' work on honeycomb.io popularised high-cardinality, exploratory alerting over threshold-based static rules.

What are common mistakes with an Alert Rule?

Thresholds set too sensitively flood the team with noise until genuine alerts are ignored, the textbook cause of alert fatigue. Alerting on raw symptoms with no link to user impact pages people for conditions that don't actually matter. Rules with no clear owner or runbook get acknowledged and forgotten, and an alert that fires with no actionable next step is just anxiety on a schedule.

🛎️

Alert Rule

Q: What is an example of an Alert Rule?

Alert: graph save failure rate > 2%: Condition: (failed_saves / total_saves) > 0.02 for 3 consecutive minutes. Severity: P1. Route: PagerDuty → graph-team on-call. Runbook: graph-api-high-error-rate. Suppression: mute during planned maintenance windows. Review: quarterly, adjust threshold if >3 false positives/month.

A rule triggering notifications on conditions

DevOps & PlatformOperations & Qualitytype: 'alert_rule'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An alert rule is the policy that turns a monitored signal into a decision to interrupt a human. It names a condition, a threshold, and a window, and when reality crosses the line it fires.

See moreSee less

Origin & evolution

The canonical text is Rob Ewaschuk's "My Philosophy on Alerting", a Google doc that circulated internally around 2013 and later became the alerting chapter of the Site Reliability Engineering book. Its argument reframed the whole practice. Pages should be urgent, important, actionable, and real. Err on the side of removing noisy alerts, because over-monitoring is harder to recover from than under-monitoring. A page that a human cannot act on is a defect in the alert rule, not a feature.

From this came the symptom-versus-cause distinction. Ewaschuk argued for alerting on symptoms, the conditions users actually feel, and demoting causes to dashboards and context. Symptom-based alerts catch more real problems with fewer rules, since a single user-facing symptom can be produced by dozens of underlying causes you would otherwise have to enumerate one by one. As the SRE book puts it, in a layered system one person's symptom is another person's cause.

The newer refinement is SLO-based, burn-rate alerting, set out in the SRE Workbook. Here the rule fires on how fast a service is consuming its error budget, not on a raw threshold crossing. The recommended technique is multiwindow, multi-burn-rate: a fast short window and a slower long window must both breach before the rule fires. The long window resists transient spikes; the short window clears quickly once you fix the problem. Google's suggested starting points are paging when a service burns 2% of its monthly budget in an hour, or 5% in six hours.

How it works in practice

A team owns an API with a 99.9% availability SLO, giving roughly 43 minutes of allowed error per 30-day window. Their old rule paged whenever the error rate exceeded 1% for five minutes, and it fired most weeks on harmless blips. They replace it with a burn-rate rule: page only when the error budget burns at 14.4 times the sustainable rate across both a 5-minute and a 1-hour window.

The pager goes quiet on noise. A 90-second spike during a deploy no longer fires, because the 1-hour window stays calm. When a bad config change starts burning the budget for real, both windows trip within minutes and the on-call engineer is paged with the symptom already framed in budget terms. Every page now carries a linked runbook, so the responder opens a known procedure instead of starting from a blank terminal.

Alert rule vs. its neighbours

Monitor. A monitor observes continuously and reports a health signal. An alert rule is the thin layer of judgement on top that decides the signal warrants human attention. The monitor never sleeps; the alert rule chooses when to wake someone.
Incident. An alert rule firing is a hypothesis that something is wrong. An incident is the declared, coordinated response once that hypothesis is confirmed worth handling. Many alerts auto-resolve and never become incidents; conversely a major incident may be opened by a customer report before any rule fires.
Runbook. A runbook is the documented procedure a responder follows. The working rule, captured in the graph, is that every alert rule should point at one. An alert with no runbook leaves the responder with no documented procedure to follow.

In the graph

In the Unified Product Graph an alert rule sits between observation and response. It receives its signal through monitor_triggers_via_alert_rule and dispatches action through alert_rule_triggers_runbook. Modelling the rule as its own entity, separate from the monitor that feeds it and the runbook it invokes, makes the "every alert needs a runbook" discipline queryable: any alert rule with no outgoing alert_rule_triggers_runbook edge is a page waiting to land on someone with no instructions.

Preview

Presets

titlecondition

severity

Critical Page immediately; the signal is service-affecting.

notification_channelevaluation_windowescalation_policy

Alert Rule

Agent Action Quota Exceeded - Critical

SeverityCritical

Conditionagent_actions_consumed_pct > 95 for 5 minutes

Notification channel#ops-alerts Slack + PagerDuty on-call

Evaluation window5 minutes sustained above threshold before firing

Escalation policyPage on-call engineer immediately; escalate to Sam (staff eng) if unacknowledged after 10 minutes

Properties

Type-specific fields on BaseNode

conditionstring

Triggering query or expression. @example "avg(rate(http_errors_total[5m])) > 0.05"

severityenum

Routes notification and escalation. Uses the `LogLevel` scale (operational verbosity, distinct from user-impact `severity_5`). `critical` = page immediately. `warning` = notify, don't page. `info` = log only.

Log level scale →

critical

Critical

Page immediately; the signal is service-affecting.

warning

Warning

Needs attention soon; not yet service-affecting.

info

Info

Informational; no action required.

notification_channelstring

Notification destination. @example "pagerduty:sev1-rotation", "slack:#alerts-low"

evaluation_windowstring

Required duration the condition holds before firing. Prevents flapping on transient spikes. @example "5m", "15m", "1h"

escalation_policystring

Escalation behaviour on unacknowledged alerts. @example "Escalate to engineering lead after 10 minutes"

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planning · template: OPERATIONAL

All lifecycles

Relationships

2 edge types connected to this entity.

Parents

Entities that can contain this type

Monitormonitor_triggers_via_alert_rule

Cross-References

Contextual links across the graph

Runbookalert_rule_triggers_runbook

Graph Position

1parent

🛎️Alert Rule

1cross-ref

Definition

An alert rule is a policy that converts a monitored signal into a human interrupt: a condition, threshold, and window that together decide when a page is warranted. It connects a monitor to an on-call rotation, routing the interrupt only when human action is actually needed.

Usage Guidance

Every alert rule must answer: what does this mean, what should I do, can this be automated?
Avoid alert rules that fire but require no action, they erode trust.
Design for low false positive rates.
Route alerts by severity: P0/P1 → immediate page; P2 → Slack notification; P3 → ticket.

Anti-Patterns

Thresholds set too sensitively flood the team with noise until genuine alerts are ignored, the textbook cause of alert fatigue.
Alerting on raw symptoms with no link to user impact pages people for conditions that don't actually matter.
Rules with no clear owner or runbook get acknowledged and forgotten, and an alert that fires with no actionable next step is just anxiety on a schedule.

Examples

Alert: graph save failure rate > 2%

Condition: (failed_saves / total_saves) > 0.02 for 3 consecutive minutes. Severity: P1. Route: PagerDuty → graph-team on-call. Runbook: graph-api-high-error-rate. Suppression: mute during planned maintenance windows. Review: quarterly, adjust threshold if >3 false positives/month.

Alert Rule

A rule triggering notifications on conditions

DevOps & PlatformOperations & Qualitytype: 'alert_rule'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An alert rule is the policy that turns a monitored signal into a decision to interrupt a human. It names a condition, a threshold, and a window, and when reality crosses the line it fires.

See moreSee less

Origin & evolution

How it works in practice

Alert rule vs. its neighbours

Monitor. A monitor observes continuously and reports a health signal. An alert rule is the thin layer of judgement on top that decides the signal warrants human attention. The monitor never sleeps; the alert rule chooses when to wake someone.
Incident. An alert rule firing is a hypothesis that something is wrong. An incident is the declared, coordinated response once that hypothesis is confirmed worth handling. Many alerts auto-resolve and never become incidents; conversely a major incident may be opened by a customer report before any rule fires.
Runbook. A runbook is the documented procedure a responder follows. The working rule, captured in the graph, is that every alert rule should point at one. An alert with no runbook leaves the responder with no documented procedure to follow.

In the graph

Preview

Presets

titlecondition

severity

Critical Page immediately; the signal is service-affecting.

notification_channelevaluation_windowescalation_policy

Alert Rule

Agent Action Quota Exceeded - Critical

SeverityCritical

Conditionagent_actions_consumed_pct > 95 for 5 minutes

Notification channel#ops-alerts Slack + PagerDuty on-call

Evaluation window5 minutes sustained above threshold before firing

Escalation policyPage on-call engineer immediately; escalate to Sam (staff eng) if unacknowledged after 10 minutes

Properties

Type-specific fields on BaseNode

conditionstring

Triggering query or expression. @example "avg(rate(http_errors_total[5m])) > 0.05"

severityenum

Log level scale →

critical

Critical

Page immediately; the signal is service-affecting.

warning

Warning

Needs attention soon; not yet service-affecting.

info

Info

Informational; no action required.

notification_channelstring

Notification destination. @example "pagerduty:sev1-rotation", "slack:#alerts-low"

evaluation_windowstring

Required duration the condition holds before firing. Prevents flapping on transient spikes. @example "5m", "15m", "1h"

escalation_policystring

Escalation behaviour on unacknowledged alerts. @example "Escalate to engineering lead after 10 minutes"

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planning · template: OPERATIONAL

All lifecycles

Relationships

2 edge types connected to this entity.

Parents

Entities that can contain this type

Monitormonitor_triggers_via_alert_rule

Cross-References

Contextual links across the graph

Runbookalert_rule_triggers_runbook

Graph Position

1parent

🛎️Alert Rule

1cross-ref

Definition

Usage Guidance

Every alert rule must answer: what does this mean, what should I do, can this be automated?
Avoid alert rules that fire but require no action, they erode trust.
Design for low false positive rates.
Route alerts by severity: P0/P1 → immediate page; P2 → Slack notification; P3 → ticket.

Anti-Patterns

Thresholds set too sensitively flood the team with noise until genuine alerts are ignored, the textbook cause of alert fatigue.
Alerting on raw symptoms with no link to user impact pages people for conditions that don't actually matter.
Rules with no clear owner or runbook get acknowledged and forgotten, and an alert that fires with no actionable next step is just anxiety on a schedule.

Examples

Alert: graph save failure rate > 2%