A rule triggering notifications on conditions
An alert rule is the policy that turns a monitored signal into a decisionDecisionStrategyA recorded decision with context, rationale, and consequencesView reference → to interrupt a human. It names a condition, a threshold, and a window, and when reality crosses the line it fires. The hard part is rarely writing the condition. It is the discipline of firing only when someone genuinely needsNeedUserA user need, pain, desire, or constraintView reference → to act, because an alert that wakes an engineer for nothing trains the whole team to ignore the next one.
The canonical text is Rob Ewaschuk's "My Philosophy on Alerting", a Google doc that circulated internally around 2013 and later became the alerting chapter of the Site Reliability Engineering book. Its argument reframed the whole practice. Pages should be urgent, important, actionable, and real. Err on the side of removing noisy alerts, because over-monitoring is harder to recover from than under-monitoring. A page that a human cannot act on is a defect in the alert rule, not a featureFeatureProduct SpecificationA product capability or featureView reference →.
From this came the symptomSymptomEngineeringA symptom of a problemView reference →-versus-cause distinction. Ewaschuk argued for alerting on symptoms, the conditions users actually feel, and demoting causes to dashboardsDashboardData & AnalyticsAn analytics dashboardView reference → and context. Symptom-based alerts catch more real problems with fewer rules, since a single user-facing symptom can be produced by dozens of underlying causes you would otherwise have to enumerate one by one. As the SRE book puts it, in a layered system one person's symptom is another person's cause.
The newer refinement is SLO-based, burn-rate alerting, set out in the SRE Workbook. Here the rule fires on how fast a service is consuming its error budgetError BudgetDevOps & PlatformAn error budget for a serviceView reference →, not on a raw threshold crossing. The recommended technique is multiwindow, multi-burn-rate: a fast short window and a slower long window must both breach before the rule fires. The long window resists transient spikes; the short window clears quickly once you fix the problem. Google's suggested starting points are paging when a service burns 2% of its monthly budget in an hour, or 5% in six hours.
A team owns an API with a 99.9% availability SLO, giving roughly 43 minutes of allowed error per 30-day window. Their old rule paged whenever the error rate exceeded 1% for five minutes, and it fired most weeks on harmless blips. They replace it with a burn-rate rule: page only when the error budget burns at 14.4 times the sustainable rate across both a 5-minute and a 1-hour window.
The pager goes quiet on noise. A 90-second spike during a deploy no longer fires, because the 1-hour window stays calm. When a bad config change starts burning the budget for real, both windows trip within minutes and the on-call engineer is paged with the symptom already framed in budget terms. Every page now carries a linked runbookRunbookDevOps & PlatformA runbook for incident responseView reference →, so the responder opens a known procedure instead of starting from a blank terminal.
monitorMonitorDevOps & PlatformA monitoring checkView reference → observes continuously and reports a health signal. An alert rule is the thin layer of judgement on top that decides the signal warrants human attention. The monitor never sleeps; the alert rule chooses when to wake someone.incidentIncidentDevOps & PlatformA production incidentView reference → is the declared, coordinated response once that hypothesis is confirmed worth handling. Many alerts auto-resolve and never become incidents; conversely a major incident may be opened by a customer report before any rule fires.In the Unified Product Graph an alert rule sits between observationObservationUser ResearchA specific behaviour or statement observedView reference → and response. It receives its signal through Monitortriggers viaAlert Rulehierarchy and dispatches action through monitor_triggers_via_alert_ruleAlert RuletriggersRunbookcross-domain. Modelling the rule as its own entity, separate from the monitor that feeds it and the runbook it invokes, makes the "every alert needs a runbook" discipline queryable: any alert rule with no outgoing alert_rule_triggers_runbookAlert RuletriggersRunbookcross-domain edge is a page waiting to land on someone with no instructions.alert_rule_triggers_runbook
Type-specific fields on BaseNode
conditionstringTriggering query or expression. @example "avg(rate(http_errors_total[5m])) > 0.05"
severitystringRoutes notification and escalation. `critical` = page immediately. `warning` = notify, don't page. `info` = log only.
notification_channelstringNotification destination. @example "pagerduty:sev1-rotation", "slack:#alerts-low"
evaluation_windowstringRequired duration the condition holds before firing. Prevents flapping on transient spikes. @example "5m", "15m", "1h"
escalation_policystringEscalation behaviour on unacknowledged alerts. @example "Escalate to engineering lead after 10 minutes"
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: planning · template: OPERATIONAL
2 edge types connected to this entity.
monitor_triggers_via_alert_rulealert_rule_triggers_runbook