A rotation schedule for incident response
An on-call rotation is the schedule that names who is responsible for a system's health right now and who to escalate to next. It converts a vague collective duty to "keep things running" into a specific, time-boxed assignment with a pager attached. The tension at its heart is human: a rotation has to provide coverage around the clock while keeping the cost to any one engineer low enough that good people will keep doing it.
On-call long predates software, but Google's Site Reliability Engineering book (2016) turned it from a tax into an engineering discipline. The central move was treating operational load as a measurable quantity with a budget. The same body of work defined toil: manual, repetitive, automatable work that scales with system size and produces no lasting value. Google's SRE teams cap toil at 50% of each engineer's time, reserving the rest for engineering that reduces future toil.
That budget sets hard limits on rotation design. The SRE book gives the arithmetic: in a 6-person single-site rotation, two of every six weeks go to on-call and interrupt handling, putting the toil floor at roughly 33%. Push to 8 people and the floor drops to 25%. The principle is that a rotation too small to honour these bounds is structurally unsustainable, and the fix is more people or less to carry, not more stamina.
Multi-site teams gave the field follow-the-sun: two or three teams spread across time zones, each covering its own daylight hours, so nobody works night shifts. Inside a shift, the standard shape is primary and secondary, the secondary acting as backstop and escalation path when the primary is saturated or unreachable. The wider industry layered structured escalation policies and pager-load caps on top, but the underlying stance held: on-call must be bounded, paid attention to, and continuously made cheaper through automation.
A platform team of six engineers runs a single weekly primary rotation. Each engineer is on call one week in six. Over a quarter the team tracks pages and finds the primary is woken an average of four times per night-shift week, well past the threshold where sleep and judgement degrade.
They act on the signal rather than absorbing it. Two of the most frequent pages trace to a disk-fill alert and a flapping health check, both fixable by automation. After a sprint clearing those, weekly pages drop from a dozen to three, and follow-the-sun coverage with a sister team in another time zone removes night shifts entirely. The rotation became sustainable because the team treated pager load as a defect rate to drive down, not weather to endure.
incidentIncidentDevOps & PlatformA production incidentView reference → is a single declared disruption with a start, a response, and a resolution. The on-call rotation is the standing structure that decides who responds when an incident is declared. The rotation exists whether or not anything is on fire; the incident is the fire.teamTeamTeam & OrganisationA cross-functional teamView reference → is the durable group that owns a system. The on-call rotation is one specific obligation that group fulfils, organised as a schedule. A team can run several rotations (primary, secondary, an escalation tier), which is why the assignment is modelled separately from the group itself.In the Unified Product Graph an on-call rotation is an operations entity that attaches coverage to what it protects. A product connects through Productcovered byOn Call Rotationhierarchy, and an infrastructure componentInfrastructure ComponentDevOps & PlatformAn infrastructure component (server, CDN, etc.)View reference → through product_covered_by_on_call_rotationInfrastructure Componentcovered byOn Call Rotationhierarchy. Those edges make a quiet but useful query possible: which products and which components have nobody on call. A system in production with no covering rotation is an outage waiting for an audience, and the graph surfaces that gap before the pager would.infrastructure_component_covered_by_on_call_rotation
Type-specific fields on BaseNode
schedulestringHuman-readable schedule of who is on call when. @example "Weekly rotation, Monday 09:00 UTC handoff", "Follow-the-sun (US, EU, APAC)"
escalation_policystringEscalation when the primary doesn't respond. @example "5 min to respond, then escalate to secondary. 10 min to secondary, then page engineering lead."
rotation_cadencestringCycle cadence. `weekly` for standard team rotations. `daily` for high-incident-volume teams.
handoff_timestringShift handoff time. Affects team coordination and sleep. @example "09:00 UTC", "17:00 local"
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: planning · template: OPERATIONAL
2 edge types connected to this entity.
product_covered_by_on_call_rotationinfrastructure_component_covered_by_on_call_rotation