What is the purpose of an On Call Rotation?

An on-call rotation converts a collective duty to keep things running into a specific, time-boxed assignment with a pager attached. It ensures production is always monitored by someone responsible, providing coverage around the clock while bounding the load on any one engineer.

How do you use an On Call Rotation in product management?

Rotate weekly to distribute burden fairly. Include shadow shifts for engineers new to on-call. Define escalation paths for every service. Track on-call burden, if a rotation generates >2 pages per week, the alerting or system reliability must be improved. Compensate on-call appropriately.

Where does the concept of an On Call Rotation come from?

On-call rotations formalised the responsibility for production systems in the SRE model popularised by Google. The practice of rotating responsibility across engineers (rather than having a dedicated ops team) became central to DevOps culture, empowering teams to own their services end to end.

What are common mistakes with an On Call Rotation?

A rotation that leans on a single hero who fields most pages creates a bus-factor risk and burns that person out, while everyone else's skills atrophy. Scheduling on-call with no clear escalation path leaves the on-call engineer stranded when an incident exceeds their authority or knowledge. Treating on-call as free labour, with no compensation, recovery time, or load-shedding when pages spike, drives attrition. A rotation without a maintained runbook forces every responder to rediscover the same fixes under pressure at three in the morning.

🏕️

On-Call Rotation

Q: What is an On Call Rotation?

The schedule that names who is responsible for a system's health right now and who to escalate to next.

Q: What is an example of an On Call Rotation?

Graph Platform on-call rotation: Rotation: weekly, 6 engineers. Primary: paged first; escalation to secondary after 5 min. Hours: 24/7 for P0/P1; business hours only for P2+. Tools: PagerDuty → Slack #incidents. Average pages/week this quarter: 1.4 (target: <2).

A rotation schedule for incident response

DevOps & PlatformOperations & Qualitytype: 'on_call_rotation'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An on-call rotation is the schedule that names who is responsible for a system's health right now and who to escalate to next. It converts a collective duty to "keep things running" into a specific, time-boxed assignment with a pager attached. A rotation provides coverage around the clock while bounding the load on any one engineer.

See moreSee less

Origin & evolution

On-call long predates software, but Google's Site Reliability Engineering book (2016) turned it from a tax into an engineering discipline. The central move was treating operational load as a measurable quantity with a budget. The same body of work defined toil: manual, repetitive, automatable work that scales with system size and produces no lasting value. Google's SRE teams cap toil at 50% of each engineer's time, reserving the rest for engineering that reduces future toil.

That budget sets hard limits on rotation design. The SRE book gives the arithmetic: in a 6-person single-site rotation, two of every six weeks go to on-call and interrupt handling, putting the toil floor at roughly 33%. Push to 8 people and the floor drops to 25%. The principle is that a rotation too small to honour these bounds is structurally unsustainable, and the fix is more people or less to carry, not more stamina.

The PagerDuty Operations Guide codifies the practitioner consensus on that structure: a three-tier escalation ladder of primary, backup, and whole-team, with a recommended five-minute timeout before escalating from one tier to the next. It also treats shift handoff as a first-class obligation — the outgoing responder is expected to brief the incoming one on any outstanding issues — on the grounds that context loss at the boundary is itself a reliability risk. The same guide makes psychological safety explicit: engineers should escalate without hesitation or shame, because a culture that penalises asking for help inflates response times and hides systemic load.

Multi-site teams gave the field follow-the-sun: two or three teams spread across time zones, each covering its own daylight hours, so nobody works night shifts. Inside a shift, the standard shape is primary and secondary, the secondary acting as backstop and escalation path when the primary is saturated or unreachable. The wider industry layered structured escalation policies and pager-load caps on top, but the underlying stance held: on-call must be bounded, paid attention to, and continuously made cheaper through automation.

How it works in practice

A platform team of six engineers runs a single weekly primary rotation. Each engineer is on call one week in six. Over a quarter the team tracks pages and finds the primary is woken an average of four times per night-shift week, well past the threshold where sleep and judgement degrade.

They act on the signal rather than absorbing it. Two of the most frequent pages trace to a disk-fill alert and a flapping health check, both fixable by automation. After a sprint clearing those, weekly pages drop from a dozen to three, and follow-the-sun coverage with a sister team in another time zone removes night shifts entirely. The rotation became sustainable because the team treated pager load as a defect rate to drive down, not weather to endure.

On-call rotation vs. its neighbours

Incident. An incident is a single declared disruption with a start, a response, and a resolution. The on-call rotation is the standing structure that decides who responds when an incident is declared. The rotation exists whether or not anything is on fire; the incident is the fire.
Team. A team is the durable group that owns a system. The on-call rotation is one specific obligation that group fulfils, organised as a schedule. A team can run several rotations (primary, secondary, an escalation tier), which is why the assignment is modelled separately from the group itself.
Runbook. A runbook is the procedure a responder follows once paged. The rotation determines who holds the pager; the runbook tells that person what to do. Strong rotations are paired with runbooks so the duty does not depend on which individual happens to be on call.

In the graph

In the Unified Product Graph an on-call rotation is an operations entity that attaches coverage to what it protects. A product connects through product_covered_by_on_call_rotation, and an infrastructure component through infrastructure_component_covered_by_on_call_rotation. Those edges make a quiet but useful query possible: which products and which components have nobody on call. A system in production with no covering rotation is a gap the graph surfaces before the pager would.

Preview

Presets

titlescheduleescalation_policy

rotation_cadence

handoff_time

On Call Rotation

Trellis Core Engineering On-Call Rotation

Rotation cadenceweekly

ScheduleWeek 1: Sam (staff eng). Week 2: Ana (backend). Week 3: Priya (platform). Week 4: Sam (staff eng). Repeats on a 3-person cycle.

Escalation policyPrimary unresponsive after 10 minutes: page secondary on the roster. Secondary unresponsive after 10 minutes: page Sam directly. Sev1 incidents skip to Sam immediately.

Handoff time09:00 Monday UTC

Properties

Type-specific fields on BaseNode

schedulestring

Human-readable schedule of who is on call when. @example "Weekly rotation, Monday 09:00 UTC handoff", "Follow-the-sun (US, EU, APAC)"

escalation_policystring

Escalation when the primary doesn't respond. @example "5 min to respond, then escalate to secondary. 10 min to secondary, then page engineering lead."

rotation_cadenceenum

Cycle cadence. `weekly` for standard team rotations. `daily` for high-incident-volume teams.

dailyweeklybiweeklycustom

handoff_timestring

Shift handoff time. Affects team coordination and sleep. @example "09:00 UTC", "17:00 local"

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planning · template: OPERATIONAL

All lifecycles

Relationships

2 edge types connected to this entity.

Parents

Entities that can contain this type

Productproduct_covered_by_on_call_rotation

Infrastructure Componentinfrastructure_component_covered_by_on_call_rotation

Graph Position

2parents

🏕️On Call Rotation

Definition

An on-call rotation is a schedule naming who responds to incidents for a service at any given time and who to escalate to next. It connects alert rules to the human responders they page.

Usage Guidance

Rotate weekly to distribute burden fairly.
Include shadow shifts for engineers new to on-call.
Define escalation paths for every service.
Track on-call burden, if a rotation generates >2 pages per week, the alerting or system reliability must be improved.
Compensate on-call appropriately.

Anti-Patterns

A rotation that leans on a single hero who fields most pages creates a bus-factor risk and burns that person out, while everyone else's skills atrophy.
Scheduling on-call with no clear escalation path leaves the on-call engineer stranded when an incident exceeds their authority or knowledge.
Treating on-call as free labour, with no compensation, recovery time, or load-shedding when pages spike, drives attrition.
A rotation without a maintained runbook forces every responder to rediscover the same fixes under pressure at three in the morning.

Examples

Graph Platform on-call rotation

Rotation: weekly, 6 engineers. Primary: paged first; escalation to secondary after 5 min. Hours: 24/7 for P0/P1; business hours only for P2+. Tools: PagerDuty → Slack #incidents. Average pages/week this quarter: 1.4 (target: <2).