What is the purpose of an Incident?

An incident is the moment a system stops meeting its commitments to users. Declared when a service is interrupted or a customer-impacting event occurs, it triggers a coordinated response process and a record that can later be reviewed in a postmortem.

How do you use an Incident in product management?

Define clear incident severity levels (P0–P3) with response time expectations. Use an incident commander role. Run incidents in a dedicated Slack channel or incident management tool. Communicate status updates to users every 30 minutes during active incidents. Transition to postmortem within 48 hours of resolution.

Where does the concept of an Incident come from?

Incident management as a formal discipline comes from ITIL (IT Infrastructure Library), developed by the UK government in the 1980s. The modern software industry adapted it into lighter, blameless frameworks. PagerDuty's incident management guide and Atlassian's incident handbook became widely referenced resources.

What are common mistakes with an Incident?

Declaring an incident resolved when the symptom clears but the root cause remains guarantees the same failure returns. Running a blameful post-incident review drives the honest detail underground and stops the team learning. Skimping on severity classification means a minor blip and a full outage get the same (or the wrong) response. And closing an incident without tracked follow-up actions turns the painful lesson into a story nobody acts on.

🚨

Incident

Q: What is an Incident?

An unplanned disruption to a service, or a degradation serious enough to demand a coordinated response.

Q: What is an example of an Incident?

INC-2026-014: Graph API timeout storm, 47-min outage: Severity: P1. Start: 2026-01-15 09:14 UTC. Resolved: 10:01 UTC. Impact: 100% of graph saves failing. Root cause: cascading timeout from upstream Supabase query regression. Error budget consumed: 47 min (exceeding monthly budget). Postmortem: completed 2026-01-17.

The moment a service stops meeting its commitments, threshold serious enough to demand coordinated response.

DevOps & PlatformOperations & Qualitytype: 'incident'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An incident is an unplanned disruption to a service, or a degradation serious enough to demand a coordinated response. It is the moment the system stops meeting its commitments.

See moreSee less

Origin & evolution

Two lineages converge on the modern incident. The operational definition came from IT service management: ITIL framed an incident as an unplanned interruption to a service or a reduction in its quality, and built a lifecycle of detection, logging, categorisation, and resolution around it. That gave the word a precise, ticketed meaning inside enterprise IT.

The response discipline came from somewhere harder. The Incident Command System emerged from 1970s California wildfire response, where uncoordinated agencies fighting the same fire kept failing for organisational reasons rather than tactical ones. It defined a clear command hierarchy and a shared vocabulary so that strangers could coordinate under pressure. Google adapted it directly for software in Incident Management at Google (the Site Reliability Engineering book, O'Reilly, 2017), centred on the three Cs: coordinate, communicate, control, with a named incident commander running the response and engineers freed to fix the problem.

The thinking has moved from minimising incidents to learning from them. The mature view treats incidents as inevitable in complex systems and routine in well-run ones, valued for the signal they carry, with severity levels triggering proportionate response rather than uniform panic.

Gene Kim, Kevin Behr, and George Spafford, in *The Phoenix Project*, offer a complementary lens: incidents are the visible surface of what they call unplanned work, the fourth and most corrosive type of IT work alongside business projects, internal IT projects, and changes. By that framing, an incident is not merely a service disruption but evidence that earlier work — a rushed deployment, a deferred fix, an untested dependency — has come due. Their argument is that unplanned work is self-compounding: it displaces planned work, which creates more defects, which produce more incidents. Reducing incident volume therefore requires surfacing and addressing the upstream conditions, not just responding faster at the moment of failure.

How it works in practice

At 02:14 a checkout error-rate monitor fires. An on-call engineer acknowledges, sees that errors crossed 5 percent, and declares a Sev2. They take the incident commander role, open a dedicated channel, and pull in a database specialist as operations lead. The commander does not debug; they coordinate, post updates every fifteen minutes, and decide.

Investigation points at a deployment from 01:50. The team rolls it back, error rate recovers by 02:31, and the incident is downgraded then closed. Seventeen minutes of elevated errors are recorded against the quarter's error budget. The next morning, because this crossed the severity bar, the incident triggers a postmortem. The fix took minutes; the value is in what the postmortem extracts from those minutes.

Incident vs. its neighbours

Bug is a defect in the code, latent until something triggers it. An incident is the live, customer-affecting event. A bug can sit unnoticed for a year; the incident is the day it takes down checkout. One bug can cause many incidents.
Root cause is the underlying condition the investigation lands on. The incident is the surface event; the root cause is the explanation beneath it. They are deliberately separate, because one incident can have several contributing causes.
Postmortem is the written analysis produced after the event. The incident is what happened; the postmortem is what the team learned. A short incident can warrant a long postmortem, and a noisy one can warrant none.

In the graph

In the Unified Product Graph, incident is the anchor of the Operations and Quality region, chosen because it forces every other domain to pay attention: it references a service, may breach a commitment, and produces learning. Its edges encode the lifecycle directly: incident_caused_by_root_cause reaches into the investigation, incident_analysed_in_postmortem and incident_triggers_postmortem capture the learning loop, incident_breaches_service_level_objective ties it to the reliability budget, and runbook_mitigates_incident records how it was handled. That web is why an incident is queryable as more than a ticket: it is the join point between what broke, what it cost, and what changed because of it.

Preview

Presets

title

incident_type

severity_level

SEV1 Critical outage; full response, executive-visible.

urgency

Critical Immediate action required; escalate now.

started_atacknowledged_atcontained_atresolved_atimpact_summary

Incident

Safe Change Preview Service Down - All Workspaces Affected

Incident typeoperationalSeverity levelSEV1UrgencyCritical

Started at2025-03-12T02:14:00Z

Acknowledged at2025-03-12T02:19:00Z

Contained at2025-03-12T02:47:00Z

Resolved at2025-03-12T03:31:00Z

Impact summaryAll director workspaces unable to preview or approve agent-proposed structural changes; Safe Change blocked for 100% of active users

Properties

Type-specific fields on BaseNode

incident_typeenum

Discriminator. Absorbs the deprecated `security_incident` type. When `incident_type === 'security'`, this node replaces the former `security_incident`. @example "security" for a data breach, "operational" for a service outage, "performance" for degradation

operationalsecuritydata_breachperformancedependencyother

severity_levelenum

Incident severity tier (paging classification). `sev1` = critical/system down. `sev2` = major impact. `sev3` = minor impact. `sev4` = minimal. Uses the `IncidentSeverity` scale; distinct from user-impact `severity_5`. @example "sev1" for complete service unavailability

Incident severity scale →

sev1

SEV1

Critical outage; full response, executive-visible.

sev2

SEV2

Major degradation; urgent response.

sev3

SEV3

Minor or partial impact; handled within hours.

sev4

SEV4

Negligible impact; routine handling.

urgencyenum

Notification urgency. Independent of severity. Uses the shared `SignalUrgency` scale (`low` | `medium` | `high` | `critical`); higher tiers escalate the notification channel.

Signal urgency scale →

low

Low

No immediate action required; monitor and review at regular cadence.

medium

Medium

Should be addressed within the current sprint or cycle.

high

High

Requires prompt attention; address before the next planning checkpoint.

critical

Critical

Immediate action required; escalate now.

started_atstring

ISO timestamp the incident started or was first detected.

acknowledged_atstring

ISO timestamp first acknowledged by a responder. Used to compute time-to-acknowledge.

contained_atstring

ISO timestamp contained. Blast radius limited, bleeding stopped. Containment precedes full resolution, especially for security incidents.

resolved_atstring

ISO timestamp fully resolved.

impact_summarystring

Customer or service impact. @example "Users unable to log in", "Payment processing delayed by 30+ seconds for 15% of users"

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: detected

All lifecycles

Relationships

13 edge types connected to this entity.

Parents

Entities that can contain this type

Productproduct_experiences_incident

Infrastructure Componentinfrastructure_component_experiences_incident

Security Policysecurity_policy_governs_incident

Children

Entities this type can contain

Postmortemincident_analysed_in_postmortem

Cross-References

Contextual links across the graph

Deploymentdeployment_triggers_incident

Postmortemincident_triggers_postmortem

Service Level Objectiveincident_breaches_service_level_objective

Root Causeincident_caused_by_root_cause

Vulnerabilityincident_exploits_vulnerability

Runbookrunbook_mitigates_incident

Symptomsymptom_triggers_incident

Featureincident_affects_feature

Support Ticketincident_generates_support_ticket

Graph Position

3parents

🚨Incident

1child

9cross-ref

Definition

An incident is an unplanned service disruption, declared when an SLO is breached or users are actively affected. It links the breached SLO to the postmortem that follows.

Usage Guidance

Define clear incident severity levels (P0–P3) with response time expectations.
Use an incident commander role.
Run incidents in a dedicated Slack channel or incident management tool.
Communicate status updates to users every 30 minutes during active incidents.
Transition to postmortem within 48 hours of resolution.

Anti-Patterns

Declaring an incident resolved when the symptom clears but the root cause remains guarantees the same failure returns.
Running a blameful post-incident review drives the honest detail underground and stops the team learning.
Skimping on severity classification means a minor blip and a full outage get the same (or the wrong) response.
And closing an incident without tracked follow-up actions turns the painful lesson into a story nobody acts on.

Examples

INC-2026-014: Graph API timeout storm, 47-min outage

Severity: P1. Start: 2026-01-15 09:14 UTC. Resolved: 10:01 UTC. Impact: 100% of graph saves failing. Root cause: cascading timeout from upstream Supabase query regression. Error budget consumed: 47 min (exceeding monthly budget). Postmortem: completed 2026-01-17.

Incident

The moment a service stops meeting its commitments, threshold serious enough to demand coordinated response.

DevOps & PlatformOperations & Qualitytype: 'incident'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An incident is an unplanned disruption to a service, or a degradation serious enough to demand a coordinated response. It is the moment the system stops meeting its commitments.

See moreSee less

Origin & evolution

How it works in practice

Incident vs. its neighbours

Bug is a defect in the code, latent until something triggers it. An incident is the live, customer-affecting event. A bug can sit unnoticed for a year; the incident is the day it takes down checkout. One bug can cause many incidents.
Root cause is the underlying condition the investigation lands on. The incident is the surface event; the root cause is the explanation beneath it. They are deliberately separate, because one incident can have several contributing causes.
Postmortem is the written analysis produced after the event. The incident is what happened; the postmortem is what the team learned. A short incident can warrant a long postmortem, and a noisy one can warrant none.

In the graph

Preview

Presets

title

incident_type

severity_level

SEV1 Critical outage; full response, executive-visible.

urgency

Critical Immediate action required; escalate now.

started_atacknowledged_atcontained_atresolved_atimpact_summary

Incident

Safe Change Preview Service Down - All Workspaces Affected

Incident typeoperationalSeverity levelSEV1UrgencyCritical

Started at2025-03-12T02:14:00Z

Acknowledged at2025-03-12T02:19:00Z

Contained at2025-03-12T02:47:00Z

Resolved at2025-03-12T03:31:00Z

Impact summaryAll director workspaces unable to preview or approve agent-proposed structural changes; Safe Change blocked for 100% of active users

Properties

Type-specific fields on BaseNode

incident_typeenum

operationalsecuritydata_breachperformancedependencyother

severity_levelenum

Incident severity scale →

sev1

SEV1

Critical outage; full response, executive-visible.

sev2

SEV2

Major degradation; urgent response.

sev3

SEV3

Minor or partial impact; handled within hours.

sev4

SEV4

Negligible impact; routine handling.

urgencyenum

Notification urgency. Independent of severity. Uses the shared `SignalUrgency` scale (`low` | `medium` | `high` | `critical`); higher tiers escalate the notification channel.

Signal urgency scale →

low

Low

No immediate action required; monitor and review at regular cadence.

medium

Medium

Should be addressed within the current sprint or cycle.

high

High

Requires prompt attention; address before the next planning checkpoint.

critical

Critical

Immediate action required; escalate now.

started_atstring

ISO timestamp the incident started or was first detected.

acknowledged_atstring

ISO timestamp first acknowledged by a responder. Used to compute time-to-acknowledge.

contained_atstring

ISO timestamp contained. Blast radius limited, bleeding stopped. Containment precedes full resolution, especially for security incidents.

resolved_atstring

ISO timestamp fully resolved.

impact_summarystring

Customer or service impact. @example "Users unable to log in", "Payment processing delayed by 30+ seconds for 15% of users"

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: detected

All lifecycles

Relationships

13 edge types connected to this entity.

Parents

Entities that can contain this type

Productproduct_experiences_incident

Infrastructure Componentinfrastructure_component_experiences_incident

Security Policysecurity_policy_governs_incident

Children

Entities this type can contain

Postmortemincident_analysed_in_postmortem

Cross-References

Contextual links across the graph

Deploymentdeployment_triggers_incident

Postmortemincident_triggers_postmortem

Service Level Objectiveincident_breaches_service_level_objective

Root Causeincident_caused_by_root_cause

Vulnerabilityincident_exploits_vulnerability

Runbookrunbook_mitigates_incident

Symptomsymptom_triggers_incident

Featureincident_affects_feature

Support Ticketincident_generates_support_ticket

Graph Position

3parents

🚨Incident

1child

9cross-ref

Definition

An incident is an unplanned service disruption, declared when an SLO is breached or users are actively affected. It links the breached SLO to the postmortem that follows.

Usage Guidance

Define clear incident severity levels (P0–P3) with response time expectations.
Use an incident commander role.
Run incidents in a dedicated Slack channel or incident management tool.
Communicate status updates to users every 30 minutes during active incidents.
Transition to postmortem within 48 hours of resolution.

Anti-Patterns

Declaring an incident resolved when the symptom clears but the root cause remains guarantees the same failure returns.
Running a blameful post-incident review drives the honest detail underground and stops the team learning.
Skimping on severity classification means a minor blip and a full outage get the same (or the wrong) response.
And closing an incident without tracked follow-up actions turns the painful lesson into a story nobody acts on.

Examples

INC-2026-014: Graph API timeout storm, 47-min outage