An unplanned disruption of service
An incident is an unplanned disruption to a service, or a degradation serious enough to demand a coordinated response. It is the moment the system stops meeting its commitments, and the way an organisation handles that moment reveals more about its engineering culture than any architecture diagram.
Two lineages converge on the modern incident. The operational definition came from IT service management: ITIL framed an incident as an unplanned interruption to a service or a reduction in its quality, and built a lifecycle of detection, logging, categorisation, and resolution around it. That gave the word a precise, ticketed meaning inside enterprise IT.
The response discipline came from somewhere harder. The Incident Command System emerged from 1970s California wildfire response, where uncoordinated agencies fighting the same fire kept failing for organisational reasons rather than tactical ones. It defined a clear command hierarchy and a shared vocabulary so that strangers could coordinate under pressure. Google adapted it directly for software in Incident Management at Google (the Site Reliability Engineering book, O'Reilly, 2017), centred on the three Cs: coordinate, communicate, control, with a named incident commander running the response and engineers freed to fix the problem.
The thinking has moved from minimising incidents to learningLearningValidationAn insight gained from an experimentView reference → from them. The mature view treats incidents as inevitable in complex systems and routine in well-run ones, valued for the signal they carry, with severity levels triggering proportionate response rather than uniform panic.
At 02:14 a checkout error-rate monitor fires. An on-call engineer acknowledges, sees that errors crossed 5 percent, and declares a Sev2. They take the incident commander role, open a dedicated channel, and pull in a database specialist as operations lead. The commander does not debug; they coordinate, post updates every fifteen minutes, and decide.
InvestigationInvestigationEngineeringAn investigation into an issue or incidentView reference → points at a deploymentDeploymentEngineeringA deployment eventView reference → from 01:50. The team rolls it back, error rate recovers by 02:31, and the incident is downgraded then closed. Seventeen minutes of elevated errors are recorded against the quarter's error budgetError BudgetDevOps & PlatformAn error budget for a serviceView reference →. The next morning, because this crossed the severity bar, the incident triggers a postmortemPostmortemDevOps & PlatformA post-incident reviewView reference →. The fix took minutes; the value is in what the postmortem extracts from those minutes.
In the Unified Product Graph, IncidentDevOps & PlatformA production incident is the anchor of the Operations and Quality region, chosen because it forces every other domain to pay attention: it references a service, may breach a commitment, and produces learning. Its edges encode the lifecycle directly: incidentIncidentcaused byRoot Causecross-domain reaches into the investigation, incident_caused_by_root_causeIncidentanalysed inPostmortemhierarchy and incident_analysed_in_postmortemIncidenttriggersPostmortemcross-domain capture the learning loop, incident_triggers_postmortemIncidentbreachesService Level Objectivecross-domain ties it to the reliability budget, and incident_breaches_service_level_objectiveRunbookmitigatesIncidentcross-domain records how it was handled. That web is why an incident is queryable as more than a ticket: it is the join point between what broke, what it cost, and what changed because of it.runbook_mitigates_incident
Type-specific fields on BaseNode
incident_typestringDiscriminator. Absorbs the deprecated `security_incident` type. When `incident_type === 'security'`, this node replaces the former `security_incident`. @example "security" for a data breach, "operational" for a service outage, "performance" for degradation
severity_levelstringSeverity. `sev1` = critical/system down. `sev2` = major impact. `sev3` = minor impact. `sev4` = minimal impact. @example "sev1" for complete service unavailability
urgencystringNotification urgency. Independent of severity. `high` triggers phone calls. `low` sends a message.
started_atstringISO timestamp the incident started or was first detected.
acknowledged_atstringISO timestamp first acknowledged by a responder. Used to compute time-to-acknowledge.
contained_atstringISO timestamp contained. Blast radius limited, bleeding stopped. Containment precedes full resolution, especially for security incidents.
resolved_atstringISO timestamp fully resolved.
impact_summarystringCustomer or service impact. @example "Users unable to log in", "Payment processing delayed by 30+ seconds for 15% of users"
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: detected
13 edge types connected to this entity.
product_experiences_incidentinfrastructure_component_experiences_incidentsecurity_policy_governs_incidentincident_analysed_in_postmortemdeployment_triggers_incidentincident_triggers_postmortemincident_breaches_service_level_objectiveincident_caused_by_root_causeincident_exploits_vulnerabilityrunbook_mitigates_incidentproduct_experiences_incident_hierarchysymptom_triggers_incidentincident_affects_feature