What is a Postmortem?

The written record of an incident, documenting what happened, what caused it, what was done, and what will change to prevent a recurrence.

What is the purpose of a Postmortem?

A postmortem converts an incident into learning by interrogating the system that allowed the failure rather than the person who tripped over it. It identifies root causes and produces concrete action items that prevent the same failure recurring, building the reliability of the system over time.

How do you use a Postmortem in product management?

Write postmortems within 48 hours while memory is fresh. Include: timeline, root cause analysis (5 whys), contributing factors, and action items with owners and deadlines. Make postmortems blameless, focus on system failures, not individual mistakes. Share postmortems widely; they build trust.

Where does the concept of a Postmortem come from?

Blameless postmortems were pioneered in SRE culture at Google and Etsy, inspired by the aviation and medical industries' use of incident reviews. John Allspaw's influential 2012 Etsy blog post "Blameless PostMortems and a Just Culture" established the principle that humans making errors are the symptom, not the cause, of system failures.

What are common mistakes with a Postmortem?

The moment a postmortem assigns blame to an individual, honesty evaporates and the real systemic causes stay hidden. Stopping at the proximate cause ("a bad deploy") rather than tracing the conditions that let it happen produces shallow, unactionable lessons. Postmortems that generate action items nobody owns or tracks are theatre. Reserving the practice for only the largest outages forfeits the cheaper learning available from near-misses.

🩻

Postmortem

Q: What is an example of a Postmortem?

Postmortem: INC-2026-014, Graph API timeout storm: Root cause: missing index on nodes table caused full table scan, P99 latency → 8s. 5 whys: missing index → schema migration gap → no production load test → insufficient CI environment → action item: add production-equivalent load testing to CI pipeline.

A post-incident analysis documenting lessons

DevOps & PlatformOperations & Qualitytype: 'postmortem'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

A postmortem is the written record of an incident: what happened, what caused it, what was done, and what will change so it does not happen the same way again. Its defining quality is blamelessness. The document interrogates the system that allowed the failure, and it deliberately refuses to interrogate the engineer who tripped over it.

See moreSee less

Origin & evolution

The blameless idea was imported from safety science, not invented in software. James Reason's work on human error and Sidney Dekker's writing on Just Culture framed a hard finding from aviation and medicine: punishing the individual at the sharp end leaves the latent system conditions untouched, so the next person inherits the same trap. Dekker named the punitive reflex the Bad Apple Theory and showed it does not make complex systems safer.

John Allspaw carried that thinking into engineering. In 2012, as CTO of Etsy, he published Blameless PostMortems and a Just Culture on Etsy's engineering blog, arguing that engineers who feel safe to give a detailed account of what they saw and did are the organisation's best source of truth about how the system actually behaves. Etsy backed it with a tool, Morgue, for recording these reviews. Google's Site Reliability Engineering book then codified the practice and the template, and the industry followed.

The debate since has been about accountability. Blameless does not mean consequence-free; the refinement most teams reach is that you separate the account of the failure, which must be safe and honest, from any question of competence, which belongs to a different conversation. The phrase "blameless, not accountability-free" captures where the field landed.

How it works in practice

After a Sev2 checkout outage, the incident commander schedules a postmortem within two days while memory is fresh. The author builds a minute-by-minute timeline from chat logs and graphs, then writes the analysis in language that names systems and not people: the deploy pipeline had no automated canary, so a bad config reached 100 percent of traffic in ninety seconds.

The document lists contributing factors rather than a single culprit, and ends with dated, owned action items: add a canary stage, alert on config diff size, document the rollback in a runbook. It is shared widely, because the point of writing it down is that the team three doors over learns the lesson without living the outage. A year later the canary action item is the reason a similar bad config is caught at 1 percent.

Postmortem vs. its neighbours

Incident is the live event. The postmortem is the reflection afterwards. The incident is measured in minutes of disruption; the postmortem is measured in lessons extracted.
Root cause is a conclusion the postmortem reaches, the condition the analysis identifies. The postmortem is the whole document and process; the root cause is one of its outputs. Mature postmortems resist naming a single root cause, preferring a set of contributing factors.
Retrospective is a recurring team ritual reflecting on a sprint or a period of work. A postmortem is triggered by a specific failure. One is cadence-driven and broad; the other is event-driven and forensic.

In the graph

In the Unified Product Graph, postmortem is a leaf in the Operations and Quality region, reached from the anchor incident through incident_analysed_in_postmortem and incident_triggers_postmortem. Its own causal edges are what make it valuable: postmortem_identifies_root_cause connects the analysis to the condition it found, and postmortem_produces_runbook connects it to the operational change it generated. That second edge is the learning loop made structural. A postmortem that produces no runbook and identifies no root cause is visibly a document that changed nothing.

Preview

Presets

titletimelineaction_items

detection_method

Postmortem

Postmortem: Safe Change Preview Service Outage (2025-03-12)

Detection methodmonitoring

Timeline02:14 - Safe Change preview service begins returning 503 errors. 02:17 - First automated alert fires. 02:19 - On-call engineer acknowledges. 02:28 - Root cause identified: database connection pool exhausted by a migration job. 02:47 - Migration job killed; connection pool freed; service contained. 03:31 - Full recovery confirmed across all regions.

Action items1. Add connection pool saturation alert (owner: Sam, due: 2025-03-26). 2. Run migration jobs against a read replica, not primary (owner: infra team, due: 2025-04-05). 3. Add Safe Change synthetic canary to detect outages before users report them (owner: Eng, due: 2025-04-12).

Properties

Type-specific fields on BaseNode

timelinestring

Chronological timeline. Events with timestamps in order. @example "03:15 Alert fired. 03:20 On-call acknowledged. 03:45 Root cause identified. 06:30 Service restored."

action_itemsstring

Follow-up actions with owners and due dates. @example "1. Add circuit breaker to auth service (owner: Platform, due: 2026-04-12). 2. Update runbook for DB failover."

detection_methodenum

Detection source. Key learning for improving detection coverage. @example "alert" if monitoring caught it, "customer_report" if a user reported first

monitoringalertcustomer_reportinternal_reportautomated

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

4 phases, initial: draft · template: PUBLISHING

All lifecycles

Relationships

4 edge types connected to this entity.

Parents

Entities that can contain this type

Incidentincident_analysed_in_postmortem

Cross-References

Contextual links across the graph

Incidentincident_triggers_postmortem

Root Causepostmortem_identifies_root_cause

Runbookpostmortem_produces_runbook

Graph Position

1parent

🩻Postmortem

3cross-ref

Definition

A postmortem is the written record of an incident: what happened, what caused it, what was done, and what will change to prevent a recurrence. It connects an incident to the systemic improvements it prompts.

Usage Guidance

Write postmortems within 48 hours while memory is fresh.
Include: timeline, root cause analysis (5 whys), contributing factors, and action items with owners and deadlines.
Make postmortems blameless, focus on system failures, not individual mistakes.
Share postmortems widely; they build trust.

Anti-Patterns

The moment a postmortem assigns blame to an individual, honesty evaporates and the real systemic causes stay hidden.
Stopping at the proximate cause ("a bad deploy") rather than tracing the conditions that let it happen produces shallow, unactionable lessons.
Postmortems that generate action items nobody owns or tracks are theatre.
Reserving the practice for only the largest outages forfeits the cheaper learning available from near-misses.

Examples

Postmortem: INC-2026-014, Graph API timeout storm

Root cause: missing index on nodes table caused full table scan, P99 latency → 8s. 5 whys: missing index → schema migration gap → no production load test → insufficient CI environment → action item: add production-equivalent load testing to CI pipeline.

Postmortem

A post-incident analysis documenting lessons

DevOps & PlatformOperations & Qualitytype: 'postmortem'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

Postmortem vs. its neighbours

Incident is the live event. The postmortem is the reflection afterwards. The incident is measured in minutes of disruption; the postmortem is measured in lessons extracted.
Root cause is a conclusion the postmortem reaches, the condition the analysis identifies. The postmortem is the whole document and process; the root cause is one of its outputs. Mature postmortems resist naming a single root cause, preferring a set of contributing factors.
Retrospective is a recurring team ritual reflecting on a sprint or a period of work. A postmortem is triggered by a specific failure. One is cadence-driven and broad; the other is event-driven and forensic.

In the graph

Preview

Presets

titletimelineaction_items

detection_method

Postmortem

Postmortem: Safe Change Preview Service Outage (2025-03-12)

Detection methodmonitoring

Properties

Type-specific fields on BaseNode

timelinestring

Chronological timeline. Events with timestamps in order. @example "03:15 Alert fired. 03:20 On-call acknowledged. 03:45 Root cause identified. 06:30 Service restored."

action_itemsstring

Follow-up actions with owners and due dates. @example "1. Add circuit breaker to auth service (owner: Platform, due: 2026-04-12). 2. Update runbook for DB failover."

detection_methodenum

Detection source. Key learning for improving detection coverage. @example "alert" if monitoring caught it, "customer_report" if a user reported first

monitoringalertcustomer_reportinternal_reportautomated

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

4 phases, initial: draft · template: PUBLISHING

All lifecycles

Relationships

4 edge types connected to this entity.

Parents

Entities that can contain this type

Incidentincident_analysed_in_postmortem

Cross-References

Contextual links across the graph

Incidentincident_triggers_postmortem

Root Causepostmortem_identifies_root_cause

Runbookpostmortem_produces_runbook

Graph Position

1parent

🩻Postmortem

3cross-ref

Definition

Usage Guidance

Write postmortems within 48 hours while memory is fresh.
Include: timeline, root cause analysis (5 whys), contributing factors, and action items with owners and deadlines.
Make postmortems blameless, focus on system failures, not individual mistakes.
Share postmortems widely; they build trust.

Anti-Patterns

The moment a postmortem assigns blame to an individual, honesty evaporates and the real systemic causes stay hidden.
Stopping at the proximate cause ("a bad deploy") rather than tracing the conditions that let it happen produces shallow, unactionable lessons.
Postmortems that generate action items nobody owns or tracks are theatre.
Reserving the practice for only the largest outages forfeits the cheaper learning available from near-misses.

Examples

Postmortem: INC-2026-014, Graph API timeout storm