What is an Investigation?

The structured diagnosis that sits between a symptom and a fix, turning "something is wrong" into "this is what is wrong, and here is the evidence".

What is the purpose of an Investigation?

An investigation is the work of diagnosis: the hypotheses tested, evidence gathered, and dead ends ruled out between noticing a symptom and naming its cause. It preserves the reasoning that links symptom to root cause, so the conclusion can be trusted, verified, and revisited later.

How do you use an Investigation in product management?

Start from the symptom and record your hypotheses, the evidence for and against each, and what you ruled out. Not just the eventual answer. Time-box and assign it so an open investigation doesn't drift. Conclude by linking to the root cause it identified, and keep the reasoning even when the cause seems obvious in hindsight.

Where does the concept of an Investigation come from?

Disciplined fault diagnosis grew up in two places at once: manufacturing gave it the gemba walk and the Five Whys, while software gave it the scientific debugging tradition codified in works such as Andreas Zeller's Why Programs Fail, which frames debugging as hypothesis and experiment. The cultural shift was blamelessness, popularised by John Allspaw and the team at Etsy through the blameless postmortem, which surfaces the detailed how-and-why rather than a self-protective account. Allspaw's later Infinite Hows argues an investigation constructs an explanation rather than uncovering a pre-existing root cause.

What are common mistakes with an Investigation?

Investigations that stop at the first plausible explanation tend to find a cause rather than the cause. The symptom returns because the real fault was one "why" deeper. Recording only the conclusion and discarding the ruled-out hypotheses strips future responders of the context that would let them reopen the case intelligently. And an investigation that slides into blame (hunting for who erred rather than what failed) suppresses the very information needed to fix the system.

🔍

Investigation

Q: What is an example of an Investigation?

Checkout-error investigation: Started from the 14:10 payment-failure spike. Ruled out a code deploy (none in the window) and a card-network outage (other gateways fine); traced it to an expired TLS certificate on the payment provider's webhook endpoint.

The enquiry tracing a symptom back to its root cause.

EngineeringEngineering & Platformtype: 'investigation'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

An investigation is the structured diagnosis that sits between a symptom and a fix. It is the work of turning "something is wrong" into "this is what is wrong, and here is the evidence".

See moreSee less

Origin & evolution

Disciplined fault diagnosis grew up in two places at once. Manufacturing gave it the gemba walk and the Five Whys, Toyota's habit of going to where the problem happened and asking why until something systemic appears. Software gave it the scientific debugging tradition, codified in works such as Andreas Zeller's *Why Programs Fail*, which frames debugging as hypothesis and experiment: observe the failure, form a theory of the cause, predict what you would see if the theory held, then test it.

The cultural shift that reshaped investigation was blamelessness. John Allspaw and the team at Etsy popularised the blameless postmortem, arguing that an engineer who fears punishment will give the account that protects them, not the account that explains the failure. Remove the threat and you get the second story, the detailed how-and-why that teaches the organisation something. Allspaw's later Infinite Hows pushed further: an investigation does not uncover a pre-existing root cause so much as construct an explanation, so where it chooses to stop is a judgement, and a careful investigation keeps asking "how" after the first comfortable "why".

How it works in practice

A scheduled report stops arriving. The symptom is narrow: one daily email, missing. The investigation begins with a timeline. Logs show the report job started and exited cleanly, so the failure is downstream of generation. The engineer forms a hypothesis: the file uploaded but the notification did not send. A quick check of the storage bucket confirms the file is there, falsifying the "report never generated" theory and pointing at the delivery step. The next hypothesis, an expired API credential for the email provider, is confirmed by a 401 in the provider logs. The investigation has now produced two artefacts: a revealed bug, the unhandled 401, and a candidate root cause, credentials with no rotation alert. Both feed the fix.

Investigation vs. its neighbours

Incident is the declared event that an investigation serves. The incident is the organisational container with an owner and a clock; the investigation is the diagnostic work inside it. Not every investigation needs an incident, and a long incident may hold several investigations.
Root cause is an output of the investigation, never a guaranteed one. The investigation produces a candidate; whether it is the true underlying condition is the open question Allspaw warns about.
Postmortem is the written record and the learning ritual after the fact. The postmortem documents what the investigation found and what changes follow; the investigation is the live diagnosis, the postmortem its durable account.

In the graph

In the Unified Product Graph, investigation sits in the engineering and reliability region as the diagnostic spine of an incident chain. It points to its findings through investigation_revealed_bug and investigation_revealed_root_cause, and the resulting change is tied to it by fix_derived_from_investigation. Its target is recorded by service_investigated_via_investigation, anchoring the diagnosis to the component under examination. Holding the investigation as its own node, between symptom and fix, preserves the chain of reasoning: a graph reader can see not just that a fault was fixed but how it was diagnosed and what evidence justified the conclusion.

Preview

Presets

title

severity

Mild inconvenience Notices but works around easily

hypothesisfindingsstarted_atresolved_atlead_investigator

Lifecycle

5 phases, initial: open

All lifecycles

Relationships

5 edge types connected to this entity.

Children

Entities this type can contain

Symptominvestigation_surfaces_symptom

Root Causeinvestigation_revealed_root_cause

Cross-References

Contextual links across the graph

Buginvestigation_revealed_bug

Fixfix_derived_from_investigation

Serviceservice_investigated_via_investigation

Graph Position

🔍Investigation

2children

3cross-ref

Definition

An investigation is the structured enquiry that turns a symptom into a named, evidenced root cause. It sits in the middle of the diagnostic chain, starting from observed symptoms and concluding with a root cause finding.

Usage Guidance

Start from the symptom and record your hypotheses, the evidence for and against each, and what you ruled out.
Not just the eventual answer.
Time-box and assign it so an open investigation doesn't drift.
Conclude by linking to the root cause it identified, and keep the reasoning even when the cause seems obvious in hindsight.

Anti-Patterns

Investigations that stop at the first plausible explanation tend to find a cause rather than the cause.
The symptom returns because the real fault was one "why" deeper.
Recording only the conclusion and discarding the ruled-out hypotheses strips future responders of the context that would let them reopen the case intelligently.
And an investigation that slides into blame (hunting for who erred rather than what failed) suppresses the very information needed to fix the system.

Examples

Checkout-error investigation

Started from the 14:10 payment-failure spike. Ruled out a code deploy (none in the window) and a card-network outage (other gateways fine); traced it to an expired TLS certificate on the payment provider's webhook endpoint.

Retention-dip enquiry

Investigated a 3-point drop in week-2 retention. Tested and rejected a pricing-change hypothesis before identifying a broken activation email as the cause.