The structured enquiry that traces a symptom back to its root cause.
An investigation is the structured diagnosis that sits between a symptomSymptomEngineeringA symptom of a problemView reference → and a fix. It is the work of turning "something is wrong" into "this is what is wrong, and here is the evidenceEvidenceValidationData supporting or refuting a hypothesisView reference →". The quality of an investigation is decided less by how fast it reaches an answer than by whether it can resist the first plausible one.
Disciplined fault diagnosis grew up in two places at once. Manufacturing gave it the gemba walk and the Five Whys, Toyota's habit of going to where the problem happened and asking why until something systemic appears. Software gave it the scientific debugging tradition, codified in works such as Andreas Zeller's *Why Programs Fail*, which frames debugging as hypothesisHypothesisValidationA testable belief about a solutionView reference → and experimentExperimentValidationA test designed to validate a hypothesisView reference →: observe the failure, form a theory of the cause, predict what you would see if the theory held, then test it.
The cultural shift that reshaped investigation was blamelessness. John Allspaw and the team at Etsy popularised the blameless postmortem, arguing that an engineer who fears punishment will give the account that protects them, not the account that explains the failure. Remove the threatThreatSecurityA specific security threatView reference → and you get the second story, the detailed how-and-why that teaches the organisation something. Allspaw's later Infinite Hows pushed further: an investigation does not uncover a pre-existing root causeRoot CauseEngineeringAn identified root cause of an issueView reference → so much as construct an explanation, so where it chooses to stop is a judgement, and a careful investigation keeps asking "how" after the first comfortable "why".
A scheduled report stops arriving. The symptom is narrow: one daily email, missing. The investigation begins with a timeline. Logs show the report jobJobUserJob To Be Done: what the user is trying to accomplishView reference → started and exited cleanly, so the failure is downstream of generation. The engineer forms a hypothesis: the file uploaded but the notification did not send. A quick check of the storage bucket confirms the file is there, falsifying the "report never generated" theory and pointing at the delivery step. The next hypothesis, an expired API credential for the email provider, is confirmed by a 401 in the provider logs. The investigation has now produced two artefacts: a revealed bugBugProduct SpecificationA defect or unexpected behaviourView reference →, the unhandled 401, and a candidate root cause, credentials with no rotation alert. Both feed the fix.
incidentIncidentDevOps & PlatformA production incidentView reference → is the organisational container with an owner and a clock; the investigation is the diagnostic work inside it. Not every investigation needsNeedUserA user need, pain, desire, or constraintView reference → an incident, and a long incident may hold several investigations.investigationInvestigationEngineeringAn investigation into an issue or incident produces a candidate; whether it is the true underlying condition is the open question Allspaw warns about.postmortemPostmortemDevOps & PlatformA post-incident reviewView reference → documents what the investigation found and what changes follow; the investigation is the live diagnosis, the postmortem its durable account.In the Unified Product Graph, InvestigationEngineeringAn investigation into an issue or incident sits in the engineering and reliability region as the diagnostic spine of an incident chain. It points to its findings through investigationInvestigationrevealedBugcausal and investigation_revealed_bugInvestigationrevealedRoot Causecausal, and the resulting change is tied to it by investigation_revealed_root_causeFixderived fromInvestigationcausal. Its target is recorded by fix_derived_from_investigationServiceinvestigated viaInvestigationcausal, anchoring the diagnosis to the component under examination. Holding the investigation as its own node, between symptom and fix, preserves the chain of reasoning: a graph reader can see not just that a fault was fixed but how it was diagnosed and what evidence justified the conclusion.service_investigated_via_investigation
Type-specific fields on BaseNode
severityobjectBelieved severity of the underlying issue. Drives prioritisation across investigations. Canonicalised in v0.4.0: the ad-hoc `'low' | 'medium' | 'high' | 'critical'` shape was replaced by `UPGAssessment` so every "severity" property reports on the same axis. Migration: `low → 2`, `medium → 3`, `high → 4`, `critical → 5`.
hypothesisstringWorking hypothesis about the root cause
findingsstringFindings discovered so far
started_atstringISO timestamp the investigation began
resolved_atstringISO timestamp the investigation was concluded. Pairs with `investigation_status === 'resolved' | 'abandoned'`.
lead_investigatorstringLead investigator (email or handle). Distinct from the team owning the affected service.
session_idstringOriginating session
categorystringKind of issue under investigation. Distinct from `RootCauseProperties.cause_category`, which captures *why something went wrong*.
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: open
5 edge types connected to this entity.
investigation_surfaces_symptominvestigation_revealed_root_cause