The underlying fault that, once removed, stops a problem recurring, the true source beneath the symptoms.
A root cause is the underlying condition that, once removed, stops a fault from recurring. The appeal is obvious: find the one thing, fix the one thing, never see the failure again. The trouble is that complex systems rarely oblige with a single one thing, and the search for it tends to stop the moment a satisfying answer appears, which is seldom the moment the real condition does.
The most famous root-cause technique came out of Toyota. Sakichi Toyoda devised the Five Whys in the 1930s, and Taiichi Ohno made it a pillar of the Toyota Production System, describing it as "the basis of Toyota's scientific approach" by which "repeating why five times" exposes both the problem and the fix. Ask why a machine stopped, then why the fuse blew, then why the bearing seized, and you walk from the symptomSymptomEngineeringA symptom of a problemView reference → back toward a maintenance policy that nobody had set.
A parallel tool arrived from quality engineering. Kaoru Ishikawa developed the cause-and-effect diagram while working with Kawasaki shipyards, formally presenting it in 1968. Its fishbone shape sorts candidate causes into families such as method, machine, material, and measurement, which guards against the Five Whys habit of marching down a single line and declaring victory early.
Both methods carry an assumptionAssumptionStrategyA belief taken as true that underpins a strategyView reference → that later thinkers challenged. The Five Whys traces one chain; reality branches. John Allspaw's 2014 essay The Infinite Hows, drawing on Dekker, Conklin, and Leveson, argues that in a retrospectiveRetrospectiveTeam & OrganisationA team retrospectiveView reference → nothing is sitting there waiting to be found. Causes are constructed by where we choose to start and stop asking, and we declare a "root cause" at the point we get tired of looking. An outage in a distributed system is typically the joint product of several conditions, each necessary and none sufficient alone. Allspaw's repair is to ask "how", which calls for description and resists the tidy single-villain story.
A payments API starts returning 500s at 09:14. The on-call engineer sees the symptom, a spike in error rate, and opens an investigationInvestigationEngineeringAn investigation into an issue or incidentView reference →. The first answer is "the database connection pool was exhausted", which yields a workaround: raise the pool size. The Five Whys keeps going. Why was the pool exhausted? A slow query held connections open. Why was the query slow? A nightly index rebuild had not completed. Why not? The rebuild jobJobUserJob To Be Done: what the user is trying to accomplishView reference → had silently failed for three nights, because nobody monitored it. The named cause is now an absent alert, not a small pool, and the lasting fix targets that.
symptomSymptomEngineeringA symptom of a problemView reference → is the entry point of diagnosis; the root cause is its terminus. Treating the symptom buys time; removing the root cause buys recurrence-freedom.investigationInvestigationEngineeringAn investigation into an issue or incidentView reference → produces a candidate root cause; it does not guarantee a true one, which is exactly Allspaw's caution.In the Unified Product Graph, Root CauseEngineeringAn identified root cause of an issue sits in the engineering and reliability region as the explanatory anchor of an incidentIncidentDevOps & PlatformA production incidentView reference → chain. It links downstream through root_causeRoot CausecausesSymptomcausal and root_cause_causes_symptomRoot CausecausesBugcausal, and an root_cause_causes_bugInvestigationrevealedRoot Causecausal edge records how it was identified. Closure runs the other way: investigation_revealed_root_causeFixresolvedRoot Causecausal marks the condition as addressed. Modelling cause, symptom, investigation, and fix as separate connected nodes keeps a record that a fix targeting only the symptom leaves a root cause queryably unresolved.fix_resolved_root_cause
Type-specific fields on BaseNode
severityobjectSeverity (1 = minor, 5 = critical)
cause_categorystringClosed-enum cause category for RCA reporting and dashboards. Distinct from the legacy free-form `category`.
cause_confidencestringTeam certainty about this cause. `hypothesised` = educated guess. `likely` = evidence points here. `confirmed` = reproduced. Renamed from `confidence` in v0.4.0 to disambiguate from the entity-wide `UPGAssessment`-typed epistemic confidence used elsewhere. The 3-tier shape stays as a discrete RCA-lifecycle marker.
evidence_summarystringOne-paragraph evidence summary. Log lines, traces, repro steps. Detailed artefacts go on linked `evidence` nodes.
categorystringLegacy free-form category. Retained for v0.2 baseline; new graphs prefer `cause_category`.
affected_areastringAffected area of the system
verifiedbooleanVerified through investigation
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
13 edge types connected to this entity.
investigation_revealed_root_causeroot_cause_resolved_by_fixroot_cause_causes_symptomroot_cause_causes_bugfix_resolved_root_causeroot_cause_shares_cause_with_root_causeroot_cause_shares_cause_with_root_causeroot_cause_manifests_as_technical_debt_itemservice_affected_by_root_causeroot_cause_affects_serviceroot_cause_affects_featureincident_caused_by_root_causepostmortem_identifies_root_cause