An observable sign that something is wrong. The surface evidence of a problem, not its cause.
A symptom is the observable manifestation of a fault: the thing a user, a dashboardDashboardData & AnalyticsAn analytics dashboardView reference →, or an alert can actually see. It is evidenceEvidenceValidationData supporting or refuting a hypothesisView reference → of trouble, never the trouble itself. The discipline of reliability turns on holding that line, because a team that fixes what it can see will spend forever fixing the same fault wearing different faces.
The symptom-versus-cause distinction was borrowed from medicine, where a symptom is what the patient reports and the cause is what the clinician must infer. Kaoru Ishikawa's cause-and-effect diagram, presented in 1968, carried the idea into manufacturing, sorting observable effects from their candidate causes across families such as method and machine so that teams stop treating the visible defect as the explanation.
Modern operations reframed the distinction around alerting. Google's Site Reliability Engineering practice argues for symptom-based alerting: page on what the user experiences, such as elevated latency or a failing checkout, and avoid alerting directly on every internal cause. The reasoning is that causes are many and shift as the system changes, while the symptoms that hurt users are few and stable. Rob Ewaschuk's "Philosophy on Alerting", which seeded the SRE book's monitoring chapter, makes the case that good alerts answer "what is broken, and why" by paging on the symptom and leaving the why to investigationInvestigationEngineeringAn investigation into an issue or incidentView reference →.
The live tension is calibration. Alert only on symptoms and you learn late, after users are already hurting. Alert on every cause and you drown in noise and start ignoring the page. The accepted balance pairs symptom alerts for paging with cause signals for diagnosis once you are already looking.
Checkout conversion drops six percent at 14:00. That fall is the symptom: it is what the business dashboard shows and what a monitor flags. It says nothing about why. The investigation finds three contributing conditions: a third-party fraud check has slowed, a retry loop is amplifying the load, and a recent deploy shortened the client timeout. The single visible symptom sat on top of a small cluster of causes. Had the team alerted on raw fraud-check latency alone, they would have paged nightly on blips that never reached a customer. Alerting on the conversion symptom paged them exactly when it mattered, and the investigation supplied the rest.
root_causeRoot CauseEngineeringAn identified root cause of an issueView reference → explains the symptom; the symptom is the evidence trail back to it. One root cause can throw several symptoms, and several causes can converge on one.symptomSymptomEngineeringA symptom of a problem is a signal; an incident is an organisational response with an owner and a timeline. A symptom can pass without ever becoming an incident.monitorMonitorDevOps & PlatformA monitoring checkView reference → is the sensor; the symptom is what it senses. A symptom nobody monitors is one nobody learns about until a user reports it.In the Unified Product Graph, SymptomEngineeringA symptom of a problem sits in the engineering and reliability region as the observable layer of an incident chain. It is produced by symptomRoot CausecausesSymptomcausal and located by root_cause_causes_symptomInvestigationsurfacesSymptomhierarchy. The watch comes from investigation_surfaces_symptomMonitordetectsSymptomcausal, and escalation runs through monitor_detects_symptomSymptomtriggersIncidentcausal. Keeping the symptom as its own node, distinct from cause and incident, preserves the honest sequence of a diagnosis: a team can see whether a closed incident actually removed the underlying symptom_triggers_incidentRoot CauseEngineeringAn identified root cause of an issueView reference → or merely silenced the symptom that exposed it.root_cause
Type-specific fields on BaseNode
symptom_descriptionstringPlain-language description of observed behaviour. Primary content of the entity.
first_observed_atstringISO timestamp first observed in the wild. Pairs with `frequency_rating` and `reproducibility` for triage.
severityobjectSeverity for affected users. Independent of how widespread the symptom is. Canonicalised in v0.4.0: the ad-hoc `'low' | 'medium' | 'high' | 'critical'` shape was replaced by `UPGAssessment`.
frequency_countnumberExact observation count in the period. Pairs with `frequency_period` for a precise rate.
frequency_periodstringRecurrence period (ISO-8601 `Duration`). @example 'P7D' (per week), 'P1D' (per day), 'PT1H' (per hour)
frequency_ratingstringQualitative frequency tier. Canonical replacement for the legacy `'once' | 'sporadic' | 'frequent' | 'constant' | string` shape. Use when an exact rate is unknown. Migration: `once → rare`, `sporadic → occasional`, `frequent → regular`, `constant → constant`.
affected_users_estimatenumberApproximate count of users affected. Snapshot estimate.
reproducibilitystringReproduction reliability
steps_to_reproducestringSteps to reproduce
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
4 edge types connected to this entity.
investigation_surfaces_symptom