Skip to content

Operations & Quality

Incidents that leave a lesson the next on-call can find

An incident resolves, a postmortem is written, and six months later the same class of failure recurs because the lesson lived in a doc nobody re-reads. UPG types the whole operational backbone: the pipeline that ships, the monitors and SLOs that watch, the incident that breaks, the postmortem and root cause that explain it, the fix that prevents the next one, the quality gates that guard the release, and the compliance and support that close the loop. The break-and-learn cycle becomes structure a team can query, rather than knowledge that leaves with the on-call engineer.

“Everything fails, all the time.”
Werner Vogels, CTO, Amazon
01DevOps Backbone

The pipeline that ships, and the runbook that recovers

A ci pipelineci_pipelineA CI/CD pipeline produces the build artifactbuild_artifactA build output (binary, container image) and deploys the serviceserviceA deployable service or microservice, and the runbookrunbookA runbook for incident response that mitigates an incidentincidentA production incident is a typed node alongside them.

The path out of an incident links to the path that shipped it. Recovery is not improvised under pressure, because the runbook that handles a given failure is one edge from the incident it was written for.

Reporting CI pipelineanchor
ci_pipelineci_pipelineA CI/CD pipeline
producesci_pipeline_produces_build_artifact
reporting:3.0.1
build_artifactbuild_artifactA build output (binary, container image)
deploysci_pipeline_deploys_service
Reporting service
serviceserviceA deployable service or microservice
And when a deploy goes wrong, the recovery is pre-wired:
Aggregation throttle runbook
runbookrunbookA runbook for incident response
mitigatesrunbook_mitigates_incident
Dashboard outage
incidentincidentA production incident

The backbone carries code from commit to production. A CI pipeline produces the build artifact and deploys the service. The runbook that mitigates an incident is a node in the graph, so the path out of a failed deploy is linked to the path in.

02Monitoring & SLOs

What the monitor measures, and what the SLO promises

A monitormonitorA monitoring check watches the serviceserviceA deployable service or microservice, measures the service level indicatorservice_level_indicatorA service level indicator (SLI) behind an SLO, triggers via an alert rulealert_ruleAn alerting rule wired to a named owner, and detects the symptomsymptomA symptom of a problem before a user reports it.

The measurement and the promise sit in one graph. An alert always names the service and the indicator it fired on, so the page that opens already says where the problem is.

Reporting availability monitoranchor
monitormonitorA monitoring check
watchesmonitor_watches_service
Reporting service
serviceserviceA deployable service or microservice
measuresmonitor_measures_service_level_indicator
Successful report loads / total
service_level_indicatorservice_level_indicatorA service level indicator (SLI)
triggers viamonitor_triggers_via_alert_rule
Error rate above 1% for 5m
alert_rulealert_ruleAn alerting rule
detectsmonitor_detects_symptom
Dashboards loading slowly
symptomsymptomA symptom of a problem

Measurement and the promise made on it sit in the same graph. A monitor watches the service, measures the indicator behind an SLO, triggers via an alert rule routed to a person, and detects the symptom ahead of a user report. An alert names the service and the indicator it fired on.

03The Incident Loop

What an incident connects to across the graph

An incidentincidentA production incident breaches a service level objectiveservice_level_objectiveA service level objective (SLO), affects a featurefeatureA product capability or feature, triggers a postmortempostmortemA post-incident review, is caused by a root causeroot_causeAn identified root cause of an issue, and generates support tickets. Each is a typed edge, so the blast radius resolves to a query rather than a reconstruction.

The loop closes back onto the rest of the product. The feature affected is the same one the roadmap tracks, and the SLO breached is the one engineering set. When a similar failure starts, the graph already records what happened last time.

Dashboard outage, 2026-06anchor
incidentincidentA production incident
breachesincident_breaches_service_level_objective
99.9% reporting uptime
service_level_objectiveservice_level_objectiveA service level objective (SLO)
affectsincident_affects_feature
Verified metrics dashboard
featurefeatureA product capability or feature
triggersincident_triggers_postmortem
PM-141
postmortempostmortemA post-incident review
caused byincident_caused_by_root_cause
Unbounded aggregation query
root_causeroot_causeAn identified root cause of an issue
generatesincident_generates_support_ticket
37 tickets
support_ticketsupport_ticketCustomer support request or issue

An incident is the operations anchor, and it does not vanish when the page resolves. It breaches an SLO, affects a feature, triggers a postmortem, and is caused by a root cause, which links to the change that prevents the next one. The loop from break to learning to prevention is structure the team can query, not tribal memory.

04Postmortem & Prevention

How the postmortem turns into prevention

A postmortempostmortemA post-incident review identifies the root causeroot_causeAn identified root cause of an issue and produces a runbookrunbookA runbook for incident response, and the root cause is resolved by a fixfixA fix applied to resolve an issue, affects the serviceserviceA deployable service or microservice, and causes the bugbugA defect or unexpected behaviour it explains.

Prevention links to the failure it answers. When a similar incident appears, the graph names the root cause it shares and the fix that worked, in place of an engineer half-remembering a thread from last year.

PM-141: Dashboard outageanchor
postmortempostmortemA post-incident review
identifiespostmortem_identifies_root_cause
Unbounded aggregation query
root_causeroot_causeAn identified root cause of an issue
producespostmortem_produces_runbook
Aggregation throttle runbook
runbookrunbookA runbook for incident response
And the root cause links to the fix that prevents the next one:
resolved byroot_cause_resolved_by_fix
Bound the aggregation window
fixfixA fix applied to resolve an issue
affectsroot_cause_affects_service
Reporting service
serviceserviceA deployable service or microservice
causesroot_cause_causes_bug
Timeout on large orgs
bugbugA defect or unexpected behaviour

A postmortem retains the learning after the page resolves. It identifies the root cause and produces a runbook. The root cause is resolved by a fix, affects the service, and causes the bug it explains. A later incident that looks similar resolves against a graph that already records the earlier one.

05Quality Gates

The gate a release passes before it ships

A test suitetest_suiteA suite of related tests contains the test casetest_caseAn individual test case nodes and includes the regression testregression_testA regression test that guards against old bugs, is tested via a qa sessionqa_sessionAn exploratory QA session, covers the featurefeatureA product capability or feature it protects, and is measured by a test coverage reporttest_coverage_reportA test coverage report.

The question “is this safe to ship?” resolves to the gate the graph already holds. The regression that caught last quarter’s bug is linked to the feature it protects, so the reason a check exists stays attached to the work it guards.

Reporting test suiteanchor
test_suitetest_suiteA suite of related tests
containstest_suite_contains_test_case
Badge renders on verified metric
test_casetest_caseAn individual test case
includestest_suite_includes_regression_test
No double-count on rollup
regression_testregression_testA regression test
tested viatest_suite_tested_via_qa_session
Pre-release QA pass
qa_sessionqa_sessionAn exploratory QA session
coverstest_suite_covers_feature
Verified metrics dashboard
featurefeatureA product capability or feature
measured bytest_suite_measured_by_test_coverage_report
87% of lines, 92% of branches
test_coverage_reporttest_coverage_reportA test coverage report

A test suite gates delivery. It contains the cases and includes the regressions that guard against old bugs, is tested via a QA session, covers the feature it protects, and is measured by a coverage report. Whether a change is safe to ship reads as the gate the graph already holds.

06Compliance & Support

The audit and the support ticket, on one graph

A compliance frameworkcompliance_frameworkA compliance framework (SOC 2, GDPR, etc.) mandates a compliance requirementcompliance_requirementA compliance requirement, requires the security controlsecurity_controlA security control or mitigation that satisfies it, and is verified by a security auditsecurity_auditA security audit. The customer’s voice is typed in the same way: a support ticket reveals a needneedA user need, pain, desire, or constraint and reports a bugbugA defect or unexpected behaviour.

An audit and a customer report draw on the same structure. A ticket becomes discovery and delivery work, linked to the need it raised and the bug it found, so the help desk feeds the rest of the product rather than closing the thread.

SOC 2 Type IIanchor
compliance_frameworkcompliance_frameworkA compliance framework (SOC 2, GDPR, etc.)
mandatescompliance_framework_mandates_compliance_requirement
Quarterly access reviews
compliance_requirementcompliance_requirementA compliance requirement
requirescompliance_framework_requires_security_control
Signed, short-lived export tokens
security_controlsecurity_controlA security control or mitigation
verified bycompliance_framework_verified_by_security_audit
2026 Type II audit
security_auditsecurity_auditA security audit
And the customer's voice is typed straight back into the graph:
revealssupport_ticket_reveals_need
Need a faster export
needneedA user need, pain, desire, or constraint
reportssupport_ticket_reports_bug
Export link expires too soon
bugbugA defect or unexpected behaviour

An audit and a customer report draw on the same structure. A compliance framework mandates requirements, requires the security controls that satisfy them, and is verified by an audit, while a support ticket reveals a need and reports a bug. A support ticket lands in the graph as discovery and delivery work rather than staying in a help desk.

07Where To Go Next

Operations ties delivery and engineering to the reality of running the thing. Follow a thread back to the system or the measurement: