Allowed unreliability before action is required
An error budget is the amount of unreliability a service is allowed to spend. If the objectiveObjectiveStrategyA strategic goal (OKR)View reference → is 99.9% availability, the budget is the remaining 0.1%: the failures, the slow responses, the downtime a team may incur before it has broken its own promise. The reframing is quiet and large. One hundred percent reliability stops being the goal, and a measured quantity of failure becomes a resource to spend on shipping faster.
The error budget is a load-bearing idea in Google's Site Reliability Engineering practice, set out in the chapter Embracing Risk of the SRE book (2016). The arithmetic is deliberately plain: the budget is 100% minus the objective. A service with a 99.9% objective serving a million requests in four weeks has a budget of a thousand failed requests over that window.
The mechanism it replaces is an old standoff. Developers want to ship; operations wants to stay stable; each accuses the other of bad faith. The error budget removes the politics by making "how reliable" a number both teams agreed to in advance. While budget remains, the developers may spend it on riskRiskComplianceA risk to the product or businessView reference →: new featuresFeatureProduct SpecificationA product capability or featureView reference →, faster releasesReleaseProduct SpecificationA shipped version of the productView reference →, bolder experimentsExperimentValidationA test designed to validate a hypothesisView reference →. The budget aligns incentives by giving both sides the same scoreboard.
The SRE Workbook (2018) turned the number into a policy with teeth. A typical error budget policy says that while a service is within budget, releases proceed; once the budget for the window is exhausted, all changes except security and critical fixes halt until reliability recovers. The freeze is the part that gives the budget real authority. A budget you can overspend without consequence is a suggestion, and the policy is what converts the suggestion into a brake.
A team runs a checkout service to a 99.95% availability objective, measured over a 28-day window against an indicator of successful requests over valid requests. That objective grants a budget of 0.05%, roughly twenty minutes of total unavailability across the four weeks, or its equivalent in scattered failed requests.
For the first three weeks they spend the budget on velocity, shipping daily, accepting that an occasional bad deploy burns a few minutes of budget. In week four a botched migration causes a fifteen-minute partial outage, and the burn-rate alert fires: most of the remaining budget has gone in one incidentIncidentDevOps & PlatformA production incidentView reference →. The policy triggers automatically. Feature releases freeze, the on-call rolls back, and engineering effort moves to the reliability work that had been deferred. No manager had to win an argument; the budget made the call, and it resets at the start of the next window.
In the Unified Product Graph, an error budget sits in the operations region as the derived consequence of a reliability target. Its defining edge is its source: an objective budgets as an error budget (Service Level Objectivebudgets asError Budgethierarchy). Modelling the budget as its own node, rather than a property of the objective, lets the graph attach what depends on it, the release policy it gates and the incidents that drained it, so the budget reads as a live account rather than a fixed figure. That structure mirrors how the budget works in practice: a single number, owned jointly, that decides whether the next change ships.service_level_objective_budgets_as_error_budget
Type-specific fields on BaseNode
budget_remainingnumberRemaining budget percentage (0–100). @example 45.2 (45.2% remaining, 54.8% used)
burn_ratenumberConsumption rate as a multiplier against sustainable burn. 1.0 = on track. 2.5 = consuming 2.5x faster than sustainable.
policystringPolicy when the budget hits 0%. @example "Freeze all non-reliability deploys", "Page engineering lead immediately"
budget_windowstringBudget window. Defines reset cadence. @example "30 days", "rolling 28 days"
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
1 edge type connected to this entity.
service_level_objective_budgets_as_error_budget