A quantitative measure of one aspect of service health, for example request latency or availability.
A service level indicator is the measured quantity that tells you how a service is actually doing: the proportion of requests served successfully, the share of responses under 100 milliseconds, the fraction of jobsJobUserJob To Be Done: what the user is trying to accomplishView reference → that completed. It is the number, defined precisely enough to be computed from real telemetry. Everything above it (the target, the budget, the contract) depends on the indicator being chosen well, because a meaningless measurement makes every promise built on it meaningless too.
The term was given its working definition by Google's Site Reliability Engineering practice, set out in the SRE book (2016). The book defines a service level indicator as "a carefully defined quantitative measure of some aspect of the level of service that is provided." It sits in a trio: the indicator is what you measure, the objectiveObjectiveStrategyA strategic goal (OKR)View reference → is the target you hold it to, and the agreement is the contract with consequences.
The SRE Workbook (2018) sharpened the recommended form into a ratio: good events divided by valid events, multiplied by a hundred for a percentage. The phrasing repays attention. "Valid events" is not "all events." A well-built availability indicator excludes events that say nothing about service health, so client errors in the 4xx range and redirect responses are often kept out of the denominator. Folding them in punishes the service for a user mistyping a URL, which corrupts the signal.
The standing debate is which indicator to pick. The SRE guidance is to choose few, and to choose those that track the user's actual experience. A service can post 99.99% on a CPU graph while every user request times out. The indicator has to measure something a user would notice, and ideally something they would complain about when it slips.
An API team picks availability as its first indicator. The naive version, "successful responses over all responses," counts a flood of 401s from an expired token integration as service failures, dragging the number down for a problem the service did not cause. They redefine it: good events are responses with a 2xx or 3xx status, valid events exclude 4xx client errors, and the indicator is the ratio of the two over a rolling 28-day window.
Now the number means something. When a deploy starts returning 500s, the indicator drops within minutes, and because the denominator is clean, the drop reflects a real regression the team owns. They add a second indicator for latency, the share of requests served under 300 milliseconds, because availability alone would call a slow-but-answering service healthy. Two well-chosen indicators describe the experience; ten poorly chosen ones describe noise.
In the Unified Product Graph, a service level indicator sits in the operations region as the empirical floor of a reliability target. Its defining relationship is upward: an objective is measured by an indicator (Service Level Objectivemeasured byService Level Indicatorhierarchy). That single edge carries weight, because it forces every objective to name the concrete quantity it rests on. An objective with no indicator is an aspiration; an indicator with no objective is an ungoverned number on a chart. The graph insists the two travel together, which is the discipline that makes a reliability programme more than a wall of dials.service_level_objective_measured_by_service_level_indicator
Type-specific fields on BaseNode
metric_namestringIndicator metric name. @example "Request latency p99", "Error rate", "Availability"
thresholdnumberThreshold that defines a "good" event. @example 200 (ms latency), 0.01 (1% error rate), 99.9 (% availability)
current_valuenumberCurrent observed value. Compared against `threshold` for SLO compliance. @example 150 (ms), 0.003 (0.3% error rate)
unitstringUnit of measurement. Required to interpret `threshold` and `current_value`. @example "ms", "%", "req/s", "errors/min"
aggregationstringAggregation over the evaluation window. p99 and avg tell different stories. @example "p99" for tail latency, "avg" for mean throughput, "count" for total events
measurement_querystringQuery expression that produces `current_value`. Free-form to fit PromQL, Datadog query strings, SQL, or vendor-specific DSLs. @example 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
baseline_valuenumberHistorical baseline. Pairs with `current_value` to indicate drift. @example 99.85
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
2 edge types connected to this entity.
service_level_objective_measured_by_service_level_indicatormonitor_measures_service_level_indicator