What is a Service Level Indicator?

A quantitative measure of one aspect of service health, such as request latency, success rate, or availability.

What is the purpose of a Service Level Indicator?

A service level indicator is the actual measurement of how a service is performing on one dimension, defined precisely enough to be computed from real telemetry. Examples include the proportion of requests served under a latency threshold or the fraction of successful responses. It is the raw signal that an objective sets a target on and an agreement may promise, anchoring reliability work in measured reality rather than opinion.

How do you use a Service Level Indicator in product management?

Choose SLIs that reflect what users actually experience (request latency, availability, error rate) not internal proxies like CPU load. Express each as a ratio of good events to valid events, so it reads as a percentage. Measure from as close to the user as practical; an SLI sampled server-side can look healthy while users suffer.

What are common mistakes with a Service Level Indicator?

Picking SLIs because they are easy to collect rather than because they reflect user experience (CPU utilisation, queue depth) produces dashboards that look reassuring while customers are timing out. Measuring far from the user (deep server-side rather than at the edge) hides exactly the latency and failures users feel. And tracking dozens of SLIs dilutes attention; a service is best characterised by a small handful that genuinely capture its health.

🌡️

Service Level Indicator

Q: What is an example of a Service Level Indicator?

Request-latency SLI: The proportion of homepage requests served in under 200ms, measured at the load balancer over a rolling 28-day window, currently 99.2%.

A measured indicator of service health, e.g. latency.

DevOps & PlatformOperations & Qualitytype: 'service_level_indicator'interface: BaseNode

View in Graph

▼On this page

Description Properties Relationships Graph Position Related Entities

Description

A service level indicator is the measured quantity that tells you how a service is actually doing: the proportion of requests served successfully, the share of responses under 100 milliseconds, the fraction of jobs that completed. It is the number, defined precisely enough to be computed from real telemetry. The target, the budget, and the contract are all built on top of the indicator.

See moreSee less

Origin & evolution

The term was given its working definition by Google's Site Reliability Engineering practice, set out in the SRE book (2016). The book defines a service level indicator as "a carefully defined quantitative measure of some aspect of the level of service that is provided." It sits in a trio: the indicator is what you measure, the objective is the target you hold it to, and the agreement is the contract with consequences.

The SRE Workbook (2018) sharpened the recommended form into a ratio: good events divided by valid events, multiplied by a hundred for a percentage. The phrasing repays attention. "Valid events" is not "all events." A well-built availability indicator excludes events that say nothing about service health, so client errors in the 4xx range and redirect responses are often kept out of the denominator. Folding them in punishes the service for a user mistyping a URL, which corrupts the signal.

The standing debate is which indicator to pick. The SRE guidance is to choose few, and to choose those that track the user's actual experience. A service can post 99.99% on a CPU graph while every user request times out. The indicator has to measure something a user would notice, and ideally something they would complain about when it slips.

How it works in practice

An API team picks availability as its first indicator. The naive version, "successful responses over all responses," counts a flood of 401s from an expired token integration as service failures, dragging the number down for a problem the service did not cause. They redefine it: good events are responses with a 2xx or 3xx status, valid events exclude 4xx client errors, and the indicator is the ratio of the two over a rolling 28-day window.

Now the number means something. When a deploy starts returning 500s, the indicator drops within minutes, and because the denominator is clean, the drop reflects a real regression the team owns. They add a second indicator for latency, the share of requests served under 300 milliseconds, because availability alone would call a slow-but-answering service healthy. Two well-chosen indicators describe the experience; ten poorly chosen ones describe noise.

Service level indicator vs. its neighbours

Service level objective: the objective is the target the indicator must meet, such as 99.9% over a quarter. The indicator is the live measurement; the objective is the line drawn across it. You cannot set an objective until you have an indicator to set it on.
Error budget: the budget is what remains once the objective is met, the allowed quantity of failure. It is computed from the indicator and the objective together. The indicator measures reality; the budget measures how much reality may diverge from the target before action is forced.
Metric: a metric is any raw measurement a system emits, CPU load, queue depth, request count. An indicator is a metric (or ratio of metrics) deliberately chosen because it reflects the user's experience. Most metrics are not indicators, and treating them as such is how dashboards turn green while users suffer.

In the graph

In the Unified Product Graph, a service level indicator sits in the operations region as the empirical floor of a reliability target. Its defining relationship is upward: an objective is measured by an indicator (service_level_objective_measured_by_service_level_indicator). That single edge carries weight, because it forces every objective to name the concrete quantity it rests on. An objective with no indicator is an aspiration; an indicator with no objective is an ungoverned number on a chart. The graph insists the two travel together, which is the discipline that makes a reliability programme more than a wall of dials.

Preview

Presets

titlethresholdcurrent_valueunit

aggregation

measurement_querybaseline_value

Service Level Indicator

Trellis API p99 Latency

Aggregationp99

Threshold800

Current value612

Unitmilliseconds

Measurement queryhistogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='trellis-api'}[5m]))*1000

Baseline value540

Properties

Type-specific fields on BaseNode

thresholdnumber

Threshold that defines a "good" event. @example 200 (ms latency), 0.01 (1% error rate), 99.9 (% availability)

current_valuenumber

Current observed value. Compared against `threshold` for SLO compliance. @example 150 (ms), 0.003 (0.3% error rate)

unitstring

Unit of measurement. Required to interpret `threshold` and `current_value`. @example "ms", "%", "req/s", "errors/min"

aggregationenum

Aggregation over the evaluation window. p99 and avg tell different stories. @example "p99" for tail latency, "avg" for mean throughput, "count" for total events

avgsummaxminp50p95p99count

measurement_querystring

Query expression that produces `current_value`. Free-form to fit PromQL, Datadog query strings, SQL, or vendor-specific DSLs. @example 'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

baseline_valuenumber

Historical baseline. Pairs with `current_value` to indicate drift. @example 99.85

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Relationships

3 edge types connected to this entity.

Parents

Entities that can contain this type

Service Level Objectiveservice_level_objective_measured_by_service_level_indicator

Cross-References

Contextual links across the graph

Monitormonitor_measures_service_level_indicator

Metricservice_level_indicator_measures_metric

Graph Position

1parent

🌡️Service Level Indicator

2cross-ref

Definition

A service level indicator is a quantitative measure of one aspect of service health: request latency, success rate, or availability. It is the raw signal that an SLO targets and an SLA may promise.

Usage Guidance

Choose SLIs that reflect what users actually experience (request latency, availability, error rate) not internal proxies like CPU load.
Express each as a ratio of good events to valid events, so it reads as a percentage.
Measure from as close to the user as practical; an SLI sampled server-side can look healthy while users suffer.

Anti-Patterns

Picking SLIs because they are easy to collect rather than because they reflect user experience (CPU utilisation, queue depth) produces dashboards that look reassuring while customers are timing out.
Measuring far from the user (deep server-side rather than at the edge) hides exactly the latency and failures users feel.
And tracking dozens of SLIs dilutes attention; a service is best characterised by a small handful that genuinely capture its health.

Examples

Request-latency SLI

The proportion of homepage requests served in under 200ms, measured at the load balancer over a rolling 28-day window, currently 99.2%.

Availability SLI

The ratio of successful HTTP responses (non-5xx) to all valid requests for the checkout API, measured per minute and aggregated monthly.

Service Level Indicator

A measured indicator of service health, e.g. latency.

DevOps & PlatformOperations & Qualitytype: 'service_level_indicator'interface: BaseNode

View in Graph

▼On this page

Description Properties Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

Service level indicator vs. its neighbours

Service level objective: the objective is the target the indicator must meet, such as 99.9% over a quarter. The indicator is the live measurement; the objective is the line drawn across it. You cannot set an objective until you have an indicator to set it on.
Error budget: the budget is what remains once the objective is met, the allowed quantity of failure. It is computed from the indicator and the objective together. The indicator measures reality; the budget measures how much reality may diverge from the target before action is forced.
Metric: a metric is any raw measurement a system emits, CPU load, queue depth, request count. An indicator is a metric (or ratio of metrics) deliberately chosen because it reflects the user's experience. Most metrics are not indicators, and treating them as such is how dashboards turn green while users suffer.

In the graph

Preview

Presets

titlethresholdcurrent_valueunit

aggregation

measurement_querybaseline_value

Service Level Indicator

Trellis API p99 Latency

Aggregationp99

Threshold800

Current value612

Unitmilliseconds

Measurement queryhistogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service='trellis-api'}[5m]))*1000

Baseline value540

Properties

Type-specific fields on BaseNode

thresholdnumber

Threshold that defines a "good" event. @example 200 (ms latency), 0.01 (1% error rate), 99.9 (% availability)

current_valuenumber

Current observed value. Compared against `threshold` for SLO compliance. @example 150 (ms), 0.003 (0.3% error rate)

unitstring

Unit of measurement. Required to interpret `threshold` and `current_value`. @example "ms", "%", "req/s", "errors/min"

aggregationenum

Aggregation over the evaluation window. p99 and avg tell different stories. @example "p99" for tail latency, "avg" for mean throughput, "count" for total events

avgsummaxminp50p95p99count

measurement_querystring

baseline_valuenumber

Historical baseline. Pairs with `current_value` to indicate drift. @example 99.85

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Relationships

3 edge types connected to this entity.

Parents

Entities that can contain this type

Service Level Objectiveservice_level_objective_measured_by_service_level_indicator

Cross-References

Contextual links across the graph

Monitormonitor_measures_service_level_indicator

Metricservice_level_indicator_measures_metric

Graph Position

1parent

🌡️Service Level Indicator

2cross-ref

Definition

A service level indicator is a quantitative measure of one aspect of service health: request latency, success rate, or availability. It is the raw signal that an SLO targets and an SLA may promise.

Usage Guidance

Choose SLIs that reflect what users actually experience (request latency, availability, error rate) not internal proxies like CPU load.
Express each as a ratio of good events to valid events, so it reads as a percentage.
Measure from as close to the user as practical; an SLI sampled server-side can look healthy while users suffer.

Anti-Patterns

Picking SLIs because they are easy to collect rather than because they reflect user experience (CPU utilisation, queue depth) produces dashboards that look reassuring while customers are timing out.
Measuring far from the user (deep server-side rather than at the edge) hides exactly the latency and failures users feel.
And tracking dozens of SLIs dilutes attention; a service is best characterised by a small handful that genuinely capture its health.

Examples

Request-latency SLI

The proportion of homepage requests served in under 200ms, measured at the load balancer over a rolling 28-day window, currently 99.2%.

Availability SLI

The ratio of successful HTTP responses (non-5xx) to all valid requests for the checkout API, measured per minute and aggregated monthly.