The standing apparatus that watches a system on a loop and asks one question: is this still healthy?

How do you use a Monitor in product management?

Design monitors at three levels: infrastructure (CPU, memory, disk), service (error rate, latency, throughput), and business (active users, graph saves/hour). Avoid alert fatigue. Every monitor that fires should require action. Review and tune monitors quarterly.

Where does the concept of a Monitor come from?

Application monitoring evolved from basic server ping checks to comprehensive observability stacks. The three pillars of observability (metrics, logs, and traces) were articulated by Peter Bourgon in 2017 and implemented by platforms like Datadog, Honeycomb, and Grafana. "You can't manage what you can't measure" applies doubly to distributed systems.

What are common mistakes with a Monitor?

Monitors that alert on raw metrics with no actionable threshold either fire constantly until the team mutes them or stay silent through real incidents. Alert fatigue from too many low-signal monitors trains responders to ignore pages, so the one that matters gets dismissed too. Monitoring symptoms the user never feels while missing the ones they do means dashboards stay green during an outage. A monitor with no owner or runbook produces an alert that wakes someone with no idea what to do about it.

📺

Monitor

Q: What is the purpose of a Monitor?

A monitor exists to detect when a system deviates from its expected state. It continuously collects signals, compares them to expectations, and hands the verdict to whatever acts on it, triggering an alert when a measurement moves outside its expected range.

Q: What is an example of a Monitor?

Graph API p95 latency monitor: Metric: graph-api.request.duration.p95. Threshold: > 500ms for 5 consecutive minutes. Method: Datadog timeseries. Data source: CloudWatch → Datadog integration. Alert: PagerDuty → on-call engineer. Runbook linked.

A monitoring check observing system health

DevOps & PlatformOperations & Qualitytype: 'monitor'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

A monitor is the standing apparatus that watches a system and asks one question on a loop: is this still healthy? It collects signals, compares them to expectations, and hands the verdict to whatever acts on it.

See moreSee less

Origin & evolution

The deeper root is control theory. Rudolf Kálmán defined observability in 1960 as the degree to which a system's internal state can be reconstructed from its external outputs. Decades later that word migrated into software, and the distinction between monitoring and observability hardened into a real argument. Monitoring asks known questions of known signals; observability is the property that lets you ask questions you did not anticipate when you instrumented the thing.

Google's Site Reliability Engineering book (2016) gave the field its working vocabulary. It separated black-box monitoring, which probes a system from outside as a user would and reports active symptoms, from white-box monitoring, which inspects internal state through logs and instrumented endpoints. The same chapter named the four golden signals: latency, traffic, errors, and saturation. The guidance was blunt. If you can measure only four things about a user-facing system, measure those.

In parallel, Peter Bourgon's 2017 post crystallised the three pillars of observability, metrics, logs, and traces, as the telemetry types a monitor consumes. The framing stuck, partly because it described daily reality and partly because vendors already sold a product for each pillar. The current debate pushes past three discrete silos towards unified, high-cardinality event data, but the monitor's job under either model is the same: turn raw telemetry into a health judgement.

How it works in practice

A payments team runs a checkout service handling 4,000 requests per second. They wire a white-box monitor to the golden signals: p99 latency, request rate, the ratio of HTTP 5xx responses, and database connection-pool saturation. The latency monitor evaluates a rolling one-minute window and compares it to a threshold derived from the service level objective.

At 02:14, p99 latency climbs from 180ms to 1.9s while the error rate holds flat. No instance has crashed; nothing is technically down. The monitor catches the degradation precisely because it watches a symptom users feel, not just process liveness. A separate black-box monitor, probing the public endpoint every 30 seconds, confirms real requests are slow rather than a metrics artefact. Together they distinguish a genuine incident from a noisy gauge.

Monitor vs. its neighbours

Alert rule. A monitor produces a continuous signal and a health verdict. An alert_rule is the policy layered on top that decides when that verdict warrants waking a human. One monitor can feed several alert rules at different thresholds, which is why the graph models them as separate entities.
Service level indicator. An SLI is a precise, defined measurement of one aspect of service health, such as the fraction of requests served under 300ms. A monitor is the running mechanism that observes that measurement over time. The SLI defines what good means; the monitor reports where you currently stand against it.
Observability tooling. Observability is a system property: how answerable your system is. A monitor is a concrete instrument that exercises that property to answer one recurring question. A highly observable system with no monitors tells you nothing until you go looking.

In the graph

In the Unified Product Graph a monitor lives in the operations region, bound to the things it observes. A product connects through product_monitored_by_monitor, an infrastructure component through infrastructure_component_monitored_by_monitor, and a running service through monitor_watches_service. Detection flows outward through monitor_triggers_via_alert_rule, which keeps observation and notification as distinct, separately tunable steps. That separation matters: it lets you query which parts of a product have no monitor at all, the blind spots that surface only during an incident.

Preview

Presets

title

monitor_type

targetthresholdalert_channel

monitor_status

muted

Monitor

Trellis API 5xx Error Rate Monitor

Monitor typeerror_rateMonitor statusokMutedfalse

Targettrellis-api service, all endpoints

ThresholdAlert when 5xx rate exceeds 1% over a 5-minute window

Alert channel#incidents Slack channel and PagerDuty on-call rotation

Properties

Type-specific fields on BaseNode

monitor_typeenum

Measurement kind. `uptime` = availability. `latency` = response time. `error_rate` = failure ratio. `throughput` = req/sec. `log` = log-based. `event` = event-driven. `synthetic` = scripted user-journey tests. `slo_burn` = tracks SLO error budget. @example "synthetic" for a scripted checkout flow test

uptimelatencyerror_ratethroughputlogeventsyntheticslo_burncustom

targetstring

Service, endpoint, or resource monitored. @example "graph-api /health", "PostgreSQL connection pool", "CDN edge latency"

thresholdstring

Alert condition, expressed as a condition rather than a bare number. @example "> 500ms p99", "< 99.9% uptime over 5 minutes", "> 1% error rate"

alert_channelstring

Alert destination on threshold breach. @example "slack:#ops-alerts", "pagerduty:on-call-graph"

monitor_statusenum

Operational state. `ok` = all clear. `warn` = approaching threshold. `alert` = threshold breached. `no_data` = nothing received (may indicate monitor or service failure). `muted` = silenced.

okwarnalertno_datamuted

mutedboolean

Currently silenced. Typical during planned maintenance windows.

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planning · template: OPERATIONAL

All lifecycles

Relationships

6 edge types connected to this entity.

Parents

Entities that can contain this type

Productproduct_monitored_by_monitor

Infrastructure Componentinfrastructure_component_monitored_by_monitor

Children

Entities this type can contain

Alert Rulemonitor_triggers_via_alert_rule

Cross-References

Contextual links across the graph

Servicemonitor_watches_service

Service Level Indicatormonitor_measures_service_level_indicator

Symptommonitor_detects_symptom

Graph Position

2parents

📺Monitor

1child

3cross-ref

Definition

A monitor is a standing check that continuously observes system health by measuring a service level indicator against an expected range. It connects an SLI to the alert rule it triggers when the measurement goes out of bounds.

Usage Guidance

Design monitors at three levels: infrastructure (CPU, memory, disk), service (error rate, latency, throughput), and business (active users, graph saves/hour).
Avoid alert fatigue.
Every monitor that fires should require action.
Review and tune monitors quarterly.

Anti-Patterns

Monitors that alert on raw metrics with no actionable threshold either fire constantly until the team mutes them or stay silent through real incidents.
Alert fatigue from too many low-signal monitors trains responders to ignore pages, so the one that matters gets dismissed too.
Monitoring symptoms the user never feels while missing the ones they do means dashboards stay green during an outage.
A monitor with no owner or runbook produces an alert that wakes someone with no idea what to do about it.

Examples

Graph API p95 latency monitor

Metric: graph-api.request.duration.p95. Threshold: > 500ms for 5 consecutive minutes. Method: Datadog timeseries. Data source: CloudWatch → Datadog integration. Alert: PagerDuty → on-call engineer. Runbook linked.

Monitor

A monitoring check observing system health

DevOps & PlatformOperations & Qualitytype: 'monitor'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

Monitor vs. its neighbours

Alert rule. A monitor produces a continuous signal and a health verdict. An alert_rule is the policy layered on top that decides when that verdict warrants waking a human. One monitor can feed several alert rules at different thresholds, which is why the graph models them as separate entities.
Service level indicator. An SLI is a precise, defined measurement of one aspect of service health, such as the fraction of requests served under 300ms. A monitor is the running mechanism that observes that measurement over time. The SLI defines what good means; the monitor reports where you currently stand against it.
Observability tooling. Observability is a system property: how answerable your system is. A monitor is a concrete instrument that exercises that property to answer one recurring question. A highly observable system with no monitors tells you nothing until you go looking.

In the graph

Preview

Presets

title

monitor_type

targetthresholdalert_channel

monitor_status

muted

Monitor

Trellis API 5xx Error Rate Monitor

Monitor typeerror_rateMonitor statusokMutedfalse

Targettrellis-api service, all endpoints

ThresholdAlert when 5xx rate exceeds 1% over a 5-minute window

Alert channel#incidents Slack channel and PagerDuty on-call rotation

Properties

Type-specific fields on BaseNode

monitor_typeenum

uptimelatencyerror_ratethroughputlogeventsyntheticslo_burncustom

targetstring

Service, endpoint, or resource monitored. @example "graph-api /health", "PostgreSQL connection pool", "CDN edge latency"

thresholdstring

Alert condition, expressed as a condition rather than a bare number. @example "> 500ms p99", "< 99.9% uptime over 5 minutes", "> 1% error rate"

alert_channelstring

Alert destination on threshold breach. @example "slack:#ops-alerts", "pagerduty:on-call-graph"

monitor_statusenum

Operational state. `ok` = all clear. `warn` = approaching threshold. `alert` = threshold breached. `no_data` = nothing received (may indicate monitor or service failure). `muted` = silenced.

okwarnalertno_datamuted

mutedboolean

Currently silenced. Typical during planned maintenance windows.

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: planning · template: OPERATIONAL

All lifecycles

Relationships

6 edge types connected to this entity.

Parents

Entities that can contain this type

Productproduct_monitored_by_monitor

Infrastructure Componentinfrastructure_component_monitored_by_monitor

Children

Entities this type can contain

Alert Rulemonitor_triggers_via_alert_rule

Cross-References

Contextual links across the graph

Servicemonitor_watches_service

Service Level Indicatormonitor_measures_service_level_indicator

Symptommonitor_detects_symptom

Graph Position

2parents

📺Monitor

1child

3cross-ref

Definition

Usage Guidance

Design monitors at three levels: infrastructure (CPU, memory, disk), service (error rate, latency, throughput), and business (active users, graph saves/hour).
Avoid alert fatigue.
Every monitor that fires should require action.
Review and tune monitors quarterly.

Anti-Patterns

Monitors that alert on raw metrics with no actionable threshold either fire constantly until the team mutes them or stay silent through real incidents.
Alert fatigue from too many low-signal monitors trains responders to ignore pages, so the one that matters gets dismissed too.
Monitoring symptoms the user never feels while missing the ones they do means dashboards stay green during an outage.
A monitor with no owner or runbook produces an alert that wakes someone with no idea what to do about it.

Examples

Graph API p95 latency monitor