A monitoring check observing system health
A monitor is the standing apparatus that watches a system and asks one question on a loop: is this still healthy? It collects signals, compares them to expectations, and hands the verdict to whatever acts on it. The interesting tension is that a monitor can be technically correct and operationally useless at the same time, reporting that every component is green while users sit on a broken checkout page.
The deeper root is control theory. Rudolf Kálmán defined observability in 1960 as the degree to which a system's internal state can be reconstructed from its external outputs. Decades later that word migrated into software, and the distinction between monitoring and observability hardened into a real argument. Monitoring asks known questions of known signals; observability is the property that lets you ask questions you did not anticipate when you instrumented the thing.
Google's Site Reliability Engineering book (2016) gave the field its working vocabulary. It separated black-box monitoring, which probes a system from outside as a user would and reports active symptomsSymptomEngineeringA symptom of a problemView reference →, from white-box monitoring, which inspects internal state through logs and instrumented endpoints. The same chapter named the four golden signals: latency, traffic, errors, and saturation. The guidance was blunt. If you can measure only four things about a user-facing system, measure those.
In parallel, Peter Bourgon's 2017 post crystallised the three pillars of observability, metricsMetricStrategyA unified metric that measures progress, health, or behaviour across the productView reference →, logs, and traces, as the telemetry types a monitor consumes. The framing stuck, partly because it described daily reality and partly because vendors already sold a product for each pillar. The current debate pushes past three discrete silos towards unified, high-cardinality event data, but the monitor's jobJobUserJob To Be Done: what the user is trying to accomplishView reference → under either model is the same: turn raw telemetry into a health judgement.
A payments team runs a checkout service handling 4,000 requests per second. They wire a white-box monitor to the golden signals: p99 latency, request rate, the ratio of HTTP 5xx responses, and database connection-pool saturation. The latency monitor evaluates a rolling one-minute window and compares it to a threshold derived from the service level objectiveService Level ObjectiveDevOps & PlatformA service level objective (SLO)View reference →.
At 02:14, p99 latency climbs from 180ms to 1.9s while the error rate holds flat. No instance has crashed; nothing is technically down. The monitor catches the degradation precisely because it watches a symptom users feel, not just process liveness. A separate black-box monitor, probing the public endpoint every 30 seconds, confirms real requests are slow rather than a metrics artefact. Together they distinguish a genuine incidentIncidentDevOps & PlatformA production incidentView reference → from a noisy gauge.
alert_ruleAlert RuleDevOps & PlatformAn alerting ruleView reference → is the policy layered on top that decides when that verdict warrants waking a human. One monitor can feed several alert rules at different thresholds, which is why the graph models them as separate entities.In the Unified Product Graph a monitor lives in the operations region, bound to the things it observes. A product connects through Productmonitored byMonitorhierarchy, an infrastructure componentInfrastructure ComponentDevOps & PlatformAn infrastructure component (server, CDN, etc.)View reference → through product_monitored_by_monitorInfrastructure Componentmonitored byMonitorhierarchy, and a running service through infrastructure_component_monitored_by_monitorMonitorwatchesServicecross-domain. Detection flows outward through monitor_watches_serviceMonitortriggers viaAlert Rulehierarchy, which keeps observationObservationUser ResearchA specific behaviour or statement observedView reference → and notification as distinct, separately tunable steps. That separation matters: it lets you query which parts of a product have no monitor at all, the blind spots that surface only during an incident.monitor_triggers_via_alert_rule
Type-specific fields on BaseNode
monitor_typestringMeasurement kind. `uptime` = availability. `latency` = response time. `error_rate` = failure ratio. `throughput` = req/sec. `log` = log-based. `event` = event-driven. `synthetic` = scripted user-journey tests. `slo_burn` = tracks SLO error budget. @example "synthetic" for a scripted checkout flow test
targetstringService, endpoint, or resource monitored. @example "graph-api /health", "PostgreSQL connection pool", "CDN edge latency"
thresholdstringAlert condition, expressed as a condition rather than a bare number. @example "> 500ms p99", "< 99.9% uptime over 5 minutes", "> 1% error rate"
alert_channelstringAlert destination on threshold breach. @example "slack:#ops-alerts", "pagerduty:on-call-graph"
monitor_statusstringOperational state. `ok` = all clear. `warn` = approaching threshold. `alert` = threshold breached. `no_data` = nothing received (may indicate monitor or service failure). `muted` = silenced.
mutedbooleanCurrently silenced. Typical during planned maintenance windows.
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: planning · template: OPERATIONAL
6 edge types connected to this entity.
product_monitored_by_monitorinfrastructure_component_monitored_by_monitormonitor_triggers_via_alert_rulemonitor_watches_servicemonitor_measures_service_level_indicatormonitor_detects_symptom