An automated pipeline for data transformation
A data pipeline is the automated path that moves data from where it is produced to where it is used, applying the steps that make it fit for that use along the way. It is the plumbing behind every metricMetricStrategyA unified metric that measures progress, health, or behaviour across the productView reference →, dashboardDashboardData & AnalyticsAn analytics dashboardView reference →, and model: invisible when it works, the prime suspect when a number looks wrong. The interesting decisionsDecisionStrategyA recorded decision with context, rationale, and consequencesView reference → are about where the cleaning happens and how fresh the data needsNeedUserA user need, pain, desire, or constraintView reference → to be.
The classical shape is ETL: extract, transform, load. It dates to the data-warehousing practice of the 1980s and 1990s, when warehouse compute was scarce and expensive, so data was cleaned and reshaped on a separate staging server before being loaded into the warehouse in a nightly batch. Transformation came first because the warehouse could not afford to do it.
Cloud warehouses inverted that constraintConstraintStrategyA constraint entityView reference →. When Redshift, BigQuery, and Snowflake made warehouse compute cheap and elastic, it became practical to load raw data first and transform it inside the warehouse using its own SQL engine. ETL became ELT. As Fivetran describes the shift, the reordering is small on paper and large in consequence: raw data lands intact, and transformation becomes versioned, testable SQL in place of brittle pre-load scripts. This reordering anchored what the industry came to call the modern data stack, a set of specialised tools where managed connectors such as Fivetran or Airbyte handle extract-and-load and dbt handles transformation as code. (The category keeps consolidating; Fivetran and dbt Labs announced a merger in 2025.)
A second axis runs alongside the ETL/ELT one: batch versus streaming. Batch pipelines move data on a schedule, which suits most analytics. Streaming pipelines built on systems such as Apache Kafka process events continuously as they arrive, transforming in flight, which suits fraud detection and live operational metrics where minutes-old data is already stale. Most organisations run both, choosing per use case and treating neither as a default.
A subscriptionSubscriptionSales & RevenueA recurring subscriptionView reference → business needs a daily revenue dashboard. The pipeline is ELT and batch. At 02:00 a managed connector extracts rows from the billing database, the payments processor's API, and the product-event log, and loads them raw into a cloud warehouse. A dbt jobJobUserJob To Be Done: what the user is trying to accomplishView reference → then runs: it deduplicates events, reconciles currencies, joins payments to accounts, and builds a clean daily_revenue table, with tests asserting that no row has a negative amount and that yesterday's total reconciles to the processor's settlement report. The dashboard reads only that final table.
The discipline pays off the morning revenue appears to drop 12%. Because the raw load is preserved and the transformations are tested code, an engineer traces the dip in twenty minutes to a currency-conversion step that mishandled a new market, with no need to re-derive the number by hand. The pipeline is debuggable because each stage is explicit.
In the Unified Product Graph, a data pipeline sits in the data and analytics domain and is defined by its flow. It connects to its inputs through Data Pipelinereads fromData Sourcecross-domain and to its outputs through data_pipeline_reads_from_data_sourceData Pipelinewrites toData Sourcecross-domain, while the inverse data_pipeline_writes_to_data_sourceData Sourceprocessed viaData Pipelinehierarchy lets any source name the pipelines acting on it. Downstream, data_source_processed_via_data_pipelineData PipelinefeedsData Productcross-domain links the machinery to the curated dataset it produces. Modelling direction explicitly (reads-from versus writes-to) is what makes the graph traceable: when a metric looks wrong, the edges let you walk back from the dashboard to the data product to the exact pipeline and source, which is the question every data team is actually asking.data_pipeline_feeds_data_product
Type-specific fields on BaseNode
pipeline_statusstringOperational status of the pipeline
schedulestringCron or scheduling expression
avg_runtimestringAverage wall-clock runtime per execution
orchestratorstringOrchestration tool (e.g. "Airflow", "Dagster", "dbt Cloud")
retry_countnumberNumber of automatic retries on failure
retry_delay_secondsnumberDelay between retries in seconds
timeout_secondsnumberMaximum allowed runtime in seconds before timeout
trigger_rulestringRule that determines when this pipeline triggers
poolstringResource pool this pipeline runs in
idstringrequiredUnique identifier (UUID)
typeNodeTyperequiredDiscriminator for the entity type
titlestringrequiredDisplay name
descriptionstringOptional detailed description
statusstringLifecycle status
tagsstring[]Freeform tags for filtering
5 phases — initial: building
4 edge types connected to this entity.
data_source_processed_via_data_pipelinedata_pipeline_feeds_data_productdata_pipeline_reads_from_data_sourcedata_pipeline_writes_to_data_source