What is the purpose of a Data Pipeline?

A data pipeline sits behind every metric, dashboard, and model, reliably extracting data, transforming it, and loading it where analysts and systems can consume it. Its main design choices are where the cleaning happens and how fresh the data needs to be.

How do you use a Data Pipeline in product management?

Design pipelines with: idempotency (safe to re-run), observability (logs and alerts on failure), and incremental loading (process only new data where possible). Test pipelines in isolation from downstream consumers. Document expected volume and latency. Monitor for pipeline drift as sources change.

Where does the concept of a Data Pipeline come from?

Data pipelines formalised as an architecture concept with the rise of ETL (Extract, Transform, Load) tools in the 1990s. The modern ELT (Extract, Load, Transform) pattern (enabled by cheap cloud storage and compute) replaced ETL as the dominant approach. Tools like Fivetran, Airbyte, and Apache Airflow became the standard infrastructure.

What are common mistakes with a Data Pipeline?

Pipelines built without idempotency turn a single retry into duplicated or corrupted data. Bundling extraction, transformation and loading into one opaque job makes failures impossible to isolate and reruns all-or-nothing. Silent failure is the classic trap, a pipeline that drops rows or skips a partition without alerting erodes trust in every downstream number. Hardcoding schema assumptions means an upstream column rename breaks the pipeline in ways that surface only days later in a wrong report.

🪣

Data Pipeline

Q: What is a Data Pipeline?

An automated path that moves data from where it is produced to where it is used, applying the steps that make it fit for that use.

Q: What is an example of a Data Pipeline?

PostHog → Snowflake pipeline: Tool: Fivetran (managed). Extraction: PostHog REST API, incremental by timestamp. Schedule: every 4 hours. Target: Snowflake (raw schema, posthog_events table). Alert: Slack #data-alerts on failure. Last failure: 23 days ago (PostHog API timeout). Downstream: dbt staging models.

The automated path carrying data from where it's produced to where it's used, behind every metric, dashboard, and model.

Data & AnalyticsAnalytics & Datatype: 'data_pipeline'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

A data pipeline is the automated path that moves data from where it is produced to where it is used, applying the steps that make it fit for that use along the way. It sits behind every metric, dashboard, and model, and its main design choices are where the cleaning happens and how fresh the data needs to be.

See moreSee less

Origin & evolution

The classical shape is ETL: extract, transform, load. It dates to the data-warehousing practice of the 1980s and 1990s, when warehouse compute was scarce and expensive, so data was cleaned and reshaped on a separate staging server before being loaded into the warehouse in a nightly batch. Transformation came first because the warehouse could not afford to do it.

Cloud warehouses inverted that constraint. When Redshift, BigQuery, and Snowflake made warehouse compute cheap and elastic, it became practical to load raw data first and transform it inside the warehouse using its own SQL engine. ETL became ELT. As Fivetran describes the shift, the reordering is small on paper and large in consequence: raw data lands intact, and transformation becomes versioned, testable SQL in place of brittle pre-load scripts. This reordering anchored what the industry came to call the modern data stack, a set of specialised tools where managed connectors such as Fivetran or Airbyte handle extract-and-load and dbt handles transformation as code. (The category keeps consolidating; Fivetran and dbt Labs announced a merger in 2025.)

A second axis runs alongside the ETL/ELT one: batch versus streaming. Batch pipelines move data on a schedule, which suits most analytics. Streaming pipelines built on systems such as Apache Kafka process events continuously as they arrive, transforming in flight, which suits fraud detection and live operational metrics where minutes-old data is already stale. Most organisations run both, choosing per use case and treating neither as a default.

Reis and Housley offer a technology-agnostic map of this territory in their data engineering lifecycle, which stages the work as generation, storage, ingestion, transformation, and serving. Alongside those stages they identify a set of undercurrents — security, data management, DataOps, data architecture, orchestration, and software engineering — that apply at every stage rather than belonging to any one of them. By that framing, the ETL/ELT distinction and the batch/streaming choice are both decisions within the ingestion and transformation stages, not the whole of the discipline; what sits beneath them is the same sequence of concerns regardless of tooling.

How it works in practice

A subscription business needs a daily revenue dashboard. The pipeline is ELT and batch. At 02:00 a managed connector extracts rows from the billing database, the payments processor's API, and the product-event log, and loads them raw into a cloud warehouse. A dbt job then runs: it deduplicates events, reconciles currencies, joins payments to accounts, and builds a clean daily_revenue table, with tests asserting that no row has a negative amount and that yesterday's total reconciles to the processor's settlement report. The dashboard reads only that final table.

The discipline pays off the morning revenue appears to drop 12%. Because the raw load is preserved and the transformations are tested code, an engineer traces the dip in twenty minutes to a currency-conversion step that mishandled a new market, with no need to re-derive the number by hand. The pipeline is debuggable because each stage is explicit.

Data pipeline vs. its neighbours

Data source is an origin or a destination of data: a database, an API, an event stream. A data pipeline is the moving machinery between sources; the source holds data at rest, the pipeline carries it.
Data product is a curated, reusable dataset packaged for consumers with an owner and a contract, such as a trusted "customer 360" table. A pipeline is what builds and maintains a data product; the product is the output, the pipeline is the means.
Dashboard is a presentation surface that reads processed data. A pipeline is upstream of it. When a dashboard is wrong, the cause usually lives in the pipeline that fed it.

In the graph

In the Unified Product Graph, a data pipeline sits in the data and analytics domain and is defined by its flow. It connects to its inputs through data_pipeline_reads_from_data_source and to its outputs through data_pipeline_writes_to_data_source, while the inverse data_source_processed_via_data_pipeline lets any source name the pipelines acting on it. Downstream, data_pipeline_feeds_data_product links the machinery to the curated dataset it produces. Modelling direction explicitly (reads-from versus writes-to) is what makes the graph traceable: when a metric looks wrong, the edges let you walk back from the dashboard to the data product to the exact pipeline and source, which is the question every data team is actually asking.

Worked example: Trellis

The pipeline that moves raw agent-change events into the activation and trust metrics Nora and her team actually watch is what connects Trellis's event schema to its dashboards. It transforms proposed, approved, and reverted events into the workspace-level signals that determine whether a cohort of workspaces is trending toward the 40 percent week-1 activation goal.

Preview

Presets

titlescheduleavg_runtimeorchestratorretry_countretry_delay_secondstimeout_secondstrigger_rulepool

Data Pipeline

Workspace Record Sync Pipeline

ScheduleEvery 15 minutes during business hours (*/15 9-18 * * 1-5)

Avg runtime42 seconds

OrchestratorDagster

Retry count3

Retry delay seconds30

Timeout seconds120

Trigger ruleSchedule-based, skips if previous run is still active

Pooloperations-default

Properties

Type-specific fields on BaseNode

schedulestring

Cron or scheduling expression

avg_runtimestring

Average wall-clock runtime per execution

orchestratorstring

Orchestration tool (e.g. "Airflow", "Dagster", "dbt Cloud")

retry_countnumber

Number of automatic retries on failure

retry_delay_secondsnumber

Delay between retries in seconds

timeout_secondsnumber

Maximum allowed runtime in seconds before timeout

trigger_rulestring

Rule that determines when this pipeline triggers

poolstring

Resource pool this pipeline runs in

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: building

All lifecycles

Relationships

4 edge types connected to this entity.

Parents

Entities that can contain this type

Data Sourcedata_source_processed_via_data_pipeline

Cross-References

Contextual links across the graph

Data Productdata_pipeline_feeds_data_product

Data Sourcedata_pipeline_reads_from_data_source

Data Sourcedata_pipeline_writes_to_data_source

Graph Position

1parent

🪣Data Pipeline

3cross-ref

Definition

A data pipeline is an automated process that extracts, loads, and transforms data from source to consumption layer. It connects data sources to the data models that make data analytics-ready.

Usage Guidance

Design pipelines with: idempotency (safe to re-run), observability (logs and alerts on failure), and incremental loading (process only new data where possible).
Test pipelines in isolation from downstream consumers.
Document expected volume and latency.
Monitor for pipeline drift as sources change.

Anti-Patterns

Pipelines built without idempotency turn a single retry into duplicated or corrupted data.
Bundling extraction, transformation and loading into one opaque job makes failures impossible to isolate and reruns all-or-nothing.
Silent failure is the classic trap, a pipeline that drops rows or skips a partition without alerting erodes trust in every downstream number.
Hardcoding schema assumptions means an upstream column rename breaks the pipeline in ways that surface only days later in a wrong report.

Examples

PostHog → Snowflake pipeline

Tool: Fivetran (managed). Extraction: PostHog REST API, incremental by timestamp. Schedule: every 4 hours. Target: Snowflake (raw schema, posthog_events table). Alert: Slack #data-alerts on failure. Last failure: 23 days ago (PostHog API timeout). Downstream: dbt staging models.

Data Pipeline

The automated path carrying data from where it's produced to where it's used, behind every metric, dashboard, and model.

Data & AnalyticsAnalytics & Datatype: 'data_pipeline'interface: BaseNode

View in Graph

▼On this page

Description Properties Lifecycle Relationships Graph Position Related Entities

Description

See moreSee less

Origin & evolution

How it works in practice

Data pipeline vs. its neighbours

Data source is an origin or a destination of data: a database, an API, an event stream. A data pipeline is the moving machinery between sources; the source holds data at rest, the pipeline carries it.
Data product is a curated, reusable dataset packaged for consumers with an owner and a contract, such as a trusted "customer 360" table. A pipeline is what builds and maintains a data product; the product is the output, the pipeline is the means.
Dashboard is a presentation surface that reads processed data. A pipeline is upstream of it. When a dashboard is wrong, the cause usually lives in the pipeline that fed it.

In the graph

Worked example: Trellis

Preview

Presets

titlescheduleavg_runtimeorchestratorretry_countretry_delay_secondstimeout_secondstrigger_rulepool

Data Pipeline

Workspace Record Sync Pipeline

ScheduleEvery 15 minutes during business hours (*/15 9-18 * * 1-5)

Avg runtime42 seconds

OrchestratorDagster

Retry count3

Retry delay seconds30

Timeout seconds120

Trigger ruleSchedule-based, skips if previous run is still active

Pooloperations-default

Properties

Type-specific fields on BaseNode

schedulestring

Cron or scheduling expression

avg_runtimestring

Average wall-clock runtime per execution

orchestratorstring

Orchestration tool (e.g. "Airflow", "Dagster", "dbt Cloud")

retry_countnumber

Number of automatic retries on failure

retry_delay_secondsnumber

Delay between retries in seconds

timeout_secondsnumber

Maximum allowed runtime in seconds before timeout

trigger_rulestring

Rule that determines when this pipeline triggers

poolstring

Resource pool this pipeline runs in

Inherited from BaseNode (6 fields)

idstringrequired

Unique identifier (UUID)

typeNodeTyperequired

Discriminator for the entity type

titlestringrequired

Display name

descriptionstring

Optional detailed description

statusstring

Lifecycle status

tagsstring[]

Freeform tags for filtering

Lifecycle

5 phases, initial: building

All lifecycles

Relationships

4 edge types connected to this entity.

Parents

Entities that can contain this type

Data Sourcedata_source_processed_via_data_pipeline

Cross-References

Contextual links across the graph

Data Productdata_pipeline_feeds_data_product

Data Sourcedata_pipeline_reads_from_data_source

Data Sourcedata_pipeline_writes_to_data_source

Graph Position

1parent

🪣Data Pipeline

3cross-ref

Definition

A data pipeline is an automated process that extracts, loads, and transforms data from source to consumption layer. It connects data sources to the data models that make data analytics-ready.

Usage Guidance

Design pipelines with: idempotency (safe to re-run), observability (logs and alerts on failure), and incremental loading (process only new data where possible).
Test pipelines in isolation from downstream consumers.
Document expected volume and latency.
Monitor for pipeline drift as sources change.

Anti-Patterns

Pipelines built without idempotency turn a single retry into duplicated or corrupted data.
Bundling extraction, transformation and loading into one opaque job makes failures impossible to isolate and reruns all-or-nothing.
Silent failure is the classic trap, a pipeline that drops rows or skips a partition without alerting erodes trust in every downstream number.
Hardcoding schema assumptions means an upstream column rename breaks the pipeline in ways that surface only days later in a wrong report.

Examples

PostHog → Snowflake pipeline