Why We Migrated from Airflow to Dagster

For the better part of a decade, Apache Airflow has been the default answer to "How do we schedule this?" It is reliable, ubiquitous, and has a provider package for nearly every tool in existence. If you need to trigger a shell script on a server in Virginia every Tuesday at 4 AM, Airflow is unparalleled.

But as our data platform matured from simple ETL scripts into a complex Financial Lakehouse, we started noticing a recurring pattern of failure that Airflow simply wasn't designed to catch.

The problem wasn't that our DAGs were failing. The problem was that our DAGs were succeeding, but the data was wrong.

The "Green Dashboard" Lie

In the Airflow paradigm, "Success" is defined by the process, not the product. If a Python function executes without raising an exception, the light turns green. It does not verify if the table was updated, if the schema drifted, or if the row count is sensible. It just knows the code finished.

We hit a breaking point with a specific incident.

We had a critical dbt run task scheduled for 8:00 AM. Airflow triggered it on time. The container spun up, dbt compiled the SQL, and the job finished with exit code 0. Airflow marked the task green.

However, due to a silent API change upstream, the ingestion layer had pulled zero rows for that day. dbt happily processed those zero rows, updated our Gold Reporting Layer (which effectively wiped the current day's data), and finished successfully.

For 12 hours, the CEO’s dashboard showed a flat line. The Engineering team didn't react because our monitoring board was entirely green. Airflow said we were fine. The business said we were blind.

The Paradigm Shift: Tasks vs. Assets

We migrated to Dagster because it flips the mental model of orchestration.

Airflow is Task-Based: "Do this, then do that."
Dagster is Asset-Based: "Ensure this table exists, is fresh, and matches these checks."

This sounds like a semantic nuance, but it fundamentally changes how you architect your platform.

The Code Comparison

In Airflow, you define imperative steps. You are essentially writing a distributed shell script in Python.

python

# The Airflow Way: Imperative & Blind
with DAG("update_financials", schedule="@daily") as dag:
    
    # We hope this creates the table we need...
    run_dbt = BashOperator(
        task_id="run_dbt_models",
        bash_command="dbt run --select gold_revenue"
    )
 
    # We hope the previous task actually updated the data...
    notify_slack = PythonOperator(
        task_id="notify_team",
        python_callable=send_alert
    )
 
    run_dbt >> notify_slack

Notice the disconnect? The notify_slack task relies on run_dbt finishing, but it has zero awareness of what run_dbt actually produced. It assumes success based on an exit code.

In Dagster, you define the declarative end state using Software-Defined Assets (SDAs).

python

# The Dagster Way: Declarative & Aware
import pandas as pd
from dagster import asset, Output, MetadataValue, AssetIn
 
@asset(
    # The asset explicitly declares its dependency on upstream data
    ins={"silver_transactions": AssetIn(key="silver_transactions")}
)
def gold_revenue(context, silver_transactions: pd.DataFrame):
    # Logic to transform the data
    df = compute_revenue_logic(silver_transactions)
    
    # We don't just return data; we return context.
    return Output(
        value=df, 
        metadata={
            "row_count": len(df),
            "preview": MetadataValue.md(df.head().to_markdown()),
            "last_updated": MetadataValue.float(pd.Timestamp.now().timestamp())
        }
    )

In this world, the graph builds itself based on data dependencies, not task strings. If silver_transactions is missing, gold_revenue won't even attempt to run.

Crucially, this makes the data state visible as metadata and makes it possible for the orchestrator to actually inspect the data. Unlike Airflow, which flies blind, Dagster allows us to add a simple guardrail: if the row count == 0, raise an exception and alert the team.

As a result, the pipeline fails explicitly, catching the issue before the empty data ever reaches the dashboard and alerting the stakeholders about the same in the interim.

This functionality enables the data engineers like us, to attach Asset Checks to the data, with simple rules enforcing the required guardrails.

Eliminating "Zombie Data"

The biggest hidden cost in legacy data stacks is Zombie Data: tables that are updated daily, consuming compute credits, but are read sparringly or in some cases, by absolutely no one.

In Airflow, determining if a DAG is useful is an archaeological dig. You see a DAG named update_marketing_v2, and you're too scared to turn it off because you don't know who consumes the output.

Dagster solves this with its Global Lineage Graph.

Because every asset declares its inputs and outputs, we can trace the lineage from a raw JSON file in S3 all the way to a specific Streamlit dashboard. If we see an asset in the middle of the graph that has no downstream children, we know instantly: Delete it.

The Verdict

Airflow is essentially a sophisticated cron scheduler. It is excellent for operational tasks—sending emails, triggering backups, or spinning up servers.

But for Data Engineering? You shouldn't be managing tasks. You should be managing data.

By switching to Dagster, we stopped asking "Did the job run?" and started asking "Is the data fresh?" The difference is the only thing that matters.