Managing Concurrency in ELT Pipelines

In the early stages of platform development ("Day 1"), simplicity is the primary architectural driver. When designing an ELT (Extract, Load, Transform) pipeline, the simplest pattern for loading data is often the Full Refresh: wipe the destination table and rewrite the dataset from scratch.

This approach—typically implemented via DROP TABLE or TRUNCATE—is seductive. It guarantees consistency, eliminates duplicate detection logic, and is highly performant on small datasets.

However, as a platform matures into "Day 2" operations and beyond, this pattern frequently becomes the root cause of production outages. This article analyzes a specific deadlock scenario encountered in certain financial lakehouse pipelines and explores the mechanical differences between destructive and incremental updates in PostgreSQL.

The Architectural Failure Mode

Our architecture consisted of a standard Lakehouse pattern:

Writer: A Dagster orchestration job that updates the silver_rates table every 15 minutes.
Reader: A Streamlit dashboard that polls this table to visualize real-time trends.

In the development environment, this worked flawlessly. In production, we observed intermittent pipeline failures where the "Loading" step would hang indefinitely, eventually timing out after 30 minutes.

The Mechanism of the Deadlock

The issue was not in the application logic but in the Postgres Lock Manager.

To execute a TRUNCATE command, Postgres requires an ACCESS EXCLUSIVE lock on the target table. This is the most restrictive lock mode available. It dictates that no other transaction can read, write, or even query the table meta-data while the lock is held.

Simultaneously, our Streamlit dashboard, which polls the database, holds an ACCESS SHARE lock while running its SELECT queries.

This created a resource contention queue:

Streamlit (Reader): Acquires ACCESS SHARE.
Dagster (Writer): Requests ACCESS EXCLUSIVE to truncate. Blocked.
Postgres: Queues the Writer until the Reader releases the lock.

Because modern dashboards are designed to keep connections alive and queries frequent, the Writer often starves in the queue, leading to a silent pipeline halt.

Solution: The Incremental Replacement Pattern

To resolve this, we must move from a "Destructive" update strategy to a "Concurrent" update strategy.

Instead of requesting a table-level reset, we refactored the pipeline to use targeted DELETE operations.

sql

-- The Blocking Pattern (Avoid in High Concurrency)
TRUNCATE TABLE silver_rates; 
 
-- The Non-Blocking Pattern (Incremental Replacement)
DELETE FROM silver_rates 
WHERE timestamp >= '2025-12-01 09:30:00';

Why This Works: Lock Compatibility

The architectural advantage of DELETE over TRUNCATE lies in the lock level required.

TRUNCATE requires ACCESS EXCLUSIVE (Incompatible with Readers).

DELETE requires ROW EXCLUSIVE.

A ROW EXCLUSIVE lock is compatible with ACCESS SHARE. This means the database allows the Dashboard to continue reading old rows at the exact same moment the Pipeline is deleting them. Postgres uses MVCC (Multi-Version Concurrency Control) to ensure the Reader sees a consistent snapshot of the data until the Writer commits.

Trade-offs and "Day 2" Considerations

As with all engineering decisions, this shift introduces trade-offs. We exchanged Availability for Maintenance Overhead.

Vacuuming Pressure: Unlike TRUNCATE, which physically removes files, DELETE marks tuples as "dead." This increases bloat in the Postgres heap, requiring more aggressive Auto-Vacuum settings.
Performance: DELETE is technically slower than TRUNCATE because it writes to the Write-Ahead Log (WAL) for every row.

For our use case, the latency penalty (seconds) was an acceptable cost for ensuring zero downtime for dashboard users.

Conclusion

When designing data pipelines, we must look beyond code correctness and consider the operational environment.

Commands that imply "exclusive access" (DROP, TRUNCATE, ALTER) are inherently dangerous in systems with active readers. By adopting incremental patterns and understanding the underlying lock compatibility matrix, we build platforms that are resilient to the chaotic reality of production usage.