Extract, Transform, Load (ETL) vs Extract, Load, Transform (ELT): Timing and Trade-offs

ETL and ELT describe different sequencing strategies for data pipelines. In ETL, you extract data from sources, transform it in flight with validation and business logic, then load the clean, curated result into the destination. In ELT, you extract and immediately load raw data into a destination, then run transformations downstream using the destination's compute power.

ETL improves immediate queryability and enforces quality gates at ingress, making curated data available the moment it lands. However, it adds latency to the ingest path and tightly couples transformation logic to the pipeline itself. If you discover a bug or need a new field, you must reprocess from source or wait for new data. ELT minimizes ingest latency, preserves raw fidelity for reprocessing, and allows consumer teams to iterate on transformations without blocking ingestion. The downside is governance drift and the risk that every consumer duplicates transformation logic or queries raw, expensive data formats.

In practice, both patterns coexist. Streaming pipelines feeding operational dashboards often use micro ETL to apply lightweight cleansing and enrichment, landing JSON or Avro within seconds to minutes. Meanwhile, batch jobs use ELT to land raw JSON cheaply to an object store, then curate into columnar Parquet with 5 to 10 times compression, reducing scan costs by 80 to 90 percent. For example, scanning 10 terabytes of raw JSON per day at 5 dollars per terabyte equals 50 dollars per day, while the curated 1 to 2 terabyte Parquet costs 5 to 10 dollars per day for the same analytics.

Choose ETL when schema stability is high, quality enforcement is critical, and you can tolerate modest latency. Choose ELT when raw fidelity is paramount, schemas evolve rapidly, or you need to experiment with transformations without re-ingesting data.

💡 Key Takeaways

✓ETL enforces quality and cleansing before load, reducing downstream burden but adding ingest latency and coupling to transformation logic.

✓ELT preserves raw fidelity and minimizes latency, with typical JSON to Parquet compression yielding 5 to 10 times size reduction and 80 to 90 percent lower query scan costs.

✓Streaming pipelines often use micro ETL for sub-minute freshness, while batch systems use ELT to land raw data cheaply and curate offline.

✓ETL is best for stable schemas and strict quality SLAs. ELT suits fast-changing schemas, experimentation, and reprocessing without re-ingesting from sources.

✓Large organizations run both patterns in parallel: streams feed operational metrics with ETL, batch jobs populate curated lakes with ELT.

📌 Interview Tips

1Amazon pattern: land raw clickstream JSON to object store via ELT (preserving raw fidelity), then curate to Parquet with 8x compression. For 10 TB raw per day, scan cost drops from 50 dollars (10 TB × 5 dollars per TB) to 5 to 10 dollars (1 to 2 TB Parquet).

← Back to ETL Pipelines & Data Integration Overview