Data Integration Patterns • Reverse ETL PatternsMedium⏱️ ~3 min
How Reverse ETL Works: The Three Phase Pipeline
The Technical Challenge:
Reverse ETL is fundamentally harder than traditional ETL because you are writing to systems, not just reading. When you extract from an API, failures are usually safe: you retry and try again. When you write to a CRM, failures have consequences. You might trigger duplicate emails, overwrite newer data with stale values, or create conflicting records. The warehouse also lacks features that make this easy, like native change streams.
Phase One: Extract from Warehouse
Most warehouses are not built for operational workloads. They excel at scanning billions of rows for analytics but lack change data capture mechanisms. To sync incrementally, Reverse ETL systems typically maintain a high watermark: the maximum
Phase Three: Load with Idempotency
The loader batches records for efficiency while respecting rate limits. A CRM might allow 10,000 records per API call but limit you to 500 requests per minute. The loader parallelizes work across multiple workers and implements backoff when hitting throttles.
Idempotency is critical. If a request fails midway through a 10,000 record batch, the system must retry without creating duplicates. Most Reverse ETL tools maintain a mapping between warehouse primary keys and destination object IDs. They query the destination first to check if a record exists, then decide whether to insert or update. Some APIs provide native upsert operations using external IDs, which simplifies this.
Versioning prevents overwriting newer data. The system tracks timestamps or version numbers for each record. If the destination already has a more recent version, the update is skipped.
updated_at timestamp seen in the last sync.
For example, your warehouse has a account_scores table with 100,000 rows. The extractor queries: "SELECT * FROM account_scores WHERE updated_at > last_watermark". If 2,500 accounts changed since the last sync 5 minutes ago, only those get extracted. The system updates its watermark to the max timestamp from this batch and uses it for the next run.
Some setups compare full snapshots instead. They materialize the current state, diff against the previous snapshot, and identify inserts, updates, and deletes. This costs more compute but handles cases where updated_at columns are unreliable or missing.
Phase Two: Transform for Destinations
Your warehouse model uses snake_case column names and stores phone numbers as strings with country codes. Salesforce expects Pascal case field names and validates phone formats strictly. The transformation layer bridges this gap.
This phase flattens nested JSON, maps warehouse IDs to destination IDs using an internal mapping table, enforces required fields, and applies validation rules. For example, emails must match a regex pattern, phone numbers need specific formatting, and certain enum fields only accept predefined values. Destinations like HubSpot or Marketo have hundreds of such constraints.
Typical Sync Volume
2.5K
RECORDS/SYNC
5 min
SYNC INTERVAL
💡 Key Takeaways
✓Extracting from warehouses requires custom change detection using watermarks or snapshot diffs since warehouses lack native Change Data Capture (CDC)
✓Transformation layer must map warehouse schemas to destination constraints, handling field naming conventions, data types, and validation rules for each target system
✓Loader implements batching and rate limiting to maximize throughput while respecting API limits, often achieving 5 million record updates per minute for well designed integrations
✓Idempotency is achieved through mapping tables between warehouse primary keys and destination IDs, plus version tracking to avoid overwriting newer data
✓Typical production syncs process 2,000 to 5,000 changed records every 5 minutes with p50 latency of 2 to 5 minutes end to end
📌 Examples
1A sync extracts 2,500 accounts where updated_at > last_watermark from a 100,000 row table, transforms snake_case columns to PascalCase for Salesforce, batches into 10,000 record API calls, and completes the full cycle in 3 minutes
2Census maintains an internal mapping table that stores warehouse row IDs alongside Salesforce contact IDs, enabling upserts by querying this table before each write operation
3Hightouch implements per destination rate limiting: a CRM connector respects 500 requests per minute while a marketing platform connector throttles to 100 writes per second based on vendor limits