Data Integration Patterns • API-based Data Ingestion PatternsMedium⏱️ ~3 min
Four Core API Ingestion Patterns
API ingestion is not one size fits all. The pattern you choose depends on who initiates data transfer, how much control you have over timing, and what latency you need. Four patterns dominate production systems.
Pull Based Polling:
Your pipeline wakes up every 15 minutes or hourly and fetches data. You use
Pull Based (Polling)
Your system polls API on schedule
↕
Push Based (Webhooks)
Source calls your endpoint
↕
Async Job Based
Submit batch, poll for status
↕
Streaming Event
Each event sent immediately
updated_at timestamps or cursors to get only changed records since last sync. Fivetran uses this pattern for most SaaS connectors. Typical incremental syncs achieve p95 latency under 5 to 10 minutes, while full backfills might take hours due to pagination and rate limits.
Push Based Webhooks:
The source system calls your HTTP endpoint when data changes. You validate the signature, enqueue the payload into a message queue or log, and return 200 OK immediately. This can deliver sub second freshness but requires you to maintain a highly available endpoint with proper authentication. You still need occasional full syncs to catch missed events.
Async Job Based:
You submit a batch of updates through one API endpoint and receive a job identifier. You then poll a separate status endpoint every 10 seconds until the job completes. Bloomreach uses this pattern for product catalog ingestion, with typical job latencies of 30 to 300 seconds for tens of thousands of products. This decouples submission from processing, allowing the backend to throttle work.
Streaming Event Collection:
Each user action or system event is sent individually through an API to a collector service. Segment uses this for behavioral tracking, accepting tens of thousands of events per second. The collector writes to a durable log immediately, then fans out to warehouses and destinations asynchronously.
✓ In Practice: Most companies use multiple patterns. Operational data from internal services might use streaming events. SaaS integrations use pull based polling. Critical low latency updates use webhooks.
💡 Key Takeaways
✓Pull based polling is simplest and works when you control the schedule, achieving p95 latency of 5 to 10 minutes for incremental syncs
✓Push based webhooks deliver sub second freshness but require maintaining a highly available endpoint and still need periodic reconciliation syncs
✓Async job based ingestion decouples submission from processing, with job latencies of 30 to 300 seconds, useful when the backend needs to throttle heavy work
✓Streaming event collection sends individual events immediately, accepting tens of thousands per second by writing to durable logs before downstream processing
📌 Examples
1Fivetran polls Salesforce API every 15 minutes using <code>updated_at</code> cursors, respecting 200 requests per minute rate limits
2Bloomreach product catalog ingestion: submit batch via API, receive job ID, poll status every 10 seconds, typical completion in 30 to 300 seconds
3Segment HTTP tracking API accepts behavioral events at tens of thousands per second, immediately writing to Kafka before fanning out to destinations