Choosing Your Ingestion Pattern: Trade-offs and Decision Criteria

The question is not which pattern is best. The question is which trade-offs match your constraints.

Pull Based Polling
Simple, predictable, p95 lag 5 to 10 min
vs
Push Based Webhooks
Sub second latency, complex ops
API Ingestion vs File Based Transfer:

File based feeds, like nightly CSV drops into S3, can deliver hundreds of gigabytes in one transfer with minimal API overhead. If your upstream system can generate full snapshots daily, this might be simpler than API pagination.

But files are coarse grained. You get everything or nothing, typically once per day. APIs let you fetch only changed objects using updated_at filters or cursors. For systems with millions of records but only thousands changing daily, incremental API syncs save bandwidth and processing time.

Choose file based when: Data volume is huge (terabytes), updates are naturally batched (daily reports), and freshness requirements are relaxed (24 hour latency is acceptable).

Choose API based when: You need sub hour freshness, only a small fraction of data changes frequently, or the source does not support bulk exports.

API Ingestion vs Change Data Capture:

Change Data Capture (CDC) from database transaction logs offers near real time replication with p99 lag under a few seconds. Systems like Debezium tap into MySQL binlog or Postgres Write Ahead Log (WAL) and stream every change.

But CDC requires database access privileges. SaaS vendors like Salesforce or Shopify will never grant you direct database access. API ingestion is your only option.

Even for internal systems, CDC has operational complexity. You must manage replication slots, handle schema evolution in the binlog, and deal with database failovers. APIs provide a stable contract that isolates you from backend changes.

Choose CDC when: You control the database, need sub second replication lag, and can handle the operational complexity of managing replication state.

Choose API ingestion when: The source is a third party SaaS, you lack database privileges, or you prefer operational simplicity over absolute minimum latency.

Pull vs Push Decision Matrix:

Polling wastes API quota. If nothing changed, you still made a request. At scale with hundreds of sources, this adds up. A 10x increase in connected systems means 10x more wasted calls.

Webhooks eliminate waste. The source only calls you when something changes. Latency can be sub second instead of minutes.

But webhooks require you to run a highly available endpoint. You must validate signatures to prevent spoofing, handle replay attacks, and deal with out of order delivery. You also need reconciliation logic because webhooks are at most once delivery: if your endpoint is down, events are lost.

"The decision is not pull versus push. It is: can I afford the operational complexity of webhooks for the latency improvement I need?"

Choose polling when: Sources number in the dozens or low hundreds, freshness requirements are 5 to 10 minutes, and you want simple operations.

Choose webhooks when: Latency requirements are under one minute, event volume justifies the infrastructure cost, and you can build robust validation and reconciliation.

When Async Job Based Fits:

If the upstream API involves heavy server side processing, like indexing or batch transformations, async job submission is the right pattern. You decouple submission from execution, preventing client timeouts. The server can throttle background work to protect its SLAs.

Bloomreach limits ingestion requests to one per minute per catalog and indexing to once per hour. This prevents runaway background jobs from impacting customer facing APIs. Your ingestion client polls job status instead of blocking on a long running request.

💡 Key Takeaways

✓File based transfer handles terabytes efficiently but provides daily freshness; API ingestion fetches only changed data with sub hour freshness at cost of rate limit complexity

✓CDC offers sub second lag with p99 under a few seconds but requires database access that SaaS vendors never grant; API ingestion is more portable with lag in minutes

✓Polling is operationally simple with p95 lag of 5 to 10 minutes; webhooks achieve sub second latency but require highly available endpoints, signature validation, and reconciliation logic

✓Async job based patterns decouple submission from execution when server side processing is heavy, preventing timeouts while allowing backends to throttle work

📌 Interview Tips

1Choose file based: Daily financial reports with terabytes of data where 24 hour latency is acceptable

2Choose API polling: Salesforce account syncs with thousands of updates daily, 10 minute freshness requirement, hundreds of integrated customers

3Choose webhooks: Payment processor sending transaction events where fraud detection needs sub second alerting and system handles tens of thousands of events per second

4Choose async jobs: Bloomreach product catalog with one ingestion request per minute limit and one index rebuild per hour to protect search cluster

← Back to API-based Data Ingestion Patterns Overview