Failure Modes and Edge Cases in API Ingestion

Production API ingestion fails in subtle ways. Understanding these failure modes separates junior engineers from senior ones in interviews.

Rate Limit Death Spirals:

Suppose your upstream API allows 1000 requests per hour per token. You scale horizontally by launching more workers to handle 10x customer growth. Each worker polls independently. Suddenly, all workers hit the rate limit and receive 429 Too Many Requests errors.

Naive retry logic makes this worse. Workers retry immediately, consume the quota even faster, and oscillate between overload and idle. The system never stabilizes.

The fix is centralized rate limit tracking. A shared service or database tracks remaining quota per token. Workers request capacity before making API calls. When quota is low, the scheduler backs off all workers for that token.

At Fivetran scale with thousands of customers, this means a global scheduler that dynamically allocates capacity. When one customer nears their limit, capacity shifts to others. This is why 10x growth does not require 10x infrastructure.

Schema Drift and Partial Failures:

An upstream team adds a non nullable field to their API response. Or they change an enum value from active to enabled. Your ingestion job that expects the old schema starts failing for some objects, maybe 5 percent of records in a batch.

If you validate strictly and fail the entire batch on any error, you lose 95 percent of good data. If you skip validation, you load corrupt data into your warehouse and poison downstream analytics.

❗ Remember: Land raw JSON first, transform later. Segment and Fivetran write raw API responses to storage even with unknown fields, then run separate transformation jobs with schema validation.

This pattern isolates ingestion from transformation. When schemas evolve, ingestion continues successfully. Transformation jobs catch schema mismatches, log them with sample records, and alert data engineers. You get visibility without data loss.

Idempotency and Duplicate Writes:

Your ingestion job fetches 1000 records, processes 500, then times out. The orchestrator retries the job. Without idempotency, you write 500 records twice, creating duplicates.

For analytical warehouses, duplicates corrupt counts and sums. For key value stores or search indexes, you might overwrite newer data with stale data if retries are delayed.

Async job based patterns like Bloomreach mitigate this. The server assigns a job identifier. If you submit the same batch twice with the same identifier, the server treats it as idempotent and returns the existing job. The client can safely retry without duplicating work.

For pull based ingestion, store high watermarks or cursors per stream in a transaction. Fetch new data, write to destination, update cursor, all in one atomic operation. If the job fails mid flight, the next run starts from the old cursor and refetches, but deduplicates based on primary keys.

Large Objects and Payload Limits:

Most APIs have payload size limits. Bloomreach limits inline API payloads to a few megabytes. If your product catalog includes high resolution images or large descriptions, you exceed this limit.

The workaround is hybrid ingestion. Upload large files via SFTP or presigned S3 URLs. Use the API only to trigger processing of those files. The API call is lightweight, just a reference to the file location, while the heavy data transfer happens out of band.

Late Arriving Updates and Cursor Pagination:

Time based pagination using updated_at seems simple. Fetch records with updated_at greater than your last sync time.

But if the source system allows backdated updates, like editing an order timestamp to correct a billing error, those records have old updated_at values. Your incremental sync misses them.

Cursor or token based pagination is safer. The source API maintains state about what you have already fetched and returns an opaque token. The next request passes that token and gets only truly new records, regardless of timestamps.

Alternatively, run periodic full reconciliation syncs. Incremental syncs every 15 minutes for freshness, full syncs weekly to catch anything missed.

Webhook Out of Order Delivery:

Webhooks are fire and forget. The source sends event A, then event B. But network routing or retries can deliver B before A. If you process naively, you might apply updates in the wrong order and end up with incorrect final state.

You need sequence numbers or timestamps in payloads. Buffer events and reorder before applying. Or design your system to be commutative, where order does not matter. For analytics, appending events to a log is naturally commutative. For key value updates, use last write wins with wall clock timestamps, accepting that clock skew can still cause issues.

💡 Key Takeaways

✓Rate limit death spirals occur when horizontal scaling increases error rates; fix with centralized quota tracking and dynamic capacity allocation across workers

✓Schema drift causes partial batch failures; land raw JSON first, transform later with separate validation jobs to isolate ingestion from schema evolution

✓Idempotency requires server side job identifiers for async patterns or atomic cursor updates for pull based patterns to prevent duplicate writes on retry

✓Large objects exceeding API payload limits (typically a few megabytes) require hybrid ingestion: upload via SFTP or presigned URLs, trigger processing via lightweight API call

📌 Interview Tips

1Fivetran uses global scheduler tracking rate limit state per customer token, shifting capacity when one customer nears their 1000 requests per hour limit

2Segment lands raw JSON events even with unknown fields, runs separate transformation jobs that log schema mismatches without blocking ingestion

3Bloomreach job submissions are idempotent: submitting the same batch twice with the same job identifier returns existing job instead of creating duplicates

4Time based pagination with <code>updated_at</code> misses backdated updates; cursor based pagination or periodic full reconciliation syncs catch late arrivals

← Back to API-based Data Ingestion Patterns Overview