Object Storage & Blob Storage • Multipart Uploads & Resumable TransfersHard⏱️ ~2 min
Production Implementation Patterns and SLOs
Building production grade multipart and resumable upload systems requires careful orchestration of control plane logic, data plane streaming, adaptive algorithms, and comprehensive observability. The control plane must provide session lifecycle management: initiate returns an opaque session identifier or dedicated upload URL with a time-to-live (TTL) for automatic cleanup. A status probe endpoint answers "what parts do I have?" or "what is the last committed byte?" to enable safe resume after ambiguous failures. Explicit finalize and abort operations complete or discard the upload, with idempotent semantics so duplicate calls do not double-publish or leak temporary data.
The client scheduler decides part or chunk size based on object size, server limits, and expected link quality. Production defaults range from 4 to 16 MB on mobile networks to 64 to 256 MB on stable data center links. A bounded worker pool (8 to 32 concurrent uploads per file) with adaptive concurrency control increases parallelism when success rates are high and latency is low, and decreases on errors or rising latency. A durable manifest persists session metadata (ID, total size, part size, completed parts with checksums, aggregate progress) after each part completion, enabling cross-restart resume.
Observability and service level objectives (SLOs) are critical for operational health. Track per-upload metrics: start-to-commit latency, effective throughput in megabytes per second, retry rate, 95th and 99th percentile part duration, bytes wasted due to retries, orphaned session count and age, and finalize failure rate. Alert on growing orphan backlogs or rising 5xx error rates. For cost-aware systems, log request counts per upload and compute cost per terabyte transferred to inform part sizing decisions.
💡 Key Takeaways
•Control plane: Initiate session with TTL, provide status probe for parts or byte offset, expose idempotent finalize and abort operations
•Client scheduler: Choose part size based on object size and limits (4-16 MB mobile, 64-256 MB data center), use 8-32 concurrent workers with adaptive scaling
•Durable manifest: Persist session ID, total size, part size, completed parts with checksums after each completion for cross-restart resume
•Observability SLOs: Track start-to-commit latency, throughput (MB/s), retry rate, p95/p99 part duration, bytes wasted, orphan count, finalize failure rate
•Cost awareness: Log request counts per upload; 10 TB with 128 MB parts = 81,920 PUTs ($0.41) vs 16 MB parts = 655,360 PUTs ($3.28); tune sizing for cost and resilience trade-offs
📌 Examples
Amazon S3 SDK: Automatically manages multipart upload lifecycle, retries with exponential backoff, tracks parts in memory, exposes progress callbacks, and finalizes on completion
Google Cloud Storage Python client: Uses resumable uploads by default for files >8 MB, persists session URL, queries committed offset on retry, supports configurable chunk size
Netflix video pipeline: Monitors upload SLOs (p99 latency <5 min for 10 GB assets, <1% finalize failures), alerts on orphan growth >1000 sessions, auto-aborts incomplete uploads after 14 days