Implementation Patterns: Per Core Sharding and Thread Pool Sizing

Per Core Sharding
Per core sharding assigns each CPU core its own dedicated resources: one thread, its own data partition, and its own network connections. Instead of multiple threads competing for shared data structures, each core operates independently on its slice of work.
This pattern eliminates lock contention entirely. With shared data, threads spend time acquiring locks, waiting for other threads to release locks, and synchronizing cache lines between cores. Per core sharding trades memory efficiency (duplicated data structures) for throughput. A system might use 8x more memory but achieve 5-10x higher throughput on an 8 core machine.
Implementing per core sharding requires partitioning incoming requests. Hash the request key (user ID, connection ID, or message ID) and route to the corresponding core. This ensures all operations for a given key hit the same core, enabling sequential processing without locks.
Thread Pool Sizing
For CPU bound work, set pool size equal to core count. More threads waste time context switching. Fewer threads leave cores idle. On an 8 core machine, 8 threads achieve maximum throughput for pure computation.
For I/O bound work, use the formula: threads = cores × (1 + wait_time / service_time). If requests spend 100ms waiting for database responses and 10ms doing CPU work, the ratio is 10. On 8 cores: 8 × 11 = 88 threads keep all cores busy while waiting for I/O.
Mixed workloads need separate pools. A single pool handling both CPU intensive image processing and I/O bound database queries leads to head of line blocking: fast I/O tasks wait behind slow CPU tasks. Split into a small compute pool (core count) and a larger I/O pool (based on wait ratios).
Work Stealing
Work stealing balances load across thread pool workers. Each worker maintains its own queue of tasks. When a worker finishes its queue, it steals tasks from other workers queues. This provides automatic load balancing without central coordination.
The key is the stealing pattern: workers steal from the opposite end of another workers queue. If the owner processes from the head, thieves steal from the tail. This minimizes contention between the owner and thieves.
Process Pool Patterns
Pre fork pools spawn workers at startup and reuse them for multiple requests. This amortizes the 1-10ms process creation cost across many requests. Workers that exceed memory limits or request counts are terminated and replaced.
Worker recycling prevents memory leak accumulation. Configure workers to restart after handling 1000 requests or exceeding 500MB memory. The brief unavailability during restart is preferable to eventual memory exhaustion.
✅ Best Practice: Start with simple thread pools sized to core count. Measure actual wait time and service time ratios before tuning. Add per core sharding only when lock contention shows up in profiling.

💡 Key Takeaways

✓Per core sharding eliminates lock contention by giving each core its own data partition; trades 8x memory for 5-10x throughput

✓CPU bound pools: threads = core count; I/O bound pools: threads = cores × (1 + wait_time / service_time)

✓Mixed workloads need separate pools; single pools cause head of line blocking between fast I/O and slow CPU tasks

✓Work stealing balances load automatically; workers steal from queue tails to minimize contention with queue owners

✓Pre fork pools amortize 1-10ms creation cost; configure worker recycling at request count or memory thresholds

📌 Interview Tips

1When profiling shows lock contention above 10% of CPU time, consider per core sharding as a solution

2To size I/O pools, measure actual database response times: 50ms average wait with 5ms compute suggests pools of cores × 11

3For long running services, recommend worker recycling at 1000 requests or 500MB, whichever comes first, to contain leaks

← Back to Processes vs Threads Overview