Implementation Patterns: Per Core Sharding and Thread Pool Sizing
Per Core Sharding
Per core sharding assigns each CPU core its own dedicated resources: one thread, its own data partition, and its own network connections. Instead of multiple threads competing for shared data structures, each core operates independently on its slice of work.
This pattern eliminates lock contention entirely. With shared data, threads spend time acquiring locks, waiting for other threads to release locks, and synchronizing cache lines between cores. Per core sharding trades memory efficiency (duplicated data structures) for throughput. A system might use 8x more memory but achieve 5-10x higher throughput on an 8 core machine.
Implementing per core sharding requires partitioning incoming requests. Hash the request key (user ID, connection ID, or message ID) and route to the corresponding core. This ensures all operations for a given key hit the same core, enabling sequential processing without locks.
Thread Pool Sizing
For CPU bound work, set pool size equal to core count. More threads waste time context switching. Fewer threads leave cores idle. On an 8 core machine, 8 threads achieve maximum throughput for pure computation.
For I/O bound work, use the formula: threads = cores × (1 + wait_time / service_time). If requests spend 100ms waiting for database responses and 10ms doing CPU work, the ratio is 10. On 8 cores: 8 × 11 = 88 threads keep all cores busy while waiting for I/O.
Mixed workloads need separate pools. A single pool handling both CPU intensive image processing and I/O bound database queries leads to head of line blocking: fast I/O tasks wait behind slow CPU tasks. Split into a small compute pool (core count) and a larger I/O pool (based on wait ratios).
Work Stealing
Work stealing balances load across thread pool workers. Each worker maintains its own queue of tasks. When a worker finishes its queue, it steals tasks from other workers queues. This provides automatic load balancing without central coordination.
The key is the stealing pattern: workers steal from the opposite end of another workers queue. If the owner processes from the head, thieves steal from the tail. This minimizes contention between the owner and thieves.
Process Pool Patterns
Pre fork pools spawn workers at startup and reuse them for multiple requests. This amortizes the 1-10ms process creation cost across many requests. Workers that exceed memory limits or request counts are terminated and replaced.
Worker recycling prevents memory leak accumulation. Configure workers to restart after handling 1000 requests or exceeding 500MB memory. The brief unavailability during restart is preferable to eventual memory exhaustion.