Distributed Data Processing • Resource Management (YARN, Kubernetes)Medium⏱️ ~3 min
How YARN Schedules and Allocates Resources
The Two Level Scheduling Model: YARN uses an elegant separation of concerns. The global ResourceManager handles cluster wide fairness and capacity allocation, while each job's ApplicationMaster makes fine grained scheduling decisions within its allocated resources. This division lets frameworks like Spark implement their own task placement logic while YARN enforces inter application policies.
When you submit a job, the ResourceManager first finds space to launch an ApplicationMaster container, typically 1 to 2 GB and 1 CPU core. Once running, the ApplicationMaster requests containers for the actual work: for a Spark job, these are executor containers. Each request specifies resources like 4 vcores and 16 GB, plus optional preferences such as node locality hints to run near Hadoop Distributed File System (HDFS) blocks.
Scheduler Types and Queue Management: YARN offers multiple scheduler implementations. The Capacity Scheduler divides cluster resources into queues with guaranteed minimum capacities and configurable maximum limits. For example, you might configure a production ETL queue at 40 percent minimum, data science at 30 percent, and ad hoc at 20 percent, with remaining capacity shared elastically.
The Fair Scheduler aims to equalize resource distribution among active users or applications over time. If one user submitted jobs early and consumed most resources, newly arriving jobs from other users gradually receive more allocation as the scheduler rebalances. Both schedulers support preemption: when a high priority queue needs resources but they're held by lower priority jobs, the scheduler can kill those containers to free space.
Data Locality Optimization: YARN's killer feature for Hadoop workloads is data locality awareness. When requesting containers, the ApplicationMaster can specify node local preferences: I want this container on node47 because the HDFS block I need to process is stored there. The scheduler tries to honor these hints, falling back to rack local, then any node if local placement isn't possible within a timeout.
This matters tremendously for performance. Reading an HDFS block locally from the same machine's disk takes around 100 MB per second. Reading it over the network from another rack adds latency and consumes network bandwidth, dropping throughput to perhaps 50 MB per second while creating congestion. For a job processing terabytes, the difference between 80 percent node locality and 40 percent can mean doubling runtime from 15 to 30 minutes.
1
Client submits: Job sent to ResourceManager with resource requirements and queue name
2
ApplicationMaster launches: ResourceManager finds a node and starts the AM container
3
Container requests: AM negotiates with ResourceManager for executor containers based on data locality
4
NodeManagers execute: Containers launch on selected nodes, each with cgroup resource limits
Scheduling Decision Latency
50ms
TYPICAL P99
2+ sec
OVERLOADED
✓ In Practice: Production YARN clusters often configure scheduler parameters like
Resource Isolation with Cgroups: YARN relies on Linux control groups (cgroups) for enforcing resource limits. Each container runs in its own cgroup with hard limits on CPU shares and memory. If a container tries to exceed its memory allocation, the kernel's Out Of Memory (OOM) killer terminates it immediately. CPU is typically enforced as shares rather than hard limits, allowing containers to burst if idle capacity exists, but guaranteeing their minimum allocation under contention.yarn.scheduler.capacity.node-locality-delay to wait a few seconds for node local placement before accepting rack local, balancing latency against locality quality.💡 Key Takeaways
✓Two level scheduling separates cluster fairness managed by ResourceManager from application specific task placement handled by each ApplicationMaster, providing both global coordination and framework flexibility
✓Data locality optimization reduces job runtime significantly: achieving 80 percent node locality versus 40 percent can cut a terabyte scale ETL job from 30 minutes to 15 minutes by avoiding network transfers
✓Schedulers support preemption to reclaim resources from low priority jobs when high priority queues need capacity, with typical p99 scheduling latencies under 50ms rising to multiple seconds under overload
✓Queue based capacity management allows organizations to guarantee minimum resources per team while sharing unused capacity elastically, for example 40 percent for ETL, 30 percent for data science
✓Cgroups provide hard isolation: containers exceeding memory limits are killed immediately by the OOM killer, while CPU shares ensure minimum guarantees with burst capability when cluster has idle cores
📌 Examples
1Submitting a Spark job requests 100 executor containers, each with 4 vcores and 16 GB memory, plus locality preferences to run near HDFS blocks containing the input data
2A capacity scheduler configuration might allocate 40 percent minimum to production queue, 30 percent to data science queue, with both allowed to grow to 70 percent maximum when other queues are idle
3Node local HDFS reads achieve 100 MB per second throughput, while rack local reads drop to 50 MB per second, making locality the difference between 15 minute and 30 minute job completion for terabyte datasets