Big Data Systems • HDFS Architecture & OperationsMedium⏱️ ~3 min
HDFS vs Object Storage for Cloud and Hybrid Workloads
Traditional HDFS co locates compute and storage on the same nodes, achieving data locality: map tasks read blocks from local disks at 100 to 300 MB/s without network overhead. This maximizes shuffle performance and reduces cross rack bandwidth, critical for on premises clusters with oversubscribed aggregation links (often 4:1 or 10:1 oversubscription). The trade off is operational complexity: managing DataNode failures, balancing storage, and scaling compute and storage together even when workloads need different ratios.
Cloud native architectures decouple compute and storage by using object stores like Amazon Simple Storage Service (S3) instead of HDFS. Compute clusters (Elastic MapReduce, Databricks, Snowflake) read and write data directly from S3 over the network. This simplifies operations: no DataNode management, storage scales independently of compute, and durability is managed by the cloud provider (S3 offers 99.999999999% durability). The cost is latency and bandwidth: S3 single request p50 latency is typically 10 to 30 ms versus sub millisecond for local HDFS disk reads. Large sequential reads mitigate this (multi part parallel downloads achieve 1 to 10 GB/s depending on instance network), but job startup and shuffle phases often see 20 to 50% latency increases compared to data local HDFS.
Hybrid patterns combine both: store raw immutable data in object storage for durability and elasticity, cache hot datasets in ephemeral HDFS or memory (Alluxio, Tachyon) for performance, and write intermediate shuffle data to local disks. This balances operational simplicity with performance. For example, a Spark job reading training data from S3 might cache it in memory across executors for iterative machine learning, avoiding repeated S3 fetches that would add minutes to hours of latency per iteration.
💡 Key Takeaways
•Data locality in HDFS: tasks read from local disks at 100 to 300 MB/s with zero network overhead. Object storage requires network reads at 10 to 100 MB/s depending on instance type, reducing throughput by 50 to 90% for small instances.
•S3 durability is 99.999999999% (11 nines) across multiple Availability Zones, eliminating DataNode failure management. HDFS RF=3 durability depends on cluster size and maintenance, typically 99.999% to 99.9999% (5 to 6 nines) with vigilant operations.
•Cost comparison: S3 storage is approximately $0.023 per GB per month (standard tier). On premises HDFS including disks, servers, power, and ops is approximately $0.02 to $0.05 per GB per month depending on scale and utilization. S3 is cheaper for bursty or low utilization workloads; HDFS is cheaper at sustained high utilization.
•Job latency impact: Presto queries on S3 show 20 to 40% higher p95 latency than HDFS in benchmarks due to network round trips. Mitigation includes columnar formats (Parquet, ORC) that minimize I/O, predicate pushdown, and caching layers.
•Elastic scaling: with object storage, spin up 1000 compute nodes in 5 minutes for a batch job, process terabytes, then terminate. HDFS requires pre provisioned DataNodes; adding capacity takes hours to days including hardware, network setup, and data rebalancing.
📌 Examples
Amazon EMR users commonly run transient clusters reading input from S3, processing with Spark or Hive, and writing results back to S3. This eliminates DataNode management and enables on demand scaling. Typical job overhead is 30 to 60 seconds for S3 metadata listing at startup versus near zero with HDFS locality.
Uber data lake uses a hybrid approach: Parquet files in HDFS for hot analytics data accessed hourly, with older partitions archived to S3 after 7 days. This balances performance for recent data with cost efficiency for cold storage, reducing total storage costs by approximately 40%.