Document Databases and Wide Column Stores for High Write Throughput

Document vs Wide-Column Overview:

Document databases (MongoDB) and wide column stores (Cassandra) both handle massive scale, but their architectures optimize for different workloads. Choosing incorrectly leads to performance issues that cannot be fixed without migration.

MongoDB for Flexible Schemas:

MongoDB stores data as JSON like documents, making it ideal for flexible schemas that evolve rapidly. Uber uses MongoDB for driver profiles because each driver has different attributes: some have vehicle inspection data, others have background check documents, and new fields get added weekly during feature development. MongoDB allows querying nested fields efficiently and supports secondary indexes for flexible access patterns. Write throughput reaches 50,000 operations per second, with latency typically 10 to 50ms. The limitation: MongoDB sharding requires careful shard key selection, and poor choices create hot spots where one shard receives 80% of traffic while others sit idle.

Cassandra for Write Throughput:

Cassandra optimizes for extreme write throughput with eventual consistency. Netflix writes 2.5 trillion records per day to Cassandra for viewing history, achieving over 1 million writes per second across clusters. Cassandra's log structured merge tree architecture makes writes incredibly fast (1 to 10ms) because data appends to memory and disk sequentially. The trade-off: reads are slower (10 to 100ms) because data spreads across multiple files requiring compaction, and you must design your data model around query patterns upfront. Cassandra has no joins, no secondary indexes without performance penalties, and limited ad hoc querying.

Decision Criteria:

The decision point: if you need flexible querying with reasonable write speed and your schema changes frequently, choose MongoDB. If writes vastly outnumber reads (logging, time series, messaging) and you can design your data model around known access patterns, Cassandra delivers unmatched write throughput. Instagram uses both: MongoDB for user profiles (flexible schema, moderate scale), Cassandra for user feeds (billions of writes daily, simple access by user_id).

💡 Key Takeaways

✓Schema flexibility versus query predictability: MongoDB allows adding fields without migration and supports ad hoc queries, Cassandra requires defining complete data model upfront and only efficiently queries by primary key

✓Write throughput differs by 20x: Cassandra handles 1 million writes per second via log structured merge trees and no transaction overhead, MongoDB reaches 50,000 writes per second with ACID on single documents

✓Read performance inverts: MongoDB reads typically 5 to 50ms with indexes, Cassandra reads 10 to 100ms because data scatters across SSTables (sorted string tables) requiring compaction and multiple disk seeks

✓Sharding complexity: MongoDB auto sharding can create hot partitions if shard key poorly chosen (monotonically increasing IDs), Cassandra distributes via consistent hashing but requires manual partition key design

✓Consistency models: MongoDB offers tunable consistency with default strong consistency for single document writes, Cassandra defaults to eventual consistency requiring quorum reads for strong consistency at 2x latency cost

📌 Interview Tips

1Discord migrated from MongoDB to ScyllaDB (Cassandra compatible) for message storage: 100 million users generating append only messages, MongoDB sharding operational burden too high, ScyllaDB handles petabyte scale message history with predictable performance

2Uber architecture uses both: MongoDB for trip metadata and driver profiles (schema evolves weekly, complex queries for analytics), separate Cassandra cluster for location tracking (millions of GPS coordinates per second, query only by trip_id)

3Adobe uses MongoDB for customer analytics: each customer event has different properties based on product, need to query across multiple fields for reporting, write volume 10,000 per second within MongoDB capabilities

← Back to Choosing Databases by Use Case Overview