This is a 3 part series where I discuss the Swift Objectstore and Cassandra in the context of the original idea that inspired them both: Amazon's Dynamo
PART 1: Amazon Dynamo review
Dynamo was described in a 2007 paper from Amazon. Dynamo is a distributed data store that Amazon developed in order to service the database and storage use cases of Amazon at the time. Yes, both database and storage needs because Dynamo can be adapted to store data in ways that are more like databases or ways that are more like distributed file systems. It achieves this by taking a pre-existing “persistence engine”, such as the Berkeley DB or MySQL or a filesystem, and adding distributed system sauce to make these services scale horizontally across hundreds of cheap servers in a SLA-bound manner with respect to latencies of the 99.9th or higher quartile. This is absolutely remarkable because Dynamo transforms mature but non-distributed data stores (like Berkeley DB) to scale horizontally, almost limitlessly.
Dynamo provides redundancy by storing multiple data copies (N) in different physical servers, possibly located in independent fault domains. The fault domains requirement keeps the N copies in different Amazon data centers, effectively making the probabilities of loosing all N copies independent of each other and practically miniscule. Read and write load is distributed across multiple servers by randomizing which servers are responsible for which data. Then, given the reasonable assumption that the number of frequently accessed data objects far exceeds the number of physical servers, load is nicely balanced across the physical servers.
Dynamo achieves these properties by making two tradeoffs compared to more traditional data stores such as RDBMS databases. The first tradeoff is that reading data off dynamo may not yield the most recent write update to the data. This is the tradeoff of consistency in the famous (C)onsistency, (A)vailability and (P)artition tolerance triangle of the CAP theorem in distributed computing. Applications using a dynamo data store need to be intelligent enough to detect and deal with inconsistent copies of data that may yield a stale version of the data object.
The second tradeoff dynamo makes is limiting the richness of the data schema by providing a simple key-value store. Therefore, unlike traditional RDBMS databases the data store implements content-addressable storage via a simple key-value model. Applications can create a key-value pair corresponding to a data object insert or “put” operation and read the value corresponding to a key or a “get” operation. While putting an object Dynamo provides version support capabilities via passing implicit metadata during the put operation on data objects, which serves as a means of handling conflict resolution between different versions of the same data object. Deletes are handling by inserting a tombstone corresponding to the key that needs to be deleted.
Consistency is eventual, this means that if an object is left unmodified for a (finite) amount of time then all N copies of the object will be identical. Moreover it is guaranteed that the latest update of the object becomes available after a finite amount of time at all physical locations hosting the N copies so a subsequent read off any physical location will yield the latest version of the data. The exact time required for the system to enter this state depends on the number of failures, network conditions, and user-defined parameters that control how aggressively the background replication of objects happens.
Dynamo uses Merkle trees to reconcile data quickly and scalably in the background. The key benefit of Merkle trees is that the data transferred for reconciliation is proportional to the number of differences and not to the total number of data objects being checked for consistency via reconciliation.
The paper describes the interplay between number of copies (N), the number of copies read (R) before returning a client’s read request and the number of copies successfully written (W) before acknowledging the client’s write request. When N < R + W then strong consistency is guaranteed (since the intersection of servers where the W writes and R subsequent reads is performed cannot be an empty set). This assumes that the read happens after the write is acknowledged. If instead two independent clients were to write and read a data object respectively with the read request hitting the system before the write request of the other client is acknowledged (a classic asynchronous and independent data access pattern) then there is no guarantee. The authors suggest that N=3, R=2 and W=2 are used for several Amazon services.
The most interesting part of Dynamo is the partitioning that dictates how data is dispersed among backend resources (physical servers). Consistent hashing is used to divide up the key space (128 bit MD5 hash keyspace of the keys in the key-value insertions). The consistent hashing approach guarantees that when a physical server is added to or removed from a dynamo cluster of M nodes then the total data moved is a 1/M fraction of the data stored in the cluster. Given the limited number of physical servers each is further divided into 100s of virtual nodes which are randomly mapped to the consistent hash ring. This enables faster recovery from failure conditions (disks and node malfunction) and distributes replication load during recovery across all servers in the dynamo cluster.
There are several other handy features - for example hinted handoffs for maintaining the correct replica count even when the preferred physical server for which the data is destined is down transiently. Writing speedups using ideas similar to commit logs via the notion of a “in memory writer” with at least 1 persistent copy are also described. There are also some SLA-related performance graphs which show the remarkable availability and bounded latency properties of Dynamo under production settings. I highly recommend reading the paper on your own if you have got so far in reading this write-up.
In the next part of this series I will dive into Openstack Swift, a Dynamo-inspired file objectstore. I will analyze design decisions particular to Swift, what is unique and different in each, and where (perhaps) Swift could still improve by standing on the shoulders of the grand-daddy of modern distributed systems -Dynamo.
And after that, we’ll repeat the above for another Dynamo-inspired data store - Cassandra. Stay tuned.