How dynamo replicates and handles failures under the hood

About the paper

Dynamo paper released in 2007 talks about the design and implementation of Dynamo, a distributed key-value store that could be scaled infinitely. Amazon engineers while researching the data access pattern across Amazon;’s services noticed that a lot of the calls were simple primary key access to the data store. The common pattern of using a relational database at the time would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key-only interface to meet the requirements of these applications. The focus was on reliability and therefore sacrifices consistency under certain failure scenarios.

Let’s understand Replication

Replication is a major topic in distributed systems. In simplest terms, replication is the way to duplicate data across different servers and geographies. The core idea here is to provide durability and be always available in the face of a major outage.

Why Replication is needed in Distributed Systems?

  • Higher Availability: In Distributed Systems, Replication is the most important aspect of increasing data availability. Data is replicated over numerous locations so that the user can access it even if some of the copies are unavailable due to failures.

  • Reduced Latency: By replicating data to geographies closer to the user. Replication helps to reduce data query latency.

  • Fault-Tolerant: Even when there are network challenges, the system operates. If one replica fails, service can be supplied by another replica.

Dynamo Replication Strategy

Now that we understand replication. Let’s discuss how dynamo replicates to achieve durability and availability.

Dynamo replicates its data on multiple hosts. Each data item is replicated at N hosts, where N is a parameter that could be configured by applications. When the write request arrives, each key is assigned a node(referred to as “coordinator” in the paper). The coordinator is in charge of the replication of the data items that fall within its hash range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 clockwise successor nodes in the ring. This results in a system where each node is responsible for the region of the ring between it and its Nth predecessor

In the above figure, when the write request comes for Key K, moving clockwise, Node B becomes the coordinator node. B in addition to storing the key locally also sends the key to Node C and D. Dynamo maintains the list of all nodes responsible for keys in the particular hash range and calls it a “Preference list”. To account for any possible failure to nodes in the preference list. The preference list can contain more than N nodes.

Replica synchronization

Node failure, outage and transient failures are common occurrences in the distributed world. The data is replicated in multiple nodes and nodes can be out of sync if the copies of the data are not consistent in scenarios when the node is down during writes. Add to that, dealing with situations in live production is hard. How to reliably copy huge amounts of data from one node to another node where the possibility of data changing is high.

To detect the inconsistencies between replicas faster and to minimize the amount of transferred data, Dynamo uses Merkle trees. A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The benefit of the Merkle tree is that data can be checked independently without actually transferring it. Merkle trees help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas. For instance, if the hash values of the root of two trees are equal, then the values of the leaf nodes in the tree are equal and the nodes require no synchronization. If not, it implies that the values of some replicas are different. In such cases, the nodes may exchange the hash values of children and the process continues until it reaches the leaves of the trees, at which point the hosts can identify the keys that are “out of sync”. Merkle trees minimize the amount of data that needs to be transferred for synchronization and reduce the number of disk reads performed.

Design Consideration

Replication is a critical component of Dynamo. It is what makes Dynamo alway available and eventually consistent. The replication strategy adopted by Dynamo sacrifices consistency by replicating the data asynchronously. This can lead to clients getting stale data. But that’s the trade-off Dynamo adopted to support high durability and availability.

Another design consideration is to pick the most replication factor based on application SLAs. The replication factor determines the latency of the application during write. A high replication factor can provide high durability at the cost of increasing the write time.