Ever have this constant fear that you are going to end up alone in the future? or do you fear that the people who are here with you right now are probably going to leave you alone soon? This is…


独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Distributed Databases I

Written by Sajeed Syed Bakht & Govind Mohan

The world of databases can be difficult to navigate whether you’re a student, a fresh graduate, or a seasoned developer. Several paradigms have emerged over the last decade, all embodied by various large projects with eccentric names like Cockroach DB and Voldemort. We are also experiencing a new wave of innovation in distributed systems with Distributed Ledger Technology which was popularized by the rise of cryptocurrencies such as Bitcoin and Ethereum.

In this article we’ll explore the major paradigms, and provide some context as to why these paradigms are popular and what has led to their popularity.

A database that runs on one server can satisfy all these cases, but as traffic increases (a natural side-effect of growth), it loses its ability to be fault tolerant, that is to perform without errors. There can be issues with memory/disk/CPU limitations, and hardware upgrades (vertical scaling) only kick the can down the curb. Further, network outages can completely cut access to the database. Thus, it is imperative to maintain the database across different machines (horizontal scaling) to ensure smooth access even when the network is unreliable. Maintaining copies of the database, or replication, results in a distributed database. Replication comes with its own set of questions: how do we ensure that these copies are up-to-date with each other at all times?

Before we continue, it’s worth discussing how we can compare databases based on their desirable properties. From distributed systems theory, the CAP theorem defines the landscape of database performance. It states that any distributed data store cannot provide all of the following three properties: Consistency, Availability, and Partition Tolerance. Consistency (C) is the guarantee that queries to the distributed database always happen in a certain order. Thus, all records in all copies of the database must be composed by the same set of operations. For example, all copies of a MySQL distributed database must have the same tables at all times, and must have built these tables using the same INSERT, UPDATE, etc. statements. Availability (A) entails that any request to any node in the database always returns some kind of non-error response. This includes scenarios where nodes in the distributed system have failures, where the response will have an indefinite delay. Partition-tolerance (P) refers to the system’s ability to operate after losing/delaying an arbitrary number of messages between nodes. Specifically, when the network is segmented into groups of nodes, or partitioned, some partitions will not be able to communicate reliably to others. Thus, a distributed system cannot have CAP as there can be two partitions, G1 and G2, of nodes within the system that don’t communicate with each other, and a write query to G1 followed immediately by a read request to G2 will have inconsistent values as G2 will not show the write query to G1.

This problem is avoided in an AC system as requests can always be communicated between nodes. It is also avoided in AP systems as the inconsistency from the above situation is not required to be solved, and in CP systems the database queries will not succeed as availability is not required. In a real-world network, hardware failures are largely unpredictable, and thus partitioning is typically assumed. As a result, AP and CP systems are preferred over AC systems. It is important to note that CAP is a rather simplistic quantification of database performance, but serves as a good starting point to understand other metrics.

When dealing with a cluster of nodes, our goal is to ensure that the data is properly replicated on multiple nodes. Therefore if a node goes down, then the data does not “disappear”.The cluster can simply look into another node for the relevant data.

The most common and traditional approach to replication within distributed systems is a Leader-Follower approach. Each node within a cluster is separated into two groups: leaders, which accept write requests and send the data to the other group, and followers which can only accept read requests. The general idea is for the leaders to accept writes, and then send the data change contained with the write to the followers to copy from. The read requests are mainly handled by the followers to take the load off the leader, while its resources are focused on accepting the writes.

This type of system designates one instance in the cluster to be the leader, and the rest of the replicas to be followers. This paradigm is often useful when writes need to follow a sequential order. The single leader can sequentially process every write that occurs. For example, three writes may be requested to multiply, then divide then subtract from a value. Since these writes are not commutative, it is essential that they are dealt with sequentially. A write occurs in the following manner: the leader receives the write request, then writes the data to its own storage and then sends the same write to the followers to make the data change as well. A write can be accepted under two different conditions. Firstly, the replication from leader to follower could be done asynchronously; the write could be accepted after the leader writes the data to its own storage and replicates on at least one other node. Or the write can be accepted synchronously; after the leader is sure that each of the followers have also replicated the data change to their own store. This subtle differentiation has implications on consistency and availability.

Synchronous replication works as the following. A write is sent to the database. The leader handles the write by updating its own data store. Then, the leaders send the data to the followers for them to replicate. Each follower sends back a status stating that data was successfully replicated to their data store. The leader waits for each follower to send a confirmation response. After ensuring every follower has successfully replicated the data, the leader informs the client that the write has been successful. This ensures that the data is consistent across all nodes. Consequently, when a read is requested, the cluster sends the request to any data store. However, this level of consistency comes with a trade-off; a more latent and therefore less available system.

Take for instance, a cluster with twenty instances that observe a single leader, synchronous approach, i.e one instance is designated as the leader, and the other nineteen are designated as the followers. A write request is sent to the leader. The leader processes it and then waits for the status of the replication from each of its followers. Perhaps, multiple followers are currently running slow and take longer than usual to replicate the data and notify the leader of the data change. The leader is held into a predicament where it cannot handle new writes since it is waiting for the data to replicate amongst the rest of the followers. Thus the system can no longer guarantee availability.

Asynchronous Replication in contrast lets the leader deal with write requests without having to worry about every follower replicating the data change. The leader accepts writes, makes the change to its own data store and then sends a notification to the followers to replicate the data from the leader. The leader then begins accepting new write requests. However, since the data is not guaranteed to be consistent this can have unpleasant consequences. Imagine a person wants to update the profile picture on their social media account. This change can be considered a “write request”. I.e.

The follower would accept the write and handles the write asynchronously. The person then checks their profile to ensure the change was made. This request would send a “read” request to the cluster. The person is delighted to see that their photo change has occurred. This would be a case of the read request being routed to a follower that has successfully replicated the data change from the leader. Then the person notifies a friend to comment on the picture. The friend goes on the person’s profile but notices no profile picture change has occurred. This would be a case of the read request being routed to a follower that has not successfully replicated the data change from the leader.

An approach that combines the two is the minimum insync replica approach. The data system sets a number, n, denoting the minimum insync replicas. When the leader deals with a write, it only waits until n followers have successfully replicated the data. This way the leader is not bogged down by waiting for all followers to replicate the data and the system is thus less likely to become unavailable due to waiting for write requests to succeed.

In a distributed system, one pitfall that may occur is node churn, which is when nodes shut down arbitrarily. Therefore if a node goes down our system can still be running, accepting write and read requests. To deal with node outages, it is important to know what type of node went down. If a follower node goes down but comes back up, then how would it catch up to all the data changes that occurred? If a leader goes down, how does the system continue accepting writes?

Each follower keeps a log of the data changes it has received from the leader. Therefore, the follower knows the last transaction that occurred before it went down. When the follower comes back up, it can request from other nodes all the data changes that occurred while it was down. This effectively catches the follower up, and makes it consistent with every other node in the network.

More complexity is introduced when a leader goes down. To ensure the system continues accepting writes, one of the followers in the system needs to be elected a leader. Then every follower needs to be aware of this change so that it could replicate data from the newly elected leader. This process is called failover.

Failover can be done manually by the administrator or it could be done automatically.

The automatic steps are generally the following.

There are many common pitfalls with failover. For example, in asynchronous replication, an older leader may come back up. However, it may have data that was written to it that other followers have. How should the system deal with this data? Another problem is understanding when the leader has gone down. What is the correct timeout length? A longer timeout would make the recovery of the system longer if the leader has failed. A shorter timeout may make for unnecessary failover. For example, if the system is experiencing network delays because of a spike of activity, then making the system do a failover process will lead to more load on the system thus further slowing it down.

Single leader architectures are well suited for application workloads that consist of mostly reads and a small percentage of writes. Think of a simple blog website. Most of the traffic is users reading the blogs, while the writing occurs only by authors, or users that want to comment. One way to scale this website is to add more nodes, i.e. more followers to handle the read-only requests. However, as previously noted as more followers are added to the clusters, so does the time to replicate data amongst them. Therefore an asynchronous approach or minimum insync replicas becomes more realistic. As stated before, asynchronous approaches are not consistent. Instead, they are considered “eventually” consistent. That eventually all replicas will catch up and copy the data. However, there is no guarantee on how long this will take. It could be a few seconds to even a few hours. This is known as replication lag; the time it takes for the data to be replicated amongst all replicas. There are many unexpected problems that could happen with replication lag. We’ll explore one in detail and solution to it, namely monotonic reads.

Imagine the following situation. You read a new blog that was just posted. You go to comment on the page, but your wifi goes down. After reconnecting to the wifi after a few moments, you again navigate to the blog to make a comment. However, there is no blog to comment on. It seems that the website is going back in time. What has happened is that initially, you have read from a replica that is up to date. But then the subsequent read request is sent to a node that is not up to date.

Monotonic reads[link] is a guarantee that the above would not occur. Namely, a user will not see an older replica, after it has seen data from a more up to date replica. One way of achieving this is for the user to read from the same replica. This could be done by choosing a replica based on some function of the user’s identity. However, if the chosen replica goes down then the user’s request will have to be routed to another replica.

The largest downside of a single leader approach is that only one instance is a leader. For example, a network interruption between the leader and a client could effectively stop that client from writing to the database. An extension of the single leader approach is to have multiple leaders within the system. Imagine you have multiple data centers separated geographically. In the single leader approach, one replica has to be the leader of all these data centers. In the multi leader approach, you can have one leader for each data center. The other nodes are followers to the leader within their data center. The leaders in turn follow other leaders.

A multi leader approach makes more sense for the following reasons. Firstly, such a system performs better. In the single leader system, every follower will have to route to the single leader that may be outside it’s data center. In the multi leader approach, the followers only have to look at the leader in their local data center. Secondly, in the case of a data center going out; a single leader approach would have to elect a new leader, if that data center contained the leader. However, in the multi leader approach, writes can be accepted since there are multiple leaders. The most glaring problem with the multi leader approach is that some data may be concurrently changed within two different data centers.

Consider the scenario where one write request could set a variable to a certain value, while another write request simultaneously sets the same variable to a different value. In a single leader database, the second write will either block or wait for the first write to be done. It could also abort the second transaction and ask the user to resolve it. In the multi leader setup, both writes are successful and it is later asynchronously detected at some later point in time. There is no guarantee when this will be detected thus making it too late to ask the user to resolve the conflict. The way to solve this problem is through conflict resolution.

In a multi leader configuration there is no defined ordering of writes, so it’s not clear what the final value should be. Thus, every replication scheme should ensure that the data is eventually the same on all replicas. One popular way to achieve this is through Last Write Wins(LWW). Let’s briefly go over the general structure of it.

However, this approach is prone to data loss, since the other writes are discarded.Moreover there may be writes that are discarded that are not concurrent.

Resolving a write conflict in a multi leader approach can be solved and minimized using various strategies. There are many strategies to look over, that we won’t go into full detail. Some notable conflict resolution strategies include Last Write Wins (LWW), Operational Transformation, Mergeable persistent data structures, and Conflict Free Replicated Data Types.

Thank you for making it this far. In this article, we introduced the need for a distributed system as a means for safely storing and retrieving data. We covered the leader-follower paradigm as a way of nodes interacting and exchanging data amongst each other, under the assumptions that the network could be unreliable and node outages may occur. The Leader-follower paradigm helps the database reach a consistent state amongst nodes, however there is a cost since the system must always agree upon a leader, and that a leader must always be present. This leads our system to be less available in certain cases. New paradigms have emerged to address these issues. Namely, the leaderless paradigm where no node is a leader and thus no need for expensive operations such as Failover in the case of a leader outage. In the next article, we will introduce the leaderless paradigm and elaborate on how leaderless systems are more available and how they keep the system consistent.

Add a comment

Related posts:

Kiss and Tell with the Prince of Hell

Traditional greetings vary. Meeting someone famous, like Lucifer, is intimidating. Be yourself. Don't be nervous. Wear lip balm and give the Devil his kiss.


Poetry by Gale Davis. “Erased” is published by Gale Davis. A view of life and how fleeting it can be.


I could never keep them straight. A day didn’t go by when a guy was clung to her hip, brainwashed into thinking that the two of them were going to last while my sister had her eyes on other guys. By…