Why worth knowing?

When designing a new system resiliency is usually made the top priority. Most often the goal is to build a system that is as fault-tolerant and immune to outages as possible. Whether this should be of such great importance is another topic, but what is quite interesting is how reliable a system can become. In other words, what is the limit we are going to eventually reach when trying to max out the reliability of our system.

CAP theorem can be pretty helpful with establishing such limit. Or actually it can help us become more aware of the trade offs lurking behind the architectural choices we are making.

Another down-to-earth argument for learning about CAP theorem is that it is often asked about during job interviews (especially when applying for backend roles) so getting to know it better could potentially improve your performance on the interviews.

What is it?

CAP theorem states that at one time a distributed system can only provide 2 out of 3 available guarantees:

  • C onsistency - every read returns the most recent write (it looks like system contains a single up-to-date copy of the data)

  • A vailability - every request receives a non-error response (system returns data although it may not be the latest update)

  • P artition tolerance - system continues to operate despite broken communication between nodes

This sounds similar to hiring a contractor in order to renovate your apartment or house.

The contractor can be only two of:

  • cheap
  • good
  • fast

If a contractor is cheap then you can expect either poor result or job not being done in a timely manner. When the results are to be anywhere good then you should either equip yourself with patience or be ready to pay more. Finally, when you require all the work to complete in a short period of time then you have to make a decision what is more important: reducing the costs or getting better quality (which seems also relatable in the software world).

All in all, you cannot have all three and that is just the way it is.

Proof

Let’s try to use common sense to confirm that CAP theorem is indeed true.

Imagine we have a simple distributed system where data is written to a leader that replicates data to a replica from which data is being read:

If nothing bad ever happened then such system would work well but in reality there is always a chance that connectivity between leader and replica breaks down:

Unfortunately, this means that if replica contains stale data then it cannot be refreshed:

Availability or consistency?

We have to make a tough call between sacrificing consistency and returning old data:

…or losing availability by throwing back an error:

Practical application

It sounds really bad that we are forced to make such decisions. Luckily, though CAP theorem stands true and we have to accept it as a fact, the problem is not so black and white. Quite the opposite, there is a spectrum of available options for us to choose from.

First of all, there is a wide range of countermeasures to greatly reduce the probability of a network partition, which happens rarely anyways. Then we can usually pick different strategies depending on a scenario: we might not be willing to give up on consistency when dealing with money but being presented with somewhat older number of users currently signed in to our application could be perfectly acceptable.

What’s important, a choice how to react to a network partition can be made on a case-by-case basis within the same system. Most modern database and messaging systems usually provide a number of parameters we can pass to read and write operations that affect whether we opt for consistency or availability.

Want to learn more?

In order to learn more here you can read an article containing Eric Brewer’s considerations many years after he came up with the CAP theorem.