NoSQL has been a hot topic for a lot of people. There
has been lots of things that I have learnt about their architecture
and how distributed systems are built. Why was it that other
companies started looking at other solutions?
Why NoSQL?
The
story is a typical that I think we can all relate to. As data and
transaction volumes started to grow, companies realised that they
needed to scale their solutions.
So companies tried to address these challenges by trying
to make it fit the relational model:
- Add more hardware or upgraded to faster hardware.
- Simplifying database schema, denormalising the schema, relaxing durability and referential integrity.
- Introducing various query caching layers, separating read-only from write-dedicated replicas.
These
solutions drove the rise of what is now known as NoSQL. Scaling these
kind of solutions we need to think about them in a different way. To
understand large-scale distributed systems we need to understand the
CAP
theorem.
Brewer’s CAP Theorem
The
theorem states that within a large-scale distributed data system,
there are three requirements that have a relationship
of sliding dependency: Consistency, Availability, and Partition
Tolerance.
Consistency
– All database clients will read the same value for the same query,
even given con- current updates.
Availability
– All database clients will always be able to read and write
data.
Partition
Tolerance – The database can be split into multiple machines;
it can continue functioning in the face of network segmentation
breaks.
Brewer’s
theorem states that in any given system, you can strongly support
only two of the three. So the traditional relational databases focus
of Consistency and Availability, while the NoSQL movement tends to
focus more on the Availability and Partition-tolerance (this is
tunable in some of the systems). Due to this focus the NoSQL systems
are said to be eventually consistent.
Eventually Consistent
Basically the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.
When
we look at all lot of our data needs one starts to wonder whether you
really need it to be available then and there. How many of us have
degraded a system because we were logging everything to the same
database? Logs are a great example of data that can be eventually
consistent (obviously depending what you are logging)
As
stated in this great blog Eventual
Consistency By Example
We
may sum up the eventual consistency model in the following statement:
Given
a total number on T nodes, we choose a subset of N nodes for holding
key/value replicas, arrange them in a preference list calling the top
node "coordinator", and pick the minimum number of writes
(W) and reads (R) that must be executed by the coordinator on nodes
belonging to the preference list (including itself) in order to
define the write and read as "successful".
Here
are the key concepts, extracted from the statement above:
- N is the number of nodes defining the number of replicas for a given key/value.
- Those nodes are arranged in a preference list.
- The node at the top of the preference list is called coordinator.
- W is the minimum number of nodes where the write must be successfully replicated.
- R is the minimum number of nodes where the read must be successfully executed.
More
specifically, values for N, W and R can be tuned in order to:
- Achieve high write availability by setting: W < N
- Achieve high read availability by setting: R < N
- Achieve full consistency by setting: W + R > N
Final Thoughts
Hopefully
you can se how powerful this information is and it really gets you
thinking about the way you design your systems. I look forward going
thought each of the NoSQL databases and see how these topics apply to
them.
Hello Alex,
ReplyDeleteNice post. I'm planning on migrate a solution from relational to NoSQL architecture.
Will follow you on your path and try to share my thoughs with you.
Thanks I'm just as excited about the journey. Good luck with yours :)
Delete