Introduction to NOSQL and cassandra, part 1

I recently gave a talk at outbrain, where I work, about an introduction to no-sql and Cassandra as we’re looking for alternatives of scaling out our database solution to match our incredible growth rate.

NOSQL is a general name for many non relational databases and Cassandra is one of them.

This was the first session of two in which I introduced the theoretical background and explained few of the important concepts of nosql. In the second session, due next week, I’ll talk more specifically about Cassandra.

The talk is on youtube, video below, but it’s in Hebrew so I’ll share it’s outline in English here. Slides are enclosed as well.

SQL and relational DBs in general offer is a very good general purpose solution for many applications such as blogs, banking, my cat’s site etc.
RDBMS provide ACID: Atomicity, Consistency, Isolation and Durability
RDBMS + SQL (the query language) + ACID provide a very nice and clean programming interface, suitable for banks, online merchants and many other applications, but not all applications really actually do require full ACID and one has to realize that ACID and SQL features are not without costs when systems need to scale out. Cost is not only in $$, it’s also in application performance and features.
The new generation of internet scale applications put very high demands on DB systems when it comes to scale and speed of operation but they don’t necessariry require all the good that’s in RDBMS, such as Full Consistency or Atomicity.
So, a new brand of DB systems has grown over the past 5 or so years – nosql, which either stands for No-SQL or Not-Only-SQL.
Leading actors in the nosql arena are Google with its BigTable, Amazon with Dynamo, Facebook with Cassandra and there’s more.
I presented intermediate solutions before going no-sql, namely RDBMS sharding which is very common and FriendFeed’s particularly interesting solution of application level indexing for using mysql with a schema-less data model.
CAP Theorem: At large scale systems you may only choose 2 out of the 3 desired attributes: Consistency, Availability and Partition-Tolerance. All three may not go hand in hand and application designers need to realize that.
A Consistent and Available system with no Partition-tolerance is a RDBMS system that comes to a halt if one of it’s hosts is down. That’s a very commonly used solution and perfect for small systems. This blog, for example, which uses WordPress, also uses a single mysql server which, if happens to be down, will also take the blog down. However, for internet scale systems where at almost any point in time there’s a good chance that one of the nodes is either down, or there are network disruptions, the No-Partition-Tolerance approach just isn’t going to cut it and they will have to choose a different approach for providing their SLAs.
Systems that are Available at all times and are capable of handling Partitions must sacrifice their consistency. As it turns out, though, this isn’t bad as it seems, as there are pretty good alternatives for lower levels of consistently, one such solution is Eventual Consistency, which actually works pretty nicely for “social applications” such as Google’s Facebook’s and Outbrain’s
I introduced the concept of NRW – N is the number of database replicas data is copied to one must replicate data in order to withstand partitions. W is the number of replicas a write operation would block on until it returns to it’s caller and is “successful” and R is the number of replicas a read operation would block on before returning to its caller.
N, R and W are crucial when dealing with Eventual Consistency as their values usually determine the level of consistency you’re going to have. For example, when N=R=W you have a full consistency (which isn’t tolerant to partitions or course). When W=0 you have async writes, which is the lowest level of consistency (you never know when the write operation actually finishes)
I introduced the concept of Quorum, which means R=W=ceil((N+1)/2)
Introduced a (very partial) list of currently available nosql solutions, such as Cassandra, BigTable, HBase, Dynamo, Voldemort, Riak, CouchDB, MongoDB and more.

Overall this was a very interesting talk, a lot of (fun and interesting) theory. The next part is going to be specific about Cassandra – how all this theory fits into Cassandra and how does one use Cassandra’s API, so stay tuned.

Published in Uncategorized, January 9th, 2010

7 Responses to “Introduction to NOSQL and cassandra, part 1”

Hi,

Thanks for this introduction, It’s nice to finally see the faces behind the reversim podcast .

Eyal

By Eyal on Jan 10, 2010
Hi Ran,
thanks for this introduction.
I am now just getting into Cassandra and this lecture made things really clear.

By Avi Yehuda on May 19, 2010
thanks for the intro. I’ve been sturggling to get the SQL ideas that have been imprinted, out of my head to grapple with all the new concepts. This lecture has helped out a lot thanks.

By Courtney on Aug 3, 2010