Understanding Cassandra Code Base

Lately I’ve been adding some random small features to cassandra so I took the time to have a closer look at the internal design of the system.
While with some features added, such as an embedded service, I could have certainly get away without good understanding of the codebase and design, others, such as the truncate feature require good understanding of the various algorithms used, such as how writes are performed, how reads are performed, how values are deleted (hint: they are not…) etc.

The codebase, although isn’t very large, about 91136 lines, is quite dense and packed with algorithmic sauce, so simply reading through it just didn’t cut it for me. (I used the following kong-fu to count: $ cassandra/trunk $ find * -name *.java -type f -exec cat {} \;|wc -l)

I’m writing this post in hope it’d help others get up to speed. I’m not going to cover the basics, such as what is cassandra, how to deploy, how to checkout code, how to build, how to download thrift etc. I’m also not going to cover the real algorithmic complicated parts, such as how merkle trees are used by the ae-service, how bloom filters are used in different parts of cassandra (and what are they), how gossip is used etc. I don’t think I’m the right person to explain all this, plus there are already bits of those in the cassandra developer wiki. What I am going to write about is what was the path that I took in order to learn cassandra and what I’ve learned along the way. I haven’t found all that stuff documented somewhere else (perhaps I’ll contribute it back to the wiki when I’m done) so I think I’d be very helpful to have it next time I dive into a new codebase.

Lastly, a disclaimer: The views expressed here are simply my personal understanding of how the system works, they are both incomplete and inaccurate, so be warned. Keep in mind that I’m only learning and still sort of new to cassandra. Please also keep in mind that cassandra is a moving target and keeps changing so rapidly that any given snapshot of the code will get irrelevant sooner or later. By the time of writing this the currently official version is 0.6.1 but I’m working on trunk towards 0.7.0.

Here’s a description of the steps I took and things I learned.

Download, configure, run…

First you need to download the code and run unit tests. If you use eclipse, idea, netbeans, vi, emacs and what not, you want to configure it. That was easy. There’s more here.

Reading

Next you want to read some of the background material, depending on what part exactly you want to work on. I wanted to understand the read path, write path and how values are deleted, so I read the following documents about 5 times each. Yes, 5 times. Each. They are packed with information and I found myself absorbing a few more details each time I read. I used to read the document, get back to the source code, make sure I understand how the algorithm maps to the methods and classes, reread the document, reread the source code, read the unit tests (and run them, with a debugger) etc. Here are the docs.

http://wiki.apache.org/cassandra/ArchitectureInternals

SEDA paper

http://wiki.apache.org/cassandra/HintedHandoff

http://wiki.apache.org/cassandra/ArchitectureAntiEntropy

http://wiki.apache.org/cassandra/ArchitectureSSTable

http://wiki.apache.org/cassandra/ArchitectureCommitLog

http://wiki.apache.org/cassandra/DistributedDeletes

I also read the google BigTable paper and the fascinating Amazon’s Dynamo paper, but that was a long time ago. They are good as background material, but not required to understand actual bits of code.

Well, after having read all this I was starting to get a clue what can be done and how but I still didn’t feel I’m at the level of really coding new features. After reading through the code a few times I realized I’m kind of stuck and still don’t understand things like “how do values really get deleted”, which class is responsible for which functionality, what stages are there and how is data flowing between stages, or “how can I mark and entire column family as deleted”, which is what I really wanted to do with the truncate operation.

Stages

Cassandra operates in a concurrency model described by the SEDA paper. This basically means that, unlike many other concurrent systems, an operation, say a write operation, does not start and end by the same thread. Instead, an operation starts at one thread, which then passes it to another thread (asynchronously), which then passes it to another thread etc, until it ends. As a matter of fact, the operation doesn’t exactly flow b/w threads, it actually flows b/w stages. It moves from one stage to another. Each stage is associated with a thread pool and this thread pool executes the operation when it’s convenient to it. Some operations are IO bound, some are disk or network bound, so “convenience” is determined by resource availability. The SEDA paper explains this process very well (good read, worth your time), but basically what you gain by that is higher level of concurrently and better resource management, resource being CPU, disk, network etc.

So, to understand data flow in cassandra you first need to understand SEDA. Then you need to know which stages exist in cassandra and exactly does the data flow b/w them.

Fortunately, to get you started, a partial list of stages is present at the StageManager class:

public final static String READ_STAGE = "ROW-READ-STAGE";
public final static String MUTATION_STAGE = "ROW-MUTATION-STAGE";
public final static String STREAM_STAGE = "STREAM-STAGE";
public final static String GOSSIP_STAGE = "GS";
public static final String RESPONSE_STAGE = "RESPONSE-STAGE";
public final static String AE_SERVICE_STAGE = "AE-SERVICE-STAGE";
private static final String LOADBALANCE_STAGE = "LOAD-BALANCER-STAGE";

I won’t go into detail about what each and every stage is responsible for (b/c I don’t know…) but I can say that, in short, we have the ROW-READ-STAGE which takes part in the read operation, the ROW-MUTATION-STAGE which takes part in the write and delete operations, the AE-SERVICE-STAGE which is responsible for anti-entropy. This is not a comprehensive list of stages, depending on the code path you’re interested in, you may find more along the way. For example, browsing the file ColumnFamilyStore you’ll find some more stages, such as FLUSH-SORTER-POOL, FLUSH-WRITER-POOL and MEMTABLE-POST-FLUSHER. In Cassandra stages are identified by instances of the ExecutorService, which is more or less a thread pool and they all have all-caps names, such as MEMTABLE-POST-FLUSHER.

To visualize that I created a diagram that mixes both classes and stages. This isn’t valid UML, but I think it’s a good way to look at how data flows in the system. This is not a comprehensive diagram of all classes and all stages, just the ones that were interesting to me.


yUML source

Debugging

Reading through the code using a debugger, while running a unit-test is an awesome way to get things into your head. I’m not a huge fan of debuggers, but one thing they are good at is learning a new codebase by singlestepping into unit tests. So what I did was to run the unit-tests while single stepping into the code. That was awesome. I also ran the unit tests for Hector, which uses the thrift interface and spawn an embedded cassandra server so they were right to the point, user friendly and eye opening.

Class Diagrams

Next thing I did is use a tool to extract class diagrams from the existing codebase. That was not a great use of my time.

Well, the tool I used wasn’t great, but that’s not the point. The point is that cassandra’s codebase is written in such way that class diagrams help very little in understanding it. UML class diagrams are great for object oriented design. The essence of them is the list of classes, class members and their relationships. For example if a class A has a list of Bs, so you can draw that in a UML class diagram such that A is an aggregation of Bs and just by looking at the diagram you learn a lot. For example, an Airplane has a list of Passengers.

Cassandra is a complex system with solid algorithmic background and excellent performance, but, to be honest, IMO from the sole perspective of good oo practice, it isn’t a good case study… Its classes contain many static methods and members and in many cases you’d see one class calling other static method of another class, C style, therefore I found that class diagrams, although they are somewhat helpful at getting a visual sense of what classes exist and learn roughly manner about their relationships, are not so helpful.

I ditched the class diagrams and continued to the next diagram – sequence diagrams.

Sequence Diagrams

Sequence diagrams are great at abstracting and visualizing interactions b/w entities. In my case an entity may either be a class, or a STAGE, or a thrift client. Luckily with sequence diagrams you don’t have to be too specific and formal about the kind of entities are used in it, you just represent them all as happy actors (at least, I allow myself to do that, I hope the gods of UML will forgive).

The following diagrams were produced by running Hector‘s unit tests and using an embedded cassandra server (single node). The diagrams aren’t generic, they describe only one possible code path while there could be many, but I preferred keeping them as simple as possible even in the cost of small inaccuracies.

I used a simple online sequence diagram editor at http://www.websequencediagrams.com to generate them.

Read Path

note left of CassandraServer: Read Path

CassandraServer -> StorageProxy: readProtocol
StorageProxy -> weakReadLocal: READ-STAGE.call

weakReadLocal -> SliceByNamesReadCommand: getRow
SliceByNamesReadCommand -> Table: getRow
Table -> ColumnFamilyStore: getColumnFamily
ColumnFamilyStore -> QueryFilter: collectCollatedColumns
QueryFilter -> ColumnFamilyStore:
ColumnFamilyStore -> ColumnFamilyStore: removeDeleted
ColumnFamilyStore -> Table:
Table -> SliceByNamesReadCommand:
SliceByNamesReadCommand -> weakReadLocal:

weakReadLocal -> StorageProxy:
StorageProxy -> CassandraServer:

Write Path

note left of CassandraServer: Write Path
CassandraServer -> StorageProxy: mutateBlocking

note over StorageProxy: async
StorageProxy --> StorageProxy: MUTATION-STAGE call
StorageProxy -> RowMutation: run

RowMutation -> Table: apply

note over Table, CommitLog: async
Table --> CommitLog: COMMIT-LOG-WRITER add
CommitLog -> CommitLogSegment: write
CommitLogSegment -> CommitLog: 

Table -> ColumnFamilyStore: apply
ColumnFamilyStore -> Memtable: put
Memtable -> Memtable: resolve
Memtable -> ColumnFamilyStore:
ColumnFamilyStore -> Table:
Table -> RowMutation:
RowMutation -> StorageProxy:
StorageProxy --> StorageProxy: signal
StorageProxy -> CassandraServer:

Table is a Keyspace

One final note: As user of cassandra I use the terms Keyspace, ColumnFamily, Column etc. However, the codebase is packed with the term Table. What are Tables?… As it turns out, a Table is actually a Keyspace… just keep this in mind, that’s all.

Learning the codebase was a large and satisfying task, I hope this writing helps you get up and running as well.

9 Responses to “Understanding Cassandra Code Base”

  1. Thanks for this post, investigation of cassandra architecture is exactly what I’m currently trying to do.

    Wiki pages links you listed – where did you find its? I’m asking because main wiki page shows nothing http://wiki.apache.org/cassandra/

    By Andrey Kopachevsky on May 2, 2010

  2. From the main page there’s a Contribute link, follow its path and you’d get here http://wiki.apache.org/cassandra/ArchitectureInternals

    By Ran Tavory on May 2, 2010

  3. Very helpful!!!
    Thank you soooooo much!!!

    By IL on May 5, 2010

  4. Thank you! I feel this blog is very helpful and have translated it to Chinese. Do you mind I post it on my personal blog, and I kept the original link pointing here.

    By Xu Wang on May 12, 2010

  5. no problem, go ahead

    By Ran Tavory on May 12, 2010

4 Trackback(s)

  1. May 3, 2010: How to read cassandra source code by Ran | People Mountain People Sea
  2. May 12, 2010: [译文]理解Cassandra源代码 » 我有分寸
  3. Aug 19, 2010: Keksrolle.de » Blog Archive » 11 links on scalable software architecture and big data
  4. Oct 19, 2010: Stack Overflow Co-Founder Joel Spolsky Questions Why Digg Has So Many Servers

Sorry, comments for this entry are closed at this time.