Most of my datastore experience is with RDBMS like mysql, oracle and postgresql (though I did work with some key value stores like berkleydb back in the day). So when a full day, free intro to Cassandra was offered, I jumped on it, even though it was in Centennial. You can view the schedule, speakers and talk synopses for the day. There were two tracks, beginner and advanced. Since I didn’t know anything about Cassandra, I followed the beginner track.
First, though, it was amazing how many people were there. The two main companies behind the talk, Pearson Education and DataStax, a vendor providing a commercial, supported version of Cassandra, ended up having to provide two overflow rooms, and it was still standing room only for some of the talks. Quite a nice turnout, and I think the sponsors were pleasantly shocked. I was also surprised by the number of folks from Boulder. I happened to sit next to two folks from Westminster and Superior, and ended up having a common friend or colleague with each. Small world.
I learned a ton about Cassandra, from its internals, to its topology (the ring’s the thing) to abstractions that let you query it (CQL, which is a subset of SQL) to data modeling to using the java driver, which makes accessing Cassandra almost as easy as using JDBC. While there are some SQL concepts that appear to map fairly well to Cassandra, I put quotes around them below to remind myself of the fact that a Cassandra ‘table’ isn’t the same as a RDBMS table, ditto for ‘row’, ‘primary key’ and other important concepts.
I think the biggest takeaway for me was that Cassandra is a “write many, read once” system. Because you can only query efficiently on one or two keys, if you have multiple queries, you want to write the data multiple times in a denormalized system, one ‘table’ for each query. Because of this, Cassandra shines in use cases where you are doing a lot of inserts, have known queries, and need speed and availability (sensor data was mentioned several times).
How does this actually work? Here’s an example (as best I understand it–here are some others from people who actually have experience using this technology):
If you have click stream data, in standard apache format, and you want to be able to have it stored in a database and highly available for a few specific queries, Cassandra might be a good choice. Here’s a line of my clickstream, from my blog:
126.96.36.199 - - [15/Oct/2014:08:03:30 -0600] "GET /wordpress/archives/date/2007/07 HTTP/1.1" 200 66258 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
This is time based data, and has some valuable information, and some not so valuable information. What things might I be interested in querying on? Well, I might care about the user agent, request time, ip address, status code, or URL path requested. I probably don’t care about the HTTP method, HTTP version or the bytes served. But, for the sake of this example, let’s say my application needs to show the location of the most recent 1000 users for a given country, for a fancy widget for my website. I will use maxmind or a similar service for mapping ip addresses to country. That’s all I care about. (Yes, I know this is contrived–I had to revise this example a couple of times to make it fit with Cassandra’s usage model.) I would set up in this ‘table’ in Cassandra.
CREATE TABLE IF NOT EXISTS location (
PRIMARY KEY (text, time)
In this table, country is the partition key, and time is the clustering key. That means that this query:
select location from location where country = 'USA' limit 1000; will be screaming fast. If I wanted to look at paths requested by time, I would not add an index to this table, but instead create another whole table, say
request_path. Then my insertion code would write to both
request_path. And then clients that wanted path information would use the specific table. Yup, denormalization is the name of the game.
This means that Cassandra has certain specific use cases, and that trying to use it as a general purpose data storage and query engine is foolish. Several presenters mentioned that Cassandra plus Apache Spark for general queries was a good solution.
Denormalization as standard operating procedure isn’t the only mind bending facet of Cassandra. Others:
- the presenters also talked about the high availability and replication of Cassandra–you can actually configure it to be data center aware so that it automatically replicates data across different data centers.
- For each keyspace (a set of tables, so similar to a schema), you specify how a replication factor–how many times each piece of data is stored.
- For every read or write, you specify how many nodes must either agree or accept the data, respectively.
- Adding nodes is the preferable way to deal with scale. You can add a node easily, and Cassandra will auto partition data and spread existing data across the new node.
- The biggest Cassandra setup they mentioned was 75K nodes.
- Each ‘row’ can have up to 2 billion records. If a row stretches across nodes, you’ll kill performance.
- There’s a process called ‘compaction’ which is similar to Java’s garbage collection, and just like GC, you have to pay attention to how compaction works, because it will affect performance.
All in all, very interesting day, and I appreciated the experience. One more (interesting) tool to add to the toolbox.