Breaking

Thursday, July 26, 2018

Apache Cassandra turns 10

Born during the post-Y2K backlash that gave rise to innovations that are now the cornerstones of big data implementations, Cassandra has firmly entrenched itself as one of the most popular databases. As Cassandra enters adolescence, DataStax -- the company closely associated with it -- is embarking on a classic open core strategy that uses the database as its starting point.


Apache Cassandra turns 10

The past couple years have seen a number of 10-year milestones being passed, like the decade anniversaries of Amazon Web Services, MongoDB, Hadoop and many others. And so in 2018, it's Apache Cassandra's turn. Today, Apache Cassandra has morphed into a modest ecosystem where there is one principle commercial platform supplier -- DataStax -- supplemented by a small collection of companies delivering third-party support. It combines the versatility of a table-oriented database with the speed and efficiency of a key-value store.

But make no mistake about it -- the fact that there aren't a dozen vendors of Cassandra distros doesn't hide up the fact that Cassandra is a very popular database. It is one of a quartet of NoSQL databases that rank in DB-Engine's top ten. And in itself, Cassandra has carved out a niche for continuous online systems that can carry up to PBytes of data. Like other "wide column" databases that began life as key-value stores, Cassandra was first known for fast writes, but over the years, read performance has caught up.

For instance, when you get film recommendations served up on Netflix, they come from an application running on Cassandra. It has carved presence with maintaining of online user profiles, shopping carts, fraud detection, and increasingly, real-time mobile and IoT applications. For that matter, so have most of Cassandra's prime NoSQL competitors like MongoDB, DynamoDB, and Cosmos DB.

As this is 10th birthday time, it makes sense to look at Cassandra's beginnings. The story is a familiar one. An Internet giant -- Facebook -- needed a more scalable, always-on database alternative for its inbox feature and created Cassandra back in 2008 based on the Dynamo paper published by Amazon. After open sourcing, it, Jonathan Ellis, an engineer at Rackspace at the time, saw its potential as a distributed database for powering cloud applications, and a year later, drew venture backing to cofound what is now DataStax with then-colleague Matt Pfeil.

The biggest source of confusion early on was with Hadoop. Because of some ridiculous historical coincidences, Cassandra got lumped into the Hadoop project where it still appears on the Apache project page. That implies that Cassandra is an in-kind replacement for HBase. Well kinda and kinda not. Although both were initially designed to run as online production systems for big data, HBase requires HDFS, YARN, and Zookeeper to run, whereas Cassandra doesn't require Hadoop components and runs on its own cluster. Then there are other architectural differences, such as that HBase runs with Hadoop hierarchical topology, whereas Cassandra works in more of a peer-to-peer mode.

Comparison to the usual suspects

Hadoop flirtations notwithstanding, how does Cassandra differentiate from the usual NoSQL suspects? We'll start with the biggest differentiator: query language. Cassandra also has a query language that is much more like SQL compared to most rivals except Couchbase.

Compared to MongoDB, Cassandra was more writer-friendly, but as both databases matured, differences in reading and write performance are no longer as stark. Cassandra was initially designed as a tabular database for key-value data (compared to MongoDB's more object-like model), but in time was evolved to accommodate JSON documents. There are still basic differences in database topology: Cassandra was designed for higher availability writes with its multi-master architecture, whereas MongoDB uses a single master, but suggests managing sharding for higher availability writes.

Among cloud-native counterparts, Cassandra shares lineage with Amazon DynamoDB. A detailed comparison can be found here. But at a high level, the obvious difference is where they run: DynamoDB only runs in AWS as a managed service (and likewise for Microsoft Azure Cosmos DB on Azure); Cassandra, on the other hand, can run anywhere, but as managed service, DataStax Managed Cloud Service has only been introduced recently. Cassandra and DynamoDB both let you tune consistency levels -- Cassandra offers five options for consistency while DynamoDB narrows it down to two (eventual or strong).

Compared to Microsoft Azure Cosmos DB, the biggest difference is multi-model that is core to the Azure offering; by comparison, the commercial version of Cassandra -- DataStax Enterprise -- is just starting on this road, as it is still integrating its graph model.

Are we in a post-relational world?

Given that four NoSQL databases have now made it to the mainstream (based on developer interest charted by DB-Engines), one would think that the matter has been settled about the role that these platforms play. One would be wrong.

There's still healthy debate. On one side, there's the irrational exuberance of being in a post-relational world. Yes, NoSQL databases have become very popular among database developers. And yes, DataStax does have its share of Oracle run-ins, but these are going to win from outside of Oracle's core back-office base. Actually, DataStax and Oracle are frenemies, as DataStax Enterprise (DSE) is one of the first third-party databases to become officially supported in the Oracle Public Cloud's bare metal services, but we digress.

Fortuitously, having spoken with Patrick McFadin, the five-stages-of-grief author, we've found his insights to be far more nuanced than his blog post would suggest. But there are many others taking more extreme views based on the notion of big data becoming the mainstream. On the other side, there's the constituency that still believes that NoSQL is overhyped.

Reality is much grayer. The fact that NoSQL databases like Cassandra allow schema to vary does not mean that they lack schema, or that developers should not bother with optimizing the database for specific types of schema. In a NoSQL database, schema still matters and so does table layout. Even if you don't design the data model exactly for the queries that you're going throw at it, you still need to consider which data the app will touch when laying out the tables.

Don't count relational out either. If your application or use case requires strict ACID guarantees and data with referential integrity, relational is going to be your choice. If the use case involves complex analytical queries, you have a couple options. You could go the NoSQL route if you denormalize the data to improve performance; design the application so you don't have to rely on complex table joins, and take advantage of the Spark connectors that are becoming checkbox items with commercial NoSQL databases like DataStax Enterprise. But if the purpose of the database is solely for analytics, NoSQL won't be the right route.

Apache Cassandra turns 10

DataStax and Cassandra today

So what gives with Apache Cassandra and DataStax, the company that for most of its history was most closely associated with the database and open source project? It boils down to the nature of the open source project. Unlike MongoDB, which controls the underlying open source project and licenses the database under AGPL 3 license (which requires developers to contribute back to the community), Cassandra is an official Apache Foundation project that is governed by the Apache license.

So DataStax does not own or control Cassandra, and a couple years ago, stepped back from leadership of the project. DataStax still contributes and maintains presence on the Cassandra project, but the bulk of its energies are in building the enterprise platform features around it. In essence, DataStax is becoming more of a classic "open core" software company, a strategy that is not all that different from Cloudera's on Hadoop.

With Cassandra at 10, DataStax still embraces the platform but views it as the starting point for additional features. It is reaching out to accommodate analytics and search with Spark connectivity and new search functions that have been added to its CQL query language. Then there is the addition of graph, which came from the 2015 acquisition of Aurelius that brought the leaders of the Apache TinkerPop project to DataStax. While DataStax is still working to fully integrate graph into its implementation of Cassandra, in the DSE 6.0 release, you can load graph and Cassandra tables at the same time onto your cluster. And the company is now meeting cloud frenemies like Amazon head-on by rolling out the DataStax Managed Cloud service on AWS and Azure

There's a reason that we've been seeing all these tenth anniversaries in the big data space over the past few years. That's because, in the first decade of the 2000s, a backlash formed against the post-Y2K consensus that we were at the end of times where n-tier was the de facto standard application architecture; .NET and Java were the predominant application development stacks; and relational databases were entrenched as the enterprise standard. Notably, it was the experiences of Internet companies like Amazon and Google who subsequently overthrew the enterprise IT order whose experiences with the limitations of the post-2000 technology stack gave rise to the innovations that are now hitting middle age.

A decade in, Cassandra is no longer the new kid on the block. But the database has become one of the fixtures of modern operating systems, and the company most associated with it is using it as a jumping off point to a broader platform.




No comments:

Post a Comment