Photo by Almos Bechtold on Unsplash

Photo by Almos Bechtold on Unsplash

Getting started with the Apache Cassandra database


Article from ADMIN 53/2019
The open source Apache Cassandra database claims to be fail-safe, economical, highly scalable, and easy to manage. A few exercises show whether it lives up to its advertising.

Once upon a time, Facebook set itself the goal of offering its users a search option in their inbox, which meant storing a reverse index of all messages. At that time, however, no database would have been able to do this safely and inexpensively. The MySQL database used until then had reached its limits at a size of 7TB and more than 100 million users.

Two Facebook programmers, Avinash Lakshman (one of the authors of Amazon's DynamoDB) and Prashant Malik, set out to develop a completely new storage system that would bear up in the face of the huge amount of data and a high growth rate while being fail-safe, fast, and economical [1]. The result was Apache Cassandra [2].

Facebook released Cassandra in 2008 under an open source license, and two former Rackspace employees founded their own company in 2010 to launch a commercial Cassandra offshoot in 2011: DataStax Enterprise (DSE) has several additional features compared with Apache Cassandra and is now used by some of the world's largest companies, including Netflix, Microsoft, Deloitte, The Home Depot, Walgreens, and Facebook and its subsidiary Instagram. (See the "Cassandra Hands-on" box.)

Cassandra Hands-on

In May 2019, the DataStax Accelerate conference, organized by DataStax, the manufacturer of the commercial Cassandra offshoot, provided a number of interesting practical application examples. Yahoo Japan, for example, reported on how its Cassandra cluster for apps, logs, and statistical data grew from 100 nodes in 2016 to 4,700 nodes in 170 clusters today.

However, operating a large installation of this nature also causes problems, especially with regard to the repair processes required after errors and failures that often broke down or got stuck in infinite loops. Without further ado, they developed their own nodetool repair command, which was able to reduce the repair time from 970 to 67 minutes.

The Cassandra operators at Instagram had different problems. In particular, the latencies in their cluster, geographically distributed over continents, caused them headaches. They created a tool called Akkio that always tries to store the data in the geographical vicinity of the requester, which proved possible for around 80 percent of the data. For data retrieved from all geographic regions (e.g., photos of celebrities) a separate partition is available everywhere that reduced cross-partition traffic by 20 percent, saving a large amount of money given the volumes that Instagram generates.

The wholesaler Metro faced another challenge; it runs more than 750 stores in 35 countries and employs around 1,500,000 people. Because of the imminent Brexit, Metro migrated 500 Cassandra nodes in 91 clusters from England to Central Europe within just a few months during ongoing operations and without downtime. The replication feature integrated in Cassandra was used.

Several presentations were also given on running Cassandra in containers on Kubernetes or in various public clouds; integration with tools such as analytics, business intelligence, or artificial intelligence; the design of data-driven applications; Cassandra management; and database security.

One of the biggest announcements at the conference was DataStax Constellation, a cloud data platform that enables easy and fast development and deployment of cloud applications. The platform launched soon after with two new cloud services: DataStax Apache Cassandra as a Service and DataStax Insights [3]. The former enables flexible scalability in Cassandra clusters and consumption-based billing supported by the stability and performance enhancements of DataStax Enterprise, which monitors the cluster and gives tuning recommendations. DataStax Insights offers hosted performance monitoring service.

What Is Cassandra?

Cassandra is a Java-implemented, error-tolerant, distributed and column-oriented NoSQL database that uses the wide column store layout. This type of database relies on column families, groups of related columns, that resemble relational tables. Unlike the relational model, you can add or delete columns per row at any time.

Column families in turn can contain groups of columns and are then known as super column families. In the wide column store universe, the keyspace comes closest to the concept of the schema in the relational world. The keyspace can contain several column families as well as functions or views. In general, each application has one keyspace. The cluster in turn houses the keyspaces.

Cassandra uses each of the nodes in the cluster, which together form a logical ring and can also be distributed across different data centers for reading and writing. Master and slave roles do not exist, and the node computers can be relatively inexpensive off-the-shelf hardware.

For each data item, a consistent hash algorithm determines its position in the ring; it traverses the ring clockwise until it finds the first node whose position is greater than the data item's position and identifies that node as the one responsible for the data item. This node, known as the coordinator, also replicates the data item to a configurable number of other nodes. It can follow adjustable policies and makes sure replication always includes nodes in remote data centers.

Because each node is only responsible for the section of the ring up to its predecessor, the failure or insertion of a node never affects the entire ring, only the immediate neighbors of the affected node.

Cassandra cannot achieve all desirable goals at the same time. As is well known, in a distributed system, only two of the three requirements of the CAP theorem – consistency (C), availability (A), and partition tolerance (P, failure tolerance) – can be guaranteed (Figure 1).

Figure 1: Only two of the three CAP properties can be guaranteed.

Cassandra vouches for A and P, providing only eventual consistency. It can happen that simultaneous identical queries on two different nodes lead to different results because data to be written has not been stored immediately at all locations. Instead, the write operation is confirmed when a quorum of replicas to which the request was forwarded confirms the operation. The results will eventually be consistent again but might not be consistent at the time of the query. Users have to live with that.

The number of nodes can be increased at any time to prevent bottlenecks, and a number of nodes can always fail without affecting the database function.

Bookkeeping for Books

The first example is a simple table containing books reviewed in the German Linux-Magazin since the beginning of 2018. For each title, the table stores the ISBN, the year of publication, the title, the book publisher, the names of up to three authors, the number of issues, and the issue number in which the review appeared. If such a table existed in MySQL, it could be sorted very easily (e.g., by the issue number; Listing 1). A somewhat more complex question, such as the number of reviewed titles per publisher (Listing 2), could be solved almost as easily.

Listing 1

Sorted List of Titles

| Author1                | Title                            | Publisher   | Issue |
| Tim Philipp Schäfers   | Hacking im Web                   | Franzis     | 18/01 |
| Richard Eisenmenger    | Nur noch dieses Level!           | Rheinwerk   | 18/01 |
| Joachim Zuckarelli     | Statistik mit R                  | O'Reilly    | 18/02 |
| Sebastian Erlhofer     | Website-Konzeption und Relaunch  | Rheinwerk   | 18/02 |

Listing 2

Reviews per Publisher

# SELECT Publisher, COUNT(*) AS 'Number' FROM books GROUP BY Publisher ORDER BY Number DESC;
| Publisher | Number |
| Rheinwerk |     10 |
| O'Reilly  |      6 |
| Wiley     |      5 |
| Hanser    |      5 |
| Mitp      |      4 |
| C.H. Beck |      2 |
| Tintal    |      1 |
| Franzis   |      1 |
| Dpunkt    |      1 |
| Apress    |      1 |

Not so in Cassandra. Although the database has a SQL-flavored query language, CQL, it is strongly tailored for use in a partitioned, distributed NoSQL database that is designed to handle petabytes of data. Language constructs that require, for example, that a filter criterion be compared with each individual line could lead to incalculable performance problems. Restrictions thus apply to their use, and subqueries don't even exist.

Searches with wildcards are also missing. Worse still, joins like those that make connections between tables in the relational world are not intended in the Cassandra realm. In the MySQL case, for example, the database designer would have moved the names of the publishers to a separate table, so as not to store them redundantly in the main table. The designer would only have stored an ID in the main table that points to a publisher name table. You can't do that with Cassandra; instead, redundant data storage is consciously accepted. Anything a query wants to investigate must be in a column family.

Additionally, the aggregation functions known from SQL (e.g., AVG(), SUM(), COUNT(), MAX(), and MIN()) are not available per se, although they could be retrofitted as user-defined functions. However, it is more common to create a separate table that contains a counter updated by each INSERT or DELETE rather than a COUNT() function.

Cassandra requires a completely different data modeling technique compared with relational databases because of these special features. Whereas database developers in the relational world rely on the table as the basic element and express relationships between tables through joins, Cassandra's starting point is the query.

Often a table is designed to support precisely one query. If it is not possible to merge related data in a table, the application logic must establish the connection.

For the book reviews example, this could mean that the designer would also create a table for each publisher with a count column to determine the number of reviewed titles from that publisher without having to rely on COUNT(). The application would then be responsible for updating the count table each time it is inserted or deleted.

If a table like the one in the example only contains a small amount of data, this kind of filtering will work as well. The first attempt, however, will result in an error message that points to possible performance effects that are difficult to calculate. A user who is sure that this cannot happen in their case therefore explicitly needs to allow filtering with ALLOW FILTERING (Listing 3).

Listing 3


# SELECT Authors, Title FROM titles WHERE Publisher='Mitp' ALLOW FILTERING;
 Authors                              | Title
                    {1: 'Ralf Jesse'} | Embedded Linux mit Raspberry Pi und Co.
 {1: 'F. Chollet', 2: 'J.J. Allaire'} |           Deep Learning mit R und Keras
            {1: 'Aditya Y. Bhargava'} |                    Algorithmen kapieren
              {1: 'Winfried Seimert'} |                                Inkscape

Books can have a different number of authors. In the relational model – and in Cassandra, as well – a number of fields could be reserved for authors (e.g., three: Author1, Author2, and Author3). If the book has only one author, Author2 and Author3 contain zero values. But if the book has four authors, one has to be dropped.

Cassandra has an elegant solution to this problem: maps, which are fields that can contain any number of key-value pairs of the same meaning. Maps could be used to define a field for the authors, as in Listing 4, which also demonstrates inserts into such a table and a select.

Listing 4


01 # CREATE TABLE titles
02 (
03   isbn   TEXT,
04   year   INT,
05   titles  TEXT,
06   publisher TEXT,
07   authors MAP<INT,TEXT>,
08   circulation INT,
09   issue TEXT,
10   PRIMARY KEY(publisher, issue, isbn)
11 );
13 # INSERT INTO titles(isbn, year, title, publisher, authors, circulation, issue) VALUES('978-3836244091', 2017, 'Nur noch dieses Level!', 'Rheinwerk', {1 : 'Richard Eisenmenger'}, 1, '18/01');
15 INSERT INTO titles(isbn, year, title, publisher, authors, circulation, issue) VALUES('978-3836262477', 2018, 'Objektorientierte Programmierung', 'Rheinwerk', {1 :'Bernhard Lahres', 2 : 'Gregor Rayman', 3 : 'Stefan Strich'}, 4, '18/09');
17 # SELECT Title FROM titles WHERE authors CONTAINS 'Stefan Strich' ALLOW FILTERING;
19  Title
20 ----------------------------------
21  Objektorientierte Programmierung

Everything Is Fine

Another difference that causes SQL stomach pain is sorting. Before you can sort, you first need to look at the different keys of a Cassandra table:

  • If a single column is defined as the primary key, this key is also the partition key, which determines the location of the data (i.e., it picks the node of the distributed database on which the data ends up). As already mentioned, an algorithm uses a hash value calculated from the partition key to determine a token that is permanently assigned to a specific node.
  • For a compound primary key, the partition key is the first component (i.e., the first expression in the key definition).
  • The partition key can also comprise several columns, which in this case have to be bracketed in the definition.
  • The second component of the compound primary key is the clustering key, which determines the sort order within the partition. Data can only be returned in an orderly manner with this sort criterion.

Therefore, where required, you have to construct the primary key when creating the table such that it comprises several parts, and its second part, the clustering key, reflects the sort sequence that you will need later. Moreover, ORDER BY can only be used if an EQ or IN condition restricts the partition key.

To get something similar to the MySQL example (Listing 1), first define the primary key so that the output column becomes the clustering key:

PRIMARY KEY (publisher, issue, isbn)

After that, an EQ condition would be needed to restrict the partition key (the publisher row in this case); then, the result could be sorted by the output number (Listing 5).

Listing 5

Sorted Output

 # SELECT authors, title, publisher, issue FROM titles WHERE publisher='Hanser' ORDER BY issue ALLOW FILTERING;
authors                | title                                   | publisher | issue
{1: 'Dirk Louis',
 2: 'Peter M¸ller'}    |Java                                     | Hanser    | 18/08
{1: 'Herbert Dowalil'} |Grundlagen des modularen Softwareentwurfs| Hanser    | 18/08
{1: 'Jonas Freiknecht',
 2: 'Stefan Papp'}     |Big Data in der Praxis                   | Hanser    | 18/10
{1: 'Jörg Frochte'}    |Maschinelles Lernen                      | Hanser    | 18/11
{1: 'Heiko Kalista'}   |Python 3                                 | Hanser    | 19/05

In any case, Cassandra users, when designing the database, have to consider carefully which queries the database will need to answer later. The freedom to think about the evaluation afterward, unlike with the relational model, does not exist.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus