[database] Difference between partition key, composite key and clustering key in Cassandra?

I have been reading articles around the net to understand the differences between the following key types. But it just seems hard for me to grasp. Examples will definitely help make understanding better.

primary key,
partition key, 
composite key 
clustering key

This question is related to database cassandra cql

The answer is


Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.

  • A primary key uniquely identifies a row.
  • A composite key is a key formed from multiple columns.
  • A partition key is the primary lookup to find a set of rows, i.e. a partition.
  • A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).

Examples:

  • PRIMARY KEY (a): The partition key is a.
  • PRIMARY KEY (a, b): The partition key is a, the clustering key is b.
  • PRIMARY KEY ((a, b)): The composite partition key is (a, b).
  • PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).
  • PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.
  • PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).

In brief sense:

Partition Key is nothing but identification for a row, that identification most of the times is the single column (called Primary Key) sometimes a combination of multiple columns (called Composite Partition Key).

Cluster key is nothing but Indexing & Sorting. Cluster keys depend on few things:

  1. What columns you use in where clause except primary key columns.

  2. If you have very large records then on what concern I can divide the date for easy management. Example, I have data of 1million a county population records. So for easy management, I cluster data based on state and after pincode and so on.


Worth to note, you will probably use those lots more than in similar concepts in relational world (composite keys).

Example - suppose you have to find last N users who recently joined user group X. How would you do this efficiently given reads are predominant in this case? Like that (from offical Cassandra guide):

CREATE TABLE group_join_dates (
    groupname text,
    joined timeuuid,
    join_date text,
    username text,
    email text,
    age int,
    PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)

Here, partitioning key is compound itself and the clustering key is a joined date. The reason why a clustering key is a join date is that results are already sorted (and stored, which makes lookups fast). But why do we use a compound key for partitioning key? Because we always want to read as few partitions as possible. How putting join_date in there helps? Now users from the same group and the same join date will reside in a single partition! This means we will always read as few partitions as possible (first start with the newest, then move to older and so on, rather than jumping between them).

In fact, in extreme cases you would also need to use the hash of a join_date rather than a join_date alone - so that if you query for last 3 days often those share the same hash and therefore are available from same partition!


In database design, a compound key is a set of superkeys that is not minimal.

A composite key is a set that contains a compound key and at least one attribute that is not a superkey

Given table: EMPLOYEES {employee_id, firstname, surname}

Possible superkeys are:

{employee_id}
{employee_id, firstname}
{employee_id, firstname, surname}

{employee_id} is the only minimal superkey, which also makes it the only candidate key--given that {firstname} and {surname} do not guarantee uniqueness. Since a primary key is defined as a chosen candidate key, and only one candidate key exists in this example, {employee_id} is the minimal superkey, the only candidate key, and the only possible primary key.

The exhaustive list of compound keys is:

{employee_id, firstname}
{employee_id, surname}
{employee_id, firstname, surname}

The only composite key is {employee_id, firstname, surname} since that key contains a compound key ({employee_id,firstname}) and an attribute that is not a superkey ({surname}).


In cassandra , the difference between primary key,partition key,composite key, clustering key always makes some confusion.. So I am going to explain below and co relate to each others. We use CQL (Cassandra Query Language) for Cassandra database access. Note:- Answer is as per updated version of Cassandra. Primary Key :-

In cassandra there are 2 different way to use primary Key .

CREATE TABLE Cass (
    id int PRIMARY KEY,
    name text 
);

Create Table Cass (
   id int,
   name text,
   PRIMARY KEY(id) 
);

In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns forms the partition key.

Partitioning and Clustering The PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.

CREATE TABLE device_check (
  device_id   int,
  checked_at  timestamp,
  is_power    boolean,
  is_locked   boolean,
  PRIMARY KEY (device_id, checked_at)
);

Here device_id is partition key and checked_at is cluster_key.

We can have multiple cluster key as well as partition key too which depends on declaration.


The primary key in Cassandra usually consists of two parts - Partition key and Clustering columns.

primary_key((partition_key), clustering_col )

Partition key - The first part of the primary key. The main aim of a partition key is to identify the node which stores the particular row.

CREATE TABLE phone_book ( phone_num int, name text, age int, city text, PRIMARY KEY ((phone_num, name), age);

Here, (phone_num, name) is the partition key. While inserting the data, the hash value of the partition key is generated and this value decides which node the row should go into.

Consider a 4 node cluster, each node has a range of hash values it can store. (Write) INSERT INTO phone_book VALUES (7826573732, ‘Joey’, 25, ‘New York’);

Now, the hash value of the partition key is calculated by Cassandra partitioner. say, hash value(7826573732, ‘Joey’) ? 12 , now, this row will be inserted in Node C.

(Read) SELECT * FROM phone_book WHERE phone_num=7826573732 and name=’Joey’;

Now, again the hash value of the partition key (7826573732,’Joey’) is calculated, which is 12 in our case which resides in Node C, from which the read is done.

  1. Clustering columns - Second part of the primary key. The main purpose of having clustering columns is to store the data in a sorted order. By default, the order is ascending.

There can be more than one partition key and clustering columns in a primary key depending on the query you are solving.

primary_key((pk1, pk2), col 1,col2)


Primary Key: Is composed of partition key(s) [and optional clustering keys(or columns)]
Partition Key: The hash value of Partition key is used to determine the specific node in a cluster to store the data
Clustering Key: Is used to sort the data in each of the partitions(or responsible node and it's replicas)

Compound Primary Key: As said above, the clustering keys are optional in a Primary Key. If they aren't mentioned, it's a simple primary key. If clustering keys are mentioned, it's a Compound primary key.

Composite Partition Key: Using just one column as a partition key, might result in wide row issues (depends on use case/data modeling). Hence the partition key is sometimes specified as a combination of more than one column.

Regarding confusion of which one is mandatory, which one can be skipped etc. in a query, trying to imagine Cassandra as a giant HashMap helps. So in a HashMap, you can't retrieve the values without the Key.
Here, the Partition keys play the role of that key. So each query needs to have them specified. Without which Cassandra won't know which node to search for.
The clustering keys (columns, which are optional) help in further narrowing your query search after Cassandra finds out the specific node(and it's replicas) responsible for that specific Partition key.