As @Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored. In addition to that another way to write cache friendly code is: change the way our data is stored; then write new code to access the data stored in this new structure.
This makes sense in the case of how database systems linearize the tuples of a table and store them. There are two basic ways to store the tuples of a table i.e. row store and column store. In row store as the name suggests the tuples are stored row wise. Lets suppose a table named Product
being stored has 3 attributes i.e. int32_t key, char name[56]
and int32_t price
, so the total size of a tuple is 64
bytes.
We can simulate a very basic row store query execution in main memory by creating an array of Product
structs with size N, where N is the number of rows in table. Such memory layout is also called array of structs. So the struct for Product can be like:
struct Product
{
int32_t key;
char name[56];
int32_t price'
}
/* create an array of structs */
Product* table = new Product[N];
/* now load this array of structs, from a file etc. */
Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N, one array for each attribute of the Product
table. Such memory layout is also called struct of arrays. So the 3 arrays for each attribute of Product can be like:
/* create separate arrays for each attribute */
int32_t* key = new int32_t[N];
char* name = new char[56*N];
int32_t* price = new int32_t[N];
/* now load these arrays, from a file etc. */
Now after loading both the array of structs (Row Layout) and the 3 separate arrays (Column Layout), we have row store and column store on our table Product
present in our memory.
Now we move on to the cache friendly code part. Suppose that the workload on our table is such that we have an aggregation query on the price attribute. Such as
SELECT SUM(price)
FROM PRODUCT
For the row store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + table[i].price;
For the column store we can convert the above SQL query into
int sum = 0;
for (int i=0; i<N; i++)
sum = sum + price[i];
The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column layout we are doing just that i.e. only accessing the price column.
Suppose that the cache line size is 64
bytes.
In the case of row layout when a cache line is read, the price value of only 1(cacheline_size/product_struct_size = 64/64 = 1
) tuple is read, because our struct size of 64 bytes and it fills our whole cache line, so for every tuple a cache miss occurs in case of a row layout.
In the case of column layout when a cache line is read, the price value of 16(cacheline_size/price_int_size = 64/4 = 16
) tuples is read, because 16 contiguous price values stored in memory are brought into the cache, so for every sixteenth tuple a cache miss ocurs in case of column layout.
So the column layout will be faster in the case of given query, and is faster in such aggregation queries on a subset of columns of the table. You can try out such experiment for yourself using the data from TPC-H benchmark, and compare the run times for both the layouts. The wikipedia article on column oriented database systems is also good.
So in database systems, if the query workload is known beforehand, we can store our data in layouts which will suit the queries in workload and access data from these layouts. In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly.