What is a cache-friendly code

Question

What is the difference between  cache unfriendly code  and the  cache friendly  code   How can I make sure I write cache-efficient code

User · Answer

Optimizing cache usage largely comes down to two factors.

Locality of Reference

The first factor (to which others have already alluded) is locality of reference. Locality of reference really has two dimensions though: space and time.

Spatial

The spatial dimension also comes down to two things: first, we want to pack our information densely, so more information will fit in that limited memory. This means (for example) that you need a major improvement in computational complexity to justify data structures based on small nodes joined by pointers.

Second, we want information that will be processed together also located together. A typical cache works in "lines", which means when you access some information, other information at nearby addresses will be loaded into the cache with the part we touched. For example, when I touch one byte, the cache might load 128 or 256 bytes near that one. To take advantage of that, you generally want the data arranged to maximize the likelihood that you'll also use that other data that was loaded at the same time.

For just a really trivial example, this can mean that a linear search can be much more competitive with a binary search than you'd expect. Once you've loaded one item from a cache line, using the rest of the data in that cache line is almost free. A binary search becomes noticeably faster only when the data is large enough that the binary search reduces the number of cache lines you access.

Time

The time dimension means that when you do some operations on some data, you want (as much as possible) to do all the operations on that data at once.

Since you've tagged this as C++, I'll point to a classic example of a relatively cache-unfriendly design: std::valarray. valarray overloads most arithmetic operators, so I can (for example) say a = b + c + d; (where a, b, c and d are all valarrays) to do element-wise addition of those arrays.

The problem with this is that it walks through one pair of inputs, puts results in a temporary, walks through another pair of inputs, and so on. With a lot of data, the result from one computation may disappear from the cache before it's used in the next computation, so we end up reading (and writing) the data repeatedly before we get our final result. If each element of the final result will be something like (a[n] + b[n]) * (c[n] + d[n]);, we'd generally prefer to read each a[n], b[n], c[n] and d[n] once, do the computation, write the result, increment n and repeat 'til we're done.²

Line Sharing

The second major factor is avoiding line sharing. To understand this, we probably need to back up and look a little at how caches are organized. The simplest form of cache is direct mapped. This means one address in main memory can only be stored in one specific spot in the cache. If we're using two data items that map to the same spot in the cache, it works badly -- each time we use one data item, the other has to be flushed from the cache to make room for the other. The rest of the cache might be empty, but those items won't use other parts of the cache.

To prevent this, most caches are what are called "set associative". For example, in a 4-way set-associative cache, any item from main memory can be stored at any of 4 different places in the cache. So, when the cache is going to load an item, it looks for the least recently used³ item among those four, flushes it to main memory, and loads the new item in its place.

The problem is probably fairly obvious: for a direct-mapped cache, two operands that happen to map to the same cache location can lead to bad behavior. An N-way set-associative cache increases the number from 2 to N+1. Organizing a cache into more "ways" takes extra circuitry and generally runs slower, so (for example) an 8192-way set associative cache is rarely a good solution either.

Ultimately, this factor is more difficult to control in portable code though. Your control over where your data is placed is usually fairly limited. Worse, the exact mapping from address to cache varies between otherwise similar processors. In some cases, however, it can be worth doing things like allocating a large buffer, and then using only parts of what you allocated to ensure against data sharing the same cache lines (even though you'll probably need to detect the exact processor and act accordingly to do this).

False Sharing

There's another, related item called "false sharing". This arises in a multiprocessor or multicore system, where two (or more) processors/cores have data that's separate, but falls in the same cache line. This forces the two processors/cores to coordinate their access to the data, even though each has its own, separate data item. Especially if the two modify the data in alternation, this can lead to a massive slowdown as the data has to be constantly shuttled between the processors. This can't easily be cured by organizing the cache into more "ways" or anything like that either. The primary way to prevent it is to ensure that two threads rarely (preferably never) modify data that could possibly be in the same cache line (with the same caveats about difficulty of controlling the addresses at which data is allocated).

Those who know C++ well might wonder if this is open to optimization via something like expression templates. I'm pretty sure the answer is that yes, it could be done and if it was, it would probably be a pretty substantial win. I'm not aware of anybody having done so, however, and given how little valarray gets used, I'd be at least a little surprised to see anybody do so either.
In case anybody wonders how valarray (designed specifically for performance) could be this badly wrong, it comes down to one thing: it was really designed for machines like the older Crays, that used fast main memory and no cache. For them, this really was a nearly ideal design.
Yes, I'm simplifying: most caches don't really measure the least recently used item precisely, but they use some heuristic that's intended to be close to that without having to keep a full time-stamp for each access.

User · Answer

Welcome to the world of Data Oriented Design  The basic mantra is to Sort  Eliminate Branches  Batch  Eliminate virtual calls - all steps towards better locality   Since you tagged the question with C    here s the obligatory typical C   Bullshit  Tony Albrecht s Pitfalls of Object Oriented Programming is also a great introduction into the subject

User · Answer

It needs to be clarified that not only data should be cache-friendly  it is just as important for the code  This is in addition to branch predicition  instruction reordering  avoiding actual divisions and other techniques   Typically the denser the code  the fewer cache lines will be required to store it  This results in more cache lines being available for data   The code should not call functions all over the place as they typically will require one or more cache lines of their own  resulting in fewer cache lines for data   A function should begin at a cache line-alignment-friendly address  Though there are  gcc  compiler switches for this be aware that if the the functions are very short it might be wasteful for each one to occupy an entire cache line  For example  if three of the most often used functions fit inside one 64 byte cache line  this is less wasteful than if each one has its own line and results in two cache lines less available for other usage  A typical alignment value could be 32 or 16   So spend some extra time to make the code dense  Test different constructs  compile and review the generated code size and profile

User · Answer

Just piling on   the classic example of cache-unfriendly versus cache-friendly code is the  quot cache blocking quot  of matrix multiply  Naive matrix multiply looks like  for i 0 i lt N i         for j 0 j lt N j            dest i  j    0        for  k 0 k lt N k               dest i  j     src1 i  k    src2 k  j                   If N is large  e g  if N   sizeof elemType  is greater than the cache size  then every single access to src2 k  j  will be a cache miss  There are many different ways of optimizing this for a cache   Here s a very simple example  instead of reading one item per cache line in the inner loop  use all of the items  int itemsPerCacheLine   CacheLineSize   sizeof elemType    for i 0 i lt N i         for j 0 j lt N j    itemsPerCacheLine           for jj 0 jj lt itemsPerCacheLine  jj              dest i  j jj    0                for  k 0 k lt N k               for jj 0 jj lt itemsPerCacheLine  jj                 dest i  j jj     src1 i  k    src2 k  j jj                              If the cache line size is 64 bytes  and we are operating on 32 bit  4 byte  floats  then there are 16 items per cache line   And the number of cache misses via just this simple transformation is reduced approximately 16-fold  Fancier transformations operate on 2D tiles  optimize for multiple caches  L1  L2  TLB   and so on  Some results of googling  quot cache blocking quot   http   stumptown cc gt atl ga us cse6230-hpcta-fa11 slides 11a-matmul-goto pdf http   software intel com en-us articles cache-blocking-techniques A nice video animation of an optimized cache blocking algorithm  http   www youtube com watch v IFWgwGMMrh0 Loop tiling is very closely related  http   en wikipedia org wiki Loop tiling

User · Answer

Processors today work with many levels of cascading memory areas  So the CPU will have a bunch of memory that is on the CPU chip itself  It has very fast access to this memory  There are different levels of cache each one slower access   and larger   than the next  until you get to system memory which is not on the CPU and is relatively much slower to access   Logically  to the CPU s instruction set you just refer to memory addresses in a giant virtual address space  When you access a single memory address the CPU will go fetch it  in the old days it would fetch just that single address  But today the CPU will fetch a bunch of memory around the bit you asked for  and copy it into the cache  It assumes that if you asked for a particular address that is is highly likely that you are going to ask for an address nearby very soon  For example if you were copying a buffer you would read and write from consecutive addresses - one right after the other   So today when you fetch an address it checks the first level of cache to see if it already read that address into cache  if it doesn t find it  then this is a cache miss and it has to go out to the next level of cache to find it  until it eventually has to go out into main memory   Cache friendly code tries to keep accesses close together in memory so that you minimize cache misses   So an example would be imagine you wanted to copy a giant 2 dimensional table  It is organized with reach row in consecutive in memory  and one row follow the next right after   If you copied the elements one row at a time from left to right - that would be cache friendly  If you decided to copy the table one column at a time  you would copy the exact same amount of memory - but it would be cache unfriendly

User · Answer

Preliminaries On modern computers  only the lowest level memory structures  the registers  can move data around in single clock cycles  However  registers are very expensive and most computer cores have less than a few dozen registers   At the other end of the memory spectrum  DRAM   the memory is very cheap  i e  literally millions of times cheaper  but takes hundreds of cycles after a request to receive the data   To bridge this gap between super fast and expensive and super slow and cheap are the cache memories  named L1  L2  L3 in decreasing speed and cost  The idea is that most of the executing code will be hitting a small set of variables often  and the rest  a much larger set of variables  infrequently  If the processor can t find the data in L1 cache  then it looks in L2 cache  If not there  then L3 cache  and if not there  main memory  Each of these  quot misses quot  is expensive in time   The analogy is cache memory is to system memory  as system memory is too hard disk storage  Hard disk storage is super cheap but very slow   Caching is one of the main methods to reduce the impact of latency  To paraphrase Herb Sutter  cfr  links below   increasing bandwidth is easy  but we can t buy our way out of latency  Data is always retrieved through the memory hierarchy  smallest    fastest to slowest   A cache hit miss usually refers to a hit miss in the highest level of cache in the CPU -- by highest level I mean the largest    slowest  The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM  or worse      which takes a lot of time  hundreds of cycles for RAM  tens of millions of cycles for HDD   In comparison  reading data from the  highest level  cache typically takes only a handful of cycles  In modern computer architectures  the performance bottleneck is leaving the CPU die  e g  accessing RAM or higher   This will only get worse over time  The increase in processor frequency is currently no longer relevant to increase performance  The problem is memory access  Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches  prefetching  pipelines and concurrency  For instance  modern CPUs spend around 85  of die on caches and up to 99  for storing moving data  There is quite a lot to be said on the subject  Here are a few great references about caches  memory hierarchies and proper programming   Agner Fog s page  In his excellent documents  you can find detailed examples covering languages ranging from assembly to C    If you are into videos  I strongly recommend to have a look at  Herb Sutter s talk on machine architecture  youtube   specifically check 12 00 and onwards    Slides about memory optimization by Christer Ericson  director of technology   Sony  LWN net s article  quot What every programmer should know about memory quot   Main concepts for cache-friendly code A very important aspect of cache-friendly code is all about the principle of locality  the goal of which is to place related data close in memory to allow efficient caching  In terms of the CPU cache  it s important to be aware of cache lines to understand how this works  How do cache lines work  The following particular aspects are of high importance to optimize caching   Temporal locality  when a given memory location was accessed  it is likely that the same location is accessed again in the near future  Ideally  this information will still be cached at that point  Spatial locality  this refers to placing related data close to each other  Caching happens on many levels  not just in the CPU  For example  when you read from RAM  typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon  HDD caches follow the same line of thought  Specifically for CPU caches  the notion of cache lines is important   Use appropriate c   containers A simple example of cache-friendly versus cache-unfriendly is c   s std  vector versus std  list  Elements of a std  vector are stored in contiguous memory  and as such accessing them is much more cache-friendly than accessing elements in a std  list  which stores its content all over the place  This is due to spatial locality  A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip  thanks to  Mohammad Ali Baydoun for the link    Don t neglect the cache in data structure and algorithm design Whenever possible  try to adapt your data structures and order of computations in a way that allows maximum use of the cache  A common technique in this regard is cache blocking  Archive org version   which is of extreme importance in high-performance computing  cfr  for example ATLAS   Know and exploit the implicit structure of data Another simple example  which many people in the field sometimes forget is column-major  ex  fortran matlab  vs  row-major ordering  ex  c c    for storing two dimensional arrays  For example  consider the following matrix  1 2 3 4  In row-major ordering  this is stored in memory as 1 2 3 4  in column-major ordering  this would be stored as 1 3 2 4  It is easy to see that implementations which do not exploit this ordering will quickly run into  easily avoidable   cache issues  Unfortunately  I see stuff like this very often in my domain  machine learning    MatteoItalia showed this example in more detail in his answer  When fetching a certain element of a matrix from memory  elements near it will be fetched as well and stored in a cache line  If the ordering is exploited  this will result in fewer memory accesses  because the next few values which are needed for subsequent computations are already in a cache line   For simplicity  assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory  the next one is too  Say we want to take the sum over all elements in the example 2x2 matrix above  lets call it M   Exploiting the ordering  e g  changing column index first in c     M 0  0   memory    M 0  1   cached    M 1  0   memory    M 1  1   cached    1   2   3   4 -- gt  2 cache hits  2 memory accesses  Not exploiting the ordering  e g  changing row index first in c     M 0  0   memory    M 1  0   memory    M 0  1   memory    M 1  1   memory    1   3   2   4 -- gt  0 cache hits  4 memory accesses  In this simple example  exploiting the ordering approximately doubles execution speed  since memory access requires much more cycles than computing the sums   In practice  the performance difference can be much larger  Avoid unpredictable branches Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access  When your critical code contains  unpredictable  branches  it is hard or impossible to prefetch data  This will indirectly lead to more cache misses  This is explained very well here  thanks to  0x90 for the link   Why is processing a sorted array faster than processing an unsorted array  Avoid virtual functions In the context of c    virtual methods represent a controversial issue with regard to cache misses  a general consensus exists that they should be avoided when possible in terms of performance   Virtual functions can induce cache misses during look up  but this only happens if the specific function is not called often  otherwise it would likely be cached   so this is regarded as a non-issue by some  For reference about this issue  check out  What is the performance cost of having a virtual method in a C   class  Common problems A common problem in modern architectures with multiprocessor caches is called false sharing  This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line  This causes the cache line -- which contains data another processor can use -- to be overwritten again and again  Effectively  different threads make each other wait by inducing cache misses in this situation  See also  thanks to  Matt for the link   How and when to align to cache line size  An extreme symptom of poor caching in RAM memory  which is probably not what you mean in this context  is so-called thrashing  This occurs when the process continuously generates page faults  e g  accesses memory which is not in the current page  which require disk access

User · Answer

As  Marc Claesen mentioned that one of the ways to write cache friendly code is to exploit the structure in which our data is stored  In addition to that another way to write cache friendly code is  change the way our data is stored  then write new code to access the data stored in this new structure    This makes sense in the case of how database systems linearize the tuples of a table and store them  There are two basic ways to store the tuples of a table i e  row store and column store  In row store as the name suggests the tuples are stored row wise  Lets suppose a table named Product being stored has 3 attributes i e  int32 t key  char name 56  and int32 t price  so the total size of a tuple is 64 bytes    We can simulate a very basic row store query execution in main memory by creating an array of Product structs with size N  where N is the number of rows in table  Such memory layout is also called array of structs  So the struct for Product can be like   struct Product      int32 t key     char name 56      int32 t price        create an array of structs    Product  table   new Product N      now load this array of structs  from a file etc       Similarly we can simulate a very basic column store query execution in main memory by creating an 3 arrays of size N  one array for each attribute of the Product table  Such memory layout is also called struct of arrays  So the 3 arrays for each attribute of Product can be like      create separate arrays for each attribute    int32 t  key   new int32 t N   char  name   new char 56 N   int32 t  price   new int32 t N      now load these arrays  from a file etc       Now after loading both the array of structs  Row Layout  and the 3 separate arrays  Column Layout   we have row store and column store on our table Product present in our memory   Now we move on to the cache friendly code part  Suppose that the workload on our table is such that we have an aggregation query on the price attribute  Such as   SELECT SUM price  FROM PRODUCT   For the row store we can convert the above SQL query into   int sum   0  for  int i 0  i lt N  i       sum   sum   table i  price    For the column store we can convert the above SQL query into   int sum   0  for  int i 0  i lt N  i       sum   sum   price i     The code for the column store would be faster than the code for the row layout in this query as it requires only a subset of attributes and in column  layout we are doing just that i e  only accessing the price column   Suppose that the cache line size is 64 bytes    In the case of row layout when a cache line is read  the price value of only 1 cacheline size product struct size   64 64   1  tuple is read  because our struct size of 64 bytes and it fills our whole cache line  so for every tuple a cache miss occurs in case of a row layout    In the case of column layout when a cache line is read  the price value of 16 cacheline size price int size   64 4   16  tuples is read  because 16 contiguous price values stored in memory are brought into the cache  so for every sixteenth tuple a cache miss ocurs in case of column layout   So the column layout will be faster in the case of given query  and is faster in such aggregation queries on a subset of columns of the table  You can try out such experiment for yourself using the data from TPC-H benchmark  and compare the run times for both the layouts  The wikipedia article on column oriented database systems is also good   So in database systems  if the query workload is known beforehand  we can store our data in layouts which will suit the queries in workload and access data from these layouts  In the case of above example we created a column layout and changed our code to compute sum so that it became cache friendly

User · Answer

In addition to  Marc Claesen s answer  I think that an instructive classic example of cache-unfriendly code is code that scans a C bidimensional array  e g  a bitmap image  column-wise instead of row-wise   Elements that are adjacent in a row are also adjacent in memory  thus accessing them in sequence means accessing them in ascending memory order  this is cache-friendly  since the cache tends to prefetch contiguous blocks of memory   Instead  accessing such elements column-wise is cache-unfriendly  since elements on the same column are distant in memory from each other  in particular  their distance is equal to the size of the row   so when you use this access pattern you are jumping around in memory  potentially wasting the effort of the cache of retrieving the elements nearby in memory   And all that it takes to ruin the performance is to go from     Cache-friendly version - processes pixels which are adjacent in memory for unsigned int y 0  y lt height    y        for unsigned int x 0  x lt width    x                    image y  x                to     Cache-unfriendly version - jumps around in memory for no good reason for unsigned int x 0  x lt width    x        for unsigned int y 0  y lt height    y                    image y  x                This effect can be quite dramatic  several order of magnitudes in speed  in systems with small caches and or working with big arrays  e g  10  megapixels 24 bpp images on current machines   for this reason  if you have to do many vertical scans  often it s better to rotate the image of 90 degrees first and perform the various analysis later  limiting the cache-unfriendly code just to the rotation

User · Answer

Be aware that caches do not just cache continuous memory   They have multiple lines  at least 4  so discontinous and overlapping memory can often be stored just as efficiently   What is missing from all the above examples is measured benchmarks   There are many myths about performance   Unless you measure it you do not know   Do not complicate your code unless you have a measured improvement

[c++] What is a "cache-friendly" code?

Locality of Reference

Line Sharing

Examples related to c++

Examples related to performance

Examples related to caching

Examples related to memory

Examples related to cpu-cache