C 11 introduced a standardized memory model What does it mean And how is it going to affect C programming

Question

C  11 introduced a standardized memory model  but what exactly does that mean  And how is it going to affect C   programming   This article  by Gavin Clarke who quotes Herb Sutter  says that      The memory model means that C   code   now has a standardized library to call   regardless of who made the compiler   and on what platform it s running    There s a standard way to control how   different threads talk to the   processor s memory        When you are talking about splitting    code  across different cores that s   in the standard  we are talking about   the memory model  We are going to   optimize it without breaking the   following assumptions people are going   to make in the code   Sutter said    Well  I can memorize this and similar paragraphs available online  as I ve had my own memory model since birth  P  and can even post as an answer to questions asked by others  but to be honest  I don t exactly understand this   C   programmers used to develop multi-threaded applications even before  so how does it matter if it s POSIX threads  or Windows threads  or C  11 threads  What are the benefits  I want to understand the low-level details   I also get this feeling that the C  11 memory model is somehow related to C  11 multi-threading support  as I often see these two together  If it is  how exactly  Why should they be related   As I don t know how the internals of multi-threading work  and what memory model means in general  please help me understand these concepts   -

User · Answer

C and C   used to be defined by an execution trace of a well formed program   Now they are half defined by an execution trace of a program  and half a posteriori by many orderings on synchronisation objects   Meaning that these language definitions make no sense at all as no logical method to mix these two approaches  In particular  destruction of a mutex or atomic variable is not well defined

User · Answer

It means that the standard now defines multi-threading  and it defines what happens in the context of multiple threads  Of course  people used varying implementations  but that s like asking why we should have a std  string when we could all be using a home-rolled string class   When you re talking about POSIX threads or Windows threads  then this is a bit of an illusion as actually you re talking about x86 threads  as it s a hardware function to run concurrently  The C  0x memory model makes guarantees  whether you re on x86  or ARM  or MIPS  or anything else you can come up with

User · Answer

If you use mutexes to protect all your data  you really shouldn t need to worry  Mutexes have always provided sufficient ordering and visibility guarantees   Now  if you used atomics  or lock-free algorithms  you need to think about the memory model  The memory model describes precisely when atomics provide ordering and visibility guarantees  and provides portable fences for hand-coded guarantees   Previously  atomics would be done using compiler intrinsics  or some higher level library  Fences would have been done using CPU-specific instructions  memory barriers

User · Answer

This is now a multiple-year old question  but being very popular  it s worth mentioning a fantastic resource for learning about the C  11 memory model  I see no point in summing up his talk in order to make this yet another full answer  but given this is the guy who actually wrote the standard  I think it s well worth watching the talk   Herb Sutter has a three hour long talk about the C  11 memory model titled  atomic lt   Weapons   available on the Channel9 site - part 1 and part 2  The talk is pretty technical  and covers the following topics    Optimizations  Races  and the Memory Model Ordering     What  Acquire and Release Ordering     How  Mutexes  Atomics  and or Fences Other Restrictions on Compilers and Hardware Code Gen  amp  Performance  x86 x64  IA64  POWER  ARM Relaxed Atomics   The talk doesn t elaborate on the API  but rather on the reasoning  background  under the hood and behind the scenes  did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently

User · Answer

The above answers get at the most fundamental aspects of the C   memory model   In practice  most uses of std  atomic lt  gt   just work   at least until the programmer over-optimizes  e g   by trying to relax too many things    There is one place where mistakes are still common  sequence locks   There is an excellent and easy-to-read discussion of the challenges at https   www hpl hp com techreports 2012 HPL-2012-68 pdf   Sequence locks are appealing because the reader avoids writing to the lock word   The following code is based on Figure 1 of the above technical report  and it highlights the challenges when implementing sequence locks in C     atomic lt uint64 t gt  seq     seqlock representation int data1  data2         this data will be protected by seq  T reader         int r1  r2      unsigned seq0  seq1      while  true            seq0   seq          r1   data1     INCORRECT  Data Race          r2   data2     INCORRECT          seq1   seq              if the lock didn t change while I was reading  and            the lock wasn t held while I was reading  then my            reads should be valid         if  seq0    seq1  amp  amp    seq0  amp  1               break            use r1  r2      void writer int new data1  int new data2        unsigned seq0   seq      while  true            if     seq0  amp  1    amp  amp  seq compare exchange weak seq0  seq0   1               break     atomically moving the lock from even to odd is an acquire           data1   new data1      data2   new data2      seq   seq0   2     release the lock by increasing its value to even     As unintuitive as it seams at first  data1 and data2 need to be atomic lt  gt    If they are not atomic  then they could be read  in reader    at the exact same time as they are written  in writer      According to the C   memory model  this is a race even if reader   never actually uses the data   In addition  if they are not atomic  then the compiler can cache the first read of each value in a register   Obviously you wouldn t want that    you want to re-read in each iteration of the while loop in reader     It is also not sufficient to make them atomic lt  gt  and access them with memory order relaxed   The reason for this is that the reads of seq  in reader    only have acquire semantics   In simple terms  if X and Y are memory accesses  X precedes Y  X is not an acquire or release  and Y is an acquire  then the compiler can reorder Y before X   If Y was the second read of seq  and X was a read of data  such a reordering would break the lock implementation   The paper gives a few solutions   The one with the best performance today is probably the one that uses an atomic thread fence with memory order relaxed before the second read of the seqlock   In the paper  it s Figure 6   I m not reproducing the code here  because anyone who has read this far really ought to read the paper   It is more precise and complete than this post   The last issue is that it might be unnatural to make the data variables atomic   If you can t in your code  then you need to be very careful  because casting from non-atomic to atomic is only legal for primitive types   C  20 is supposed to add atomic ref lt  gt   which will make this problem easier to resolve   To summarize  even if you think you understand the C   memory model  you should be very careful before rolling your own sequence locks

User · Answer

First  you have to learn to think like a Language Lawyer   The C   specification does not make reference to any particular compiler  operating system  or CPU   It makes reference to an abstract machine that is a generalization of actual systems   In the Language Lawyer world  the job of the programmer is to write code for the abstract machine  the job of the compiler is to actualize that code on a concrete machine   By coding rigidly to the spec  you can be certain that your code will compile and run without modification on any system with a compliant C   compiler  whether today or 50 years from now   The abstract machine in the C  98 C  03 specification is fundamentally single-threaded   So it is not possible to write multi-threaded C   code that is  fully portable  with respect to the spec   The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen  never mind things like mutexes   Of course  you can write multi-threaded code in practice for particular concrete systems  ndash  like pthreads or Windows   But there is no standard way to write multi-threaded code for C  98 C  03   The abstract machine in C  11 is multi-threaded by design   It also has a well-defined memory model  that is  it says what the compiler may and may not do when it comes to accessing memory   Consider the following example  where a pair of global variables are accessed concurrently by two threads              Global            int x  y   Thread 1            Thread 2 x   17              cout  lt  lt  y  lt  lt       y   37              cout  lt  lt  x  lt  lt  endl    What might Thread 2 output   Under C  98 C  03  this is not even Undefined Behavior  the question itself is meaningless because the standard does not contemplate anything called a  thread    Under C  11  the result is Undefined Behavior  because loads and stores need not be atomic in general   Which may not seem like much of an improvement     And by itself  it s not   But with C  11  you can write this              Global            atomic lt int gt  x  y   Thread 1                 Thread 2 x store 17               cout  lt  lt  y load    lt  lt       y store 37               cout  lt  lt  x load    lt  lt  endl    Now things get much more interesting   First of all  the behavior here is defined   Thread 2 could now print 0 0  if it runs before Thread 1   37 17  if it runs after Thread 1   or 0 17  if it runs after Thread 1 assigns to x but before it assigns to y    What it cannot print is 37 0  because the default mode for atomic loads stores in C  11 is to enforce sequential consistency   This just means all loads and stores must be  as if  they happened in the order you wrote them within each thread  while operations among threads can be interleaved however the system likes   So the default behavior of atomics provides both atomicity and ordering for loads and stores   Now  on a modern CPU  ensuring sequential consistency can be expensive   In particular  the compiler is likely to emit full-blown memory barriers between every access here   But if your algorithm can tolerate out-of-order loads and stores  i e   if it requires atomicity but not ordering  i e   if it can tolerate 37 0 as output from this program  then you can write this              Global            atomic lt int gt  x  y   Thread 1                            Thread 2 x store 17 memory order relaxed     cout  lt  lt  y load memory order relaxed   lt  lt       y store 37 memory order relaxed     cout  lt  lt  x load memory order relaxed   lt  lt  endl    The more modern the CPU  the more likely this is to be faster than the previous example   Finally  if you just need to keep particular loads and stores in order  you can write              Global            atomic lt int gt  x  y   Thread 1                            Thread 2 x store 17 memory order release     cout  lt  lt  y load memory order acquire   lt  lt       y store 37 memory order release     cout  lt  lt  x load memory order acquire   lt  lt  endl    This takes us back to the ordered loads and stores  ndash  so 37 0 is no longer a possible output  ndash  but it does so with minimal overhead    In this trivial example  the result is the same as full-blown sequential consistency  in a larger program  it would not be    Of course  if the only outputs you want to see are 0 0 or 37 17  you can just wrap a mutex around the original code   But if you have read this far  I bet you already know how that works  and this answer is already longer than I intended  -    So  bottom line  Mutexes are great  and C  11 standardizes them  But sometimes for performance reasons you want lower-level primitives  e g   the classic double-checked locking pattern    The new standard provides high-level gadgets like mutexes and condition variables  and it also provides low-level gadgets like atomic types and the various flavors of memory barrier   So now you can write sophisticated  high-performance concurrent routines entirely within the language specified by the standard  and you can be certain your code will compile and run unchanged on both today s systems and tomorrow s   Although to be frank  unless you are an expert and working on some serious low-level code  you should probably stick to mutexes and condition variables   That s what I intend to do   For more on this stuff  see this blog post

User · Answer

I will just give the analogy with which I understand memory consistency models  or memory models  for short   It is inspired by Leslie Lamport s seminal paper  quot Time  Clocks  and the Ordering of Events in a Distributed System quot   The analogy is apt and has fundamental significance  but may be overkill for many people  However  I hope it provides a mental image  a pictorial representation  that facilitates reasoning about memory consistency models  Let   s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space  i e   each memory location is represented by a point on that axis  and the vertical axis represents time  we will see that  in general  there is not a universal notion of time   The history of values held by each memory location is  therefore  represented by a vertical column at that memory address  Each value change is due to one of the threads writing a new value to that location  By a memory image  we will mean the aggregate combination of values of all memory locations observable at a particular time by a particular thread  Quoting from  quot A Primer on Memory Consistency and Cache Coherence quot   The intuitive  and most restrictive  memory model is sequential consistency  SC  in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread  as if the threads were time-multiplexed on a single-core processor   That global memory order can vary from one run of the program to another and may not be known beforehand  The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity  i e   memory images   On a given plane  all of its events  or memory values  are simultaneous  There is a notion of Absolute Time  in which all threads agree on which memory values are simultaneous  In SC  at every time instant  there is only one memory image shared by all threads  That s  at every instant of time  all processors agree on the memory image  i e   the aggregate content of memory   Not only does this imply that all threads view the same sequence of values for all memory locations  but also that all processors observe the same combinations of values of all variables  This is the same as saying all memory operations  on all memory locations  are observed in the same total order by all threads  In relaxed memory models  each thread will slice up address-space-time in its own way  the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location  of course  slices of different threads may  and will  cross each other   There is no universal way to slice it up  no privileged foliation of address-space-time   Slices do not have to be planar  or linear   They can be curved and this is what can make a thread read values written by another thread out of the order they were written in  Histories of different memory locations may slide  or get stretched  arbitrarily relative to each other when viewed by any particular thread  Each thread will have a different sense of which events  or  equivalently  memory values  are simultaneous  The set of events  or memory values  that are simultaneous to one thread are not simultaneous to another  Thus  in a relaxed memory model  all threads still observe the same history  i e   sequence of values  for each memory location  But they may observe different memory images  i e   combinations of values of all memory locations   Even if two different memory locations are written by the same thread in sequence  the two newly written values may be observed in different order by other threads   Picture from Wikipedia   Readers familiar with Einstein   s Special Theory of Relativity will notice what I am alluding to  Translating Minkowski   s words into the memory models realm  address space and time are shadows of address-space-time  In this case  each observer  i e   thread  will project shadows of events  i e   memory stores loads  onto his own world-line  i e   his time axis  and his own plane of simultaneity  his address-space axis   Threads in the C  11 memory model correspond to observers that are moving relative to each other in special relativity  Sequential consistency corresponds to the Galilean space-time  i e   all observers agree on one absolute order of events and a global sense of simultaneity   The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events  often called a causal set  Some events  i e   memory stores  can affect  but not be affected by  other events  A C  11 thread  or observer in physics  is no more than a chain  i e   a totally ordered set  of events  e g   memory loads and stores to possibly different addresses   In relativity  some order is restored to the seemingly chaotic picture of partially ordered events  since the only temporal ordering that all observers agree on is the ordering among    timelike    events  i e   those events that are in principle connectible by any particle going slower than the speed of light in a vacuum   Only the timelike related events are invariantly ordered  Time in Physics  Craig Callender  In C  11 memory model  a similar mechanism  the acquire-release consistency model  is used to establish these local causality relations  To provide a definition of memory consistency and a motivation for abandoning SC  I will quote from  quot A Primer on Memory Consistency and Cache Coherence quot   For a shared memory machine  the memory consistency model defines the architecturally visible behavior of its memory system  The correctness criterion for a single processor core partitions behavior between    one correct result    and    many incorrect alternatives     This is because the processor   s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state  even on an out-of-order core  Shared memory consistency models  however  concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many  more  incorrect ones  The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently  often with many possible legal interleavings of instructions from different threads  Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary  If a thread updates ten data items and then a synchronization flag  programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated  usually implemented using FENCE instructions   Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers    require    to get both higher performance and correctness of SC  For example  in certain architectures  FIFO write buffers are used by each core to hold the results of committed  retired  stores before writing the results to the caches  This optimization enhances performance but violates SC  The write buffer hides the latency of servicing a store miss  Because stores are common  being able to avoid stalling on most of them is an important benefit  For a single-core processor  a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer  This is typically done by either bypassing the value of the most recent store to A to the load from A  where    most recent    is determined by program order  or by stalling a load of A if a store to A is in the write buffer  When multiple cores are used  each will have its own bypassing write buffer  Without write buffers  the hardware is SC  but with write buffers  it is not  making write buffers architecturally visible in a multicore processor  Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered  This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store  i e   before the first store   Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order  That can behave the same as reordering stores on another core  Can you come up with an example interleaving between two threads    Reordering an earlier load with a later store  a load-store reordering  can cause many incorrect behaviors  such as loading a value after releasing the lock that protects it  if the store is the unlock operation   Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer  even with a core that executes all instructions in program order   Because cache coherence and memory consistency are sometimes confused  it is instructive to also have this quote   Unlike consistency  cache coherence is neither visible to software nor required  Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system  Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores  This is because correct coherence ensures that the caches never enable new or different functional behavior  programmers may still be able to infer likely cache structure using timing information   The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers  SWMR  invariant for every memory location  An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis  whereas consistency is specified with respect to all memory locations   Continuing with our mental picture  the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location

User · Answer

For languages not specifying a memory model  you are writing code for the language and the memory model specified by the processor architecture  The processor may choose to re-order memory accesses for performance  So  if your program has data races  a data race is when it s possible for multiple cores   hyper-threads to access the same memory concurrently  then your program is not cross platform because of its dependence on the processor memory model  You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses   Very importantly  locks  and concurrency semantics with locking  are typically implemented in a cross platform way    So if you are using standard locks in a multithreaded program with no data races then you don t have to worry about cross platform memory models   Interestingly  Microsoft compilers for C   have acquire   release semantics for volatile which is a C   extension to deal with the lack of a memory model in C   http   msdn microsoft com en-us library 12a04hfd v vs 80  aspx  However  given that Windows runs on x86   x64 only  that s not saying much  Intel and AMD memory models make it easy and efficient to implement acquire   release semantics in a language

[c++] C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?

Examples related to c++

Examples related to multithreading

Examples related to c++11

Examples related to language-lawyer

Examples related to memory-model