[c++] C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?

C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?

This article (by Gavin Clarke who quotes Herb Sutter) says that,

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.

"When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Sutter said.

Well, I can memorize this and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as an answer to questions asked by others, but to be honest, I don't exactly understand this.

C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.

I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?

As I don't know how the internals of multi-threading work, and what memory model means in general, please help me understand these concepts. :-)

This question is related to c++ multithreading c++11 language-lawyer memory-model

The answer is


I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System". The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.

Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.

Quoting from "A Primer on Memory Consistency and Cache Coherence"

The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.

That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.

In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.

[Picture from Wikipedia] Picture from Wikipedia

Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).

The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).

In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered. Time in Physics, Craig Callender.

In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.

To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"

For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.

Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.

Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.

Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:

Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location. An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.

Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.


For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.

Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.

Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).


The above answers get at the most fundamental aspects of the C++ memory model. In practice, most uses of std::atomic<> "just work", at least until the programmer over-optimizes (e.g., by trying to relax too many things).

There is one place where mistakes are still common: sequence locks. There is an excellent and easy-to-read discussion of the challenges at https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf. Sequence locks are appealing because the reader avoids writing to the lock word. The following code is based on Figure 1 of the above technical report, and it highlights the challenges when implementing sequence locks in C++:

atomic<uint64_t> seq; // seqlock representation
int data1, data2;     // this data will be protected by seq

T reader() {
    int r1, r2;
    unsigned seq0, seq1;
    while (true) {
        seq0 = seq;
        r1 = data1; // INCORRECT! Data Race!
        r2 = data2; // INCORRECT!
        seq1 = seq;

        // if the lock didn't change while I was reading, and
        // the lock wasn't held while I was reading, then my
        // reads should be valid
        if (seq0 == seq1 && !(seq0 & 1))
            break;
    }
    use(r1, r2);
}

void writer(int new_data1, int new_data2) {
    unsigned seq0 = seq;
    while (true) {
        if ((!(seq0 & 1)) && seq.compare_exchange_weak(seq0, seq0 + 1))
            break; // atomically moving the lock from even to odd is an acquire
    }
    data1 = new_data1;
    data2 = new_data2;
    seq = seq0 + 2; // release the lock by increasing its value to even
}

As unintuitive as it seams at first, data1 and data2 need to be atomic<>. If they are not atomic, then they could be read (in reader()) at the exact same time as they are written (in writer()). According to the C++ memory model, this is a race even if reader() never actually uses the data. In addition, if they are not atomic, then the compiler can cache the first read of each value in a register. Obviously you wouldn't want that... you want to re-read in each iteration of the while loop in reader().

It is also not sufficient to make them atomic<> and access them with memory_order_relaxed. The reason for this is that the reads of seq (in reader()) only have acquire semantics. In simple terms, if X and Y are memory accesses, X precedes Y, X is not an acquire or release, and Y is an acquire, then the compiler can reorder Y before X. If Y was the second read of seq, and X was a read of data, such a reordering would break the lock implementation.

The paper gives a few solutions. The one with the best performance today is probably the one that uses an atomic_thread_fence with memory_order_relaxed before the second read of the seqlock. In the paper, it's Figure 6. I'm not reproducing the code here, because anyone who has read this far really ought to read the paper. It is more precise and complete than this post.

The last issue is that it might be unnatural to make the data variables atomic. If you can't in your code, then you need to be very careful, because casting from non-atomic to atomic is only legal for primitive types. C++20 is supposed to add atomic_ref<>, which will make this problem easier to resolve.

To summarize: even if you think you understand the C++ memory model, you should be very careful before rolling your own sequence locks.


It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.

When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.


If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.

Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.

Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).


This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.

Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:

  1. Optimizations, Races, and the Memory Model
  2. Ordering – What: Acquire and Release
  3. Ordering – How: Mutexes, Atomics, and/or Fences
  4. Other Restrictions on Compilers and Hardware
  5. Code Gen & Performance: x86/x64, IA64, POWER, ARM
  6. Relaxed Atomics

The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).


C and C++ used to be defined by an execution trace of a well formed program.

Now they are half defined by an execution trace of a program, and half a posteriori by many orderings on synchronisation objects.

Meaning that these language definitions make no sense at all as no logical method to mix these two approaches. In particular, destruction of a mutex or atomic variable is not well defined.


Examples related to c++

Method Call Chaining; returning a pointer vs a reference? How can I tell if an algorithm is efficient? Difference between opening a file in binary vs text How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Install Qt on Ubuntu #include errors detected in vscode Cannot open include file: 'stdio.h' - Visual Studio Community 2017 - C++ Error How to fix the error "Windows SDK version 8.1" was not found? Visual Studio 2017 errors on standard headers How do I check if a Key is pressed on C++

Examples related to multithreading

How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Waiting until the task finishes What is the difference between Task.Run() and Task.Factory.StartNew() Why is setState in reactjs Async instead of Sync? What exactly is std::atomic? Calling async method on button click WAITING at sun.misc.Unsafe.park(Native Method) How to use background thread in swift? What is the use of static synchronized method in java? Locking pattern for proper use of .NET MemoryCache

Examples related to c++11

Remove from the beginning of std::vector Converting std::__cxx11::string to std::string What exactly is std::atomic? C++ How do I convert a std::chrono::time_point to long and back Passing capturing lambda as function pointer undefined reference to 'std::cout' Is it possible to use std::string in a constexpr? How does #include <bits/stdc++.h> work in C++? error::make_unique is not a member of ‘std’ no match for ‘operator<<’ in ‘std::operator

Examples related to language-lawyer

In CSS Flexbox, why are there no "justify-items" and "justify-self" properties? C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming? Is null reference possible?

Examples related to memory-model

C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?