[c++] Is it better to use std::memcpy() or std::copy() in terms to performance?

Is it better to use memcpy as shown below or is it better to use std::copy() in terms to performance? Why?

char *bits = NULL;
...

bits = new (std::nothrow) char[((int *) copyMe->bits)[0]];
if (bits == NULL)
{
    cout << "ERROR Not enough memory.\n";
    exit(1);
}

memcpy (bits, copyMe->bits, ((int *) copyMe->bits)[0]);

This question is related to c++ performance optimization

The answer is


I'm going to go against the general wisdom here that std::copy will have a slight, almost imperceptible performance loss. I just did a test and found that to be untrue: I did notice a performance difference. However, the winner was std::copy.

I wrote a C++ SHA-2 implementation. In my test, I hash 5 strings using all four SHA-2 versions (224, 256, 384, 512), and I loop 300 times. I measure times using Boost.timer. That 300 loop counter is enough to completely stabilize my results. I ran the test 5 times each, alternating between the memcpy version and the std::copy version. My code takes advantage of grabbing data in as large of chunks as possible (many other implementations operate with char / char *, whereas I operate with T / T * (where T is the largest type in the user's implementation that has correct overflow behavior), so fast memory access on the largest types I can is central to the performance of my algorithm. These are my results:

Time (in seconds) to complete run of SHA-2 tests

std::copy   memcpy  % increase
6.11        6.29    2.86%
6.09        6.28    3.03%
6.10        6.29    3.02%
6.08        6.27    3.03%
6.08        6.27    3.03%

Total average increase in speed of std::copy over memcpy: 2.99%

My compiler is gcc 4.6.3 on Fedora 16 x86_64. My optimization flags are -Ofast -march=native -funsafe-loop-optimizations.

Code for my SHA-2 implementations.

I decided to run a test on my MD5 implementation as well. The results were much less stable, so I decided to do 10 runs. However, after my first few attempts, I got results that varied wildly from one run to the next, so I'm guessing there was some sort of OS activity going on. I decided to start over.

Same compiler settings and flags. There is only one version of MD5, and it's faster than SHA-2, so I did 3000 loops on a similar set of 5 test strings.

These are my final 10 results:

Time (in seconds) to complete run of MD5 tests

std::copy   memcpy      % difference
5.52        5.56        +0.72%
5.56        5.55        -0.18%
5.57        5.53        -0.72%
5.57        5.52        -0.91%
5.56        5.57        +0.18%
5.56        5.57        +0.18%
5.56        5.53        -0.54%
5.53        5.57        +0.72%
5.59        5.57        -0.36%
5.57        5.56        -0.18%

Total average decrease in speed of std::copy over memcpy: 0.11%

Code for my MD5 implementation

These results suggest that there is some optimization that std::copy used in my SHA-2 tests that std::copy could not use in my MD5 tests. In the SHA-2 tests, both arrays were created in the same function that called std::copy / memcpy. In my MD5 tests, one of the arrays was passed in to the function as a function parameter.

I did a little bit more testing to see what I could do to make std::copy faster again. The answer turned out to be simple: turn on link time optimization. These are my results with LTO turned on (option -flto in gcc):

Time (in seconds) to complete run of MD5 tests with -flto

std::copy   memcpy      % difference
5.54        5.57        +0.54%
5.50        5.53        +0.54%
5.54        5.58        +0.72%
5.50        5.57        +1.26%
5.54        5.58        +0.72%
5.54        5.57        +0.54%
5.54        5.56        +0.36%
5.54        5.58        +0.72%
5.51        5.58        +1.25%
5.54        5.57        +0.54%

Total average increase in speed of std::copy over memcpy: 0.72%

In summary, there does not appear to be a performance penalty for using std::copy. In fact, there appears to be a performance gain.

Explanation of results

So why might std::copy give a performance boost?

First, I would not expect it to be slower for any implementation, as long as the optimization of inlining is turned on. All compilers inline aggressively; it is possibly the most important optimization because it enables so many other optimizations. std::copy can (and I suspect all real world implementations do) detect that the arguments are trivially copyable and that memory is laid out sequentially. This means that in the worst case, when memcpy is legal, std::copy should perform no worse. The trivial implementation of std::copy that defers to memcpy should meet your compiler's criteria of "always inline this when optimizing for speed or size".

However, std::copy also keeps more of its information. When you call std::copy, the function keeps the types intact. memcpy operates on void *, which discards almost all useful information. For instance, if I pass in an array of std::uint64_t, the compiler or library implementer may be able to take advantage of 64-bit alignment with std::copy, but it may be more difficult to do so with memcpy. Many implementations of algorithms like this work by first working on the unaligned portion at the start of the range, then the aligned portion, then the unaligned portion at the end. If it is all guaranteed to be aligned, then the code becomes simpler and faster, and easier for the branch predictor in your processor to get correct.

Premature optimization?

std::copy is in an interesting position. I expect it to never be slower than memcpy and sometimes faster with any modern optimizing compiler. Moreover, anything that you can memcpy, you can std::copy. memcpy does not allow any overlap in the buffers, whereas std::copy supports overlap in one direction (with std::copy_backward for the other direction of overlap). memcpy only works on pointers, std::copy works on any iterators (std::map, std::vector, std::deque, or my own custom type). In other words, you should just use std::copy when you need to copy chunks of data around.


My rule is simple. If you are using C++ prefer C++ libraries and not C :)


If you really need maximum copying performance (which you might not), use neither of them.

There's a lot that can be done to optimize memory copying - even more if you're willing to use multiple threads/cores for it. See, for example:

What's missing/sub-optimal in this memcpy implementation?

both the question and some of the answers have suggested implementations or links to implementations.


Always use std::copy because memcpy is limited to only C-style POD structures, and the compiler will likely replace calls to std::copy with memcpy if the targets are in fact POD.

Plus, std::copy can be used with many iterator types, not just pointers. std::copy is more flexible for no performance loss and is the clear winner.


Just a minor addition: The speed difference between memcpy() and std::copy() can vary quite a bit depending on if optimizations are enabled or disabled. With g++ 6.2.0 and without optimizations memcpy() clearly wins:

Benchmark             Time           CPU Iterations
---------------------------------------------------
bm_memcpy            17 ns         17 ns   40867738
bm_stdcopy           62 ns         62 ns   11176219
bm_stdcopy_n         72 ns         72 ns    9481749

When optimizations are enabled (-O3), everything looks pretty much the same again:

Benchmark             Time           CPU Iterations
---------------------------------------------------
bm_memcpy             3 ns          3 ns  274527617
bm_stdcopy            3 ns          3 ns  272663990
bm_stdcopy_n          3 ns          3 ns  274732792

The bigger the array the less noticeable the effect gets, but even at N=1000 memcpy() is about twice as fast when optimizations aren't enabled.

Source code (requires Google Benchmark):

#include <string.h>
#include <algorithm>
#include <vector>
#include <benchmark/benchmark.h>

constexpr int N = 10;

void bm_memcpy(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    memcpy(r.data(), a.data(), N * sizeof(int));
  }
}

void bm_stdcopy(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    std::copy(a.begin(), a.end(), r.begin());
  }
}

void bm_stdcopy_n(benchmark::State& state)
{
  std::vector<int> a(N);
  std::vector<int> r(N);

  while (state.KeepRunning())
  {
    std::copy_n(a.begin(), N, r.begin());
  }
}

BENCHMARK(bm_memcpy);
BENCHMARK(bm_stdcopy);
BENCHMARK(bm_stdcopy_n);

BENCHMARK_MAIN()

/* EOF */

Profiling shows that statement: std::copy() is always as fast as memcpy() or faster is false.

My system:

HP-Compaq-dx7500-Microtower 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux.

gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2

The code (language: c++):

    const uint32_t arr_size = (1080 * 720 * 3); //HD image in rgb24
    const uint32_t iterations = 100000;
    uint8_t arr1[arr_size];
    uint8_t arr2[arr_size];
    std::vector<uint8_t> v;

    main(){
        {
            DPROFILE;
            memcpy(arr1, arr2, sizeof(arr1));
            printf("memcpy()\n");
        }

        v.reserve(sizeof(arr1));
        {
            DPROFILE;
            std::copy(arr1, arr1 + sizeof(arr1), v.begin());
            printf("std::copy()\n");
        }

        {
            time_t t = time(NULL);
            for(uint32_t i = 0; i < iterations; ++i)
                memcpy(arr1, arr2, sizeof(arr1));
            printf("memcpy()    elapsed %d s\n", time(NULL) - t);
        }

        {
            time_t t = time(NULL);
            for(uint32_t i = 0; i < iterations; ++i)
                std::copy(arr1, arr1 + sizeof(arr1), v.begin());
            printf("std::copy() elapsed %d s\n", time(NULL) - t);
        }
    }

g++ -O0 -o test_stdcopy test_stdcopy.cpp

memcpy() profile: main:21: now:1422969084:04859 elapsed:2650 us
std::copy() profile: main:27: now:1422969084:04862 elapsed:2745 us
memcpy() elapsed 44 s std::copy() elapsed 45 s

g++ -O3 -o test_stdcopy test_stdcopy.cpp

memcpy() profile: main:21: now:1422969601:04939 elapsed:2385 us
std::copy() profile: main:28: now:1422969601:04941 elapsed:2690 us
memcpy() elapsed 27 s std::copy() elapsed 43 s

Red Alert pointed out that the code uses memcpy from array to array and std::copy from array to vector. That coud be a reason for faster memcpy.

Since there is

v.reserve(sizeof(arr1));

there shall be no difference in copy to vector or array.

The code is fixed to use array for both cases. memcpy still faster:

{
    time_t t = time(NULL);
    for(uint32_t i = 0; i < iterations; ++i)
        memcpy(arr1, arr2, sizeof(arr1));
    printf("memcpy()    elapsed %ld s\n", time(NULL) - t);
}

{
    time_t t = time(NULL);
    for(uint32_t i = 0; i < iterations; ++i)
        std::copy(arr1, arr1 + sizeof(arr1), arr2);
    printf("std::copy() elapsed %ld s\n", time(NULL) - t);
}

memcpy()    elapsed 44 s
std::copy() elapsed 48 s 

All compilers I know will replace a simple std::copy with a memcpy when it is appropriate, or even better, vectorize the copy so that it would be even faster than a memcpy.

In any case: profile and find out yourself. Different compilers will do different things, and it's quite possible it won't do exactly what you ask.

See this presentation on compiler optimisations (pdf).

Here's what GCC does for a simple std::copy of a POD type.

#include <algorithm>

struct foo
{
  int x, y;    
};

void bar(foo* a, foo* b, size_t n)
{
  std::copy(a, a + n, b);
}

Here's the disassembly (with only -O optimisation), showing the call to memmove:

bar(foo*, foo*, unsigned long):
    salq    $3, %rdx
    sarq    $3, %rdx
    testq   %rdx, %rdx
    je  .L5
    subq    $8, %rsp
    movq    %rsi, %rax
    salq    $3, %rdx
    movq    %rdi, %rsi
    movq    %rax, %rdi
    call    memmove
    addq    $8, %rsp
.L5:
    rep
    ret

If you change the function signature to

void bar(foo* __restrict a, foo* __restrict b, size_t n)

then the memmove becomes a memcpy for a slight performance improvement. Note that memcpy itself will be heavily vectorised.


In theory, memcpy might have a slight, imperceptible, infinitesimal, performance advantage, only because it doesn't have the same requirements as std::copy. From the man page of memcpy:

To avoid overflows, the size of the arrays pointed by both the destination and source parameters, shall be at least num bytes, and should not overlap (for overlapping memory blocks, memmove is a safer approach).

In other words, memcpy can ignore the possibility of overlapping data. (Passing overlapping arrays to memcpy is undefined behavior.) So memcpy doesn't need to explicitly check for this condition, whereas std::copy can be used as long as the OutputIterator parameter is not in the source range. Note this is not the same as saying that the source range and destination range can't overlap.

So since std::copy has somewhat different requirements, in theory it should be slightly (with an extreme emphasis on slightly) slower, since it probably will check for overlapping C-arrays, or else delegate the copying of C-arrays to memmove, which needs to perform the check. But in practice, you (and most profilers) probably won't even detect any difference.

Of course, if you're not working with PODs, you can't use memcpy anyway.


Examples related to c++

Method Call Chaining; returning a pointer vs a reference? How can I tell if an algorithm is efficient? Difference between opening a file in binary vs text How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Install Qt on Ubuntu #include errors detected in vscode Cannot open include file: 'stdio.h' - Visual Studio Community 2017 - C++ Error How to fix the error "Windows SDK version 8.1" was not found? Visual Studio 2017 errors on standard headers How do I check if a Key is pressed on C++

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to optimization

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Measuring execution time of a function in C++ GROUP BY having MAX date How to efficiently remove duplicates from an array without using Set Storing JSON in database vs. having a new column for each key Read file As String How to write a large buffer into a binary file in C++, fast? Is optimisation level -O3 dangerous in g++? Why is processing a sorted array faster than processing an unsorted array? MySQL my.cnf performance tuning recommendations