[c++] Why are elementwise additions much faster in separate loops than in a combined loop?

It's because the CPU doesn't have so many cache misses (where it has to wait for the array data to come from the RAM chips). It would be interesting for you to adjust the size of the arrays continually so that you exceed the sizes of the level 1 cache (L1), and then the level 2 cache (L2), of your CPU and plot the time taken for your code to execute against the sizes of the arrays. The graph shouldn't be a straight line like you'd expect.

Examples related to c++

Method Call Chaining; returning a pointer vs a reference? How can I tell if an algorithm is efficient? Difference between opening a file in binary vs text How can compare-and-swap be used for a wait-free mutual exclusion for any shared data structure? Install Qt on Ubuntu #include errors detected in vscode Cannot open include file: 'stdio.h' - Visual Studio Community 2017 - C++ Error How to fix the error "Windows SDK version 8.1" was not found? Visual Studio 2017 errors on standard headers How do I check if a Key is pressed on C++

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to x86

How to compile Tensorflow with SSE4.2 and AVX instructions? Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs How to install ia32-libs in Ubuntu 14.04 LTS (Trusty Tahr) How to run a program without an operating system? Carry Flag, Auxiliary Flag and Overflow Flag in Assembly How do AX, AH, AL map onto EAX? JNZ & CMP Assembly Instructions How does the ARM architecture differ from x86? Difference between JE/JNE and JZ/JNZ

Examples related to vectorization

numpy array TypeError: only integer scalar arrays can be converted to a scalar index Difference between map, applymap and apply methods in Pandas Why are elementwise additions much faster in separate loops than in a combined loop? Efficient evaluation of a function at every cell of a NumPy array Is there an R function for finding the index of an element in a vector? How can I apply a function to every row/column of a matrix in MATLAB?

Examples related to compiler-optimization

How to compile Tensorflow with SSE4.2 and AVX instructions? Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs Swift Beta performance: sorting arrays Why are elementwise additions much faster in separate loops than in a combined loop? Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? How to disable compiler optimizations in gcc? How to see which flags -march=native will activate? Why do we use volatile keyword? How to turn off gcc compiler optimization to enable buffer overflow