TL;DR: Use __builtin
intrinsics instead; they might happen to help.
I was able to make gcc
4.8.4 (and even 4.7.3 on gcc.godbolt.org) generate optimal code for this by using __builtin_popcountll
which uses the same assembly instruction, but gets lucky and happens to make code that doesn't have an unexpectedly long loop-carried dependency because of the false dependency bug.
I am not 100% sure of my benchmarking code, but objdump
output seems to share my views. I use some other tricks (++i
vs i++
) to make the compiler unroll loop for me without any movl
instruction (strange behaviour, I must say).
Results:
Count: 20318230000 Elapsed: 0.411156 seconds Speed: 25.503118 GB/s
Benchmarking code:
#include <stdint.h>
#include <stddef.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
uint64_t builtin_popcnt(const uint64_t* buf, size_t len){
uint64_t cnt = 0;
for(size_t i = 0; i < len; ++i){
cnt += __builtin_popcountll(buf[i]);
}
return cnt;
}
int main(int argc, char** argv){
if(argc != 2){
printf("Usage: %s <buffer size in MB>\n", argv[0]);
return -1;
}
uint64_t size = atol(argv[1]) << 20;
uint64_t* buffer = (uint64_t*)malloc((size/8)*sizeof(*buffer));
// Spoil copy-on-write memory allocation on *nix
for (size_t i = 0; i < (size / 8); i++) {
buffer[i] = random();
}
uint64_t count = 0;
clock_t tic = clock();
for(size_t i = 0; i < 10000; ++i){
count += builtin_popcnt(buffer, size/8);
}
clock_t toc = clock();
printf("Count: %lu\tElapsed: %f seconds\tSpeed: %f GB/s\n", count, (double)(toc - tic) / CLOCKS_PER_SEC, ((10000.0*size)/(((double)(toc - tic)*1e+9) / CLOCKS_PER_SEC)));
return 0;
}
Compile options:
gcc --std=gnu99 -mpopcnt -O3 -funroll-loops -march=native bench.c -o bench
GCC version:
gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4
Linux kernel version:
3.19.0-58-generic
CPU information:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 70
model name : Intel(R) Core(TM) i7-4870HQ CPU @ 2.50 GHz
stepping : 1
microcode : 0xf
cpu MHz : 2494.226
cache size : 6144 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm arat pln pts dtherm fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid xsaveopt
bugs :
bogomips : 4988.45
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management: