Have you tried moving the reduction step outside the loop? Right now you have a data dependency that really isn't needed.
Try:
uint64_t subset_counts[4] = {};
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
unsigned i=0;
while (i < size/8) {
subset_counts[0] += _mm_popcnt_u64(buffer[i]);
subset_counts[1] += _mm_popcnt_u64(buffer[i+1]);
subset_counts[2] += _mm_popcnt_u64(buffer[i+2]);
subset_counts[3] += _mm_popcnt_u64(buffer[i+3]);
i += 4;
}
}
count = subset_counts[0] + subset_counts[1] + subset_counts[2] + subset_counts[3];
You also have some weird aliasing going on, that I'm not sure is conformant to the strict aliasing rules.