I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.
With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.
#define BitCount(X,Y) \
Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
Y = ((Y + (Y >> 3)) & 030707070707); \
Y = (Y + (Y >> 6)); \
Y = (Y + (Y >> 12) + (Y >> 24)) & 077;
Here is a secret to the first and most complex step:
input output
AB CD Note
00 00 = AB
01 01 = AB
10 01 = AB - (A >> 1) & 0x1
11 10 = AB - (A >> 1) & 0x1
so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.