[algorithm] How to count the number of set bits in a 32-bit integer?

8 bits representing the number 7 look like this:

00000111

Three bits are set.

What are algorithms to determine the number of set bits in a 32-bit integer?

The answer is


Also consider the built-in functions of your compilers.

On the GNU compiler for example you can just use:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a * or / operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.)

The GCC builtins even work across multiple platforms. Popcount will become mainstream in the x86 architecture, so it makes sense to start using the builtin now so you can recompile to let it inline a hardware instruction. Other architectures have had popcount for years.


On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt (also implied by -msse4.2). See GCC x86 options. -march=nehalem (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.


Using std::bitset<>::count() instead of a built-in

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, although technically there is a separate feature bit for popcnt.)

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret
unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax
    ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

This source isn't x86-specific or GNU-specific at all, but only compiles well for x86 with gcc/clang/icc.

Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.


What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.

But if the speed is the major factor and not the code size then I think the follow can be faster:

       static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
        static int bitCountOfByte( int value ){
            return BIT_COUNT[ value & 0xFF ];
        }

        static int bitCountOfInt( int value ){
            return bitCountOfByte( value ) 
                 + bitCountOfByte( value >> 8 ) 
                 + bitCountOfByte( value >> 16 ) 
                 + bitCountOfByte( value >> 24 );
        }

I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.


I use the following function. Haven't checked benchmarks, but it works.

int msb(int num)
{
    int m = 0;
    for (int i = 16; i > 0; i = i>>1)
    {
        // debug(i, num, m);
        if(num>>i)
        {
            m += i;
            num>>=i;
        }
    }
    return m;
}

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.


This can be done in O(k), where k is the number of bits set.

int NumberOfSetBits(int n)
{
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}

Few open questions:-

  1. If the number is negative then?
  2. If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.

we can modify the algo to support the negative number as follows:-

count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
    count += 1
  n /= 2  
return count

now to overcome the second problem we can write the algo like:-

int bit_count(int num)
{
    int count=0;
    while(num)
    {
        num=(num)&(num-1);
        count++;
    }
    return count;
}

for complete reference see :

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html


For a happy medium between a 232 lookup table and iterating through each bit individually:

int bitcount(unsigned int num){
    int count = 0;
    static int nibblebits[] =
        {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
    for(; num != 0; num >>= 4)
        count += nibblebits[num & 0x0f];
    return count;
}

From http://ctips.pbwiki.com/CountBits


Here is a portable module ( ANSI-C ) which can benchmark each of your algorithms on any architecture.

Your CPU has 9 bit bytes? No problem :-) At the moment it implements 2 algorithms, the K&R algorithm and a byte wise lookup table. The lookup table is on average 3 times faster than the K&R algorithm. If someone can figure a way to make the "Hacker's Delight" algorithm portable feel free to add it in.

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.  
 * onTheFly:    Calculate the bitcount on demand.
 *
 * lookupTalbe: Uses a small lookup table to determine the bitcount.  This
 * method is on average 3 times as fast as onTheFly, but incurs a small
 * upfront cost to initialize the lookup table on the first call.
 *
 * strategyCount is just a placeholder. 
 */
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

.

#include <limits.h>

#include "bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
 * values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
    int count;

    /* Starting with:
     * 1100 - 1 == 1011,  1100 & 1011 == 1000
     * 1000 - 1 == 0111,  1000 & 0111 == 0000
     */
    for ( count = 0; val; ++count )
        val &= val - 1;

    return count;
}

/* Looks up each byte of the integer in a lookup table.
 *
 * The first time the function is called it initializes the lookup table.
 */
static int _tableBitCount( unsigned int val ) {
    int bCount = 0;

    if ( !_lookupTableInitialized ) {
        unsigned int i;
        for ( i = 0; i != UCHAR_MAX + 1; ++i )
            _bitCountTable[i] =
                ( unsigned char )_defaultBitCount( i );

        _lookupTableInitialized = 1;
    }

    for ( ; val; val >>= CHAR_BIT )
        bCount += _bitCountTable[val & UCHAR_MAX];

    return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = { "onTheFly", "lookupTable" };

void setStrategy( enum strategy s ) {
    switch ( s ) {
    case onTheFly:
        _bitcount = _defaultBitCount;
        break;
    case lookupTable:
        _bitcount = _tableBitCount;
        break;
    case strategyCount:
        break;
    }
}

/* Just a forwarding function which will call whichever version of the
 * algorithm has been selected by the client 
 */
int bitcount( unsigned int val ) {
    return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
 * Weight algorithm.
 */
void benchmark( int reps ) {
    clock_t start, stop;
    int i, j;
    static const int iterations = 1000000;

    for ( j = 0; j != strategyCount; ++j ) {
        setStrategy( j );

        srand( 257 );

        start = clock(  );

        for ( i = 0; i != reps * iterations; ++i )
            bitcount( rand(  ) );

        stop = clock(  );

        printf
            ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n",
              reps * iterations, strategyNames[j],
              ( double )( stop - start ) / CLOCKS_PER_SEC );
    }
}

int main( void ) {
    int option;

    while ( 1 ) {
        printf( "Menu Options\n"
            "\t1.\tPrint the Hamming Weight of an Integer\n"
            "\t2.\tBenchmark Hamming Weight implementations\n"
            "\t3.\tExit ( or cntl-d )\n\n\t" );

        if ( scanf( "%d", &option ) == EOF )
            break;

        switch ( option ) {
        case 1:
            printf( "Please enter the integer: " );
            if ( scanf( "%d", &option ) != EOF )
                printf
                    ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n",
                      option, option, bitcount( option ) );
            break;
        case 2:
            printf
                ( "Please select number of reps ( in millions ): " );
            if ( scanf( "%d", &option ) != EOF )
                benchmark( option );
            break;
        case 3:
            goto EXIT;
            break;
        default:
            printf( "Invalid option\n" );
        }

    }

 EXIT:
    printf( "\n" );

    return 0;
}

#endif

I have not seen this approach anywhere:

int nbits(unsigned char v) {
    return ((((v - ((v >> 1) & 0x55)) * 0x1010101) & 0x30c00c03) * 0x10040041) >> 0x1c;
}

It works per byte, so it would have to be called 4 times for a 32-bit integer. It is derived from the sideways addition but uses two 32-bit multiplications to reduce the number of instructions to only 7.

Most current C compilers will optimize this function using SIMD (SSE2) instructions when it is clear that the number of requests is a multiple of 4, and it becomes quite competitive. It is portable, can be defined as a macro or inline function and does not need data tables.

This approach can be extended to work on 16 bits at a time, using 64-bit multiplications. However, it fails when all 16 bits are set, returning zero, so it can be used only when the 0xffff input value is not present. It is also slower due to the 64-bit operations and does not optimize well.


  private int get_bits_set(int v)
    {
      int c; // c accumulates the total bits set in v
        for (c = 0; v>0; c++)
        {
            v &= v - 1; // clear the least significant bit set
        }
        return c;
    }

In Java 8 or 9 just invoke Integer.bitCount .


Fast C# solution using pre-calculated table of Byte bit counts with branching on input size.

public static class BitCount
{
    public static uint GetSetBitsCount(uint n)
    {
        var counts = BYTE_BIT_COUNTS;
        return n <= 0xff ? counts[n]
             : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
             : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
             : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
    }

    public static readonly uint[] BYTE_BIT_COUNTS = 
    {
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
    };
}

Personally I use this :

  public static int myBitCount(long L){
      int count = 0;
      while (L != 0) {
         count++;
         L ^= L & -L; 
      }
      return count;
  }

Java JDK1.5

Integer.bitCount(n);

where n is the number whose 1's are to be counted.

check also,

Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);

//Beginning with the value 1, rotate left 16 times
     n = 1;
         for (int i = 0; i < 16; i++) {
            n = Integer.rotateLeft(n, 1);
            System.out.println(n);
         }

32-bit or not ? I just came with this method in Java after reading "cracking the coding interview" 4th edition exercice 5.5 ( chap 5: Bit Manipulation). If the least significant bit is 1 increment count, then right-shift the integer.

public static int bitCount( int n){
    int count = 0;
    for (int i=n; i!=0; i = i >> 1){
        count += i & 1;
    }
    return count;
}

I think this one is more intuitive than the solutions with constant 0x33333333 no matter how fast they are. It depends on your definition of "best algorithm" .


If you happen to be using Java, the built-in method Integer.bitCount will do that.


I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..

v = v - ((v >> 1) & 0x55555555); 

The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.

v = (v & 0x33333333) + ((v >> 2) & 0x33333333); 

This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

Let's break it down further...

v + (v >> 4)

It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.

0b01000010 + 0b01000000

It gives us the count of set bits in a byte, in the first nibble 0b01100010 and therefore we mask the last four bytes of all the bytes in the number (discarding them).

0b01100010 & 0xF0 = 0b01100000

Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.

All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.


This is one of those questions where it helps to know your micro-architecture. I just timed two variants under gcc 4.3.3 compiled with -O3 using C++ inlines to eliminate function call overhead, one billion iterations, keeping the running sum of all counts to ensure the compiler doesn't remove anything important, using rdtsc for timing (clock cycle precise).

inline int pop2(unsigned x, unsigned y)
{
    x = x - ((x >> 1) & 0x55555555);
    y = y - ((y >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    y = y + (y >> 8);
    x = x + (x >> 16);
    y = y + (y >> 16);
    return (x+y) & 0x000000FF;
}

The unmodified Hacker's Delight took 12.2 gigacycles. My parallel version (counting twice as many bits) runs in 13.0 gigacycles. 10.5s total elapsed for both together on a 2.4GHz Core Duo. 25 gigacycles = just over 10 seconds at this clock frequency, so I'm confident my timings are right.

This has to do with instruction dependency chains, which are very bad for this algorithm. I could nearly double the speed again by using a pair of 64-bit registers. In fact, if I was clever and added x+y a little sooner I could shave off some shifts. The 64-bit version with some small tweaks would come out about even, but count twice as many bits again.

With 128 bit SIMD registers, yet another factor of two, and the SSE instruction sets often have clever short-cuts, too.

There's no reason for the code to be especially transparent. The interface is simple, the algorithm can be referenced on-line in many places, and it's amenable to comprehensive unit test. The programmer who stumbles upon it might even learn something. These bit operations are extremely natural at the machine level.

OK, I decided to bench the tweaked 64-bit version. For this one sizeof(unsigned long) == 8

inline int pop2(unsigned long x, unsigned long y)
{
    x = x - ((x >> 1) & 0x5555555555555555);
    y = y - ((y >> 1) & 0x5555555555555555);
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
    y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
    x = x + y; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x + (x >> 32); 
    return x & 0xFF;
}

That looks about right (I'm not testing carefully, though). Now the timings come out at 10.70 gigacycles / 14.1 gigacycles. That later number summed 128 billion bits and corresponds to 5.9s elapsed on this machine. The non-parallel version speeds up a tiny bit because I'm running in 64-bit mode and it likes 64-bit registers slightly better than 32-bit registers.

Let's see if there's a bit more OOO pipelining to be had here. This was a bit more involved, so I actually tested a bit. Each term alone sums to 64, all combined sum to 256.

inline int pop4(unsigned long x, unsigned long y, 
                unsigned long u, unsigned long v)
{
  enum { m1 = 0x5555555555555555, 
         m2 = 0x3333333333333333, 
         m3 = 0x0F0F0F0F0F0F0F0F, 
         m4 = 0x000000FF000000FF };

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y; 
    u = u + v; 
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4; 
    x = x + (x >> 32);
    return x & 0x000001FF;
}

I was excited for a moment, but it turns out gcc is playing inline tricks with -O3 even though I'm not using the inline keyword in some tests. When I let gcc play tricks, a billion calls to pop4() takes 12.56 gigacycles, but I determined it was folding arguments as constant expressions. A more realistic number appears to be 19.6gc for another 30% speed-up. My test loop now looks like this, making sure each argument is different enough to stop gcc from playing tricks.

   hitime b4 = rdtsc(); 
   for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i) 
      sum += pop4 (i,  i^1, ~i, i|1); 
   hitime e4 = rdtsc(); 

256 billion bits summed in 8.17s elapsed. Works out to 1.02s for 32 million bits as benchmarked in the 16-bit table lookup. Can't compare directly, because the other bench doesn't give a clock speed, but looks like I've slapped the snot out of the 64KB table edition, which is a tragic use of L1 cache in the first place.

Update: decided to do the obvious and create pop6() by adding four more duplicated lines. Came out to 22.8gc, 384 billion bits summed in 9.5s elapsed. So there's another 20% Now at 800ms for 32 billion bits.


I'm very disappointed to see that nobody has responded with the functional master race recursive solution, by far the purest one (and can be used with any bit length!):

template<typename T>
int popcnt(T n)
{
  if (n>0)
    return n&1 + popcnt(n>>1);
  return 0; 
}

From Python 3.10 onwards, you will be able to use the int.bit_count() function, but for the time being, you can define this function yourself.

def bit_count(integer):
    return bin(integer).count("1")

This is the implementation in golang

func CountBitSet(n int) int {


    count := 0
    for n > 0 {
      count += n & 1
      n >>= 1

    }
    return count
}

Another Hamming weight algorithm if you're on a BMI2 capable CPU

the_weight=__tzcnt_u64(~_pext_u64(data[i],data[i]));

Have fun!


unsigned int count_bit(unsigned int x)
{
  x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
  x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
  x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
  x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
  x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
  return x;
}

Let me explain this algorithm.

This algorithm is based on Divide and Conquer Algorithm. Suppose there is a 8bit integer 213(11010101 in binary), the algorithm works like this(each time merge two neighbor blocks):

+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |  <- x
|  1 0  |  0 1  |  0 1  |  0 1  |  <- first time merge
|    0 0 1 1    |    0 0 1 0    |  <- second time merge
|        0 0 0 0 0 1 0 1        |  <- third time ( answer = 00000101 = 5)
+-------------------------------+

Simple algorithm to count the number of set bits:

int countbits(n){
     int count = 0;
     while(n != 0){
        n = n & (n-1);
        count++;
   }
   return count;
}

Take the example of 11 (1011) and try manually running through the algorithm. Should help you a lot!


Here is a solution that has not been mentioned so far, using bitfields. The following program counts the set bits in an array of 100000000 16-bit integers using 4 different methods. Timing results are given in parentheses (on MacOSX, with gcc -O3):

#include <stdio.h>
#include <stdlib.h>

#define LENGTH 100000000

typedef struct {
    unsigned char bit0 : 1;
    unsigned char bit1 : 1;
    unsigned char bit2 : 1;
    unsigned char bit3 : 1;
    unsigned char bit4 : 1;
    unsigned char bit5 : 1;
    unsigned char bit6 : 1;
    unsigned char bit7 : 1;
} bits;

unsigned char sum_bits(const unsigned char x) {
    const bits *b = (const bits*) &x;
    return b->bit0 + b->bit1 + b->bit2 + b->bit3 \
         + b->bit4 + b->bit5 + b->bit6 + b->bit7;
}

int NumberOfSetBits(int i) {
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

#define out(s) \
    printf("bits set: %lu\nbits counted: %lu\n", 8*LENGTH*sizeof(short)*3/4, s);

int main(int argc, char **argv) {
    unsigned long i, s;
    unsigned short *x = malloc(LENGTH*sizeof(short));
    unsigned char lut[65536], *p;
    unsigned short *ps;
    int *pi;

    /* set 3/4 of the bits */
    for (i=0; i<LENGTH; ++i)
        x[i] = 0xFFF0;

    /* sum_bits (1.772s) */
    for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=sum_bits(*p++));
    out(s);

    /* NumberOfSetBits (0.404s) */
    for (i=LENGTH*sizeof(short)/sizeof(int), pi=(int*)x, s=0; i--; s+=NumberOfSetBits(*pi++));
    out(s);

    /* populate lookup table */
    for (i=0, p=(unsigned char*) &i; i<sizeof(lut); ++i)
        lut[i] = sum_bits(p[0]) + sum_bits(p[1]);

    /* 256-bytes lookup table (0.317s) */
    for (i=LENGTH*sizeof(short), p=(unsigned char*) x, s=0; i--; s+=lut[*p++]);
    out(s);

    /* 65536-bytes lookup table (0.250s) */
    for (i=LENGTH, ps=x, s=0; i--; s+=lut[*ps++]);
    out(s);

    free(x);
    return 0;
}

While the bitfield version is very readable, the timing results show that it is over 4x slower than NumberOfSetBits(). The lookup-table based implementations are still quite a bit faster, in particular with a 65 kB table.


if you're using C++ another option is to use template metaprogramming:

// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
        // return the least significant bit plus the result of calling ourselves with
        // .. the shifted value
        return (val & 0x1) + countBits<BITS-1>(val >> 1);
}

// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
        return val & 0x1;
}

usage would be:

// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )

// another byte (this returns 7)
countBits<8>( 254 )

// counting bits in a word/short (this returns 1)
countBits<16>( 256 )

you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.

edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)


int countBits(int x)
{
    int n = 0;
    if (x) do n++;
           while(x=x&(x-1));
    return n;
}   

Or also:

int countBits(int x) { return (x)? 1+countBits(x&(x-1)): 0; }

public class BinaryCounter {

private int N;

public BinaryCounter(int N) {
    this.N = N;
}

public static void main(String[] args) {

    BinaryCounter counter=new BinaryCounter(7);     
    System.out.println("Number of ones is "+ counter.count());

}

public int count(){
    if(N<=0) return 0;
    int counter=0;
    int K = 0;
    do{
        K = biggestPowerOfTwoSmallerThan(N);
        N = N-K;
        counter++;
    }while (N != 0);
    return counter;

}

private int biggestPowerOfTwoSmallerThan(int N) {
    if(N==1) return 1;
    for(int i=0;i<N;i++){
        if(Math.pow(2, i) > N){
            int power = i-1;
            return (int) Math.pow(2, power);
        }
    }
    return 0;
}
}

def hammingWeight(n):
    count = 0
    while n:
        if n&1:
            count += 1
        n >>= 1
    return count

I use the below code which is more intuitive.

int countSetBits(int n) {
    return !n ? 0 : 1 + countSetBits(n & (n-1));
}

Logic : n & (n-1) resets the last set bit of n.

P.S : I know this is not O(1) solution, albeit an interesting solution.


C++20 std::popcount

The following proposal has been merged http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0553r4.html and should add it to a the <bit> header.

I expect the usage to be like:

#include <bit>
#include <iostream>

int main() {
    std::cout << std::popcount(0x55) << std::endl;
}

I'll give it a try when support arrives to GCC, GCC 9.1.0 with g++-9 -std=c++2a still doesn't support it.

The proposal says:

Header: <bit>

namespace std {

  // 25.5.6, counting
  template<class T>
    constexpr int popcount(T x) noexcept;

and:

template<class T>
  constexpr int popcount(T x) noexcept;

Constraints: T is an unsigned integer type (3.9.1 [basic.fundamental]).

Returns: The number of 1 bits in the value of x.

std::rotl and std::rotr were also added to do circular bit rotations: Best practices for circular shift (rotate) operations in C++


I think the Brian Kernighan's method will be useful too... It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

int countSetBits(unsigned int n) { 
    unsigned int n; // count the number of bits set in n
    unsigned int c; // c accumulates the total bits set in n
    for (c=0;n>0;n=n&(n-1)) c++; 
    return c; 
}

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"


A simple way which should work nicely for a small amount of bits it something like this (For 4 bits in this example):

(i & 1) + (i & 2)/2 + (i & 4)/4 + (i & 8)/8

Would others recommend this for a small number of bits as a simple solution?


I am giving two algorithms to answer the question,

  package countSetBitsInAnInteger;

    import java.util.Scanner;

    public class UsingLoop {

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        try{
        System.out.println("Enter a integer number to check for set bits in it");
        int n = in.nextInt();
        System.out.println("Using while loop, we get the number of set bits as: "+usingLoop(n));
        System.out.println("Using Brain Kernighan's Algorithm, we get the number of set bits as: "+usingBrainKernighan(n));
        System.out.println("Using ");
        }
        finally{
        in.close();
        }
    }
    private static int usingBrainKernighan(int n) {
        int count = 0;
        while(n>0){
            n&=(n-1);
            count++;
        }
        return count;
    }/*
    Analysis:
        Time complexity = O(lgn)
        Space complexity = O(1)
    */
    private static int usingLoop(int n) {
        int count = 0;
        for(int i=0;i<32;i++){
            if((n&(1<<i))!=0)
                count++;
        }
        return count;
    }
    /*
    Analysis:
        Time Complexity = O(32) // Maybe the complexity is O(lgn)
        Space Complexity = O(1)
    */
    }

You can use built in function named __builtin_popcount(). There is no__builtin_popcount in C++ but it is a built in function of GCC compiler. This function return the number of set bit in an integer.

int __builtin_popcount (unsigned int x);

Reference : Bit Twiddling Hacks


what you can do is

while(n){
    n=n&(n-1);
    count++;
}

the logic behind this is the bits of n-1 is inverted from rightmost set bit of n. if n=6 i.e 110 then 5 is 101 the bits are inverted from rightmost set bit of n. so if we & these two we will make the rightmost bit 0 in every iteration and always go to the next rightmost set bit.Hence, counting the set bit.The worst time complexity will be O(logn) when every bit is set.


It's not the fastest or best solution, but I found the same question in my way, and I started to think and think. finally I realized that it can be done like this if you get the problem from mathematical side, and draw a graph, then you find that it's a function which has some periodic part, and then you realize the difference between the periods... so here you go:

unsigned int f(unsigned int x)
{
    switch (x) {
        case 0:
            return 0;
        case 1:
            return 1;
        case 2:
            return 1;
        case 3:
            return 2;
        default:
            return f(x/4) + f(x%4);
    }
}

Here's something that works in PHP (all PHP intergers are 32 bit signed, thus 31 bit):

function bits_population($nInteger)
{

    $nPop=0;
    while($nInteger)
    {
        $nInteger^=(1<<(floor(1+log($nInteger)/log(2))-1));
        $nPop++;
    }
    return $nPop;
}

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.

SSSE3 version:

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m128i _sum =  _mm128_setzero_si128();
    for (size_t i = 0; i < size; i += 16)
    {
        //load 16-byte vector
        __m128i _src = _mm_loadu_si128((__m128i*)(src + i));
        //get low 4 bit for every byte in vector
        __m128i lo = _mm_and_si128(_src, F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
    }
    uint64_t sum[2];
    _mm_storeu_si128((__m128i*)sum, _sum);
    return sum[0] + sum[1];
}

AVX2 version:

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
                                   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m256i _sum =  _mm256_setzero_si256();
    for (size_t i = 0; i < size; i += 32)
    {
        //load 32-byte vector
        __m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
        //get low 4 bit for every byte in vector
        __m256i lo = _mm256_and_si256(_src, F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
    }
    uint64_t sum[4];
    _mm256_storeu_si256((__m256i*)sum, _sum);
    return sum[0] + sum[1] + sum[2] + sum[3];
}

This will also work fine :

int ans = 0;
while(num){
 ans += (num &1);
 num = num >>1;
}    
return ans;

You can do something like:

int countSetBits(int n)
{
    n=((n&0xAAAAAAAA)>>1) + (n&0x55555555);
    n=((n&0xCCCCCCCC)>>2) + (n&0x33333333);
    n=((n&0xF0F0F0F0)>>4) + (n&0x0F0F0F0F);
    n=((n&0xFF00FF00)>>8) + (n&0x00FF00FF);
    return n;
}

int main()
{
    int n=10;
    printf("Number of set bits: %d",countSetBits(n));
     return 0;
}

See heer: http://ideone.com/JhwcX

The working can be explained as follows:

First, all the even bits are shifted towards right & added with the odd bits to count the number of bits in group of two. Then we work in group of two, then four & so on..


How about converting the integer to a binary string and count the ones?

php solution:

substr_count( decbin($integer), '1' );

Why not iteratively divide by 2?

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2  

I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity


I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.

With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

Here is a secret to the first and most complex step:

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.

  • Don Gillies

I'm particularly fond of this example from the fortune file:

#define BITCOUNT(x)    (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x)         ((x) - (((x)>>1)&0x77777777)
                             - (((x)>>2)&0x33333333)
                             - (((x)>>3)&0x11111111))

I like it best because it's so pretty!


The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.

unsigned int bitCount(unsigned int x)
{
  x = ((x >> 1) & 0b01010101010101010101010101010101)
     + (x       & 0b01010101010101010101010101010101);
  x = ((x >> 2) & 0b00110011001100110011001100110011)
     + (x       & 0b00110011001100110011001100110011); 
  x = ((x >> 4) & 0b00001111000011110000111100001111)
     + (x       & 0b00001111000011110000111100001111); 
  x = ((x >> 8) & 0b00000000111111110000000011111111)
     + (x       & 0b00000000111111110000000011111111); 
  x = ((x >> 16)& 0b00000000000000001111111111111111)
     + (x       & 0b00000000000000001111111111111111); 
  return x;
}

The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.


I always use this in Competitive Programming and it's easy to write and efficient:

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
    bitset<32> b(n);
    return b.count();
}

Here is the sample code, which might be useful.

private static final int[] bitCountArr = new int[]{0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
private static final int firstByteFF = 255;
public static final int getCountOfSetBits(int value){
    int count = 0;
    for(int i=0;i<4;i++){
        if(value == 0) break;
        count += bitCountArr[value & firstByteFF];
        value >>>= 8;
    }
    return count;
}

python solution:

def hammingWeight(n: int) -> int:
    sums = 0
    while (n!=0):
        sums+=1
        n = n &(n-1)
        
    return sums

In the binary representation, the least significant 1-bit in n always corresponds to a 0-bit in n - 1. Therefore, anding the two numbers n and n - 1 always flips the least significant 1-bit in n to 0, and keeps all other bits the same.

enter image description here


In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

Although these rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.


For Java, there is a java.util.BitSet. https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html

cardinality(): Returns the number of bits set to true in this BitSet.

The BitSet is memory efficient since it's stored as a Long.


// How about the following:
public int CountBits(int value)
{
    int count = 0;
    while (value > 0)
    {
        if (value & 1)
            count++;
        value <<= 1;
    }
    return count;
}

int bitcount(unsigned int n)
{ 
      int count=0;
      while(n)
      {
           count += n & 0x1u;
           n >>= 1;
      }
      return  count;
 }

Iterated 'count' runs in time proportional to the total number of bits. It simply loops through all the bits, terminating slightly earlier because of the while condition. Useful, if 1'S or the set bits are sparse and among the least significant bits.


I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.

Fastest is the following, at 3.7 seconds:

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.

Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.

Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.

The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.

So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.

It's hard to think of a situation where you'd want to use the bit-twiddling approach.

Edit: Similar results here.


There are many algorithm to count the set bits; but i think the best one is the faster one! You can see the detailed on this page:

Bit Twiddling Hacks

I suggest this one:

Counting bits set in 14, 24, or 32-bit words using 64-bit instructions

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) 
     % 0x1f;

// option 3, for at most 32-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 
     0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.


From Hacker's Delight, p. 66, Figure 5-2

int pop(unsigned x)
{
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

Executes in ~20-ish instructions (arch dependent), no branching.

Hacker's Delight is delightful! Highly recommended.


#!/user/local/bin/perl


    $c=0x11BBBBAB;
     $count=0;
     $m=0x00000001;
    for($i=0;$i<32;$i++)
    {
        $f=$c & $m;
        if($f == 1)
        {
            $count++;
        }
        $c=$c >> 1;
    }
    printf("%d",$count);

ive done it through a perl script. the number taken is $c=0x11BBBBAB   
B=3 1s   
A=2 1s   
so in total  
1+1+3+3+3+2+3+3=19

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to binary

Difference between opening a file in binary vs text Remove 'b' character do in front of a string literal in Python 3 Save and retrieve image (binary) from SQL Server using Entity Framework 6 bad operand types for binary operator "&" java C++ - Decimal to binary converting Converting binary to decimal integer output How to convert string to binary? How to convert 'binary string' to normal string in Python3? Read and write to binary files in C? Convert to binary and keep leading zeros in Python

Examples related to bit-manipulation

What is (x & 1) and (x >>= 1)? 'and' (boolean) vs '&' (bitwise) - Why difference in behavior with lists vs numpy arrays? What does AND 0xFF do? bitwise XOR of hex numbers in python What is Bit Masking? What does a bitwise shift (left or right) do and what is it used for? Implement division with bit-wise operator How can I multiply and divide using only bit shifting and adding? In C/C++ what's the simplest way to reverse the order of bits in a byte? How do I get bit-by-bit data from an integer value in C?

Examples related to hammingweight

How to count the number of set bits in a 32-bit integer?

Examples related to iec10967

How to count the number of set bits in a 32-bit integer?