[c] What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the furthest left bit that is a 1), what is the quickest/most efficient method of finding out?

I know that POSIX supports a ffs() method in strings.h to find the first set bit, but there doesn't seem to be a corresponding fls() method.

Is there some really obvious way of doing this that I'm missing?

What about in cases where you can't use POSIX functions for portability?

Edit: What about a solution that works on both 32 and 64 bit architectures (many of the code listings seem like they'd only work on 32 bit ints).

This question is related to c algorithm optimization bit-manipulation

The answer is


Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.

#include <limits.h>

unsigned int fls(const unsigned int value)
{
    return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}

unsigned long flsl(const unsigned long value)
{
    return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}

unsigned long long flsll(const unsigned long long value)
{
    return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}

And a little improved version for C++.

#include <climits>

constexpr unsigned int fls(const unsigned int value)
{
    return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}

constexpr unsigned long fls(const unsigned long value)
{
    return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}

constexpr unsigned long long fls(const unsigned long long value)
{
    return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}

The code assumes that value won't be 0. If you want to allow 0, you need to modify it.


Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:

readonly static byte[] msb_tab_15;

// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
    var p = new byte[0x8000];

    for (byte n = 0; n < 16; n++)
        for (int c = (1 << n) >> 1, i = 0; i < c; i++)
            p[c + i] = n;

    msb_tab_15 = p;
}

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

ulong (64-bit) Version

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
    if ((long)v <= 0)
        return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63

    int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
    j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
    return j + msb_tab_15[v >> (j + 1)];
}

uint (32-bit) Version

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
    if ((int)v <= 0)
        return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31

    int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
    return j + msb_tab_15[v >> (j + 1)];
}

Various overloads for the above

public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];

This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.




That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.


The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

 1  candidates.HighestOne_Tab16A               622,496
 2  candidates.HighestOne_Tab16C               628,234
 3  candidates.HighestOne_Tab8A                649,146
 4  candidates.HighestOne_Tab8B                656,847
 5  candidates.HighestOne_Tab16B               657,147
 6  candidates.HighestOne_Tab16D               659,650
 7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900
 8  de_Bruijn.IndexOfMSB                       709,672
 9  _old_2.HighestOne_Old2                     715,810
10  _test_A.HighestOne8                        757,188
11  _old_1.HighestOne_Old1                     757,925
12  _test_A.HighestOne5  (unsafe)              760,387
13  _test_B.HighestOne8  (unsafe)              763,904
14  _test_A.HighestOne3  (unsafe)              766,433
15  _test_A.HighestOne1  (unsafe)              767,321
16  _test_A.HighestOne4  (unsafe)              771,702
17  _test_B.HighestOne2  (unsafe)              772,136
18  _test_B.HighestOne1  (unsafe)              772,527
19  _test_B.HighestOne3  (unsafe)              774,140
20  _test_A.HighestOne7  (unsafe)              774,581
21  _test_B.HighestOne7  (unsafe)              775,463
22  _test_A.HighestOne2  (unsafe)              776,865
23  candidates.HighestOne_NoTab                777,698
24  _test_B.HighestOne6  (unsafe)              779,481
25  _test_A.HighestOne6  (unsafe)              781,553
26  _test_B.HighestOne4  (unsafe)              785,504
27  _test_B.HighestOne5  (unsafe)              789,797
28  _test_A.HighestOne0  (unsafe)              809,566
29  _test_B.HighestOne0  (unsafe)              814,990
30  _highest_one_bit.HighestOne                824,345
30  _bitarray_ext.RtlFindMostSignificantBit    894,069
31  candidates.HighestOne_Naive                898,865

Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:

[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);

It's really too bad, because here's the entire actual function:

    RtlFindMostSignificantBit:
        bsr rdx, rcx  
        mov eax,0FFFFFFFFh  
        movzx ecx, dl  
        cmovne      eax,ecx  
        ret

I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

public static int HighestOne_Tab8A(ulong v)
{
    if ((long)v <= 0)
        return (int)((v >> 57) & 64) - 1;

    int j;
    j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
    j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
    j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
    return j + msb_tab_8[v >> j];
}

The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
            N_bsr64 = 0x03F79D71B4CB0A89;

readonly public static sbyte[]
bsf64 =
{
    63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,
    61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,
    62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
    56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,
},
bsr64 =
{
     0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,
    54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,
    46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
    25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,
};

public static int IndexOfLSB(ulong v) =>
    v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;

public static int IndexOfMSB(ulong v)
{
    if ((long)v <= 0)
        return (int)((v >> 57) & 64) - 1;

    v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better
    v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?
    return bsr64[(v * N_bsr64) >> 58];
}

There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.


My humble method is very simple:

MSB(x) = INT[Log(x) / Log(2)]

Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).

This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.


Think bitwise operators.

I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:

position = sizeof(int)*8
while(!(n & cmp)){ 
   n <<=1;
   position--;
}

A version in C using successive approximation:

unsigned int getMsb(unsigned int n)
{
  unsigned int msb  = sizeof(n) * 4;
  unsigned int step = msb;
  while (step > 1)
 {
    step /=2;
    if (n>>msb)
     msb += step;
   else
     msb -= step;
 }
  if (n>>msb)
    msb++;
  return (msb - 1);
}

Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same. ( 4 loops when using "unsigned int")


Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.

[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:

asm ("bsrl %1, %0" 
     : "=r" (position) 
     : "r" (number));

Note that what you are trying to do is calculate the integer log2 of an integer,

#include <stdio.h>
#include <stdlib.h>

unsigned int
Log2(unsigned long x)
{
    unsigned long n = x;
    int bits = sizeof(x)*8;
    int step = 1; int k=0;
    for( step = 1; step < bits; ) {
        n |= (n >> step);
        step *= 2; ++k;
    }
    //printf("%ld %ld\n",x, (x - (n >> 1)) );
    return(x - (n >> 1));
}

Observe that you can attempt to search more than 1 bit at a time.

unsigned int
Log2_a(unsigned long x)
{
    unsigned long n = x;
    int bits = sizeof(x)*8;
    int step = 1;
    int step2 = 0;
    //observe that you can move 8 bits at a time, and there is a pattern...
    //if( x>1<<step2+8 ) { step2+=8;
        //if( x>1<<step2+8 ) { step2+=8;
            //if( x>1<<step2+8 ) { step2+=8;
            //}
        //}
    //}
    for( step2=0; x>1L<<step2+8; ) {
        step2+=8;
    }
    //printf("step2 %d\n",step2);
    for( step = 0; x>1L<<(step+step2); ) {
        step+=1;
        //printf("step %d\n",step+step2);
    }
    printf("log2(%ld) %d\n",x,step+step2);
    return(step+step2);
}

This approach uses a binary search

unsigned int
Log2_b(unsigned long x)
{
    unsigned long n = x;
    unsigned int bits = sizeof(x)*8;
    unsigned int hbit = bits-1;
    unsigned int lbit = 0;
    unsigned long guess = bits/2;
    int found = 0;

    while ( hbit-lbit>1 ) {
        //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
        //when value between guess..lbit
        if( (x<=(1L<<guess)) ) {
           //printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
            hbit=guess;
            guess=(hbit+lbit)/2;
            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
        }
        //when value between hbit..guess
        //else
        if( (x>(1L<<guess)) ) {
            //printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
            lbit=guess;
            guess=(hbit+lbit)/2;
            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
        }
    }
    if( (x>(1L<<guess)) ) ++guess;
    printf("log2(x%ld)=r%d\n",x,guess);
    return(guess);
}

Another binary search method, perhaps more readable,

unsigned int
Log2_c(unsigned long x)
{
    unsigned long v = x;
    unsigned int bits = sizeof(x)*8;
    unsigned int step = bits;
    unsigned int res = 0;
    for( step = bits/2; step>0; )
    {
        //printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
        while ( v>>step ) {
            v>>=step;
            res+=step;
            //printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
        }
        step /= 2;
    }
    if( (x>(1L<<res)) ) ++res;
    printf("log2(x%ld)=r%ld\n",x,res);
    return(res);
}

And because you will want to test these,

int main()
{
    unsigned long int x = 3;
    for( x=2; x<1000000000; x*=2 ) {
        //printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
        printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
        printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
        printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
    }
    return(0);
}

thats some kind of binary search, it works with all kinds of (unsigned!) integer types

#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))

int msb(UINT x)
{
    if(0 == x)
        return -1;

    int c = 0;

    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
    if(static_cast<UINT>(x >> i))
    {
        x >>= i;
        c |= i;
    }

    return c;
}

to make complete:

#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))

int lsb(UINT x)
{
    if(0 == x)
        return -1;

    int c = UINT_BIT-1;

    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
    if(static_cast<UINT>(x << i))
    {
        x <<= i;
        c ^= i;
    }

    return c;
}

Woaw, that was many answers. I am not sorry for answering on an old question.

int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
    if(0xFFFFFFFF00000000&value){  value>>=(1<<5); result|=(1<<5);  }//if it is 32bit then remove this line
    if(0x00000000FFFF0000&value){  value>>=(1<<4); result|=(1<<4);  }//and remove the 32msb
    if(0x000000000000FF00&value){  value>>=(1<<3); result|=(1<<3);  }
    if(0x00000000000000F0&value){  value>>=(1<<2); result|=(1<<2);  }
    if(0x000000000000000C&value){  value>>=(1<<1); result|=(1<<1);  }
    if(0x0000000000000002&value){  result|=(1<<0);  }
}else{
  result=-1;
}

This answer is pretty similar to another answer... oh well.


I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.

int highest_bit(unsigned int a) {
  static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
  const unsigned int *mask = maskv;
  int l, h;

  if (a == 0) return -1;

  l = 0;
  h = 32;

  do {
    int m = l + (h - l) / 2;

    if ((a >> m) != 0) l = m;
    else if ((a & (*mask << l)) != 0) h = m;

    mask++;
  } while (l < h - 1);

  return l;
}

Use a combination of VPTEST(D, W, B) and PSRLDQ instructions to focus in on the byte containing the most significant bit as shown below using an emulation of these instructions in Perl found at:

https://github.com/philiprbrenan/SimdAvx512

if (1) {                                                                        #TpositionOfMostSignificantBitIn64
  my @m = (                                                                     # Test strings
#B0       1       2       3       4       5       6       7
#b0123456701234567012345670123456701234567012345670123456701234567
 '0000000000000000000000000000000000000000000000000000000000000000',
 '0000000000000000000000000000000000000000000000000000000000000001',
 '0000000000000000000000000000000000000000000000000000000000000010',
 '0000000000000000000000000000000000000000000000000000000000000111',
 '0000000000000000000000000000000000000000000000000000001010010000',
 '0000000000000000000000000000000000001000000001100100001010010000',
 '0000000000000000000001001000010000000000000001100100001010010000',
 '0000000000000000100000000000000100000000000001100100001010010000',
 '1000000000000000100000000000000100000000000001100100001010010000',
);
  my @n = (0, 1, 2, 3, 10, 28, 43, 48, 64);                                     # Expected positions of msb

  sub positionOfMostSignificantBitIn64($)                                       # Find the position of the most significant bit in a string of 64 bits starting from 1 for the least significant bit or return 0 if the input field is all zeros
   {my ($s64) = @_;                                                             # String of 64 bits

    my $N = 128;                                                                # 128 bit operations
    my $f = 0;                                                                  # Position of first bit set
    my $x = '0'x$N;                                                             # Double Quad Word set to 0
    my $s = substr $x.$s64, -$N;                                                # 128 bit area needed

    substr(VPTESTMD($s, $s), -2, 1) eq '1' ? ($s = PSRLDQ $s, 4) : ($f += 32);  # Test 2 dwords
    substr(VPTESTMW($s, $s), -2, 1) eq '1' ? ($s = PSRLDQ $s, 2) : ($f += 16);  # Test 2 words
    substr(VPTESTMB($s, $s), -2, 1) eq '1' ? ($s = PSRLDQ $s, 1) : ($f +=  8);  # Test 2 bytes

    $s = substr($s, -8);                                                        # Last byte remaining

    $s < $_ ? ++$f : last for                                                   # Search remaing byte
     (qw(10000000 01000000 00100000 00010000
         00001000 00000100 00000010 00000001));

    64 - $f                                                                     # Position of first bit set
   }

  ok $n[$_] eq positionOfMostSignificantBitIn64 $m[$_] for keys @m              # Test
 }


unsigned int
msb32(register unsigned int x)
{
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);
        return(x & ~(x >> 1));
}

1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.

From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit


This should be lightning fast:

int msb(unsigned int v) {
  static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
    30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
    16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
  v |= v >> 1;
  v |= v >> 2;
  v |= v >> 4;
  v |= v >> 8;
  v |= v >> 16;
  v = (v >> 1) + 1;
  return pos[(v * 0x077CB531UL) >> 27];
}

I assume your question is for an integer (called v below) and not an unsigned integer.

int v = 612635685; // whatever value you wish

unsigned int get_msb(int v)
{
    int r = 31;                         // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.

    while (!(v & 0x80000000) && r--) {   // mask of the highest bit
        v <<= 1;                        // multiply integer by 2.
    }
    return r;                           // will even return -1 if no bit was set, allowing error catch
}

If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly). Please let me know if I forgot anything. I haven't tested it but it should work just fine.


This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.

My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.

double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023;  // assumes x86 endianness

This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.

It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.


To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.

// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?

double ff=(double)(v|1);

uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;

Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.


This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).

This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD


The code:

    // x>=1;
    unsigned func(unsigned x) {
    double d = x ;
    int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
    printf( "The left-most non zero bit of %d is bit %d\n", x, p);
    }

Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1


I know this question is very old, but just having implemented an msb() function myself, I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:

Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01. See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).

Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:

template <typename T>
auto msb(T n) -> int
{
    static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
        "msb<T>(): T must be an unsigned integral type.");

    for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
    {
        if ((n & mask) != 0)
            return i;
    }

    return 0;
}

Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size. My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.

1The argument goes like this (rough draft): Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.

If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n

This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.


Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).

If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.

u8  inline log2 (u32 val)  {
    u8  k = 0;
    if (val > 0x0000FFFFu) { val >>= 16; k  = 16; }
    if (val > 0x000000FFu) { val >>= 8;  k |= 8;  }
    if (val > 0x0000000Fu) { val >>= 4;  k |= 4;  }
    if (val > 0x00000003u) { val >>= 2;  k |= 2;  }
    k |= (val & 2) >> 1;
    return k;
}

Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.

Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.

u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};

u8 log2_table(u32 val)  {
    u8  k = 0;
    if (val > 0x0000FFFFuL) { val >>= 16; k  = 16; }
    if (val > 0x000000FFuL) { val >>=  8; k |=  8; }
    k |= kTableLog2[val]; // precompute the Log2 of the low byte

    return k;
}

This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.


Here are some (simple) benchmarks, of algorithms currently given on this page...

The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)

On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...

//////// go.c ///////////////////////////////
// compile with:  gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/***************** math ********************/

#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */    \
  ((unsigned) log2(a))         /* thus: do not use if a <= 0 */  

#define NUM_OF_HIGHESTBITmath(a) ((a)               \
                  ? (1U << POS_OF_HIGHESTBITmath(a))    \
                  : 0)



/***************** clz ********************/

unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */

#define NUM_OF_HIGHESTBITclz(a) ((a)                    \
                 ? (1U << POS_OF_HIGHESTBITclz(a))  \
                 : 0)


/***************** i2f ********************/

double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)


#define NUM_OF_HIGHESTBITi2f(a) ((a)                    \
                 ? (1U << POS_OF_HIGHESTBITi2f(a))  \
                 : 0)




/***************** asm ********************/

unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)

#define NUM_OF_HIGHESTBITasm(a) ((a)                    \
                 ? (1U << POS_OF_HIGHESTBITasm(a))  \
                 : 0)




/***************** bitshift1 ********************/

#define NUM_OF_HIGHESTBITbitshift1(a) (({   \
  OUT = a;                  \
  OUT |= (OUT >> 1);                \
  OUT |= (OUT >> 2);                \
  OUT |= (OUT >> 4);                \
  OUT |= (OUT >> 8);                \
  OUT |= (OUT >> 16);               \
      }), (OUT & ~(OUT >> 1)))          \



/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
             30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
             16, 7, 26, 12, 18, 6, 11, 5, 10, 9};

#define POS_OF_HIGHESTBITbitshift2(a) (({   \
  OUT = a;                  \
  OUT |= OUT >> 1;              \
  OUT |= OUT >> 2;              \
  OUT |= OUT >> 4;              \
  OUT |= OUT >> 8;              \
  OUT |= OUT >> 16;             \
  OUT = (OUT >> 1) + 1;             \
      }), POS[(OUT * 0x077CB531UL) >> 27])

#define NUM_OF_HIGHESTBITbitshift2(a) ((a)              \
                       ? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
                       : 0)



#define LOOPS 100000000U

int main()
{
  time_t start, end;
  unsigned ui;
  unsigned n;

  /********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
  printf("math\n");
  for (ui = 0U; ui < 18; ++ui)
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));

  printf("\n\n");

  printf("clz\n");
  for (ui = 0U; ui < 18U; ++ui)
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));

  printf("\n\n");

  printf("i2f\n");
  for (ui = 0U; ui < 18U; ++ui)
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));

  printf("\n\n");

  printf("asm\n");
  for (ui = 0U; ui < 18U; ++ui) {
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
  }

  printf("\n\n");

  printf("bitshift1\n");
  for (ui = 0U; ui < 18U; ++ui) {
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
  }

  printf("\n\n");

  printf("bitshift2\n");
  for (ui = 0U; ui < 18U; ++ui) {
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
  }

  printf("\n\nPlease wait...\n\n");


  /************************* Simple clock() benchmark ******************/
  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITmath(ui);
  end = clock();
  printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITclz(ui);
  end = clock();
  printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITi2f(ui);
  end = clock();
  printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITasm(ui);
  end = clock();
  printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITbitshift1(ui);
  end = clock();
  printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITbitshift2(ui);
  end = clock();
  printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");

  return EXIT_SUCCESS;
}

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:

Searches the source operand for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.

(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)

Example code for gcc:

#include <iostream>

int main (int,char**)
{
  int n=1;
  for (;;++n) {
    int msb;
    asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
    std::cout << n << " : " << msb << std::endl;
  }
  return 0;
}

See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.


Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

unsigned int v;
unsigned r = 0;

while (v >>= 1) {
    r++;
}

This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.

If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.

NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.


As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:

#if defined(__LP64__) || defined(_LP64)
# define BUILD_64   1
#endif

#include <stdio.h>
#include <stdint.h>  /* for uint32_t */

/* CHAR_BIT  (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT  8
#endif  /* CHAR_BIT */

/* 
 * Find the log base 2 of an integer with the MSB N set in O(N)
 * operations. (on 64bit & 32bit architectures)
 */
int
getmsb (uint32_t word)
{
    int r = 0;
    if (word < 1)
        return 0;
#ifdef BUILD_64
    union { uint32_t u[2]; double d; } t;  // temp
    t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
    t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
    t.d -= 4503599627370496.0;
    r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
    while (word >>= 1)
    {
        r++;
    }
#endif  /* BUILD_64 */
    return r;
}

Expanding on Josh's benchmark... one can improve the clz as follows

/***************** clz2 ********************/

#define NUM_OF_HIGHESTBITclz2(a) ((a)                              \
                  ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
                  : 0)

Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.


has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:

const auto n = 13UL;
const auto Index = (unsigned long)log2(n);

printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

An n of 0UL needs to be guarded against as well, because:

-8 is returned and FE_DIVBYZERO is raised

I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi


The corollary to ephemient's gcc only answer is:

const auto n = 13UL;
unsigned long Index;

_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

The documentation for _BitScanReverse states that Index is:

Loaded with the bit position of the first set bit (1) found

In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:

0 if no set bits were found

Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409


What about

int highest_bit(unsigned int a) {
    int count;
    std::frexp(a, &count);
    return count - 1;
}

?


GCC has:

 -- Built-in Function: int __builtin_clz (unsigned int x)
     Returns the number of leading 0-bits in X, starting at the most
     significant bit position.  If X is 0, the result is undefined.

 -- Built-in Function: int __builtin_clzl (unsigned long)
     Similar to `__builtin_clz', except the argument type is `unsigned
     long'.

 -- Built-in Function: int __builtin_clzll (unsigned long long)
     Similar to `__builtin_clz', except the argument type is `unsigned
     long long'.

I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.


A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.

To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)

There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).

(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)


Putting this in since it's 'yet another' approach, seems to be different from others already given.

returns -1 if x==0, otherwise floor( log2(x)) (max result 31)

Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

This is what I use when I don't want to use __builtin_clz because of portability issues.

To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

int log2floor( unsigned x ){
   static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
   int r = 0;
   unsigned xk = x >> 16;
   if( xk != 0 ){
       r = 16;
       x = xk;
   }
   // x is 0 .. 0xFFFF
   xk = x >> 8;
   if( xk != 0){
       r += 8;
       x = xk;
   }
   // x is 0 .. 0xFF
   xk = x >> 4;
   if( xk != 0){
       r += 4;
       x = xk;
   }
   // now x is 0..15; x=0 only if originally zero.
   return r + wtab[x];
}

Kaz Kylheku here

I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.

(I happen to need this "find highest bit" for something, you see.)

I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.

The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.

The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.

Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).

int highest_bit_unrolled(long long n)
{
  if (n & 0x7FFFFFFF00000000) {
    if (n & 0x7FFF000000000000) {
      if (n & 0x7F00000000000000) {
        if (n & 0x7000000000000000) {
          if (n & 0x4000000000000000)
            return 63;
          else
            return (n & 0x2000000000000000) ? 62 : 61;
        } else {
          if (n & 0x0C00000000000000)
            return (n & 0x0800000000000000) ? 60 : 59;
          else
            return (n & 0x0200000000000000) ? 58 : 57;
        }
      } else {
        if (n & 0x00F0000000000000) {
          if (n & 0x00C0000000000000)
            return (n & 0x0080000000000000) ? 56 : 55;
          else
            return (n & 0x0020000000000000) ? 54 : 53;
        } else {
          if (n & 0x000C000000000000)
            return (n & 0x0008000000000000) ? 52 : 51;
          else
            return (n & 0x0002000000000000) ? 50 : 49;
        }
      }
    } else {
      if (n & 0x0000FF0000000000) {
        if (n & 0x0000F00000000000) {
          if (n & 0x0000C00000000000)
            return (n & 0x0000800000000000) ? 48 : 47;
          else
            return (n & 0x0000200000000000) ? 46 : 45;
        } else {
          if (n & 0x00000C0000000000)
            return (n & 0x0000080000000000) ? 44 : 43;
          else
            return (n & 0x0000020000000000) ? 42 : 41;
        }
      } else {
        if (n & 0x000000F000000000) {
          if (n & 0x000000C000000000)
            return (n & 0x0000008000000000) ? 40 : 39;
          else
            return (n & 0x0000002000000000) ? 38 : 37;
        } else {
          if (n & 0x0000000C00000000)
            return (n & 0x0000000800000000) ? 36 : 35;
          else
            return (n & 0x0000000200000000) ? 34 : 33;
        }
      }
    }
  } else {
    if (n & 0x00000000FFFF0000) {
      if (n & 0x00000000FF000000) {
        if (n & 0x00000000F0000000) {
          if (n & 0x00000000C0000000)
            return (n & 0x0000000080000000) ? 32 : 31;
          else
            return (n & 0x0000000020000000) ? 30 : 29;
        } else {
          if (n & 0x000000000C000000)
            return (n & 0x0000000008000000) ? 28 : 27;
          else
            return (n & 0x0000000002000000) ? 26 : 25;
        }
      } else {
        if (n & 0x0000000000F00000) {
          if (n & 0x0000000000C00000)
            return (n & 0x0000000000800000) ? 24 : 23;
          else
            return (n & 0x0000000000200000) ? 22 : 21;
        } else {
          if (n & 0x00000000000C0000)
            return (n & 0x0000000000080000) ? 20 : 19;
          else
            return (n & 0x0000000000020000) ? 18 : 17;
        }
      }
    } else {
      if (n & 0x000000000000FF00) {
        if (n & 0x000000000000F000) {
          if (n & 0x000000000000C000)
            return (n & 0x0000000000008000) ? 16 : 15;
          else
            return (n & 0x0000000000002000) ? 14 : 13;
        } else {
          if (n & 0x0000000000000C00)
            return (n & 0x0000000000000800) ? 12 : 11;
          else
            return (n & 0x0000000000000200) ? 10 : 9;
        }
      } else {
        if (n & 0x00000000000000F0) {
          if (n & 0x00000000000000C0)
            return (n & 0x0000000000000080) ? 8 : 7;
          else
            return (n & 0x0000000000000020) ? 6 : 5;
        } else {
          if (n & 0x000000000000000C)
            return (n & 0x0000000000000008) ? 4 : 3;
          else
            return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
        }
      }
    }
  }
}

int highest_bit(long long n)
{
  const long long mask[] = {
    0x000000007FFFFFFF,
    0x000000000000FFFF,
    0x00000000000000FF,
    0x000000000000000F,
    0x0000000000000003,
    0x0000000000000001
  };
  int hi = 64;
  int lo = 0;
  int i = 0;

  if (n == 0)
    return 0;

  for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
    int mi = lo + (hi - lo) / 2;

    if ((n >> mi) != 0)
      lo = mi;
    else if ((n & (mask[i] << lo)) != 0)
      hi = mi;
  }

  return lo + 1;
}

Quick and dirty test program:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int highest_bit_unrolled(long long n);
int highest_bit(long long n);

main(int argc, char **argv)
{
  long long n = strtoull(argv[1], NULL, 0);
  int b1, b2;
  long i;
  clock_t start = clock(), mid, end;

  for (i = 0; i < 1000000000; i++)
    b1 = highest_bit_unrolled(n);

  mid = clock();

  for (i = 0; i < 1000000000; i++)
    b2 = highest_bit(n);

  end = clock();

  printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);

  printf("time1 = %d\n", (int) (mid - start));
  printf("time2 = %d\n", (int) (end - mid));
  return 0;
}

Using only -O2, the difference becomes greater. The decision tree is almost four times faster.

I also benchmarked against the naive bit shifting code:

int highest_bit_shift(long long n)
{
  int i = 0;
  for (; n; n >>= 1, i++)
    ; /* empty */
  return i;
}

This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!

On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.

I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.


Examples related to c

conflicting types for 'outchar' Can't compile C program on a Mac after upgrade to Mojave Program to find largest and second largest number in array Prime numbers between 1 to 100 in C Programming Language In c, in bool, true == 1 and false == 0? How I can print to stderr in C? Visual Studio Code includePath "error: assignment to expression with array type error" when I assign a struct field (C) Compiling an application for use in highly radioactive environments How can you print multiple variables inside a string using printf?

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to optimization

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Measuring execution time of a function in C++ GROUP BY having MAX date How to efficiently remove duplicates from an array without using Set Storing JSON in database vs. having a new column for each key Read file As String How to write a large buffer into a binary file in C++, fast? Is optimisation level -O3 dangerous in g++? Why is processing a sorted array faster than processing an unsorted array? MySQL my.cnf performance tuning recommendations

Examples related to bit-manipulation

What is (x & 1) and (x >>= 1)? 'and' (boolean) vs '&' (bitwise) - Why difference in behavior with lists vs numpy arrays? What does AND 0xFF do? bitwise XOR of hex numbers in python What is Bit Masking? What does a bitwise shift (left or right) do and what is it used for? Implement division with bit-wise operator How can I multiply and divide using only bit shifting and adding? In C/C++ what's the simplest way to reverse the order of bits in a byte? How do I get bit-by-bit data from an integer value in C?