Efficient Algorithm for Bit Reversal from MSB- LSB to LSB- MSB in C

Question

What is the most efficient algorithm to achieve the following   0010 0000   gt  0000 0100  The conversion is from MSB- LSB to LSB- MSB   All bits must be reversed  that is  this is not endianness-swapping

User · Accepted Answer

NOTE  All algorithms below are in C  but should be portable to your language of choice  just don t look at me when they re not as fast     Options  Low Memory  32-bit int  32-bit machine  from here    unsigned int reverse register unsigned int x        x      x  amp  0xaaaaaaaa   gt  gt  1      x  amp  0x55555555   lt  lt  1        x      x  amp  0xcccccccc   gt  gt  2      x  amp  0x33333333   lt  lt  2        x      x  amp  0xf0f0f0f0   gt  gt  4      x  amp  0x0f0f0f0f   lt  lt  4        x      x  amp  0xff00ff00   gt  gt  8      x  amp  0x00ff00ff   lt  lt  8        return  x  gt  gt  16     x  lt  lt  16         From the famous Bit Twiddling Hacks page   Fastest  lookup table    static const unsigned char BitReverseTable256          0x00  0x80  0x40  0xC0  0x20  0xA0  0x60  0xE0  0x10  0x90  0x50  0xD0  0x30  0xB0  0x70  0xF0     0x08  0x88  0x48  0xC8  0x28  0xA8  0x68  0xE8  0x18  0x98  0x58  0xD8  0x38  0xB8  0x78  0xF8     0x04  0x84  0x44  0xC4  0x24  0xA4  0x64  0xE4  0x14  0x94  0x54  0xD4  0x34  0xB4  0x74  0xF4     0x0C  0x8C  0x4C  0xCC  0x2C  0xAC  0x6C  0xEC  0x1C  0x9C  0x5C  0xDC  0x3C  0xBC  0x7C  0xFC     0x02  0x82  0x42  0xC2  0x22  0xA2  0x62  0xE2  0x12  0x92  0x52  0xD2  0x32  0xB2  0x72  0xF2     0x0A  0x8A  0x4A  0xCA  0x2A  0xAA  0x6A  0xEA  0x1A  0x9A  0x5A  0xDA  0x3A  0xBA  0x7A  0xFA    0x06  0x86  0x46  0xC6  0x26  0xA6  0x66  0xE6  0x16  0x96  0x56  0xD6  0x36  0xB6  0x76  0xF6     0x0E  0x8E  0x4E  0xCE  0x2E  0xAE  0x6E  0xEE  0x1E  0x9E  0x5E  0xDE  0x3E  0xBE  0x7E  0xFE    0x01  0x81  0x41  0xC1  0x21  0xA1  0x61  0xE1  0x11  0x91  0x51  0xD1  0x31  0xB1  0x71  0xF1    0x09  0x89  0x49  0xC9  0x29  0xA9  0x69  0xE9  0x19  0x99  0x59  0xD9  0x39  0xB9  0x79  0xF9     0x05  0x85  0x45  0xC5  0x25  0xA5  0x65  0xE5  0x15  0x95  0x55  0xD5  0x35  0xB5  0x75  0xF5    0x0D  0x8D  0x4D  0xCD  0x2D  0xAD  0x6D  0xED  0x1D  0x9D  0x5D  0xDD  0x3D  0xBD  0x7D  0xFD    0x03  0x83  0x43  0xC3  0x23  0xA3  0x63  0xE3  0x13  0x93  0x53  0xD3  0x33  0xB3  0x73  0xF3     0x0B  0x8B  0x4B  0xCB  0x2B  0xAB  0x6B  0xEB  0x1B  0x9B  0x5B  0xDB  0x3B  0xBB  0x7B  0xFB    0x07  0x87  0x47  0xC7  0x27  0xA7  0x67  0xE7  0x17  0x97  0x57  0xD7  0x37  0xB7  0x77  0xF7     0x0F  0x8F  0x4F  0xCF  0x2F  0xAF  0x6F  0xEF  0x1F  0x9F  0x5F  0xDF  0x3F  0xBF  0x7F  0xFF     unsigned int v     reverse 32-bit value  8 bits at time unsigned int c     c will get v reversed     Option 1  c    BitReverseTable256 v  amp  0xff   lt  lt  24          BitReverseTable256  v  gt  gt  8   amp  0xff   lt  lt  16          BitReverseTable256  v  gt  gt  16   amp  0xff   lt  lt  8         BitReverseTable256  v  gt  gt  24   amp  0xff        Option 2  unsigned char   p    unsigned char     amp v  unsigned char   q    unsigned char     amp c  q 3    BitReverseTable256 p 0     q 2    BitReverseTable256 p 1     q 1    BitReverseTable256 p 2     q 0    BitReverseTable256 p 3      You can extend this idea to 64-bit ints  or trade off memory for speed  assuming your L1 Data Cache is large enough   and reverse 16 bits at a time with a 64K-entry lookup table     Others  Simple  unsigned int v         input bits to be reversed unsigned int r   v  amp  1     r will be reversed bits of v  first get LSB of v int s   sizeof v    CHAR BIT - 1     extra shift needed at end  for  v  gt  gt   1  v  v  gt  gt   1         r  lt  lt   1    r    v  amp  1    s--    r  lt  lt   s     shift when v s highest bits are zero   Faster  32-bit processor   unsigned char b   x  b     b   0x0802LU  amp  0x22110LU     b   0x8020LU  amp  0x88440LU     0x10101LU  gt  gt  16     Faster  64-bit processor   unsigned char b     reverse this  8-bit  byte b    b   0x0202020202ULL  amp  0x010884422010ULL    1023    If you want to do this on a 32-bit int  just reverse the bits in each byte  and reverse the order of the bytes   That is   unsigned int toReverse  unsigned int reversed  unsigned char inByte0    toReverse  amp  0xFF   unsigned char inByte1    toReverse  amp  0xFF00   gt  gt  8  unsigned char inByte2    toReverse  amp  0xFF0000   gt  gt  16  unsigned char inByte3    toReverse  amp  0xFF000000   gt  gt  24  reversed    reverseBits inByte0   lt  lt  24     reverseBits inByte1   lt  lt  16     reverseBits inByte2   lt  lt  8     reverseBits inByte3       Results  I benchmarked the two most promising solutions  the lookup table  and bitwise-AND  the first one    The test machine is a laptop w  4GB of DDR2-800 and a Core 2 Duo T7500   2 4GHz  4MB L2 Cache  YMMV   I used gcc 4 3 2 on 64-bit Linux   OpenMP  and the GCC bindings  were used for high-resolution timers   reverse c   include  lt stdlib h gt   include  lt stdio h gt   include  lt omp h gt   unsigned int reverse register unsigned int x        x      x  amp  0xaaaaaaaa   gt  gt  1      x  amp  0x55555555   lt  lt  1        x      x  amp  0xcccccccc   gt  gt  2      x  amp  0x33333333   lt  lt  2        x      x  amp  0xf0f0f0f0   gt  gt  4      x  amp  0x0f0f0f0f   lt  lt  4        x      x  amp  0xff00ff00   gt  gt  8      x  amp  0x00ff00ff   lt  lt  8        return  x  gt  gt  16     x  lt  lt  16        int main         unsigned int  ints   malloc 100000000 sizeof unsigned int        unsigned int  ints2   malloc 100000000 sizeof unsigned int        for unsigned int i   0  i  lt  100000000  i          ints i    rand         unsigned int  inptr   ints      unsigned int  outptr   ints2      unsigned int  endptr   ints   100000000         Starting the time measurement     double start   omp get wtime           Computations to be measured     while inptr    endptr                outptr    reverse  inptr         inptr          outptr                 Measuring the elapsed time     double end   omp get wtime           Time calculation  in seconds      printf  Time   f seconds n   end-start        free ints       free ints2        return 0      reverse lookup c   include  lt stdlib h gt   include  lt stdio h gt   include  lt omp h gt   static const unsigned char BitReverseTable256          0x00  0x80  0x40  0xC0  0x20  0xA0  0x60  0xE0  0x10  0x90  0x50  0xD0  0x30  0xB0  0x70  0xF0     0x08  0x88  0x48  0xC8  0x28  0xA8  0x68  0xE8  0x18  0x98  0x58  0xD8  0x38  0xB8  0x78  0xF8     0x04  0x84  0x44  0xC4  0x24  0xA4  0x64  0xE4  0x14  0x94  0x54  0xD4  0x34  0xB4  0x74  0xF4     0x0C  0x8C  0x4C  0xCC  0x2C  0xAC  0x6C  0xEC  0x1C  0x9C  0x5C  0xDC  0x3C  0xBC  0x7C  0xFC     0x02  0x82  0x42  0xC2  0x22  0xA2  0x62  0xE2  0x12  0x92  0x52  0xD2  0x32  0xB2  0x72  0xF2     0x0A  0x8A  0x4A  0xCA  0x2A  0xAA  0x6A  0xEA  0x1A  0x9A  0x5A  0xDA  0x3A  0xBA  0x7A  0xFA    0x06  0x86  0x46  0xC6  0x26  0xA6  0x66  0xE6  0x16  0x96  0x56  0xD6  0x36  0xB6  0x76  0xF6     0x0E  0x8E  0x4E  0xCE  0x2E  0xAE  0x6E  0xEE  0x1E  0x9E  0x5E  0xDE  0x3E  0xBE  0x7E  0xFE    0x01  0x81  0x41  0xC1  0x21  0xA1  0x61  0xE1  0x11  0x91  0x51  0xD1  0x31  0xB1  0x71  0xF1    0x09  0x89  0x49  0xC9  0x29  0xA9  0x69  0xE9  0x19  0x99  0x59  0xD9  0x39  0xB9  0x79  0xF9     0x05  0x85  0x45  0xC5  0x25  0xA5  0x65  0xE5  0x15  0x95  0x55  0xD5  0x35  0xB5  0x75  0xF5    0x0D  0x8D  0x4D  0xCD  0x2D  0xAD  0x6D  0xED  0x1D  0x9D  0x5D  0xDD  0x3D  0xBD  0x7D  0xFD    0x03  0x83  0x43  0xC3  0x23  0xA3  0x63  0xE3  0x13  0x93  0x53  0xD3  0x33  0xB3  0x73  0xF3     0x0B  0x8B  0x4B  0xCB  0x2B  0xAB  0x6B  0xEB  0x1B  0x9B  0x5B  0xDB  0x3B  0xBB  0x7B  0xFB    0x07  0x87  0x47  0xC7  0x27  0xA7  0x67  0xE7  0x17  0x97  0x57  0xD7  0x37  0xB7  0x77  0xF7     0x0F  0x8F  0x4F  0xCF  0x2F  0xAF  0x6F  0xEF  0x1F  0x9F  0x5F  0xDF  0x3F  0xBF  0x7F  0xFF     int main         unsigned int  ints   malloc 100000000 sizeof unsigned int        unsigned int  ints2   malloc 100000000 sizeof unsigned int        for unsigned int i   0  i  lt  100000000  i          ints i    rand         unsigned int  inptr   ints      unsigned int  outptr   ints2      unsigned int  endptr   ints   100000000         Starting the time measurement     double start   omp get wtime           Computations to be measured     while inptr    endptr            unsigned int in    inptr            Option 1         outptr    BitReverseTable256 in  amp  0xff   lt  lt  24                BitReverseTable256  in  gt  gt  8   amp  0xff   lt  lt  16                BitReverseTable256  in  gt  gt  16   amp  0xff   lt  lt  8               BitReverseTable256  in  gt  gt  24   amp  0xff            Option 2      unsigned char   p    unsigned char     amp   inptr       unsigned char   q    unsigned char     amp   outptr       q 3    BitReverseTable256 p 0         q 2    BitReverseTable256 p 1         q 1    BitReverseTable256 p 2         q 0    BitReverseTable256 p 3           inptr          outptr                 Measuring the elapsed time     double end   omp get wtime           Time calculation  in seconds      printf  Time   f seconds n   end-start        free ints       free ints2        return 0      I tried both approaches at several different optimizations  ran 3 trials at each level  and each trial reversed 100 million random unsigned ints   For the lookup table option  I tried both schemes  options 1 and 2  given on the bitwise hacks page   Results are shown below   Bitwise AND  mrj10 mjlap   code  gcc -fopenmp -std c99 -o reverse reverse c mrj10 mjlap   code    reverse Time  2 000593 seconds mrj10 mjlap   code    reverse Time  1 938893 seconds mrj10 mjlap   code    reverse Time  1 936365 seconds mrj10 mjlap   code  gcc -fopenmp -std c99 -O2 -o reverse reverse c mrj10 mjlap   code    reverse Time  0 942709 seconds mrj10 mjlap   code    reverse Time  0 991104 seconds mrj10 mjlap   code    reverse Time  0 947203 seconds mrj10 mjlap   code  gcc -fopenmp -std c99 -O3 -o reverse reverse c mrj10 mjlap   code    reverse Time  0 922639 seconds mrj10 mjlap   code    reverse Time  0 892372 seconds mrj10 mjlap   code    reverse Time  0 891688 seconds   Lookup Table  option 1   mrj10 mjlap   code  gcc -fopenmp -std c99 -o reverse lookup reverse lookup c mrj10 mjlap   code    reverse lookup Time  1 201127 seconds               mrj10 mjlap   code    reverse lookup Time  1 196129 seconds               mrj10 mjlap   code    reverse lookup Time  1 235972 seconds               mrj10 mjlap   code  gcc -fopenmp -std c99 -O2 -o reverse lookup reverse lookup c mrj10 mjlap   code    reverse lookup Time  0 633042 seconds               mrj10 mjlap   code    reverse lookup Time  0 655880 seconds               mrj10 mjlap   code    reverse lookup Time  0 633390 seconds               mrj10 mjlap   code  gcc -fopenmp -std c99 -O3 -o reverse lookup reverse lookup c mrj10 mjlap   code    reverse lookup Time  0 652322 seconds               mrj10 mjlap   code    reverse lookup Time  0 631739 seconds               mrj10 mjlap   code    reverse lookup Time  0 652431 seconds     Lookup Table  option 2   mrj10 mjlap   code  gcc -fopenmp -std c99 -o reverse lookup reverse lookup c mrj10 mjlap   code    reverse lookup Time  1 671537 seconds mrj10 mjlap   code    reverse lookup Time  1 688173 seconds mrj10 mjlap   code    reverse lookup Time  1 664662 seconds mrj10 mjlap   code  gcc -fopenmp -std c99 -O2 -o reverse lookup reverse lookup c mrj10 mjlap   code    reverse lookup Time  1 049851 seconds mrj10 mjlap   code    reverse lookup Time  1 048403 seconds mrj10 mjlap   code    reverse lookup Time  1 085086 seconds mrj10 mjlap   code  gcc -fopenmp -std c99 -O3 -o reverse lookup reverse lookup c mrj10 mjlap   code    reverse lookup Time  1 082223 seconds mrj10 mjlap   code    reverse lookup Time  1 053431 seconds mrj10 mjlap   code    reverse lookup Time  1 081224 seconds   Conclusion  Use the lookup table  with option 1  byte addressing is unsurprisingly slow  if you re concerned about performance   If you need to squeeze every last byte of memory out of your system  and you might  if you care about the performance of bit reversal   the optimized versions of the bitwise-AND approach aren t too shabby either   Caveat  Yes  I know the benchmark code is a complete hack   Suggestions on how to improve it are more than welcome   Things I know about    I don t have access to ICC   This may be faster  please respond in a comment if you can test this out   A 64K lookup table may do well on some modern microarchitectures with large L1D  -mtune native didn t work for -O2 -O3  ld blew up with some crazy symbol redefinition error   so I don t believe the generated code is tuned for my microarchitecture  There may be a way to do this slightly faster with SSE   I have no idea how  but with fast replication  packed bitwise AND  and swizzling instructions  there s got to be something there  I know only enough x86 assembly to be dangerous  here s the code GCC generated on -O3 for option 1  so somebody more knowledgable than myself can check it out    32-bit   L3  movl      r12  rsi    ecx movzbl   cl   eax movzbl  BitReverseTable256  rax    edx movl     ecx   eax shrl     24   eax mov      eax   eax movzbl  BitReverseTable256  rax    eax sall     24   edx orl      eax   edx movzbl   ch   eax shrl     16   ecx movzbl  BitReverseTable256  rax    eax movzbl   cl   ecx sall     16   eax orl      eax   edx movzbl  BitReverseTable256  rcx    eax sall     8   eax orl      eax   edx movl     edx    r13  rsi  addq     4   rsi cmpq     400000000   rsi jne      L3   EDIT  I also tried using uint64 t types on my machine to see if there was any performance boost   Performance was about 10  faster than 32-bit  and was nearly identical whether you were just using 64-bit types to reverse bits on two 32-bit int types at a time  or whether you were actually reversing bits in half as many 64-bit values   The assembly code is shown below  for the former case  reversing bits for two 32-bit int types at a time     L3  movq      r12  rsi    rdx movq     rdx   rax shrq     24   rax andl     255   eax movzbl  BitReverseTable256  rax    ecx movzbq   dl  rax movzbl  BitReverseTable256  rax    eax salq     24   rax orq      rax   rcx movq     rdx   rax shrq     56   rax movzbl  BitReverseTable256  rax    eax salq     32   rax orq      rax   rcx movzbl   dh   eax shrq     16   rdx movzbl  BitReverseTable256  rax    eax salq     16   rax orq      rax   rcx movzbq   dl  rax shrq     16   rdx movzbl  BitReverseTable256  rax    eax salq     8   rax orq      rax   rcx movzbq   dl  rax shrq     8   rdx movzbl  BitReverseTable256  rax    eax salq     56   rax orq      rax   rcx movzbq   dl  rax shrq     8   rdx movzbl  BitReverseTable256  rax    eax andl     255   edx salq     48   rax orq      rax   rcx movzbl  BitReverseTable256  rdx    eax salq     40   rax orq      rax   rcx movq     rcx    r13  rsi  addq     8   rsi cmpq     400000000   rsi jne      L3

User · Answer

I was curious how fast would be the obvious raw rotation  On my machine  i7 2600   the average for 1 500 150 000 iterations was 27 28 ns  over a a random set of 131 071 64-bit integers    Advantages  the amount of memory needed is little and the code is simple  I would say it is not that large  either  The time required is predictable and constant for any input  128 arithmetic SHIFT operations   64 logical AND operations   64 logical OR operations    I compared to the best time obtained by  Matt J - who has the accepted answer  If I read his answer correctly  the best he has got was 0 631739 seconds for 1 000 000 iterations  which leads to an average of 631 ns per rotation   The code snippet I used is this one below   unsigned long long reverse long unsigned long long x        return    x  gt  gt  0   amp  1   lt  lt  63                  x  gt  gt  1   amp  1   lt  lt  62                  x  gt  gt  2   amp  1   lt  lt  61                  x  gt  gt  3   amp  1   lt  lt  60                  x  gt  gt  4   amp  1   lt  lt  59                  x  gt  gt  5   amp  1   lt  lt  58                  x  gt  gt  6   amp  1   lt  lt  57                  x  gt  gt  7   amp  1   lt  lt  56                  x  gt  gt  8   amp  1   lt  lt  55                  x  gt  gt  9   amp  1   lt  lt  54                  x  gt  gt  10   amp  1   lt  lt  53                  x  gt  gt  11   amp  1   lt  lt  52                  x  gt  gt  12   amp  1   lt  lt  51                  x  gt  gt  13   amp  1   lt  lt  50                  x  gt  gt  14   amp  1   lt  lt  49                  x  gt  gt  15   amp  1   lt  lt  48                  x  gt  gt  16   amp  1   lt  lt  47                  x  gt  gt  17   amp  1   lt  lt  46                  x  gt  gt  18   amp  1   lt  lt  45                  x  gt  gt  19   amp  1   lt  lt  44                  x  gt  gt  20   amp  1   lt  lt  43                  x  gt  gt  21   amp  1   lt  lt  42                  x  gt  gt  22   amp  1   lt  lt  41                  x  gt  gt  23   amp  1   lt  lt  40                  x  gt  gt  24   amp  1   lt  lt  39                  x  gt  gt  25   amp  1   lt  lt  38                  x  gt  gt  26   amp  1   lt  lt  37                  x  gt  gt  27   amp  1   lt  lt  36                  x  gt  gt  28   amp  1   lt  lt  35                  x  gt  gt  29   amp  1   lt  lt  34                  x  gt  gt  30   amp  1   lt  lt  33                  x  gt  gt  31   amp  1   lt  lt  32                  x  gt  gt  32   amp  1   lt  lt  31                  x  gt  gt  33   amp  1   lt  lt  30                  x  gt  gt  34   amp  1   lt  lt  29                  x  gt  gt  35   amp  1   lt  lt  28                  x  gt  gt  36   amp  1   lt  lt  27                  x  gt  gt  37   amp  1   lt  lt  26                  x  gt  gt  38   amp  1   lt  lt  25                  x  gt  gt  39   amp  1   lt  lt  24                  x  gt  gt  40   amp  1   lt  lt  23                  x  gt  gt  41   amp  1   lt  lt  22                  x  gt  gt  42   amp  1   lt  lt  21                  x  gt  gt  43   amp  1   lt  lt  20                  x  gt  gt  44   amp  1   lt  lt  19                  x  gt  gt  45   amp  1   lt  lt  18                  x  gt  gt  46   amp  1   lt  lt  17                  x  gt  gt  47   amp  1   lt  lt  16                  x  gt  gt  48   amp  1   lt  lt  15                  x  gt  gt  49   amp  1   lt  lt  14                  x  gt  gt  50   amp  1   lt  lt  13                  x  gt  gt  51   amp  1   lt  lt  12                  x  gt  gt  52   amp  1   lt  lt  11                  x  gt  gt  53   amp  1   lt  lt  10                  x  gt  gt  54   amp  1   lt  lt  9                  x  gt  gt  55   amp  1   lt  lt  8                  x  gt  gt  56   amp  1   lt  lt  7                  x  gt  gt  57   amp  1   lt  lt  6                  x  gt  gt  58   amp  1   lt  lt  5                  x  gt  gt  59   amp  1   lt  lt  4                  x  gt  gt  60   amp  1   lt  lt  3                  x  gt  gt  61   amp  1   lt  lt  2                  x  gt  gt  62   amp  1   lt  lt  1                  x  gt  gt  63   amp  1   lt  lt  0

User · Answer

Bit reversal in pseudo code  source -  byte to be reversed b00101100  destination -  reversed  also needs to be of unsigned type so sign bit is not propogated down  copy into temp so original is unaffected  also needs to be of unsigned type so that sign bit is not shifted in automaticaly  bytecopy   b0010110   LOOP8     do this 8 times     test if bytecopy is  lt  0  negative       set bit8  msb  of reversed   reversed   b10000000   else do not set bit8  shift bytecopy left 1 place bytecopy   bytecopy  lt  lt  1   b0101100 result  shift result right 1 place reversed   reversed  gt  gt  1   b00000000 8 times no then up  LOOP8 8 times yes then done

User · Answer

This is another solution for folks who love recursion   The idea is simple   Divide up input by half and swap the two halves  continue until it reaches single bit   Illustrated in the example below   Ex   If Input is 00101010      gt  Expected output is 01010100  1  Divide the input into 2 halves      0010 --- 1010  2  Swap the 2 Halves     1010     0010  3  Repeat the same for each half      10 -- 10 ---  00 -- 10     10    10      10    00      1-0 -- 1-0 --- 1-0 -- 0-0     0 1    0 1     0 1    0 0  Done  Output is 01010100   Here is a recursive function to solve it   Note I have used unsigned ints  so it can work for inputs up to sizeof unsigned int  8 bits      The recursive function takes 2 parameters - The value whose bits need   to be reversed and the number of bits in the value    int reverse bits recursive unsigned int num  unsigned int numBits        unsigned int reversedNum       unsigned int mask   0       mask    0x1  lt  lt   numBits 2   - 1       if  numBits    1  return num      reversedNum   reverse bits recursive num  gt  gt  numBits 2  numBits 2                       reverse bits recursive  num  amp  mask   numBits 2   lt  lt  numBits 2      return reversedNum     int main         unsigned int reversedNum      unsigned int num       num   0x55      reversedNum   reverse bits recursive num  8       printf   Bit Reversal Input   0x x Output   0x x n   num  reversedNum        num   0xabcd      reversedNum   reverse bits recursive num  16       printf   Bit Reversal Input   0x x Output   0x x n   num  reversedNum        num   0x123456      reversedNum   reverse bits recursive num  24       printf   Bit Reversal Input   0x x Output   0x x n   num  reversedNum        num   0x11223344      reversedNum   reverse bits recursive num 32       printf   Bit Reversal Input   0x x Output   0x x n   num  reversedNum       This is the output   Bit Reversal Input   0x55 Output   0xaa Bit Reversal Input   0xabcd Output   0xb3d5 Bit Reversal Input   0x123456 Output   0x651690 Bit Reversal Input   0x11223344 Output   0x22cc4488

User · Answer

Of course the obvious source of bit-twiddling hacks is here  http   graphics stanford edu  seander bithacks html BitReverseObvious

User · Answer

Anders Cedronius s answer provides a great solution for people that have an x86 CPU with AVX2 support  For x86 platforms without AVX support or non-x86 platforms  either of the following implementations should work well   The first code is a variant of the classic binary partitioning method  coded to maximize the use of the shift-plus-logic idiom useful on various ARM processors  In addition  it uses on-the-fly mask generation which could be beneficial for RISC processors that otherwise require multiple instructions to load each 32-bit mask value  Compilers for x86 platforms should use constant propagation to compute all masks at compile time rather than run time      Classic binary partitioning algorithm    inline uint32 t brev classic  uint32 t a        uint32 t m      a    a  gt  gt  16     a  lt  lt  16                                 swap halfwords     m   0x00ff00ff  a     a  gt  gt  8   amp  m      a  lt  lt  8   amp   m      swap bytes     m   m  m  lt  lt  4   a     a  gt  gt  4   amp  m      a  lt  lt  4   amp   m      swap nibbles     m   m  m  lt  lt  2   a     a  gt  gt  2   amp  m      a  lt  lt  2   amp   m       m   m  m  lt  lt  1   a     a  gt  gt  1   amp  m      a  lt  lt  1   amp   m       return a      In volume 4A of  The Art of Computer Programming   D  Knuth shows clever ways of reversing bits that somewhat surprisingly require fewer operations than the classical binary partitioning algorithms  One such algorithm for 32-bit operands  that I cannot find in TAOCP  is shown in this document on the Hacker s Delight website      Knuth s algorithm from http   www hackersdelight org revisions pdf  Retrieved 8 19 2015    inline uint32 t brev knuth  uint32 t a        uint32 t t      a    a  lt  lt  15     a  gt  gt  17       t    a    a  gt  gt  10    amp  0x003f801f       a    t    t  lt  lt  10     a      t    a    a  gt  gt   4    amp  0x0e038421       a    t    t  lt  lt   4     a      t    a    a  gt  gt   2    amp  0x22488842       a    t    t  lt  lt   2     a      return a      Using the Intel compiler C C   compiler 13 1 3 198  both of the above functions auto-vectorize nicely targetting XMM registers  They could also be vectorized manually without a lot of effort   On my IvyBridge Xeon E3 1270v2  using the auto-vectorized code  100 million uint32 t words were bit-reversed in 0 070 seconds using brev classic    and 0 068 seconds using brev knuth    I took care to ensure that my benchmark was not limited by system memory bandwidth

User · Answer

I think the simplest method I know follows  MSB is input and LSB is  reversed  output   unsigned char rev char MSB        unsigned char LSB 0      for output      FOR i 0 8            LSB  LSB  lt  lt  1          if MSB amp 1  LSB   LSB   1          MSB  MSB  gt  gt  1            return LSB           It works by rotating bytes in opposite directions         Just repeat for each byte

User · Answer

Presuming that you have an array of bits  how about this   1  Starting from MSB  push bits into a stack one by one   2  Pop bits from this stack into another array  or the same array if you want to save space   placing the first popped bit into MSB and going on to less significant bits from there   Stack stack   new Stack    Bit   bits   new Bit     0  0  1  0  0  0  0  0     for  int i   0  i  lt  bits Length  i           stack push bits i       for  int i   0  i  lt  bits Length  i          bits i    stack pop

User · Answer

This thread caught my attention since it deals with a simple problem that requires a lot of work  CPU cycles  even for a modern CPU  And one day I also stood there with the same         problem  I had to flip millions of bytes  However I know all my target systems are modern Intel-based so let s start optimizing to the extreme     So I used Matt J s lookup code as the base  the system I m benchmarking on is a i7 haswell 4700eq   Matt J s lookup bitflipping 400 000 000 bytes  Around 0 272 seconds   I then went ahead and tried to see if Intel s ISPC compiler could vectorise the arithmetics in the reverse c   I m not going to bore you with my findings here since I tried a lot to help the compiler find stuff  anyhow I ended up with performance of around 0 15 seconds to bitflip 400 000 000 bytes  It s a great reduction but for my application that s still way way too slow    So people let me present the fastest Intel based bitflipper in the world  Clocked at   Time to bitflip 400000000 bytes  0 050082 seconds            Bitflip using AVX2 - The fastest Intel based bitflip in the world      Made by Anders Cedronius 2014  anders cedronius  you know what  gmail com    include  lt stdio h gt   include  lt stdlib h gt   include  lt math h gt   include  lt omp h gt   using namespace std    define DISPLAY HEIGHT  4  define DISPLAY WIDTH   32  define NUM DATA BYTES  400000000     Constants  first we got the mask  then the high order nibble look up table and last we got the low order nibble lookup table    attribute     aligned 32    static unsigned char k1 32 3            0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f 0x0f          0x00 0x08 0x04 0x0c 0x02 0x0a 0x06 0x0e 0x01 0x09 0x05 0x0d 0x03 0x0b 0x07 0x0f 0x00 0x08 0x04 0x0c 0x02 0x0a 0x06 0x0e 0x01 0x09 0x05 0x0d 0x03 0x0b 0x07 0x0f          0x00 0x80 0x40 0xc0 0x20 0xa0 0x60 0xe0 0x10 0x90 0x50 0xd0 0x30 0xb0 0x70 0xf0 0x00 0x80 0x40 0xc0 0x20 0xa0 0x60 0xe0 0x10 0x90 0x50 0xd0 0x30 0xb0 0x70 0xf0        The data to be bitflipped   32 to avoid the quantization out of memory problem    attribute     aligned 32    static unsigned char data NUM DATA BYTES 32       extern  C    void bitflipbyte unsigned char   unsigned int unsigned char        int main          for unsigned int i   0  i  lt  NUM DATA BYTES  i                  data i    rand               printf    r nData in start   r n        for  unsigned int j   0  j  lt  4  j                  for  unsigned int i   0  i  lt  DISPLAY WIDTH  i                          printf   0x 02x   data i  j DISPLAY WIDTH                       printf    r n               printf    r nNumber of 32-byte chunks to convert   d r n   unsigned int ceil NUM DATA BYTES 32 0         double start time   omp get wtime        bitflipbyte data  unsigned int ceil NUM DATA BYTES 32 0  k1       double end time   omp get wtime         printf    r nData out  r n        for  unsigned int j   0  j  lt  4  j                  for  unsigned int i   0  i  lt  DISPLAY WIDTH  i                          printf   0x 02x   data i  j DISPLAY WIDTH                       printf    r n              printf   r n r nTime to bitflip  d bytes   f seconds r n r n  NUM DATA BYTES  end time-start time           return with no errors     return 0      The printf s are for debugging    Here is the workhorse   bits 64 global bitflipbyte  bitflipbyte              vmovdqa     ymm2   rdx          add         rdx  20h         vmovdqa     ymm3   rdx          add         rdx  20h         vmovdqa     ymm4   rdx  bitflipp loop          vmovdqa     ymm0   rdi           vpand       ymm1  ymm2  ymm0          vpandn      ymm0  ymm2  ymm0          vpsrld      ymm0  ymm0  4h          vpshufb     ymm1  ymm4  ymm1          vpshufb     ymm0  ymm3  ymm0                  vpor        ymm0  ymm0  ymm1         vmovdqa      rdi   ymm0         add     rdi  20h         dec     rsi         jnz     bitflipp loop         ret   The code takes 32 bytes then masks out the nibbles  The high nibble gets shifted right by 4  Then I use vpshufb and ymm4   ymm3 as lookup tables  I could use a single lookup table but then I would have to shift left before ORing the nibbles together again   There are even faster ways of flipping the bits  But I m bound to single thread and CPU so this was the fastest I could achieve  Can you make a faster version   Please make no comments about using the Intel C C   Compiler Intrinsic Equivalent commands

User · Answer

This is for 32 bit  we need to change the size if we consider 8 bits        void bitReverse int num                int num reverse   0          int size    sizeof int  8  -1          int i 0 j 0          for i 0 j size i lt  size j gt  0 i   j--                        if  num  gt  gt  i  amp 1                                num reverse    num reverse    1 lt  lt j                                    printf   n rev num    d n  num reverse           Reading the input integer  num  in LSB- MSB order and storing in num reverse in MSB- LSB order

User · Answer

Generic  C code  Using 1 byte input data num as example       unsigned char num   0xaa       1010 1010  aa  - gt  0101 0101  55      int s   sizeof num    8        get number of bits     int i  x  y  p      int var   0                    make var data type to be equal or larger than num      for  i   0  i  lt   s   2   i                 extract bit on the left  from MSB         p   s - i - 1          x   num  amp   1  lt  lt  p           x   x  gt  gt  p          printf  x   d n   x               extract bit on the right  from LSB         y   num  amp   1  lt  lt  i           y   y  gt  gt  i          printf  y   d n   y            var   var    x  lt  lt  i            apply x         var   var    y  lt  lt  p            apply y            printf  new  0x x n   new

User · Answer

Native ARM instruction  rbit  can do it with 1 cpu cycle and 1 extra cpu register  impossible to beat

User · Answer

Well this certainly won t be an answer like Matt J s but hopefully it will still be useful   size t reverse size t n  unsigned int bytes          asm    BSWAP  0      r  n     0  n        n  gt  gt     sizeof size t  - bytes    8       n     n  amp  0xaaaaaaaaaaaaaaaa   gt  gt  1      n  amp  0x5555555555555555   lt  lt  1       n     n  amp  0xcccccccccccccccc   gt  gt  2      n  amp  0x3333333333333333   lt  lt  2       n     n  amp  0xf0f0f0f0f0f0f0f0   gt  gt  4      n  amp  0x0f0f0f0f0f0f0f0f   lt  lt  4       return n      This is exactly the same idea as Matt s best algorithm except that there s this little instruction called BSWAP which swaps the bytes  not the bits  of a 64-bit number   So b7 b6 b5 b4 b3 b2 b1 b0 becomes b0 b1 b2 b3 b4 b5 b6 b7   Since we are working with a 32-bit number we need to shift our byte-swapped number down 32 bits   This just leaves us with the task of swapping the 8 bits of each byte which is done and voila  we re done   Timing  on my machine  Matt s algorithm ran in  0 52 seconds per trial   Mine ran in about 0 42 seconds per trial   20  faster is not bad I think   If you re worried about the availability of the instruction BSWAP Wikipedia lists the instruction BSWAP as being added with 80846 which came out in 1989   It should be noted that Wikipedia also states that this instruction only works on 32 bit registers which is clearly not the case on my machine  it very much works only on 64-bit registers   This method will work equally well for any integral datatype so the method can be generalized trivially by passing the number of bytes desired       size t reverse size t n  unsigned int bytes                  asm    BSWAP  0      r  n     0  n            n  gt  gt     sizeof size t  - bytes    8           n     n  amp  0xaaaaaaaaaaaaaaaa   gt  gt  1      n  amp  0x5555555555555555   lt  lt  1           n     n  amp  0xcccccccccccccccc   gt  gt  2      n  amp  0x3333333333333333   lt  lt  2           n     n  amp  0xf0f0f0f0f0f0f0f0   gt  gt  4      n  amp  0x0f0f0f0f0f0f0f0f   lt  lt  4           return n          which can then be called like       n   reverse n  sizeof char     only reverse 8 bits     n   reverse n  sizeof short     reverse 16 bits     n   reverse n  sizeof int     reverse 32 bits     n   reverse n  sizeof size t     reverse 64 bits   The compiler should be able to optimize the extra parameter away  assuming the compiler inlines the function  and for the sizeof size t  case the right-shift would be removed completely   Note that GCC at least is not able to remove the BSWAP and right-shift if passed sizeof char

User · Answer

You might want to use the standard template library  It might be slower than the above mentioned code  However  it seems to me clearer and easier to understand      include lt bitset gt    include lt iostream gt     template lt size t N gt   const std  bitset lt N gt  reverse const std  bitset lt N gt  amp  ordered           std  bitset lt N gt  reversed        for size t i   0  j   N - 1  i  lt  N    i  --j             reversed j    ordered i         return reversed            test the function  int main            unsigned long num         const size t N   sizeof num  8         std  cin  gt  gt  num        std  cout  lt  lt  std  showbase  lt  lt  std  hex        std  cout  lt  lt   ordered       lt  lt  num  lt  lt  std  endl        std  cout  lt  lt   reversed      lt  lt  reverse lt N gt  num  to ulong     lt  lt  std  endl        std  cout  lt  lt   double reversed      lt  lt  reverse lt N gt  reverse lt N gt  num   to ulong    lt  lt  std  endl

User · Answer

Well  this is basically the same as the first  reverse    but it is 64 bit and only needs one immediate mask to be loaded from the instruction stream  GCC creates code without jumps  so this should be pretty fast    include  lt stdio h gt   static unsigned long long swap64 unsigned long long val     define ZZZZ x s m     x   gt  gt  s    amp   m        x   amp   m   lt  lt  s       val      val   gt  gt 16   amp  0xFFFF0000FFFF       val   amp  0xFFFF0000FFFF  lt  lt 16       val   ZZZZ val 32   0x00000000FFFFFFFFull    val   ZZZZ val 16   0x0000FFFF0000FFFFull    val   ZZZZ val 8    0x00FF00FF00FF00FFull    val   ZZZZ val 4    0x0F0F0F0F0F0F0F0Full    val   ZZZZ val 2    0x3333333333333333ull    val   ZZZZ val 1    0x5555555555555555ull     return val   undef ZZZZ    int main void    unsigned long long val  aaaa 16       0xfedcba9876543210 0xedcba9876543210f 0xdcba9876543210fe 0xcba9876543210fed    0xba9876543210fedc 0xa9876543210fedcb 0x9876543210fedcba 0x876543210fedcba9    0x76543210fedcba98 0x6543210fedcba987 0x543210fedcba9876 0x43210fedcba98765    0x3210fedcba987654 0x210fedcba9876543 0x10fedcba98765432 0x0fedcba987654321     unsigned iii   for  iii 0  iii  lt  16  iii          val   swap64  aaaa iii        printf  A    016llX Sw  016llx n   aaaa iii   val         return 0

User · Answer

Implementation with low memory and fastest        private Byte  BitReverse Byte bData                Byte   lookup     0  8   4  12                             2  10  6  14                              1  9   5  13                            3  11  7  15            Byte ret val    Byte    lookup  bData  amp  0x0F     lt  lt  4    lookup   bData  amp  0xF0   gt  gt  4             return ret val

User · Answer

I know it isn t C but asm     var1 dw 0f0f0 clc      push ax      push cx      mov cx 16 loop1       shl var1      shr ax loop loop1      pop ax      pop cx   This works with the carry bit  so you may save flags too

User · Answer

My simple solution    BitReverse IN      OUT   0x00      R   1          Right mask      0000 0001     L   0          Left mask    1000 0000        L    0       L     i  gt  gt  1       int size   sizeof IN    4      bit size      while size--           if IN  amp  L  OUT   OUT   R     start from MSB  1000 xxxx         if IN  amp  R  OUT   OUT   L     start from LSB  xxxx 0001         L   L  gt  gt  1          R   R  lt  lt  1             return OUT

User · Answer

Purpose  to reverse bits in an unsigned short integer     Input  an unsigned short integer whose bits are to be reversed    Output  an unsigned short integer with the reversed bits of the input one unsigned short ReverseBits  unsigned short a             declare and initialize number of bits in the unsigned short integer      const char num bits   sizeof a    CHAR BIT           declare and initialize bitset representation of integer a      bitset lt num bits gt  bitset a a                      declare and initialize bitset representation of integer b  0000000000000000       bitset lt num bits gt  bitset b 0                              declare and initialize bitset representation of mask  0000000000000001       bitset lt num bits gt  mask 1                   for   char i   0  i  lt  num bits    i                    bitset b    bitset b  lt  lt  1    bitset a  amp  mask            bitset a  gt  gt   1               return  unsigned short  bitset b to ulong       void PrintBits  unsigned short a             declare and initialize bitset representation of a      bitset lt sizeof a    CHAR BIT gt  bitset a            print out bits      cout  lt  lt  bitset  lt  lt  endl         Testing the functionality of the code  int main           unsigned short a   17  b        cout  lt  lt   Original           PrintBits a         b   ReverseBits  a          cout  lt  lt   Reversed          PrintBits b         Output  Original  0000000000010001 Reversed  1000100000000000

User · Answer

This ain t no job for a human      but perfect for a machine  This is 2015  6 years from when this question was first asked  Compilers have since become our masters  and our job as humans is only to help them  So what s the best way to give our intentions to the machine   Bit-reversal is so common that you have to wonder why the x86 s ever growing ISA doesn t include an instruction to do it one go   The reason  if you give your true concise intent to the compiler  bit reversal should only take  20 CPU cycles  Let me show you how to craft reverse   and use it    include  lt inttypes h gt   include  lt stdio h gt   uint64 t reverse const uint64 t n                   const uint64 t k            uint64 t r  i          for  r   0  i   0  i  lt  k    i                  r      n  gt  gt  i   amp  1   lt  lt   k - i - 1           return r     int main             const uint64 t size   64          uint64 t sum   0          uint64 t a          for  a   0  a  lt   uint64 t 1  lt  lt  30    a                  sum    reverse a  size           printf     PRIu64   n   sum           return 0      Compiling this sample program with Clang version    3 6  -O3  -march native  tested with Haswell   gives artwork-quality code using the new AVX2 instructions  with a runtime of 11 seconds processing  1 billion reverse  s  That s  10 ns per reverse    with  5 ns CPU cycle assuming 2 GHz puts us at the sweet 20 CPU cycles    You can fit 10 reverse  s in the time it takes to access RAM once for a single large array  You can fit 1 reverse   in the time it takes to access an L2 cache LUT twice    Caveat  this sample code should hold as a decent benchmark for a few years  but it will eventually start to show its age once compilers are smart enough to optimize main   to just printf the final result instead of really computing anything  But for now it works in showcasing reverse

User · Answer

It seems that many other posts are concerned about speed  i e best   fastest   What about simplicity  Consider   char ReverseBits char character        char reversed character   0      for  int i   0  i  lt  8  i              char ith bit    c  gt  gt  i   amp  1          reversed character     ith bit  lt  lt   sizeof char  - 1 - i              return reversed character      and hope that clever compiler will optimise for you   If you want to reverse a longer list of bits  containing sizeof char    n bits   you can use this function to get   void ReverseNumber char  number  int bit count in number        int bytes occupied   bit count in number   sizeof char                 first reverse bytes     for  int i   0  i  lt    bytes occupied   2   i              swap long number i   long number n - i                  then reverse bits of each individual byte     for  int i   0  i  lt  bytes occupied  i               long number i    ReverseBits long number i              This would reverse  10000000  10101010  into  01010101  00000001

User · Answer

Efficient can mean throughput or latency  For throughout  see the answer by Anders Cedronius  it   s a good one  For lower latency  I would recommend this code  uint32 t reverseBits  uint32 t x      if defined   arm       defined   aarch64          asm     quot rbit  0   1 quot     quot  r quot    x      quot r quot    x          return x   endif        Flip pairwise     x       x  amp  0x55555555    lt  lt  1         x  amp  0xAAAAAAAA    gt  gt  1           Flip pairs     x       x  amp  0x33333333    lt  lt  2         x  amp  0xCCCCCCCC    gt  gt  2           Flip nibbles     x       x  amp  0x0F0F0F0F    lt  lt  4         x  amp  0xF0F0F0F0    gt  gt  4            Flip bytes  CPUs have an instruction for that  pretty fast one   ifdef  MSC VER     return  byteswap ulong  x     elif defined   INTEL COMPILER      return  uint32 t  bswap   int x     else        Assuming gcc or clang     return   builtin bswap32  x     endif    Compilers output  https   godbolt org z 5ehd89

User · Answer

I thought this is one of the simplest way to reverse the bit   please let me know if there is any flaw in this logic   basically in this logic  we check the value of the bit in position  set the bit if value is 1 on reversed position   void bit reverse ui32  data      ui32 temp   0        ui32 i  bit len                 for i   0  bit len   31  i  lt   bit len  i                    temp      data  amp  1  lt  lt  i    1  lt  lt  bit len-i    0                   data   temp                return

User · Answer

How about the following       uint reverseMSBToLSB32ui uint input                uint output   0x00000000          uint toANDVar   0          int places   0           for  int i   1  i  lt  32  i                          places    32 - i               toANDVar    uint  1  lt  lt  places               output     uint  input  amp   toANDVar    gt  gt  places                       return output          Small and easy  though  32 bit only

User · Answer

Another loop-based solution that exits quickly when the number is low  in C   for multiple types   template lt class T gt  T reverse bits T in        T bit   static cast lt T gt  1   lt  lt   sizeof T    8 - 1       T out       for  out   0  bit  amp  amp  in  bit  gt  gt   1  in  gt  gt   1            if  in  amp  1                out    bit                      return out      or in C for an unsigned int  unsigned int reverse bits unsigned int in        unsigned int bit   1u  lt  lt   sizeof T    8 - 1       unsigned int out       for  out   0  bit  amp  amp  in  bit  gt  gt   1  in  gt  gt   1            if  in  amp  1              out    bit            return out

User · Answer

unsigned char ReverseBits unsigned char data        unsigned char k   0  rev   0       unsigned char n   data       while n                 k   n  amp     n - 1            n  amp    n - 1           rev     128   k             return rev

[c] Efficient Algorithm for Bit Reversal (from MSB->LSB to LSB->MSB) in C

Options

Others

Results

Conclusion

Caveat

Examples related to c

Examples related to algorithm

Examples related to bit-manipulation