[c] What integer hash function are good that accepts an integer hash key?

What integer hash function are good that accepts an integer hash key?

This question is related to c algorithm hash

The answer is


There's a nice overview over some hash algorithms at Eternally Confuzzled. I'd recommend Bob Jenkins' one-at-a-time hash which quickly reaches avalanche and therefore can be used for efficient hash table lookup.


This page lists some simple hash functions that tend to decently in general, but any simple hash has pathological cases where it doesn't work well.


Depends on how your data is distributed. For a simple counter, the simplest function

f(i) = i

will be good (I suspect optimal, but I can't prove it).


I have been using splitmix64 (pointed in Thomas Mueller's answer) ever since I found this thread. However, I recently stumbled upon Pelle Evensen's rrxmrrxmsx_0, which yielded tremendously better statistical distribution than the original MurmurHash3 finalizer and its successors (splitmix64 and other mixes). Here is the code snippet in C:

#include <stdint.h>

static inline uint64_t ror64(uint64_t v, int r) {
    return (v >> r) | (v << (64 - r));
}

uint64_t rrxmrrxmsx_0(uint64_t v) {
    v ^= ror64(v, 25) ^ ror64(v, 50);
    v *= 0xA24BAED4963EE407UL;
    v ^= ror64(v, 24) ^ ror64(v, 49);
    v *= 0x9FB21C651E98DF25UL;
    return v ^ v >> 28;
}

Pelle also provides an in-depth analysis of the 64-bit mixer used in the final step of MurmurHash3 and the more recent variants.


The answer depends on a lot of things like:

  • Where do you intend to employ it?
  • What are you trying to do with the hash?
  • Do you need a crytographically secure hash function?

I suggest that you take a look at the Merkle-Damgard family of hash functions like SHA-1 etc


For random hash values, some engineers said golden ratio prime number(2654435761) is a bad choice, with my testing results, I found that it's not true; instead, 2654435761 distributes the hash values pretty good.

#define MCR_HashTableSize 2^10

unsigned int
Hash_UInt_GRPrimeNumber(unsigned int key)
{
  key = key*2654435761 & (MCR_HashTableSize - 1)
  return key;
}

The hash table size must be a power of two.

I have written a test program to evaluate many hash functions for integers, the results show that GRPrimeNumber is a pretty good choice.

I have tried:

  1. total_data_entry_number / total_bucket_number = 2, 3, 4; where total_bucket_number = hash table size;
  2. map hash value domain into bucket index domain; that is, convert hash value into bucket index by Logical And Operation with (hash_table_size - 1), as shown in Hash_UInt_GRPrimeNumber();
  3. calculate the collision number of each bucket;
  4. record the bucket that has not been mapped, that is, an empty bucket;
  5. find out the max collision number of all buckets; that is, the longest chain length;

With my testing results, I found that Golden Ratio Prime Number always has the fewer empty buckets or zero empty bucket and the shortest collision chain length.

Some hash functions for integers are claimed to be good, but the testing results show that when the total_data_entry / total_bucket_number = 3, the longest chain length is bigger than 10(max collision number > 10), and many buckets are not mapped(empty buckets), which is very bad, compared with the result of zero empty bucket and longest chain length 3 by Golden Ratio Prime Number Hashing.

BTW, with my testing results, I found one version of shifting-xor hash functions is pretty good(It's shared by mikera).

unsigned int Hash_UInt_M3(unsigned int key)
{
  key ^= (key << 13);
  key ^= (key >> 17);    
  key ^= (key << 5); 
  return key;
}

I found the following algorithm provides a very good statistical distribution. Each input bit affects each output bit with about 50% probability. There are no collisions (each input results in a different output). The algorithm is fast except if the CPU doesn't have a built-in integer multiplication unit. C code, assuming int is 32 bit (for Java, replace >> with >>> and remove unsigned):

unsigned int hash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = (x >> 16) ^ x;
    return x;
}

The magic number was calculated using a special multi-threaded test program that ran for many hours, which calculates the avalanche effect (the number of output bits that change if a single input bit is changed; should be nearly 16 on average), independence of output bit changes (output bits should not depend on each other), and the probability of a change in each output bit if any input bit is changed. The calculated values are better than the 32-bit finalizer used by MurmurHash, and nearly as good (not quite) as when using AES. A slight advantage is that the same constant is used twice (it did make it slightly faster the last time I tested, not sure if it's still the case).

You can reverse the process (get the input value from the hash) if you replace the 0x45d9f3b with 0x119de1f3 (the multiplicative inverse):

unsigned int unhash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = (x >> 16) ^ x;
    return x;
}

For 64-bit numbers, I suggest to use the following, even thought it might not be the fastest. This one is based on splitmix64, which seems to be based on the blog article Better Bit Mixing (mix 13).

uint64_t hash(uint64_t x) {
    x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
    x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
    x = x ^ (x >> 31);
    return x;
}

For Java, use long, add L to the constant, replace >> with >>> and remove unsigned. In this case, reversing is more complicated:

uint64_t unhash(uint64_t x) {
    x = (x ^ (x >> 31) ^ (x >> 62)) * UINT64_C(0x319642b2d24d8ec3);
    x = (x ^ (x >> 27) ^ (x >> 54)) * UINT64_C(0x96de1b173f119089);
    x = x ^ (x >> 30) ^ (x >> 60);
    return x;
}

Update: You may also want to look at the Hash Function Prospector project, where other (possibly better) constants are listed.


  • 32-bits multiplicative method (very fast) see @rafal

    #define hash32(x) ((x)*2654435761)
    #define H_BITS 24 // Hashtable size
    #define H_SHIFT (32-H_BITS)
    unsigned hashtab[1<<H_BITS]  
    .... 
    unsigned slot = hash32(x) >> H_SHIFT
    
  • 32-bits and 64-bits (good distribution) at : MurmurHash

  • Integer Hash Function

Fast and good hash functions can be composed from fast permutations with lesser qualities, like

  • multiplication with an uneven integer
  • binary rotations
  • xorshift

To yield a hashing function with superior qualities, like demonstrated with PCG for random number generation.

This is in fact also the recipe rrxmrrxmsx_0 and murmur hash are using, knowingly or unknowingly.

I personally found

uint64_t xorshift(const uint64_t& n,int i){
  return n^(n>>i);
}
uint64_t hash(const uint64_t& n){
  uint64_t p = 0x5555555555555555ull; // pattern of alternating 0 and 1
  uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
  return c*xorshift(p*xorshift(n,32),32);
}

to be good enough.

A good hash function should

  1. be bijective to not loose information, if possible and have the least collisions
  2. cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5.

Let's first look at the identity function. It satisfies 1. but not 2. :

identity function

Input bit n determines output bit n with a correlation of 100% (red) and no others, they are therefore blue, giving a perfect red line across.

A xorshift(n,32) is not much better, yielding one and half a line. Still satisfying 1., because it is invertible with a second application.

xorshift

A multiplication with an unsigned integer ("Knuth's multiplicative method") is much better, cascading more strongly and flipping more output bits with a probability of 0.5, which is what you want, in green. It satisfies 1. as for each uneven integer there is a multiplicative inverse.

knuth

Combining the two gives the following output, still satisfying 1. as the composition of two bijective functions yields another bijective function.

knuth•xorshift

A second application of multiplication and xorshift will yield the following:

proposed hash

Or you can use Galois field multiplications like GHash, they have become reasonably fast on modern CPUs and have superior qualities in one step.

   uint64_t const inline gfmul(const uint64_t& i,const uint64_t& j){           
     __m128i I{};I[0]^=i;                                                          
     __m128i J{};J[0]^=j;                                                          
     __m128i M{};M[0]^=0xb000000000000000ull;                                      
     __m128i X = _mm_clmulepi64_si128(I,J,0);                                      
     __m128i A = _mm_clmulepi64_si128(X,M,0);                                      
     __m128i B = _mm_clmulepi64_si128(A,M,0);                                      
     return A[0]^A[1]^B[1]^X[0]^X[1];                                              
   }

I don't think we can say that a hash function is "good" without knowing your data in advance ! and without knowing what you're going to do with it.

There are better data structures than hash tables for unknown data sizes (I'm assuming you're doing the hashing for a hash table here ). I would personally use a hash table when I Know I have a "finite" number of elements that are needing stored in a limited amount of memory. I would try and do a quick statistical analysis on my data, see how it is distributed etc before I start thinking about my hash function.


Examples related to c

conflicting types for 'outchar' Can't compile C program on a Mac after upgrade to Mojave Program to find largest and second largest number in array Prime numbers between 1 to 100 in C Programming Language In c, in bool, true == 1 and false == 0? How I can print to stderr in C? Visual Studio Code includePath "error: assignment to expression with array type error" when I assign a struct field (C) Compiling an application for use in highly radioactive environments How can you print multiple variables inside a string using printf?

Examples related to algorithm

How can I tell if an algorithm is efficient? Find the smallest positive integer that does not occur in a given sequence Efficiently getting all divisors of a given number Peak signal detection in realtime timeseries data What is the optimal algorithm for the game 2048? How can I sort a std::map first by value, then by key? Finding square root without using sqrt function? Fastest way to flatten / un-flatten nested JSON objects Mergesort with Python Find common substring between two strings

Examples related to hash

php mysqli_connect: authentication method unknown to the client [caching_sha2_password] What is Hash and Range Primary Key? How to create a laravel hashed password Hashing a file in Python PHP salt and hash SHA256 for login password Append key/value pair to hash with << in Ruby Are there any SHA-256 javascript implementations that are generally considered trustworthy? How do I generate a SALT in Java for Salted-Hash? What does hash do in python? Hashing with SHA1 Algorithm in C#