hash function for string

Question

I m working on hash table in C language and I m testing hash function for string   The first function I ve tried is to add ascii code and use modulo   100  but i ve got poor results with the first test of data  40 collisions for 130 words    The final input data will contain 8 000 words  it s a dictionnary stores in a file   The hash table is declared as int table 10000  and contains the position of the word in a txt file   The first question is which is the best algorithm for hashing string   and how to determinate the size of hash table    thanks in advance     -

User · Answer

djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes). My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:

uint32_t inline MurmurOAAT32 ( const char * key)
{
  uint32_t h(3323198485ul);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e995;
    h ^= h >> 15;
  }
  return h;
}

uint64_t inline MurmurOAAT64 ( const char * key)
{
  uint64_t h(525201411107845655ull);
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e9955bd1e995;
    h ^= h >> 47;
  }
  return h;
}

The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.

User · Answer

One thing I ve used with good results is the following  I don t know if its mentioned already because I can t remember its name    You precompute a table T with a random number for each character in your key s alphabet  0 255   You hash your key  k0 k1 k2     kN  by taking T k0  xor T k1  xor     xor T kN   You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers

User · Answer

Wikipedia shows a nice string hash function called Jenkins One At A Time Hash  It also quotes improved versions of this hash   uint32 t jenkins one at a time hash char  key  size t len        uint32 t hash  i      for hash   i   0  i  lt  len    i                hash    key i           hash     hash  lt  lt  10           hash     hash  gt  gt  6             hash     hash  lt  lt  3       hash     hash  gt  gt  11       hash     hash  lt  lt  15       return hash

User · Answer

Though djb2  as presented on stackoverflow by cnicutar  is almost certainly better  I think it s worth showing the K amp R hashes too   1  Apparently a terrible hash algorithm  as presented in K amp R 1st edition  source   unsigned long hash unsigned char  str        unsigned int hash   0      int c       while  c    str            hash    c       return hash      2  Probably a pretty decent hash algorithm  as presented in K amp R version 2  verified by me on pg  144 of the book   NB  be sure to remove   HASHSIZE from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm  Also  I recommend you make the return and  hashval  type unsigned long instead of the simple unsigned  int    unsigned hash char  s        unsigned hashval       for  hashval   0   s      0   s            hashval    s   31 hashval      return hashval   HASHSIZE      Note that it s clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order  so hash  ab   would therefore return the same value as hash  ba    This is not so with the 2nd edition hash  however  which would  much better   return two different values for those strings   The GCC C  11 hashing functions used for unordered map  a hash table template  and unordered set  a hash set template  appear to be as follows    This is a partial answer to the question of what are the GCC C  11 hash functions used  stating that GCC uses an implementation of  MurmurHashUnaligned2   by Austin Appleby  http   murmurhash googlepages com     In the file  gcc libstdc  -v3 libsupc   hash bytes cc   here  https   github com gcc-mirror gcc blob master libstdc  -v3 libsupc   hash bytes cc   I found the implementations  Here s the one for the  32-bit size t  return value  for example  pulled 11 Aug 2017     Code        Implementation of Murmur hash for 32-bit size t  size t  Hash bytes const void  ptr  size t len  size t seed      const size t m   0x5bd1e995    size t hash   seed   len    const char  buf   static cast lt const char  gt  ptr         Mix 4 bytes at a time into the hash    while  len  gt   4          size t k   unaligned load buf       k    m      k    k  gt  gt  24      k    m      hash    m      hash    k      buf    4      len -  4            Handle the last few bytes of the input array    switch  len          case 3        hash    static cast lt unsigned char gt  buf 2    lt  lt  16          gnu  fallthrough        case 2        hash    static cast lt unsigned char gt  buf 1    lt  lt  8          gnu  fallthrough        case 1        hash    static cast lt unsigned char gt  buf 0          hash    m             Do a few final mixes of the hash    hash    hash  gt  gt  13    hash    m    hash    hash  gt  gt  15    return hash

User · Answer

First  is 40 collisions for 130 words hashed to 0  99 bad  You can t expect perfect hashing if you are not taking steps specifically for it to happen  An ordinary hash function won t have fewer collisions than a random generator most of the time   A hash function with a good reputation is MurmurHash3   Finally  regarding the size of the hash table  it really depends what kind of hash table you have in mind  especially  whether buckets are extensible or one-slot  If buckets are extensible  again there is a choice  you choose the average bucket length for the memory speed constraints that you have

User · Answer

I ve had nice results with djb2 by Dan Bernstein   unsigned long hash unsigned char  str        unsigned long hash   5381      int c       while  c    str            hash     hash  lt  lt  5    hash    c     hash   33   c         return hash

User · Answer

I have tried these hash functions and got the following result  I have about 960 3 entries  each 64 bytes long  64 chars in different order  hash value 32bit  Codes from here   Hash function      collision rate   how many minutes to finish                                                                MurmurHash3                  6                           4m15s Jenkins One                  6 1                         6m54s    Bob  1st in link            6 16                         5m34s SuperFastHash                 10                         4m58s bernstein                     20          14s only finish 1 20 one at a time               6 16                          7m5s crc                         6 16                         7m56s   One strange things is that almost all the hash functions have 6  collision rate for my data

User · Answer

First  you generally do not want to use a cryptographic hash for a hash table  An algorithm that s very fast by cryptographic standards is still excruciatingly slow by hash table standards   Second  you want to ensure that every bit of the input can will affect the result  One easy way to do that is to rotate the current result by some number of bits  then XOR the current hash code with the current byte  Repeat until you reach the end of the string  Note that you generally do not want the rotation to be an even multiple of the byte size either   For example  assuming the common case of 8 bit bytes  you might rotate by 5 bits   int hash char const  input         int result   0x55555555       while   input             result     input            result   rol result  5             Edit  Also note that 10000 slots is rarely a good choice for a hash table size  You usually want one of two things  you either want a prime number as the size  required to ensure correctness with some types of hash resolution  or else a power of 2  so reducing the value to the correct range can be done with a simple bit-mask

User · Answer

There are a number of existing hashtable implementations for C  from the C standard library hcreate hdestroy hsearch  to those in the APR and glib  which also provide prebuilt hash functions  I d highly recommend using those rather than inventing your own hashtable or hash function  they ve been optimized heavily for common use-cases   If your dataset is static  however  your best solution is probably to use a perfect hash  gperf will generate a perfect hash for you for a given dataset

[c] hash function for string

Examples related to c

Examples related to algorithm

Examples related to hash

Examples related to dictionary

Examples related to hashtable