Why should hash functions use a prime number modulus

Question

A long time ago  I bought a data structures book off the bargain table for  1 25   In it  the explanation for a hashing function said that it should ultimately mod by a prime number because of  quot the nature of math quot   What do you expect from a  1 25 book  Anyway  I ve had years to think about the nature of math  and still can t figure it out  Is the distribution of numbers truly more even when there are a prime number of buckets  Or is this an old programmer s tale that everyone accepts because everybody else accepts it

User · Answer

Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P   Say  you use such hash-function for strings of length  lt   N  and you have a collision  That means that 2 different polynomials produce the same value modulo P  The difference of those polynomials is again a polynomial of the same degree N  or less   It has no more than N roots  this is here the nature of math shows itself  since this claim is only true for a polynomial over a field    prime number   So if N is much less than P  you are likely not to have a collision  After that  experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10  and is small enough to use for calculations

User · Answer

The first thing you do when inserting retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode   table length  Here are 2  statements  that you most probably have read somewhere   If you use a power of 2 for table length  finding  hashCode key    2 n   is as simple and quick as  hashCode key   amp   2 n -1    But if your function to calculate hashCode for a given key isn t good  you will definitely suffer from clustering of many keys in a few hash buckets  But if you use prime numbers for table length  hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function    And here is the proof   If suppose your hashCode function results in the following hashCodes among others  x   2x  3x  4x  5x  6x      then all these are going to be clustered in just m number of buckets  where m   table length GreatestCommonFactor table length  x    It is trivial to verify derive this   Now you can do one of the following to avoid clustering  Make sure that you don t generate too many hashCodes that are multiples of another hashCode like in  x  2x  3x  4x  5x  6x     But this may be kind of difficult if your hashTable is supposed to have millions of entries  Or simply make m equal to the table length by making GreatestCommonFactor table length  x  equal to 1  i e by making table length coprime with x  And if x can be just about any number then make sure that table length is a prime number   From - http   srinvis blogspot com 2006 07 hash-table-lengths-and-prime-numbers html

User · Answer

I ve read the popular wordpress website linked in some of the above popular answers at the top   From what I ve understood  I d like to share a simple observation I made   You can find all the details in the article here  but assume the following holds true    Using a prime number gives us the  best chance  of an unique value   A general hashmap implementation wants 2 things to be unique     Unique hash code for the key Unique index to store the actual value   How do we get the unique index   By making the initial size of the internal container a prime as well   So basically  prime is involved because it possesses this unique trait of producing unique numbers which we end up using to ID objects and finding indexes inside the internal container   Example   key    key   value    value   uniqueId    k    31   2           e    31   1           y    maps to unique id   Now we want a unique location for our value - so we   uniqueId   internalContainerSize    uniqueLocationForValue   assuming internalContainerSize is also a prime   I know this is simplified  but I m hoping to get the general idea through

User · Answer

I would say the first answer at this link is the clearest answer I found regarding this question  Consider the set of keys K    0 1     100  and a hash table where the number of buckets is m   12  Since 3 is a factor of 12  the keys that are multiples of 3 will be hashed to buckets that are multiples of 3   Keys  0 12 24 36      will be hashed to bucket 0  Keys  3 15 27 39      will be hashed to bucket 3  Keys  6 18 30 42      will be hashed to bucket 6  Keys  9 21 33 45      will be hashed to bucket 9   If K is uniformly distributed  i e   every key in K is equally likely to occur   then the choice of m is not so critical  But  what happens if K is not uniformly distributed  Imagine that the keys that are most likely to occur are the multiples of 3  In this case  all of the buckets that are not multiples of 3 will be empty with high probability  which is really bad in terms of hash table performance   This situation is more common that it may seem  Imagine  for instance  that you are keeping track of objects based on where they are stored in memory  If your computer s word size is four bytes  then you will be hashing keys that are multiples of 4  Needless to say that choosing m to be a multiple of 4 would be a terrible choice  you would have 3m 4 buckets completely empty  and all of your keys colliding in the remaining m 4 buckets  In general   Every key in K that shares a common factor with the number of buckets m will be hashed to a bucket that is a multiple of this factor   Therefore  to minimize collisions  it is important to reduce the number of common factors between m and the elements of K  How can this be achieved  By choosing m to be a number that has very few factors  a prime number  FROM THE ANSWER BY Mario

User · Answer

It depends on the choice of hash function   Many hash functions combine the various elements in the data by multiplying them with some factors modulo the power of two corresponding to the word size of the machine  that modulus is free by just letting the calculation overflow    You don t want any common factor between a multiplier for a data element and the size of the hash table  because then it could happen that varying the data element doesn t spread the data over the whole table  If you choose a prime for the size of the table such a common factor is highly unlikely   On the other hand  those factors are usually made up from odd primes  so you should also be safe using powers of two for your hash table  e g  Eclipse uses 31 when it generates the Java hashCode   method

User · Answer

Suppose your table-size  or the number for modulo  is T    B C   Now if hash for your input is like  N A B  where N can be any integer  then your output won t be well distributed  Because every time n becomes C  2C  3C etc   your output will start repeating  i e  your output will be distributed only in C positions  Note that C here is  T   HCF table-size  hash     This problem can be eliminated by making HCF 1  Prime numbers are very good for that   Another interesting thing is when T is 2 N  These will give output exactly same as all the lower N bits of input-hash  As every number can be represented powers of 2  when we will take modulo of any number with T  we will subtract all powers of 2 form number  which are    N  hence always giving off number of specific pattern  dependent on the input  This is also a bad choice   Similarly  T as 10 N is bad as well because of similar reasons  pattern in decimal notation of numbers instead of binary    So  prime numbers tend to give a better distributed results  hence are good choice for table size

User · Answer

Just to provide an alternate viewpoint there s this site    http   www codexon com posts hash-functions-the-modulo-prime-myth  Which contends that you should use the largest number of buckets possible as opposed to to rounding down to a prime number of buckets  It seems like a reasonable possibility  Intuitively  I can certainly see how a larger number of buckets would be better  but I m unable to make a mathematical argument of this

User · Answer

Copying from my other answer https   stackoverflow com a 43126969 917428  See it for more details and examples    I believe that it just has to do with the fact that computers work with in base 2  Just think at how the same thing works for base 10    8   10   8 18   10   8 87865378   10   8   It doesn t matter what the number is  as long as it ends with 8  its modulo 10 will be 8   Picking a big enough  non-power-of-two number will make sure the hash function really is a function of all the input bits  rather than a subset of them

User · Answer

Usually a simple hash function works by taking the  component parts  of the input  characters in the case of a string   and multiplying them by the powers of some constant  and adding them together in some integer type  So for example a typical  although not especially good  hash of a string might be    first char    k    second char    k 2    third char          Then if a bunch of strings all having the same first char are fed in  then the results will all be the same modulo k  at least until the integer type overflows    As an example  Java s string hashCode is eerily similar to this - it does the characters reverse order  with k 31  So you get striking relationships modulo 31 between strings that end the same way  and striking relationships modulo 2 32 between strings that are the same except near the end  This doesn t seriously mess up hashtable behaviour    A hashtable works by taking the modulus of the hash over the number of buckets   It s important in a hashtable not to produce collisions for likely cases  since collisions reduce the efficiency of the hashtable   Now  suppose someone puts a whole bunch of values into a hashtable that have some relationship between the items  like all having the same first character  This is a fairly predictable usage pattern  I d say  so we don t want it to produce too many collisions   It turns out that  because of the nature of maths   if the constant used in the hash  and the number of buckets  are coprime  then collisions are minimised in some common cases  If they are not coprime  then there are some fairly simple relationships between inputs for which collisions are not minimised  All the hashes come out equal modulo the common factor  which means they ll all fall into the 1 n th of the buckets which have that value modulo the common factor  You get n times as many collisions  where n is the common factor  Since n is at least 2  I d say it s unacceptable for a fairly simple use case to generate at least twice as many collisions as normal  If some user is going to break our distribution into buckets  we want it to be a freak accident  not some simple predictable usage   Now  hashtable implementations obviously have no control over the items put into them  They can t prevent them being related  So the thing to do is to ensure that the constant and the bucket counts are coprime  That way you aren t relying on the  last  component alone to determine the modulus of the bucket with respect to some small common factor  As far as I know they don t have to be prime to achieve this  just coprime   But if the hash function and the hashtable are written independently  then the hashtable doesn t know how the hash function works  It might be using a constant with small factors  If you re lucky it might work completely differently and be nonlinear  If the hash is good enough  then any bucket count is just fine  But a paranoid hashtable can t assume a good hash function  so should use a prime number of buckets  Similarly a paranoid hash function should use a largeish prime constant  to reduce the chance that someone uses a number of buckets which happens to have a common factor with the constant   In practice  I think it s fairly normal to use a power of 2 as the number of buckets  This is convenient and saves having to search around or pre-select a prime number of the right magnitude  So you rely on the hash function not to use even multipliers  which is generally a safe assumption  But you can still get occasional bad hashing behaviours based on hash functions like the one above  and prime bucket count could help further   Putting about the principle that  everything has to be prime  is as far as I know a sufficient but not a necessary condition for good distribution over hashtables  It allows everybody to interoperate without needing to assume that the others have followed the same rule    Edit  there s another  more specialized reason to use a prime number of buckets  which is if you handle collisions with linear probing  Then you calculate a stride from the hashcode  and if that stride comes out to be a factor of the bucket count then you can only do  bucket count   stride  probes before you re back where you started  The case you most want to avoid is stride   0  of course  which must be special-cased  but to avoid also special-casing bucket count   stride equal to a small integer  you can just make the bucket count prime and not care what the stride is provided it isn t 0

User · Answer

Primes are unique numbers  They are   unique in that  the product of a prime   with any other number has the best   chance of being unique  not as unique   as the prime itself of-course  due to   the fact that a prime is used to   compose it  This property is used in   hashing functions       Given a string    Samuel     you can   generate a unique hash by multiply   each of the constituent digits or   letters with a prime number and adding   them up  This is why primes are used       However using primes is an old   technique  The key here to understand   that as long as you can generate a   sufficiently unique key you can move   to other hashing techniques too  Go   here for more on this topic about   http   www azillionmonkeys com qed hash html   http   computinglife wordpress com 2008 11 20 why-do-hash-functions-use-prime-numbers

User · Answer

This question was merged with the more appropriate question  why hash tables should use prime sized arrays  and not power of 2  For hash functions itself there are plenty of good answers here  but for the related question  why some security-critical hash tables  like glibc  use prime-sized arrays  there s none yet   Generally power of 2 tables are much faster  There the expensive h   n   gt  h  amp  bitmask  where the bitmask can be calculated via clz   count leading zeros   of the size n  A modulo function needs to do integer division which is about 50x slower than a logical and  There are some tricks to avoid a modulo  like using Lemire s https   lemire me blog 2016 06 27 a-fast-alternative-to-the-modulo-reduction   but generally fast hash tables use power of 2  and secure hash tables use primes   Why so   Security in this case is defined by attacks on the collision resolution strategy  which is with most hash tables just linear search in a linked list of collisions  Or with the faster open-addressing tables linear search in the table directly  So with power of 2 tables and some internal knowledge of the table  e g  the size or the order of the list of keys provided by some JSON interface  you get the number of right bits used  The number of ones on the bitmask  This is typically lower than 10 bits  And for 5-10 bits it s trivial to brute force collisions even with the strongest and slowest hash functions  You don t get the full security of your 32bit or 64 bit hash functions anymore  And the point is to use fast small hash functions  not monsters such as murmur or even siphash   So if you provide an external interface to your hash table  like a DNS resolver  a programming language      you want to care about abuse folks who like to DOS such services  It s normally easier for such folks to shut down your public service with much easier methods  but it did happen  So people did care   So the best options to prevent from such collision attacks is  either  1  to use prime tables  because then   all 32 or 64 bits are relevant to find the bucket  not just a few  the hash table resize function is more natural than just double  The best growth function is the fibonacci sequence and primes come closer to that than doubling    2  use better measures against the actual attack  together with fast power of 2 sizes    count the collisions and abort or sleep on detected attacks  which is collision numbers with a probability of  lt 1   Like 100 with 32bit hash tables  This is what e g  djb s dns resolver does  convert the linked list of collisions to tree s with O log n  search not O n  when an collision attack is detected  This is what e g  java does    There s a wide-spread myth that more secure hash functions help to prevent such attacks  which is wrong as I explained  There s no security with low bits only  This would only work with prime-sized tables  but this would use a combination of the two slowest methods  slow hash plus slow prime modulo   Hash functions for hash tables primarily need to be small  to be inlinable  and fast  Security can come only from preventing linear search in the collisions  And not to use trivially bad hash functions  like ones insensitive to some values  like  0 when using multiplication    Using random seeds is also a good option  people started with that first  but with enough information of the table even a random seed does not help much  and dynamic languages typically make it trivial to get the seed via other methods  as it s stored in known memory locations

User · Answer

For a hash function it s not only important to minimize colisions generally but to make it impossible to stay with the same hash while chaning a few bytes   Say you have an equation   x   y z    key   x with 0 lt x lt key and 0 lt z lt key  If key is a primenumber n y key is true for every n in N and false for every other number   An example where key isn t a prime example  x 1  z 2 and key 8 Because key z 4 is still a natural number  4 becomes a solution for our equation and in this case  n 2  y   key is true for every n in N  The amount of solutions for the equation have practially doubled because 8 isn t a prime   If our attacker already knows that 8 is possible solution for the equation he can change the file from producing 8 to 4 and still gets the same hash

User · Answer

I d like to add something for Steve Jessop s answer I can t comment on it since I don t have enough reputation   But I found some helpful material  His answer is very help but he made a mistake  the bucket size should not be a power of 2  I ll just quote from the book  Introduction to Algorithm  by Thomas Cormen  Charles Leisersen  et al on page263      When using the division method  we usually avoid certain values of m  For example  m should not be a power of 2  since if m   2 p  then h k  is just the p lowest-order bits of k  Unless we know that all low-order p-bit patterns are equally likely  we are better off designing the hash function to depend on all the bits of the key  As Exercise 11 3-3 asks you to show  choosing m   2 p-1 when k is a character string interpreted in radix 2 p may be a poor choice  because permuting the characters of k does not change its hash value    Hope it helps

User · Answer

tl dr  index hash input  2  would result in a collision for half of all possible hashes and a range of values   index hash input  prime  results in a collision of  lt 2 of all possible hashes   Fixing the divisor to the table size also ensures that the number cannot be greater than the table

User · Answer

quot The nature of math quot  regarding prime power moduli is that they are one building block of a finite field  The other two building blocks are an addition and a multiplication operation  The special property of prime moduli is that they form a finite field with the  quot regular quot  addition and multiplication operations  just taken to the modulus  This means every multiplication maps to a different integer modulo the prime  so does every addition  Prime moduli are advantageous because   They give the most freedom when choosing the secondary multiplier in secondary hashing  all multipliers except 0 will end up visiting all elements exactly once If all hashes are less than the modulus there will be no collisions at all Random primes mix better than power of two moduli and compress the information of all the bits not just a subset  They however have a big downside  they require an integer division  which takes many    15-40  cycles  even on a modern CPU  With around half the computation one can make sure the hash is mixed up very well  Two multiplications and xorshift operations will mix better than a prime moudulus  Then we can use whatever hash table size and hash reduction is fastest  giving 7 operations in total for power of 2 table sizes and around 9 operations for arbitrary sizes  I recently looked at many of the fastest hash table implementations and most of them don t use prime moduli  The distribution of the hash table indices are mainly dependent on the hash function in use  A prime modulus can t fix a bad hash function and a good hash function does not benefit from a prime modulus  There are cases where they can be advantageous however  It can mend a half-bad hash function for example

User · Answer

http   computinglife wordpress com 2008 11 20 why-do-hash-functions-use-prime-numbers   Pretty clear explanation  with pictures too   Edit  As a summary  primes are used because you have the best chance of obtaining a unique value when multiplying values by the prime number chosen and adding them all up  For example given a string  multiplying each letter value with the prime number and then adding those all up will give you its hash value   A better question would be  why exactly the number 31

[language-agnostic] Why should hash functions use a prime number modulus?

Examples related to language-agnostic

Examples related to data-structures

Examples related to hash