[language-agnostic] Why should hash functions use a prime number modulus?

A long time ago, I bought a data structures book off the bargain table for $1.25. In it, the explanation for a hashing function said that it should ultimately mod by a prime number because of "the nature of math".

What do you expect from a $1.25 book?

Anyway, I've had years to think about the nature of math, and still can't figure it out.

Is the distribution of numbers truly more even when there are a prime number of buckets?

Or is this an old programmer's tale that everyone accepts because everybody else accepts it?

This question is related to language-agnostic data-structures hash

The answer is


The first thing you do when inserting/retreiving from hash table is to calculate the hashCode for the given key and then find the correct bucket by trimming the hashCode to the size of the hashTable by doing hashCode % table_length. Here are 2 'statements' that you most probably have read somewhere

  1. If you use a power of 2 for table_length, finding (hashCode(key) % 2^n ) is as simple and quick as (hashCode(key) & (2^n -1)). But if your function to calculate hashCode for a given key isn't good, you will definitely suffer from clustering of many keys in a few hash buckets.
  2. But if you use prime numbers for table_length, hashCodes calculated could map into the different hash buckets even if you have a slightly stupid hashCode function.

And here is the proof.

If suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering

Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries. Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.

From - http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html


Copying from my other answer https://stackoverflow.com/a/43126969/917428. See it for more details and examples.

I believe that it just has to do with the fact that computers work with in base 2. Just think at how the same thing works for base 10:

  • 8 % 10 = 8
  • 18 % 10 = 8
  • 87865378 % 10 = 8

It doesn't matter what the number is: as long as it ends with 8, its modulo 10 will be 8.

Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.


It depends on the choice of hash function.

Many hash functions combine the various elements in the data by multiplying them with some factors modulo the power of two corresponding to the word size of the machine (that modulus is free by just letting the calculation overflow).

You don't want any common factor between a multiplier for a data element and the size of the hash table, because then it could happen that varying the data element doesn't spread the data over the whole table. If you choose a prime for the size of the table such a common factor is highly unlikely.

On the other hand, those factors are usually made up from odd primes, so you should also be safe using powers of two for your hash table (e.g. Eclipse uses 31 when it generates the Java hashCode() method).


I'd like to add something for Steve Jessop's answer(I can't comment on it since I don't have enough reputation). But I found some helpful material. His answer is very help but he made a mistake: the bucket size should not be a power of 2. I'll just quote from the book "Introduction to Algorithm" by Thomas Cormen, Charles Leisersen, et al on page263:

When using the division method, we usually avoid certain values of m. For example, m should not be a power of 2, since if m = 2^p, then h(k) is just the p lowest-order bits of k. Unless we know that all low-order p-bit patterns are equally likely, we are better off designing the hash function to depend on all the bits of the key. As Exercise 11.3-3 asks you to show, choosing m = 2^p-1 when k is a character string interpreted in radix 2^p may be a poor choice, because permuting the characters of k does not change its hash value.

Hope it helps.


Suppose your table-size (or the number for modulo) is T = (B*C). Now if hash for your input is like (N*A*B) where N can be any integer, then your output won't be well distributed. Because every time n becomes C, 2C, 3C etc., your output will start repeating. i.e. your output will be distributed only in C positions. Note that C here is (T / HCF(table-size, hash)).

This problem can be eliminated by making HCF 1. Prime numbers are very good for that.

Another interesting thing is when T is 2^N. These will give output exactly same as all the lower N bits of input-hash. As every number can be represented powers of 2, when we will take modulo of any number with T, we will subtract all powers of 2 form number, which are >= N, hence always giving off number of specific pattern, dependent on the input. This is also a bad choice.

Similarly, T as 10^N is bad as well because of similar reasons (pattern in decimal notation of numbers instead of binary).

So, prime numbers tend to give a better distributed results, hence are good choice for table size.


"The nature of math" regarding prime power moduli is that they are one building block of a finite field. The other two building blocks are an addition and a multiplication operation. The special property of prime moduli is that they form a finite field with the "regular" addition and multiplication operations, just taken to the modulus. This means every multiplication maps to a different integer modulo the prime, so does every addition.

Prime moduli are advantageous because:

  • They give the most freedom when choosing the secondary multiplier in secondary hashing, all multipliers except 0 will end up visiting all elements exactly once
  • If all hashes are less than the modulus there will be no collisions at all
  • Random primes mix better than power of two moduli and compress the information of all the bits not just a subset

They however have a big downside, they require an integer division, which takes many (~ 15-40) cycles, even on a modern CPU. With around half the computation one can make sure the hash is mixed up very well. Two multiplications and xorshift operations will mix better than a prime moudulus. Then we can use whatever hash table size and hash reduction is fastest, giving 7 operations in total for power of 2 table sizes and around 9 operations for arbitrary sizes.

I recently looked at many of the fastest hash table implementations and most of them don't use prime moduli.

The distribution of the hash table indices are mainly dependent on the hash function in use. A prime modulus can't fix a bad hash function and a good hash function does not benefit from a prime modulus. There are cases where they can be advantageous however. It can mend a half-bad hash function for example.


This question was merged with the more appropriate question, why hash tables should use prime sized arrays, and not power of 2. For hash functions itself there are plenty of good answers here, but for the related question, why some security-critical hash tables, like glibc, use prime-sized arrays, there's none yet.

Generally power of 2 tables are much faster. There the expensive h % n => h & bitmask, where the bitmask can be calculated via clz ("count leading zeros") of the size n. A modulo function needs to do integer division which is about 50x slower than a logical and. There are some tricks to avoid a modulo, like using Lemire's https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/, but generally fast hash tables use power of 2, and secure hash tables use primes.

Why so?

Security in this case is defined by attacks on the collision resolution strategy, which is with most hash tables just linear search in a linked list of collisions. Or with the faster open-addressing tables linear search in the table directly. So with power of 2 tables and some internal knowledge of the table, e.g. the size or the order of the list of keys provided by some JSON interface, you get the number of right bits used. The number of ones on the bitmask. This is typically lower than 10 bits. And for 5-10 bits it's trivial to brute force collisions even with the strongest and slowest hash functions. You don't get the full security of your 32bit or 64 bit hash functions anymore. And the point is to use fast small hash functions, not monsters such as murmur or even siphash.

So if you provide an external interface to your hash table, like a DNS resolver, a programming language, ... you want to care about abuse folks who like to DOS such services. It's normally easier for such folks to shut down your public service with much easier methods, but it did happen. So people did care.

So the best options to prevent from such collision attacks is either

1) to use prime tables, because then

  • all 32 or 64 bits are relevant to find the bucket, not just a few.
  • the hash table resize function is more natural than just double. The best growth function is the fibonacci sequence and primes come closer to that than doubling.

2) use better measures against the actual attack, together with fast power of 2 sizes.

  • count the collisions and abort or sleep on detected attacks, which is collision numbers with a probability of <1%. Like 100 with 32bit hash tables. This is what e.g. djb's dns resolver does.
  • convert the linked list of collisions to tree's with O(log n) search not O(n) when an collision attack is detected. This is what e.g. java does.

There's a wide-spread myth that more secure hash functions help to prevent such attacks, which is wrong as I explained. There's no security with low bits only. This would only work with prime-sized tables, but this would use a combination of the two slowest methods, slow hash plus slow prime modulo.

Hash functions for hash tables primarily need to be small (to be inlinable) and fast. Security can come only from preventing linear search in the collisions. And not to use trivially bad hash functions, like ones insensitive to some values (like \0 when using multiplication).

Using random seeds is also a good option, people started with that first, but with enough information of the table even a random seed does not help much, and dynamic languages typically make it trivial to get the seed via other methods, as it's stored in known memory locations.


http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

Pretty clear explanation, with pictures too.

Edit: As a summary, primes are used because you have the best chance of obtaining a unique value when multiplying values by the prime number chosen and adding them all up. For example given a string, multiplying each letter value with the prime number and then adding those all up will give you its hash value.

A better question would be, why exactly the number 31?


For a hash function it's not only important to minimize colisions generally but to make it impossible to stay with the same hash while chaning a few bytes.

Say you have an equation: (x + y*z) % key = x with 0<x<key and 0<z<key. If key is a primenumber n*y=key is true for every n in N and false for every other number.

An example where key isn't a prime example: x=1, z=2 and key=8 Because key/z=4 is still a natural number, 4 becomes a solution for our equation and in this case (n/2)*y = key is true for every n in N. The amount of solutions for the equation have practially doubled because 8 isn't a prime.

If our attacker already knows that 8 is possible solution for the equation he can change the file from producing 8 to 4 and still gets the same hash.


I've read the popular wordpress website linked in some of the above popular answers at the top. From what I've understood, I'd like to share a simple observation I made.

You can find all the details in the article here, but assume the following holds true:

  • Using a prime number gives us the "best chance" of an unique value

A general hashmap implementation wants 2 things to be unique.

  • Unique hash code for the key
  • Unique index to store the actual value

How do we get the unique index? By making the initial size of the internal container a prime as well. So basically, prime is involved because it possesses this unique trait of producing unique numbers which we end up using to ID objects and finding indexes inside the internal container.

Example:

key = "key"

value = "value" uniqueId = "k" * 31 ^ 2 + "e" * 31 ^ 1` + "y"

maps to unique id

Now we want a unique location for our value - so we

uniqueId % internalContainerSize == uniqueLocationForValue , assuming internalContainerSize is also a prime.

I know this is simplified, but I'm hoping to get the general idea through.


Just to provide an alternate viewpoint there's this site:

http://www.codexon.com/posts/hash-functions-the-modulo-prime-myth

Which contends that you should use the largest number of buckets possible as opposed to to rounding down to a prime number of buckets. It seems like a reasonable possibility. Intuitively, I can certainly see how a larger number of buckets would be better, but I'm unable to make a mathematical argument of this.


tl;dr

index[hash(input)%2] would result in a collision for half of all possible hashes and a range of values. index[hash(input)%prime] results in a collision of <2 of all possible hashes. Fixing the divisor to the table size also ensures that the number cannot be greater than the table.


Primes are used because you have good chances of obtaining a unique value for a typical hash-function which uses polynomials modulo P. Say, you use such hash-function for strings of length <= N, and you have a collision. That means that 2 different polynomials produce the same value modulo P. The difference of those polynomials is again a polynomial of the same degree N (or less). It has no more than N roots (this is here the nature of math shows itself, since this claim is only true for a polynomial over a field => prime number). So if N is much less than P, you are likely not to have a collision. After that, experiment can probably show that 37 is big enough to avoid collisions for a hash-table of strings which have length 5-10, and is small enough to use for calculations.


I would say the first answer at this link is the clearest answer I found regarding this question.

Consider the set of keys K = {0,1,...,100} and a hash table where the number of buckets is m = 12. Since 3 is a factor of 12, the keys that are multiples of 3 will be hashed to buckets that are multiples of 3:

  • Keys {0,12,24,36,...} will be hashed to bucket 0.
  • Keys {3,15,27,39,...} will be hashed to bucket 3.
  • Keys {6,18,30,42,...} will be hashed to bucket 6.
  • Keys {9,21,33,45,...} will be hashed to bucket 9.

If K is uniformly distributed (i.e., every key in K is equally likely to occur), then the choice of m is not so critical. But, what happens if K is not uniformly distributed? Imagine that the keys that are most likely to occur are the multiples of 3. In this case, all of the buckets that are not multiples of 3 will be empty with high probability (which is really bad in terms of hash table performance).

This situation is more common that it may seem. Imagine, for instance, that you are keeping track of objects based on where they are stored in memory. If your computer's word size is four bytes, then you will be hashing keys that are multiples of 4. Needless to say that choosing m to be a multiple of 4 would be a terrible choice: you would have 3m/4 buckets completely empty, and all of your keys colliding in the remaining m/4 buckets.

In general:

Every key in K that shares a common factor with the number of buckets m will be hashed to a bucket that is a multiple of this factor.

Therefore, to minimize collisions, it is important to reduce the number of common factors between m and the elements of K. How can this be achieved? By choosing m to be a number that has very few factors: a prime number.

FROM THE ANSWER BY Mario.


Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

Given a string “Samuel”, you can generate a unique hash by multiply each of the constituent digits or letters with a prime number and adding them up. This is why primes are used.

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about http://www.azillionmonkeys.com/qed/hash.html

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/


Examples related to language-agnostic

IOException: The process cannot access the file 'file path' because it is being used by another process Peak signal detection in realtime timeseries data Match linebreaks - \n or \r\n? Simple way to understand Encapsulation and Abstraction How can I pair socks from a pile efficiently? How do I determine whether my calculation of pi is accurate? What is ADT? (Abstract Data Type) How to explain callbacks in plain english? How are they different from calling one function from another function? Ukkonen's suffix tree algorithm in plain English Private vs Protected - Visibility Good-Practice Concern

Examples related to data-structures

Program to find largest and second largest number in array golang why don't we have a set datastructure How to initialize a vector with fixed length in R C compiling - "undefined reference to"? List of all unique characters in a string? Binary Search Tree - Java Implementation How to clone object in C++ ? Or Is there another solution? How to check queue length in Python Difference between "Complete binary tree", "strict binary tree","full binary Tree"? Write code to convert given number into words (eg 1234 as input should output one thousand two hundred and thirty four)

Examples related to hash

php mysqli_connect: authentication method unknown to the client [caching_sha2_password] What is Hash and Range Primary Key? How to create a laravel hashed password Hashing a file in Python PHP salt and hash SHA256 for login password Append key/value pair to hash with << in Ruby Are there any SHA-256 javascript implementations that are generally considered trustworthy? How do I generate a SALT in Java for Salted-Hash? What does hash do in python? Hashing with SHA1 Algorithm in C#