Why does Java s hashCode in String use 31 as a multiplier

Question

Per the Java documentation  the hash code for a String object is computed as    s 0  31  n-1    s 1  31  n-2          s n-1        using int arithmetic  where s i  is the    ith character of the string  n is the length of    the string  and   indicates exponentiation    Why is 31 used as a multiplier   I understand that the multiplier should be a relatively large prime number  So why not 29  or 37  or even 97

User · Answer

From JDK-4045622  where Joshua Bloch describes the reasons why that particular  new  String hashCode   implementation was chosen     The table below summarizes the performance of the various hash   functions described above  for three data sets       1  All of the words and phrases with entries in Merriam-Webster s          2nd Int l Unabridged Dictionary  311 141 strings  avg length 10 chars        2  All of the strings in  bin    usr bin    usr lib    usr ucb           and  usr openwin bin     66 304 strings  avg length 21 characters        3  A list of URLs gathered by a web-crawler that ran for several          hours last night  28 372 strings  avg length 49 characters        The performance metric shown in the table is the  average chain size    over all elements in the hash table  i e   the expected value of the   number of key compares to look up an element                              Webster s   Code Strings    URLs                           ---------   ------------    ---- Current Java Fn           1 2509      1 2738          13 2560 P 37      Java            1 2508      1 2481          1 2454 P 65599   Aho et al       1 2490      1 2510          1 2450 P 31      K R             1 2500      1 2488          1 2425 P 33      Torek           1 2500      1 2500          1 2453 Vo s Fn                   1 2487      1 2471          1 2462 WAIS Fn                   1 2497      1 2519          1 2452 Weinberger s Fn MatPak    6 5169      7 2142          30 6864 Weinberger s Fn 24        1 3222      1 2791          1 9732 Weinberger s Fn 28        1 2530      1 2506          1 2439       Looking at this table  it s clear that all of the functions except for   the current Java function and the two broken versions of Weinberger s   function offer excellent  nearly indistinguishable performance   I   strongly conjecture that this performance is essentially the    theoretical ideal   which is what you d get if you used a true random   number generator in place of a hash function       I d rule out the WAIS function as its specification contains pages of random numbers  and its performance is no better than any of the   far simpler functions   Any of the remaining six functions seem like   excellent choices  but we have to pick one   I suppose I d rule out   Vo s variant and Weinberger s function because of their added   complexity  albeit minor   Of the remaining four  I d probably select   P 31   as it s the cheapest to calculate on a RISC machine  because 31   is the difference of two powers of two    P 33  is similarly cheap to   calculate  but it s performance is marginally worse  and 33 is   composite  which makes me a bit nervous       Josh

User · Answer

A big expectation from hash functions is that their result s uniform randomness survives an operation such as hash x    N where N is an arbitrary number  and in many cases  a power of two   one reason being that such operations are used commonly in hash tables for determining slots  Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors  which would make the result of the operation less uniformly random  Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction  I just want to point out that there is a mathematical term for such primes  Mersenne Prime All mersenne primes are one less than a power of two so we can write them as  p   2 n - 1  Multiplying x by p  x   p   x    2 n - 1    x   2 n - x    x  lt  lt  n  - x  Shifts  SAL SHL  and subtractions  SUB  are generally faster than multiplications  MUL  on many machines  See instruction tables from Agner Fog That s why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs  see here  However  in my opinion  such a small prime is a bad choice for a hash function  With a relatively good hash function  you would expect to have randomness at the higher bits of the hash  However  with the Java hash function  there is almost no randomness at the higher bits with shorter strings  and still highly questionable randomness at the lower bits   This makes it more difficult to build efficient hash tables  See this nice trick you couldn t do with the Java hash function  Some answers mention that they believe it is good that 31 fits into a byte  This is actually useless since   1  We execute shifts instead of multiplications  so the size of the multiplier does not matter   2  As far as I know  there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert  quot 31 quot  to a 8 byte value anyway even if you were multiplying  See here  you multiply entire 64bit registers   And 127 is actually the largest mersenne prime that could fit in a byte   Does a smaller value increase randomness in the middle-lower bits  Maybe  but it also seems to greatly increase the possible collisions     One could list many different issues but they generally boil down to two core principles not being fulfilled well  Confusion and Diffusion But is it fast  Probably  since it doesn t do much  However  if performance is really the focus here  one character per loop is quite inefficient  Why not do 4 characters at a time  8 bytes  per loop iteration for longer strings  like this   Well  that would be difficult to do with the current definition of hash where you need to multiply every character individually  please tell me if there is a bit hack to solve this  D

User · Answer

Neil Coffey explains why 31 is used under Ironing out the bias   Basically using 31 gives you a more even set-bit probability distribution for the hash function

User · Answer

By multiplying  bits are shifted to the left  This uses more of the available space of hash codes  reducing collisions   By not using a power of two  the lower-order  rightmost bits are populated as well  to be mixed with the next piece of data going into the hash   The expression n   31 is equivalent to  n  lt  lt  5  - n

User · Answer

On  mostly  old processors  multiplying by 31 can be relatively cheap  On an ARM  for instance  it is only one instruction   RSB       r1  r0  r0  ASL  5      r1    - r0    r0 lt  lt 5    Most other processors would require a separate shift and subtract instruction  However  if your multiplier is slow this is still a win  Modern processors tend to have fast multipliers so it doesn t make much difference  so long as 32 goes on the correct side   It s not a great hash algorithm  but it s good enough and better than the 1 0 code  and very much better than the 1 0 spec

User · Answer

I m not sure  but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings

User · Answer

Goodrich and Tamassia computed from over 50 000 English words  formed as the union of the word lists provided in two variants of Unix  that using the constants 31  33  37  39  and 41 will produce fewer than 7 collisions in each case  This may be the reason that so many Java implementations choose such constants  See section 9 2 Hash Tables  page 522  of Data Structures and Algorithms in Java

User · Answer

According to Joshua Bloch s Effective Java  a book that can t be recommended enough  and which I bought thanks to continual mentions on stackoverflow       The value 31 was chosen because it is an odd prime  If it were even and the multiplication overflowed  information would be lost  as multiplication by 2 is equivalent to shifting  The advantage of using a prime is less clear  but it is traditional  A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance  31   i     i  lt  lt  5  - i  Modern VMs do this sort of optimization automatically     from Chapter 3  Item 9  Always override hashcode when you override equals  page 48

User · Answer

From JDK-4045622  where Joshua Bloch describes the reasons why that particular  new  String hashCode   implementation was chosen     The table below summarizes the performance of the various hash   functions described above  for three data sets       1  All of the words and phrases with entries in Merriam-Webster s          2nd Int l Unabridged Dictionary  311 141 strings  avg length 10 chars        2  All of the strings in  bin    usr bin    usr lib    usr ucb           and  usr openwin bin     66 304 strings  avg length 21 characters        3  A list of URLs gathered by a web-crawler that ran for several          hours last night  28 372 strings  avg length 49 characters        The performance metric shown in the table is the  average chain size    over all elements in the hash table  i e   the expected value of the   number of key compares to look up an element                              Webster s   Code Strings    URLs                           ---------   ------------    ---- Current Java Fn           1 2509      1 2738          13 2560 P 37      Java            1 2508      1 2481          1 2454 P 65599   Aho et al       1 2490      1 2510          1 2450 P 31      K R             1 2500      1 2488          1 2425 P 33      Torek           1 2500      1 2500          1 2453 Vo s Fn                   1 2487      1 2471          1 2462 WAIS Fn                   1 2497      1 2519          1 2452 Weinberger s Fn MatPak    6 5169      7 2142          30 6864 Weinberger s Fn 24        1 3222      1 2791          1 9732 Weinberger s Fn 28        1 2530      1 2506          1 2439       Looking at this table  it s clear that all of the functions except for   the current Java function and the two broken versions of Weinberger s   function offer excellent  nearly indistinguishable performance   I   strongly conjecture that this performance is essentially the    theoretical ideal   which is what you d get if you used a true random   number generator in place of a hash function       I d rule out the WAIS function as its specification contains pages of random numbers  and its performance is no better than any of the   far simpler functions   Any of the remaining six functions seem like   excellent choices  but we have to pick one   I suppose I d rule out   Vo s variant and Weinberger s function because of their added   complexity  albeit minor   Of the remaining four  I d probably select   P 31   as it s the cheapest to calculate on a RISC machine  because 31   is the difference of two powers of two    P 33  is similarly cheap to   calculate  but it s performance is marginally worse  and 33 is   composite  which makes me a bit nervous       Josh

User · Answer

Goodrich and Tamassia computed from over 50 000 English words  formed as the union of the word lists provided in two variants of Unix  that using the constants 31  33  37  39  and 41 will produce fewer than 7 collisions in each case  This may be the reason that so many Java implementations choose such constants  See section 9 2 Hash Tables  page 522  of Data Structures and Algorithms in Java

User · Answer

This is because 31 has a nice property     it s multiplication can be replaced by a bitwise shift which is faster than the standard multiplication  31   i     i  lt  lt  5  - i

User · Answer

On  mostly  old processors  multiplying by 31 can be relatively cheap  On an ARM  for instance  it is only one instruction   RSB       r1  r0  r0  ASL  5      r1    - r0    r0 lt  lt 5    Most other processors would require a separate shift and subtract instruction  However  if your multiplier is slow this is still a win  Modern processors tend to have fast multipliers so it doesn t make much difference  so long as 32 goes on the correct side   It s not a great hash algorithm  but it s good enough and better than the 1 0 code  and very much better than the 1 0 spec

User · Answer

I m not sure  but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings

User · Answer

Bloch doesn t quite go into this  but the rationale I ve always heard believed is that this is basic algebra   Hashes boil down to multiplication and modulus operations  which means that you never want to use numbers with common factors if you can help it   In other words  relatively prime numbers provide an even distribution of answers   The numbers that make up using a hash are typically    modulus of the data type you put it into  2 32 or 2 64  modulus of the bucket count in your hashtable  varies   In java used to be prime  now 2 n  multiply or shift by a magic number in your mixing function The input value   You really only get to control a couple of these values  so a little extra care is due

User · Answer

Actually  37 would work pretty well   z    37   x can be computed as y    x   8   x  z    x   4   y   Both steps correspond to one LEA x86 instructions  so this is extremely fast     In fact  multiplication with the even-larger prime 73 could be done at the same speed by setting y    x   8   x  z    x   8   y   Using 73 or 37  instead of 31  might be better  because it leads to denser code   The two LEA instructions only take 6 bytes vs  the 7 bytes for move shift subtract for the multiplication by 31   One possible caveat is that the 3-argument LEA instructions used here became slower on Intel s Sandy bridge architecture  with an increased latency of 3 cycles   Moreover  73 is Sheldon Cooper s favorite number

User · Answer

I m not sure  but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings

User · Answer

In latest version of JDK  31 is still used  https   docs oracle com en java javase 12 docs api java base java lang String html hashCode    The purpose of hash string is   unique  Let see operator   in hashcode calculation document  it help unique  cheap cost for calculating   31 is max value can put in 8 bit    1 byte  register  is largest prime number can put in 1 byte register  is odd number   Multiply 31 is  lt  lt 5 then subtract itself  therefore need cheap resources

User · Answer

According to Joshua Bloch s Effective Java  a book that can t be recommended enough  and which I bought thanks to continual mentions on stackoverflow       The value 31 was chosen because it is an odd prime  If it were even and the multiplication overflowed  information would be lost  as multiplication by 2 is equivalent to shifting  The advantage of using a prime is less clear  but it is traditional  A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance  31   i     i  lt  lt  5  - i  Modern VMs do this sort of optimization automatically     from Chapter 3  Item 9  Always override hashcode when you override equals  page 48

User · Answer

Goodrich and Tamassia computed from over 50 000 English words  formed as the union of the word lists provided in two variants of Unix  that using the constants 31  33  37  39  and 41 will produce fewer than 7 collisions in each case  This may be the reason that so many Java implementations choose such constants  See section 9 2 Hash Tables  page 522  of Data Structures and Algorithms in Java

User · Answer

A big expectation from hash functions is that their result s uniform randomness survives an operation such as hash x    N where N is an arbitrary number  and in many cases  a power of two   one reason being that such operations are used commonly in hash tables for determining slots  Using prime number multipliers when computing the hash decreases the probability that your multiplier and the N share divisors  which would make the result of the operation less uniformly random  Others have pointed out the nice property that multiplication by 31 can be done by a multiplication and a subtraction  I just want to point out that there is a mathematical term for such primes  Mersenne Prime All mersenne primes are one less than a power of two so we can write them as  p   2 n - 1  Multiplying x by p  x   p   x    2 n - 1    x   2 n - x    x  lt  lt  n  - x  Shifts  SAL SHL  and subtractions  SUB  are generally faster than multiplications  MUL  on many machines  See instruction tables from Agner Fog That s why GCC seems to optimize multiplications by mersenne primes by replacing them with shifts and subs  see here  However  in my opinion  such a small prime is a bad choice for a hash function  With a relatively good hash function  you would expect to have randomness at the higher bits of the hash  However  with the Java hash function  there is almost no randomness at the higher bits with shorter strings  and still highly questionable randomness at the lower bits   This makes it more difficult to build efficient hash tables  See this nice trick you couldn t do with the Java hash function  Some answers mention that they believe it is good that 31 fits into a byte  This is actually useless since   1  We execute shifts instead of multiplications  so the size of the multiplier does not matter   2  As far as I know  there is no specific x86 instruction to multiply an 8 byte value with a 1 byte value so you would have needed to convert  quot 31 quot  to a 8 byte value anyway even if you were multiplying  See here  you multiply entire 64bit registers   And 127 is actually the largest mersenne prime that could fit in a byte   Does a smaller value increase randomness in the middle-lower bits  Maybe  but it also seems to greatly increase the possible collisions     One could list many different issues but they generally boil down to two core principles not being fulfilled well  Confusion and Diffusion But is it fast  Probably  since it doesn t do much  However  if performance is really the focus here  one character per loop is quite inefficient  Why not do 4 characters at a time  8 bytes  per loop iteration for longer strings  like this   Well  that would be difficult to do with the current definition of hash where you need to multiply every character individually  please tell me if there is a bit hack to solve this  D

User · Answer

Actually  37 would work pretty well   z    37   x can be computed as y    x   8   x  z    x   4   y   Both steps correspond to one LEA x86 instructions  so this is extremely fast     In fact  multiplication with the even-larger prime 73 could be done at the same speed by setting y    x   8   x  z    x   8   y   Using 73 or 37  instead of 31  might be better  because it leads to denser code   The two LEA instructions only take 6 bytes vs  the 7 bytes for move shift subtract for the multiplication by 31   One possible caveat is that the 3-argument LEA instructions used here became slower on Intel s Sandy bridge architecture  with an increased latency of 3 cycles   Moreover  73 is Sheldon Cooper s favorite number

User · Answer

Goodrich and Tamassia computed from over 50 000 English words  formed as the union of the word lists provided in two variants of Unix  that using the constants 31  33  37  39  and 41 will produce fewer than 7 collisions in each case  This may be the reason that so many Java implementations choose such constants  See section 9 2 Hash Tables  page 522  of Data Structures and Algorithms in Java

User · Answer

In latest version of JDK  31 is still used  https   docs oracle com en java javase 12 docs api java base java lang String html hashCode    The purpose of hash string is   unique  Let see operator   in hashcode calculation document  it help unique  cheap cost for calculating   31 is max value can put in 8 bit    1 byte  register  is largest prime number can put in 1 byte register  is odd number   Multiply 31 is  lt  lt 5 then subtract itself  therefore need cheap resources

User · Answer

By multiplying  bits are shifted to the left  This uses more of the available space of hash codes  reducing collisions   By not using a power of two  the lower-order  rightmost bits are populated as well  to be mixed with the next piece of data going into the hash   The expression n   31 is equivalent to  n  lt  lt  5  - n

User · Answer

I m not sure  but I would guess they tested some sample of prime numbers and found that 31 gave the best distribution over some sample of possible Strings

User · Answer

Bloch doesn t quite go into this  but the rationale I ve always heard believed is that this is basic algebra   Hashes boil down to multiplication and modulus operations  which means that you never want to use numbers with common factors if you can help it   In other words  relatively prime numbers provide an even distribution of answers   The numbers that make up using a hash are typically    modulus of the data type you put it into  2 32 or 2 64  modulus of the bucket count in your hashtable  varies   In java used to be prime  now 2 n  multiply or shift by a magic number in your mixing function The input value   You really only get to control a couple of these values  so a little extra care is due

User · Answer

On  mostly  old processors  multiplying by 31 can be relatively cheap  On an ARM  for instance  it is only one instruction   RSB       r1  r0  r0  ASL  5      r1    - r0    r0 lt  lt 5    Most other processors would require a separate shift and subtract instruction  However  if your multiplier is slow this is still a win  Modern processors tend to have fast multipliers so it doesn t make much difference  so long as 32 goes on the correct side   It s not a great hash algorithm  but it s good enough and better than the 1 0 code  and very much better than the 1 0 spec

User · Answer

Neil Coffey explains why 31 is used under Ironing out the bias   Basically using 31 gives you a more even set-bit probability distribution for the hash function

User · Answer

According to Joshua Bloch s Effective Java  a book that can t be recommended enough  and which I bought thanks to continual mentions on stackoverflow       The value 31 was chosen because it is an odd prime  If it were even and the multiplication overflowed  information would be lost  as multiplication by 2 is equivalent to shifting  The advantage of using a prime is less clear  but it is traditional  A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance  31   i     i  lt  lt  5  - i  Modern VMs do this sort of optimization automatically     from Chapter 3  Item 9  Always override hashcode when you override equals  page 48

User · Answer

On  mostly  old processors  multiplying by 31 can be relatively cheap  On an ARM  for instance  it is only one instruction   RSB       r1  r0  r0  ASL  5      r1    - r0    r0 lt  lt 5    Most other processors would require a separate shift and subtract instruction  However  if your multiplier is slow this is still a win  Modern processors tend to have fast multipliers so it doesn t make much difference  so long as 32 goes on the correct side   It s not a great hash algorithm  but it s good enough and better than the 1 0 code  and very much better than the 1 0 spec

User · Answer

This is because 31 has a nice property     it s multiplication can be replaced by a bitwise shift which is faster than the standard multiplication  31   i     i  lt  lt  5  - i

User · Answer

According to Joshua Bloch s Effective Java  a book that can t be recommended enough  and which I bought thanks to continual mentions on stackoverflow       The value 31 was chosen because it is an odd prime  If it were even and the multiplication overflowed  information would be lost  as multiplication by 2 is equivalent to shifting  The advantage of using a prime is less clear  but it is traditional  A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance  31   i     i  lt  lt  5  - i  Modern VMs do this sort of optimization automatically     from Chapter 3  Item 9  Always override hashcode when you override equals  page 48

User · Answer

You can read Bloch s original reasoning under  Comments  in http   bugs java com bugdatabase view bug do bug id 4045622  He investigated the performance of different hash functions in regards to the resulting  average chain size  in a hash table  P 31  was one of the common functions during that time which he found in K amp R s book  but even Kernighan and Ritchie couldn t remember where it came from   In the end he basically had to choose one and so he took P 31  since it seemed to perform well enough  Even though P 33  was not really worse and multiplication by 33 is equally fast to calculate  just a shift by 5 and an addition   he opted for 31 since 33 is not a prime      Of the remaining   four  I d probably select P 31   as it s the cheapest to calculate on a RISC   machine  because 31 is the difference of two powers of two    P 33  is   similarly cheap to calculate  but it s performance is marginally worse  and   33 is composite  which makes me a bit nervous    So the reasoning was not as rational as many of the answers here seem to imply  But we re all good in coming up with rational reasons after gut decisions  and even Bloch might be prone to that

User · Answer

You can read Bloch s original reasoning under  Comments  in http   bugs java com bugdatabase view bug do bug id 4045622  He investigated the performance of different hash functions in regards to the resulting  average chain size  in a hash table  P 31  was one of the common functions during that time which he found in K amp R s book  but even Kernighan and Ritchie couldn t remember where it came from   In the end he basically had to choose one and so he took P 31  since it seemed to perform well enough  Even though P 33  was not really worse and multiplication by 33 is equally fast to calculate  just a shift by 5 and an addition   he opted for 31 since 33 is not a prime      Of the remaining   four  I d probably select P 31   as it s the cheapest to calculate on a RISC   machine  because 31 is the difference of two powers of two    P 33  is   similarly cheap to calculate  but it s performance is marginally worse  and   33 is composite  which makes me a bit nervous    So the reasoning was not as rational as many of the answers here seem to imply  But we re all good in coming up with rational reasons after gut decisions  and even Bloch might be prone to that

[java] Why does Java's hashCode() in String use 31 as a multiplier?

Examples related to java

Examples related to string

Examples related to algorithm

Examples related to hash