[python] Built in Python hash() function

Windows XP, Python 2.5:

hash('http://stackoverflow.com') Result: 1934711907

Google App Engine (http://shell.appspot.com/):

hash('http://stackoverflow.com') Result: -5768830964305142685

Why is that? How can I have a hash function that will give me same results across different platforms (Windows, Linux, Mac)?

This question is related to python google-app-engine hash

The answer is


This is the hash function that Google uses in production for python 2.5:

def c_mul(a, b):
  return eval(hex((long(a) * b) & (2**64 - 1))[:-1])

def py25hash(self):
  if not self:
    return 0 # empty
  value = ord(self[0]) << 7
  for char in self:
    value = c_mul(1000003, value) ^ ord(char)
  value = value ^ len(self)
  if value == -1:
    value = -2
  if value >= 2**63:
    value -= 2**64
  return value

As stated in the documentation, built-in hash() function is not designed for storing resulting hashes somewhere externally. It is used to provide object's hash value, to store them in dictionaries and so on. It's also implementation-specific (GAE uses a modified version of Python). Check out:

>>> class Foo:
...     pass
... 
>>> a = Foo()
>>> b = Foo()
>>> hash(a), hash(b)
(-1210747828, -1210747892)

As you can see, they are different, as hash() uses object's __hash__ method instead of 'normal' hashing algorithms, such as SHA.

Given the above, the rational choice is to use the hashlib module.


It probably just asks the operating system provided function, rather than its own algorithm.

As other comments says, use hashlib or write your own hash function.


Hash results varies between 32bit and 64bit platforms

If a calculated hash shall be the same on both platforms consider using

def hash32(value):
    return hash(value) & 0xffffffff

Polynomial hash for strings. 1000000009 and 239 are arbitrary prime numbers. Unlikely to have collisions by accident. Modular arithmetic is not very fast, but for preventing collisions this is more reliable than taking it modulo a power of 2. Of course, it is easy to find a collision on purpose.

mod=1000000009
def hash(s):
    result=0
    for c in s:
        result = (result * 239 + ord(c)) % mod
    return result % mod

The response is absolutely no surprise: in fact

In [1]: -5768830964305142685L & 0xffffffff
Out[1]: 1934711907L

so if you want to get reliable responses on ASCII strings, just get the lower 32 bits as uint. The hash function for strings is 32-bit-safe and almost portable.

On the other side, you can't rely at all on getting the hash() of any object over which you haven't explicitly defined the __hash__ method to be invariant.

Over ASCII strings it works just because the hash is calculated on the single characters forming the string, like the following:

class string:
    def __hash__(self):
        if not self:
            return 0 # empty
        value = ord(self[0]) << 7
        for char in self:
            value = c_mul(1000003, value) ^ ord(char)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

where the c_mul function is the "cyclic" multiplication (without overflow) as in C.


What about sign bit?

For example:

Hex value 0xADFE74A5 represents unsigned 2919134373 and signed -1375832923. Currect value must be signed (sign bit = 1) but python converts it as unsigned and we have an incorrect hash value after translation from 64 to 32 bit.

Be careful using:

def hash32(value):
    return hash(value) & 0xffffffff

Most answers suggest this is because of different platforms, but there is more to it. From the documentation of object.__hash__(self):

By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n²) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

Even running on the same machine will yield varying results across invocations:

$ python -c "print(hash('http://stackoverflow.com'))"
-3455286212422042986
$ python -c "print(hash('http://stackoverflow.com'))"
-6940441840934557333

While:

$ python -c "print(hash((1,2,3)))"
2528502973977326415
$ python -c "print(hash((1,2,3)))"
2528502973977326415

See also the environment variable PYTHONHASHSEED:

If this variable is not set or set to random, a random value is used to seed the hashes of str, bytes and datetime objects.

If PYTHONHASHSEED is set to an integer value, it is used as a fixed seed for generating the hash() of the types covered by the hash randomization.

Its purpose is to allow repeatable hashing, such as for selftests for the interpreter itself, or to allow a cluster of python processes to share hash values.

The integer must be a decimal number in the range [0, 4294967295]. Specifying the value 0 will disable hash randomization.

For example:

$ export PYTHONHASHSEED=0                            
$ python -c "print(hash('http://stackoverflow.com'))"
-5843046192888932305
$ python -c "print(hash('http://stackoverflow.com'))"
-5843046192888932305

At a guess, AppEngine is using a 64-bit implementation of Python (-5768830964305142685 won't fit in 32 bits) and your implementation of Python is 32 bits. You can't rely on object hashes being meaningfully comparable between different implementations.


The value of PYTHONHASHSEED might be used to initialize the hash values.

Try:

PYTHONHASHSEED python -c 'print(hash('http://stackoverflow.com'))'

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to google-app-engine

Problems with installation of Google App Engine SDK for php in OS X Get Public URL for File - Google Cloud Storage - App Engine (Python) Visual Studio Code pylint: Unable to import 'protorpc' Get root password for Google Cloud Engine VM Spring Boot - Cannot determine embedded database driver class for database type NONE What is the difference between Google App Engine and Google Compute Engine? Cross-Origin Request Blocked Class JavaLaunchHelper is implemented in both. One of the two will be used. Which one is undefined ImportError: No module named apiclient.discovery java.lang.ClassNotFoundException: com.sun.jersey.spi.container.servlet.ServletContainer

Examples related to hash

php mysqli_connect: authentication method unknown to the client [caching_sha2_password] What is Hash and Range Primary Key? How to create a laravel hashed password Hashing a file in Python PHP salt and hash SHA256 for login password Append key/value pair to hash with << in Ruby Are there any SHA-256 javascript implementations that are generally considered trustworthy? How do I generate a SALT in Java for Salted-Hash? What does hash do in python? Hashing with SHA1 Algorithm in C#