[python] What does hash do in python?

TL;DR:

Please refer to the glossary: hash() is used as a shortcut to comparing objects, an object is deemed hashable if it can be compared to other objects. that is why we use hash(). It's also used to access dict and set elements which are implemented as resizable hash tables in CPython.

Technical considerations

  • usually comparing objects (which may involve several levels of recursion) is expensive.
  • preferably, the hash() function is an order of magnitude (or several) less expensive.
  • comparing two hashes is easier than comparing two objects, this is where the shortcut is.

If you read about how dictionaries are implemented, they use hash tables, which means deriving a key from an object is a corner stone for retrieving objects in dictionaries in O(1). That's however very dependent on your hash function to be collision-resistant. The worst case for getting an item in a dictionary is actually O(n).

On that note, mutable objects are usually not hashable. The hashable property means you can use an object as a key. If the hash value is used as a key and the contents of that same object change, then what should the hash function return? Is it the same key or a different one? It depends on how you define your hash function.

Learning by example:

Imagine we have this class:

>>> class Person(object):
...     def __init__(self, name, ssn, address):
...         self.name = name
...         self.ssn = ssn
...         self.address = address
...     def __hash__(self):
...         return hash(self.ssn)
...     def __eq__(self, other):
...         return self.ssn == other.ssn
... 

Please note: this is all based on the assumption that the SSN never changes for an individual (don't even know where to actually verify that fact from authoritative source).

And we have Bob:

>>> bob = Person('bob', '1111-222-333', None)

Bob goes to see a judge to change his name:

>>> jim = Person('jim bo', '1111-222-333', 'sf bay area')

This is what we know:

>>> bob == jim
True

But these are two different objects with different memory allocated, just like two different records of the same person:

>>> bob is jim
False

Now comes the part where hash() is handy:

>>> dmv_appointments = {}
>>> dmv_appointments[bob] = 'tomorrow'

Guess what:

>>> dmv_appointments[jim] #?
'tomorrow'

From two different records you are able to access the same information. Now try this:

>>> dmv_appointments[hash(jim)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in __eq__
AttributeError: 'int' object has no attribute 'ssn'
>>> hash(jim) == hash(hash(jim))
True

What just happened? That's a collision. Because hash(jim) == hash(hash(jim)) which are both integers btw, we need to compare the input of __getitem__ with all items that collide. The builtin int does not have an ssn attribute so it trips.

>>> del Person.__eq__
>>> dmv_appointments[bob]
'tomorrow'
>>> dmv_appointments[jim]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <__main__.Person object at 0x7f611bd37110>

In this last example, I show that even with a collision, the comparison is performed, the objects are no longer equal, which means it successfully raises a KeyError.