[python] Python: List vs Dict for look up table

I have about 10million values that I need to put in some type of look up table, so I was wondering which would be more efficient a list or dict?

I know you can do something like this for both:

if something in dict_of_stuff:
    pass

and

if something in list_of_stuff:
    pass

My thought is the dict will be faster and more efficient.

Thanks for your help.

EDIT 1
Little more info on what I'm trying to do. Euler Problem 92. I'm making a look up table to see if a value calculated has all ready been calculated.

EDIT 2
Efficiency for look up.

EDIT 3
There are no values assosiated with the value...so would a set be better?

This question is related to python performance

The answer is


You want a dict.

For (unsorted) lists in Python, the "in" operation requires O(n) time---not good when you have a large amount of data. A dict, on the other hand, is a hash table, so you can expect O(1) lookup time.

As others have noted, you might choose a set (a special type of dict) instead, if you only have keys rather than key/value pairs.

Related:

  • Python wiki: information on the time complexity of Python container operations.
  • SO: Python container operation time and memory complexities

if data are unique set() will be the most efficient, but of two - dict (which also requires uniqueness, oops :)


You don't actually need to store 10 million values in the table, so it's not a big deal either way.

Hint: think about how large your result can be after the first sum of squares operation. The largest possible result will be much smaller than 10 million...


A dict is a hash table, so it is really fast to find the keys. So between dict and list, dict would be faster. But if you don't have a value to associate, it is even better to use a set. It is a hash table, without the "table" part.


EDIT: for your new question, YES, a set would be better. Just create 2 sets, one for sequences ended in 1 and other for the sequences ended in 89. I have sucessfully solved this problem using sets.


As a new set of tests to show @EriF89 is still right after all these years:

$ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop

Here we also compare a tuple, which are known to be faster than lists (and use less memory) in some use cases. In the case of lookup table, the tuple faired no better .

Both the dict and set performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it's unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It's possible the 10M data points only have 10 unique values, which is a much smaller space to search!

SilentGhost's mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.

For example, if the source data list to be searched was l=[1,2,3,1,2,1,4], it could be optimized for both searching and memory by replacing it with this dict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
...     d[e].append(i)
>>> d
defaultdict(<class 'list'>, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})

With this dict, one can know:

  1. If a value was in the original dataset (ie 2 in d returns True)
  2. Where the value was in the original dataset (ie d[2] returns list of indices where data was found in original data list: [1, 4])

You want a dict.

For (unsorted) lists in Python, the "in" operation requires O(n) time---not good when you have a large amount of data. A dict, on the other hand, is a hash table, so you can expect O(1) lookup time.

As others have noted, you might choose a set (a special type of dict) instead, if you only have keys rather than key/value pairs.

Related:

  • Python wiki: information on the time complexity of Python container operations.
  • SO: Python container operation time and memory complexities

if data are unique set() will be the most efficient, but of two - dict (which also requires uniqueness, oops :)


set() is exactly what you want. O(1) lookups, and smaller than a dict.


You want a dict.

For (unsorted) lists in Python, the "in" operation requires O(n) time---not good when you have a large amount of data. A dict, on the other hand, is a hash table, so you can expect O(1) lookup time.

As others have noted, you might choose a set (a special type of dict) instead, if you only have keys rather than key/value pairs.

Related:

  • Python wiki: information on the time complexity of Python container operations.
  • SO: Python container operation time and memory complexities

A dict is a hash table, so it is really fast to find the keys. So between dict and list, dict would be faster. But if you don't have a value to associate, it is even better to use a set. It is a hash table, without the "table" part.


EDIT: for your new question, YES, a set would be better. Just create 2 sets, one for sequences ended in 1 and other for the sequences ended in 89. I have sucessfully solved this problem using sets.


if data are unique set() will be the most efficient, but of two - dict (which also requires uniqueness, oops :)


You want a dict.

For (unsorted) lists in Python, the "in" operation requires O(n) time---not good when you have a large amount of data. A dict, on the other hand, is a hash table, so you can expect O(1) lookup time.

As others have noted, you might choose a set (a special type of dict) instead, if you only have keys rather than key/value pairs.

Related:

  • Python wiki: information on the time complexity of Python container operations.
  • SO: Python container operation time and memory complexities

You don't actually need to store 10 million values in the table, so it's not a big deal either way.

Hint: think about how large your result can be after the first sum of squares operation. The largest possible result will be much smaller than 10 million...


if data are unique set() will be the most efficient, but of two - dict (which also requires uniqueness, oops :)


I did some benchmarking and it turns out that dict is faster than both list and set for large data sets, running python 2.7.3 on an i7 CPU on linux:

  • python -mtimeit -s 'd=range(10**7)' '5*10**6 in d'

    10 loops, best of 3: 64.2 msec per loop

  • python -mtimeit -s 'd=dict.fromkeys(range(10**7))' '5*10**6 in d'

    10000000 loops, best of 3: 0.0759 usec per loop

  • python -mtimeit -s 'from sets import Set; d=Set(range(10**7))' '5*10**6 in d'

    1000000 loops, best of 3: 0.262 usec per loop

As you can see, dict is considerably faster than list and about 3 times faster than set. In some applications you might still want to choose set for the beauty of it, though. And if the data sets are really small (< 1000 elements) lists perform pretty well.


A dict is a hash table, so it is really fast to find the keys. So between dict and list, dict would be faster. But if you don't have a value to associate, it is even better to use a set. It is a hash table, without the "table" part.


EDIT: for your new question, YES, a set would be better. Just create 2 sets, one for sequences ended in 1 and other for the sequences ended in 89. I have sucessfully solved this problem using sets.


I did some benchmarking and it turns out that dict is faster than both list and set for large data sets, running python 2.7.3 on an i7 CPU on linux:

  • python -mtimeit -s 'd=range(10**7)' '5*10**6 in d'

    10 loops, best of 3: 64.2 msec per loop

  • python -mtimeit -s 'd=dict.fromkeys(range(10**7))' '5*10**6 in d'

    10000000 loops, best of 3: 0.0759 usec per loop

  • python -mtimeit -s 'from sets import Set; d=Set(range(10**7))' '5*10**6 in d'

    1000000 loops, best of 3: 0.262 usec per loop

As you can see, dict is considerably faster than list and about 3 times faster than set. In some applications you might still want to choose set for the beauty of it, though. And if the data sets are really small (< 1000 elements) lists perform pretty well.


As a new set of tests to show @EriF89 is still right after all these years:

$ python -m timeit -s "l={k:k for k in xrange(5000)}"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.84 msec per loop
$ python -m timeit -s "l=[k for k in xrange(5000)]"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 573 msec per loop
$ python -m timeit -s "l=tuple([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
10 loops, best of 3: 587 msec per loop
$ python -m timeit -s "l=set([k for k in xrange(5000)])"    "[i for i in xrange(10000) if i in l]"
1000 loops, best of 3: 1.88 msec per loop

Here we also compare a tuple, which are known to be faster than lists (and use less memory) in some use cases. In the case of lookup table, the tuple faired no better .

Both the dict and set performed very well. This brings up an interesting point tying into @SilentGhost answer about uniqueness: if the OP has 10M values in a data set, and it's unknown if there are duplicates in them, then it would be worth keeping a set/dict of its elements in parallel with the actual data set, and testing for existence in that set/dict. It's possible the 10M data points only have 10 unique values, which is a much smaller space to search!

SilentGhost's mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data (in values) into a nonduplicated set (keys), and thus keep one data object to hold all data, yet still be fast as a lookup table. For example, a dict key could be the value being looked up, and the value could be a list of indices in an imaginary list where that value occurred.

For example, if the source data list to be searched was l=[1,2,3,1,2,1,4], it could be optimized for both searching and memory by replacing it with this dict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> l=[1,2,3,1,2,1,4]
>>> for i, e in enumerate(l):
...     d[e].append(i)
>>> d
defaultdict(<class 'list'>, {1: [0, 3, 5], 2: [1, 4], 3: [2], 4: [6]})

With this dict, one can know:

  1. If a value was in the original dataset (ie 2 in d returns True)
  2. Where the value was in the original dataset (ie d[2] returns list of indices where data was found in original data list: [1, 4])

set() is exactly what you want. O(1) lookups, and smaller than a dict.