Python List vs Dict for look up table

Question

I have about 10million values that I need to put in some type of look up table  so I was wondering which would be more efficient a list or dict   I know you can do something like this for both   if something in dict of stuff      pass   and  if something in list of stuff      pass   My thought is the dict will be faster and more efficient   Thanks for your help   EDIT 1  Little more info on what I m trying to do   Euler Problem 92   I m making a look up table to see if a value calculated has all ready been calculated    EDIT 2  Efficiency for look up   EDIT 3  There are no values assosiated with the value   so would a set be better

User · Accepted Answer

Speed  Lookups in lists are O n   lookups in dictionaries are amortized O 1   with regard to the number of items in the data structure  If you don t need to associate values  use sets   Memory  Both dictionaries and sets use hashing and they use much more memory than only for object storage  According to A M  Kuchling in Beautiful Code  the implementation tries to keep the hash 2 3 full  so you might waste quite some memory    If you do not add new entries on the fly  which you do  based on your updated question   it might be worthwhile to sort the list and use binary search  This is O log n   and is likely to be slower for strings  impossible for objects which do not have a natural ordering

User · Answer

You want a dict     For  unsorted  lists in Python  the  in  operation requires O n  time---not good when you have a large amount of data   A dict  on the other hand  is a hash table  so you can expect O 1  lookup time     As others have noted  you might choose a set  a special type of dict  instead  if you only have keys rather than key value pairs   Related    Python wiki  information on the time complexity of Python container operations  SO  Python container operation time and memory complexities

User · Answer

set   is exactly what you want   O 1  lookups  and smaller than a dict

User · Answer

As a new set of tests to show  EriF89 is still right after all these years     python -m timeit -s  l  k k for k in xrange 5000         i for i in xrange 10000  if i in l   1000 loops  best of 3  1 84 msec per loop   python -m timeit -s  l  k for k in xrange 5000         i for i in xrange 10000  if i in l   10 loops  best of 3  573 msec per loop   python -m timeit -s  l tuple  k for k in xrange 5000          i for i in xrange 10000  if i in l   10 loops  best of 3  587 msec per loop   python -m timeit -s  l set  k for k in xrange 5000          i for i in xrange 10000  if i in l   1000 loops  best of 3  1 88 msec per loop   Here we also compare a tuple  which are known to be faster than lists  and use less memory  in some use cases   In the case of lookup table  the tuple faired no better        Both the dict and set performed very well   This brings up an interesting point tying into  SilentGhost answer about uniqueness  if the OP has 10M values in a data set  and it s unknown if there are duplicates in them  then it would be worth keeping a set dict of its elements in parallel with the actual data set  and testing for existence in that set dict   It s possible the 10M data points only have 10 unique values  which is a much smaller space to search     SilentGhost s mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data  in values  into a nonduplicated set  keys   and thus keep one data object to hold all data  yet still be fast as a lookup table   For example  a dict key could be the value being looked up  and the value could be a list of indices in an imaginary list where that value occurred     For example  if the source data list to be searched was l  1 2 3 1 2 1 4   it could be optimized for both searching and memory by replacing it with this dict    gt  gt  gt  from collections import defaultdict  gt  gt  gt  d   defaultdict list   gt  gt  gt  l  1 2 3 1 2 1 4   gt  gt  gt  for i  e in enumerate l           d e  append i   gt  gt  gt  d defaultdict  lt class  list  gt    1   0  3  5   2   1  4   3   2   4   6      With this dict  one can know    If a value was in the original dataset  ie 2 in d returns True  Where the value was in the original dataset  ie d 2  returns list of indices where data was found in original data list   1  4

User · Answer

As a new set of tests to show  EriF89 is still right after all these years     python -m timeit -s  l  k k for k in xrange 5000         i for i in xrange 10000  if i in l   1000 loops  best of 3  1 84 msec per loop   python -m timeit -s  l  k for k in xrange 5000         i for i in xrange 10000  if i in l   10 loops  best of 3  573 msec per loop   python -m timeit -s  l tuple  k for k in xrange 5000          i for i in xrange 10000  if i in l   10 loops  best of 3  587 msec per loop   python -m timeit -s  l set  k for k in xrange 5000          i for i in xrange 10000  if i in l   1000 loops  best of 3  1 88 msec per loop   Here we also compare a tuple  which are known to be faster than lists  and use less memory  in some use cases   In the case of lookup table  the tuple faired no better        Both the dict and set performed very well   This brings up an interesting point tying into  SilentGhost answer about uniqueness  if the OP has 10M values in a data set  and it s unknown if there are duplicates in them  then it would be worth keeping a set dict of its elements in parallel with the actual data set  and testing for existence in that set dict   It s possible the 10M data points only have 10 unique values  which is a much smaller space to search     SilentGhost s mistake about dicts is actually illuminating because one could use a dict to correlate duplicated data  in values  into a nonduplicated set  keys   and thus keep one data object to hold all data  yet still be fast as a lookup table   For example  a dict key could be the value being looked up  and the value could be a list of indices in an imaginary list where that value occurred     For example  if the source data list to be searched was l  1 2 3 1 2 1 4   it could be optimized for both searching and memory by replacing it with this dict    gt  gt  gt  from collections import defaultdict  gt  gt  gt  d   defaultdict list   gt  gt  gt  l  1 2 3 1 2 1 4   gt  gt  gt  for i  e in enumerate l           d e  append i   gt  gt  gt  d defaultdict  lt class  list  gt    1   0  3  5   2   1  4   3   2   4   6      With this dict  one can know    If a value was in the original dataset  ie 2 in d returns True  Where the value was in the original dataset  ie d 2  returns list of indices where data was found in original data list   1  4

User · Answer

A dict is a hash table  so it is really fast to find the keys  So between dict and list  dict would be faster  But if you don t have a value to associate  it is even better to use a set  It is a hash table  without the  table  part     EDIT  for your new question  YES  a set would be better  Just create 2 sets  one for sequences ended in 1 and other for the sequences ended in 89  I have sucessfully solved this problem using sets

User · Answer

You don t actually need to store 10 million values in the table  so it s not a big deal either way   Hint  think about how large your result can be after the first sum of squares operation  The largest possible result will be much smaller than 10 million

User · Answer

if data are unique set   will be the most efficient  but of two - dict  which also requires uniqueness  oops

User · Answer

A dict is a hash table  so it is really fast to find the keys  So between dict and list  dict would be faster  But if you don t have a value to associate  it is even better to use a set  It is a hash table  without the  table  part     EDIT  for your new question  YES  a set would be better  Just create 2 sets  one for sequences ended in 1 and other for the sequences ended in 89  I have sucessfully solved this problem using sets

User · Answer

if data are unique set   will be the most efficient  but of two - dict  which also requires uniqueness  oops

User · Answer

You want a dict     For  unsorted  lists in Python  the  in  operation requires O n  time---not good when you have a large amount of data   A dict  on the other hand  is a hash table  so you can expect O 1  lookup time     As others have noted  you might choose a set  a special type of dict  instead  if you only have keys rather than key value pairs   Related    Python wiki  information on the time complexity of Python container operations  SO  Python container operation time and memory complexities

User · Answer

You don t actually need to store 10 million values in the table  so it s not a big deal either way   Hint  think about how large your result can be after the first sum of squares operation  The largest possible result will be much smaller than 10 million

User · Answer

A dict is a hash table  so it is really fast to find the keys  So between dict and list  dict would be faster  But if you don t have a value to associate  it is even better to use a set  It is a hash table  without the  table  part     EDIT  for your new question  YES  a set would be better  Just create 2 sets  one for sequences ended in 1 and other for the sequences ended in 89  I have sucessfully solved this problem using sets

User · Answer

I did some benchmarking and it turns out that dict is faster than both list and set for large data sets  running python 2 7 3 on an i7 CPU on linux    python -mtimeit -s  d range 10  7    5 10  6 in d   10 loops  best of 3  64 2 msec per loop python -mtimeit -s  d dict fromkeys range 10  7     5 10  6 in d   10000000 loops  best of 3  0 0759 usec per loop python -mtimeit -s  from sets import Set  d Set range 10  7     5 10  6 in d   1000000 loops  best of 3  0 262 usec per loop   As you can see  dict is considerably faster than list and about 3 times faster than set  In some applications you might still want to choose set for the beauty of it  though  And if the data sets are really small   lt  1000 elements  lists perform pretty well

User · Answer

You want a dict     For  unsorted  lists in Python  the  in  operation requires O n  time---not good when you have a large amount of data   A dict  on the other hand  is a hash table  so you can expect O 1  lookup time     As others have noted  you might choose a set  a special type of dict  instead  if you only have keys rather than key value pairs   Related    Python wiki  information on the time complexity of Python container operations  SO  Python container operation time and memory complexities

User · Answer

set   is exactly what you want   O 1  lookups  and smaller than a dict

User · Answer

A dict is a hash table  so it is really fast to find the keys  So between dict and list  dict would be faster  But if you don t have a value to associate  it is even better to use a set  It is a hash table  without the  table  part     EDIT  for your new question  YES  a set would be better  Just create 2 sets  one for sequences ended in 1 and other for the sequences ended in 89  I have sucessfully solved this problem using sets

User · Answer

if data are unique set   will be the most efficient  but of two - dict  which also requires uniqueness  oops

User · Answer

You want a dict     For  unsorted  lists in Python  the  in  operation requires O n  time---not good when you have a large amount of data   A dict  on the other hand  is a hash table  so you can expect O 1  lookup time     As others have noted  you might choose a set  a special type of dict  instead  if you only have keys rather than key value pairs   Related    Python wiki  information on the time complexity of Python container operations  SO  Python container operation time and memory complexities

User · Answer

I did some benchmarking and it turns out that dict is faster than both list and set for large data sets  running python 2 7 3 on an i7 CPU on linux    python -mtimeit -s  d range 10  7    5 10  6 in d   10 loops  best of 3  64 2 msec per loop python -mtimeit -s  d dict fromkeys range 10  7     5 10  6 in d   10000000 loops  best of 3  0 0759 usec per loop python -mtimeit -s  from sets import Set  d Set range 10  7     5 10  6 in d   1000000 loops  best of 3  0 262 usec per loop   As you can see  dict is considerably faster than list and about 3 times faster than set  In some applications you might still want to choose set for the beauty of it  though  And if the data sets are really small   lt  1000 elements  lists perform pretty well

User · Answer

if data are unique set   will be the most efficient  but of two - dict  which also requires uniqueness  oops

[python] Python: List vs Dict for look up table

Speed

Memory

Examples related to python

Examples related to performance