[python] Python Sets vs Lists

In Python, which data structure is more efficient/speedy? Assuming that order is not important to me and I would be checking for duplicates anyway, is a Python set slower than a Python list?

This question is related to python list performance data-structures set

The answer is


List performance:

>>> import timeit
>>> timeit.timeit(stmt='10**6 in a', setup='a = range(10**6)', number=100000)
0.008128150348026608

Set performance:

>>> timeit.timeit(stmt='10**6 in a', setup='a = set(range(10**6))', number=100000)
0.005674857488571661

You may want to consider Tuples as they're similar to lists but can’t be modified. They take up slightly less memory and are faster to access. They aren’t as flexible but are more efficient than lists. Their normal use is to serve as dictionary keys.

Sets are also sequence structures but with two differences from lists and tuples. Although sets do have an order, that order is arbitrary and not under the programmer’s control. The second difference is that the elements in a set must be unique.

set by definition. [python | wiki].

>>> x = set([1, 1, 2, 2, 3, 3])
>>> x
{1, 2, 3}

I would recommend a Set implementation where the use case is limit to referencing or search for existence and Tuple implementation where the use case requires you to perform iteration. A list is a low-level implementation and requires significant memory overhead.


tl;dr

Data structures (DS) are important because they are used to perform operations on data which basically implies: take some input, process it, and give back the output.

Some data structures are more useful than others in some particular cases. Therefore, it is quite unfair to ask which (DS) is more efficient/speedy. It is like asking which tool is more efficient between a knife and fork. I mean all depends on the situation.

Lists

A list is mutable sequence, typically used to store collections of homogeneous items.

Sets

A set object is an unordered collection of distinct hashable objects. It is commonly used to test membership, remove duplicates from a sequence, and compute mathematical operations such as intersection, union, difference, and symmetric difference.

Usage

From some of the answers, it is clear that a list is quite faster than a set when iterating over the values. On the other hand, a set is faster than a list when checking if an item is contained within it. Therefore, the only thing you can say is that a list is better than a set for some particular operations and vice-versa.


Lists are slightly faster than sets when you just want to iterate over the values.

Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.

It turns out tuples perform in almost exactly the same way as lists, except for their immutability.

Iterating

>>> def iter_test(iterable):
...     for i in iterable:
...         pass
...
>>> from timeit import timeit
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = set(range(10000))",
...     number=100000)
12.666952133178711
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = list(range(10000))",
...     number=100000)
9.917098999023438
>>> timeit(
...     "iter_test(iterable)",
...     setup="from __main__ import iter_test; iterable = tuple(range(10000))",
...     number=100000)
9.865639209747314

Determine if an object is present

>>> def in_test(iterable):
...     for i in range(1000):
...         if i in iterable:
...             pass
...
>>> from timeit import timeit
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = set(range(1000))",
...     number=10000)
0.5591847896575928
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = list(range(1000))",
...     number=10000)
50.18339991569519
>>> timeit(
...     "in_test(iterable)",
...     setup="from __main__ import in_test; iterable = tuple(range(1000))",
...     number=10000)
51.597304821014404

from datetime import datetime
listA = range(10000000)
setA = set(listA)
tupA = tuple(listA)
#Source Code

def calc(data, type):
start = datetime.now()
if data in type:
print ""
end = datetime.now()
print end-start

calc(9999, listA)
calc(9999, tupA)
calc(9999, setA)

Output after comparing 10 iterations for all 3 : Comparison


I was interested in the results when checking, with CPython, if a value is one of a small number of literals. set wins in Python 3 vs tuple, list and or:

from timeit import timeit

def in_test1():
  for i in range(1000):
    if i in (314, 628):
      pass

def in_test2():
  for i in range(1000):
    if i in [314, 628]:
      pass

def in_test3():
  for i in range(1000):
    if i in {314, 628}:
      pass

def in_test4():
  for i in range(1000):
    if i == 314 or i == 628:
      pass

print("tuple")
print(timeit("in_test1()", setup="from __main__ import in_test1", number=100000))
print("list")
print(timeit("in_test2()", setup="from __main__ import in_test2", number=100000))
print("set")
print(timeit("in_test3()", setup="from __main__ import in_test3", number=100000))
print("or")
print(timeit("in_test4()", setup="from __main__ import in_test4", number=100000))

Output:

tuple
4.735646052286029
list
4.7308746771886945
set
3.5755991376936436
or
4.687681658193469

For 3 to 5 literals, set still wins by a wide margin, and or becomes the slowest.

In Python 2, set is always the slowest. or is the fastest for 2 to 3 literals, and tuple and list are faster with 4 or more literals. I couldn't distinguish the speed of tuple vs list.

When the values to test were cached in a global variable out of the function, rather than creating the literal within the loop, set won every time, even in Python 2.

These results apply to 64-bit CPython on a Core i7.


Sets are faster, morover you get more functions with sets, such as lets say you have two sets :

set1 = {"Harry Potter", "James Bond", "Iron Man"}
set2 = {"Captain America", "Black Widow", "Hulk", "Harry Potter", "James Bond"}

We can easily join two sets:

set3 = set1.union(set2)

Find out what is common in both:

set3 = set1.intersection(set2)

Find out what is different in both:

set3 = set1.difference(set2)

And much more! Just try them out, they are fun! Moreover if you have to work on the different values within 2 list or common values within 2 lists, I prefer to convert your lists to sets, and many programmers do in that way. Hope it helps you :-)


Set wins due to near instant 'contains' checks: https://en.wikipedia.org/wiki/Hash_table

List implementation: usually an array, low level close to the metal good for iteration and random access by element index.

Set implementation: https://en.wikipedia.org/wiki/Hash_table, it does not iterate on a list, but finds the element by computing a hash from the key, so it depends on the nature of the key elements and the hash function. Similar to what is used for dict. I suspect list could be faster if you have very few elements (< 5), the larger element count the better the set will perform for a contains check. It is also fast for element addition and removal. Also always keep in mind that building a set has a cost !

NOTE: If the list is already sorted, searching the list could be quite fast on small lists, but with more data a set is faster for contains checks.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to list

Convert List to Pandas Dataframe Column Python find elements in one list that are not in the other Sorting a list with stream.sorted() in Java Python Loop: List Index Out of Range How to combine two lists in R How do I multiply each element in a list by a number? Save a list to a .txt file The most efficient way to remove first N elements in a list? TypeError: list indices must be integers or slices, not str Parse JSON String into List<string>

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to data-structures

Program to find largest and second largest number in array golang why don't we have a set datastructure How to initialize a vector with fixed length in R C compiling - "undefined reference to"? List of all unique characters in a string? Binary Search Tree - Java Implementation How to clone object in C++ ? Or Is there another solution? How to check queue length in Python Difference between "Complete binary tree", "strict binary tree","full binary Tree"? Write code to convert given number into words (eg 1234 as input should output one thousand two hundred and thirty four)

Examples related to set

java, get set methods golang why don't we have a set datastructure Simplest way to merge ES6 Maps/Sets? Swift Set to Array JavaScript Array to Set How to sort a HashSet? Python Set Comprehension How to get first item from a java.util.Set? Getting the difference between two sets Python convert set to string and vice versa