If I have a python list that is has many duplicates, and I want to iterate through each item, but not through the duplicates, is it best to use a set (as in set(mylist)
, or find another way to create a list without duplicates? I was thinking of just looping through the list and checking for duplicates but I figured that's what set()
does when it's initialized.
So if mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
and I really just want to loop through [1,2,3,4,5]
(order doesn't matter), should I use set(mylist)
or something else?
An alternative is possible in the last example, since the list contains every integer between its min and max value, I could loop through range(min(mylist),max(mylist))
or through set(mylist)
. Should I generally try to avoid using set in this case? Also, would finding the min
and max
be slower than just creating the set
?
In the case in the last example, the set
is faster:
from numpy.random import random_integers
ids = random_integers(1e3,size=1e6)
def set_loop(mylist):
idlist = []
for id in set(mylist):
idlist.append(id)
return idlist
def list_loop(mylist):
idlist = []
for id in range(min(mylist),max(mylist)):
idlist.append(id)
return idlist
%timeit set_loop(ids)
#1 loops, best of 3: 232 ms per loop
%timeit list_loop(ids)
#1 loops, best of 3: 408 ms per loop
I the list is vary large looping two time over it will take a lot of time and more in the second time you are looping a set not a list and as we know iterating over a set is slower than list.
i think you need the power of generator
and set
.
def first_test():
def loop_one_time(my_list):
# create a set to keep the items.
iterated_items = set()
# as we know iterating over list is faster then list.
for value in my_list:
# as we know checking if element exist in set is very fast not
# metter the size of the set.
if value not in iterated_items:
iterated_items.add(value) # add this item to list
yield value
mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
for v in loop_one_time(mylist):pass
def second_test():
mylist = [3,1,5,2,4,4,1,4,2,5,1,3]
s = set(mylist)
for v in s:pass
import timeit
print(timeit.timeit('first_test()', setup='from __main__ import first_test', number=10000))
print(timeit.timeit('second_test()', setup='from __main__ import second_test', number=10000))
out put:
0.024003583388435043
0.010424674188938422
Note: this technique order is guaranteed
While a set
may be what you want structure-wise, the question is what is faster. A list is faster. Your example code doesn't accurately compare set
vs list
because you're converting from a list to a set in set_loop
, and then you're creating the list
you'll be looping through in list_loop
. The set and list you iterate through should be constructed and in memory ahead of time, and simply looped through to see which data structure is faster at iterating:
ids_list = range(1000000)
ids_set = set(ids)
def f(x):
for i in x:
pass
%timeit f(ids_set)
#1 loops, best of 3: 214 ms per loop
%timeit f(ids_list)
#1 loops, best of 3: 176 ms per loop
For simplicity's sake: newList = list(set(oldList))
But there are better options out there if you'd like to get speed/ordering/optimization instead: http://www.peterbe.com/plog/uniqifiers-benchmark
set
is what you want, so you should use set
. Trying to be clever introduces subtle bugs like forgetting to add one tomax(mylist)
! Code defensively. Worry about what's faster when you determine that it is too slow.
range(min(mylist), max(mylist) + 1) # <-- don't forget to add 1
Source: Stackoverflow.com