[python] Removing duplicates from a list of lists

I have a list of lists in Python:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

And I want to remove duplicate elements from it. Was if it a normal list not of lists I could used set. But unfortunate that list is not hashable and can't make set of lists. Only of tuples. So I can turn all lists to tuples then use set and back to lists. But this isn't fast.

How can this done in the most efficient way?

The result of above list should be:

k = [[5, 6, 2], [1, 2], [3], [4]]

I don't care about preserve order.

Note: this question is similar but not quite what I need. Searched SO but didn't find exact duplicate.


Benchmarking:

import itertools, time


class Timer(object):
    def __init__(self, name=None):
        self.name = name

    def __enter__(self):
        self.tstart = time.time()

    def __exit__(self, type, value, traceback):
        if self.name:
            print '[%s]' % self.name,
        print 'Elapsed: %s' % (time.time() - self.tstart)


k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000

print len(k)

with Timer('set'):
    for i in xrange(N):
        kt = [tuple(i) for i in k]
        skt = set(kt)
        kk = [list(i) for i in skt]


with Timer('sort'):
    for i in xrange(N):
        ks = sorted(k)
        dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]


with Timer('groupby'):
    for i in xrange(N):
        k = sorted(k)
        dedup = list(k for k, _ in itertools.groupby(k))

with Timer('loop in'):
    for i in xrange(N):
        new_k = []
        for elem in k:
            if elem not in new_k:
                new_k.append(elem)

"loop in" (quadratic method) fastest of all for short lists. For long lists it's faster then everyone except groupby method. Does this make sense?

For short list (the one in the code), 100000 iterations:

[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665

For longer list (the one in the code duplicated 5 times):

[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599

This question is related to python

The answer is


All the set-related solutions to this problem thus far require creating an entire set before iteration.

It is possible to make this lazy, and at the same time preserve order, by iterating the list of lists and adding to a "seen" set. Then only yield a list if it is not found in this tracker set.

This unique_everseen recipe is available in the itertools docs. It's also available in the 3rd party toolz library:

from toolz import unique

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

# lazy iterator
res = map(list, unique(map(tuple, k)))

print(list(res))

[[1, 2], [4], [5, 6, 2], [3]]

Note that tuple conversion is necessary because lists are not hashable.


Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:

>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]

The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).


>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> k = sorted(k)
>>> k
[[1, 2], [1, 2], [3], [4], [4], [5, 6, 2]]
>>> dedup = [k[i] for i in range(len(k)) if i == 0 or k[i] != k[i-1]]
>>> dedup
[[1, 2], [3], [4], [5, 6, 2]]

I don't know if it's necessarily faster, but you don't have to use to tuples and sets.


Strangely, the answers above removes the 'duplicates' but what if I want to remove the duplicated value also?? The following should be useful and does not create a new object in memory!

def dictRemoveDuplicates(self):
    a=[[1,'somevalue1'],[1,'somevalue2'],[2,'somevalue1'],[3,'somevalue4'],[5,'somevalue5'],[5,'somevalue1'],[5,'somevalue1'],[5,'somevalue8'],[6,'somevalue9'],[6,'somevalue0'],[6,'somevalue1'],[7,'somevalue7']]


print(a)
temp = 0
position = -1
for pageNo, item in a:
    position+=1
    if pageNo != temp:
        temp = pageNo
        continue
    else:
        a[position] = 0
        a[position - 1] = 0
a = [x for x in a if x != 0]         
print(a)

and the o/p is:

[[1, 'somevalue1'], [1, 'somevalue2'], [2, 'somevalue1'], [3, 'somevalue4'], [5, 'somevalue5'], [5, 'somevalue1'], [5, 'somevalue1'], [5, 'somevalue8'], [6, 'somevalue9'], [6, 'somevalue0'], [6, 'somevalue1'], [7, 'somevalue7']]
[[2, 'somevalue1'], [3, 'somevalue4'], [7, 'somevalue7']]

a_list = [
          [1,2],
          [1,2],
          [2,3],
          [3,4]
]

print (list(map(list,set(map(tuple,a_list)))))

outputs: [[1, 2], [3, 4], [2, 3]]


List of tuple and {} can be used to remove duplicates

>>> [list(tupl) for tupl in {tuple(item) for item in k }]
[[1, 2], [5, 6, 2], [3], [4]]
>>> 

This should work.

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

k_cleaned = []
for ele in k:
    if set(ele) not in [set(x) for x in k_cleaned]:
        k_cleaned.append(ele)
print(k_cleaned)

# output: [[1, 2], [4], [5, 6, 2], [3]]

A bit of a background, I just started with python and learnt comprehensions.

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
dedup = [elem.split('.') for elem in set(['.'.join(str(int_elem) for int_elem in _list) for _list in k])]

Create a dictionary with tuple as the key, and print the keys.

  • create dictionary with tuple as key and index as value
  • print list of keys of dictionary

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

dict_tuple = {tuple(item): index for index, item in enumerate(k)}

print [list(itm) for itm in dict_tuple.keys()]

# prints [[1, 2], [5, 6, 2], [3], [4]]

Doing it manually, creating a new k list and adding entries not found so far:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
new_k = []
for elem in k:
    if elem not in new_k:
        new_k.append(elem)
k = new_k
print k
# prints [[1, 2], [4], [5, 6, 2], [3]]

Simple to comprehend, and you preserve the order of the first occurrence of each element should that be useful, but I guess it's quadratic in complexity as you're searching the whole of new_k for each element.


k=[[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [3], [8], [9]]
kl=[]
kl.extend(x for x in k if x not in kl)
k=list(kl)
print(k)

which prints,

[[1, 2], [4], [5, 6, 2], [3], [5, 2], [8], [9]]

Even your "long" list is pretty short. Also, did you choose them to match the actual data? Performance will vary with what these data actually look like. For example, you have a short list repeated over and over to make a longer list. This means that the quadratic solution is linear in your benchmarks, but not in reality.

For actually-large lists, the set code is your best bet—it's linear (although space-hungry). The sort and groupby methods are O(n log n) and the loop in method is obviously quadratic, so you know how these will scale as n gets really big. If this is the real size of the data you are analyzing, then who cares? It's tiny.

Incidentally, I'm seeing a noticeable speedup if I don't form an intermediate list to make the set, that is to say if I replace

kt = [tuple(i) for i in k]
skt = set(kt)

with

skt = set(tuple(i) for i in k)

The real solution may depend on more information: Are you sure that a list of lists is really the representation you need?