Removing duplicates from a list of lists

Question

I have a list of lists in Python   k     1  2    4    5  6  2    1  2    3    4     And I want to remove duplicate elements from it  Was if it a normal list not of lists I could used set  But unfortunate that list is not hashable and can t make set of lists  Only of tuples  So I can turn all lists to tuples then use set and back to lists  But this isn t fast   How can this done in the most efficient way   The result of above list should be   k     5  6  2    1  2    3    4     I don t care about preserve order   Note  this question is similar but not quite what I need  Searched SO but didn t find exact duplicate     Benchmarking   import itertools  time   class Timer object       def   init   self  name None           self name   name      def   enter   self           self tstart   time time        def   exit   self  type  value  traceback           if self name              print    s     self name          print  Elapsed   s     time time   - self tstart    k     1  2    4    5  6  2    1  2    3    5  2    6    8    9     5 N   100000  print len k   with Timer  set        for i in xrange N           kt    tuple i  for i in k          skt   set kt          kk    list i  for i in skt    with Timer  sort        for i in xrange N           ks   sorted k          dedup    ks i  for i in xrange len ks   if i    0 or ks i     ks i-1     with Timer  groupby        for i in xrange N           k   sorted k          dedup   list k for k    in itertools groupby k    with Timer  loop in        for i in xrange N           new k              for elem in k              if elem not in new k                  new k append elem     loop in   quadratic method  fastest of all for short lists  For long lists it s faster then everyone except groupby method  Does this make sense   For short list  the one in the code   100000 iterations    set  Elapsed  1 3900001049  sort  Elapsed  0 891000032425  groupby  Elapsed  0 780999898911  loop in  Elapsed  0 578000068665   For longer list  the one in the code duplicated 5 times     set  Elapsed  3 68700003624  sort  Elapsed  3 43799996376  groupby  Elapsed  1 03099989891  loop in  Elapsed  1 85900020599

User · Answer

All the set-related solutions to this problem thus far require creating an entire set before iteration   It is possible to make this lazy  and at the same time preserve order  by iterating the list of lists and adding to a  seen  set  Then only yield a list if it is not found in this tracker set   This unique everseen recipe is available in the itertools docs  It s also available in the 3rd party toolz library   from toolz import unique  k     1  2    4    5  6  2    1  2    3    4      lazy iterator res   map list  unique map tuple  k     print list res      1  2    4    5  6  2    3     Note that tuple conversion is necessary because lists are not hashable

User · Answer

Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:

>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]

The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).

User · Answer

gt  gt  gt  k     1  2    4    5  6  2    1  2    3    4    gt  gt  gt  k   sorted k   gt  gt  gt  k   1  2    1  2    3    4    4    5  6  2    gt  gt  gt  dedup    k i  for i in range len k   if i    0 or k i     k i-1    gt  gt  gt  dedup   1  2    3    4    5  6  2     I don t know if it s necessarily faster  but you don t have to use to tuples and sets

User · Answer

Strangely  the answers above removes the  duplicates  but what if I want to remove the duplicated value also   The following should be useful and does not create a new object in memory   def dictRemoveDuplicates self       a   1  somevalue1    1  somevalue2    2  somevalue1    3  somevalue4    5  somevalue5    5  somevalue1    5  somevalue1    5  somevalue8    6  somevalue9    6  somevalue0    6  somevalue1    7  somevalue7      print a  temp   0 position   -1 for pageNo  item in a      position  1     if pageNo    temp          temp   pageNo         continue     else          a position    0         a position - 1    0 a    x for x in a if x    0           print a    and the o p is     1   somevalue1     1   somevalue2     2   somevalue1     3   somevalue4     5   somevalue5     5   somevalue1     5   somevalue1     5   somevalue8     6   somevalue9     6   somevalue0     6   somevalue1     7   somevalue7      2   somevalue1     3   somevalue4     7   somevalue7

User · Answer

a list                1 2              1 2              2 3              3 4     print  list map list set map tuple a list       outputs    1  2    3  4    2  3

User · Answer

List of tuple and    can be used to remove duplicates   gt  gt  gt   list tupl  for tupl in  tuple item  for item in k      1  2    5  6  2    3    4    gt  gt  gt

User · Answer

This should work   k     1  2    4    5  6  2    1  2    3    4    k cleaned      for ele in k      if set ele  not in  set x  for x in k cleaned           k cleaned append ele  print k cleaned     output    1  2    4    5  6  2    3

User · Answer

gt  gt  gt  k     1  2    4    5  6  2    1  2    3    4    gt  gt  gt  import itertools  gt  gt  gt  k sort    gt  gt  gt  list k for k   in itertools groupby k     1  2    3    4    5  6  2     itertools often offers the fastest and most powerful solutions to this kind of problems  and is well worth getting intimately familiar with -   Edit  as I mention in a comment  normal optimization efforts are focused on large inputs  the big-O approach  because it s so much easier that it offers good returns on efforts  But sometimes  essentially for  tragically crucial bottlenecks  in deep inner loops of code that s pushing the boundaries of performance limits  one may need to go into much more detail  providing probability distributions  deciding which performance measures to optimize  maybe the upper bound or the 90th centile is more important than an average or median  depending on one s apps   performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics  and so forth   Careful measurements of  point  performance  code A vs code B for a specific input  are a part of this extremely costly process  and standard library module timeit helps here  However  it s easier to use it at a shell prompt   For example  here s a short module to showcase the general approach for this problem  save it as nodup py   import itertools  k     1  2    4    5  6  2    1  2    3    4    def doset k  map map  list list  set set  tuple tuple     return map list  set map tuple  k     def dosort k  sorted sorted  xrange xrange  len len     ks   sorted k    return  ks i  for i in xrange len ks   if i    0 or ks i     ks i-1    def dogroupby k  sorted sorted  groupby itertools groupby  list list     ks   sorted k    return  i for i    in itertools groupby ks    def donewk k     newk        for i in k      if i not in newk        newk append i    return newk    sanity check that all functions compute the same result and don t alter k if   name         main       savek   list k    for f in doset  dosort  dogroupby  donewk      resk   f k      assert k    savek     print   10s  s     f   name    sorted resk     Note the sanity check  performed when you just do python nodup py  and the basic hoisting technique  make constant global names local to each function for speed  to put things on equal footing   Now we can run checks on the tiny example list     python -mtimeit -s import nodup   nodup doset nodup k   100000 loops  best of 3  11 7 usec per loop   python -mtimeit -s import nodup   nodup dosort nodup k   100000 loops  best of 3  9 68 usec per loop   python -mtimeit -s import nodup   nodup dogroupby nodup k   100000 loops  best of 3  8 74 usec per loop   python -mtimeit -s import nodup   nodup donewk nodup k   100000 loops  best of 3  4 44 usec per loop   confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values   With a short list without duplicates     python -mtimeit -s import nodup   nodup donewk   i  for i in range 12     10000 loops  best of 3  25 4 usec per loop   python -mtimeit -s import nodup   nodup dogroupby   i  for i in range 12     10000 loops  best of 3  23 7 usec per loop   python -mtimeit -s import nodup   nodup doset   i  for i in range 12     10000 loops  best of 3  31 3 usec per loop   python -mtimeit -s import nodup   nodup dosort   i  for i in range 12     10000 loops  best of 3  25 usec per loop   the quadratic approach isn t bad  but the sort and groupby ones are better   Etc  etc   If  as the obsession with performance suggests  this operation is at a core inner loop of your pushing-the-boundaries application  it s worth trying the same set of tests on other representative input samples  possibly detecting some simple measure that could heuristically let you pick one or the other approach  but the measure must be fast  of course    It s also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place   If the duplicate removal task is frequent  and profiling shows it to be the program s performance bottleneck  keeping a set of tuples all the time and getting a list of lists from it only if and where needed  might be faster overall  for example

User · Answer

A bit of a background  I just started with python and learnt comprehensions  k     1  2    4    5  6  2    1  2    3    4   dedup    elem split      for elem in set      join str int elem  for int elem in  list  for  list in k

User · Answer

Create a dictionary with tuple as the key  and print the keys    create dictionary with tuple as key and index as value print list of keys of dictionary     k     1  2    4    5  6  2    1  2    3    4    dict tuple    tuple item   index for index  item in enumerate k    print  list itm  for itm in dict tuple keys       prints   1  2    5  6  2    3    4

User · Answer

Doing it manually  creating a new k list and adding entries not found so far   k     1  2    4    5  6  2    1  2    3    4   new k      for elem in k      if elem not in new k          new k append elem  k   new k print k   prints   1  2    4    5  6  2    3     Simple to comprehend  and you preserve the order of the first occurrence of each element should that be useful  but I guess it s quadratic in complexity as you re searching the whole of new k for each element

User · Answer

k   1  2    4    5  6  2    1  2    3    5  2    3    8    9   kl    kl extend x for x in k if x not in kl  k list kl  print k   which prints    1  2    4    5  6  2    3    5  2    8    9

User · Answer

Even your  long  list is pretty short  Also  did you choose them to match the actual data  Performance will vary with what these data actually look like  For example  you have a short list repeated over and over to make a longer list  This means that the quadratic solution is linear in your benchmarks  but not in reality   For actually-large lists  the set code is your best bet   it s linear  although space-hungry   The sort and groupby methods are O n log n  and the loop in method is obviously quadratic  so you know how these will scale as n gets really big  If this is the real size of the data you are analyzing  then who cares  It s tiny   Incidentally  I m seeing a noticeable speedup if I don t form an intermediate list to make the set  that is to say if I replace  kt    tuple i  for i in k  skt   set kt    with  skt   set tuple i  for i in k    The real solution may depend on more information  Are you sure that a list of lists is really the representation you need

[python] Removing duplicates from a list of lists

The answer is

Examples related to python

Tags