How do I check if there are duplicates in a flat list

Question

For example  given the list   one    two    one    the algorithm should return True  whereas given   one    two    three   it should return False

User · Answer

If the list contains unhashable items  you can use Alex Martelli s solution but with a list instead of a set  though it s slower for larger inputs  O N 2    def has duplicates iterable       seen          for x in iterable          if x in seen              return True         seen append x      return False

User · Answer

Recommended for short lists only   any thelist count x   gt  1 for x in thelist    Do not use on a long list -- it can take time proportional to the square of the number of items in the list   For longer lists with hashable items  strings  numbers   amp c    def anydup thelist     seen   set     for x in thelist      if x in seen  return True     seen add x    return False   If your items are not hashable  sublists  dicts  etc  it gets hairier  though it may still be possible to get O N logN  if they re at least comparable   But you need to know or test the characteristics of the items  hashable or not  comparable or not  to get the best performance you can -- O N  for hashables  O N log N  for non-hashable comparables  otherwise it s down to O N squared  and there s nothing one can do about it -

User · Answer

If you are fond of functional programming style  here is a useful function  self-documented and tested code using doctest   def decompose a list          Turns a list into a set of all elements and a set of duplicated elements       Returns a pair of sets  The first one contains elements     that are found at least once in the list  The second one     contains elements that appear more than once        gt  gt  gt  decompose  1 2 3 5 3 2 6        set  1  2  3  5  6    set  2  3                return reduce          lambda  u  d   o    u union  o    d union u intersection  o              a list           set    set      if   name         main         import doctest     doctest testmod     From there you can test unicity by checking whether the second element of the returned pair is empty   def is set l          Test if there is no duplicate element in l        gt  gt  gt  is set  1 2 3       True      gt  gt  gt  is set  1 2 1       False      gt  gt  gt  is set         True             return not decompose l  1    Note that this is not efficient since you are explicitly constructing the decomposition  But along the line of using reduce  you can come up to something equivalent  but slightly less efficient  to answer 5   def is set l       try          def func s  o               if o in s                  raise Exception             return s union  o           reduce func  l  set            return True     except          return False

User · Answer

I dont really know what set does behind the scenes  so I just like to keep it simple   def dupes num list       unique          dupes          for i in num list          if i not in unique              unique append i          else              dupes append i      if len dupes     0          return False     else          return True

User · Answer

I thought it would be useful to compare the timings of the different solutions presented here  For this I used my own library simple benchmark     So indeed for this case the solution from Denis Otkidach is fastest   Some of the approaches also exhibit a much steeper curve  these are the approaches that scale quadratic with the number of elements  Alex Martellis first solution  wjandrea and both of Xavier Decorets solutions   Also important to mention is that the pandas solution from Keiku has a very big constant factor  But for larger lists it almost catches up with the other solutions   And in case the duplicate is at the first position  This is useful to see which solutions are short-circuiting     Here several approaches don t short-circuit  Kaiku  Frank  Xavier Decoret  first solution   Turn  Alex Martelli  first solution  and the approach presented by Denis Otkidach  which was fastest in the no-duplicate case    I included a function from my own library here  iteration utilities all distinct which can compete with the fastest solution in the no-duplicates case and performs in constant-time for the duplicate-at-begin case  although not as fastest    The code for the benchmark   from collections import Counter from functools import reduce  import pandas as pd from simple benchmark import BenchmarkBuilder from iteration utilities import all distinct  b   BenchmarkBuilder     b add function   def Keiku l       return pd Series l  duplicated   sum    gt  0   b add function   def Frank num list       unique          dupes          for i in num list          if i not in unique              unique append i          else              dupes append i      if len dupes     0          return False     else          return True   b add function   def wjandrea iterable       seen          for x in iterable          if x in seen              return True         seen append x      return False   b add function   def user iterable       clean elements set   set       clean elements set add   clean elements set add      for possible duplicate element in iterable           if possible duplicate element in clean elements set              return True          else              clean elements set add  possible duplicate element        return False   b add function   def Turn l       return Counter l  most common   0  1   gt  1  def getDupes l       seen   set       seen add   seen add     for x in l          if x in seen or seen add x               yield x   b add function             def F1Rumors l       try          if next getDupes l    return True      Found a dupe     except StopIteration          pass     return False  def decompose a list       return reduce          lambda u  o    u 0  union  o    u 1  union u 0  intersection  o              a list           set    set       b add function   def Xavier Decoret 1 l       return not decompose l  1    b add function   def Xavier Decoret 2 l       try          def func s  o               if o in s                  raise Exception             return s union  o           reduce func  l  set            return True     except          return False   b add function   def pyrospade xs       s   set       return any x in s or s add x  for x in xs    b add function   def Alex Martelli 1 thelist       return any thelist count x   gt  1 for x in thelist    b add function   def Alex Martelli 2 thelist       seen   set       for x in thelist          if x in seen  return True         seen add x      return False   b add function   def Denis Otkidach your list       return len your list     len set your list     b add function   def MSeifert04 l       return not all distinct l    And for the arguments      No duplicate run  b add arguments  list size   def arguments        for exp in range 2  14           size   2  exp         yield size  list range size      Duplicate at beginning run  b add arguments  list size   def arguments        for exp in range 2  14           size   2  exp         yield size   0   list range size      Running and plotting r   b run   r plot

User · Answer

def check duplicates my list       seen          for item in my list          if seen get item               return True         seen item    True     return False

User · Answer

Use set   to remove duplicates if all values are hashable    gt  gt  gt  your list     one    two    one    gt  gt  gt  len your list     len set your list   True

User · Answer

This is old  but the answers here led me to a slightly different solution   If you are up for abusing comprehensions  you can get short-circuiting this way   xs    1  2  1  s   set   any x in s or s add x  for x in xs    You can use a similar approach to actually retrieve the duplicates  s   set   duplicates   set x for x in xs if x in s or s add x

User · Answer

Another way of doing this succinctly is with Counter   To just determine if there are any duplicates in the original list   from collections import Counter  def has dupes l         second element of the tuple has number of repetitions     return Counter l  most common   0  1   gt  1   Or to get a list of items that have duplicates   def get dupes l       return  k for k  v in Counter l  items   if v  gt  1

User · Answer

I found this to do the best performance because it short-circuit the operation when the first duplicated it found  then this algorithm has time and space complexity O n  where n is the list s length   def has duplicated elements iterable           Given an  iterable   return True if there are duplicated entries          clean elements set   set       clean elements set add   clean elements set add      for possible duplicate element in iterable           if possible duplicate element in clean elements set              return True          else              clean elements set add  possible duplicate element        return False

User · Answer

A more simple solution is as follows  Just check True False with pandas  duplicated   method and then take sum  Please also see  pandas Series duplicated     pandas 0 24 1 documentation   import pandas as pd  def has duplicated l       return pd Series l  duplicated   sum    gt  0  print has duplicated   one    two    one       True print has duplicated   one    two    three       False

User · Answer

I used pyrospade s approach  for its simplicity  and modified that slightly on a short list made from the case-insensitive Windows registry    If the raw PATH value string is split into individual paths all  null  paths  empty or whitespace-only strings  can be removed by using   PATH nonulls    s for s in PATH if s strip     def HasDupes aseq        s   set       return any   x lower   in s  or s add x lower     for x in aseq   def GetDupes aseq        s   set       return set x for x in aseq if   x lower   in s  or s add x lower       def DelDupes aseq        seen   set       return  x for x in aseq if  x lower   not in seen  and  not seen add x lower        The original PATH has both  null  entries and duplicates for testing purposes    list   Root paths in HKLM SYSTEM CurrentControlSet Control Session Manager Environment PATH list   Root paths in HKLM SYSTEM CurrentControlSet Control Session Manager Environment   1  C  Python37    2   3   4  C  Python37 Scripts    5  c  python37    6  C  Program Files ImageMagick-7 0 8-Q8   7  C  Program Files  x86  poppler bin   8  D  DATA Sounds   9  C  Program Files  x86  GnuWin32 bin  10  C  Program Files  x86  Intel iCLS Client   11  C  Program Files Intel iCLS Client   12  D  DATA CCMD FF  13  D  DATA CCMD  14  D  DATA UTIL  15  C    16  D  DATA UHELP  17   SystemRoot  system32  18  19  20  D  DATA CCMD FF SystemRoot   21  D  DATA Sounds  22   SystemRoot  System32 Wbem  23  D  DATA CCMD FF  24  25  26  c    27   SYSTEMROOT  System32 WindowsPowerShell v1 0   28   Null paths have been removed  but still has duplicates  e g    1  3  and  13  20         list   Null paths removed from HKLM SYSTEM CurrentControlSet Control Session Manager Environment PATH   1  C  Python37    2  C  Python37 Scripts    3  c  python37    4  C  Program Files ImageMagick-7 0 8-Q8   5  C  Program Files  x86  poppler bin   6  D  DATA Sounds   7  C  Program Files  x86  GnuWin32 bin   8  C  Program Files  x86  Intel iCLS Client    9  C  Program Files Intel iCLS Client   10  D  DATA CCMD FF  11  D  DATA CCMD  12  D  DATA UTIL  13  C    14  D  DATA UHELP  15   SystemRoot  system32  16  D  DATA CCMD FF SystemRoot   17  D  DATA Sounds  18   SystemRoot  System32 Wbem  19  D  DATA CCMD FF  20  c    21   SYSTEMROOT  System32 WindowsPowerShell v1 0    And finally  the dupes have been removed    list   Massaged path list from in HKLM SYSTEM CurrentControlSet Control Session Manager Environment PATH   1  C  Python37    2  C  Python37 Scripts    3  C  Program Files ImageMagick-7 0 8-Q8   4  C  Program Files  x86  poppler bin   5  D  DATA Sounds   6  C  Program Files  x86  GnuWin32 bin   7  C  Program Files  x86  Intel iCLS Client    8  C  Program Files Intel iCLS Client    9  D  DATA CCMD FF  10  D  DATA CCMD  11  D  DATA UTIL  12  C    13  D  DATA UHELP  14   SystemRoot  system32  15  D  DATA CCMD FF SystemRoot   16   SystemRoot  System32 Wbem  17   SYSTEMROOT  System32 WindowsPowerShell v1 0

User · Answer

I recently answered a related question to establish all the duplicates in a list  using a generator  It has the advantage that if used just to establish  if there is a duplicate  then you just need to get the first item and the rest can be ignored  which is the ultimate shortcut   This is an interesting set based approach I adapted straight from moooeeeep   def getDupes l       seen   set       seen add   seen add     for x in l          if x in seen or seen add x               yield x   Accordingly  a full list of dupes would be list getDupes etc     To simply test  if  there is a dupe  it should be wrapped as follows   def hasDupes l       try          if getDupes l  next    return True      Found a dupe     except StopIteration          pass     return False   This scales well and provides consistent operating times wherever the dupe is in the list -- I tested with lists of up to 1m entries   If you know something about the data  specifically  that dupes are likely to show up in the first half  or other things that let you skew your requirements  like needing to get the actual dupes  then there are a couple of really alternative dupe locators that might outperform  The two I recommend are     Simple dict based approach  very readable   def getDupes c       d          for i in c          if i in d              if d i                   yield i                 d i    False         else              d i    True   Leverage itertools  essentially an ifilter izip tee  on the sorted list  very efficient if you are getting all the dupes though not as quick to get just the first   def getDupes c       a  b   itertools tee sorted c       next b  None      r   None     for k  g in itertools ifilter lambda x  x 0   x 1   itertools izip a  b            if k    r              yield k             r   k   These were the top performers from the approaches I tried for the full dupe list  with the first dupe occurring anywhere in a 1m element list from the start to the middle   It was surprising how little overhead the sort step added   Your mileage may vary  but here are my specific timed results   Finding FIRST duplicate  single dupe places  n  elements in to 1m element array  Test set len change          50 -             -- 0 002 Test in dict                 50 -             -- 0 002 Test in set                  50 -             -- 0 002 Test sort adjacent           50 -             -- 0 023 Test sort groupby            50 -             -- 0 026 Test sort zip                50 -             -- 1 102 Test sort izip               50 -             -- 0 035 Test sort tee izip           50 -             -- 0 024 Test moooeeeep               50 -             -- 0 001   Test iter  sorted            50 -             -- 0 027  Test set len change        5000 -             -- 0 017 Test in dict               5000 -             -- 0 003   Test in set                5000 -             -- 0 004 Test sort adjacent         5000 -             -- 0 031 Test sort groupby          5000 -             -- 0 035 Test sort zip              5000 -             -- 1 080 Test sort izip             5000 -             -- 0 043 Test sort tee izip         5000 -             -- 0 031 Test moooeeeep             5000 -             -- 0 003   Test iter  sorted          5000 -             -- 0 031  Test set len change       50000 -             -- 0 035 Test in dict              50000 -             -- 0 023 Test in set               50000 -             -- 0 023 Test sort adjacent        50000 -             -- 0 036 Test sort groupby         50000 -             -- 0 134 Test sort zip             50000 -             -- 1 121 Test sort izip            50000 -             -- 0 054 Test sort tee izip        50000 -             -- 0 045 Test moooeeeep            50000 -             -- 0 019   Test iter  sorted         50000 -             -- 0 055  Test set len change      500000 -             -- 0 249 Test in dict             500000 -             -- 0 145 Test in set              500000 -             -- 0 165 Test sort adjacent       500000 -             -- 0 139 Test sort groupby        500000 -             -- 1 138 Test sort zip            500000 -             -- 1 159 Test sort izip           500000 -             -- 0 126 Test sort tee izip       500000 -             -- 0 120   Test moooeeeep           500000 -             -- 0 131 Test iter  sorted        500000 -             -- 0 157

User · Answer

my list     one    two    one    duplicates       for value in my list    if my list count value   gt  1      if value not in duplicates        duplicates append value   print duplicates      one

[python] How do I check if there are duplicates in a flat list?

Examples related to python

Examples related to string

Examples related to list

Examples related to duplicates