python pandas remove duplicate columns

Question

What is the easiest way to remove duplicate columns from a dataframe   I am reading a text file that has duplicate columns via   import pandas as pd  df pd read table fname    The column names are   Time  Time Relative  N2  Time  Time Relative  H2  etc      All the Time and Time Relative columns contain the same data  I want   Time  Time Relative  N2  H2   All my attempts at dropping  deleting  etc  such as   df df T drop duplicates   T   Result in uniquely valued index errors   Reindexing only valid with uniquely valued index objects   Sorry for being a Pandas noob  Any Suggestions would be appreciated     Additional Details  Pandas version  0 9 0 Python Version  2 7 3 Windows 7  installed via Pythonxy 2 7 3 0   data file  note  in the real file  columns are separated by tabs  here they are separated by 4 spaces    Time    Time Relative  s     N2       Time    Time Relative  s     H2 ppm  2 12 2013 9 20 55 AM    6 177    9 99268e 001    2 12 2013 9 20 55 AM    6 177    3 216293e-005     2 12 2013 9 21 06 AM    17 689    9 99296e 001    2 12 2013 9 21 06 AM    17 689    3 841667e-005     2 12 2013 9 21 18 AM    29 186    9 992954e 001    2 12 2013 9 21 18 AM    29 186    3 880365e-005         etc     2 12 2013 2 12 44 PM    17515 269    9 991756 001    2 12 2013 2 12 44 PM    17515 269    2 800279e-005     2 12 2013 2 12 55 PM    17526 769    9 991754e 001    2 12 2013 2 12 55 PM    17526 769    2 880386e-005 2 12 2013 2 13 07 PM    17538 273    9 991797e 001    2 12 2013 2 13 07 PM    17538 273    3 131447e-005

User · Answer

The way below will identify dupe columns to review what is going wrong building the dataframe originally   dupes   pd DataFrame df columns  dupes dupes duplicated

User · Answer

It looks like you were on the right path  Here is the one-liner you were looking for   df reset index   T drop duplicates   T   But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects  it is tough to say exactly what would solve the problem  if restoring the original index is important to you do this   original index   df index names df reset index   T drop duplicates   reset index original index  T

User · Answer

I ran into this problem where the one liner provided by the first answer worked well   However  I had the extra complication where the second copy of the column had all of the data   The first copy did not     The solution was to create two data frames by splitting the one data frame by toggling the negation operator   Once I had the two data frames  I ran a join statement using the lsuffix   This way  I could then reference and delete the column without the data   - E

User · Answer

First step - Read first row i e all columns the remove all duplicate columns   Second step - Finally read only that columns   cols   pd read csv  file csv   header None  nrows 1  iloc 0  drop duplicates   df   pd read csv  file csv   usecols cols

User · Answer

Fast and easy way to drop the duplicated columns by their values   df   df T drop duplicates   T  More info  Pandas DataFrame drop duplicates manual

User · Answer

Here s a one line solution to remove columns based on duplicate column names  df   df loc    df columns duplicated     How it works  Suppose the columns of the data frame are   alpha   beta   alpha   df columns duplicated   returns a boolean array  a True or False for each column  If it is False then the column name is unique up to that point  if it is True then the column name is duplicated earlier  For example  using the given example  the returned value would be  False False True   Pandas allows one to index using boolean values whereby it selects only the True values  Since we want to keep the unduplicated columns  we need the above boolean array to be flipped  ie  True  True  False      False False True   Finally  df loc    True True False   selects only the non-duplicated columns using the aforementioned indexing capability  Note  the above only checks columns names  not column values

User · Answer

Note that Gene Burinsky s answer  at the time of writing the selected answer  keeps the first of each duplicated column  To keep the last  df df loc     df columns   -1  duplicated     -1

User · Answer

An update on  kalu s answer  which uses the latest pandas  def find duplicated columns df       dupes           columns   df columns      for i in range len columns            col1   df iloc    i          for j in range i   1  len columns                col2   df iloc    j                break early if dtypes aren t the same  helps deal with               categorical dtypes              if col1 dtype is not col2 dtype                  break               otherwise compare values             if col1 equals col2                   dupes append columns i                   break      return dupes

User · Answer

Transposing is inefficient for large DataFrames   Here is an alternative   def duplicate columns frame       groups   frame columns to series   groupby frame dtypes  groups     dups          for t  v in groups items            dcols   frame v  to dict orient  list            vs   dcols values           ks   dcols keys           lvs   len vs           for i in range lvs               for j in range i 1 lvs                   if vs i     vs j                        dups append ks i                       break      return dups          Use it like this   dups   duplicate columns frame  frame   frame drop dups  axis 1    Edit  A memory efficient version that treats nans like any other value   from pandas core common import array equivalent  def duplicate columns frame       groups   frame columns to series   groupby frame dtypes  groups     dups           for t  v in groups items             cs   frame v  columns         vs   frame v          lcs   len cs           for i in range lcs               ia   vs iloc   i  values             for j in range i 1  lcs                   ja   vs iloc   j  values                 if array equivalent ia  ja                       dups append cs i                       break      return dups

User · Answer

It sounds like you already know the unique column names  If that s the case  then df   df  Time    Time Relative    N2   would work   If not  your solution should work   In  101   vals   np random randint 0 20   4 3             vals Out 101   array    3  13   0            1  15  14           14  19  14           19   5   1     In  106   df   pd DataFrame np hstack  vals  vals    columns   Time    H1    N2    Time Relative    N2    Time               df Out 106      Time  H1  N2  Time Relative  N2  Time 0     3  13   0              3  13     0 1     1  15  14              1  15    14 2    14  19  14             14  19    14 3    19   5   1             19   5     1  In  107   df T drop duplicates   T Out 107      Time  H1  N2 0     3  13   0 1     1  15  14 2    14  19  14 3    19   5   1   You probably have something specific to your data that s messing it up  We could give more help if there s more details you could give us about the data    Edit  Like Andy said  the problem is probably with the duplicate column titles   For a sample table file  dummy csv  I made up   Time    H1  N2  Time    N2  Time Relative 3   13  13  3   13  0 1   15  15  1   15  14 14  19  19  14  19  14 19  5   5   19  5   1   using read table gives unique columns and works properly   In  151   df2   pd read table  dummy csv             df2 Out 151            Time  H1  N2  Time 1  N2 1  Time Relative       0     3  13  13       3    13              0       1     1  15  15       1    15             14       2    14  19  19      14    19             14       3    19   5   5      19     5              1 In  152   df2 T drop duplicates   T Out 152                Time  H1  Time Relative           0     3  13              0           1     1  15             14           2    14  19             14           3    19   5              1     If your version doesn t let your  you can hack together a solution to make them unique   In  169   df2   pd read table  dummy csv   header None            df2 Out 169                 0   1   2     3   4              5         0  Time  H1  N2  Time  N2  Time Relative         1     3  13  13     3  13              0         2     1  15  15     1  15             14         3    14  19  19    14  19             14         4    19   5   5    19   5              1 In  171   from collections import defaultdict           col counts   defaultdict int            col ix   df2 first valid index   In  172   cols                for col in df2 ix col ix                 cnt   col counts col                col counts col     1               suf         str cnt  if cnt else                  cols append col   suf            cols Out 172               Time    H1    N2    Time 1    N2 1    Time Relative   In  174   df2 columns   cols           df2   df2 drop  col ix   In  177   df2 Out 177             Time  H1  N2 Time 1 N2 1 Time Relative         1    3  13  13      3   13             0         2    1  15  15      1   15            14         3   14  19  19     14   19            14         4   19   5   5     19    5             1 In  178   df2 T drop duplicates   T Out 178             Time  H1 Time Relative         1    3  13             0         2    1  15            14         3   14  19            14         4   19   5             1

User · Answer

If I m not mistaken  the following does what was asked without the memory problems of the transpose solution and with fewer lines than  kalu  s function  keeping the first of any similarly named columns   Cols   list df columns  for i item in enumerate df columns       if item in df columns  i   Cols i     toDROP  df columns   Cols df   df drop  toDROP  1

[python] python pandas remove duplicate columns

Examples related to python

Examples related to pandas