Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()]
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated()
returns a boolean array: a True
or False
for each column. If it is False
then the column name is unique up to that point, if it is True
then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True]
.
Pandas
allows one to index using boolean values whereby it selects only the True
values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True]
)
Finally, df.loc[:,[True,True,False]]
selects only the non-duplicated columns using the aforementioned indexing capability.
Note: the above only checks columns names, not column values.