I've been looking around for ways to select columns through the python documentation and the forums but every example on indexing columns are too simplistic.
Suppose I have a 10 x 10 dataframe
df = DataFrame(randn(10, 10), index=range(0,10), columns=['A', 'B', 'C', 'D','E','F','G','H','I','J'])
So far, all the documentations gives is just a simple example of indexing like
subset = df.loc[:,'A':'C']
or
subset = df.loc[:,'C':]
But I get an error when I try index multiple, non-sequential columns, like this
subset = df.loc[:,('A':'C', 'E')]
How would I index in Pandas if I wanted to select column A to C, E, and G to I? It appears that this logic will not work
subset = df.loc[:,('A':'C', 'E', 'G':'I')]
I feel that the solution is pretty simple, but I can't get around this error. Thanks!
How do I select multiple columns by labels in pandas?
Multiple label-based range slicing is not easily supported with pandas, but position-based slicing is, so let's try that instead:
loc = df.columns.get_loc
df.iloc[:, np.r_[loc('A'):loc('C')+1, loc('E'), loc('G'):loc('I')+1]]
A B C E G H I
0 -1.666330 0.321260 -1.768185 -0.034774 0.023294 0.533451 -0.241990
1 0.911498 3.408758 0.419618 -0.462590 0.739092 1.103940 0.116119
2 1.243001 -0.867370 1.058194 0.314196 0.887469 0.471137 -1.361059
3 -0.525165 0.676371 0.325831 -1.152202 0.606079 1.002880 2.032663
4 0.706609 -0.424726 0.308808 1.994626 0.626522 -0.033057 1.725315
5 0.879802 -1.961398 0.131694 -0.931951 -0.242822 -1.056038 0.550346
6 0.199072 0.969283 0.347008 -2.611489 0.282920 -0.334618 0.243583
7 1.234059 1.000687 0.863572 0.412544 0.569687 -0.684413 -0.357968
8 -0.299185 0.566009 -0.859453 -0.564557 -0.562524 0.233489 -0.039145
9 0.937637 -2.171174 -1.940916 -1.553634 0.619965 -0.664284 -0.151388
Note that the +1
is added because when using iloc
the rightmost index is exclusive.
filter
is a nice and simple method for OP's headers, but this might not generalise well to arbitrary column names.
The "location-based" solution with loc
is a little closer to the ideal, but you cannot avoid creating intermediate DataFrames (that are eventually thrown out and garbage collected) to compute the final result range -- something that we would ideally like to avoid.
Lastly, "pick your columns directly" is good advice as long as you have a manageably small number of columns to pick. It will, however not be applicable in some cases where ranges span dozens (or possibly hundreds) of columns.
Just pick the columns you want directly....
df[['A','E','I','C']]
Source: Stackoverflow.com