pandas read csv and filter columns with usecols

Question

I have a csv file which isn t coming in correctly with pandas read csv when I  filter the columns with usecols and use multiple indexes    import pandas as pd csv   r   dummy date loc x    bar 20090101 a 1    bar 20090102 a 3    bar 20090103 a 5    bar 20090101 b 1    bar 20090102 b 3    bar 20090103 b 5     f   open  foo csv    w   f write csv  f close    df1   pd read csv  foo csv           header 0          names   dummy    date    loc    x             index col   date    loc             usecols   dummy    date    loc    x            parse dates   date    print df1    Ignore the dummy columns df2   pd read csv  foo csv            index col   date    loc             usecols   date    loc    x       lt ----------- Changed         parse dates   date            header 0          names   dummy    date    loc    x    print df2   I expect that df1 and df2 should be the same except for the missing dummy column  but the columns come in mislabeled   Also the date is getting parsed as a date     In  118    run test py                dummy  x date       loc 2009-01-01 a     bar  1 2009-01-02 a     bar  3 2009-01-03 a     bar  5 2009-01-01 b     bar  1 2009-01-02 b     bar  3 2009-01-03 b     bar  5               date date loc a    1    20090101      3    20090102      5    20090103 b    1    20090101      3    20090102      5    20090103   Using column numbers instead of names give me the same problem   I can workaround the issue by dropping the dummy column after the read csv step  but I m trying to understand what is going wrong   I m using pandas 0 10 1   edit  fixed bad header usage

User · Answer

This code achieves what you want --- also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols ;) for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd
from StringIO import StringIO

csv = """dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5
"""

df = pd.read_csv(StringIO(csv),
        index_col=[0,1],
        usecols=[1,2,3], 
        parse_dates=[0],
        header=0,
        names=["date", "loc", "", "x"])

print df

which prints

                x
date       loc   
2009-01-01 a    1
2009-01-02 a    3
2009-01-03 a    5
2009-01-01 b    1
2009-01-02 b    3
2009-01-03 b    5

User · Answer

The solution lies in understanding these two keyword arguments   names is only necessary when there is no header row in your file and you want to specify other arguments  such as usecols  using column names rather than integer indices  usecols is supposed to provide a filter before reading the whole DataFrame into memory  if used properly  there should never be a need to delete columns after reading   So because you have a header row  passing header 0 is sufficient and additionally passing names appears to be confusing pd read csv  Removing names from the second call gives the desired output  import pandas as pd from StringIO import StringIO  csv   r quot  quot  quot dummy date loc x bar 20090101 a 1 bar 20090102 a 3 bar 20090103 a 5 bar 20090101 b 1 bar 20090102 b 3 bar 20090103 b 5 quot  quot  quot   df   pd read csv StringIO csv           header 0          index col   quot date quot    quot loc quot             usecols   quot date quot    quot loc quot    quot x quot            parse dates   quot date quot     Which gives us                  x date       loc 2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5

User · Answer

You have to just add the index col False parameter df1   pd read csv  foo csv        header 0       index col False       names   quot dummy quot    quot date quot    quot loc quot    quot x quot          usecols   quot dummy quot    quot date quot    quot loc quot    quot x quot         parse dates   quot date quot      print df1

User · Answer

import csv first and use csv DictReader its easy to process

User · Answer

If your csv file contains extra data  columns can be deleted from the DataFrame after import      import pandas as pd from StringIO import StringIO  csv   r   dummy date loc x bar 20090101 a 1 bar 20090102 a 3 bar 20090103 a 5 bar 20090101 b 1 bar 20090102 b 3 bar 20090103 b 5     df   pd read csv StringIO csv           index col   date    loc             usecols   dummy    date    loc    x            parse dates   date            header 0          names   dummy    date    loc    x    del df  dummy     Which gives us                   x date       loc 2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5

[python] pandas read_csv and filter columns with usecols

Examples related to python

Examples related to pandas

Examples related to csv

Examples related to csv-import