Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:
df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])
For example it can be checked in this way:
for i, r in df.iterrows():
print type(r['col1']), type(r['col2']), type(r['col3'])
In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4
. These dates were recognized as strings (not as python date-objects). Is there a way to "learn" pandas to recognized dates?
When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.
The following works:
def dateparse(d,t):
dt = d + " " + t
return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
You could use pandas.to_datetime()
as recommended in the documentation for pandas.read_csv()
:
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use
pd.to_datetime
afterpd.read_csv
.
Demo:
>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
date
0 2013-6-4
>>> df.dtypes
date object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
date
0 2013-06-04
>>> df.dtypes
date datetime64[ns]
dtype: object
In addition to what the other replies said, if you have to parse very large files with hundreds of thousands of timestamps, date_parser
can prove to be a huge performance bottleneck, as it's a Python function called once per row. You can get a sizeable performance improvements by instead keeping the dates as text while parsing the CSV file and then converting the entire column into dates in one go:
# For a data column
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']})
df['mydatetime'] = pd.to_datetime(df['mydatetime'], exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a DateTimeIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col='mydatetime')
df.index = pd.to_datetime(df.index, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
# For a MultiIndex
df = pd.read_csv(infile, parse_dates={'mydatetime': ['date', 'time']}, index_col=['mydatetime', 'num'])
idx_mydatetime = df.index.get_level_values(0)
idx_num = df.index.get_level_values(1)
idx_mydatetime = pd.to_datetime(idx_mydatetime, exact=True, cache=True, format='%Y-%m-%d %H:%M:%S')
df.index = pd.MultiIndex.from_arrays([idx_mydatetime, idx_num])
For my use case on a file with 200k rows (one timestamp per row), that cut down processing time from about a minute to less than a second.
Yes - according to the pandas.read_csv
documentation:
Note: A fast-path exists for iso8601-formatted dates.
So if your csv has a column named datetime
and the dates looks like 2013-01-01T01:01
for example, running this will make pandas (I'm on v0.19.2) pick up the date and time automatically:
df = pd.read_csv('test.csv', parse_dates=['datetime'])
Note that you need to explicitly pass parse_dates
, it doesn't work without.
Verify with:
df.dtypes
You should see the datatype of the column is datetime64[ns]
Perhaps the pandas interface has changed since @Rutger answered, but in the version I'm using (0.15.2), the date_parser
function receives a list of dates instead of a single value. In this case, his code should be updated like so:
dateparse = lambda dates: [pd.datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
you can even have the different date parts in different columns and pass the parameter:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.
you can use the date_parser parameter to pass a function to convert your format.
date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.
If performance matters to you make sure you time:
import sys
import timeit
import pandas as pd
print('Python %s on %s' % (sys.version, sys.platform))
print('Pandas version %s' % pd.__version__)
repeat = 3
numbers = 100
def time(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
print("Format %m/%d/%y")
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,07/29/15
x2,07/29/15
x3,07/29/15
x4,07/30/15
x5,07/29/15
x6,07/29/15
x7,07/29/15
y7,08/05/15
x8,08/05/15
z3,08/05/15
''' * 100)"""
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'date_parser=lambda x: pd.datetime.strptime(x, "%m/%d/%y")); data.seek(0)')
print("Format %Y-%m-%d %H:%M:%S")
setup = """import pandas as pd
import io
data = io.StringIO('''\
ProductCode,Date
''' + '''\
x1,2016-10-15 00:00:43
x2,2016-10-15 00:00:56
x3,2016-10-15 00:00:56
x4,2016-10-15 00:00:12
x5,2016-10-15 00:00:34
x6,2016-10-15 00:00:55
x7,2016-10-15 00:00:06
y7,2016-10-15 00:00:01
x8,2016-10-15 00:00:00
z3,2016-10-15 00:00:02
''' * 1000)"""
time('pd.read_csv(data); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"]); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'infer_datetime_format=True); data.seek(0)')
time('pd.read_csv(data, parse_dates=["Date"],'
'date_parser=lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")); data.seek(0)')
prints:
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 03:13:28)
[Clang 6.0 (clang-600.0.57)] on darwin
Pandas version 0.23.4
Format %m/%d/%y
0.19123052499999993
8.20691274
8.143124389
1.2384357139999977
Format %Y-%m-%d %H:%M:%S
0.5238807110000039
0.9202787830000005
0.9832778819999959
12.002349824999996
So with iso8601-formatted date (%Y-%m-%d %H:%M:%S
is apparently an iso8601-formatted date, I guess the T can be dropped and replaced by a space) you should not specify infer_datetime_format
(which does not make a difference with more common ones either apparently) and passing your own parser in just cripples performance. On the other hand, date_parser
does make a difference with not so standard day formats. Be sure to time before you optimize, as usual.
While loading csv file contain date column.We have two approach to to make pandas to recognize date column i.e
Pandas explicit recognize the format by arg date_parser=mydateparser
Pandas implicit recognize the format by agr infer_datetime_format=True
Some of the date column data
01/01/18
01/02/18
Here we don't know the first two things It may be month or day. So in this case we have to use Method 1:- Explicit pass the format
mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%y")
df = pd.read_csv(file_name, parse_dates=['date_col_name'],
date_parser=mydateparser)
Method 2:- Implicit or Automatically recognize the format
df = pd.read_csv(file_name, parse_dates=[date_col_name],infer_datetime_format=True)
Source: Stackoverflow.com