A Pandas DataFrame
contains column named "date"
that contains non-unique datetime
values.
I can group the lines in this frame using:
data.groupby(data['date'])
However, this splits the data by the datetime
values. I would like to group these data by the year stored in the "date" column. This page shows how to group by year in cases where the time stamp is used as an index, which is not true in my case.
How do I achieve this grouping?
ecatmur's solution will work fine. This will be better performance on large datasets, though:
data.groupby(data['date'].map(lambda x: x.year))
This should work:
data.groupby(lambda x: data['date'][x].year)
this will also work
data.groupby(data['date'].dt.year)
This might be easier to explain with a sample dataset.
Create Sample Data
Let's assume we have a single column of Timestamps, date
and another column we would like to perform an aggregation on, a
.
df = pd.DataFrame({'date':pd.DatetimeIndex(['2012-1-1', '2012-6-1', '2015-1-1', '2015-2-1', '2015-3-1']),
'a':[9,5,1,2,3]}, columns=['date', 'a'])
df
date a
0 2012-01-01 9
1 2012-06-01 5
2 2015-01-01 1
3 2015-02-01 2
4 2015-03-01 3
There are several ways to group by year
year
propertydate
in index and use anonymous function to access yearresample
method.dt
accessor with year
propertyWhen you have a column (and not an index) of pandas Timestamps, you can access many more extra properties and methods with the dt
accessor. For instance:
df['date'].dt.year
0 2012
1 2012
2 2015
3 2015
4 2015
Name: date, dtype: int64
We can use this to form our groups and calculate some aggregations on a particular column:
df.groupby(df['date'].dt.year)['a'].agg(['sum', 'mean', 'max'])
sum mean max
date
2012 14 7 9
2015 6 2 3
If you set the date column as the index, it becomes a DateTimeIndex with the same properties and methods as the dt
accessor gives normal columns
df1 = df.set_index('date')
df1.index.year
Int64Index([2012, 2012, 2015, 2015, 2015], dtype='int64', name='date')
Interestingly, when using the groupby method, you can pass it a function. This function will be implicitly passed the DataFrame's index. So, we can get the same result from above with the following:
df1.groupby(lambda x: x.year)['a'].agg(['sum', 'mean', 'max'])
sum mean max
2012 14 7 9
2015 6 2 3
resample
methodIf your date column is not in the index, you must specify the column with the on
parameter. You also need to specify the offset alias as a string.
df.resample('AS', on='date')['a'].agg(['sum', 'mean', 'max'])
sum mean max
date
2012-01-01 14.0 7.0 9.0
2013-01-01 NaN NaN NaN
2014-01-01 NaN NaN NaN
2015-01-01 6.0 2.0 3.0
You can also convert the date column to a pandas Period object. We must pass in the offset alias as a string to determine the length of the Period.
df['date'].dt.to_period('A')
0 2012
1 2012
2 2015
3 2015
4 2015
Name: date, dtype: object
We can then use this as a group
df.groupby(df['date'].dt.to_period('Y'))['a'].agg(['sum', 'mean', 'max'])
sum mean max
2012 14 7 9
2015 6 2 3
Source: Stackoverflow.com