[python] Select DataFrame rows between two dates

I am creating a DataFrame from a csv as follows:

stock = pd.read_csv('data_in/' + filename + '.csv', skipinitialspace=True)

The DataFrame has a date column. Is there a way to create a new DataFrame (or just overwrite the existing one) which only contains rows with date values that fall within a specified date range or between two specified date values?

This question is related to python pandas

The answer is


With my testing of pandas version 0.22.0 you can now answer this question easier with more readable code by simply using between.

# create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019
df = pd.DataFrame({'dates':pd.date_range('2018-01-01','2019-01-01')})

Let's say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019:

# use the between statement to get a boolean mask
df['dates'].between('2018-11-27','2019-01-15', inclusive=False)

0    False
1    False
2    False
3    False
4    False

# you can pass this boolean mask straight to loc
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=False)]

    dates
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01
335 2018-12-02

Notice the inclusive argument. very helpful when you want to be explicit about your range. notice when set to True we return Nov 27th of 2018 as well:

df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]

    dates
330 2018-11-27
331 2018-11-28
332 2018-11-29
333 2018-11-30
334 2018-12-01

This method is also faster than the previously mentioned isin method:

%%timeit -n 5
df.loc[df['dates'].between('2018-11-27','2019-01-15', inclusive=True)]
868 µs ± 164 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


%%timeit -n 5

df.loc[df['dates'].isin(pd.date_range('2018-01-01','2019-01-01'))]
1.53 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

However, it is not faster than the currently accepted answer, provided by unutbu, only if the mask is already created. but if the mask is dynamic and needs to be reassigned over and over, my method may be more efficient:

# already create the mask THEN time the function

start_date = dt.datetime(2018,11,27)
end_date = dt.datetime(2019,1,15)
mask = (df['dates'] > start_date) & (df['dates'] <= end_date)

%%timeit -n 5
df.loc[mask]
191 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

I prefer not to alter the df.

An option is to retrieve the index of the start and end dates:

import numpy as np   
import pandas as pd

#Dummy DataFrame
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')

#Get the index of the start and end dates respectively
start = df[df['date']=='2017-01-07'].index[0]
end = df[df['date']=='2017-01-14'].index[0]

#Show the sliced df (from 2017-01-07 to 2017-01-14)
df.loc[start:end]

which results in:

     0   1   2       date
6  0.5 0.8 0.8 2017-01-07
7  0.0 0.7 0.3 2017-01-08
8  0.8 0.9 0.0 2017-01-09
9  0.0 0.2 1.0 2017-01-10
10 0.6 0.1 0.9 2017-01-11
11 0.5 0.3 0.9 2017-01-12
12 0.5 0.4 0.3 2017-01-13
13 0.4 0.9 0.9 2017-01-14

Keeping the solution simple and pythonic, I would suggest you to try this.

In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.

import pandas as pd

data_frame = data_frame.set_index('date')

df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]

Another option, how to achieve this, is by using pandas.DataFrame.query() method. Let me show you an example on the following data frame called df.

>>> df = pd.DataFrame(np.random.random((5, 1)), columns=['col_1'])
>>> df['date'] = pd.date_range('2020-1-1', periods=5, freq='D')
>>> print(df)
      col_1       date
0  0.015198 2020-01-01
1  0.638600 2020-01-02
2  0.348485 2020-01-03
3  0.247583 2020-01-04
4  0.581835 2020-01-05

As an argument, use the condition for filtering like this:

>>> start_date, end_date = '2020-01-02', '2020-01-04'
>>> print(df.query('date >= @start_date and date <= @end_date'))
      col_1       date
1  0.244104 2020-01-02
2  0.374775 2020-01-03
3  0.510053 2020-01-04

If you do not want to include boundaries, just change the condition like following:

>>> print(df.query('date > @start_date and date < @end_date'))
      col_1       date
2  0.374775 2020-01-03

I feel the best option will be to use the direct checks rather than using loc function:

df = df[(df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')]

It works for me.

Major issue with loc function with a slice is that the limits should be present in the actual values, if not this will result in KeyError.


You can also use between:

df[df.some_date.between(start_date, end_date)]

you can do it with pd.date_range() and Timestamp. Let's say you have read a csv file with a date column using parse_dates option:

df = pd.read_csv('my_file.csv', parse_dates=['my_date_col'])

Then you can define a date range index :

rge = pd.date_range(end='15/6/2020', periods=2)

and then filter your values by date thanks to a map:

df.loc[df['my_date_col'].map(lambda row: row.date() in rge)]

You can use the isin method on the date column like so df[df["date"].isin(pd.date_range(start_date, end_date))]

Note: This only works with dates (as the question asks) and not timestamps.

Example:

import numpy as np   
import pandas as pd

# Make a DataFrame with dates and random numbers
df = pd.DataFrame(np.random.random((30, 3)))
df['date'] = pd.date_range('2017-1-1', periods=30, freq='D')

# Select the rows between two dates
in_range_df = df[df["date"].isin(pd.date_range("2017-01-15", "2017-01-20"))]

print(in_range_df)  # print result

which gives

           0         1         2       date
14  0.960974  0.144271  0.839593 2017-01-15
15  0.814376  0.723757  0.047840 2017-01-16
16  0.911854  0.123130  0.120995 2017-01-17
17  0.505804  0.416935  0.928514 2017-01-18
18  0.204869  0.708258  0.170792 2017-01-19
19  0.014389  0.214510  0.045201 2017-01-20

Inspired by unutbu

print(df.dtypes)                                 #Make sure the format is 'object'. Rerunning this after index will not show values.
columnName = 'YourColumnName'
df[columnName+'index'] = df[columnName]          #Create a new column for index
df.set_index(columnName+'index', inplace=True)   #To build index on the timestamp/dates
df.loc['2020-09-03 01:00':'2020-09-06']          #Select range from the index. This is your new Dataframe.