Select DataFrame rows between two dates

Question

I am creating a DataFrame from a csv as follows   stock   pd read csv  data in     filename     csv   skipinitialspace True    The DataFrame has a date column  Is there a way to create a new DataFrame  or just overwrite the existing one  which only contains rows with date values that fall within a specified date range or between two specified date values

User · Answer

Keeping the solution simple and pythonic, I would suggest you to try this.

In case if you are going to do this frequently the best solution would be to first set the date column as index which will convert the column in DateTimeIndex and use the following condition to slice any range of dates.

import pandas as pd

data_frame = data_frame.set_index('date')

df = data_frame[(data_frame.index > '2017-08-10') & (data_frame.index <= '2017-08-15')]

User · Answer

I feel the best option will be to use the direct checks rather than using loc function   df   df  df  date    gt   2000-6-1    amp   df  date    lt    2000-6-10      It works for me   Major issue with loc function with a slice is that the limits should be present in the actual values  if not this will result in KeyError

User · Answer

With my testing of pandas version 0 22 0 you can now answer this question easier with more readable code by simply using between     create a single column DataFrame with dates going from Jan 1st 2018 to Jan 1st 2019 df   pd DataFrame   dates  pd date range  2018-01-01   2019-01-01       Let s say you want to grab the dates between Nov 27th 2018 and Jan 15th 2019     use the between statement to get a boolean mask df  dates   between  2018-11-27   2019-01-15   inclusive False   0    False 1    False 2    False 3    False 4    False    you can pass this boolean mask straight to loc df loc df  dates   between  2018-11-27   2019-01-15   inclusive False        dates 331 2018-11-28 332 2018-11-29 333 2018-11-30 334 2018-12-01 335 2018-12-02   Notice the inclusive argument  very helpful when you want to be explicit about your range  notice when set to True we return Nov 27th of 2018 as well   df loc df  dates   between  2018-11-27   2019-01-15   inclusive True        dates 330 2018-11-27 331 2018-11-28 332 2018-11-29 333 2018-11-30 334 2018-12-01   This method is also faster than the previously mentioned isin method     timeit -n 5 df loc df  dates   between  2018-11-27   2019-01-15   inclusive True   868   s    164   s per loop  mean    std  dev  of 7 runs  5 loops each      timeit -n 5  df loc df  dates   isin pd date range  2018-01-01   2019-01-01     1 53 ms    305   s per loop  mean    std  dev  of 7 runs  5 loops each    However  it is not faster than the currently accepted answer  provided by unutbu  only if the mask is already created  but if the mask is dynamic and needs to be reassigned over and over  my method may be more efficient     already create the mask THEN time the function  start date   dt datetime 2018 11 27  end date   dt datetime 2019 1 15  mask    df  dates    gt  start date   amp   df  dates    lt   end date     timeit -n 5 df loc mask  191   s    28 5   s per loop  mean    std  dev  of 7 runs  5 loops each

User · Answer

Another option  how to achieve this  is by using pandas DataFrame query   method  Let me show you an example on the following data frame called df    gt  gt  gt  df   pd DataFrame np random random  5  1    columns   col 1     gt  gt  gt  df  date     pd date range  2020-1-1   periods 5  freq  D    gt  gt  gt  print df        col 1       date 0  0 015198 2020-01-01 1  0 638600 2020-01-02 2  0 348485 2020-01-03 3  0 247583 2020-01-04 4  0 581835 2020-01-05   As an argument  use the condition for filtering like this    gt  gt  gt  start date  end date    2020-01-02    2020-01-04   gt  gt  gt  print df query  date  gt    start date and date  lt    end date          col 1       date 1  0 244104 2020-01-02 2  0 374775 2020-01-03 3  0 510053 2020-01-04   If you do not want to include boundaries  just change the condition like following    gt  gt  gt  print df query  date  gt   start date and date  lt   end date          col 1       date 2  0 374775 2020-01-03

User · Answer

You can use the isin method on the date column like so df df  date   isin pd date range start date  end date     Note  This only works with dates  as the question asks  and not timestamps   Example      import numpy as np    import pandas as pd    Make a DataFrame with dates and random numbers df   pd DataFrame np random random  30  3    df  date     pd date range  2017-1-1   periods 30  freq  D      Select the rows between two dates in range df   df df  date   isin pd date range  2017-01-15    2017-01-20      print in range df     print result   which gives             0         1         2       date 14  0 960974  0 144271  0 839593 2017-01-15 15  0 814376  0 723757  0 047840 2017-01-16 16  0 911854  0 123130  0 120995 2017-01-17 17  0 505804  0 416935  0 928514 2017-01-18 18  0 204869  0 708258  0 170792 2017-01-19 19  0 014389  0 214510  0 045201 2017-01-20

User · Answer

You can also use between   df df some date between start date  end date

User · Answer

There are two possible solutions    Use a boolean mask  then use df loc mask  Set the date column as a DatetimeIndex  then use df start date   end date      Using a boolean mask   Ensure df  date   is a Series with dtype datetime64 ns    df  date     pd to datetime df  date        Make a boolean mask  start date and end date can be datetime datetimes  np datetime64s  pd Timestamps  or even datetime strings    greater than the start date and smaller than the end date mask    df  date    gt  start date   amp   df  date    lt   end date    Select the sub-DataFrame   df loc mask    or re-assign to df  df   df loc mask      For example   import numpy as np import pandas as pd  df   pd DataFrame np random random  200 3    df  date     pd date range  2000-1-1   periods 200  freq  D   mask    df  date    gt   2000-6-1    amp   df  date    lt    2000-6-10   print df loc mask     yields              0         1         2       date 153  0 208875  0 727656  0 037787 2000-06-02 154  0 750800  0 776498  0 237716 2000-06-03 155  0 812008  0 127338  0 397240 2000-06-04 156  0 639937  0 207359  0 533527 2000-06-05 157  0 416998  0 845658  0 872826 2000-06-06 158  0 440069  0 338690  0 847545 2000-06-07 159  0 202354  0 624833  0 740254 2000-06-08 160  0 465746  0 080888  0 155452 2000-06-09 161  0 858232  0 190321  0 432574 2000-06-10     Using a DatetimeIndex   If you are going to do a lot of selections by date  it may be quicker to set the date column as the index first  Then you can select rows by date using df loc start date end date    import numpy as np import pandas as pd  df   pd DataFrame np random random  200 3    df  date     pd date range  2000-1-1   periods 200  freq  D   df   df set index   date    print df loc  2000-6-1   2000-6-10      yields                     0         1         2 date                                     2000-06-01  0 040457  0 326594  0 492136       lt - includes start date 2000-06-02  0 279323  0 877446  0 464523 2000-06-03  0 328068  0 837669  0 608559 2000-06-04  0 107959  0 678297  0 517435 2000-06-05  0 131555  0 418380  0 025725 2000-06-06  0 999961  0 619517  0 206108 2000-06-07  0 129270  0 024533  0 154769 2000-06-08  0 441010  0 741781  0 470402 2000-06-09  0 682101  0 375660  0 009916 2000-06-10  0 754488  0 352293  0 339337   While Python list indexing  e g  seq start end  includes start but not end  in contrast  Pandas df loc start date   end date  includes both end-points in the result if they are in the index  Neither start date nor end date has to be in the index however     Also note that pd read csv has a parse dates parameter which you could use to parse the date column as datetime64s  Thus  if you use parse dates  you would not need to use df  date     pd to datetime df  date

User · Answer

Inspired by unutbu print df dtypes                                   Make sure the format is  object   Rerunning this after index will not show values  columnName    YourColumnName  df columnName  index     df columnName            Create a new column for index df set index columnName  index   inplace True     To build index on the timestamp dates df loc  2020-09-03 01 00   2020-09-06             Select range from the index  This is your new Dataframe

User · Answer

you can do it with pd date range   and Timestamp  Let s say you have read a csv file with a date column using parse dates option  df   pd read csv  my file csv   parse dates   my date col     Then you can define a date range index   rge   pd date range end  15 6 2020   periods 2   and then filter your values by date thanks to a map  df loc df  my date col   map lambda row  row date   in rge

User · Answer

I prefer not to alter the df   An option is to retrieve the index of the start and end dates   import numpy as np    import pandas as pd   Dummy DataFrame df   pd DataFrame np random random  30  3    df  date     pd date range  2017-1-1   periods 30  freq  D     Get the index of the start and end dates respectively start   df df  date     2017-01-07   index 0  end   df df  date     2017-01-14   index 0    Show the sliced df  from 2017-01-07 to 2017-01-14  df loc start end    which results in        0   1   2       date 6  0 5 0 8 0 8 2017-01-07 7  0 0 0 7 0 3 2017-01-08 8  0 8 0 9 0 0 2017-01-09 9  0 0 0 2 1 0 2017-01-10 10 0 6 0 1 0 9 2017-01-11 11 0 5 0 3 0 9 2017-01-12 12 0 5 0 4 0 3 2017-01-13 13 0 4 0 9 0 9 2017-01-14

[python] Select DataFrame rows between two dates

Examples related to python

Examples related to pandas