Pandas rolling mean by time interval

Question

I m new to Pandas     I ve got a bunch of polling data  I want to compute a rolling mean to get an estimate for each day based on a three-day window  As I understand from this question  the rolling   functions compute the window based on a specified number of values  and not a specific datetime range    Is there a different function that implements this functionality  Or am I stuck writing my own   EDIT    Sample input data    polls subset tail 20  Out 185                favorable  unfavorable  other  enddate                                   2012-10-25       0 48         0 49   0 03 2012-10-25       0 51         0 48   0 02 2012-10-27       0 51         0 47   0 02 2012-10-26       0 56         0 40   0 04 2012-10-28       0 48         0 49   0 04 2012-10-28       0 46         0 46   0 09 2012-10-28       0 48         0 49   0 03 2012-10-28       0 49         0 48   0 03 2012-10-30       0 53         0 45   0 02 2012-11-01       0 49         0 49   0 03 2012-11-01       0 47         0 47   0 05 2012-11-01       0 51         0 45   0 04 2012-11-03       0 49         0 45   0 06 2012-11-04       0 53         0 39   0 00 2012-11-04       0 47         0 44   0 08 2012-11-04       0 49         0 48   0 03 2012-11-04       0 52         0 46   0 01 2012-11-04       0 50         0 47   0 03 2012-11-05       0 51         0 46   0 02 2012-11-07       0 51         0 41   0 00   Output would have only one row for each date    EDIT x2  fixed typo

User · Answer

I just had the same question but with irregularly spaced datapoints  Resample is not really an option here  So I created my own function  Maybe it will be useful for others too   from pandas import Series  DataFrame import pandas as pd from datetime import datetime  timedelta import numpy as np  def rolling mean data  window  min periods 1  center False           Function that computes a rolling mean      Parameters     ----------     data   DataFrame or Series            If a DataFrame is passed  the rolling mean is computed for all columns      window   int or string              If int is passed  window is the number of observations used for calculating               the statistic  as defined by the function pd rolling mean                If a string is passed  it must be a frequency string  e g   90S   This is              internally converted into a DateOffset object  representing the window size      min periods   int                   Minimum number of observations in window required to have a value       Returns     -------     Series or DataFrame  if more than one column                 def f x              Function to apply that actually computes the rolling mean            if center    False              dslice   col x-pd datetools to offset window  delta timedelta 0 0 1  x                    adding a microsecond because when slicing with labels start and endpoint                   are inclusive         else              dslice   col x-pd datetools to offset window  delta 2 timedelta 0 0 1                            x pd datetools to offset window  delta 2          if dslice size  lt  min periods              return np nan         else              return dslice mean        data   DataFrame data copy        dfout   DataFrame       if isinstance window  int           dfout   pd rolling mean data  window  min periods min periods  center center      elif isinstance window  basestring           idx   Series data index to pydatetime    index data index          for colname  col in data iterkv                result   idx apply f              result name   colname             dfout   dfout join result  how  outer       if dfout columns size    1          dfout   dfout ix   0      return dfout     Example idx    datetime 2011  2  7  0  0          datetime 2011  2  7  0  1          datetime 2011  2  7  0  1  30          datetime 2011  2  7  0  2          datetime 2011  2  7  0  4          datetime 2011  2  7  0  5          datetime 2011  2  7  0  5  10          datetime 2011  2  7  0  6          datetime 2011  2  7  0  8          datetime 2011  2  7  0  9   idx   pd Index idx  vals   np arange len idx   astype float  s   Series vals  index idx  rm   rolling mean s  window  2min

User · Answer

In the meantime  a time-window capability was added  See this link   In  1   df   DataFrame   B   range 5     In  2   df index    Timestamp  20130101 09 00 00                        Timestamp  20130101 09 00 02                        Timestamp  20130101 09 00 03                        Timestamp  20130101 09 00 05                        Timestamp  20130101 09 00 06     In  3   df Out 3                         B 2013-01-01 09 00 00  0 2013-01-01 09 00 02  1 2013-01-01 09 00 03  2 2013-01-01 09 00 05  3 2013-01-01 09 00 06  4  In  4   df rolling 2  min periods 1  sum   Out 4                           B 2013-01-01 09 00 00  0 0 2013-01-01 09 00 02  1 0 2013-01-01 09 00 03  3 0 2013-01-01 09 00 05  5 0 2013-01-01 09 00 06  7 0  In  5   df rolling  2s   min periods 1  sum   Out 5                           B 2013-01-01 09 00 00  0 0 2013-01-01 09 00 02  1 0 2013-01-01 09 00 03  3 0 2013-01-01 09 00 05  3 0 2013-01-01 09 00 06  7 0

User · Answer

visualize the rolling averages to see if it makes sense    I don t understand why sum was used when the rolling average was requested    df pd read csv  poll csv  parse dates   enddate   dtype   favorable  np float  unfavorable  np float  other  np float      df set index  enddate     df df fillna 0    fig  axs   plt subplots figsize  5 10    df plot x  enddate   ax axs   plt show      df rolling window 3 min periods 3  mean   plot    plt show    print  quot The larger the window coefficient the smoother the line will appear quot    print  The min periods is the minimum number of observations in the window required to have a value     df rolling window 6 min periods 3  mean   plot    plt show

User · Answer

user2689410 s code was exactly what I needed  Providing my version  credits to user2689410   which is faster due to calculating mean at once for whole rows in the DataFrame   Hope my suffix conventions are readable   s  string   i  int   b  bool   ser  Series and  df  DataFrame  Where you find multiple suffixes  type can be both   import pandas as pd from datetime import datetime  timedelta import numpy as np  def time offset rolling mean df ser data df ser  window i s  min periods i 1  center b False           Function that computes a rolling mean      Credit goes to user2689410 at http   stackoverflow com questions 15771472 pandas-rolling-mean-by-time-interval      Parameters     ----------     data df ser   DataFrame or Series          If a DataFrame is passed  the time offset rolling mean df ser is computed for all columns      window i s   int or string          If int is passed  window i s is the number of observations used for calculating          the statistic  as defined by the function pd time offset rolling mean df ser            If a string is passed  it must be a frequency string  e g   90S   This is          internally converted into a DateOffset object  representing the window i s size      min periods i   int          Minimum number of observations in window i s required to have a value       Returns     -------     Series or DataFrame  if more than one column       gt  gt  gt  idx                 datetime 2011  2  7  0  0               datetime 2011  2  7  0  1               datetime 2011  2  7  0  1  30               datetime 2011  2  7  0  2               datetime 2011  2  7  0  4               datetime 2011  2  7  0  5               datetime 2011  2  7  0  5  10               datetime 2011  2  7  0  6               datetime 2011  2  7  0  8               datetime 2011  2  7  0  9        gt  gt  gt  idx   pd Index idx       gt  gt  gt  vals   np arange len idx   astype float       gt  gt  gt  ser   pd Series vals  index idx       gt  gt  gt  df   pd DataFrame   s1  ser   s2  ser 1        gt  gt  gt  time offset rolling mean df ser df  window i s  2min                             s1   s2     2011-02-07 00 00 00  0 0  1 0     2011-02-07 00 01 00  0 5  1 5     2011-02-07 00 01 30  1 0  2 0     2011-02-07 00 02 00  2 0  3 0     2011-02-07 00 04 00  4 0  5 0     2011-02-07 00 05 00  4 5  5 5     2011-02-07 00 05 10  5 0  6 0     2011-02-07 00 06 00  6 0  7 0     2011-02-07 00 08 00  8 0  9 0     2011-02-07 00 09 00  8 5  9 5              def calculate mean at ts ts              Function  closure  to apply that actually computes the rolling mean            if center b    False              dslice df ser   data df ser                  ts-pd datetools to offset window i s  delta timedelta 0 0 1                   ts                             adding a microsecond because when slicing with labels start and endpoint               are inclusive         else              dslice df ser   data df ser                  ts-pd datetools to offset window i s  delta 2 timedelta 0 0 1                   ts pd datetools to offset window i s  delta 2                       if   isinstance dslice df ser  pd DataFrame  and dslice df ser shape 0   lt  min periods i  or                isinstance dslice df ser  pd Series  and dslice df ser size  lt  min periods i               return dslice df ser mean   np nan     keeps number format and whether Series or DataFrame         else              return dslice df ser mean        if isinstance window i s  int           mean df ser   pd rolling mean data df ser  window window i s  min periods min periods i  center center b      elif isinstance window i s  basestring           idx ser   pd Series data df ser index to pydatetime    index data df ser index          mean df ser   idx ser apply calculate mean at ts       return mean df ser

User · Answer

I found that user2689410 code broke when I tried with window  1M  as the delta on business month threw this error   AttributeError   MonthEnd  object has no attribute  delta    I added the option to pass directly a relative time delta  so you can do similar things for user defined periods   Thanks for the pointers  here s my attempt - hope it s of use      def rolling mean data  window  min periods 1  center False       Function that computes a rolling mean Reference      http   stackoverflow com questions 15771472 pandas-rolling-mean-by-time-interval  Parameters ---------- data   DataFrame or Series        If a DataFrame is passed  the rolling mean is computed for all columns  window   int  string  Timedelta or Relativedelta          int - number of observations used for calculating the statistic                 as defined by the function pd rolling mean            string - must be a frequency string  e g   90S   This is                   internally converted into a DateOffset object  and then                   Timedelta representing the window size           Timedelta   Relativedelta - Can directly pass a timedeltas  min periods   int               Minimum number of observations in window required to have a value  center   bool          Point around which to  center  the slicing   Returns ------- Series or DataFrame  if more than one column     def f x  time increment          Function to apply that actually computes the rolling mean      param x       return              if not center            adding a microsecond because when slicing with labels start           and endpoint are inclusive         start date   x - time increment   timedelta 0  0  1          end date   x     else          start date   x - time increment 2   timedelta 0  0  1          end date   x   time increment 2       Select the date index from the     dslice   col start date end date       if dslice size  lt  min periods          return np nan     else          return dslice mean    data   DataFrame data copy    dfout   DataFrame   if isinstance window  int       dfout   pd rolling mean data  window  min periods min periods  center center   elif isinstance window  basestring       time delta   pd datetools to offset window  delta     idx   Series data index to pydatetime    index data index      for colname  col in data iteritems            result   idx apply lambda x  f x  time delta           result name   colname         dfout   dfout join result  how  outer    elif isinstance window   timedelta  relativedelta        time delta   window     idx   Series data index to pydatetime    index data index      for colname  col in data iteritems            result   idx apply lambda x  f x  time delta           result name   colname         dfout   dfout join result  how  outer    if dfout columns size    1      dfout   dfout ix    0  return dfout   And the example with a 3 day time window to calculate the mean     from pandas import Series  DataFrame import pandas as pd from datetime import datetime  timedelta import numpy as np from dateutil relativedelta import relativedelta  idx    datetime 2011  2  7  0  0              datetime 2011  2  7  0  1              datetime 2011  2  8  0  1  30              datetime 2011  2  9  0  2              datetime 2011  2  10  0  4              datetime 2011  2  11  0  5              datetime 2011  2  12  0  5  10              datetime 2011  2  12  0  6              datetime 2011  2  13  0  8              datetime 2011  2  14  0  9   idx   pd Index idx  vals   np arange len idx   astype float  s   Series vals  index idx    Now try by passing the 3 days as a relative time delta directly  rm   rolling mean s  window relativedelta days 3    gt  gt  gt  rm Out 2    2011-02-07 00 00 00    0 0 2011-02-07 00 01 00    0 5 2011-02-08 00 01 30    1 0 2011-02-09 00 02 00    1 5 2011-02-10 00 04 00    3 0 2011-02-11 00 05 00    4 0 2011-02-12 00 05 10    5 0 2011-02-12 00 06 00    5 5 2011-02-13 00 08 00    6 5 2011-02-14 00 09 00    7 5 Name  0  dtype  float64

User · Answer

This example seems to call for a weighted mean as suggested in  andyhayden s comment   For example  there are two polls on 10 25 and one each on 10 26 and 10 27   If you just resample and then take the mean  this effectively gives twice as much weighting to the polls on 10 26 and 10 27 compared to the ones on 10 25   To give equal weight to each poll rather than equal weight to each day  you could do something like the following    gt  gt  gt  wt   df resample  D  limit 5  count                favorable  unfavorable  other enddate                                   2012-10-25          2            2      2 2012-10-26          1            1      1 2012-10-27          1            1      1   gt  gt  gt  df2   df resample  D   mean                favorable  unfavorable  other enddate                                   2012-10-25      0 495        0 485  0 025 2012-10-26      0 560        0 400  0 040 2012-10-27      0 510        0 470  0 020   That gives you the raw ingredients for doing a poll-based mean instead of a day-based mean   As before  the polls are averaged on 10 25  but the weight for 10 25 is also stored and is double the weight on 10 26 or 10 27 to reflect that two polls were taken on 10 25    gt  gt  gt  df3   df2   wt  gt  gt  gt  df3   df3 rolling 3 min periods 1  sum    gt  gt  gt  wt3   wt rolling 3 min periods 1  sum     gt  gt  gt  df3   df3   wt3                favorable  unfavorable     other enddate                                      2012-10-25   0 495000     0 485000  0 025000 2012-10-26   0 516667     0 456667  0 030000 2012-10-27   0 515000     0 460000  0 027500 2012-10-28   0 496667     0 465000  0 041667 2012-10-29   0 484000     0 478000  0 042000 2012-10-30   0 488000     0 474000  0 042000 2012-10-31   0 530000     0 450000  0 020000 2012-11-01   0 500000     0 465000  0 035000 2012-11-02   0 490000     0 470000  0 040000 2012-11-03   0 490000     0 465000  0 045000 2012-11-04   0 500000     0 448333  0 035000 2012-11-05   0 501429     0 450000  0 032857 2012-11-06   0 503333     0 450000  0 028333 2012-11-07   0 510000     0 435000  0 010000   Note that the rolling mean for 10 27 is now 0 51500  poll-weighted  rather than 52 1667  day-weighted    Also note that there have been changes to the APIs for resample and rolling as of version 0 18 0   rolling  what s new in pandas 0 18 0   resample  what s new in pandas 0 18 0

User · Answer

To keep it basic  I used a loop and something like this to get you started  my index are datetimes    import pandas as pd import datetime as dt   populate your dataframe   df        df df index lt  df index 0  dt timedelta hours 1     gives you a slice  you can then take  sum    mean    whatever   and then you can run functions on that slice  You can see how adding an iterator to make the start of the window something other than the first value in your dataframes index would then roll the window  you could use a   rule for the start as well for example    Note  this may be less efficient for SUPER large data or very small increments as your slicing may become more strenuous  works for me well enough for hundreds of thousands of rows of data and several columns though for hourly windows across a few weeks

User · Answer

What about something like this   First resample the data frame into 1D intervals   This takes the mean of the values for all duplicate days   Use the fill method option to fill in missing date values   Next  pass the resampled frame into pd rolling mean with a window of 3 and min periods 1    pd rolling mean df resample  1D   fill method  ffill    window 3  min periods 1               favorable  unfavorable     other enddate 2012-10-25   0 495000     0 485000  0 025000 2012-10-26   0 527500     0 442500  0 032500 2012-10-27   0 521667     0 451667  0 028333 2012-10-28   0 515833     0 450000  0 035833 2012-10-29   0 488333     0 476667  0 038333 2012-10-30   0 495000     0 470000  0 038333 2012-10-31   0 512500     0 460000  0 029167 2012-11-01   0 516667     0 456667  0 026667 2012-11-02   0 503333     0 463333  0 033333 2012-11-03   0 490000     0 463333  0 046667 2012-11-04   0 494000     0 456000  0 043333 2012-11-05   0 500667     0 452667  0 036667 2012-11-06   0 507333     0 456000  0 023333 2012-11-07   0 510000     0 443333  0 013333   UPDATE  As Ben points out in the comments  with pandas 0 18 0 the syntax has changed   With the new syntax this would be   df resample  1d   sum   fillna 0  rolling window 3  min periods 1  mean

User · Answer

Check that your index is really datetime  not str Can be helpful   data index   pd to datetime data  Index    values

[python] Pandas: rolling mean by time interval

Examples related to python

Examples related to pandas

Examples related to time-series