Add missing dates to pandas dataframe

Question

My data can have multiple events on a given date or NO events on a date  I take these events  get a count by date and plot them   However  when I plot them  my two series don t always match       idx   pd date range df  simpleDate   min    df  simpleDate   max    s   df groupby   simpleDate    size     In the above code idx becomes a range of say 30 dates  09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date  I then get an AssertionError as the sizes dont match when I try to plot   fig  ax   plt subplots       ax bar idx to pydatetime    s  color  green     What s the proper way to tackle this  Do I want to remove dates with no values from IDX or  which I d rather do  is add to the series the missing date with a count of 0  I d rather have a full graph of 30 days with 0 values  If this approach is right  any suggestions on how to get started  Do I need some sort of dynamic reindex function   Here s a snippet of S   df groupby   simpleDate    size       notice no entries for 04 and 05   09-02-2013     2 09-03-2013    10 09-06-2013     5 09-07-2013     1

User · Answer

A quicker workaround is to use  asfreq     This doesn t require creation of a new index to call within  reindex        broken   staggered  dates dates   pd Index  pd Timestamp  2012-05-01                       pd Timestamp  2012-05-04                       pd Timestamp  2012-05-06     s   pd Series  1  2  3   dates   print s asfreq  D    2012-05-01    1 0 2012-05-02    NaN 2012-05-03    NaN 2012-05-04    2 0 2012-05-05    NaN 2012-05-06    3 0 Freq  D  dtype  float64

User · Answer

You could use Series reindex   import pandas as pd  idx   pd date range  09-01-2013    09-30-2013    s   pd Series   09-02-2013   2                  09-03-2013   10                  09-06-2013   5                  09-07-2013   1   s index   pd DatetimeIndex s index   s   s reindex idx  fill value 0  print s    yields  2013-09-01     0 2013-09-02     2 2013-09-03    10 2013-09-04     0 2013-09-05     0 2013-09-06     5 2013-09-07     1 2013-09-08     0

User · Answer

One issue is that reindex will fail if there are duplicate values  Say we re working with timestamped data  which we want to index by date   df   pd DataFrame        timestamps   pd to datetime            2016-11-15 1 00   2016-11-16 2 00   2016-11-16 3 00   2016-11-18 4 00          values    a   b   c   d     df index   pd DatetimeIndex df  timestamps    floor  D   df   yields              timestamps             values 2016-11-15   2016-11-15 01 00 00   a 2016-11-16   2016-11-16 02 00 00   b 2016-11-16   2016-11-16 03 00 00   c 2016-11-18   2016-11-18 04 00 00   d   Due to the duplicate 2016-11-16 date  an attempt to reindex   all days   pd date range df index min    df index max    freq  D   df reindex all days    fails with       ValueError  cannot reindex from a duplicate axis    by this it means the index has duplicates  not that it is itself a dup   Instead  we can use  loc to look up entries for all dates in range   df loc all days    yields              timestamps             values 2016-11-15   2016-11-15 01 00 00   a 2016-11-16   2016-11-16 02 00 00   b 2016-11-16   2016-11-16 03 00 00   c 2016-11-17  NaN                    NaN 2016-11-18   2016-11-18 04 00 00   d   fillna can be used on the column series to fill blanks if needed

User · Answer

An alternative approach is resample  which can handle duplicate dates in addition to missing dates   For example   df resample  D   mean     resample is a deferred operation like groupby so you need to follow it with another operation   In this case mean works well  but you can also use many other pandas methods like max  sum  etc   Here is the original data  but with an extra entry for  2013-09-03                 val date            2013-09-02     2 2013-09-03    10 2013-09-03    20     lt - duplicate date added to OP s data 2013-09-06     5 2013-09-07     1   And here are the results                val date             2013-09-02   2 0 2013-09-03  15 0     lt - mean of original values for 2013-09-03 2013-09-04   NaN     lt - NaN b c date not present in orig 2013-09-05   NaN     lt - NaN b c date not present in orig 2013-09-06   5 0 2013-09-07   1 0   I left the missing dates as NaNs to make it clear how this works  but you can add fillna 0  to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate   to fill with non-zero values based on the neighboring rows

User · Answer

Here s a nice method to fill in missing dates into a dataframe  with your choice of fill value  days back to fill in  and sort order  date order  by which to sort the dataframe   def fill in missing dates df  date col name    date  date order    asc   fill value   0  days back   30        df set index date col name drop True inplace True      df index   pd DatetimeIndex df index      d   datetime now   date       d2   d - timedelta days   days back      idx   pd date range d2  d  freq    D       df   df reindex idx fill value fill value      df date col name    pd DatetimeIndex df index       return df

[python] Add missing dates to pandas dataframe

Examples related to python

Examples related to date

Examples related to plot

Examples related to pandas

Examples related to dataframe