[python] Add missing dates to pandas dataframe

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?

Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

This question is related to python date plot pandas dataframe

The answer is


An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

df.resample('D').mean()

resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

Here is the original data, but with an extra entry for '2013-09-03':

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.


One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16 date, an attempt to reindex:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

Instead, we can use .loc to look up entries for all dates in range:

df.loc[all_days]

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna can be used on the column series to fill blanks if needed.


Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to date

How do I format {{$timestamp}} as MM/DD/YYYY in Postman? iOS Swift - Get the Current Local Time and Date Timestamp Typescript Date Type? how to convert current date to YYYY-MM-DD format with angular 2 SQL Server date format yyyymmdd Date to milliseconds and back to date in Swift Check if date is a valid one change the date format in laravel view page Moment js get first and last day of current month How can I convert a date into an integer?

Examples related to plot

Fine control over the font size in Seaborn plots for academic papers Why do many examples use `fig, ax = plt.subplots()` in Matplotlib/pyplot/python Modify the legend of pandas bar plot Format y axis as percent Simple line plots using seaborn Plot bar graph from Pandas DataFrame Plotting multiple lines, in different colors, with pandas dataframe Plotting in a non-blocking way with Matplotlib What does the error "arguments imply differing number of rows: x, y" mean? matplotlib get ylim values

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe