I have 2 DataFrames df1 and df2 with the same column names ['a','b','c'] and indexed by dates. The date index can have similar values. I would like to create a DataFrame df3 with only the data from columns ['c'] renamed respectively 'df1' and 'df2' and with the correct date index. My problem is that I cannot get how to merge the index properly.
df1 = pd.DataFrame(np.random.randn(5,3), index=pd.date_range('01/02/2014',periods=5,freq='D'), columns=['a','b','c'] )
df2 = pd.DataFrame(np.random.randn(8,3), index=pd.date_range('01/01/2014',periods=8,freq='D'), columns=['a','b','c'] )
df1
a b c
2014-01-02 0.580550 0.480814 1.135899
2014-01-03 -1.961033 0.546013 1.093204
2014-01-04 2.063441 -0.627297 2.035373
2014-01-05 0.319570 0.058588 0.350060
2014-01-06 1.318068 -0.802209 -0.939962
df2
a b c
2014-01-01 0.772482 0.899337 0.808630
2014-01-02 0.518431 -1.582113 0.323425
2014-01-03 0.112109 1.056705 -1.355067
2014-01-04 0.767257 -2.311014 0.340701
2014-01-05 0.794281 -1.954858 0.200922
2014-01-06 0.156088 0.718658 -1.030077
2014-01-07 1.621059 0.106656 -0.472080
2014-01-08 -2.061138 -2.023157 0.257151
The df3 DataFrame should have the following form :
df3
df1 df2
2014-01-01 NaN 0.808630
2014-01-02 1.135899 0.323425
2014-01-03 1.093204 -1.355067
2014-01-04 2.035373 0.340701
2014-01-05 0.350060 0.200922
2014-01-06 -0.939962 -1.030077
2014-01-07 NaN -0.472080
2014-01-08 NaN 0.257151
But with NaN in the df1 column as the date index of df2 is wider. (In this example, I would get NaN for the ollowing dates : 2014-01-01, 2014-01-07 and 2014-01-08
)
Thanks for your help.
What you ask for is the join operation.
With the how
argument, you can define how unique indices are handled.
Here, some article, which looks helpful concerning this point.
In the example below, I left out cosmetics (like renaming columns) for simplicity.
Code
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randn(5,3), index=pd.date_range('01/02/2014',periods=5,freq='D'), columns=['a','b','c'] )
df2 = pd.DataFrame(np.random.randn(8,3), index=pd.date_range('01/01/2014',periods=8,freq='D'), columns=['a','b','c'] )
df3 = df1.join(df2, how='outer', lsuffix='_df1', rsuffix='_df2')
print(df3)
Output
a_df1 b_df1 c_df1 a_df2 b_df2 c_df2
2014-01-01 NaN NaN NaN 0.109898 1.107033 -1.045376
2014-01-02 0.573754 0.169476 -0.580504 -0.664921 -0.364891 -1.215334
2014-01-03 -0.766361 -0.739894 -1.096252 0.962381 -0.860382 -0.703269
2014-01-04 0.083959 -0.123795 -1.405974 1.825832 -0.580343 0.923202
2014-01-05 1.019080 -0.086650 0.126950 -0.021402 -1.686640 0.870779
2014-01-06 -1.036227 -1.103963 -0.821523 -0.943848 -0.905348 0.430739
2014-01-07 NaN NaN NaN 0.312005 0.586585 1.531492
2014-01-08 NaN NaN NaN -0.077951 -1.189960 0.995123
Well, I'm not sure that merge would be the way to go. Personally I would build a new data frame by creating an index of the dates and then constructing the columns using list comprehensions. Possibly not the most pythonic way, but it seems to work for me!
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(5,3), index=pd.date_range('01/02/2014',periods=5,freq='D'), columns=['a','b','c'] )
df2 = pd.DataFrame(np.random.randn(8,3), index=pd.date_range('01/01/2014',periods=8,freq='D'), columns=['a','b','c'] )
# Create an index list from the set of dates in both data frames
Index = list(set(list(df1.index) + list(df2.index)))
Index.sort()
df3 = pd.DataFrame({'df1': [df1.loc[Date, 'c'] if Date in df1.index else np.nan for Date in Index],\
'df2': [df2.loc[Date, 'c'] if Date in df2.index else np.nan for Date in Index],},\
index = Index)
df3
Source: Stackoverflow.com