[python] Python: pandas merge multiple dataframes

I have diferent dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date'), to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date'), however it becomes really complex and unreadable to do it with multiple dataframes.

All dataframes have one column in common -date, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.

So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?

I tried diferent ways and got errors like out of range, keyerror 0/1/2/3 and can not merge DataFrame with instance of type <class 'NoneType'>.

This is the script I wrote:

dfs = [df1, df2, df3] # list of dataframes

def mergefiles(dfs, countfiles, i=0):
    if i == (countfiles - 2): # it gets to the second to last and merges it with the last
        return

    dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
    return dfm

print(mergefiles(dfs, len(dfs)))

An example: df_1:

May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%

df_2:

May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%

df_3:

May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%

Expected merge result:

May 15, 2017;  1,901.00;0.1%;  2,902.00;1000000;0.2%;   3,903.00;2000000;0.3%   

This question is related to python pandas dataframe merge data-analysis

The answer is


@everestial007 's solution worked for me. This is how I improved it for my use case, which is to have the columns of each different df with a different suffix so I can more easily differentiate between the dfs in the final merged dataframe.

from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, df4]
suffixes = [f"_{i}" for i in range(len(dfs))]
# add suffixes to each df
dfs = [dfs[i].add_suffix(suffixes[i]) for i in range(len(dfs))]
# remove suffix from the merging column
dfs = [dfs[i].rename(columns={f"date{suffixes[i]}":"date"}) for i in range(len(dfs))]
# merge
dfs = reduce(lambda left,right: pd.merge(left,right,how='outer', on='date'), dfs)

Look at this pandas three-way joining multiple dataframes on columns

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

Thank you for your help @jezrael, @zipa and @everestial007, both answers are what I need. If I wanted to make a recursive, this would also work as intended:

def mergefiles(dfs=[], on=''):
    """Merge a list of files based on one column"""
    if len(dfs) == 1:
         return "List only have one element."

    elif len(dfs) == 2:
        df1 = dfs[0]
        df2 = dfs[1]
        df = df1.merge(df2, on=on)
        return df

    # Merge the first and second datafranes into new dataframe
    df1 = dfs[0]
    df2 = dfs[1]
    df = dfs[0].merge(dfs[1], on=on)

    # Create new list with merged dataframe
    dfl = []
    dfl.append(df)

    # Join lists
    dfl = dfl + dfs[2:] 
    dfm = mergefiles(dfl, on)
    return dfm

If you are filtering by common date this will return it:

dfs = [df1, df2, df3]
checker = dfs[-1]
check = set(checker.loc[:, 0])

for df in dfs[:-1]:
    check = check.intersection(set(df.loc[:, 0]))

print(checker[checker.loc[:, 0].isin(check)])

There are 2 solutions for this, but it return all columns separately:

import functools

dfs = [df1, df2, df3]

df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
          date     a_x   b_x       a_y      b_y   c_x         a        b   c_y
0  May 15,2017  900.00  0.2%  1,900.00  1000000  0.2%  2,900.00  2000000  0.2%

k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
                0_a   0_b       1_a      1_b   1_c       2_a      2_b   2_c
date                                                                       
May 15,2017  900.00  0.2%  1,900.00  1000000  0.2%  2,900.00  2000000  0.2%

Looks like the data has the same columns, so you can:

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

merged_df = pd.concat([df1, df2])

functools.reduce and pd.concat are good solutions but in term of execution time pd.concat is the best.

from functools import reduce
import pandas as pd

dfs = [df1, df2, df3, ...]
nan_value = 0

# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)

# solution 2
result_2 = reduce(lambda df_left,df_right: pd.merge(df_left, df_right, 
                                              left_index=True, right_index=True, 
                                              how='outer'), 
                  dfs).fillna(nan_value)

@dannyeuu's answer is correct. pd.concat naturally does a join on index columns, if you set the axis option to 1. The default is an outer join, but you can specify inner join too. Here is an example:

x = pd.DataFrame({'a': [2,4,3,4,5,2,3,4,2,5], 'b':[2,3,4,1,6,6,5,2,4,2], 'val': [1,4,4,3,6,4,3,6,5,7], 'val2': [2,4,1,6,4,2,8,6,3,9]})
x.set_index(['a','b'], inplace=True)
x.sort_index(inplace=True)

y = x.__deepcopy__()
y.loc[(14,14),:] = [3,1]
y['other']=range(0,11)

y.sort_values('val', inplace=True)

z = x.__deepcopy__()
z.loc[(15,15),:] = [3,4]
z['another']=range(0,22,2)
z.sort_values('val2',inplace=True)


pd.concat([x,y,z],axis=1)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to merge

Pandas Merging 101 Python: pandas merge multiple dataframes Git merge with force overwrite Merge two dataframes by index Visual Studio Code how to resolve merge conflicts with git? merge one local branch into another local branch Merging dataframes on index with pandas Git merge is not possible because I have unmerged files Git merge develop into feature branch outputs "Already up-to-date" while it's not How merge two objects array in angularjs?

Examples related to data-analysis

Python: pandas merge multiple dataframes How do I sum values in a column that match a given condition using pandas? Peak signal detection in realtime timeseries data How to sort a dataFrame in python pandas by two or more columns? Fitting polynomial model to data in R