[python] pandas three-way joining multiple dataframes on columns

I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.

How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?

The join() function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.

This question is related to python pandas join merge

The answer is


Assumed imports:

import pandas as pd

John Galt's answer is basically a reduce operation. If I have more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):

dfs = [df0, df1, df2, dfN]

Assuming they have some common column, like name in your example, I'd do the following:

df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

That way, your code should work with whatever number of dataframes you want to merge.

Edit August 1, 2016: For those using Python 3: reduce has been moved into functools. So to use this function, you'll first need to import that module:

from functools import reduce

Simple Solution:

If the column names are similar:

 df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

This can also be done as follows for a list of dataframes df_list:

df = df_list[0]
for df_ in df_list[1:]:
    df = df.merge(df_, on='join_col_name')

or if the dataframes are in a generator object (e.g. to reduce memory consumption):

df = next(df_list)
for df_ in df_list:
    df = df.merge(df_, on='join_col_name')

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:

This is the function to merge a dict of data frames

def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
  keys = dfDict.keys()
  for i in range(len(keys)):
    key = keys[i]
    df0 = dfDict[key]
    cols = list(df0.columns)
    valueCols = list(filter(lambda x: x not in (onCols), cols))
    df0 = df0[onCols + valueCols]
    df0.columns = onCols + [(s + '_' + key) for s in valueCols] 

    if (i == 0):
      outDf = df0
    else:
      outDf = pd.merge(outDf, df0, how=how, on=onCols)   

  if (naFill != None):
    outDf = outDf.fillna(naFill)

  return(outDf)

OK, lets generates data and test this:

def GenDf(size):
  df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
                      'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 
                      'col1':np.random.uniform(low=0.0, high=100.0, size=size), 
                      'col2':np.random.uniform(low=0.0, high=100.0, size=size)
                      })
  df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
  return(df)


size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}   
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

A tutorial may be useful.

# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

You could try this if you have 3 dataframes

# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')

The three dataframes are

enter image description here

enter image description here

Let's merge these frames using nested pd.merge

enter image description here

Here we go, we have our merged dataframe.

Happy Analysis!!!


I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix parameters using reduce and i guess it can be extended to different on parameters as well.

from functools import reduce 

dfs_with_suffixes = [(df2,suffix2), (df3,suffix3), 
                     (df4,suffix4)]

merge_one = lambda x,y,sfx:pd.merge(x,y,on=['col1','col2'..], suffixes=sfx)

merged = reduce(lambda left,right:merge_one(left,*right), dfs_with_suffixes, df1)

This is an ideal situation for the join method

The join method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.

The code would look something like this:

filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])

With @zero's data, you could do this:

df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32'])

dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])

     attr11 attr12 attr21 attr22 attr31 attr32
name                                          
a         5      9      5     19     15     49
b         4     61     14     16      4     36
c        24      9      4      9     14      9

There is another solution from the pandas documentation (that I don't see here),

using the .append

>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
   A  B
0  1  2
1  3  4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
   A  B
0  5  6
1  7  8
>>> df.append(df2, ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8

The ignore_index=True is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.

If there are different column names, Nan will be introduced.


In python 3.6.3 with pandas 0.22.0 you can also use concat as long as you set as index the columns you want to use for the joining

pd.concat(
    (iDF.set_index('name') for iDF in [df1, df2, df3]),
    axis=1, join='inner'
).reset_index()

where df1, df2, and df3 are defined as in John Galt's answer

import pandas as pd
df1 = pd.DataFrame(np.array([
    ['a', 5, 9],
    ['b', 4, 61],
    ['c', 24, 9]]),
    columns=['name', 'attr11', 'attr12']
)
df2 = pd.DataFrame(np.array([
    ['a', 5, 19],
    ['b', 14, 16],
    ['c', 4, 9]]),
    columns=['name', 'attr21', 'attr22']
)
df3 = pd.DataFrame(np.array([
    ['a', 15, 49],
    ['b', 4, 36],
    ['c', 14, 9]]),
    columns=['name', 'attr31', 'attr32']
)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to pandas

xlrd.biffh.XLRDError: Excel xlsx file; not supported Pandas Merging 101 How to increase image size of pandas.DataFrame.plot in jupyter notebook? Trying to merge 2 dataframes but get ValueError Python Pandas User Warning: Sorting because non-concatenation axis is not aligned How to show all of columns name on pandas dataframe? Pandas/Python: Set value of one column based on value in another column Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Python convert object to float

Examples related to join

Pandas Merging 101 pandas: merge (join) two data frames on multiple columns How to use the COLLATE in a JOIN in SQL Server? How to join multiple collections with $lookup in mongodb How to join on multiple columns in Pyspark? Pandas join issue: columns overlap but no suffix specified MySQL select rows where left join is null How to return rows from left table not found in right table? Why do multiple-table joins produce duplicate rows? pandas three-way joining multiple dataframes on columns

Examples related to merge

Pandas Merging 101 Python: pandas merge multiple dataframes Git merge with force overwrite Merge two dataframes by index Visual Studio Code how to resolve merge conflicts with git? merge one local branch into another local branch Merging dataframes on index with pandas Git merge is not possible because I have unmerged files Git merge develop into feature branch outputs "Already up-to-date" while it's not How merge two objects array in angularjs?