pandas three-way joining multiple dataframes on columns

Question

I have 3 CSV files  Each has the first column as the  string  names of people  while all the other columns in each dataframe are attributes of that person    How can I  join  together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person s string name   The join   function in pandas specifies that I need a multiindex  but I m confused about what a hierarchical indexing scheme has to do with making a join based on a single index

User · Answer

One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name') for example)

The join operation is by default performed on index. In your case, you just have to specify that the Name column corresponds to your index. Below is an example

A tutorial may be useful.

# Simple example where dataframes index are the name on which to perform
# the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'],         index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'],     index=name)
df = df1.join(df2)
df = df.join(df3)

# If you have a 'Name' column that is not the index of your dataframe,
# one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name'] = df1.index
# 1) Select the index from column 'Name'
df1 = df1.set_index('Name')

# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))

gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')

User · Answer

Simple Solution   If the column names are similar    df1 merge df2 on  col name   merge df3 on  col name     If the column names are different   df1 merge df2 left on  col name1   right on  col name2   merge df3 left on  col name1   right on  col name3   drop columns   col name2    col name3    rename columns   col name1   col name

User · Answer

I tweaked the accepted answer to perform the operation for multiple dataframes on different suffix  parameters using reduce and i guess it can be extended to different on parameters as well  from functools import reduce   dfs with suffixes     df2 suffix2    df3 suffix3                          df4 suffix4    merge one   lambda x y sfx pd merge x y on   col1   col2      suffixes sfx   merged   reduce lambda left right merge one left  right   dfs with suffixes  df1

User · Answer

There is another solution from the pandas documentation  that I don t see here    using the  append   gt  gt  gt  df   pd DataFrame   1  2    3  4    columns list  AB       A  B 0  1  2 1  3  4  gt  gt  gt  df2   pd DataFrame   5  6    7  8    columns list  AB       A  B 0  5  6 1  7  8  gt  gt  gt  df append df2  ignore index True     A  B 0  1  2 1  3  4 2  5  6 3  7  8   The ignore index True is used to ignore the index of the appended dataframe  replacing it with the next index available in the source one   If there are different column names  Nan will be introduced

User · Answer

In python 3 6 3 with pandas 0 22 0 you can also use concat as long as you set as index the columns you want to use for the joining  pd concat       iDF set index  name   for iDF in  df1  df2  df3        axis 1  join  inner    reset index     where df1  df2  and df3 are defined as in John Galt s answer  import pandas as pd df1   pd DataFrame np array         a   5  9         b   4  61         c   24  9         columns   name    attr11    attr12     df2   pd DataFrame np array         a   5  19         b   14  16         c   4  9         columns   name    attr21    attr22     df3   pd DataFrame np array         a   15  49         b   4  36         c   14  9         columns   name    attr31    attr32

User · Answer

Assumed imports   import pandas as pd   John Galt s answer is basically a reduce operation   If I have more than a handful of dataframes  I d put them in a list like this  generated via list comprehensions or loops or whatnot    dfs    df0  df1  df2  dfN    Assuming they have some common column  like name in your example  I d do the following   df final   reduce lambda left right  pd merge left right on  name    dfs    That way  your code should work with whatever number of dataframes you want to merge   Edit August 1  2016  For those using Python 3  reduce has been moved into functools  So to use this function  you ll first need to import that module   from functools import reduce

User · Answer

The three dataframes are       Let s merge these frames using nested pd merge    Here we go  we have our merged dataframe   Happy Analysis

User · Answer

Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary  Also it fills in missing values if needed   This is the function to merge a dict of data frames  def MergeDfDict dfDict  onCols  how  outer   naFill None     keys   dfDict keys     for i in range len keys        key   keys i      df0   dfDict key      cols   list df0 columns      valueCols   list filter lambda x  x not in  onCols   cols       df0   df0 onCols   valueCols      df0 columns   onCols     s         key  for s in valueCols        if  i    0         outDf   df0     else        outDf   pd merge outDf  df0  how how  on onCols        if  naFill    None       outDf   outDf fillna naFill     return outDf    OK  lets generates data and test this   def GenDf size     df   pd DataFrame   categ1  np random choice a   a    b    c    d    e    size size  replace True                          categ2  np random choice a   A    B    size size  replace True                           col1  np random uniform low 0 0  high 100 0  size size                           col2  np random uniform low 0 0  high 100 0  size size                             df   df sort values   categ2    categ1    col1    col2      return df    size   5 dfDict     US  GenDf size    IN  GenDf size    GER  GenDf size      MergeDfDict dfDict dfDict  onCols   categ1    categ2    how  outer   naFill 0

User · Answer

This is an ideal situation for the join method  The join method is built exactly for these types of situations  You can join any number of DataFrames together with it  The calling DataFrame joins with the index of the collection of passed DataFrames  To work with multiple DataFrames  you must put the joining columns in the index   The code would look something like this   filenames     fn1    fn2    fn3    fn4        dfs    pd read csv filename  index col index col  for filename in filenames   dfs 0  join dfs 1      With  zero s data  you could do this   df1   pd DataFrame np array         a   5  9         b   4  61         c   24  9         columns   name    attr11    attr12    df2   pd DataFrame np array         a   5  19         b   14  16         c   4  9         columns   name    attr21    attr22    df3   pd DataFrame np array         a   15  49         b   4  36         c   14  9         columns   name    attr31    attr32     dfs    df1  df2  df3  dfs    df set index  name   for df in dfs  dfs 0  join dfs 1          attr11 attr12 attr21 attr22 attr31 attr32 name                                           a         5      9      5     19     15     49 b         4     61     14     16      4     36 c        24      9      4      9     14      9

User · Answer

You could try this if you have 3 dataframes    Merge multiple dataframes df1   pd DataFrame np array         a   5  9         b   4  61         c   24  9         columns   name    attr11    attr12    df2   pd DataFrame np array         a   5  19         b   14  16         c   4  9         columns   name    attr21    attr22    df3   pd DataFrame np array         a   15  49         b   4  36         c   14  9         columns   name    attr31    attr32     pd merge pd merge df1 df2 on  name   df3 on  name     alternatively  as mentioned by cwharland   df1 merge df2 on  name   merge df3 on  name

User · Answer

This can also be done as follows for a list of dataframes df list   df   df list 0  for df  in df list 1        df   df merge df   on  join col name     or if the dataframes are in a generator object  e g  to reduce memory consumption    df   next df list  for df  in df list      df   df merge df   on  join col name

[python] pandas three-way joining multiple dataframes on columns

Examples related to python

Examples related to pandas

Examples related to join

Examples related to merge