Joining Spark dataframes on the key

Question

I have constructed two dataframes  How can we join multiple Spark dataframes    For Example     PersonDf  ProfileDf with a common column as personId as  key       Now how can we have one Dataframe combining PersonDf and ProfileDf

User · Accepted Answer

Alias Approach using scala  this is example given for older version of spark for spark 2 x see my other answer     You can use case class to prepare  sample dataset      which is optional for ex  you can get DataFrame from hiveContext sql as well        import org apache spark sql functions col  case class Person name  String  age  Int  personid   Int   case class Profile name  String  personid    Int   profileDescription  String       val df1   sqlContext createDataFrame     Person  Bindu  20   2      Person  Raphel  25  5      Person  Ram  40  9    Nil    val df2   sqlContext createDataFrame  Profile  Spark  2    SparkSQLMaster       Profile  Spark  5   SparkGuru       Profile  Spark  9   DevHunter     Nil       you can do alias to refer column name with aliases to  increase readablity  val df asPerson   df1 as  dfperson   val df asProfile   df2 as  dfprofile     val joined df   df asPerson join      df asProfile   col  dfperson personid       col  dfprofile personid      inner     joined df select    col  dfperson name     col  dfperson age     col  dfprofile name     col  dfprofile profileDescription     show   sample Temp table approach which I don t like personally     The reason to use the registerTempTable  tableName   method for a DataFrame  is so that in addition to being able to use the Spark-provided methods of a DataFrame  you can also issue SQL queries via the sqlContext sql  sqlQuery   method  that use that DataFrame as an SQL table  The tableName parameter specifies the table name to use for that DataFrame in the SQL queries   df asPerson registerTempTable  dfperson    df asProfile registerTempTable  dfprofile    sqlContext sql    SELECT dfperson name  dfperson age  dfprofile profileDescription                   FROM  dfperson JOIN  dfprofile                   ON dfperson personid    dfprofile personid       If you want to know more about joins pls see this nice post   beyond-traditional-join-with-apache-spark       Note      1  As mentioned by  RaphaelRoth         val resultDf   PersonDf join ProfileDf Seq  personId    is good   approach since it doesnt have duplicate columns from both sides if you are using inner join with same table    2  Spark 2 x example updated in another answer with full set of join   operations supported by spark 2 x with examples   result   TIP    Also  important thing in joins   broadcast function can help to give hint please see my answer

User · Answer

Posting a java based solution  incase your team only uses java  The keyword inner will ensure that matching rows only are present in the final dataframe               Dataset lt Row gt  joined   PersonDf join ProfileDf                       PersonDf col  personId   equalTo ProfileDf col  personId                          inner                joined show

User · Answer

One way     join type can be inner  left  right  fullouter val mergedDf   df1 join df2  Seq  keyCol     inner      keyCol can be multiple column names seperated by comma val mergedDf   df1 join df2  Seq  keyCol1    keyCol2     left     Another way  import spark implicits    val mergedDf   df1 as  d1   join df2 as  d2       d1 colName        d2 colName       to select specific columns as output val mergedDf   df1 as  d1   join df2 as  d2       d1 colName        d2 colName    select   d1       d2 anotherColName

User · Answer

inner join with scala  val joinedDataFrame   PersonDf join ProfileDf   personId   joinedDataFrame show

User · Answer

you can use  val resultDf   PersonDf join ProfileDf  PersonDf  personId       ProfileDf  personId      or shorter and more flexible  as you can easely specify more than 1 columns for joining   val resultDf   PersonDf join ProfileDf Seq  personId

User · Answer

Let me explain with an example  create emp DataFrame import spark sqlContext implicits   val emp   Seq  1  quot Smith quot  -1  quot 2018 quot   quot 10 quot   quot M quot  3000    2  quot Rose quot  1  quot 2010 quot   quot 20 quot   quot M quot  4000    3  quot Williams quot  1  quot 2010 quot   quot 10 quot   quot M quot  1000    4  quot Jones quot  2  quot 2005 quot   quot 10 quot   quot F quot  2000    5  quot Brown quot  2  quot 2010 quot   quot 40 quot   quot  quot  -1    6  quot Brown quot  2  quot 2010 quot   quot 50 quot   quot  quot  -1    val empColumns   Seq  quot emp id quot   quot name quot   quot superior emp id quot   quot year joined quot    quot emp dept id quot   quot gender quot   quot salary quot   val empDF   emp toDF empColumns      Create dept DataFrame val dept   Seq   quot Finance quot  10     quot Marketing quot  20     quot Sales quot  30     quot IT quot  40    val deptColumns   Seq  quot dept name quot   quot dept id quot   val deptDF   dept toDF deptColumns       Now let s join emp emp dept id with dept dept id empDF join deptDF empDF  quot emp dept id quot        deptDF  quot dept id quot    quot inner quot        show false   This results below  ------ -------- --------------- ----------- ----------- ------ ------ --------- -------   emp id name     superior emp id year joined emp dept id gender salary dept name dept id   ------ -------- --------------- ----------- ----------- ------ ------ --------- -------   1      Smith    -1              2018        10          M      3000   Finance   10        2      Rose     1               2010        20          M      4000   Marketing 20        3      Williams 1               2010        10          M      1000   Finance   10        4      Jones    2               2005        10          F      2000   Finance   10        5      Brown    2               2010        40                 -1     IT        40        ------ -------- --------------- ----------- ----------- ------ ------ --------- -------   If you are looking in python PySpark Join with example and also find the complete Scala example at Spark Join

User · Answer

Apart from my above answer I tried to demonstrate all the spark joins with same case classes using spark 2 x here is my linked in article with full examples and explanation    All join types    Default inner  Must be one of                       inner  cross  outer  full  full outer  left  left outer   right  right outer  left semi  left anti   import org apache spark sql   import org apache spark sql functions               author   Ram Ghadiyaram      object SparkJoinTypesDemo extends App     private this  implicit val spark   SparkSession builder   master  local      getOrCreate     spark sparkContext setLogLevel  ERROR     case class Person name  String  age  Int  personid  Int    case class Profile profileName  String  personid  Int  profileDescription  String                 param joinType Type of join to perform  Default  inner   Must be one of                           inner    cross    outer    full    full outer    left    left outer                            right    right outer    left semi    left anti            val joinTypes   Seq       inner         outer         full         full outer         left         left outer         right         right outer         left semi         left anti           cross        val df1   spark sqlContext createDataFrame      Person  Nataraj   45  2           Person  Srinivas   45  5           Person  Ashik   22  9           Person  Deekshita   22  8           Person  Siddhika   22  4           Person  Madhu   22  3           Person  Meghna   22  2           Person  Snigdha   22  2           Person  Harshita   22  6           Person  Ravi   42  0           Person  Ram   42  9           Person  Chidananda Raju   35  9           Person  Sreekanth Doddy   29  9           Nil    val df2   spark sqlContext createDataFrame      Profile  Spark   2   SparkSQLMaster            Profile  Spark   5   SparkGuru            Profile  Spark   9   DevHunter            Profile  Spark   3   Evangelist            Profile  Spark   0   Committer            Profile  Spark   1   All Rounder            Nil       val df asPerson   df1 as  dfperson     val df asProfile   df2 as  dfprofile     val joined df   df asPerson join      df asProfile       col  dfperson personid       col  dfprofile personid          inner      println  First example inner join            you can do alias to refer column name with aliases to  increase readability   joined df select      col  dfperson name         col  dfperson age         col  dfprofile profileName         col  dfprofile profileDescription         show   println  all joins in a loop     joinTypes foreach   joinType   gt      println s   joinType toUpperCase    JOIN       df asPerson join right   df asProfile  usingColumns   Seq  personid    joinType   joinType         orderBy  personid          show         println                 Till 1 x  cross join is    df asPerson join df asProfile                  Explicit Cross Join in 2 x           http   blog madhukaraphatak com migrating-to-spark-two-part-4          Cartesian joins are very expensive without an extra filter that can be pushed down                  cross join or cartesian product                         stripMargin     val crossJoinDf   df asPerson crossJoin right   df asProfile    crossJoinDf show 200  false    println crossJoinDf explain      println crossJoinDf count     println  createOrReplaceTempView example      println                 Creates a local temporary view using the given name  The lifetime of this           temporary view is tied to the   SparkSession   that was used to create this Dataset          stripMargin        df asPerson createOrReplaceTempView  dfperson      df asProfile createOrReplaceTempView  dfprofile     val sql       s            SELECT dfperson name           dfperson age           dfprofile profileDescription           FROM  dfperson JOIN  dfprofile          ON dfperson personid    dfprofile personid         stripMargin   println s createOrReplaceTempView  sql  sql     val sqldf   spark sql sql    sqldf show     println                              EXCEPT DEMO                   stripMargin    println   df asPerson except df asProfile  Except demo     df asPerson except df asProfile  show     println   df asProfile except df asPerson  Except demo     df asProfile except df asPerson  show     Result      First example inner join    --------------- --- ----------- ------------------              name age profileName profileDescription   --------------- --- ----------- ------------------           Nataraj  45       Spark     SparkSQLMaster          Srinivas  45       Spark          SparkGuru             Ashik  22       Spark          DevHunter             Madhu  22       Spark         Evangelist            Meghna  22       Spark     SparkSQLMaster           Snigdha  22       Spark     SparkSQLMaster              Ravi  42       Spark          Committer               Ram  42       Spark          DevHunter   Chidananda Raju  35       Spark          DevHunter   Sreekanth Doddy  29       Spark          DevHunter   --------------- --- ----------- ------------------   all joins in a loop INNER JOIN  -------- --------------- --- ----------- ------------------   personid            name age profileName profileDescription   -------- --------------- --- ----------- ------------------          0            Ravi  42       Spark          Committer          2         Snigdha  22       Spark     SparkSQLMaster          2          Meghna  22       Spark     SparkSQLMaster          2         Nataraj  45       Spark     SparkSQLMaster          3           Madhu  22       Spark         Evangelist          5        Srinivas  45       Spark          SparkGuru          9             Ram  42       Spark          DevHunter          9           Ashik  22       Spark          DevHunter          9 Chidananda Raju  35       Spark          DevHunter          9 Sreekanth Doddy  29       Spark          DevHunter   -------- --------------- --- ----------- ------------------   OUTER JOIN  -------- --------------- ---- ----------- ------------------   personid            name  age profileName profileDescription   -------- --------------- ---- ----------- ------------------          0            Ravi   42       Spark          Committer          1            null null       Spark        All Rounder          2         Nataraj   45       Spark     SparkSQLMaster          2         Snigdha   22       Spark     SparkSQLMaster          2          Meghna   22       Spark     SparkSQLMaster          3           Madhu   22       Spark         Evangelist          4        Siddhika   22        null               null          5        Srinivas   45       Spark          SparkGuru          6        Harshita   22        null               null          8       Deekshita   22        null               null          9           Ashik   22       Spark          DevHunter          9             Ram   42       Spark          DevHunter          9 Chidananda Raju   35       Spark          DevHunter          9 Sreekanth Doddy   29       Spark          DevHunter   -------- --------------- ---- ----------- ------------------   FULL JOIN  -------- --------------- ---- ----------- ------------------   personid            name  age profileName profileDescription   -------- --------------- ---- ----------- ------------------          0            Ravi   42       Spark          Committer          1            null null       Spark        All Rounder          2         Nataraj   45       Spark     SparkSQLMaster          2          Meghna   22       Spark     SparkSQLMaster          2         Snigdha   22       Spark     SparkSQLMaster          3           Madhu   22       Spark         Evangelist          4        Siddhika   22        null               null          5        Srinivas   45       Spark          SparkGuru          6        Harshita   22        null               null          8       Deekshita   22        null               null          9           Ashik   22       Spark          DevHunter          9             Ram   42       Spark          DevHunter          9 Sreekanth Doddy   29       Spark          DevHunter          9 Chidananda Raju   35       Spark          DevHunter   -------- --------------- ---- ----------- ------------------   FULL OUTER JOIN  -------- --------------- ---- ----------- ------------------   personid            name  age profileName profileDescription   -------- --------------- ---- ----------- ------------------          0            Ravi   42       Spark          Committer          1            null null       Spark        All Rounder          2         Nataraj   45       Spark     SparkSQLMaster          2          Meghna   22       Spark     SparkSQLMaster          2         Snigdha   22       Spark     SparkSQLMaster          3           Madhu   22       Spark         Evangelist          4        Siddhika   22        null               null          5        Srinivas   45       Spark          SparkGuru          6        Harshita   22        null               null          8       Deekshita   22        null               null          9           Ashik   22       Spark          DevHunter          9             Ram   42       Spark          DevHunter          9 Chidananda Raju   35       Spark          DevHunter          9 Sreekanth Doddy   29       Spark          DevHunter   -------- --------------- ---- ----------- ------------------   LEFT JOIN  -------- --------------- --- ----------- ------------------   personid            name age profileName profileDescription   -------- --------------- --- ----------- ------------------          0            Ravi  42       Spark          Committer          2         Snigdha  22       Spark     SparkSQLMaster          2          Meghna  22       Spark     SparkSQLMaster          2         Nataraj  45       Spark     SparkSQLMaster          3           Madhu  22       Spark         Evangelist          4        Siddhika  22        null               null          5        Srinivas  45       Spark          SparkGuru          6        Harshita  22        null               null          8       Deekshita  22        null               null          9             Ram  42       Spark          DevHunter          9           Ashik  22       Spark          DevHunter          9 Chidananda Raju  35       Spark          DevHunter          9 Sreekanth Doddy  29       Spark          DevHunter   -------- --------------- --- ----------- ------------------   LEFT OUTER JOIN  -------- --------------- --- ----------- ------------------   personid            name age profileName profileDescription   -------- --------------- --- ----------- ------------------          0            Ravi  42       Spark          Committer          2         Nataraj  45       Spark     SparkSQLMaster          2          Meghna  22       Spark     SparkSQLMaster          2         Snigdha  22       Spark     SparkSQLMaster          3           Madhu  22       Spark         Evangelist          4        Siddhika  22        null               null          5        Srinivas  45       Spark          SparkGuru          6        Harshita  22        null               null          8       Deekshita  22        null               null          9 Chidananda Raju  35       Spark          DevHunter          9 Sreekanth Doddy  29       Spark          DevHunter          9           Ashik  22       Spark          DevHunter          9             Ram  42       Spark          DevHunter   -------- --------------- --- ----------- ------------------   RIGHT JOIN  -------- --------------- ---- ----------- ------------------   personid            name  age profileName profileDescription   -------- --------------- ---- ----------- ------------------          0            Ravi   42       Spark          Committer          1            null null       Spark        All Rounder          2         Snigdha   22       Spark     SparkSQLMaster          2          Meghna   22       Spark     SparkSQLMaster          2         Nataraj   45       Spark     SparkSQLMaster          3           Madhu   22       Spark         Evangelist          5        Srinivas   45       Spark          SparkGuru          9 Sreekanth Doddy   29       Spark          DevHunter          9 Chidananda Raju   35       Spark          DevHunter          9             Ram   42       Spark          DevHunter          9           Ashik   22       Spark          DevHunter   -------- --------------- ---- ----------- ------------------   RIGHT OUTER JOIN  -------- --------------- ---- ----------- ------------------   personid            name  age profileName profileDescription   -------- --------------- ---- ----------- ------------------          0            Ravi   42       Spark          Committer          1            null null       Spark        All Rounder          2          Meghna   22       Spark     SparkSQLMaster          2         Snigdha   22       Spark     SparkSQLMaster          2         Nataraj   45       Spark     SparkSQLMaster          3           Madhu   22       Spark         Evangelist          5        Srinivas   45       Spark          SparkGuru          9 Sreekanth Doddy   29       Spark          DevHunter          9           Ashik   22       Spark          DevHunter          9 Chidananda Raju   35       Spark          DevHunter          9             Ram   42       Spark          DevHunter   -------- --------------- ---- ----------- ------------------   LEFT SEMI JOIN  -------- --------------- ---   personid            name age   -------- --------------- ---          0            Ravi  42          2         Nataraj  45          2          Meghna  22          2         Snigdha  22          3           Madhu  22          5        Srinivas  45          9 Chidananda Raju  35          9 Sreekanth Doddy  29          9             Ram  42          9           Ashik  22   -------- --------------- ---   LEFT ANTI JOIN  -------- --------- ---   personid      name age   -------- --------- ---          4  Siddhika  22          6  Harshita  22          8 Deekshita  22   -------- --------- ---    Till 1 x  Cross join is     df asPerson join df asProfile     Explicit Cross Join in 2 x    http   blog madhukaraphatak com migrating-to-spark-two-part-4   Cartesian joins are very expensive without an extra filter that can be pushed down    Cross join or Cartesian product     --------------- --- -------- ----------- -------- ------------------   name            age personid profileName personid profileDescription   --------------- --- -------- ----------- -------- ------------------   Nataraj         45  2        Spark       2        SparkSQLMaster       Nataraj         45  2        Spark       5        SparkGuru            Nataraj         45  2        Spark       9        DevHunter            Nataraj         45  2        Spark       3        Evangelist           Nataraj         45  2        Spark       0        Committer            Nataraj         45  2        Spark       1        All Rounder          Srinivas        45  5        Spark       2        SparkSQLMaster       Srinivas        45  5        Spark       5        SparkGuru            Srinivas        45  5        Spark       9        DevHunter            Srinivas        45  5        Spark       3        Evangelist           Srinivas        45  5        Spark       0        Committer            Srinivas        45  5        Spark       1        All Rounder          Ashik           22  9        Spark       2        SparkSQLMaster       Ashik           22  9        Spark       5        SparkGuru            Ashik           22  9        Spark       9        DevHunter            Ashik           22  9        Spark       3        Evangelist           Ashik           22  9        Spark       0        Committer            Ashik           22  9        Spark       1        All Rounder          Deekshita       22  8        Spark       2        SparkSQLMaster       Deekshita       22  8        Spark       5        SparkGuru            Deekshita       22  8        Spark       9        DevHunter            Deekshita       22  8        Spark       3        Evangelist           Deekshita       22  8        Spark       0        Committer            Deekshita       22  8        Spark       1        All Rounder          Siddhika        22  4        Spark       2        SparkSQLMaster       Siddhika        22  4        Spark       5        SparkGuru            Siddhika        22  4        Spark       9        DevHunter            Siddhika        22  4        Spark       3        Evangelist           Siddhika        22  4        Spark       0        Committer            Siddhika        22  4        Spark       1        All Rounder          Madhu           22  3        Spark       2        SparkSQLMaster       Madhu           22  3        Spark       5        SparkGuru            Madhu           22  3        Spark       9        DevHunter            Madhu           22  3        Spark       3        Evangelist           Madhu           22  3        Spark       0        Committer            Madhu           22  3        Spark       1        All Rounder          Meghna          22  2        Spark       2        SparkSQLMaster       Meghna          22  2        Spark       5        SparkGuru            Meghna          22  2        Spark       9        DevHunter            Meghna          22  2        Spark       3        Evangelist           Meghna          22  2        Spark       0        Committer            Meghna          22  2        Spark       1        All Rounder          Snigdha         22  2        Spark       2        SparkSQLMaster       Snigdha         22  2        Spark       5        SparkGuru            Snigdha         22  2        Spark       9        DevHunter            Snigdha         22  2        Spark       3        Evangelist           Snigdha         22  2        Spark       0        Committer            Snigdha         22  2        Spark       1        All Rounder          Harshita        22  6        Spark       2        SparkSQLMaster       Harshita        22  6        Spark       5        SparkGuru            Harshita        22  6        Spark       9        DevHunter            Harshita        22  6        Spark       3        Evangelist           Harshita        22  6        Spark       0        Committer            Harshita        22  6        Spark       1        All Rounder          Ravi            42  0        Spark       2        SparkSQLMaster       Ravi            42  0        Spark       5        SparkGuru            Ravi            42  0        Spark       9        DevHunter            Ravi            42  0        Spark       3        Evangelist           Ravi            42  0        Spark       0        Committer            Ravi            42  0        Spark       1        All Rounder          Ram             42  9        Spark       2        SparkSQLMaster       Ram             42  9        Spark       5        SparkGuru            Ram             42  9        Spark       9        DevHunter            Ram             42  9        Spark       3        Evangelist           Ram             42  9        Spark       0        Committer            Ram             42  9        Spark       1        All Rounder          Chidananda Raju 35  9        Spark       2        SparkSQLMaster       Chidananda Raju 35  9        Spark       5        SparkGuru            Chidananda Raju 35  9        Spark       9        DevHunter            Chidananda Raju 35  9        Spark       3        Evangelist           Chidananda Raju 35  9        Spark       0        Committer            Chidananda Raju 35  9        Spark       1        All Rounder          Sreekanth Doddy 29  9        Spark       2        SparkSQLMaster       Sreekanth Doddy 29  9        Spark       5        SparkGuru            Sreekanth Doddy 29  9        Spark       9        DevHunter            Sreekanth Doddy 29  9        Spark       3        Evangelist           Sreekanth Doddy 29  9        Spark       0        Committer            Sreekanth Doddy 29  9        Spark       1        All Rounder          --------------- --- -------- ----------- -------- ------------------      Physical Plan    BroadcastNestedLoopJoin BuildRight  Cross  - LocalTableScan  name 0  age 1  personid 2   - BroadcastExchange IdentityBroadcastMode     - LocalTableScan  profileName 7  personid 8  profileDescription 9     78 createOrReplaceTempView example   Creates a local temporary view using the given name  The lifetime of this    temporary view is tied to the   SparkSession   that was used to create this Dataset   createOrReplaceTempView  sql  SELECT dfperson name   dfperson age   dfprofile profileDescription   FROM  dfperson JOIN  dfprofile  ON dfperson personid    dfprofile personid   --------------- --- ------------------              name age profileDescription   --------------- --- ------------------           Nataraj  45     SparkSQLMaster          Srinivas  45          SparkGuru             Ashik  22          DevHunter             Madhu  22         Evangelist            Meghna  22     SparkSQLMaster           Snigdha  22     SparkSQLMaster              Ravi  42          Committer               Ram  42          DevHunter   Chidananda Raju  35          DevHunter   Sreekanth Doddy  29          DevHunter   --------------- --- ------------------          EXCEPT DEMO        df asPerson except df asProfile  Except demo  --------------- --- --------              name age personid   --------------- --- --------             Ashik  22        9          Harshita  22        6             Madhu  22        3               Ram  42        9              Ravi  42        0   Chidananda Raju  35        9          Siddhika  22        4          Srinivas  45        5   Sreekanth Doddy  29        9         Deekshita  22        8            Meghna  22        2           Snigdha  22        2           Nataraj  45        2   --------------- --- --------    df asProfile except df asPerson  Except demo  ----------- -------- ------------------   profileName personid profileDescription   ----------- -------- ------------------         Spark        5          SparkGuru         Spark        9          DevHunter         Spark        2     SparkSQLMaster         Spark        3         Evangelist         Spark        0          Committer         Spark        1        All Rounder   ----------- -------- ------------------    As discussed above these are the venn diagrams of all the joins

User · Answer

From https   spark apache org docs 1 5 1 api java org apache spark sql DataFrame html  use join      Inner equi-join with another DataFrame using the given column    PersonDf join ProfileDf   personId     OR  PersonDf join ProfileDf PersonDf  personId       ProfileDf  personId      Update   You can also save the DFs as temp table using df registerTempTable  tableName   and you can write sql queries using sqlContext

[scala] Joining Spark dataframes on the key

Alias Approach using scala (this is example given for older version of spark for spark 2.x see my other answer) :

If you want to know more about joins pls see this nice post : beyond-traditional-join-with-apache-spark

TIP :

Examples related to scala

Examples related to apache-spark

Examples related to dataframe

Examples related to apache-spark-sql