Spark specify multiple column conditions for dataframe join

Question

How to give more column conditions when joining two dataframes  For example I want to run the following    val Lead all   Leads join Utm Master        Leaddetails columns  LeadSource   Utm Source   Utm Medium   Utm Campaign          Utm Master columns  LeadSource   Utm Source   Utm Medium   Utm Campaign     left     I want to join only when these columns match  But above syntax is not valid as cols only takes one string  So how do I get what I want

User · Answer

As of Spark version 1 5 0  which is currently unreleased   you can join on multiple DataFrame columns   Refer to SPARK-7990  Add methods to facilitate equi-join on multiple join keys   Python  Leads join      Utm Master         LeadSource   Utm Source   Utm Medium   Utm Campaign         left outer      Scala  The question asked for a Scala answer  but I don t use Scala   Here is my best guess      Leads join      Utm Master      Seq  LeadSource   Utm Source   Utm Medium   Utm Campaign         left outer

User · Answer

Try this   val rccJoin dfRccDeuda as  dfdeuda    join dfRccCliente as  dfcliente    col  dfdeuda etarcid     col  dfcliente etarcid     amp  amp  col  dfdeuda etarcid     col  dfcliente etarcid    inner

User · Answer

One thing you can do is to use raw SQL   case class Bar x1  Int  y1  Int  z1  Int  v1  String  case class Foo x2  Int  y2  Int  z2  Int  v2  String   val bar   sqlContext createDataFrame sc parallelize      Bar 1  1  2   bar      Bar 2  3  2   bar          Bar 3  1  2   bar      Nil    val foo   sqlContext createDataFrame sc parallelize      Foo 1  1  2   foo      Foo 2  1  2   foo          Foo 3  1  2   foo      Foo 4  4  4   foo      Nil    foo registerTempTable  foo   bar registerTempTable  bar    sqlContext sql       SELECT   FROM foo LEFT JOIN bar ON x1   x2 AND y1   y2 AND z1   z2

User · Answer

Scala   Leaddetails join      Utm Master       Leaddetails  LeadSource    lt   gt  Utm Master  LeadSource            amp  amp  Leaddetails  Utm Source    lt   gt  Utm Master  Utm Source            amp  amp  Leaddetails  Utm Medium    lt   gt  Utm Master  Utm Medium            amp  amp  Leaddetails  Utm Campaign    lt   gt  Utm Master  Utm Campaign         left      To make it case insensitive   import org apache spark sql functions  lower  upper    then just use lower value  in the condition of the join method   Eg  dataFrame filter lower dataFrame col  vendor    equalTo  fortinet

User · Answer

In Pyspark  using parenthesis around each condition is the key to using multiple column names in the join condition   joined df   df1 join df2        df1  name      df2  name     amp       df1  phone      df2  phone

User · Answer

Spark SQL supports join on tuple of columns when in parentheses  like       WHERE  list of columns1     list of columns2    which is a way shorter than specifying equal expressions     for each pair of columns combined by a set of  AND s   For example   SELECT a b c FROM    tab1 t1 WHERE     NOT EXISTS         SELECT 1         FROM    t1 except t2 df e         WHERE  t1 a  t1 b  t1 c     e a  e b  e c         instead of   SELECT a b c FROM    tab1 t1 WHERE     NOT EXISTS         SELECT 1         FROM    t1 except t2 df e         WHERE t1 a e a AND t1 b e b AND t1 c e c        which is less readable too especially when list of columns is big and you want to deal with NULLs easily

User · Answer

The     options give me duplicated columns  So I use Seq instead    val Lead all   Leads join Utm Master      Seq  Utm Source   Utm Medium   Utm Campaign    left     Of course  this only works when the names of the joining columns are the same

User · Answer

In Pyspark you can simply specify each condition separately   val Lead all   Leads join Utm Master         Leaddetails LeadSource    Utm Master LeadSource   amp       Leaddetails Utm Source    Utm Master Utm Source   amp       Leaddetails Utm Medium    Utm Master Utm Medium   amp       Leaddetails Utm Campaign    Utm Master Utm Campaign     Just be sure to use operators and parenthesis correctly

User · Answer

There is a Spark column expression API join for such case   Leaddetails join      Utm Master       Leaddetails  LeadSource    lt   gt  Utm Master  LeadSource            amp  amp  Leaddetails  Utm Source    lt   gt  Utm Master  Utm Source            amp  amp  Leaddetails  Utm Medium    lt   gt  Utm Master  Utm Medium            amp  amp  Leaddetails  Utm Campaign    lt   gt  Utm Master  Utm Campaign         left      The  lt   gt  operator in the example means  Equality test that is safe for null values    The main difference with simple Equality test       is that the first one is safe to use in case one of the columns may have null values

[apache-spark] Spark specify multiple column conditions for dataframe join

Examples related to apache-spark

Examples related to apache-spark-sql

Examples related to rdd