How to join on multiple columns in Pyspark

Question

I am using Spark 1 3 and would like to join on multiple columns using python interface  SparkSQL   The following works   I first register them as temp tables   numeric registerTempTable  numeric   Ref registerTempTable  Ref    test    numeric join Ref  numeric ID    Ref ID  joinType  inner     I would now like to join them based on multiple columns   I get SyntaxError  invalid syntax with this   test    numeric join Ref     numeric ID    Ref ID AND numeric TYPE    Ref TYPE AND    numeric STATUS    Ref STATUS    joinType  inner

User · Accepted Answer

You should use  amp      operators and be careful about operator precedence     has lower precedence than bitwise AND and OR    df1   sqlContext createDataFrame        1   a   2 0    2   b   3 0    3   c   3 0          x1    x2    x3     df2   sqlContext createDataFrame        1   f   -1 0    2   b   0 0      x1    x2    x3     df   df1 join df2   df1 x1    df2 x1   amp   df1 x2    df2 x2   df show        --- --- --- --- --- ---       x1  x2  x3  x1  x2  x3      --- --- --- --- --- ---        2   b 3 0   2   b 0 0      --- --- --- --- --- ---

User · Answer

An alternative approach would be   df1   sqlContext createDataFrame        1   a   2 0    2   b   3 0    3   c   3 0          x1    x2    x3     df2   sqlContext createDataFrame        1   f   -1 0    2   b   0 0      x1    x2    x4     df   df1 join df2    x1   x2    df show     which outputs    --- --- --- ---    x1  x2  x3  x4   --- --- --- ---     2   b 3 0 0 0   --- --- --- ---    With the main advantage being that the columns on which the tables are joined are not duplicated in the output  reducing the risk of encountering errors such as org apache spark sql AnalysisException  Reference  x1  is ambiguous  could be  x1 50L  x1 57L     Whenever the columns in the two tables have different names   let s say in the example above  df2 has the columns y1  y2 and y4   you could use the following syntax   df   df1 join df2 withColumnRenamed  y1   x1   withColumnRenamed  y2   x2      x1   x2

[python] How to join on multiple columns in Pyspark?

Examples related to python

Examples related to apache-spark

Examples related to join

Examples related to pyspark

Examples related to apache-spark-sql