Spark Dataframe distinguish columns with duplicated name

Question

So as I know in Spark Dataframe  that for multiple columns can have the same name as shown in below dataframe snapshot     Row a 107831  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0    a 107831  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0     Row a 107831  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0    a 125231  f SparseVector 5   0  0 0  1  0 0  2  0 0047  3  0 0  4  0 0043     Row a 107831  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0    a 145831  f SparseVector 5   0  0 0  1  0 2356  2  0 0036  3  0 0  4  0 4132     Row a 107831  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0    a 147031  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0     Row a 107831  f SparseVector 5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0    a 149231  f SparseVector 5   0  0 0  1  0 0032  2  0 2451  3  0 0  4  0 0042        Above result is created by join with a dataframe to itself  you can see there are 4 columns with both two a and f   The problem is is there when I try to do more calculation with the a column  I cant find a way to select the a  I have try df 0  and df select  a    both returned me below error mesaage   AnalysisException  Reference  a  is ambiguous  could be  a 1333L  a 1335L    Is there anyway in Spark API that I can distinguish the columns from the duplicated names again  or maybe some way to let me change the column names

User · Answer

Lets start with some data   from pyspark mllib linalg import SparseVector from pyspark sql import Row  df1   sqlContext createDataFrame       Row a 107831  f SparseVector          5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0         Row a 125231  f SparseVector          5   0  0 0  1  0 0  2  0 0047  3  0 0  4  0 0043         df2   sqlContext createDataFrame       Row a 107831  f SparseVector          5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0         Row a 107831  f SparseVector          5   0  0 0  1  0 0  2  0 0  3  0 0  4  0 0          There are a few ways you can approach this problem  First of all you can unambiguously reference child table columns using parent columns   df1 join df2  df1  a      df2  a    select df1  f    show 2        --------------------                          f       --------------------        5  0 1 2 3 4   0           5  0 1 2 3 4   0          --------------------    You can also use table aliases   from pyspark sql functions import col  df1 a   df1 alias  df1 a   df2 a   df2 alias  df2 a    df1 a join df2 a  col  df1 a a      col  df2 a a    select  df1 a f   show 2        --------------------                          f       --------------------        5  0 1 2 3 4   0           5  0 1 2 3 4   0          --------------------    Finally you can programmatically rename columns   df1 r   df1 select   col x  alias x     df1   for x in df1 columns   df2 r   df2 select   col x  alias x     df2   for x in df2 columns    df1 r join df2 r  col  a df1      col  a df2    select col  f df1    show 2       --------------------                     f df1      --------------------       5  0 1 2 3 4   0          5  0 1 2 3 4   0         --------------------

User · Answer

There is a simpler way than writing aliases for all of the columns you are joining on by doing   df1 join df2   a      This works if the key that you are joining on is the same in both tables   See https   kb databricks com data join-two-dataframes-duplicated-columns html

User · Answer

I would recommend that you change the column names for your join  df1 select col  quot a quot   as  quot df1 a quot   col  quot f quot   as  quot df1 f quot       join df2 select col  quot a quot   as  quot df2 a quot   col  quot f quot   as  quot df2 f quot    col  quot df1 a quot      col  quot df2 a quot     The resulting DataFrame will have schema  df1 a  df1 f  df2 a  df2 f

User · Answer

Suppose the DataFrames you want to join are df1 and df2  and you are joining them on column  a   then you have 2 methods  Method 1     df1 join df2  a   left outer     This is an awsome method and it is highly recommended   Method 2     df1 join df2 df1 a    df2 a  left outer   drop df2 a

User · Answer

You can use def drop col  Column  method to drop the duplicated column for example   DataFrame df1   ------- -----    a       f      ------- -----   107831          107831          ------- -----   DataFrame df2   ------- -----    a       f      ------- -----   107831          107831          ------- -----    when I join df1 with df2  the DataFrame will be like below   val newDf   df1 join df2 df1  a     df2  a     DataFrame newDf   ------- ----- ------- -----    a       f     a       f      ------- ----- ------- -----   107831        107831          107831        107831          ------- ----- ------- -----    Now  we can use def drop col  Column  method to drop the duplicated column  a  or  f   just like as follows   val newDfWithoutDuplicate   df1 join df2 df1  a     df2  a    drop df2  a    drop df2  f

User · Answer

What worked for me import databricks koalas as ks  df1k   df1 to koalas   df2k   df2 to koalas   df3k   df1k merge df2k  on   col1    col2    df3   df3k to spark    All of the columns except for col1 and col2 had  quot  x quot  appended to their names if they had come from df1 and  quot  y quot  appended if they had come from df2  which is exactly what I needed

User · Answer

After digging into the Spark API  I found I can first use alias to create an alias for the original dataframe  then I use withColumnRenamed to manually rename every column on the alias  this will do the join without causing the column name duplication   More detail can be refer to below Spark Dataframe API   pyspark sql DataFrame alias  pyspark sql DataFrame withColumnRenamed  However  I think this is only a troublesome workaround  and wondering if there is any better way for my question

User · Answer

If you have a more complicated use case than described in the answer of Glennie Helles Sindholt e g  you have other few non-join column names that are also same and want to distinguish them while selecting it s best to use aliasses  e g   df3   df1 select  a    b   alias  left        join df2 select  a    b   alias  right      a         select  left a    left b    right b    df3 columns   a    b    b

User · Answer

This is how we can join two Dataframes on same column names in PySpark   df   df1 join df2    col1   col2   col3      If you do printSchema   after this then you can see that duplicate columns have been removed

User · Answer

This might not be the best approach  but if you want to rename the duplicate columns after join   you can do so using this tiny function     def rename duplicate columns dataframe       columns   dataframe columns     duplicate column indices   list set  columns index col  for col in columns if columns count col     2        for index in duplicate column indices          columns index    columns index   2      dataframe   dataframe toDF  columns      return dataframe

User · Answer

if only the key column is the same in both tables then try using the following way  Approach 1    left  join right    key    inner     rather than below approach 2    left  join right   left key    right key   inner     Pros of using approach 1    the  key  will show only once in the final dataframe  easy to use the syntax   Cons of using approach 1    only help with the key column Scenarios  wherein case of left join  if planning to use the right key null count  this will not work  In that case  one has to rename one of the key as mentioned above

[python] Spark Dataframe distinguish columns with duplicated name

Examples related to python

Examples related to apache-spark

Examples related to dataframe

Examples related to pyspark

Examples related to apache-spark-sql