[python] Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]

Above result is created by join with a dataframe to itself, you can see there are 4 columns with both two a and f.

The problem is is there when I try to do more calculation with the a column, I cant find a way to select the a, I have try df[0] and df.select('a'), both returned me below error mesaage:

AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.

Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?

This question is related to python apache-spark dataframe pyspark apache-spark-sql

The answer is


This might not be the best approach, but if you want to rename the duplicate columns(after join), you can do so using this tiny function.

def rename_duplicate_columns(dataframe):
    columns = dataframe.columns
    duplicate_column_indices = list(set([columns.index(col) for col in columns if columns.count(col) == 2]))
    for index in duplicate_column_indices:
        columns[index] = columns[index]+'2'
    dataframe = dataframe.toDF(*columns)
    return dataframe

Suppose the DataFrames you want to join are df1 and df2, and you are joining them on column 'a', then you have 2 methods

Method 1

df1.join(df2,'a','left_outer')

This is an awsome method and it is highly recommended.

Method 2

df1.join(df2,df1.a == df2.a,'left_outer').drop(df2.a)


Lets start with some data:

from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=125231, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
])

df2 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
])

There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:

df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

You can also use table aliases:

from pyspark.sql.functions import col

df1_a = df1.alias("df1_a")
df2_a = df2.alias("df2_a")

df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

Finally you can programmatically rename columns:

df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
df2_r = df2.select(*(col(x).alias(x + '_df2') for x in df2.columns))

df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)

## +--------------------+
## |               f_df1|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+

You can use def drop(col: Column) method to drop the duplicated column,for example:

DataFrame:df1

+-------+-----+
| a     | f   |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+

DataFrame:df2

+-------+-----+
| a     | f   |
+-------+-----+
|107831 | ... |
|107831 | ... |
+-------+-----+

when I join df1 with df2, the DataFrame will be like below:

val newDf = df1.join(df2,df1("a")===df2("a"))

DataFrame:newDf

+-------+-----+-------+-----+
| a     | f   | a     | f   |
+-------+-----+-------+-----+
|107831 | ... |107831 | ... |
|107831 | ... |107831 | ... |
+-------+-----+-------+-----+

Now, we can use def drop(col: Column) method to drop the duplicated column 'a' or 'f', just like as follows:

val newDfWithoutDuplicate = df1.join(df2,df1("a")===df2("a")).drop(df2("a")).drop(df2("f"))

If you have a more complicated use case than described in the answer of Glennie Helles Sindholt e.g. you have other/few non-join column names that are also same and want to distinguish them while selecting it's best to use aliasses, e.g:

df3 = df1.select("a", "b").alias("left")\
   .join(df2.select("a", "b").alias("right"), ["a"])\
   .select("left.a", "left.b", "right.b")

df3.columns
['a', 'b', 'b']

There is a simpler way than writing aliases for all of the columns you are joining on by doing:

df1.join(df2,['a'])

This works if the key that you are joining on is the same in both tables.

See https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html


This is how we can join two Dataframes on same column names in PySpark.

df = df1.join(df2, ['col1','col2','col3'])

If you do printSchema() after this then you can see that duplicate columns have been removed.


After digging into the Spark API, I found I can first use alias to create an alias for the original dataframe, then I use withColumnRenamed to manually rename every column on the alias, this will do the join without causing the column name duplication.

More detail can be refer to below Spark Dataframe API:

pyspark.sql.DataFrame.alias

pyspark.sql.DataFrame.withColumnRenamed

However, I think this is only a troublesome workaround, and wondering if there is any better way for my question.


What worked for me

import databricks.koalas as ks

df1k = df1.to_koalas()
df2k = df2.to_koalas()
df3k = df1k.merge(df2k, on=['col1', 'col2'])
df3 = df3k.to_spark()

All of the columns except for col1 and col2 had "_x" appended to their names if they had come from df1 and "_y" appended if they had come from df2, which is exactly what I needed.


if only the key column is the same in both tables then try using the following way (Approach 1):

left. join(right , 'key', 'inner')

rather than below(approach 2):

left. join(right , left.key == right.key, 'inner')

Pros of using approach 1:

  • the 'key' will show only once in the final dataframe
  • easy to use the syntax

Cons of using approach 1:

  • only help with the key column
  • Scenarios, wherein case of left join, if planning to use the right key null count, this will not work. In that case, one has to rename one of the key as mentioned above.

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to pyspark

Pyspark: Filter dataframe based on multiple conditions How to convert column with string type to int form in pyspark data frame? Select columns in PySpark dataframe How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion PySpark: withColumn() with two conditions and three outcomes How to get name of dataframe column in pyspark? Spark RDD to DataFrame python PySpark 2.0 The size or shape of a DataFrame

Examples related to apache-spark-sql

Select Specific Columns from Spark DataFrame Pyspark: Filter dataframe based on multiple conditions Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Filter df when values matches part of a string in pyspark Convert date from String to Date format in Dataframes Take n rows from a spark dataframe and pass to toPandas()