[sql] Concatenate columns in Apache Spark DataFrame

How do we concatenate two columns in an Apache Spark DataFrame? Is there any function in Spark SQL which we can use?

This question is related to sql apache-spark dataframe apache-spark-sql

The answer is


With raw SQL you can use CONCAT:

  • In Python

    df = sqlContext.createDataFrame([("foo", 1), ("bar", 2)], ("k", "v"))
    df.registerTempTable("df")
    sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")
    
  • In Scala

    import sqlContext.implicits._
    
    val df = sc.parallelize(Seq(("foo", 1), ("bar", 2))).toDF("k", "v")
    df.registerTempTable("df")
    sqlContext.sql("SELECT CONCAT(k, ' ',  v) FROM df")
    

Since Spark 1.5.0 you can use concat function with DataFrame API:

  • In Python :

    from pyspark.sql.functions import concat, col, lit
    
    df.select(concat(col("k"), lit(" "), col("v")))
    
  • In Scala :

    import org.apache.spark.sql.functions.{concat, lit}
    
    df.select(concat($"k", lit(" "), $"v"))
    

There is also concat_ws function which takes a string separator as the first argument.


From Spark 2.3(SPARK-22771) Spark SQL supports the concatenation operator ||.

For example;

val df = spark.sql("select _c1 || _c2 as concat_column from <table_name>")

concat(*cols)

v1.5 and higher

Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.

Eg: new_df = df.select(concat(df.a, df.b, df.c))


concat_ws(sep, *cols)

v1.5 and higher

Similar to concat but uses the specified separator.

Eg: new_df = df.select(concat_ws('-', df.col1, df.col2))


map_concat(*cols)

v2.4 and higher

Used to concat maps, returns the union of all the given maps.

Eg: new_df = df.select(map_concat("map1", "map2"))


Using string concat operator (||):

v2.3 and higher

Eg: df = spark.sql("select col_a || col_b || col_c as abc from table_x")

Reference: Spark sql doc


Indeed, there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function. Since you mentioned Spark SQL, so I am guessing you are trying to pass it as a declarative command through spark.sql(). If so, you can accomplish in a straight forward manner passing SQL command like: SELECT CONCAT(col1, '<delimiter>', col2, ...) AS concat_column_name FROM <table_name>;

Also, from Spark 2.3.0, you can use commands in lines with: SELECT col1 || col2 AS concat_column_name FROM <table_name>;

Wherein, is your preferred delimiter (can be empty space as well) and is the temporary or permanent table you are trying to read from.


Do we have java syntax corresponding to below process

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

If you want to do it using DF, you could use a udf to add a new column based on existing columns.

val sqlContext = new SQLContext(sc)
case class MyDf(col1: String, col2: String)

//here is our dataframe
val df = sqlContext.createDataFrame(sc.parallelize(
    Array(MyDf("A", "B"), MyDf("C", "D"), MyDf("E", "F"))
))

//Define a udf to concatenate two passed in string values
val getConcatenated = udf( (first: String, second: String) => { first + " " + second } )

//use withColumn method to add a new column called newColName
df.withColumn("newColName", getConcatenated($"col1", $"col2")).select("newColName", "col1", "col2").show()

In Java you can do this to concatenate multiple columns. The sample code is to provide you a scenario and how to use it for better understanding.

SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> reducedInventory = spark.sql("select * from table_name")
                        .withColumn("concatenatedCol",
                                concat(col("col1"), lit("_"), col("col2"), lit("_"), col("col3")));


class JavaSparkSessionSingleton {
    private static transient SparkSession instance = null;

    public static SparkSession getInstance(SparkConf sparkConf) {
        if (instance == null) {
            instance = SparkSession.builder().config(sparkConf)
                    .getOrCreate();
        }
        return instance;
    }
}

The above code concatenated col1,col2,col3 seperated by "_" to create a column with name "concatenatedCol".


Another way to do it in pySpark using sqlContext...

#Suppose we have a dataframe:
df = sqlContext.createDataFrame([('row1_1','row1_2')], ['colname1', 'colname2'])

# Now we can concatenate columns and assign the new column a name 
df = df.select(concat(df.colname1, df.colname2).alias('joined_colname'))

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Note: For this code to work you need to put the parentheses "()" in the "isNotNull" function. -> The correct one is "isNotNull()".

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull(), col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull(), col("COL2")).otherwise(lit("null"))))

In my case, I wanted a Tab delimited row.

from pyspark.sql import functions as F
df.select(F.concat_ws('|','_c1','_c2','_c3','_c4')).show()

This worked well like a hot knife over butter.


In Spark 2.3.0, you may do:

spark.sql( """ select '1' || column_a from table_a """)

Here is another way of doing this for pyspark:

#import concat and lit functions from pyspark.sql.functions 
from pyspark.sql.functions import concat, lit

#Create your data frame
countryDF = sqlContext.createDataFrame([('Ethiopia',), ('Kenya',), ('Uganda',), ('Rwanda',)], ['East Africa'])

#Use select, concat, and lit functions to do the concatenation
personDF = countryDF.select(concat(countryDF['East Africa'], lit('n')).alias('East African'))

#Show the new data frame
personDF.show()

----------RESULT-------------------------

84
+------------+
|East African|
+------------+
|   Ethiopian|
|      Kenyan|
|     Ugandan|
|     Rwandan|
+------------+

One option to concatenate string columns in Spark Scala is using concat.

It is necessary to check for null values. Because if one of the columns is null, the result will be null even if one of the other columns do have information.

Using concat and withColumn:

val newDf =
  df.withColumn(
    "NEW_COLUMN",
    concat(
      when(col("COL1").isNotNull, col("COL1")).otherwise(lit("null")),
      when(col("COL2").isNotNull, col("COL2")).otherwise(lit("null"))))

Using concat and select:

val newDf = df.selectExpr("concat(nvl(COL1, ''), nvl(COL2, '')) as NEW_COLUMN")

With both approaches you will have a NEW_COLUMN which value is a concatenation of the columns: COL1 and COL2 from your original df.


Here is a suggestion for when you don't know the number or name of the columns in the Dataframe.

val dfResults = dfSource.select(concat_ws(",",dfSource.columns.map(c => col(c)): _*))

Here's how you can do custom naming

import pyspark
from pyspark.sql import functions as sf
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
df = sqlc.createDataFrame([('row11','row12'), ('row21','row22')], ['colname1', 'colname2'])
df.show()

gives,

+--------+--------+
|colname1|colname2|
+--------+--------+
|   row11|   row12|
|   row21|   row22|
+--------+--------+

create new column by concatenating:

df = df.withColumn('joined_column', 
                    sf.concat(sf.col('colname1'),sf.lit('_'), sf.col('colname2')))
df.show()

+--------+--------+-------------+
|colname1|colname2|joined_column|
+--------+--------+-------------+
|   row11|   row12|  row11_row12|
|   row21|   row22|  row21_row22|
+--------+--------+-------------+

We can simple use SelectExpr as well.

df1.selectExpr("*","upper(_2||_3) as new")


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to apache-spark-sql

Select Specific Columns from Spark DataFrame Pyspark: Filter dataframe based on multiple conditions Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Filter df when values matches part of a string in pyspark Convert date from String to Date format in Dataframes Take n rows from a spark dataframe and pass to toPandas()