Concatenate columns in Apache Spark DataFrame

Question

How do we concatenate two columns in an Apache Spark DataFrame  Is there any function in Spark SQL which we can use

User · Answer

concat  cols   v1 5 and higher  Concatenates multiple input columns together into a single column  The function works with strings  binary and compatible array columns   Eg  new df   df select concat df a  df b  df c      concat ws sep   cols   v1 5 and higher  Similar to concat but uses the specified separator   Eg  new df   df select concat ws  -   df col1  df col2      map concat  cols   v2 4 and higher   Used to concat maps  returns the union of all the given maps   Eg  new df   df select map concat  map1    map2       Using string concat operator        v2 3 and higher   Eg  df   spark sql  select col a    col b    col c as abc from table x    Reference  Spark sql doc

User · Answer

We can simple use SelectExpr as well  df1 selectExpr  quot   quot   quot upper  2   3  as new quot

User · Answer

Here is another way of doing this for pyspark      import concat and lit functions from pyspark sql functions  from pyspark sql functions import concat  lit   Create your data frame countryDF   sqlContext createDataFrame    Ethiopia       Kenya       Uganda       Rwanda        East Africa      Use select  concat  and lit functions to do the concatenation personDF   countryDF select concat countryDF  East Africa    lit  n    alias  East African      Show the new data frame personDF show    ----------RESULT-------------------------  84  ------------   East African   ------------      Ethiopian         Kenyan        Ugandan        Rwandan   ------------

User · Answer

Another way to do it in pySpark using sqlContext      Suppose we have a dataframe  df   sqlContext createDataFrame    row1 1   row1 2       colname1    colname2       Now we can concatenate columns and assign the new column a name  df   df select concat df colname1  df colname2  alias  joined colname

User · Answer

Here is a suggestion for when you don t know the number or name of the columns in the Dataframe   val dfResults   dfSource select concat ws     dfSource columns map c   gt  col c

User · Answer

Here s how you can do custom naming    import pyspark from pyspark sql import functions as sf sc   pyspark SparkContext   sqlc   pyspark SQLContext sc  df   sqlc createDataFrame    row11   row12      row21   row22       colname1    colname2    df show     gives     -------- --------   colname1 colname2   -------- --------      row11    row12      row21    row22   -------- --------    create new column by concatenating   df   df withColumn  joined column                        sf concat sf col  colname1   sf lit       sf col  colname2     df show     -------- -------- -------------   colname1 colname2 joined column   -------- -------- -------------      row11    row12   row11 row12      row21    row22   row21 row22   -------- -------- -------------

User · Answer

One option to concatenate string columns in Spark Scala is using concat   It is necessary to check for null values  Because if one of the columns is null  the result will be null even if one of the other columns do have information   Using concat and withColumn   val newDf     df withColumn       NEW COLUMN       concat        when col  COL1   isNotNull  col  COL1    otherwise lit  null           when col  COL2   isNotNull  col  COL2    otherwise lit  null        Using concat and select   val newDf   df selectExpr  concat nvl COL1       nvl COL2       as NEW COLUMN     With both approaches you will have a NEW COLUMN which value is a concatenation of the columns  COL1 and COL2  from your original df

User · Answer

Do we have java syntax corresponding to below process  val dfResults   dfSource select concat ws     dfSource columns map c   gt  col c

User · Answer

Indeed  there are some beautiful inbuilt abstractions for you to accomplish your concatenation without the need to implement a custom function  Since you mentioned Spark SQL  so I am guessing you are trying to pass it as a declarative command through spark sql    If so  you can accomplish in a straight forward manner passing SQL command like   SELECT CONCAT col1    lt delimiter gt    col2       AS concat column name FROM  lt table name gt    Also  from Spark 2 3 0  you can use commands in lines with   SELECT col1    col2 AS concat column name FROM  lt table name gt    Wherein   is your preferred delimiter  can be empty space as well  and  is the temporary or permanent table you are trying to read from

User · Answer

In Spark 2 3 0  you may do   spark sql      select  1     column a from table a

User · Answer

In Java you can do this to concatenate multiple columns  The sample code is to provide you a scenario and how to use it for better understanding   SparkSession spark   JavaSparkSessionSingleton getInstance rdd context   getConf     Dataset lt Row gt  reducedInventory   spark sql  select   from table name                            withColumn  concatenatedCol                                   concat col  col1    lit       col  col2    lit       col  col3        class JavaSparkSessionSingleton       private static transient SparkSession instance   null       public static SparkSession getInstance SparkConf sparkConf            if  instance    null                instance   SparkSession builder   config sparkConf                       getOrCreate                      return instance            The above code concatenated col1 col2 col3 seperated by     to create a column with name  concatenatedCol

User · Answer

If you want to do it using DF  you could use a udf to add a new column based on existing columns   val sqlContext   new SQLContext sc  case class MyDf col1  String  col2  String     here is our dataframe val df   sqlContext createDataFrame sc parallelize      Array MyDf  A    B    MyDf  C    D    MyDf  E    F          Define a udf to concatenate two passed in string values val getConcatenated   udf   first  String  second  String    gt    first         second        use withColumn method to add a new column called newColName df withColumn  newColName   getConcatenated   col1     col2    select  newColName    col1    col2   show

User · Answer

In my case  I wanted a Tab delimited row  from pyspark sql import functions as F df select F concat ws       c1    c2    c3    c4    show    This worked well like a hot knife over butter

User · Answer

With raw SQL you can use CONCAT    In Python  df   sqlContext createDataFrame    foo   1     bar   2      k    v    df registerTempTable  df   sqlContext sql  SELECT CONCAT k        v  FROM df    In Scala  import sqlContext implicits    val df   sc parallelize Seq   foo   1     bar   2    toDF  k    v   df registerTempTable  df   sqlContext sql  SELECT CONCAT k        v  FROM df      Since Spark 1 5 0 you can use concat function with DataFrame API    In Python    from pyspark sql functions import concat  col  lit  df select concat col  k    lit       col  v      In Scala    import org apache spark sql functions  concat  lit   df select concat   k   lit         v       There is also concat ws function which takes a string separator as the first argument

User · Answer

val newDf     df withColumn       quot NEW COLUMN quot       concat        when col  quot COL1 quot   isNotNull  col  quot COL1 quot    otherwise lit  quot null quot           when col  quot COL2 quot   isNotNull  col  quot COL2 quot    otherwise lit  quot null quot        Note  For this code to work you need to put the parentheses  quot    quot  in the  quot isNotNull quot  function  - gt  The correct one is  quot isNotNull   quot   val newDf     df withColumn       quot NEW COLUMN quot       concat        when col  quot COL1 quot   isNotNull    col  quot COL1 quot    otherwise lit  quot null quot           when col  quot COL2 quot   isNotNull    col  quot COL2 quot    otherwise lit  quot null quot

User · Answer

From Spark 2 3 SPARK-22771  Spark SQL supports the concatenation operator       For example   val df   spark sql  select  c1     c2 as concat column from  lt table name gt

[sql] Concatenate columns in Apache Spark DataFrame

concat(*cols)

concat_ws(sep, *cols)

map_concat(*cols)

Examples related to sql

Examples related to apache-spark

Examples related to dataframe

Examples related to apache-spark-sql