Extract column values of Dataframe as List in Apache Spark

Question

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.

Any suggestions would be appreciated. Thank you!

User · Answer

I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.

i.e. A DataFrame, containing a column named "Raw"

To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:

MyDataFrame.rdd.map(lambda x: x.Raw).collect()

User · Answer

In Scala and Spark 2   try this  assuming your column name is  s    df select  s  as String  collect

User · Answer

Below is for Python- df select  quot col name quot   rdd flatMap lambda x  x  collect

User · Answer

List lt String gt  whatever list   df toJavaRDD   map new Function lt Row  String gt          public String call Row row            return row getAs  column name   toString             collect     logger info String format  list is  s  whatever list      verification   Since no one has given any solution in java Real Programming Language  Can thank me later

User · Answer

from pyspark sql functions import col  df select col  column name    collect     here collect is functions which in turn convert it to list  Be ware of using the list on the huge data set  It will decrease performance  It is good to check the data

User · Answer

This is java answer   df select  id   collectAsList

User · Answer

With Spark 2 x and Scala 2 11 I d think of 3 possible ways to convert values of a specific column to List  Common code snippets for all the approaches import org apache spark sql SparkSession  val spark   SparkSession builder getOrCreate     import spark implicits      for  toDF   method  val df   Seq        quot first quot   2 0         quot test quot   1 5          quot choose quot   8 0      toDF  quot id quot    quot val quot    Approach 1 df select  quot id quot   collect   map   0   toList    res9  List Any    List one  two  three   What happens now  We are collecting data to Driver with collect   and picking element zero from each record  This could not be an excellent way of doing it  Let s improve it with next approach   Approach 2 df select  quot id quot   rdd map r   gt  r 0   collect toList    res10  List Any    List one  two  three   How is it better  We have distributed map transformation load among the workers rather than single Driver  I know rdd map r   gt  r 0   does not seems elegant you  So  let s address it in next approach   Approach 3 df select  quot id quot   map r   gt  r getString 0   collect toList    res11  List String    List one  two  three   Here we are not converting DataFrame to RDD  Look at map it won t accept r   gt  r 0  or   0   as the previous approach due to encoder issues in DataFrame  So end up using r   gt  r getString 0  and it would be addressed in the next versions of Spark   Conclusion  All the options give the same output but 2 and 3 are effective  finally 3rd one is effective and elegant I d think   Databricks notebook

User · Answer

sqlContext sql   select filename from tempTable   rdd map r   gt  r 0   collect toList foreach out streamfn println    remove brackets   it works perfectly

User · Answer

This should return the collection containing single list   dataFrame select  YOUR COLUMN NAME   rdd map r   gt  r 0   collect     Without the mapping  you just get a Row object  which contains every column from the database   Keep in mind that this will probably get you a list of Any type    f you want to specify the result type  you can use  asInstanceOf YOUR TYPE  in r   gt  r 0  asInstanceOf YOUR TYPE  mapping  P S  due to automatic conversion you can skip the  rdd part

User · Answer

An updated solution that gets you a list   dataFrame select  YOUR COLUMN NAME   map r   gt  r getString 0   collect toList

[scala] Extract column values of Dataframe as List in Apache Spark

The answer is

With Spark 2.x and Scala 2.11

Common code snippets for all the approaches

Approach 1

Approach 2

Approach 3

Conclusion

Examples related to scala

Examples related to apache-spark

Examples related to apache-spark-sql

Tags