How to print the contents of RDD

Question

I m attempting to print the contents of a collection to the Spark console   I have a type   linesWithSessionId  org apache spark rdd RDD String    FilteredRDD 3    And I use the command   scala gt  linesWithSessionId map line   gt  println line     But this is printed        res1  org apache spark rdd RDD Unit    MappedRDD 4  at map at  19   How can I write the RDD to console or save it to disk so I can view its contents

User · Accepted Answer

If you want to view the content of a RDD, one way is to use collect():

myRDD.collect().foreach(println)

That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:

myRDD.take(n).foreach(println)

User · Answer

c take 10    and Spark newer version will show table nicely

User · Answer

In python      linesWithSessionIdCollect   linesWithSessionId collect      linesWithSessionIdCollect   This will printout all the contents of the RDD

User · Answer

If you re running this on a cluster then println won t print back to your context  You need to bring the RDD data to your session  To do this you can force it to local array and then print it out   linesWithSessionId toArray   foreach line   gt  println line

User · Answer

In java syntax   rdd collect   forEach line - gt  System out println line

User · Answer

There are probably many architectural differences between myRDD foreach println  and myRDD collect   foreach println   not only  collect   but also other actions   One the differences I saw is when doing myRDD foreach println   the output will be in a random order  For ex  if my rdd is coming from a text file where each line has a number  the output will have a different order  But when I did myRDD collect   foreach println   order remains just like the text file

User · Answer

Instead of typing each time  you can     1  Create a generic print method inside Spark Shell   def p rdd  org apache spark rdd RDD       rdd foreach println     2  Or even better  using implicits  you can add the function to RDD class to print its contents   implicit class Printer rdd  org apache spark rdd RDD           def print   rdd foreach println      Example usage   val rdd   sc parallelize List 1 2 3 4   map   2   p rdd     1 rdd print    2   Output   2 6 4 8   Important  This only makes sense if you are working in local mode and with a small amount of data set  Otherwise  you either will not be able to see the results on the client or run out of memory because of the big dataset result

User · Answer

You can convert your RDD to a DataFrame then show   it      For implicit conversion from RDD to DataFrame import spark implicits    fruits   sc parallelize    apple   1     banana   2     orange   17        convert to DF then show it fruits toDF   show     This will show the top 20 lines of your data  so the size of your data should not be an issue    ------ ---                                                                            1   2   ------ ---    apple   1   banana   2   orange  17   ------ ---

User · Answer

You can also save as a file  rdd saveAsTextFile  alicia txt

User · Answer

The map function is a transformation  which means that Spark will not actually evaluate your RDD until you run an action on it    To print it  you can use foreach  which is an action    linesWithSessionId foreach println    To write it to disk you can use one of the saveAs    functions  still actions  from the RDD API

[scala] How to print the contents of RDD?

Examples related to scala

Examples related to apache-spark