How to save a spark DataFrame as csv on disk

Question

For example  the result of this   df filter  project    en    select  title   count   groupBy  title   sum     would return an Array   How to save a spark DataFrame as a csv file on disk

User · Answer

I had similar issue where i had to save the contents of the dataframe to a csv file of name which i defined  df write  csv   save   lt my-path gt    was creating directory than file  So have to come up with the following solutions   Most of the code is taken from the following dataframe-to-csv with little modifications to the logic       def saveDfToCsv df  DataFrame  tsvOutput  String  sep  String        header  Boolean   false   Unit         val tmpParquetDir    Posts tmp parquet       df repartition 1  write          format  com databricks spark csv            option  header   header toString           option  delimiter   sep           save tmpParquetDir       val dir   new File tmpParquetDir      val newFileRgex   tmpParquetDir   File separatorChar     part-00000   csv      val tmpTsfFile   dir listFiles filter   toPath toString matches newFileRgex   0  toString      new File tmpTsvFile   renameTo new File tsvOutput        dir listFiles foreach  f   gt  f delete       dir delete

User · Answer

I had similar problem  I needed to write down csv file on driver while I was connect to cluster in client mode   I wanted to reuse the same CSV parsing code as Apache Spark to avoid potential errors   I checked spark-csv code and found code responsible for converting dataframe into raw csv RDD String  in com databricks spark csv CsvSchemaRDD   Sadly it is hardcoded with sc textFile and the end of relevant method   I copy-pasted that code and removed last lines with sc textFile and returned RDD directly instead   My code        This is copypasta from com databricks spark csv CsvSchemaRDD   Spark s code has perfect method converting Dataframe - gt  raw csv RDD String    But in last lines of that method it s hardcoded against writing as text file -   for our case we need RDD      object DataframeToRawCsvRDD      val defaultCsvFormat   com databricks spark csv defaultCsvFormat    def apply dataFrame  DataFrame  parameters  Map String  String    Map                implicit ctx  ExecutionContext   RDD String          val delimiter   parameters getOrElse  delimiter            val delimiterChar   if  delimiter length    1          delimiter charAt 0        else         throw new Exception  Delimiter cannot be more than one character               val escape   parameters getOrElse  escape   null      val escapeChar  Character   if  escape    null          null       else if  escape length    1          escape charAt 0        else         throw new Exception  Escape character cannot be more than one character               val quote   parameters getOrElse  quote             val quoteChar  Character   if  quote    null          null       else if  quote length    1          quote charAt 0        else         throw new Exception  Quotation cannot be more than one character               val quoteModeString   parameters getOrElse  quoteMode    MINIMAL       val quoteMode  QuoteMode   if  quoteModeString    null          null       else         QuoteMode valueOf quoteModeString toUpperCase             val nullValue   parameters getOrElse  nullValue    null        val csvFormat   defaultCsvFormat        withDelimiter delimiterChar         withQuote quoteChar         withEscape escapeChar         withQuoteMode quoteMode         withSkipHeaderRecord false         withNullString nullValue       val generateHeader   parameters getOrElse  header    false   toBoolean     val headerRdd   if  generateHeader          ctx sparkContext parallelize Seq          csvFormat format dataFrame columns map   asInstanceOf AnyRef                       else         ctx sparkContext emptyRDD String             val rowsRdd   dataFrame rdd map row   gt          csvFormat format row toSeq map   asInstanceOf AnyRef                    headerRdd union rowsRdd

User · Answer

Apache Spark does not support native CSV output on disk    You have four available solutions though    You can convert your Dataframe into an RDD    def convertToReadableString r   Row        df rdd map  convertToReadableString   saveAsTextFile filepath    This will create a folder filepath  Under the file path  you ll find partitions files  e g part-000    What I usually do if I want to append all the partitions into a big CSV is   cat filePath part   gt  mycsvfile csv   Some will use coalesce 1 false  to create one partition from the RDD  It s usually a bad practice  since it may overwhelm the driver by pulling all the data you are collecting to it   Note that df rdd will return an RDD Row   With Spark  lt 2  you can use databricks spark-csv library    Spark 1 4    df write format  com databricks spark csv   save filepath   Spark 1 3   df save filepath  com databricks spark csv     With Spark 2 x the spark-csv package is not needed as it s included in Spark   df write format  csv   save filepath   You can convert to local Pandas data frame and use to csv method  PySpark only     Note  Solutions 1  2 and 3 will result in CSV format files  part-   generated by the underlying Hadoop API that Spark calls when you invoke save  You will have one part- file per partition

User · Answer

Writing dataframe to disk as csv is similar read from csv  If you want your result as one file  you can use coalesce   df coalesce 1         write        option  header   true          option  sep              mode  overwrite          csv  output path     If your result is an array  you should use language specific solution  not spark dataframe api  Because all these kind of results return driver machine

[scala] How to save a spark DataFrame as csv on disk?

Examples related to scala

Examples related to apache-spark

Examples related to apache-spark-sql