Write single CSV file using spark-csv

Question

I am using https   github com databricks spark-csv   I am trying to write a single CSV  but not able to  it is making a folder   Need a Scala function which will take parameter like path and file name and write that CSV file

User · Answer

If you are using Databricks and can fit all the data into RAM on one worker (and thus can use .coalesce(1)), you can use dbfs to find and move the resulting CSV file:

val fileprefix= "/mnt/aws/path/file-prefix"

dataset
  .coalesce(1)       
  .write             
//.mode("overwrite") // I usually don't use this, but you may want to.
  .option("header", "true")
  .option("delimiter","\t")
  .csv(fileprefix+".tmp")

val partition_path = dbutils.fs.ls(fileprefix+".tmp/")
     .filter(file=>file.name.endsWith(".csv"))(0).path

dbutils.fs.cp(partition_path,fileprefix+".tab")

dbutils.fs.rm(fileprefix+".tmp",recurse=true)

If your file does not fit into RAM on the worker, you may want to consider chaotic3quilibrium's suggestion to use FileUtils.copyMerge(). I have not done this, and don't yet know if is possible or not, e.g., on S3.

This answer is built on previous answers to this question as well as my own tests of the provided code snippet. I originally posted it to Databricks and am republishing it here.

The best documentation for dbfs's rm's recursive option I have found is on a Databricks forum.

User · Answer

spark sql  quot select   from df quot   coalesce 1  write option  quot mode quot   quot append quot   option  quot header quot   quot true quot   csv  quot  your hdfs path  quot    spark sql  quot select   from df quot   -- gt  this is dataframe coalesce 1  or repartition 1  -- gt  this will make your output file to 1 part file only write -- gt  writing data option  quot mode quot   quot append quot   -- gt  appending data to existing directory option  quot header quot   quot true quot   -- gt  enabling header csv  quot  quot   -- gt  write as CSV file  amp  its output location in HDFS

User · Answer

If you are running Spark with HDFS  I ve been solving the problem by writing csv files normally and leveraging HDFS to do the merging  I m doing that in Spark  1 6  directly    import org apache hadoop conf Configuration import org apache hadoop fs    def merge srcPath  String  dstPath  String   Unit         val hadoopConfig   new Configuration      val hdfs   FileSystem get hadoopConfig     FileUtil copyMerge hdfs  new Path srcPath   hdfs  new Path dstPath   true  hadoopConfig  null         the  true  setting deletes the source files once they are merged into the new output     val newData    lt  lt  create your dataframe  gt  gt    val outputfile     user feeds project outputs subject    var filename    myinsights  var outputFileName   outputfile     temp     filename  var mergedFileName   outputfile     merged     filename var mergeFindGlob    outputFileName      newData write          format  com databricks spark csv            option  header    false            mode  overwrite            save outputFileName      merge mergeFindGlob  mergedFileName       newData unpersist     Can t remember where I learned this trick  but it might work for you

User · Answer

I might be a little late to the game here  but using coalesce 1  or repartition 1  may work for small data-sets  but large data-sets would all be thrown into one partition on one node  This is likely to throw OOM errors  or at best  to process slowly   I would highly suggest that you use the FileUtil copyMerge   function from the Hadoop API  This will merge the outputs into a single file    EDIT - This effectively brings the data to the driver rather than an executor node  Coalesce   would be fine if a single executor has more RAM for use than the driver    EDIT 2  copyMerge   is being removed in Hadoop 3 0  See the following stack overflow article for more information on how to work with the newest version  How to do CopyMerge in Hadoop 3 0

User · Answer

There is one more way to use Java   import java io    def printToFile f  java io File  op  java io PrintWriter   gt  Unit            val p   new java io PrintWriter f          try   op p          finally   p close           printToFile new File  C  TEMP df csv      p   gt  df collect   foreach p println

User · Answer

you can use rdd coalesce 1  true  saveAsTextFile path   it will store data as singile file in path part-00000

User · Answer

A solution that works for S3 modified from Minkymorgan   Simply pass the temporary partitioned directory path  with different name than final path  as the srcPath and single final csv txt as destPath  Specify also deleteSource if you want to remove the original directory         Merges multiple partitions of spark text file output into single file      param srcPath source directory of partitioned files    param dstPath output path of individual path    param deleteSource whether or not to delete source directory after merging    param spark sparkSession    def mergeTextFiles srcPath  String  dstPath  String  deleteSource  Boolean   Unit        import org apache hadoop fs FileUtil   import java net URI   val config   spark sparkContext hadoopConfiguration   val fs  FileSystem   FileSystem get new URI srcPath   config    FileUtil copyMerge      fs  new Path srcPath   fs  new Path dstPath   deleteSource  config  null

User · Answer

I m using this in Python to get a single file   df toPandas   to csv   tmp my csv   sep      header True  index False

User · Answer

by using Listbuffer we can save data into single file   import java io FileWriter import org apache spark sql SparkSession import scala collection mutable ListBuffer     val text   spark read textFile  filepath       var data   ListBuffer String        for line String  lt - text collect           data    line           val writer   new FileWriter  filepath       data foreach line   gt  writer write line toString   n        writer close

User · Answer

repartition coalesce to 1 partition before you save  you d still get a folder but it would have one part file in it

User · Answer

It is creating a folder with multiple files  because each partition is saved individually  If you need a single output file  still in a folder  you can repartition  preferred if upstream data is large  but requires a shuffle    df     repartition 1      write format  com databricks spark csv       option  header    true       save  mydata csv     or coalesce   df     coalesce 1      write format  com databricks spark csv       option  header    true       save  mydata csv     data frame before saving   All data will be written to mydata csv part-00000  Before you use this option be sure you understand what is going on and what is the cost of transferring all data to a single worker  If you use distributed file system with replication  data will be transfered multiple times - first fetched to a single worker and subsequently distributed over storage nodes    Alternatively you can leave your code as it is and use general purpose tools like cat or HDFS getmerge to simply merge all the parts afterwards

User · Answer

import org apache hadoop conf Configuration import org apache hadoop fs   import org apache spark sql  DataFrame SaveMode SparkSession  import org apache spark sql functions     I solved using below approach  hdfs rename file name  -  Step 1 -  Crate Data Frame and write to HDFS   df coalesce 1  write format  csv   option  header    false   mode SaveMode Overwrite  save   hdfsfolder blah      Step 2 -  Create Hadoop Config   val hadoopConfig   new Configuration   val hdfs   FileSystem get hadoopConfig    Step3  -  Get path in hdfs folder path   val pathFiles   new Path   hdfsfolder blah      Step4 -  Get spark file names from hdfs folder    val fileNames   hdfs listFiles pathFiles  false  println fileNames    setp5 -  create scala mutable list to save all the file names and add it to the list       var fileNamesList   scala collection mutable MutableList String        while  fileNames hasNext          fileNamesList    fileNames next   getPath getName           println fileNamesList    Step 6 -  filter  SUCESS file order from file names scala list          get files name which are not  SUCCESS     val partFileName   fileNamesList filterNot filenames   gt  filenames      SUCCESS     step 7 -  convert scala list to string and add desired file name to hdfs folder string and then apply rename   val partFileSourcePath   new Path   yourhdfsfolder    partFileName mkString          val desiredCsvTargetPath   new Path  yourhdfsfolder    op      csv       hdfs rename partFileSourcePath   desiredCsvTargetPath

User · Answer

spark s df write   API will create multiple part files inside given path     to force spark write only a single part file use df coalesce 1  write csv      instead of df repartition 1  write csv      as coalesce is a narrow transformation whereas repartition is a wide transformation see Spark - repartition   vs coalesce     df coalesce 1  write csv filepath header True     will create folder in given filepath with one part-0001-   -c000 csv file  use   cat filepath part-0001-   -c000 csv  gt  filename you want csv    to have a user friendly filename

User · Answer

This answer expands on the accepted answer  gives more context  and provides code snippets you can run in the Spark Shell on your machine   More context on accepted answer  The accepted answer might give you the impression the sample code outputs a single mydata csv file and that s not the case   Let s demonstrate   val df   Seq  one    two    three   toDF  num   df    repartition 1     write csv sys env  HOME      Documents tmp mydata csv     Here s what s outputted   Documents    tmp      mydata csv         SUCCESS       part-00000-b3700504-e58b-4552-880b-e7b52c60157e-c000 csv   N B  mydata csv is a folder in the accepted answer - it s not a file   How to output a single file with a specific name  We can use spark-daria to write out a single mydata csv file   import com github mrpowers spark daria sql DariaWriters DariaWriters writeSingleFile      df   df      format    csv       sc   spark sparkContext      tmpFolder   sys env  HOME       Documents better staging       filename   sys env  HOME       Documents better mydata csv      This ll output the file as follows   Documents    better      mydata csv   S3 paths  You ll need to pass s3a paths to DariaWriters writeSingleFile to use this method in S3   DariaWriters writeSingleFile      df   df      format    csv       sc   spark sparkContext      tmpFolder    s3a   bucket data src       filename    s3a   bucket data dest my cool file csv      See here for more info   Avoiding copyMerge  copyMerge was removed from Hadoop 3   The DariaWriters writeSingleFile implementation uses fs rename  as described here   Spark 3 still used Hadoop 2  so copyMerge implementations will work in 2020   I m not sure when Spark will upgrade to Hadoop 3  but better to avoid any copyMerge approach that ll cause your code to break when Spark upgrades Hadoop   Source code  Look for the DariaWriters object in the spark-daria source code if you d like to inspect the implementation     PySpark implementation  It s easier to write out a single file with PySpark because you can convert the DataFrame to a Pandas DataFrame that gets written out as a single file by default   from pathlib import Path home   str Path home    data           jellyfish    JALYF          li    L          luisa    LAS         None  None    df   spark createDataFrame data    word    expected    df toPandas   to csv home     Documents tmp mydata-from-pyspark csv   sep      header True  index False    Limitations  The DariaWriters writeSingleFile Scala approach and the df toPandas   Python approach only work for small datasets   Huge datasets can not be written out as single files   Writing out data as a single file isn t optimal from a performance perspective because the data can t be written in parallel

[scala] Write single CSV file using spark-csv

Examples related to scala

Examples related to csv

Examples related to apache-spark

Examples related to spark-csv