[python] How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?

Thanks!

The answer is


If you cannot use spark-csv, you can do the following:

df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")

If you need to handle strings with linebreaks or comma that will not work. Use this:

import csv
import cStringIO

def row2csv(row):
    buffer = cStringIO.StringIO()
    writer = csv.writer(buffer)
    writer.writerow([str(s).encode("utf-8") for s in row])
    buffer.seek(0)
    return buffer.read().strip()

df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

query.repartition(1).write.csv("cc_out.csv", sep='|')

Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.


How about this (in you don't want an one liner) ?

for row in df.collect():
    d = row.asDict()
    s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
    f.write(s)

f is a opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.


You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')

Read more about the repartition function Read more about the save function

However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

Read more on repartition vs coalesce functions.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to apache-spark-sql

Select Specific Columns from Spark DataFrame Pyspark: Filter dataframe based on multiple conditions Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Filter df when values matches part of a string in pyspark Convert date from String to Date format in Dataframes Take n rows from a spark dataframe and pass to toPandas()

Examples related to export-to-csv

Excel: macro to export worksheet as CSV file without leaving my current Excel sheet Export to csv/excel from kibana How to export data from Spark SQL to CSV How to export a table dataframe in PySpark to csv? Java - Writing strings to a CSV file How to export dataGridView data Instantly to Excel on button click? Export javascript data to CSV file without server interaction How to create and download a csv file from php script? Export to CSV using jQuery and html export html table to csv