I am trying to convert the Spark RDD to a DataFrame. I have seen the documentation and example where the scheme is passed to
sqlContext.CreateDataFrame(rdd,schema)
function.
But I have 38 columns or fields and this will increase further. If I manually give the schema specifying each field information, that it going to be so tedious job.
Is there any other way to specify the schema without knowing the information of the columns prior.
This question is related to
python
apache-spark
pyspark
spark-dataframe
Try if that works
sc = spark.sparkContext
# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(RddName)
schemaPeople.createOrReplaceTempView("RddName")
I liked Arun's answer better but there is a tiny problem and I could not comment or edit the answer. sparkContext does not have createDeataFrame, sqlContext does (as Thiago mentioned). So:
from pyspark.sql import SQLContext
# assuming the spark environemnt is set and sc is spark.sparkContext
sqlContext = SQLContext(sc)
schemaPeople = sqlContext.createDataFrame(RDDName)
schemaPeople.createOrReplaceTempView("RDDName")
Source: Stackoverflow.com