Simply RDD
is core component, but DataFrame
is an API introduced in spark 1.30.
Collection of data partitions called RDD
. These RDD
must follow few properties such is:
Here RDD
is either structured or unstructured.
DataFrame
is an API available in Scala, Java, Python and R. It allows to process any type of Structured and semi structured data. To define DataFrame
, a collection of distributed data organized into named columns called DataFrame
. You can easily optimize the RDDs
in the DataFrame
.
You can process JSON data, parquet data, HiveQL data at a time by using DataFrame
.
val sampleRDD = sqlContext.jsonFile("hdfs://localhost:9000/jsondata.json")
val sample_DF = sampleRDD.toDF()
Here Sample_DF consider as DataFrame
. sampleRDD
is (raw data) called RDD
.