I'm just wondering what is the difference between an
DataFrame (Spark 2.0.0 DataFrame is a mere type alias for
Dataset[Row]) in Apache Spark?
Can you convert one to the other?
This question is tagged with
~ Asked on 2015-07-20 02:31:21
DataFrame is defined well with a google search for "DataFrame definition":
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, you can go from a DataFrame to an
RDD via its
rdd method, and you can go from an
RDD to a
DataFrame (if the RDD is in a tabular format) via the
In general it is recommended to use a
DataFrame where possible due to the built in query optimization.
~ Answered on 2015-07-20 03:09:05
First thing is
DataFramewas evolved from
Yes.. conversion between
RDD is absolutely possible.
Below are some sample code snippets.
Below are some of options to create dataframe.
yourrddOffrow.toDF converts to
createDataFrame of sql context
val df = spark.createDataFrame(rddOfRow, schema)
where schema can be from some of below options as described by nice SO post..
From scala case class and scala reflection api
import org.apache.spark.sql.catalyst.ScalaReflection val schema = ScalaReflection.schemaFor[YourScalacaseClass].dataType.asInstanceOf[StructType]
import org.apache.spark.sql.Encoders val mySchema = Encoders.product[MyCaseClass].schema
as described by Schema can also be created using
val schema = new StructType() .add(StructField("id", StringType, true)) .add(StructField("col1", DoubleType, true)) .add(StructField("col2", DoubleType, true)) etc...
RDD(Resilient Distributed Dataset) API has been in Spark since the 1.0 release.
RDDAPI provides many transformation methods, such as
reduce() for performing computations on the data. Each of these methods results in a new
RDDrepresenting the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are
rdd.filter(_.age > 21) // transformation .map(_.last)// transformation .saveAsObjectFile("under21.bin") // action
Example: Filter by attribute with RDD
rdd.filter(_.age > 21)
Spark 1.3 introduced a new
DataFrameAPI as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The
DataFrameAPI introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization.
DataFrameAPI is radically different from the
RDDAPI because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans
Example SQL style :
df.filter("age > 21");
Limitations : Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.
Another downside with the
DataFrame API is that it is very scala-centric and while it does support Java, the support is limited.
For example, when creating a
DataFrame from an existing
RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the
scala.Product interface. Scala
case class works out the box because they implement this interface.
DatasetAPI, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the
RDDAPI but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the
When it comes to serializing data, the
DatasetAPI has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.
DatasetAPI is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant.
Dataset API SQL style :
dataset.filter(_.age < 21);
Catalist level flow..(Demystifying DataFrame and Dataset presentation from spark summit)
Further reading... databricks article - A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets
~ Answered on 2016-08-19 07:23:53