To answer the questions directly:
Will
collect()
behave the same way if called on a dataframe?
Yes, spark.DataFrame.collect
is functionally the same as spark.RDD.collect
. They serve the same purpose on these different objects.
What about the
select()
method?
There is no such thing as spark.RDD.select
, so it cannot be the same as spark.DataFrame.select
.
Does it also work the same way as
collect()
if called on a dataframe?
The only thing that is similar between select
and collect
is that they are both functions on a DataFrame. They have absolutely zero overlap in functionality.
Here's my own description: collect
is the opposite of sc.parallelize
. select
is the same as the SELECT
in any SQL statement.
If you are still having trouble understanding what collect
actually does (for either RDD or DataFrame), then you need to look up some articles about what spark is doing behind the scenes. e.g.: