It is not an import problem. You simply call .dropDuplicates()
on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...)
is pyspark.sql.dataframe.DataFrame
, after you apply .collect()
it is a plain Python list
, and lists don't provide dropDuplicates
method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()