Remove duplicates from a dataframe in PySpark

Question

I m messing around with dataframes in pyspark 1 4 locally and am having issues getting the dropDuplicates method to work  It keeps returning the error    quot AttributeError   list  object has no attribute  dropDuplicates  quot   Not quite sure why as I seem to be following the syntax in the latest documentation   loading the CSV file into an RDD in order to start working with the data rdd1   sc textFile  quot C  myfilename csv quot   map lambda line   line split  quot   quot   0   line split  quot   quot   1   line split  quot   quot   2   line split  quot   quot   3    collect     loading the RDD object into a dataframe and assigning column names df1   sqlContext createDataFrame rdd1    column1    column2    column3    column4    collect     dropping duplicates from the dataframe df1 dropDuplicates   show

User · Answer

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

count before dedupe:

df.count()

do the de-dupe (convert the column you are de-duping to string type):

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

can use a sorted groupby to check to see that duplicates have been removed:

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

User · Answer

It is not an import problem  You simply call  dropDuplicates   on a wrong object  While class of sqlContext createDataFrame rdd1       is pyspark sql dataframe DataFrame  after you apply  collect   it is a plain Python list  and lists don t provide dropDuplicates method  What you want is something like this     df1   sqlContext       createDataFrame rdd1    column1    column2    column3    column4          dropDuplicates      df1 collect

[python] Remove duplicates from a dataframe in PySpark

Examples related to python

Examples related to apache-spark

Examples related to pyspark

Examples related to duplicates

Examples related to pyspark-dataframes