In addition to the dropDuplicates
option there is the method named as we know it in pandas
drop_duplicates
:
drop_duplicates() is an alias for dropDuplicates().
Example
s_df = sqlContext.createDataFrame([("foo", 1),
("foo", 1),
("bar", 2),
("foo", 3)], ('k', 'v'))
s_df.show()
+---+---+
| k| v|
+---+---+
|foo| 1|
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
Drop by subset
s_df.drop_duplicates(subset = ['k']).show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 1|
+---+---+
s_df.drop_duplicates().show()
+---+---+
| k| v|
+---+---+
|bar| 2|
|foo| 3|
|foo| 1|
+---+---+