In Spark version 1.2.0 one could use subtract
with 2 SchemRDD
s to end up with only the different content from the first one
val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)
onlyNewData
contains the rows in todaySchemRDD
that do not exist in yesterdaySchemaRDD
.
How can this be achieved with DataFrames
in Spark version 1.3.0?
This question is related to
apache-spark
dataframe
rdd
For me , df1.subtract(df2) was inconsistent. Worked correctly on one dataframe but not on the other . That was because of duplicates . df1.exceptAll(df2) returns a new dataframe with the records from df1 that do not exist in df2 , including any duplicates.
From spark 3.0
data_cl = reg_data.exceptAll(data_fr)
In pyspark DOCS it would be subtract
df1.subtract(df2)
I tried subtract, but the result was not consistent.
If I run df1.subtract(df2)
, not all lines of df1 are shown on the result dataframe, probably due distinct
cited on the docs.
This solved my problem:
df1.exceptAll(df2)
Source: Stackoverflow.com