Spark subtract two DataFrames

Question

In Spark version 1 2 0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one  val onlyNewData   todaySchemaRDD subtract yesterdaySchemaRDD    onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD   How can this be achieved with DataFrames in Spark version 1 3 0

User · Answer

From spark 3 0 data cl   reg data exceptAll data fr

User · Answer

In pyspark DOCS it would be subtract   df1 subtract df2

User · Answer

I tried subtract  but the result was not consistent  If I run df1 subtract df2   not all lines of df1 are shown on the result dataframe  probably due distinct cited on the docs   This solved my problem  df1 exceptAll df2

User · Answer

For me   df1 subtract df2  was inconsistent  Worked correctly on one dataframe but not on the other   That was because of duplicates   df1 exceptAll df2  returns a new dataframe with the records from df1 that do not exist in df2   including any duplicates

User · Answer

According to the api docs  doing    dataFrame1 except dataFrame2    will return a new DataFrame containing rows in dataFrame1 but not in dataframe2

[apache-spark] Spark: subtract two DataFrames

Examples related to apache-spark

Examples related to dataframe

Examples related to rdd