[r] Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

I have the following 2 data.frames:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

I want to find the row a1 has that a2 doesn't.

Is there a built in function for this type of operation?

(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

Here is my solution:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)

This question is related to r merge compare rows dataframe

The answer is


Yet another solution based on match_df in plyr. Here's plyr's match_df:

match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[keys$x %in% keys$y, , drop = FALSE]
}

We can modify it to negate:

library(plyr)
negate_match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[!(keys$x %in% keys$y), , drop = FALSE]
}

Then:

diff <- negate_match_df(a1,a2)

The following code uses both data.table and fastmatch for increased speed.

library("data.table")
library("fastmatch")

a1 <- setDT(data.frame(a = 1:5, b=letters[1:5]))
a2 <- setDT(data.frame(a = 1:3, b=letters[1:3]))

compare_rows <- a1$a %fin% a2$a
# the %fin% function comes from the `fastmatch` package

added_rows <- a1[which(compare_rows == FALSE)]

added_rows

#    a b
# 1: 4 d
# 2: 5 e

Using subset:

missing<-subset(a1, !(a %in% a2$a))

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.


Using diffobj package:

library(diffobj)

diffPrint(a1, a2)
diffObj(a1, a2)

enter image description here

enter image description here


I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
  > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
  > df_compare = compare_df(df1, df2, "row")

  > df_compare$comparison_df
    row chng_type a b
  1   4         + 4 d
  2   5         + 5 e

A more complicated example:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", "Duster 360", "Merc 240D"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
                 hp = c(110, 110, 181, 110, 245, 62),
                 cyl = c(6, 6, 4, 6, 8, 4),
                 qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
                 hp = c(110, 110, 93, 110, 175, 105),
                 cyl = c(6, 6, 4, 6, 8, 6),
                 qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare$comparison_df
    grp chng_type                id1 id2  hp cyl  qsec
  1   1         -  Hornet Sportabout Dus 175   8 17.02
  2   2         +         Datsun 710 Dat 181   4 33.00
  3   2         -         Datsun 710 Dat  93   4 18.61
  4   3         +         Duster 360 Dus 245   8 15.84
  5   7         +          Merc 240D Mer  62   4 20.00
  6   8         -            Valiant Val 105   6 20.22

The package also has an html_output command for quick checking

df_compare$html_output enter image description here


In dplyr:

setdiff(a1,a2)

Basically, setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

In the SQLverse this is called a

Left Excluding Join Venn Diagram

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

But back to this question - here are the results for the setdiff() code when using the OP's data:

> a1
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
  a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
  a b
1 4 d
2 5 e

Or even anti_join(a1,a2) will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf


I adapted the merge function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.

Another solution is to use the library prob.

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

XinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = FALSE, incomparables = NULL,
             ...)
{
    fix.by <- function(by, df)
    {
        ## fix up 'by' to be a valid set of cols by number: 0 is row.names
        if(is.null(by)) by <- numeric(0L)
        by <- as.vector(by)
        nc <- ncol(df)
        if(is.character(by))
            by <- match(by, c("row.names", names(df))) - 1L
        else if(is.numeric(by)) {
            if(any(by < 0L) || any(by > nc))
                stop("'by' must match numbers of columns")
        } else if(is.logical(by)) {
            if(length(by) != nc) stop("'by' must match number of columns")
            by <- seq_along(by)[by]
        } else stop("'by' must specify column(s) as numbers, names or logical")
        if(any(is.na(by))) stop("'by' must specify valid column(s)")
        unique(by)
    }

    nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
    by.x <- fix.by(by.x, x)
    by.y <- fix.by(by.y, y)
    if((l.b <- length(by.x)) != length(by.y))
        stop("'by.x' and 'by.y' specify different numbers of columns")
    if(l.b == 0L) {
        ## was: stop("no columns to match on")
        ## returns x
        x
    }
    else {
        if(any(by.x == 0L)) {
            x <- cbind(Row.names = I(row.names(x)), x)
            by.x <- by.x + 1L
        }
        if(any(by.y == 0L)) {
            y <- cbind(Row.names = I(row.names(y)), y)
            by.y <- by.y + 1L
        }
        ## create keys from 'by' columns:
        if(l.b == 1L) {                  # (be faster)
            bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
            by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
        } else {
            ## Do these together for consistency in as.character.
            ## Use same set of names.
            bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
            names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
            bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
            bx <- bz[seq_len(nx)]
            by <- bz[nx + seq_len(ny)]
        }
        comm <- match(bx, by, 0L)
        if (notin) {
            res <- x[comm == 0,]
        } else {
            res <- x[comm > 0,]
        }
    }
    ## avoid a copy
    ## row.names(res) <- NULL
    attr(res, "row.names") <- .set_row_names(nrow(res))
    res
}


XnotinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = TRUE, incomparables = NULL,
             ...)
{
    XinY(x,y,by,by.x,by.y,notin,incomparables)
}

Your example data does not have any duplicates, but your solution handle them automatically. This means that potentially some of the answers won't match to results of your function in case of duplicates.
Here is my solution which address duplicates the same way as yours. It also scales great!

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}

library(data.table)
setDT(a1)
setDT(a2)

# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

It needs data.table 1.9.8+


You could use the daff package (which wraps the daff.js library using the V8 package):

library(daff)

diff_data(data_ref = a2,
          data = a1)

produces the following difference object:

Daff Comparison: ‘a2’ vs. ‘a1’ 
  First 6 and last 6 patch lines:
   @@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e

The tabular diff format is described here and should be pretty self-explanatory. The lines with +++ in the first column @@ are the ones which are new in a1 and not present in a2.

The difference object can be used to patch_data(), to store the difference for documentation purposes using write_diff() or to visualize the difference using render_diff():

render_diff(
    diff_data(data_ref = a2,
              data = a1)
)

generates a neat HTML output:

enter image description here


Really fast comparison, to get count of differences. Using specific column name.

colname = "CreatedDate" # specify column name
index <- match(colname, names(source_df)) # get index name for column name
sel <- source_df[, index] == target_df[, index] # get differences, gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

For complete dataframe, do not provide column or index name

sel <- source_df[, ] == target_df[, ] # gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

sqldf provides a nice solution

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

And the rows which are in both data frames:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

require(dplyr) 
anti_join(a1,a2)

And semi_join to filter rows in a1 that are also in a2

semi_join(a1,a2)

Maybe it is too simplistic, but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets. Hope it can help.

a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to merge

Pandas Merging 101 Python: pandas merge multiple dataframes Git merge with force overwrite Merge two dataframes by index Visual Studio Code how to resolve merge conflicts with git? merge one local branch into another local branch Merging dataframes on index with pandas Git merge is not possible because I have unmerged files Git merge develop into feature branch outputs "Already up-to-date" while it's not How merge two objects array in angularjs?

Examples related to compare

Checking for duplicate strings in JavaScript array How to compare two files in Notepad++ v6.6.8 How to compare LocalDate instances Java 8 Comparing the contents of two files in Sublime Text comparing elements of the same array in java How to compare two dates along with time in java bash string compare to multiple correct values Query comparing dates in SQL How to compare two java objects Comparing two integer arrays in Java

Examples related to rows

SQL count rows in a table Converting rows into columns and columns into rows using R Delete rows containing specific strings in R How to append rows to an R data frame Excel Create Collapsible Indented Row Hierarchies Excel CSV. file with more than 1,048,576 rows of data Find which rows have different values for a given column in Teradata SQL More than 1 row in <Input type="textarea" /> Update multiple rows with different values in a single SQL query Repeat rows of a data.frame

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe