Compare two data frames to find the rows in data frame 1 that are not present in data frame 2

Question

I have the following 2 data frames   a1  lt - data frame a   1 5  b letters 1 5   a2  lt - data frame a   1 3  b letters 1 3     I want to find the row a1 has that a2 doesn t   Is there a built in function for this type of operation    p s  I did write a solution for it  I am simply curious if someone already made a more crafted code   Here is my solution   a1  lt - data frame a   1 5  b letters 1 5   a2  lt - data frame a   1 3  b letters 1 3    rows in a1 that are not in a2   lt - function a1 a2        a1 vec  lt - apply a1  1  paste  collapse           a2 vec  lt - apply a2  1  paste  collapse           a1 without a2 rows  lt - a1  a1 vec  in  a2 vec       return a1 without a2 rows    rows in a1 that are not in a2 a1 a2

User · Answer

I wrote a package  https   github com alexsanjoseph compareDF  since I had the same issue      gt  df1  lt - data frame a   1 5  b letters 1 5   row   1 5     gt  df2  lt - data frame a   1 3  b letters 1 3   row   1 3     gt  df compare   compare df df1  df2   row       gt  df compare comparison df     row chng type a b   1   4           4 d   2   5           5 e   A more complicated example   library compareDF  df1   data frame id1   c  Mazda RX4    Mazda RX4 Wag    Datsun 710                             Hornet 4 Drive    Duster 360    Merc 240D                     id2   c  Maz    Maz    Dat    Hor    Dus    Mer                     hp   c 110  110  181  110  245  62                    cyl   c 6  6  4  6  8  4                    qsec   c 16 46  17 02  33 00  19 44  15 84  20 00    df2   data frame id1   c  Mazda RX4    Mazda RX4 Wag    Datsun 710                             Hornet 4 Drive     Hornet Sportabout    Valiant                     id2   c  Maz    Maz    Dat    Hor    Dus    Val                     hp   c 110  110  93  110  175  105                    cyl   c 6  6  4  6  8  6                    qsec   c 16 46  17 02  18 61  19 44  17 02  20 22     gt  df compare comparison df     grp chng type                id1 id2  hp cyl  qsec   1   1         -  Hornet Sportabout Dus 175   8 17 02   2   2                   Datsun 710 Dat 181   4 33 00   3   2         -         Datsun 710 Dat  93   4 18 61   4   3                   Duster 360 Dus 245   8 15 84   5   7                    Merc 240D Mer  62   4 20 00   6   8         -            Valiant Val 105   6 20 22   The package also has an html output command for quick checking     df compare html output

User · Answer

Using subset   missing lt -subset a1    a  in  a2 a

User · Answer

I adapted the merge function to get this functionality  On larger dataframes it uses less memory than the full merge solution  And I can play with the names of the key columns   Another solution is to use the library prob      Derived from src library base R merge R    Part of the R package  http   www R-project org      This program is free software  you can redistribute it and or modify    it under the terms of the GNU General Public License as published by    the Free Software Foundation  either version 2 of the License  or     at your option  any later version       This program is distributed in the hope that it will be useful     but WITHOUT ANY WARRANTY  without even the implied warranty of    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE   See the    GNU General Public License for more details       A copy of the GNU General Public License is available at    http   www r-project org Licenses   XinY  lt -     function x  y  by   intersect names x   names y    by x   by  by y   by               notin   FALSE  incomparables   NULL                          fix by  lt - function by  df                   fix up  by  to be a valid set of cols by number  0 is row names         if is null by   by  lt - numeric 0L          by  lt - as vector by          nc  lt - ncol df          if is character by               by  lt - match by  c  row names   names df    - 1L         else if is numeric by                 if any by  lt  0L     any by  gt  nc                   stop   by  must match numbers of columns             else if is logical by                 if length by     nc  stop   by  must match number of columns               by  lt - seq along by  by            else stop   by  must specify column s  as numbers  names or logical           if any is na by    stop   by  must specify valid column s            unique by             nx  lt - nrow x  lt - as data frame x    ny  lt - nrow y  lt - as data frame y       by x  lt - fix by by x  x      by y  lt - fix by by y  y      if  l b  lt - length by x      length by y           stop   by x  and  by y  specify different numbers of columns       if l b    0L               was  stop  no columns to match on              returns x         x           else           if any by x    0L                 x  lt - cbind Row names   I row names x    x              by x  lt - by x   1L                   if any by y    0L                 y  lt - cbind Row names   I row names y    y              by y  lt - by y   1L                      create keys from  by  columns          if l b    1L                        be faster              bx  lt - x   by x   if is factor bx   bx  lt - as character bx              by  lt - y   by y   if is factor by   by  lt - as character by            else                  Do these together for consistency in as character                 Use same set of names              bx  lt - x   by x  drop FALSE   by  lt - y   by y  drop FALSE              names bx   lt - names by   lt - paste  V   seq len ncol bx    sep                 bz  lt - do call  paste   c rbind bx  by   sep     r                bx  lt - bz seq len nx               by  lt - bz nx   seq len ny                     comm  lt - match bx  by  0L          if  notin                res  lt - x comm    0             else               res  lt - x comm  gt  0                          avoid a copy        row names res   lt - NULL     attr res   row names    lt -  set row names nrow res       res     XnotinY  lt -     function x  y  by   intersect names x   names y    by x   by  by y   by               notin   TRUE  incomparables   NULL                          XinY x y by by x by y notin incomparables

User · Answer

Maybe it is too simplistic  but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets  Hope it can help   a1  lt - data frame a   1 5  b   letters 1 5   a2  lt - data frame a   1 3  b   letters 1 3   different names  lt -   a1 a  in  a2 a  not in a2  lt - a1 different names

User · Answer

In dplyr   setdiff a1 a2    Basically  setdiff bigFrame  smallFrame  gets you the extra records in the first table     In the SQLverse this is called a      For good descriptions of all join options and set subjects  this is one of the best summaries I ve seen put together to date  http   www vertabelo com blog technical-articles sql-joins  But back to this question - here are the results for the setdiff   code when using the OP s data    gt  a1   a b 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e   gt  a2   a b 1 1 a 2 2 b 3 3 c   gt  setdiff a1 a2    a b 1 4 d 2 5 e   Or even anti join a1 a2  will get you the same results  For more info  https   www rstudio com wp-content uploads 2015 02 data-wrangling-cheatsheet pdf

User · Answer

Yet another solution based on match df in plyr  Here s plyr s match df   match df  lt - function  x  y  on   NULL         if  is null on             on  lt - intersect names x   names y           message  Matching on     paste on  collapse                    keys  lt - join keys x  y  on      x keys x  in  keys y    drop   FALSE      We can modify it to negate    library plyr  negate match df  lt - function  x  y  on   NULL         if  is null on             on  lt - intersect names x   names y           message  Matching on     paste on  collapse                    keys  lt - join keys x  y  on      x   keys x  in  keys y     drop   FALSE      Then   diff  lt - negate match df a1 a2

User · Answer

It is certainly not efficient for this particular purpose  but what I often do in these situations is to insert indicator variables in each data frame and then merge   a1 included a1  lt - TRUE a2 included a2  lt - TRUE res  lt - merge a1  a2  all TRUE    missing values in included a1 will note which rows are missing in a1  similarly for a2   One problem with your solution is that the column orders must match  Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different  The advantage of using merge is that you get for free all error checking that is necessary for a good solution

User · Answer

sqldf provides a nice solution a1  lt - data frame a   1 5  b letters 1 5   a2  lt - data frame a   1 3  b letters 1 3    require sqldf   a1NotIna2  lt - sqldf  SELECT   FROM a1 EXCEPT SELECT   FROM a2    And the rows which are in both data frames  a1Ina2  lt - sqldf  SELECT   FROM a1 INTERSECT SELECT   FROM a2    The new version of dplyr has a function  anti join  for exactly these kinds of comparisons require dplyr   anti join a1 a2   And semi join to filter rows in a1 that are also in a2 semi join a1 a2

User · Answer

Your example data does not have any duplicates  but your solution handle them automatically  This means that potentially some of the answers won t match to results of your function in case of duplicates  Here is my solution which address duplicates the same way as yours  It also scales great     a1  lt - data frame a   1 5  b letters 1 5   a2  lt - data frame a   1 3  b letters 1 3   rows in a1 that are not in a2   lt - function a1 a2        a1 vec  lt - apply a1  1  paste  collapse           a2 vec  lt - apply a2  1  paste  collapse           a1 without a2 rows  lt - a1  a1 vec  in  a2 vec       return a1 without a2 rows     library data table  setDT a1  setDT a2     no duplicates - as in example code r  lt - fsetdiff a1  a2  all equal r  rows in a1 that are not in a2 a1 a2     1  TRUE    handling duplicates - make some duplicates a1  lt - rbind a1  a1  a1  a2  lt - rbind a2  a2  a2  r  lt - fsetdiff a1  a2  all   TRUE  all equal r  rows in a1 that are not in a2 a1 a2     1  TRUE   It needs data table 1 9 8

User · Answer

This doesn t answer your question directly  but it will give you the elements that are in common  This can be done with Paul Murrell s package compare   library compare  a1  lt - data frame a   1 5  b   letters 1 5   a2  lt - data frame a   1 3  b   letters 1 3   comparison  lt - compare a1 a2 allowAll TRUE  comparison tM    a b  1 1 a  2 2 b  3 3 c   The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed  e g  changing order of elements of each vector  changing order and names of variables  shortening variables  changing case of strings   From this  you should be able to figure out what was missing from one or the other  For example  this is not very elegant    difference  lt -    data frame lapply 1 ncol a1  function i setdiff a1  i  comparison tM  i     colnames difference   lt - colnames a1  difference    a b  1 4 d  2 5 e

User · Answer

You could use the daff package  which wraps the daff js library using the V8 package    library daff   diff data data ref   a2            data   a1   produces the following difference object  Daff Comparison     a2    vs     a1       First 6 and last 6 patch lines          a   b 1             2       3   c 3       4   d 4       5   e 5             6             7       3   c 8       4   d 9       5   e  The tabular diff format is described here and should be pretty self-explanatory  The lines with     in the first column    are the ones which are new in a1 and not present in a2  The difference object can be used to patch data    to store the difference for documentation purposes using write diff   or to visualize the difference using render diff    render diff      diff data data ref   a2                data   a1     generates a neat HTML output

User · Answer

The following code uses both data table and fastmatch for increased speed   library  data table   library  fastmatch    a1  lt - setDT data frame a   1 5  b letters 1 5    a2  lt - setDT data frame a   1 3  b letters 1 3     compare rows  lt - a1 a  fin  a2 a   the  fin  function comes from the  fastmatch  package  added rows  lt - a1 which compare rows    FALSE    added rows       a b   1  4 d   2  5 e

User · Answer

Really fast comparison  to get count of differences  Using specific column name  colname    quot CreatedDate quot    specify column name index  lt - match colname  names source df     get index name for column name sel  lt - source df   index     target df   index    get differences  gives you dataframe with TRUE and FALSE values table sel   quot FALSE quot     count of differences table sel   quot TRUE quot     count of matches  For complete dataframe  do not provide column or index name sel  lt - source df        target df       gives you dataframe with TRUE and FALSE values table sel   quot FALSE quot     count of differences table sel   quot TRUE quot     count of matches

User · Answer

Using diffobj package   library diffobj   diffPrint a1  a2  diffObj a1  a2

[r] Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

Examples related to r

Examples related to merge

Examples related to compare

Examples related to rows

Examples related to dataframe