How to join merge data frames inner outer left right

Question

Given two data frames   df1   data frame CustomerId   c 1 6   Product   c rep  Toaster   3   rep  Radio   3    df2   data frame CustomerId   c 2  4  6   State   c rep  Alabama   2   rep  Ohio   1     df1    CustomerId Product             1 Toaster             2 Toaster             3 Toaster             4   Radio             5   Radio             6   Radio  df2    CustomerId   State             2 Alabama             4 Alabama             6    Ohio   How can I do database style  i e   sql style  joins  That is  how do I get    An inner join of df1 and df2  Return only the rows in which the left table have matching keys in the right table  An outer join of df1 and df2  Returns all rows from both tables  join records from the left which have matching keys in the right table  A left outer join  or simply left join  of df1 and df2 Return all rows from the left table  and any rows with matching keys from the right table  A right outer join of df1 and df2 Return all rows from the right table  and any rows with matching keys from the left table      Extra credit   How can I do a SQL style select statement

User · Answer

You can do joins as well using Hadley Wickham s awesome dplyr package     library dplyr    make sure that CustomerId cols are both type numeric  they ARE not using the provided code in question and dplyr will complain df1 CustomerId  lt - as numeric df1 CustomerId  df2 CustomerId  lt - as numeric df2 CustomerId    Mutating joins  add columns to df1 using matches in df2   inner inner join df1  df2    left outer left join df1  df2    right outer right join df1  df2    alternate right outer left join df2  df1    full join full join df1  df2    Filtering joins  filter out rows in df1  don t modify columns  semi join df1  df2   keep only observations in df1 that match in df2  anti join df1  df2   drops all observations in df1 that match in df2

User · Answer

Update join  One other important SQL-style join is an  update join  where columns in one table are updated  or created  using another table    Modifying the OP s example tables     sales   data frame    CustomerId   c 1  1  1  3  4  6      Year   2000 2005    Product   c rep  Toaster   3   rep  Radio   3     cust   data frame    CustomerId   c 1  1  4  6      Year   c 2001L  2002L  2002L  2002L     State   state name 1 4     sales   CustomerId Year Product            1 2000 Toaster            1 2001 Toaster            1 2002 Toaster            3 2003   Radio            4 2004   Radio            6 2005   Radio  cust   CustomerId Year    State            1 2001  Alabama            1 2002   Alaska            4 2002  Arizona            6 2002 Arkansas   Suppose we want to add the customer s state from cust to the purchases table  sales  ignoring the year column  With base R  we can identify matching rows and then copy values over   sales State  lt - cust State  match sales CustomerId  cust CustomerId       CustomerId Year Product    State            1 2000 Toaster  Alabama            1 2001 Toaster  Alabama            1 2002 Toaster  Alabama            3 2003   Radio      lt NA gt             4 2004   Radio  Arizona            6 2005   Radio Arkansas    cleanup for the next example sales State  lt - NULL   As can be seen here  match selects the first matching row from the customer table     Update join with multiple columns  The approach above works well when we are joining on only a single column and are satisfied with the first match  Suppose we want the year of measurement in the customer table to match the year of sale   As  bgoldst s answer mentions  match with interaction might be an option for this case  More straightforwardly  one could use data table   library data table  setDT sales   setDT cust   sales   State    cust sales  on   CustomerId  Year   x State         CustomerId Year Product   State   1           1 2000 Toaster     lt NA gt    2           1 2001 Toaster Alabama   3           1 2002 Toaster  Alaska   4           3 2003   Radio     lt NA gt    5           4 2004   Radio     lt NA gt    6           6 2005   Radio     lt NA gt     cleanup for next example sales   State    NULL      Rolling update join  Alternately  we may want to take the last state the customer was found in   sales   State    cust sales  on   CustomerId  Year   roll TRUE  x State         CustomerId Year Product    State   1           1 2000 Toaster      lt NA gt    2           1 2001 Toaster  Alabama   3           1 2002 Toaster   Alaska   4           3 2003   Radio      lt NA gt    5           4 2004   Radio  Arizona   6           6 2005   Radio Arkansas     The three examples above all focus on creating adding a new column  See the related R FAQ for an example of updating modifying an existing column

User · Answer

In joining two data frames with  1 million rows each  one with 2 columns and the other with  20  I ve surprisingly found merge      all x   TRUE  all y   TRUE  to be faster then dplyr  full join    This is with dplyr v0 4   Merge takes  17 seconds  full join takes  65 seconds     Some food for though  since I generally default to dplyr for manipulation tasks

User · Answer

By using the merge function and its optional parameters   Inner join  merge df1  df2  will work for these examples because R automatically joins the frames by common variable names  but you would most likely want to specify merge df1  df2  by    CustomerId   to make sure that you were matching on only the fields you desired   You can also use the by x and by y parameters if the matching variables have different names in the different data frames   Outer join  merge x   df1  y   df2  by    CustomerId   all   TRUE   Left outer  merge x   df1  y   df2  by    CustomerId   all x   TRUE   Right outer  merge x   df1  y   df2  by    CustomerId   all y   TRUE   Cross join  merge x   df1  y   df2  by   NULL   Just as with the inner join  you would probably want to explicitly pass  CustomerId  to R as the matching variable   I think it s almost always best to explicitly state the identifiers on which you want to merge  it s safer if the input data frames change unexpectedly and easier to read later on   You can merge on multiple columns by giving by a vector  e g   by   c  CustomerId    OrderId      If the column names to merge on are not the same  you can specify  e g   by x    CustomerId in df1   by y    CustomerId in df2  where CustomerId in df1 is the name of the column in the first data frame and CustomerId in df2 is the name of the column in the second data frame   These can also be vectors if you need to merge on multiple columns

User · Answer

I would recommend checking out Gabor Grothendieck s sqldf package  which allows you to express these operations in SQL   library sqldf      inner join df3  lt - sqldf  SELECT CustomerId  Product  State                FROM df1               JOIN df2 USING CustomerID        left join  substitute  right  for right join  df4  lt - sqldf  SELECT CustomerId  Product  State                FROM df1               LEFT JOIN df2 USING CustomerID      I find the SQL syntax to be simpler and more natural than its R equivalent  but this may just reflect my RDBMS bias    See Gabor s sqldf GitHub for more information on joins

User · Answer

For the case of a left join with a 0    0  1 cardinality or a right join with a 0  1 0    cardinality it is possible to assign in-place the unilateral columns from the joiner  the 0  1 table  directly onto the joinee  the 0    table   and thereby avoid the creation of an entirely new table of data  This requires matching the key columns from the joinee into the joiner and indexing ordering the joiner s rows accordingly for the assignment   If the key is a single column  then we can use a single call to match   to do the matching  This is the case I ll cover in this answer   Here s an example based on the OP  except I ve added an extra row to df2 with an id of 7 to test the case of a non-matching key in the joiner  This is effectively df1 left join df2   df1  lt - data frame CustomerId 1 6 Product c rep  Toaster  3L  rep  Radio  3L     df2  lt - data frame CustomerId c 2L 4L 6L 7L  State c rep  Alabama  2L   Ohio   Texas     df1 names df2  -1L    lt - df2 match df1  1L  df2  1L   -1L   df1       CustomerId Product   State    1          1 Toaster     lt NA gt     2          2 Toaster Alabama    3          3 Toaster     lt NA gt     4          4   Radio Alabama    5          5   Radio     lt NA gt     6          6   Radio    Ohio   In the above I hard-coded an assumption that the key column is the first column of both input tables  I would argue that  in general  this is not an unreasonable assumption  since  if you have a data frame with a key column  it would be strange if it had not been set up as the first column of the data frame from the outset  And you can always reorder the columns to make it so  An advantageous consequence of this assumption is that the name of the key column does not have to be hard-coded  although I suppose it s just replacing one assumption with another  Concision is another advantage of integer indexing  as well as speed  In the benchmarks below I ll change the implementation to use string name indexing to match the competing implementations   I think this is a particularly appropriate solution if you have several tables that you want to left join against a single large table  Repeatedly rebuilding the entire table for each merge would be unnecessary and inefficient   On the other hand  if you need the joinee to remain unaltered through this operation for whatever reason  then this solution cannot be used  since it modifies the joinee directly  Although in that case you could simply make a copy and perform the in-place assignment s  on the copy     As a side note  I briefly looked into possible matching solutions for multicolumn keys  Unfortunately  the only matching solutions I found were    inefficient concatenations  e g  match interaction df1 a df1 b  interaction df2 a df2 b    or the same idea with paste    inefficient cartesian conjunctions  e g  outer df1 a df2 a        amp  outer df1 b df2 b        base R merge   and equivalent package-based merge functions  which always allocate a new table to return the merged result  and thus are not suitable for an in-place assignment-based solution    For example  see Matching multiple columns on different data frames and getting other column as result  match two columns with two other columns  Matching on multiple columns  and the dupe of this question where I originally came up with the in-place solution  Combine two data frames with different number of rows in R     Benchmarking  I decided to do my own benchmarking to see how the in-place assignment approach compares to the other solutions that have been offered in this question   Testing code   library microbenchmark   library data table   library sqldf   library plyr   library dplyr    solSpecs  lt - list      merge list testFuncs list          inner function df1 df2 key  merge df1 df2 key           left  function df1 df2 key  merge df1 df2 key all x T           right function df1 df2 key  merge df1 df2 key all y T           full  function df1 df2 key  merge df1 df2 key all T              data table unkeyed list argSpec  data table unkeyed  testFuncs list          inner function dt1 dt2 key  dt1 dt2 on key nomatch 0L allow cartesian T           left  function dt1 dt2 key  dt2 dt1 on key allow cartesian T           right function dt1 dt2 key  dt1 dt2 on key allow cartesian T           full  function dt1 dt2 key  merge dt1 dt2 key all T allow cartesian T     calls merge data table               data table keyed list argSpec  data table keyed  testFuncs list          inner function dt1 dt2  dt1 dt2 nomatch 0L allow cartesian T           left  function dt1 dt2  dt2 dt1 allow cartesian T           right function dt1 dt2  dt1 dt2 allow cartesian T           full  function dt1 dt2  merge dt1 dt2 all T allow cartesian T     calls merge data table               sqldf unindexed list testFuncs list     note  must pass connection NULL to avoid running against the live DB connection  which would result in collisions with the residual tables from the last query upload         inner function df1 df2 key  sqldf paste0  select   from df1 inner join df2 using   paste collapse     key       connection NULL           left  function df1 df2 key  sqldf paste0  select   from df1 left join df2 using   paste collapse     key       connection NULL           right function df1 df2 key  sqldf paste0  select   from df2 left join df1 using   paste collapse     key       connection NULL     can t do right join proper  not yet supported  inverted left join is equivalent           full  function df1 df2 key  sqldf paste0  select   from df1 full join df2 using   paste collapse     key       connection NULL     can t do full join proper  not yet supported  possible to hack it with a union of left joins  but too unreasonable to include in testing             sqldf indexed list testFuncs list     important  requires an active DB connection with preindexed main df1 and main df2 ready to go  arguments are actually ignored         inner function df1 df2 key  sqldf paste0  select   from main df1 inner join main df2 using   paste collapse     key                 left  function df1 df2 key  sqldf paste0  select   from main df1 left join main df2 using   paste collapse     key                 right function df1 df2 key  sqldf paste0  select   from main df2 left join main df1 using   paste collapse     key           can t do right join proper  not yet supported  inverted left join is equivalent           full  function df1 df2 key  sqldf paste0  select   from main df1 full join main df2 using   paste collapse     key           can t do full join proper  not yet supported  possible to hack it with a union of left joins  but too unreasonable to include in testing             plyr list testFuncs list          inner function df1 df2 key  join df1 df2 key  inner            left  function df1 df2 key  join df1 df2 key  left            right function df1 df2 key  join df1 df2 key  right            full  function df1 df2 key  join df1 df2 key  full               dplyr list testFuncs list          inner function df1 df2 key  inner join df1 df2 key           left  function df1 df2 key  left join df1 df2 key           right function df1 df2 key  right join df1 df2 key           full  function df1 df2 key  full join df1 df2 key              in place list testFuncs list          left  function df1 df2 key    cns  lt - setdiff names df2  key   df1 cns   lt - df2 match df1  key  df2  key   cns   df1             right function df1 df2 key    cns  lt - setdiff names df1  key   df2 cns   lt - df1 match df2  key  df1  key   cns   df2               getSolTypes  lt - function   names solSpecs   getJoinTypes  lt - function   unique unlist lapply solSpecs function x  names x testFuncs      getArgSpec  lt - function argSpecs key NULL  if  is null key   argSpecs default else argSpecs  key     initSqldf  lt - function         sqldf       creates sqlite connection on first run  cleans up and closes existing connection otherwise     if  exists  sqldfInitFlag  envir globalenv   inherits F   amp  amp  sqldfInitFlag       false only on first run         sqldf       creates a new connection       else           assign  sqldfInitFlag  T envir globalenv        set to true for the one and only time           end if     invisible          end initSqldf    setUpBenchmarkCall  lt - function argSpecs joinType solTypes getSolTypes   env parent frame             builds and returns a list of expressions suitable for passing to the list argument of microbenchmark    and assigns variables to resolve symbol references in those expressions     callExpressions  lt - list        nms  lt - character        for  solType in solTypes            testFunc  lt - solSpecs  solType   testFuncs  joinType            if  is null testFunc   next     this join type is not defined for this solution type         testFuncName  lt - paste0  tf   solType           assign testFuncName testFunc envir env           argSpecKey  lt - solSpecs  solType   argSpec          argSpec  lt - getArgSpec argSpecs argSpecKey           argList  lt - setNames nm names argSpec args  vector  list  length argSpec args             for  i in seq along argSpec args                 argName  lt - paste0  tfa   argSpecKey i               assign argName argSpec args  i   envir env               argList  i    lt - if  i in argSpec copySpec  call  copy  as symbol argName   else as symbol argName                 end for         callExpressions  length callExpressions  1L    lt - do call call c list testFuncName  argList  quote T           nms length nms  1L   lt - solType            end for     names callExpressions   lt - nms      callExpressions        end setUpBenchmarkCall    harmonize  lt - function res        res  lt - as data frame res      coerce to data frame     for  ci in which sapply res is factor    res  ci    lt - as character res  ci        coerce factor columns to character     for  ci in which sapply res is logical    res  ci    lt - as integer res  ci        coerce logical columns to integer  works around sqldf quirk of munging logicals to integers        for  ci in which sapply res inherits  POSIXct     res  ci    lt - as double res  ci        coerce POSIXct columns to double  works around sqldf quirk of losing POSIXct class  ----- POSIXct doesn t work at all in sqldf indexed     res  lt - res order names res        order columns     res  lt - res do call order res        order rows     res        end harmonize    checkIdentical  lt - function argSpecs solTypes getSolTypes          for  joinType in getJoinTypes              callExpressions  lt - setUpBenchmarkCall argSpecs joinType solTypes           if  length callExpressions  lt 2L  next          ex  lt - harmonize eval callExpressions  1L              for  i in seq 2L len length callExpressions -1L                 y  lt - harmonize eval callExpressions  i                  if   isTRUE all equal ex y check attributes F                      ex  lt  lt - ex                  y  lt  lt - y                  solType  lt - names callExpressions  i                   stop paste0  non-identical    solType     joinType                          end if               end for           end for     invisible          end checkIdentical    testJoinType  lt - function argSpecs joinType solTypes getSolTypes   metric NULL times 100L        callExpressions  lt - setUpBenchmarkCall argSpecs joinType solTypes       bm  lt - microbenchmark list callExpressions times times       if  is null metric   return bm       bm  lt - summary bm       res  lt - setNames nm names callExpressions  bm  metric         attr res  unit    lt - attr bm  unit        res        end testJoinType    testAllJoinTypes  lt - function argSpecs solTypes getSolTypes   metric NULL times 100L        joinTypes  lt - getJoinTypes        resList  lt - setNames nm joinTypes lapply joinTypes function joinType  testJoinType argSpecs joinType solTypes metric times         if  is null metric   return resList       units  lt - unname unlist lapply resList attr  unit          res  lt - do call data frame c list join joinTypes  setNames nm solTypes rep list rep NA real  length joinTypes    length solTypes    list unit units stringsAsFactors F         for  i in seq along resList   res i match names resList  i    names res     lt - resList  i        res        end testAllJoinTypes    testGrid  lt - function makeArgSpecsFunc sizes overlaps solTypes getSolTypes   joinTypes getJoinTypes   metric  median  times 100L         res  lt - expand grid size sizes overlap overlaps joinType joinTypes stringsAsFactors F       res solTypes   lt - NA real       res unit  lt - NA character       for  ri in seq len nrow res               size  lt - res size ri           overlap  lt - res overlap ri           joinType  lt - res joinType ri            argSpecs  lt - makeArgSpecsFunc size overlap            checkIdentical argSpecs solTypes            cur  lt - testJoinType argSpecs joinType solTypes metric times           res ri match names cur  names res     lt - cur          res unit ri   lt - attr cur  unit               end for      res         end testGrid       Here s a benchmark of the example based on the OP that I demonstrated earlier      OP s example  supplemented with a non-matching row in df2 argSpecs  lt - list      default list copySpec 1 2 args list          df1  lt - data frame CustomerId 1 6 Product c rep  Toaster  3L  rep  Radio  3L             df2  lt - data frame CustomerId c 2L 4L 6L 7L  State c rep  Alabama  2L   Ohio   Texas              CustomerId              data table unkeyed list copySpec 1 2 args list          as data table df1           as data table df2            CustomerId              data table keyed list copySpec 1 2 args list          setkey as data table df1  CustomerId           setkey as data table df2  CustomerId               prepare sqldf initSqldf    sqldf  create index df1 key on df1 CustomerId         upload and create an sqlite index on df1 sqldf  create index df2 key on df2 CustomerId         upload and create an sqlite index on df2  checkIdentical argSpecs    testAllJoinTypes argSpecs metric  median          join    merge data table unkeyed data table keyed sqldf unindexed sqldf indexed      plyr    dplyr in place         unit    1 inner  644 259           861 9345          923 516        9157 752      1580 390  959 2250 270 9190       NA microseconds    2  left  713 539           888 0205          910 045        8820 334      1529 714  968 4195 270 9185 224 3045 microseconds    3 right 1221 804           909 1900          923 944        8930 668      1533 135 1063 7860 269 8495 218 1035 microseconds    4  full 1302 203          3107 5380         3184 729              NA            NA 1593 6475 270 7055       NA microseconds     Here I benchmark on random input data  trying different scales and different patterns of key overlap between the two input tables  This benchmark is still restricted to the case of a single-column integer key  As well  to ensure that the in-place solution would work for both left and right joins of the same tables  all random test data uses 0  1 0  1 cardinality  This is implemented by sampling without replacement the key column of the first data frame when generating the key column of the second data frame   makeArgSpecs singleIntegerKey optionalOneToOne  lt - function size overlap         com  lt - as integer size overlap        argSpecs  lt - list          default list copySpec 1 2 args list              df1  lt - data frame id sample size  y1 rnorm size  y2 rnorm size                df2  lt - data frame id sample c if  com gt 0L  sample df1 id com  else integer   seq size 1L len size-com    y3 rnorm size  y4 rnorm size                 id                      data table unkeyed list copySpec 1 2 args list              as data table df1               as data table df2                id                      data table keyed list copySpec 1 2 args list              setkey as data table df1  id               setkey as data table df2  id                           prepare sqldf     initSqldf        sqldf  create index df1 key on df1 id         upload and create an sqlite index on df1     sqldf  create index df2 key on df2 id         upload and create an sqlite index on df2      argSpecs         end makeArgSpecs singleIntegerKey optionalOneToOne       cross of various input sizes and key overlaps sizes  lt - c 1e1L 1e3L 1e6L   overlaps  lt - c 0 99 0 5 0 01   system time   res  lt - testGrid makeArgSpecs singleIntegerKey optionalOneToOne sizes overlaps              user   system  elapsed    22024 65 12308 63 34493 19   I wrote some code to create log-log plots of the above results  I generated a separate plot for each overlap percentage  It s a little bit cluttered  but I like having all the solution types and join types represented in the same plot   I used spline interpolation to show a smooth curve for each solution join type combination  drawn with individual pch symbols  The join type is captured by the pch symbol  using a dot for inner  left and right angle brackets for left and right  and a diamond for full  The solution type is captured by the color as shown in the legend   plotRes  lt - function res titleFunc useFloor F        solTypes  lt - setdiff names res  c  size   overlap   joinType   unit        derive from res     normMult  lt - c microseconds 1e-3 milliseconds 1      normalize to milliseconds     joinTypes  lt - getJoinTypes        cols  lt - c merge  purple  data table unkeyed  blue  data table keyed   00DDDD  sqldf unindexed  brown  sqldf indexed  orange  plyr  red  dplyr   00BB00  in place  magenta        pchs  lt - list inner 20L left   lt   right   gt   full 23L       cexs  lt - c inner 0 7 left 1 right 1 full 0 7       NP  lt - 60L      ord  lt - order decreasing T colMeans res res size  max res size  solTypes  na rm T        ymajors  lt - data frame y c 1 1e3  label c  1ms   1s   stringsAsFactors F       for  overlap in unique res overlap             x1  lt - res res overlap  overlap            x1 solTypes   lt - x1 solTypes  normMult x1 unit   x1 unit  lt - NULL          xlim  lt - c 1e1 max x1 size            xticks  lt - 10 seq log10 xlim 1L   log10 xlim 2L             ylim  lt - c 1e-1 10   if  useFloor  floor else ceiling  log10 max x1 solTypes  na rm T          use floor   to zoom in a little more  only sqldf unindexed will break above  but xpd NA will keep it visible         yticks  lt - 10 seq log10 ylim 1L   log10 ylim 2L             yticks minor  lt - rep yticks -length yticks   each 9L  1 9          plot NA xlim xlim ylim ylim xaxs  i  yaxs  i  axes F xlab  size  rows   ylab  time  ms   log  xy            abline v xticks col  lightgrey            abline h yticks minor col  lightgrey  lty 3L           abline h yticks col  lightgrey            axis 1L xticks parse text sprintf  10  d  as integer log10 xticks               axis 2L yticks parse text sprintf  10  d  as integer log10 yticks     las 1L           axis 4L ymajors y ymajors label las 1L tick F cex axis 0 7 hadj 0 5           for  joinType in rev joinTypes        reverse to draw full first  since it s larger and would be more obtrusive if drawn last             x2  lt - x1 x1 joinType  joinType                for  solType in solTypes                    if  any  is na x2  solType                            xy  lt - spline x2 size x2  solType   xout 10  seq log10 x2 size 1L   log10 x2 size nrow x2    len NP                         points xy x xy y pch pchs  joinType   col cols solType  cex cexs joinType  xpd NA                         end if                   end for               end for            custom legend            due to logarithmic skew  must do all distance calcs in inches  and convert to user coords afterward            the bottom-left corner of the legend will be defined in normalized figure coords  although we can convert to inches immediately         leg cex  lt - 0 7          leg x in  lt - grconvertX 0 275  nfc   in            leg y in  lt - grconvertY 0 6  nfc   in            leg x user  lt - grconvertX leg x in  in            leg y user  lt - grconvertY leg y in  in            leg outpad w in  lt - 0 1          leg outpad h in  lt - 0 1          leg midpad w in  lt - 0 1          leg midpad h in  lt - 0 1          leg sol w in  lt - max strwidth solTypes  in  leg cex            leg sol h in  lt - max strheight solTypes  in  leg cex   1 5     multiplication factor for greater line height         leg join w in  lt - max strheight joinTypes  in  leg cex   1 5     ditto         leg join h in  lt - max strwidth joinTypes  in  leg cex            leg main w in  lt - leg join w in length joinTypes           leg main h in  lt - leg sol h in length solTypes           leg x2 user  lt - grconvertX leg x in leg outpad w in 2 leg main w in leg midpad w in leg sol w in  in            leg y2 user  lt - grconvertY leg y in leg outpad h in 2 leg main h in leg midpad h in leg join h in  in            leg cols x user  lt - grconvertX leg x in leg outpad w in leg join w in  0 5 seq 0L length joinTypes -1L    in            leg lines y user  lt - grconvertY leg y in leg outpad h in leg main h in-leg sol h in  0 5 seq 0L length solTypes -1L    in            leg sol x user  lt - grconvertX leg x in leg outpad w in leg main w in leg midpad w in  in            leg join y user  lt - grconvertY leg y in leg outpad h in leg main h in leg midpad h in  in            rect leg x user leg y user leg x2 user leg y2 user col  white            text leg sol x user leg lines y user solTypes ord  cex leg cex pos 4L offset 0           text leg cols x user leg join y user joinTypes cex leg cex pos 4L offset 0 srt 90      srt rotation applies  after  pos offset positioning         for  i in seq along joinTypes                 joinType  lt - joinTypes i               points rep leg cols x user i  length solTypes   ifelse colSums  is na x1 x1 joinType  joinType solTypes ord      0L NA leg lines y user  pch pchs  joinType   col cols solTypes ord                   end for         title titleFunc overlap            readline sprintf  overlap   02f  overlap              end for       end plotRes    titleFunc  lt - function overlap  sprintf  R merge solutions  single-column integer key  0  1 0  1 cardinality   d   overlap  as integer overlap 100    plotRes res titleFunc T             Here s a second large-scale benchmark that s more heavy-duty  with respect to the number and types of key columns  as well as cardinality  For this benchmark I use three key columns  one character  one integer  and one logical  with no restrictions on cardinality  that is  0    0       In general it s not advisable to define key columns with double or complex values due to floating-point comparison complications  and basically no one ever uses the raw type  much less for key columns  so I haven t included those types in the key columns  Also  for information s sake  I initially tried to use four key columns by including a POSIXct key column  but the POSIXct type didn t play well with the sqldf indexed solution for some reason  possibly due to floating-point comparison anomalies  so I removed it    makeArgSpecs assortedKey optionalManyToMany  lt - function size overlap uniquePct 75            number of unique keys in df1     u1Size  lt - as integer size uniquePct 100            roughly  divide u1Size into bases  so we can use expand grid   to produce the required number of unique key values with repetitions within individual key columns        use ceiling   to ensure we cover u1Size  will truncate afterward     u1SizePerKeyColumn  lt - as integer ceiling u1Size  1 3             generate the unique key values for df1     keys1  lt - expand grid stringsAsFactors F          idCharacter replicate u1SizePerKeyColumn paste collapse    sample letters sample 4 12 1L  T             idInteger sample u1SizePerKeyColumn           idLogical sample c F T  u1SizePerKeyColumn T            idPOSIXct as POSIXct  2016-01-01 00 00 00   UTC   sample u1SizePerKeyColumn        seq len u1Size             rbind some repetitions of the unique keys  this will prepare one side of the many-to-many relationship        also scramble the order afterward     keys1  lt - rbind keys1 keys1 sample nrow keys1  size-u1Size T     sample size             common and unilateral key counts     com  lt - as integer size overlap       uni  lt - size-com          generate some unilateral keys for df2 by synthesizing outside of the idInteger range of df1     keys2  lt - data frame stringsAsFactors F          idCharacter replicate uni paste collapse    sample letters sample 4 12 1L  T             idInteger u1SizePerKeyColumn sample uni           idLogical sample c F T  uni T            idPOSIXct as POSIXct  2016-01-01 00 00 00   UTC   u1SizePerKeyColumn sample uni                 rbind random keys from df1  this will complete the many-to-many relationship        also scramble the order afterward     keys2  lt - rbind keys2 keys1 sample nrow keys1  com T     sample size            keyNames  lt - c  idCharacter   idInteger   idLogical   idPOSIXct        keyNames  lt - c  idCharacter   idInteger   idLogical           note  was going to use raw and complex type for two of the non-key columns  but data table doesn t seem to fully support them     argSpecs  lt - list          default list copySpec 1 2 args list              df1  lt - cbind stringsAsFactors F keys1 y1 sample c F T  size T  y2 sample size  y3 rnorm size  y4 replicate size paste collapse    sample letters sample 4 12 1L  T                  df2  lt - cbind stringsAsFactors F keys2 y5 sample c F T  size T  y6 sample size  y7 rnorm size  y8 replicate size paste collapse    sample letters sample 4 12 1L  T                  keyNames                     data table unkeyed list copySpec 1 2 args list              as data table df1               as data table df2               keyNames                     data table keyed list copySpec 1 2 args list              setkeyv as data table df1  keyNames               setkeyv as data table df2  keyNames                           prepare sqldf     initSqldf        sqldf paste0  create index df1 key on df1   paste collapse     keyNames             upload and create an sqlite index on df1     sqldf paste0  create index df2 key on df2   paste collapse     keyNames             upload and create an sqlite index on df2      argSpecs         end makeArgSpecs assortedKey optionalManyToMany    sizes  lt - c 1e1L 1e3L 1e5L      1e5L instead of 1e6L to respect more heavy-duty inputs overlaps  lt - c 0 99 0 5 0 01   solTypes  lt - setdiff getSolTypes    in place    system time   res  lt - testGrid makeArgSpecs assortedKey optionalManyToMany sizes overlaps solTypes              user   system  elapsed    38895 50   784 19 39745 53   The resulting plots  using the same plotting code given above   titleFunc  lt - function overlap  sprintf  R merge solutions  character integer logical key  0    0    cardinality   d   overlap  as integer overlap 100    plotRes res titleFunc F

User · Answer

For an inner join on all columns  you could also use fintersect from the data table-package or intersect from the dplyr-package as an alternative to merge without specifying the by-columns  this will give the rows that are equal between two dataframes   merge df1  df2      V1 V2   1  B  2   2  C  3 dplyr  intersect df1  df2      V1 V2   1  B  2   2  C  3 data table  fintersect setDT df1   setDT df2        V1 V2   1   B  2   2   C  3     Example data   df1  lt - data frame V1   LETTERS 1 4   V2   1 4  df2  lt - data frame V1   LETTERS 2 3   V2   2 3

User · Answer

dplyr since 0 4 implemented all those joins including outer join  but it was worth noting that for the first few releases prior to 0 4 it used not to offer outer join  and as a result there was a lot of really bad hacky workaround user code floating around for quite a while afterwards  you can still find such code in SO  Kaggle answers  github from that period  Hence this answer still serves a useful purpose     Join-related release highlights   v0 5  6 2016    Handling for POSIXct type  timezones  duplicates  different factor levels  Better errors and warnings  New suffix argument to control what suffix duplicated variable names receive   1296    v0 4 0  1 2015    Implement right join and outer join   96  Mutating joins  which add new variables to one table from matching rows in another  Filtering joins  which filter observations from one table based on whether or not they match an observation in the other table    v0 3  10 2014    Can now left join by different variables in each table  df1     left join df2  c  var1     var2      v0 2  5 2014      join   no longer reorders column names   324    v0 1 3  4 2014    has inner join  left join  semi join  anti join outer join not implemented yet  fallback is use base  merge    or plyr  join    didn t yet implement right join and outer join Hadley mentioning other advantages here one minor feature merge currently has that dplyr doesn t is the ability to have separate by x by y columns as e g  Python pandas does    Workarounds per hadley s comments in that issue    right join x y  is the same as left join y x  in terms of the rows  just the columns will be different orders  Easily worked around with select new column order  outer join is basically union left join x  y   right join x  y   - i e  preserve all rows in both data frames

User · Answer

There are some good examples of doing this over at the R Wiki  I ll steal a couple here   Merge Method  Since your keys are named the same the short way to do an inner join is merge     merge df1 df2    a full inner join  all records from both tables  can be created with the  all  keyword   merge df1 df2  all TRUE    a left outer join of df1 and df2   merge df1 df2  all x TRUE    a right outer join of df1 and df2   merge df1 df2  all y TRUE    you can flip  em  slap  em and rub  em down to get the other two outer joins you asked about     Subscript Method  A left outer join with df1 on the left using a subscript method would be   df1   State   lt -df2 df1    Product     State     The other combination of outer joins can be created by mungling the left outer join subscript example   yeah  I know that s the equivalent of saying  I ll leave it as an exercise for the reader

User · Answer

Using merge function we can select the variable of left table or right table  same way like we all familiar with select statement in SQL  EX   Select a      or Select b   from        We have to add extra code which will subset from the newly joined table     SQL  -  select a   from df1 a inner join df2 b on a CustomerId b CustomerId R  - merge df1  df2  by x    CustomerId   by y    CustomerId    names df1      Same way    SQL  - select b   from df1 a inner join df2 b on a CustomerId b CustomerId R  - merge df1  df2  by x    CustomerId   by y    CustomerId    names df2

User · Answer

There is the data table approach for an inner join  which is very time and memory efficient  and necessary for some larger data frames    library data table   dt1  lt - data table df1  key    CustomerId    dt2  lt - data table df2  key    CustomerId    joined dt1 dt 2  lt - dt1 dt2    merge also works on data tables  as it is generic and calls merge data table   merge dt1  dt2    data table documented on stackoverflow  How to do a data table merge operation Translating SQL joins on foreign keys to R data table syntax Efficient alternatives to merge for larger data frames R How to do a basic left outer join with data table in R   Yet another option is the join function found in the plyr package  library plyr   join df1  df2       type    inner        CustomerId Product   State   1          2 Toaster Alabama   2          4   Radio Alabama   3          6   Radio    Ohio   Options for type  inner  left  right  full   From  join  Unlike merge   join  preserves the order of x no matter what join type is used

User · Answer

Update on data table methods for joining datasets  See below examples for each type of join  There are two methods  one from   data table when passing second data table as the first argument to subset  another way is to use merge function which dispatches to fast data table method       df1   data frame CustomerId   c 1 6   Product   c rep  Toaster   3   rep  Radio   3    df2   data frame CustomerId   c 2L  4L  7L   State   c rep  Alabama   2   rep  Ohio   1      one value changed to show full outer join  library data table   dt1   as data table df1  dt2   as data table df2  setkey dt1  CustomerId  setkey dt2  CustomerId    right outer join keyed data tables dt1 dt2   setkey dt1  NULL  setkey dt2  NULL    right outer join unkeyed data tables - use  on  argument dt1 dt2  on    CustomerId      left outer join - swap dt1 with dt2 dt2 dt1  on    CustomerId      inner join - use  nomatch  argument dt1 dt2  nomatch NULL  on    CustomerId      anti join - use     operator dt1  dt2  on    CustomerId      inner join - using merge method merge dt1  dt2  by    CustomerId      full outer join merge dt1  dt2  by    CustomerId   all   TRUE     see  merge data table arguments for other cases   Below benchmark tests base R  sqldf  dplyr and data table  Benchmark tests unkeyed unindexed datasets  Benchmark is performed on 50M-1 rows datasets  there are 50M-2 common values on join column so each scenario  inner  left  right  full  can be tested and join is still not trivial to perform  It is type of join which well stress join algorithms  Timings are as of sqldf 0 4 11  dplyr 0 7 8  data table 1 12 0       inner Unit  seconds    expr       min        lq      mean    median        uq       max neval    base 111 66266 111 66266 111 66266 111 66266 111 66266 111 66266     1   sqldf 624 88388 624 88388 624 88388 624 88388 624 88388 624 88388     1   dplyr  51 91233  51 91233  51 91233  51 91233  51 91233  51 91233     1      DT  10 40552  10 40552  10 40552  10 40552  10 40552  10 40552     1   left Unit  seconds    expr        min         lq       mean     median         uq        max     base 142 782030 142 782030 142 782030 142 782030 142 782030 142 782030        sqldf 613 917109 613 917109 613 917109 613 917109 613 917109 613 917109        dplyr  49 711912  49 711912  49 711912  49 711912  49 711912  49 711912           DT   9 674348   9 674348   9 674348   9 674348   9 674348   9 674348          right Unit  seconds    expr        min         lq       mean     median         uq        max    base 122 366301 122 366301 122 366301 122 366301 122 366301 122 366301        sqldf 611 119157 611 119157 611 119157 611 119157 611 119157 611 119157        dplyr  50 384841  50 384841  50 384841  50 384841  50 384841  50 384841           DT   9 899145   9 899145   9 899145   9 899145   9 899145   9 899145        full Unit  seconds   expr       min        lq      mean    median        uq       max neval   base 141 79464 141 79464 141 79464 141 79464 141 79464 141 79464     1  dplyr  94 66436  94 66436  94 66436  94 66436  94 66436  94 66436     1     DT  21 62573  21 62573  21 62573  21 62573  21 62573  21 62573     1   Be aware there are other types of joins you can perform using data table    - update on join - if you want to lookup values from another table to your main table   - aggregate on join - if you want to aggregate on key you are joining you do not have to materialize all join results   - overlapping join - if you want to merge by ranges   - rolling join - if you want merge to be able to match to values from preceeding following rows by rolling them forward or backward   - non-equi join - if your join condition is non-equal    Code to reproduce   library microbenchmark  library sqldf  library dplyr  library data table  sapply c  sqldf   dplyr   data table    packageVersion  simplify FALSE   n   5e7 set seed 108  df1   data frame x sample n n-1L   y1 rnorm n-1L   df2   data frame x sample n n-1L   y2 rnorm n-1L   dt1   as data table df1  dt2   as data table df2   mb   list     inner join microbenchmark times   1L                 base   merge df1  df2  by    x                   sqldf   sqldf  SELECT   FROM df1 INNER JOIN df2 ON df1 x   df2 x                   dplyr   inner join df1  df2  by    x                   DT   dt1 dt2  nomatch NULL  on    x    - gt  mb inner    left outer join microbenchmark times   1L                 base   merge df1  df2  by    x   all x   TRUE                  sqldf   sqldf  SELECT   FROM df1 LEFT OUTER JOIN df2 ON df1 x   df2 x                   dplyr   left join df1  df2  by   c  x   x                    DT   dt2 dt1  on    x    - gt  mb left    right outer join microbenchmark times   1L                 base   merge df1  df2  by    x   all y   TRUE                  sqldf   sqldf  SELECT   FROM df2 LEFT OUTER JOIN df1 ON df2 x   df1 x                   dplyr   right join df1  df2  by    x                   DT   dt1 dt2  on    x    - gt  mb right    full outer join microbenchmark times   1L                 base   merge df1  df2  by    x   all   TRUE                  dplyr   full join df1  df2  by    x                   DT   merge dt1  dt2  by    x   all   TRUE   - gt  mb full  lapply mb  print  - gt  nul

User · Answer

New in 2014    Especially if you re also interested in data manipulation in general  including sorting  filtering  subsetting  summarizing etc    you should definitely take a look at dplyr  which comes with a variety of functions all designed to facilitate your work specifically with data frames and certain other database types  It even offers quite an elaborate SQL interface  and even a function to convert  most  SQL code directly into R   The four joining-related functions in the dplyr package are  to quote     inner join x  y  by   NULL  copy   FALSE        return all rows from x where there are matching values in y  and all columns from x and y  left join x  y  by   NULL  copy   FALSE        return all rows from x  and all columns from x and y  semi join x  y  by   NULL  copy   FALSE        return all rows from x where there are matching values in y  keeping just columns from x    anti join x  y  by   NULL  copy   FALSE        return all rows from x where there are not matching values in y  keeping just columns from x   It s all here in great detail   Selecting columns can be done by select df  column    If that s not SQL-ish enough for you  then there s the sql   function  into which you can enter SQL code as-is  and it will do the operation you specified just like you were writing in R all along  for more information  please refer to the dplyr databases vignette   For example  if applied correctly  sql  SELECT   FROM hflights   will select all the columns from the  hflights  dplyr table  a  tbl

[r] How to join (merge) data frames (inner, outer, left, right)

Mutating joins: add columns to df1 using matches in df2

Filtering joins: filter out rows in df1, don't modify columns

Examples related to r

Examples related to join

Examples related to merge

Examples related to dataframe

Examples related to r-faq