How to do vlookup and fill down like in Excel in R

Question

I have a dataset about 105000 rows and 30 columns  I have a categorical variable that I would like to assign it to a number  In Excel  I would probably do something with VLOOKUP and fill   How would I go about doing the same thing in R   Essentially  what I have is a HouseType variable  and I need to calculate the HouseTypeNo  Here are some sample data   HouseType HouseTypeNo Semi            1 Single          2 Row             3 Single          2 Apartment       4 Apartment       4 Row             3

User · Answer

Solution  2 of  Ben s answer is not reproducible in other more generic examples  It happens to give the correct lookup in the example because the unique HouseType in houses appear in increasing order  Try this   hous  lt - read table header   TRUE    stringsAsFactors   FALSE    text  HouseType HouseTypeNo   Semi            1   ECIIsHome       17   Single          2   Row             3   Single          2   Apartment       4   Apartment       4   Row             3    largetable  lt - data frame HouseType   as character sample unique hous HouseType   1000  replace   TRUE    stringsAsFactors   FALSE  lookup  lt - unique hous    Bens solution 2 gives  housenames  lt - as numeric 1 length unique hous HouseType    names housenames   lt - unique hous HouseType  base2  lt - data frame HouseType   largetable HouseType                      HouseTypeNo    housenames largetable HouseType      which when   unique base2 HouseTypeNo  base2 HouseType   ECIIsHome      1  2   when the correct answer is 17 from the lookup table  The correct way to do it is   hous  lt - read table header   TRUE    stringsAsFactors   FALSE    text  HouseType HouseTypeNo       Semi            1       ECIIsHome       17       Single          2       Row             3       Single          2       Apartment       4       Apartment       4       Row             3    largetable  lt - data frame HouseType   as character sample unique hous HouseType   1000  replace   TRUE    stringsAsFactors   FALSE   housenames  lt - tapply hous HouseTypeNo  hous HouseType  unique  base2  lt - data frame HouseType   largetable HouseType    HouseTypeNo    housenames largetable HouseType      Now the lookups are performed correctly  unique base2 HouseTypeNo  base2 HouseType   ECIIsHome     ECIIsHome         17   I tried to edit Bens answer but it gets rejected for reasons I cannot understand

User · Answer

You could use mapvalues   from the plyr package   Initial data   dat  lt - data frame HouseType   c  Semi    Single    Row    Single    Apartment    Apartment    Row      gt  dat   HouseType 1      Semi 2    Single 3       Row 4    Single 5 Apartment 6 Apartment 7       Row   Lookup   crosswalk table   lookup  lt - data frame type text   c  Semi    Single    Row    Apartment    type num   c 1  2  3  4    gt  lookup   type text type num 1      Semi        1 2    Single        2 3       Row        3 4 Apartment        4   Create the new variable   dat house type num  lt - plyr  mapvalues dat HouseType  from   lookup type text  to   lookup type num    Or for simple replacements you can skip creating a long lookup table and do this directly in one step   dat house type num  lt - plyr  mapvalues dat HouseType                                        from   c  Semi    Single    Row    Apartment                                          to   c 1  2  3  4     Result    gt  dat   HouseType house type num 1      Semi              1 2    Single              2 3       Row              3 4    Single              2 5 Apartment              4 6 Apartment              4 7       Row              3

User · Answer

The poster didn t ask about looking up values if exact FALSE  but I m adding this as an answer for my own reference and possibly others   If you re looking up categorical values  use the other answers   Excel s vlookup also allows you to match match approximately for numeric values with the 4th argument 1  match TRUE  I think of match TRUE like looking up values on a thermometer   The default value is FALSE  which is perfect for categorical values   If you want to match approximately  perform a lookup   R has a function called findInterval  which  as the name implies  will find the interval   bin that contains your continuous numeric value   However  let s say that you want to findInterval for several values   You could write a loop or use an apply function   However  I ve found it more efficient to take a DIY vectorized approach   Let s say that you have a grid of values indexed by x and y   grid  lt - list x   c -87 727  -87 723  -87 719  -87 715  -87 711                 y   c 41 836  41 839  41 843  41 847  41 851                 z    matrix data   c -3 428  -3 722  -3 061  -2 554  -2 362                                     -3 034  -3 925  -3 639  -3 357  -3 283                                     -0 152  -1 688  -2 765  -3 084  -2 742                                      1 973   1 193  -0 354  -1 682  -1 803                                      0 998   2 863   3 224   1 541  -0 044                             nrow   5  ncol   5      and you have some values you want to look up by x and y   df  lt - data frame x   c -87 723  -87 712  -87 726  -87 719  -87 722  -87 722                     y   c 41 84  41 842  41 844  41 849  41 838  41 842                     id   c  a    b    c    d    e    f     Here is the example visualized   contour grid  points df x  df y  pch df id  col  blue   cex 1 2      You can find the x intervals and y intervals with this type of formula   xrng  lt - range grid x  xbins  lt - length grid x  -1 yrng  lt - range grid y  ybins  lt - length grid y  -1 df ix  lt - trunc   df x - min xrng     diff xrng     xbins     1 df iy  lt - trunc   df y - min yrng     diff yrng     ybins     1   You could take it one step further and perform a  simplistic  interpolation on the z values in grid like this   df z  lt - with df   grid z cbind ix  iy                            grid z cbind ix   1  iy                           grid z cbind ix  iy   1                            grid z cbind ix   1  iy   1      4    Which gives you these values   contour grid  xlim   range c grid x  df x    ylim   range c grid y  df y    points df x  df y  pch df id  col  blue   cex 1 2  text df x    001  df y  lab round df z  2   col  blue   cex 1      df           x      y id ix iy        z   1 -87 723 41 840  a  2  2 -3 00425   2 -87 712 41 842  b  4  2 -3 11650   3 -87 726 41 844  c  1  3  0 33150   4 -87 719 41 849  d  3  4  0 68225   6 -87 722 41 838  e  2  1 -3 58675   7 -87 722 41 842  f  2  2 -3 00425   Note that ix  and iy could have also been found with a loop using findInterval  e g  here s one example for the second row  findInterval df x 2   grid x    4 findInterval df y 2   grid y    2   Which matches ix and iy in df 2   Footnote   1  The fourth argument of vlookup was previously called  match   but after they introduced the ribbon it was renamed to   range lookup

User · Answer

Starting with   houses  lt - read table text  Semi            1 Single          2 Row             3 Single          2 Apartment       4 Apartment       4 Row             3  col names c  HouseType   HouseTypeNo          you can use  as numeric factor houses HouseType         to give a unique number for each house type  You can see the result here    gt  houses2  lt - data frame houses as numeric factor houses HouseType     gt  houses2   HouseType HouseTypeNo as numeric factor houses HouseType   1      Semi           1                                    3 2    Single           2                                    4 3       Row           3                                    2 4    Single           2                                    4 5 Apartment           4                                    1 6 Apartment           4                                    1 7       Row           3                                    2       so you end up with different numbers on the rows  because the factors are ordered alphabetically  but the same pattern    EDIT  the remaining text in this answer is actually redundant  It occurred to me to check and it turned out that read table   had already made houses HouseType into a factor when it was read into the dataframe in the first place    However  you may well be better just to convert HouseType to a factor  which would give you all the same benefits as HouseTypeNo  but would be easier to interpret because the house types are named rather than numbered  e g     gt  houses3  lt - houses  gt  houses3 HouseType  lt - factor houses3 HouseType   gt  houses3   HouseType HouseTypeNo 1      Semi           1 2    Single           2 3       Row           3 4    Single           2 5 Apartment           4 6 Apartment           4 7       Row           3  gt  levels houses3 HouseType   1   Apartment   Row         Semi        Single

User · Answer

Using merge is different from lookup in Excel as it has potential to duplicate  multiply  your data if primary key constraint is not enforced in lookup table or reduce the number of records if you are not using all x   T   To make sure you don t get into trouble with that and lookup safely  I suggest two strategies   First one is to make a check on a number of duplicated rows in lookup key   safeLookup  lt - function data  lookup  by  select   setdiff colnames lookup   by         Merges data to lookup making sure that the number of rows does not change    stopifnot sum duplicated lookup   by       0    res  lt - merge data  lookup   c by  select    by   by  all x   T    return  res      This will force you to de-dupe lookup dataset before using it   baseSafe  lt - safeLookup largetable  house ids  by    HouseType     Error  sum duplicated lookup   by       0 is not TRUE   baseSafe lt - safeLookup largetable  unique house ids   by    HouseType   head baseSafe    HouseType HouseTypeNo   1 Apartment           4   2 Apartment           4         Second option is to reproduce Excel behaviour by taking the first matching value from the lookup dataset   firstLookup  lt - function data  lookup  by  select   setdiff colnames lookup   by         Merges data to lookup using first row per unique combination in by    unique lookup  lt - lookup  duplicated lookup   by        res  lt - merge data  unique lookup   c by  select    by   by  all x   T    return  res     baseFirst  lt - firstLookup largetable  house ids  by    HouseType     These functions are slightly different from lookup as they add multiple columns

User · Answer

I think you can also use match     largetable HouseTypeNo  lt - with lookup                       HouseTypeNo match largetable HouseType                                         HouseType      This still works if I scramble the order of lookup

User · Answer

If I understand your question correctly  here are four methods to do the equivalent of Excel s VLOOKUP and fill down using R     load sample data from Q hous  lt - read table header   TRUE                      stringsAsFactors   FALSE   text  HouseType HouseTypeNo Semi            1 Single          2 Row             3 Single          2 Apartment       4 Apartment       4 Row             3      create a toy large table with a  HouseType  column    but no  HouseTypeNo  column  yet  largetable  lt - data frame HouseType   as character sample unique hous HouseType   1000  replace   TRUE    stringsAsFactors   FALSE     create a lookup table to get the numbers to fill   the large table lookup  lt - unique hous    HouseType HouseTypeNo 1      Semi           1 2    Single           2 3       Row           3 5 Apartment           4   Here are four methods to fill the HouseTypeNo in the largetable using the values in the lookup table   First with merge in base     1  using base  base1  lt -  merge lookup  largetable  by    HouseType      A second method with named vectors in base     2  using base and a named vector housenames  lt - as numeric 1 length unique hous HouseType    names housenames   lt - unique hous HouseType   base2  lt - data frame HouseType   largetable HouseType                      HouseTypeNo    housenames largetable HouseType      Third  using the plyr package     3  using the plyr package library plyr  plyr1  lt - join largetable  lookup  by    HouseType     Fourth  using the sqldf package    4  using the sqldf package library sqldf  sqldf1  lt - sqldf  SELECT largetable HouseType  lookup HouseTypeNo FROM largetable INNER JOIN lookup ON largetable HouseType   lookup HouseType     If it s possible that some house types in largetable do not exist in lookup then a left join would be used   sqldf  select   from largetable left join lookup using  HouseType      Corresponding changes to the other solutions would be needed too   Is that what you wanted to do  Let me know which method you like and I ll add commentary

User · Answer

I also like using qdapTools  lookup or shorthand binary operator  l    It works identically to an Excel vlookup  but it accepts name arguments opposed to column numbers     Replicate Ben s data  hous  lt - structure list HouseType   c  Semi    Single    Row    Single         Apartment    Apartment    Row    HouseTypeNo   c 1L  2L  3L       2L  4L  4L  3L     Names   c  HouseType    HouseTypeNo         class    data frame   row names   c NA  -7L     largetable  lt - data frame HouseType   as character sample unique hous HouseType        1000  replace   TRUE    stringsAsFactors   FALSE       It s this simple  library qdapTools  largetable   1   l  hous

[r] How to do vlookup and fill down (like in Excel) in R?

Examples related to r

Examples related to lookup