Read an Excel file directly from a R script

Question

How can I read an Excel file directly into R  Or should I first export the data to a text- or CSV file and import that file into R

User · Answer

And now there is readxl      The readxl package makes it easy to get data out of Excel and into R    Compared to the existing packages  e g  gdata  xlsx  xlsReadWrite etc    readxl has no external dependencies so it s easy to install and use on   all operating systems  It is designed to work with tabular data stored   in a single sheet       readxl is built on top of the libxls C library  which abstracts away   many of the complexities of the underlying binary format       It supports both the legacy  xls format and  xlsx      readxl is available from CRAN  or you can install it from github with      install packages  devtools   devtools  install github  hadley readxl     Usage  library readxl     read excel reads both xls and xlsx files read excel  my-old-spreadsheet xls   read excel  my-new-spreadsheet xlsx      Specify sheet with a number or name read excel  my-spreadsheet xls   sheet    data   read excel  my-spreadsheet xls   sheet   2     If NAs are represented by something other than blank cells    set the na argument read excel  my-spreadsheet xls   na    NA     Note that while the description says  no external dependencies   it does require the Rcpp package  which in turn requires Rtools  for Windows  or Xcode  for OSX   which are dependencies external to R  Though many people have them installed for other reasons

User · Answer

Another solution is the xlsReadWrite package  which doesn t require additional installs but does require you download the additional shlib before you use it the first time by    require xlsReadWrite  xls getshlib     Forgetting this can cause utter frustration  Been there and all that     On a sidenote   You might want to consider converting to a text-based format  eg csv  and read in from there  This for a number of reasons     whatever your solution  RODBC  gdata  xlsReadWrite  some strange things can happen when your data gets converted  Especially dates can be rather cumbersome  The HFWutils package has some tools to deal with EXCEL dates  per  Ben Bolker s comment   if you have large sheets  reading in text files is faster than reading in from EXCEL  for  xls and  xlsx files  different solutions might be necessary  EG the xlsReadWrite package currently does not support  xlsx AFAIK  gdata requires you to install additional perl libraries for  xlsx support  xlsx package can handle extensions of the same name

User · Answer

Given the proliferation of different ways to read an Excel file in R and the plethora of answers here  I thought I d try to shed some light on which of the options mentioned here perform the best  in a few simple situations    I myself have been using xlsx since I started using R  for inertia if nothing else  and I recently noticed there doesn t seem to be any objective information about which package works better   Any benchmarking exercise is fraught with difficulties as some packages are sure to handle certain situations better than others  and a waterfall of other caveats   That said  I m using a  reproducible  data set that I think is in a pretty common format  8 string fields  3 numeric  1 integer  3 dates    set seed 51423  data frame    str1   sample sprintf   010d   1 NN     ID field 1   str2   sample sprintf   09d   1 NN      ID field 2    varying length string field--think names addresses  etc    str3        replicate NN  paste0 sample LETTERS  sample 10 30  1L   TRUE                            collapse            factor-like string field with 50  levels    str4   sprintf   05d   sample sample 1e5  50L   NN  TRUE       factor-like string field with 17 levels  varying length   str5        sample replicate 17L  paste0 sample LETTERS  sample 15 25  1L   TRUE                                    collapse         NN  TRUE      lognormally distributed numeric   num1   round exp rnorm NN  mean   6 5  sd   1 5    2L      3 binary strings   str6   sample c  Y   N    NN  TRUE     str7   sample c  M   F    NN  TRUE     str8   sample c  B   W    NN  TRUE      right-skewed integer   int1   ceiling rexp NN       dates by month   dat1        sample seq from   as Date  2005-12-31                    to   as Date  2015-12-31    by    month               NN  TRUE     dat2        sample seq from   as Date  2005-12-31                    to   as Date  2015-12-31    by    month               NN  TRUE     num2   round exp rnorm NN  mean   6  sd   1 5    2L      date by day   dat3        sample seq from   as Date  2015-06-01                    to   as Date  2015-07-15    by    day               NN  TRUE      lognormal numeric that can be positive or negative   num3         -1    sample 2  NN  TRUE    round exp rnorm NN  mean   6  sd   1 5    2L      I then wrote this to csv and opened in LibreOffice and saved it as an  xlsx file  then benchmarked 4 of the packages mentioned in this thread  xlsx  openxlsx  readxl  and gdata  using the default options  I also tried a version of whether or not I specify column types  but this didn t change the rankings    I m excluding RODBC because I m on Linux  XLConnect because it seems its primary purpose is not reading in single Excel sheets but importing entire Excel workbooks  so to put its horse in the race on only its reading capabilities seems unfair  and xlsReadWrite because it is no longer compatible with my version of R  seems to have been phased out    I then ran benchmarks with NN 1000L and NN 25000L  resetting the seed before each declaration of the data frame above  to allow for differences with respect to Excel file size  gc is primarily for xlsx  which I ve found at times can create memory clogs  Without further ado  here are the results I found   1 000-Row Excel File  benchmark1k  lt -   microbenchmark times   100L                   xlsx    xlsx  read xlsx2 fl  sheetIndex 1   invisible gc                       openxlsx    openxlsx  read xlsx fl   invisible gc                       readxl    readxl  read excel fl   invisible gc                       gdata    gdata  read xls fl   invisible gc         Unit  milliseconds        expr       min        lq      mean    median        uq       max neval        xlsx  194 1958  199 2662  214 1512  201 9063  212 7563  354 0327   100    openxlsx  142 2074  142 9028  151 9127  143 7239  148 0940  255 0124   100      readxl  122 0238  122 8448  132 4021  123 6964  130 2881  214 5138   100       gdata 2004 4745 2042 0732 2087 8724 2062 5259 2116 7795 2425 6345   100   So readxl is the winner  with openxlsx competitive and gdata a clear loser  Taking each measure relative to the column minimum           expr   min    lq  mean median    uq   max   1     xlsx  1 59  1 62  1 62   1 63  1 63  1 65   2 openxlsx  1 17  1 16  1 15   1 16  1 14  1 19   3   readxl  1 00  1 00  1 00   1 00  1 00  1 00   4    gdata 16 43 16 62 15 77  16 67 16 25 11 31   We see my own favorite  xlsx is 60  slower than readxl   25 000-Row Excel File  Due to the amount of time it takes  I only did 20 repetitions on the larger file  otherwise the commands were identical  Here s the raw data     Unit  milliseconds        expr        min         lq       mean     median         uq        max neval        xlsx  4451 9553  4539 4599  4738 6366  4762 1768  4941 2331  5091 0057    20    openxlsx   962 1579   981 0613   988 5006   986 1091   992 6017  1040 4158    20      readxl   341 0006   344 8904   347 0779   346 4518   348 9273   360 1808    20       gdata 43860 4013 44375 6340 44848 7797 44991 2208 45251 4441 45652 0826    20   Here s the relative data           expr    min     lq   mean median     uq    max   1     xlsx  13 06  13 16  13 65  13 75  14 16  14 13   2 openxlsx   2 82   2 84   2 85   2 85   2 84   2 89   3   readxl   1 00   1 00   1 00   1 00   1 00   1 00   4    gdata 128 62 128 67 129 22 129 86 129 69 126 75   So readxl is the clear winner when it comes to speed  gdata better have something else going for it  as it s painfully slow in reading Excel files  and this problem is only exacerbated for larger tables   Two draws of openxlsx are 1  its extensive other methods  readxl is designed to do only one thing  which is probably part of why it s so fast   especially its write xlsx function  and 2   more of a drawback for readxl  the col types argument in readxl only  as of this writing  accepts some nonstandard R   text  instead of  character  and  date  instead of  Date

User · Answer

Just gave the package openxlsx a try today  It worked really well  and fast     http   cran r-project org web packages openxlsx index html

User · Answer

Expanding on the answer provided by  Mikko you can use a neat trick to speed things up without having to  know  your column classes ahead of time  Simply use read xlsx to grab a limited number of records to determine the classes and then followed it up with read xlsx2  Example    just the first 50 rows should do    df temp  lt - read xlsx  filename xlsx   1  startRow 1  endRow 50   df real  lt - read xlsx2  filename xlsx   1                         colClasses as vector sapply df temp  mode

User · Answer

I ve had good luck with XLConnect  http   cran r-project org web packages XLConnect index html

User · Answer

An Excel file can be read directly into R as follows   my data  lt - read table file    xxxxxx xls   sep     t   header TRUE    Reading xls and xlxs files using readxl package  library  readxl   my data  lt - read excel  xxxxx xls   my data  lt - read excel  xxxxx xlsx

User · Answer

As noted above in many of the other answers  there are many good packages that connect to the XLS X file and get the data in a reasonable way   However  you should be warned that under no circumstances should you use the clipboard  or a  csv  file to retrieve data from Excel   To see why  enter  1 3 into a cell in excel   Now  reduce the number of decimal points visible to you to two   Then copy and paste the data into R   Now save the CSV   You ll notice in both cases Excel has helpfully only kept the data that was visible to you through the interface and you ve lost all of the precision in your actual source data

User · Answer

Let me reiterate what  Chase recommended  Use XLConnect    The reasons for using XLConnect are  in my opinion    Cross platform  XLConnect is written in Java and  thus  will run on Win  Linux  Mac with no change of your R code  except possibly path strings  Nothing else to load  Just install XLConnect and get on with life   You only mentioned reading Excel files  but XLConnect will also write Excel files  including changing cell formatting  And it will do this from Linux or Mac  not just Win     XLConnect is somewhat new compared to other solutions so it is less frequently mentioned in blog posts and reference docs  For me it s been very useful

User · Answer

library RODBC  file name  lt -  file xls  sheet name  lt -  Sheet Name      Connect to Excel File Pull and Format Data excel connect  lt - odbcConnectExcel file name  dat  lt - sqlFetch excel connect  sheet name  na strings c     -    odbcClose excel connect    Personally  I like RODBC and can recommend it

User · Answer

EDIT 2015-October  As others have commented here the openxlsx and readxl packages are by far faster than the xlsx package and actually manage to open larger Excel files   1500 rows  amp    120 columns    MichaelChirico demonstrates that readxl is better when speed is preferred and openxlsx replaces the functionality provided by the xlsx package  If you are looking for a package to read  write  and modify Excel files in 2015  pick the openxlsx instead of xlsx   Pre-2015  I have used xlsxpackage  It changed my workflow with Excel and R  No more annoying pop-ups asking  if I am sure that I want to save my Excel sheet in  txt format  The package also writes Excel files   However  I find read xlsx function slow  when opening large Excel files  read xlsx2 function is considerably faster  but does not quess the vector class of data frame columns  You have to use colClasses command to specify desired column classes  if you use read xlsx2 function  Here is a practical example   read xlsx  filename xlsx   1  reads your file and makes the data frame column classes nearly useful  but is very slow for large data sets  Works also for  xls files   read xlsx2  filename xlsx   1  is faster  but you will have to define column classes manually  A shortcut is to run the command twice  see the example below   character specification converts your columns to factors  Use Dateand POSIXct options for time   coln  lt - function x  y  lt - rbind seq 1 ncol x     colnames y   lt - colnames x  rownames y   lt -  col number   return y     A function to see column numbers  data  lt - read xlsx2  filename xlsx   1    Open the file   coln data       Check the column numbers you want to have as factors  x  lt - 3   Say you want columns 1-3 as factors  the rest numeric  data  lt - read xlsx2  filename xlsx   1  colClasses  c rep  character   x   rep  numeric   ncol data -x 1

User · Answer

Yes  See the relevant page on the R wiki   Short answer  read xls from the gdata package works most of the time  although you need to have Perl installed on your system -- usually already true on MacOS and Linux  but takes an extra step on Windows  i e  see http   strawberryperl com    There are various caveats  and alternatives  listed on the R wiki page   The only reason I see not to do this directly is that you may want to examine the spreadsheet to see if it has glitches  weird headers  multiple worksheets  you can only read one at a time  although you can obviously loop over them all   included plots  etc     But for a well-formed  rectangular spreadsheet with plain numbers and character data  i e   not comma-formatted numbers  dates  formulas with divide-by-zero errors  missing values  etc  etc      I generally have no problem with this process

[excel] Read an Excel file directly from a R script

Examples related to excel

Examples related to r

Examples related to r-faq