Quickly reading very large tables as dataframes

Question

I have very large tables  30 million rows  that I would like to load as a dataframes in R   read table   has a lot of convenient features  but it seems like there is a lot of logic in the implementation that would slow things down   In my case  I am assuming I know the types of the columns ahead of time  the table does not contain any column headers or row names  and does not have any pathological characters that I have to worry about   I know that reading in a table as a list using scan   can be quite fast  e g    datalist  lt - scan  myfile  sep   t  list url    popularity 0 mintime 0 maxtime 0      But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6   df  lt - as data frame scan  myfile  sep   t  list url    popularity 0 mintime 0 maxtime 0       Is there a better way of doing this   Or quite possibly completely different approach to the problem

User · Answer

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

(1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.

(2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

User · Answer

A minor additional points worth mentioning  If you have a very large file you can on the fly calculate the number of rows  if no header  using  where bedGraph is the name of your file in your working directory     gt numRow as integer system paste  wc -l   bedGraph     sed  s   0-9       0-9           1      intern T     You can then use that either in read csv   read table       gt system time  BG read table bedGraph  nrows numRow  col names c  chr    start    end    score   colClasses c  character   rep  integer  3         user  system elapsed   25 877   0 887  26 752   gt object size BG  203949432 bytes

User · Answer

This was previously asked on R-Help  so that s worth reviewing   One suggestion there was to use readChar   and then do string manipulation on the result with strsplit   and substr     You can see the logic involved in readChar is much less than read table   I don t know if memory is an issue here  but you might also want to take a look at the HadoopStreaming package   This uses Hadoop  which is a MapReduce framework designed for dealing with large data sets   For this  you would use the hsTableReader function   This is an example  but it has a learning curve to learn Hadoop    str  lt -  key1 t3 9 nkey1 t8 9 nkey1 t1 2 nkey1 t3 9 nkey1 t8 9 nkey1 t1 2 nkey2 t9 9 nkey2   cat str  cols   list key    val 0  con  lt - textConnection str  open    r   hsTableReader con cols chunkSize 6 FUN print ignoreKey TRUE  close con    The basic idea here is to break the data import into chunks   You could even go so far as to use one of the parallel frameworks  e g  snow  and run the data import in parallel by segmenting the file  but most likely for large data sets that won t help since you will run into memory constraints  which is why map-reduce is a better approach

User · Answer

Here is an example that utilizes fread from data table 1 8 7 The examples come from the help page to fread  with the timings on my windows XP Core 2 duo E8400  library data table    Demo speedup n 1e6 DT   data table  a sample 1 1000 n replace TRUE                    b sample 1 1000 n replace TRUE                    c rnorm n                    d sample c  quot foo quot   quot bar quot   quot baz quot   quot qux quot   quot quux quot   n replace TRUE                    e rnorm n                    f sample 1 1000 n replace TRUE    DT 2 b  NA integer   DT 4 c  NA real   DT 3 d  NA character   DT 5 d   quot  quot   DT 2 e   Inf  DT 3 e  -Inf   standard read table write table DT  quot test csv quot  sep  quot   quot  row names FALSE quote FALSE  cat  quot File size  MB   quot  round file info  quot test csv quot   size 1024 2   quot  n quot          File size  MB   51   system time DF1  lt - read csv  quot test csv quot  stringsAsFactors FALSE                 user  system elapsed       24 71    0 15   25 42   second run will be faster system time DF1  lt - read csv  quot test csv quot  stringsAsFactors FALSE                 user  system elapsed       17 85    0 07   17 98  optimized read table system time DF2  lt - read table  quot test csv quot  header TRUE sep  quot   quot  quote  quot  quot                               stringsAsFactors FALSE comment char  quot  quot  nrows n                                               colClasses c  quot integer quot   quot integer quot   quot numeric quot                                                                   quot character quot   quot numeric quot   quot integer quot             user  system elapsed       10 20    0 03   10 32  fread require data table  system time DT  lt - fread  quot test csv quot                                             user  system elapsed        3 12    0 01    3 22  sqldf require sqldf   system time SQLDF  lt - read csv sql  quot test csv quot  dbname NULL                       user  system elapsed       12 49    0 09   12 69    sqldf as on SO  f  lt - file  quot test csv quot   system time SQLf  lt - sqldf  quot select   from f quot   dbname   tempfile    file format   list header   T  row names   F           user  system elapsed       10 21    0 47   10 73  ff   ffdf  require ff    system time FFDF  lt - read csv ffdf file  quot test csv quot  nrows n             user  system elapsed        10 85    0 10   10 99  In summary        user  system elapsed  Method      24 71    0 15   25 42  read csv  first time       17 85    0 07   17 98  read csv  second time       10 20    0 03   10 32  Optimized read table       3 12    0 01    3 22  fread      12 49    0 09   12 69  sqldf      10 21    0 47   10 73  sqldf on SO      10 85    0 10   10 99  ffdf

User · Answer

I am reading data very quickly using the new arrow package  It appears to be in a fairly early stage   Specifically  I am using the parquet columnar format  This converts back to a data frame in R  but you can get even deeper speedups if you do not  This format is convenient as it can be used from Python as well   My main use case for this is on a fairly restrained RShiny server  For these reasons  I prefer to keep data attached to the Apps  i e   out of SQL   and therefore require small file size as well as speed   This linked article provides benchmarking and a good overview  I have quoted some interesting points below   https   ursalabs org blog 2019-10-columnar-perf   File Size     That is  the Parquet file is half as big as even the gzipped CSV  One of the reasons that the Parquet file is so small is because of dictionary-encoding  also called    dictionary compression      Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD  which are used in the FST format   Parquet was designed to produce very small files that are fast to read    Read Speed     When controlling by output type  e g  comparing all R data frame outputs with each other  we see the the performance of Parquet  Feather  and FST falls within a relatively small margin of each other  The same is true of the pandas DataFrame outputs  data table  fread is impressively competitive with the 1 5 GB file size but lags the others on the 2 5 GB CSV      Independent Test  I performed some independent benchmarking on a simulated dataset of 1 000 000 rows  Basically I shuffled a bunch of things around to attempt to challenge the compression  Also I added a short text field of random words and two simulated factors   Data  library dplyr  library tibble  library OpenRepGrid   n  lt - 1000000  set seed 1234  some levels1  lt - sapply 1 10  function x  paste LETTERS sample 1 26  size   sample 3 8  1   replace   TRUE    collapse        some levels2  lt - sapply 1 65  function x  paste LETTERS sample 1 26  size   sample 5 16  1   replace   TRUE    collapse          test data  lt - mtcars   gt     rownames to column     gt     sample n n  replace   TRUE    gt     mutate all   sample    length        gt     mutate factor1   sample some levels1  n  replace   TRUE            factor2   sample some levels2  n  replace   TRUE            text   randomSentences n  sample 3 8  n  replace   TRUE                Read and Write  Writing the data is easy   library arrow   write parquet test data    test data parquet      you can also mess with the compression write parquet test data   test data2 parquet   compress    gzip   compression level   9    Reading the data is also easy   read parquet  test data parquet      this option will result in lightning fast reads  but in a different format  read parquet  test data2 parquet   as data frame   FALSE    I tested reading this data against a few of the competing options  and did get slightly different results than with the article above  which is expected     This file is nowhere near as large as the benchmark article  so maybe that is the difference   Tests   rds  test data rds  20 3 MB  parquet2 native   14 9 MB with higher compression and as data frame   FALSE  parquet2  test data2 parquet  14 9 MB with higher compression  parquet  test data parquet  40 7 MB  fst2  test data2 fst  27 9 MB with higher compression  fst  test data fst  76 8 MB  fread2  test data csv gz  23 6MB  fread  test data csv  98 7MB  feather arrow  test data feather  157 2 MB read with arrow  feather  test data feather  157 2 MB read with feather    Observations  For this particular file  fread is actually very fast  I like the small file size from the highly compressed parquet2 test  I may invest the time to work with the native data format rather than a data frame if I really need the speed up   Here fst is also a great choice  I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off

User · Answer

Strangely  no one answered the bottom part of the question for years even though this is an important one -- data frames are simply lists with the right attributes  so if you have large data you don t want to use as data frame or similar for a list  It s much faster to simply  turn  a list into a data frame in-place   attr df   row names    lt -  set row names length df  1     class df   lt -  data frame    This makes no copy of the data so it s immediate  unlike all other methods   It assumes that you have already set names   on the list accordingly    As for loading large data into R -- personally  I dump them by column into binary files and use readBin   - that is by far the fastest method  other than mmapping  and is only limited by the disk speed  Parsing ASCII files is inherently slow  even in C  compared to binary data

User · Answer

An alternative is to use the vroom package  Now on CRAN  vroom doesn t load the entire file  it indexes where each record is located  and is read later when you use it      Only pay for what you use    See Introduction to vroom  Get started with vroom and the vroom benchmarks   The basic overview is that the initial read of a huge file  will be much faster  and subsequent modifications to the data may be slightly slower  So depending on what your use is  it could be the best option   See a simplified example from vroom benchmarks below  the key parts to see is the super fast read times  but slightly sower operations like aggregate etc    package                 read    print   sample   filter  aggregate   total read delim              1m      21 5s   1ms      315ms   764ms       1m 22 6s readr                   33 1s   90ms    2ms      202ms   825ms       34 2s data table              15 7s   13ms    1ms      129ms   394ms       16 3s vroom  altrep  dplyr    1 7s    89ms    1 7s     1 3s    1 9s        6 7s

User · Answer

An update  several years later This answer is old  and R has moved on   Tweaking read table to run a bit faster has precious little benefit   Your options are   Using vroom from the tidyverse package vroom for importing data from csv tab-delimited files directly into an R tibble  See Hector s answer   Using fread in data table for importing data from csv tab-delimited files directly into R  See mnel s answer   Using read table in readr  on CRAN from April 2015    This works much like fread above   The readme in the link explains the difference between the two functions  readr currently claims to be  quot 1 5-2x slower quot  than data table  fread    read csv raw from iotools provides a third option for quickly reading CSV files   Trying to store as much data as you can in databases rather than flat files    As well as being a better permanent storage medium  data is passed to and from R in a binary format  which is faster   read csv sql in the sqldf package  as described in JD Long s answer  imports data into a temporary SQLite database and then reads it into R   See also  the RODBC package  and the reverse depends section of the DBI package page  MonetDB R gives you a data type that pretends to be a data frame but is really a MonetDB underneath  increasing performance   Import data with its monetdb read csv function   dplyr allows you to work directly with data stored in several types of database   Storing data in binary formats can also be useful for improving performance   Use saveRDS readRDS  see below   the h5 or rhdf5 packages for HDF5 format  or write fst read fst from the fst package     The original answer There are a couple of simple things to try  whether you use read table or scan   Set nrows the number of records in your data  nmax in scan    Make sure that comment char  quot  quot  to turn off interpretation of comments   Explicitly define the classes of each column using colClasses in read table   Setting multi line FALSE may also improve performance in scan    If none of these thing work  then use one of the profiling packages to determine which lines are slowing things down   Perhaps you can write a cut down version of read table based on the results  The other alternative is filtering your data before you read it into R  Or  if the problem is that you have to read it in regularly  then use these methods to read the data in once  then save the data frame as a binary blob with save saveRDS  then next time you can retrieve it faster with load readRDS

User · Answer

Instead of the conventional read table I feel fread is a faster function   Specifying additional attributes like select only the required columns  specifying colclasses and string as factors will reduce the time take to import the file   data frame  lt - fread  filename csv  sep     header FALSE stringsAsFactors FALSE select c 1 4 5 6 7  colClasses c  as numeric   as character   as numeric   as Date   as Factor

User · Answer

I didn t see this question initially and asked a similar question a few days later  I am going to take my previous question down  but I thought I d add an answer here to explain how I used sqldf   to do this   There s been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame  Yesterday I wrote a blog post about using sqldf   to import the data into SQLite as a staging area  and then sucking it from SQLite into R  This works really well for me  I was able to pull in 2GB  3 columns  40mm rows  of data in  lt  5 minutes  By contrast  the read csv command ran all night and never completed    Here s my test code   Set up the test data   bigdf  lt - data frame dim sample letters  replace T  4e7   fact1 rnorm 4e7   fact2 rnorm 4e7  20  50   write csv bigdf   bigdf csv   quote   F    I restarted R before running the following import routine   library sqldf  f  lt - file  bigdf csv   system time bigdf  lt - sqldf  select   from f   dbname   tempfile    file format   list header   T  row names   F      I let the following line run all night but it never completed   system time big df  lt - read csv  bigdf csv

User · Answer

I ve tried all above and  readr  1  made the best job  I have only 8gb RAM  Loop for 20 files  5gb each  7 columns   read fwf arquivos i  col types    ccccccc  fwf cols cnpj   c 4 17   nome   c 19 168   cpf   c 169 183   fantasia   c 169 223   sit cadastral   c 224 225   dt sitcadastral   c 226 233   cnae   c 376 382

[r] Quickly reading very large tables as dataframes

Examples related to r

Examples related to import

Examples related to dataframe

Examples related to r-faq