How to import multiple csv files at once

Question

Suppose we have a folder containing multiple data csv files  each containing the same number of variables but each from different times  Is there a way in R to import them all simultaneously rather than having to import them all individually   My problem is that I have around 2000 data files to import and having to import them individually just by using the code    read delim file  filename   header TRUE  sep   t     is not very efficient

User · Answer

The following codes should give you the fastest speed for big data as long as you have many cores on your computer:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(doParallel, data.table, stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

# use parallel setting
(cl <- detectCores() %>%
  makeCluster()) %>%
  registerDoParallel()

# read and bind all files together
system.time({
  big_df <- foreach(
    i = fn,
    .packages = "data.table"
  ) %dopar%
    {
      fread(i, colClasses = "character")
    } %>%
    rbindlist(fill = TRUE)
})

# end of parallel work
stopImplicitCluster(cl)

Updated in 2020/04/16: As I find a new package available for parallel computation, an alternative solution is provided using the following codes.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(future.apply, data.table, stringr)

# get the file name
dir() %>% str_subset("\\.csv$") -> fn

plan(multiprocess)

future_lapply(fn,fread,colClasses = "character") %>% 
  rbindlist(fill = TRUE) -> res

# res is the merged data.table

User · Answer

This is the code I developed to read all csv files into R   It will create a dataframe for each csv file individually and title that dataframe the file s original name  removing spaces and the  csv   I hope you find it useful   path  lt -  C  Users cfees My Box Files Fitness   files  lt - list files path path  pattern    csv   for file in files    perpos  lt - which strsplit file       1         assign  gsub        substr file  1  perpos-1     read csv paste path file sep

User · Answer

A speedy and succinct tidyverse solution   more than twice as fast as Base R s read csv   tbl  lt -     list files pattern      csv     gt        map df  read csv       and data table s fread   can even cut those load times by half again   for 1 4 the Base R times   library data table   tbl fread  lt -      list files pattern      csv     gt        map df  fread       The stringsAsFactors   FALSE argument keeps the dataframe factor free   and as marbel points out  is the default setting for fread   If the typecasting is being cheeky  you can force all the columns to be as characters with the col types argument   tbl  lt -     list files pattern      csv     gt        map df  read csv    col types   cols  default    c       If you are wanting to dip into subdirectories to construct your list of files to eventually bind  then be sure to include the path name  as well as register the files with their full names in your list  This will allow the binding work to go on outside of the current directory   Thinking of the full pathnames as operating like passports to allow movement back across directory  borders     tbl  lt -     list files path      subdirectory                   pattern      csv                   full names   T    gt        map df  read csv    col types   cols  default    c        As Hadley describes here  about halfway down        map df x  f  is effectively the same as do call  rbind   lapply x  f         Bonus Feature - adding filenames to the records per Niks feature request in comments below      Add original filename to each record    Code explained  make a function to append the filename to each record during the initial reading of the tables  Then use that function instead of the simple read csv   function   read plus  lt - function flnm        read csv flnm    gt            mutate filename   flnm     tbl with sources  lt -     list files pattern      csv                   full names   T    gt        map df  read plus        The typecasting and subdirectory handling approaches can also be handled inside the read plus   function in the same manner as illustrated in the second and third variants suggested above        Benchmark Code  amp  Results  library tidyverse  library data table  library microbenchmark       Base R Approaches      Instead of a dataframe  this approach creates a list of lists      removed from analysis as this alone doubled analysis time reqd   lapply read delim  lt - function path  pattern      csv           temp   list files path  pattern  full names   TRUE        myfiles   lapply temp  read delim             read csv    do call rbind read csv  lt - function path  pattern      csv         files   list files path  pattern  full names   TRUE      do call rbind  lapply files  function x  read csv x  stringsAsFactors   FALSE       map df read csv  lt - function path  pattern      csv         list files path  pattern  full names   TRUE    gt        map df  read csv    stringsAsFactors   FALSE            dplyr          read csv    lapply read csv bind rows  lt - function path  pattern      csv         files   list files path  pattern  full names   TRUE      lapply files  read csv    gt   bind rows      map df read csv  lt - function path  pattern      csv         list files path  pattern  full names   TRUE    gt        map df  read csv    col types   cols  default    c             data table     purrr  hybrid map df fread  lt - function path  pattern      csv         list files path  pattern  full names   TRUE    gt        map df  fread             data table  rbindlist fread  lt - function path  pattern      csv         files   list files path  pattern  full names   TRUE      rbindlist lapply files  function x  fread x       do call rbind fread  lt - function path  pattern      csv         files   list files path  pattern  full names   TRUE      do call rbind  lapply files  function x  fread x  stringsAsFactors   FALSE        read results  lt - function dir size       microbenchmark            lapply read delim   lapply read delim dir size     too slow to include in benchmarks         do call rbind read csv   do call rbind read csv dir size           map df read csv   map df read csv dir size           lapply read csv bind rows   lapply read csv bind rows dir size           map df read csv   map df read csv dir size           rbindlist fread   rbindlist fread dir size           do call rbind fread   do call rbind fread dir size           map df fread   map df fread dir size           times   10L      read results lrg mid mid  lt - read results    testFolder 500MB 12 5MB 40files   print read results lrg mid mid  digits   3   read results sml mic mny  lt - read results    testFolder 5MB 5KB 1000files    read results sml tny mod  lt - read results    testFolder 5MB 50KB 100files    read results sml sml few  lt - read results    testFolder 5MB 500KB 10files     read results med sml mny  lt - read results    testFolder 50MB 5OKB 1000files   read results med sml mod  lt - read results    testFolder 50MB 5OOKB 100files   read results med med few  lt - read results    testFolder 50MB 5MB 10files    read results lrg sml mny  lt - read results    testFolder 500MB 500KB 1000files   read results lrg med mod  lt - read results    testFolder 500MB 5MB 100files   read results lrg lrg few  lt - read results    testFolder 500MB 50MB 10files    read results xlg lrg mod  lt - read results    testFolder 5000MB 50MB 100files     print read results sml mic mny  digits   3  print read results sml tny mod  digits   3  print read results sml sml few  digits   3   print read results med sml mny  digits   3  print read results med sml mod  digits   3  print read results med med few  digits   3   print read results lrg sml mny  digits   3  print read results lrg med mod  digits   3  print read results lrg lrg few  digits   3   print read results xlg lrg mod  digits   3     display boxplot of my typical use case results  amp  basic machine max load par oma   c 0 0 0 0     remove overall margins if present par mfcol   c 1 1     remove grid if present par mar   c 12 5 1 1    0 1    to display just a single boxplot with its complete labels boxplot read results lrg mid mid  las   2  xlab       ylab    Duration  seconds    main    40 files   12 5MB  500MB    boxplot read results xlg lrg mod  las   2  xlab       ylab    Duration  seconds    main    100 files   50MB  5GB       generate 3x3 grid boxplots par oma   c 12 1 1 1     margins for the whole 3 x 3 grid plot par mfcol   c 3 3     create grid  filling down each column  par mar   c 1 4 2 1     margins for the individual plots in 3 x 3 grid boxplot read results sml mic mny  las   2  xlab       ylab    Duration  seconds    main    1000 files   5KB  5MB    xaxt    n   boxplot read results sml tny mod  las   2  xlab       ylab    Duration  milliseconds    main    100 files   50KB  5MB    xaxt    n   boxplot read results sml sml few  las   2  xlab       ylab    Duration  milliseconds    main    10 files   500KB  5MB      boxplot read results med sml mny  las   2  xlab       ylab    Duration  microseconds            main    1000 files   50KB  50MB    xaxt    n   boxplot read results med sml mod  las   2  xlab       ylab    Duration  microseconds    main    100 files   500KB  50MB    xaxt    n   boxplot read results med med few  las   2  xlab       ylab    Duration  seconds    main    10 files   5MB  50MB     boxplot read results lrg sml mny  las   2  xlab       ylab    Duration  seconds    main    1000 files   500KB  500MB    xaxt    n   boxplot read results lrg med mod  las   2  xlab       ylab    Duration  seconds    main    100 files   5MB  500MB    xaxt    n   boxplot read results lrg lrg few  las   2  xlab       ylab    Duration  seconds    main    10 files   50MB  500MB      Middling Use Case    Larger Use Case    Variety of Use Cases  Rows  file counts  1000  100  10  Columns  final dataframe size  5MB  50MB  500MB   click on image to view original size    The base R results are better for the smallest use cases where the overhead of bringing the C libraries of purrr and dplyr to bear outweigh the performance gains that are observed when performing larger scale processing tasks   if you want to run your own tests you may find this bash script helpful   for   i 1  i lt   2  i      do    cp   1     1 0 8    i  csv   done   bash what you name this script sh  fileName you want copied  100 will create 100 copies of your file sequentially numbered  after the initial 8 characters of the filename and an underscore    Attributions and Appreciations  With special thanks to       Tyler Rinker and Akrun for demonstrating microbenchmark  Jake Kaupp for introducing me to map df   here  David McLaughlin for helpful feedback on improving the visualizations and discussing confirming the performance inversions observed in the small file  small dataframe analysis results  marbel for pointing out the default behavior for fread     I need to study up on data table

User · Answer

With many files and many cores  fread xargs cat  described below  is about 50x faster than the fastest solution in the top 3 answers  rbindlist lapply read delim  500s  lt - 1st place  amp  accepted answer rbindlist lapply fread       250s  lt - 2nd  amp  3rd place answers rbindlist mclapply fread      10s fread xargs cat                5s  Time to read 121401 csvs into a single data table  Each time is an average of three runs then rounded  Each csv has 3 columns  one header row  and  on average  4 510 rows  Machine is a GCP VM with 96 cores  The top three answers by  A5C1D2H2I1M1N2O1R2T1   leerssej  and  marbel and are all essentially the same  apply fread  or read delim  to each file  then rbind rbindlist the resulting data tables  I usually use the rbindlist lapply list files  quot   csv quot   fread   form  This is better than other R-internal alternatives  and fine for a small number of large csvs  but not the best for a large number of small csvs when speed matters  In that case  it can be much faster to first use cat  as  Spacedman suggests in the 4th-ranked answer  I ll add some detail on how to do this from within R  x   fread cmd  cat   csv   header F   However  what if each csv has a header  x   fread cmd  quot awk  NR  1  FNR  1    csv quot   header T   And what if you have so many files that the   csv shell glob fails  x   fread cmd  find   -name  quot   csv quot    xargs cat   header F   And what if all files have a header AND there are too many files  header   fread cmd  find   -name  quot   csv quot    head -n1   xargs head -n1   header T  x   fread cmd  find   -name  quot   csv quot    xargs tail -q -n 2   header F  names x    names header   And what if the resulting concatenated csv is too big for system memory  system  find   -name  quot   csv quot    xargs cat  gt  combined csv   x   fread  combined csv   header F   With headers  system  find   -name  quot   csv quot    head -n1   xargs head -n1  gt  combined csv   system  find   -name  quot   csv quot    xargs tail -q -n 2  gt  gt  combined csv   x   fread  combined csv   header T   Finally  what if you don t want all  csv in a directory  but rather a specific set of files   Also  they all have headers    This is my use case   fread text paste0 system  quot xargs cat awk  NR  1   1    quot  lt column one name gt   quot   quot  input paths intern T  collapse  quot  n quot   header T sep  quot  t quot    and this is about the same speed as plain fread xargs cat    Note  for data table pre-v1 11 6  19 Sep 2018   omit the cmd  from fread cmd   Addendum  using the parallel library s mclapply in place of serial lapply  e g   rbindlist lapply list files  quot   csv quot   fread   is also much faster than rbindlist lapply fread  To sum up  if you re interested in speed  and have many files and many cores  fread xargs cat is about 50x faster than the fastest solution in the top 3 answers

User · Answer

It was requested that I add this functionality to the stackoverflow R package  Given that it is a tinyverse package  and can t depend on third party packages   here is what I came up with      Bulk import data files         Read in each file at a path and then unnest them  Defaults to csv format          param path        a character vector of full path names     param pattern     an optional  link  regex  regular expression   Only file names which match the regular expression will be returned      param reader      a function that can read data from a file name      param             optional arguments to pass to the reader function  eg  code stringsAsFactors        param reducer     a function to unnest the individual data files  Use I to retain the nested structure       param recursive     logical  Should the listing recurse into directories           author Neal Fultz     references  url https   stackoverflow com questions 11433432 how-to-import-multiple-csv-files-at-once          importFrom utils read csv     export read directory  lt - function path      pattern NULL  reader read csv                                   reducer function dfs  do call rbind data frame  dfs   recursive FALSE      files  lt - list files path  pattern  full names   TRUE  recursive   recursive     reducer lapply files  reader            By parameterizing the reader and reducer function  people can use data table or dplyr if they so choose  or just use the base R functions that are fine for smaller data sets

User · Answer

In my view  most of the other answers are obsoleted by rio  import list  which is a succinct one-liner   library rio  my data  lt - import list dir  path to directory   pattern     csv   rbind   TRUE     Any extra arguments are passed to rio  import  rio can deal with almost any file format R can read  and it uses data table s fread where possible  so it should be fast too

User · Answer

Using purrr and including file IDs as a column  library tidyverse    p  lt -  quot my directory quot  files  lt - list files p  pattern  quot csv quot   full names TRUE    gt       set names   merged  lt - files   gt   map dfr read csv   id  quot filename quot    Without set names     id  will use integer indicators  instead of actual file names  If you then want just the short filename without the full path  merged  lt - merged   gt   mutate filename basename filename

User · Answer

Using plyr  ldply there is roughly a 50  speed increase by enabling the  parallel option while reading 400 csv files roughly 30-40 MB each   Example includes a text progress bar   library plyr  library data table  library doSNOW   csv list  lt - list files path  t  data   pattern   csv    full names TRUE   cl  lt - makeCluster 4  registerDoSNOW cl   pb  lt - txtProgressBar max length csv list   style 3  pbu  lt - function i  setTxtProgressBar pb  i  dt  lt - setDT ldply csv list  fread   parallel TRUE   paropts list  options snow list progress pbu      stopCluster cl

User · Answer

I like the approach using list files    lapply   and list2env    or fs  dir ls    purrr  map   and list2env     That seems simple and flexible   Alternatively  you may try the small package  tor   to-R   By default it imports files from the working directory into a list  list     variants  or into the global environment  load     variants    For example  here I read all the  csv files from my working directory into a list using tor  list csv     library tor   dir     gt    1    pkgdown yml       cran-comments md   csv1 csv            gt    4   csv2 csv           datasets           DESCRIPTION         gt    7   docs               inst               LICENSE md          gt   10   man                NAMESPACE          NEWS md             gt   13   R                  README md          README Rmd          gt   16   tests              tmp R              tor Rproj   list csv     gt   csv1   gt    x   gt  1 1   gt  2 2   gt     gt   csv2   gt    y   gt  1 a   gt  2 b   And now I load those files into my global environment with tor  load csv       The working directory contains  csv files dir     gt    1    pkgdown yml       cran-comments md   CRAN-RELEASE        gt    4   csv1 csv           csv2 csv           datasets            gt    7   DESCRIPTION        docs               inst                gt   10   LICENSE md         man                NAMESPACE           gt   13   NEWS md            R                  README md           gt   16   README Rmd         tests              tmp R               gt   19   tor Rproj   load csv      Each file is now available as a dataframe in the global environment csv1   gt    x   gt  1 1   gt  2 2 csv2   gt    y   gt  1 a   gt  2 b   Should you need to read specific files  you can match their file-path with regexp  ignore case and invert     For even more flexibility use list any    It allows you to supply the reader function via the argument  f    path csv  lt - tor example  csv      gt   1   C  Users LeporeM Documents R R-3 5 2 library tor extdata csv  dir path csv    gt   1   file1 csv   file2 csv   list any path csv  read csv    gt   file1   gt    x   gt  1 1   gt  2 2   gt     gt   file2   gt    y   gt  1 a   gt  2 b   Pass additional arguments via     or inside the lambda function   path csv   gt      list any readr  read csv  skip   1    gt  Parsed with column specification    gt  cols    gt     1    col double     gt      gt  Parsed with column specification    gt  cols    gt    a   col character     gt      gt   file1   gt    A tibble  1 x 1   gt       1    gt     lt dbl gt    gt  1     2   gt     gt   file2   gt    A tibble  1 x 1   gt    a       gt     lt chr gt    gt  1 b  path csv   gt      list any  read csv    stringsAsFactors   FALSE     gt      map as tibble    gt   file1   gt    A tibble  2 x 1   gt        x   gt     lt int gt    gt  1     1   gt  2     2   gt     gt   file2   gt    A tibble  2 x 1   gt    y       gt     lt chr gt    gt  1 a       gt  2 b

User · Answer

As well as using lapply or some other looping construct in R you could merge your CSV files into one file   In Unix  if the files had no headers  then its as easy as   cat   csv  gt  all csv   or if there are headers  and you can find a string that matches headers and only headers  ie suppose header lines all start with  Age    you d do   cat   csv   grep -v  Age  gt  all csv   I think in Windows you could do this with COPY and SEARCH  or FIND or something  from the DOS command box  but why not install cygwin and get the power of the Unix command shell

User · Answer

This is my specific example to read multiple files and combine them into 1 data frame      path lt - file path  C  folder subfolder   files  lt - list files path path  pattern     csv  full names   T  library data table  data   do call rbind  lapply files  function x  read csv x  stringsAsFactors   FALSE

User · Answer

Something like the following should result in each data frame as a separate element in a single list   temp   list files pattern    csv   myfiles   lapply temp  read delim    This assumes that you have those CSVs in a single directory--your current working directory--and that all of them have the lower-case extension  csv   If you then want to combine those data frames into a single data frame  see the solutions in other answers using things like do call rbind       dplyr  bind rows   or data table  rbindlist     If you really want each data frame in a separate object  even though that s often inadvisable  you could do the following with assign   temp   list files pattern    csv   for  i in 1 length temp   assign temp i   read csv temp i      Or  without assign  and to demonstrate  1  how the file name can be cleaned up and  2  show how to use list2env  you can try the following   temp   list files pattern    csv   list2env    lapply setNames temp  make names gsub    csv        temp               read csv   envir    GlobalEnv    But again  it s often better to leave them in a single list

User · Answer

Building on dnlbrk s comment  assign can be considerably faster than list2env for big files   library readr  library stringr   List of file paths  lt - list files path   C  Users Anon Documents Folder with csv files    pattern     csv   all files   TRUE  full names   TRUE    By setting the full names argument to true  you will get the full path to each file as a separate character string in your list of files  e g   List of file paths 1  will be something like  C  Users Anon Documents Folder with csv files file1 csv   for f in 1 length List of filepaths       file name  lt - str sub string   List of filepaths f   start   46  end   -5    file df  lt - read csv List of filepaths f       assign  x   file name  value   file df  envir    GlobalEnv      You could use the data table package s fread or base R read csv instead of read csv  The file name step allows you to tidy up the name so that each data frame does not remain with the full path to the file as it s name   You could extend your loop to do further things to the data table before transferring it to the global environment  for example   for f in 1 length List of filepaths       file name  lt - str sub string   List of filepaths f   start   46  end   -5    file df  lt - read csv List of filepaths f       file df  lt - file df  1 3   if you only need the first three columns   assign  x   file name  value   file df  envir    GlobalEnv

User · Answer

Here are some options to convert the  csv files into one data frame using R base and some of the available packages for reading files in R    This is slower than the options below     Get the files names files   list files pattern    csv     First apply read csv  then rbind myfiles   do call rbind  lapply files  function x  read csv x  stringsAsFactors   FALSE      Edit  - A few more extra choices using data table and readr  A fread   version  which is a function of the data table package  This is by far the fastest option in R    library data table  DT   do call rbind  lapply files  fread     The same using  rbindlist  DT   rbindlist lapply files  fread     Using readr  which is another package for reading csv files  It s slower than fread  faster than base R but has different functionalities    library readr  library dplyr  tbl   lapply files  read csv    gt   bind rows

[r] How to import multiple .csv files at once?

Examples related to r

Examples related to csv

Examples related to import

Examples related to r-faq