Import multiple csv files into pandas and concatenate into one DataFrame

Question

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame  I have not been able to figure it out though  Here is what I have so far   import glob import pandas as pd    get data file names path  r C  DRO DCL rawdata files  filenames   glob glob path       csv    dfs      for filename in filenames      dfs append pd read csv filename      Concatenate all data into one DataFrame big frame   pd concat dfs  ignore index True    I guess I need some help within the for loop

User · Answer

import os  os system  quot awk   NR    1      FNR  gt  1   file  csv  gt  merged csv quot    Where NR and FNR represent the number of the line being processed  FNR is the current line within each file  NR    1 includes the first line of the first file  the header   while  FNR  gt  1  skips the first line of each subsequent file

User · Answer

Alternative using the pathlib library  often preferred over os path     This method avoids iterative use of pandas concat   apped     From the pandas documentation   It is worth noting that concat    and therefore append    makes a full copy of the data  and that constantly reusing this function can create a significant performance hit  If you need to use the operation over several datasets  use a list comprehension   import pandas as pd from pathlib import Path  dir   Path     relevant directory    df    pd read csv f  for f in dir glob    csv    df   pd concat df

User · Answer

Easy and Fast  Import two or more csv s without having to make a list of names    import glob  df   pd concat map pd read csv  glob glob  data   csv

User · Answer

import glob import os import pandas as pd    df   pd concat map pd read csv  glob glob os path join      quot my files  csv quot

User · Answer

If the multiple csv files are zipped  you may use zipfile to read all and concatenate as below  import zipfile import pandas as pd  ziptrain   zipfile ZipFile  yourpath yourfile zip    train       train     pd read csv ziptrain open f   for f in ziptrain namelist      df   pd concat train

User · Answer

This is how you can do using Colab on Google Drive  import pandas as pd import glob  path   r  content drive My Drive data actual comments only    use your path all files   glob glob path       csv    li       for filename in all files      df   pd read csv filename  index col None  header 0      li append df   frame   pd concat li  axis 0  ignore index True sort True  frame to csv   content drive onefile csv

User · Answer

If you want to search recursively  Python 3 5 or above   you can do the following   from glob import iglob import pandas as pd  path   r C  user your path      csv   all rec   iglob path  recursive True       dataframes    pd read csv f  for f in all rec  big dataframe   pd concat dataframes  ignore index True    Note that the three last lines can be expressed in one single line   df   pd concat  pd read csv f  for f in iglob path  recursive True    ignore index True    You can find the documentation of    here  Also  I used iglobinstead of glob  as it returns an iterator instead of a list       EDIT  Multiplatform recursive function   You can wrap the above into a multiplatform function  Linux  Windows  Mac   so you can do   df   read df rec  C  user your path     csv    Here is the function   from glob import iglob from os path import join import pandas as pd  def read df rec path  fn regex r   csv        return pd concat  pd read csv f  for f in iglob          join path        fn regex   recursive True    ignore index True

User · Answer

Edit  I googled my way into https   stackoverflow com a 21232849 186078  However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too   I do sincerely want anyone hitting this page to consider this approach  but don t want to attach this huge piece of code as a comment and making it less readable     You can leverage numpy to really speed up the dataframe concatenation    import os import glob import pandas as pd import numpy as np  path    my dir full path  allFiles   glob glob os path join path    csv      np array list      for file  in allFiles      df   pd read csv file  index col None  header 0      np array list append df as matrix     comb np array   np vstack np array list  big frame   pd DataFrame comb np array   big frame columns     col1   col2         Timing stats   total files  192 avg lines per file  8492 --approach 1 without numpy -- 8 248656988143921 seconds --- total records old  1630571 --approach 2 with numpy -- 2 289292573928833 seconds ---

User · Answer

The Dask library can read a dataframe from multiple files      import dask dataframe as dd     df   dd read csv  data  csv     Source  https   examples dask org dataframes 01-data-access html Read-CSV-files  The Dask dataframes implement a subset of the Pandas dataframe API  If all the data fits into memory  you can call df compute   to convert the dataframe into a Pandas dataframe

User · Answer

You can do it this way also  import pandas as pd import os  new df   pd DataFrame   for r  d  f in os walk csv folder path       for file in f          complete file path   csv folder path file         read file   pd read csv complete file path          new df   new df append read file  ignore index True    new df shape

User · Answer

import pandas as pd import glob  path   r C  DRO DCL rawdata files    use your path file path list   glob glob path       csv    file iter   iter file path list   list df csv      list df csv append pd read csv next file iter     for file in file iter      lsit df csv append pd read csv file  header 0   df   pd concat lsit df csv  ignore index True

User · Answer

one liner using map  but if you d like to specify additional args  you could do   import pandas as pd import glob import functools  df   pd concat map functools partial pd read csv  sep      compression None                        glob glob  data   csv       Note  map by itself does not let you supply additional args

User · Answer

If you have same columns in all your csv files then you can try the code below  I have added header 0 so that after reading csv first row can be assigned as the column names   import pandas as pd import glob  path   r C  DRO DCL rawdata files    use your path all files   glob glob path       csv    li       for filename in all files      df   pd read csv filename  index col None  header 0      li append df   frame   pd concat li  axis 0  ignore index True

User · Answer

An alternative to darindaCoder s answer   path   r C  DRO DCL rawdata files                        use your path all files   glob glob os path join path     csv          advisable to use os path join as this makes concatenation OS independent  df from each file    pd read csv f  for f in all files  concatenated df     pd concat df from each file  ignore index True    doesn t create a list  nor does it append to one

User · Answer

Based on  Sid s good answer    Before concatenating  you can load csv files into an intermediate dictionary which gives access to each data set based on the file name  in the form dict of df  filename csv     Such a dictionary can help you identify issues with heterogeneous data formats  when column names are not aligned for example    Import modules and locate file paths   import os import glob import pandas from collections import OrderedDict path  r C  DRO DCL rawdata files  filenames   glob glob path       csv     Note  OrderedDict is not necessary   but it ll keep the order of files which might be useful for analysis   Load csv files into a dictionary  Then concatenate   dict of df   OrderedDict  f  pandas read csv f   for f in filenames  pandas concat dict of df  sort True    Keys are file names f and values are the data frame content of csv files   Instead of using f as a dictionary key  you can also use os path basename f  or other os path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant

User · Answer

Another on-liner with list comprehension which allows to use arguments with read csv   df   pd concat  pd read csv f dir  f    for f in os listdir  dir   if f endswith   csv

User · Answer

Almost all of the answers here are either unnecessarily complex  glob pattern matching  or rely on additional 3rd party libraries  You can do this in 2 lines using everything Pandas and python  all versions  already have built in   For a few files - 1 liner   df   pd concat map pd read csv    data d1 csv    data d2 csv   data d3 csv       For many files   from os import listdir  filepaths    f for f in listdir    data   if f endswith   csv    df   pd concat map pd read csv  filepaths       This pandas line which sets the df utilizes 3 things    Python s map  function  iterable  sends to the function  the pd read csv    the iterable  our list  which is every csv element in filepaths   Panda s read csv   function reads in each CSV file as normal  Panda s concat   brings all these under one df variable

[python] Import multiple csv files into pandas and concatenate into one DataFrame

Examples related to python

Examples related to pandas

Examples related to csv

Examples related to dataframe

Examples related to concatenation