How to import multiple csv files in a single load

Question

Consider I have a defined schema for loading 10 csv files in a folder  Is there a way to automatically load tables using Spark SQL  I know this can be performed by using an individual dataframe for each file  given below   but can it be automated with a single command rather than pointing a file can I point a folder   df   sqlContext read         format  com databricks spark csv           option  header    true           load     Downloads 2008 csv

User · Answer

Using Spark 2 0   we can load multiple CSV files from different directories using  df   spark read csv   directory 1   directory 2   directory 3         header True   For more information  refer the documentation   here

User · Answer

Reader s Digest    Spark 2 x   For Example  if you have 3 directories holding csv files       dir1  dir2  dir3   You then define paths as a string of comma delimited list of paths as follows       paths    dir1  dir2  dir3      Then use the following function and pass it this paths variable  def get df from csv paths paths            df   spark read format  csv   option  header    false                 schema custom schema                option  delimiter     t                 option  mode    DROPMALFORMED                 load paths split               return df   By then running   df   get df from csv paths paths    You will obtain in df a single spark dataframe containing the data from all the csvs found in these 3 directories                                                                                Full Version   In case you want to ingest multiple CSVs from multiple directories you simply need to pass a list and use wildcards   For Example   if your data path looks like this       s3   bucket name subbucket name 2016-09-  184      s3   bucket name subbucket name 2016-10-  184      s3   bucket name subbucket name 2016-11-  184      s3   bucket name subbucket name 2016-12-  184             you can use the above function to ingest all the csvs in all these directories and subdirectories at once   This would ingest all directories in s3 bucket name subbucket name  according to the wildcard patterns specified  e g  the first pattern would look in      bucket name subbucket name    for all directories with names starting with      2016-09-   and for each of those take only the directory named      184   and within that subdirectory look for all csv files   And this would be executed for each of the patterns in the comma delimited list   This works way better than union

User · Answer

Ex1    Reading a single CSV file  Provide complete file path    val df   spark read option  header    true   csv  C spark  sample data  tmp  cars1 csv     Ex2    Reading multiple CSV files passing names    val df spark read option  header   true   csv  C spark  sample data  tmp  cars1 csv    C spark  sample data  tmp  cars2 csv     Ex3    Reading multiple CSV files passing list of names   val paths   List  C spark  sample data  tmp  cars1 csv    C spark  sample data  tmp  cars2 csv   val df   spark read option  header    true   csv paths        Ex4    Reading multiple CSV files in a folder ignoring other files   val df   spark read option  header    true   csv  C spark  sample data  tmp    csv     Ex5    Reading multiple CSV files from multiple folders   val folders   List  C spark  sample data  tmp    C spark  sample data  tmp1   val df   spark read option  header    true   csv folders

User · Answer

val df   spark read option  quot header quot    quot true quot   csv  quot C spark  sample data    csv   will consider files tmp  tmp1  tmp2

User · Answer

Use wildcard  e g  replace 2008 with     df   sqlContext read         format  com databricks spark csv           option  header    true           load     Downloads   csv       lt -- note the star       Spark 2 0     these lines are equivalent in Spark 2 0 spark read format  csv   option  header    true   load     Downloads   csv   spark read option  header    true   csv     Downloads   csv     Notes    Replace format  com databricks spark csv   by using format  csv   or csv method instead  com databricks spark csv format has been integrated to 2 0  Use spark not sqlContext

User · Answer

Note that you can use other tricks like    -- One or more wildcard             Downloads20    csv --  braces and brackets               Downloads201 1-5  book csv            Downloads201 11 15 19 99  book csv

[apache-spark] How to import multiple csv files in a single load?

Examples related to apache-spark

Examples related to apache-spark-sql

Examples related to spark-dataframe