Provide schema while reading csv file as a dataframe

Question

I am trying to read a csv file into a dataframe  I know what the schema of my dataframe should be since I know my csv file  Also I am using spark csv package to read the file  I trying to specify the schema like below   val pagecount   sqlContext read format  csv      option  delimiter       option  quote         option  schema   project  string  article  string  requests  integer  bytes served  long      load  dbfs  databricks-datasets wikipedia-datasets data-001 pagecounts sample pagecounts-20151124-170000     But when I check the schema of the data frame I created  it seems to have taken its own schema  Am I doing anything wrong   how to make spark to pick up the schema I mentioned     gt  pagecount printSchema root  --  c0  string  nullable   true   --  c1  string  nullable   true   --  c2  string  nullable   true   --  c3  string  nullable   true

User · Answer

Here s how you can work with a custom schema  a complete demo      shell code   echo   Slingo  iOS  Slingo  Android    gt  game csv   Scala code   import org apache spark sql types    val customSchema   StructType Array    StructField  game id   StringType  true     StructField  os id   StringType  true      val csv df   spark read format  csv   schema customSchema  load  game csv   csv df show   csv df orderBy asc  game id    desc  os id    show csv df createOrReplaceTempView  game view   val sort df   sql  select   from game view order by game id  os id desc   sort df show

User · Answer

import Library import java io StringReader    import au com bytecode opencsv CSVReader    filename  var train csv     Path train csv      read as text file  val train rdd   sc textFile train csv        use string reader to convert in proper format  var full train data    train rdd map line   gt   var csvReader   new CSVReader new StringReader line     csvReader readNext             declares  types  type s   String     declare case class for schema  case class trainSchema  Loan ID  s  Gender  s  Married  s  Dependents  s Education  s Self Employed  s ApplicantIncome  s CoapplicantIncome  s      LoanAmount  s Loan Amount Term  s  Credit History  s  Property Area  s Loan Status  s     create DF RDD with custom schema   var full train data with schema   full train data mapPartitionsWithIndex  idx itr   gt  if  idx  0  itr drop 1                         itr toList map x  gt  trainSchema x 0  x 1  x 2  x 3  x 4  x 5  x 6  x 7  x 8  x 9  x 10  x 11  x 12    iterator   toDF

User · Answer

For those interested in doing this in Python here is a working version   customSchema   StructType       StructField  IDGC   StringType    True               StructField  SEARCHNAME   StringType    True       StructField  PRICE   DoubleType    True     productDF   spark read load   home ForTesting testProduct csv   format  csv   header  true   sep      schema customSchema   testProduct csv ID SEARCHNAME PRICE 6607 EFKTON75LIN 890 88 6612 EFKTON100HEN 55 66   Hope this helps

User · Answer

schema definition as simple string   Just in case if some one is interested in schema definition as simple string  with date and time stamp       data file creation from Terminal or shell   echo    2019-07-02 22 11 11 000999  01 01 2019  Suresh  abc   2019-01-02 22 11 11 000001  01 01 2020  Aadi  xyz     gt  data csv      Defining the schema as String       user schema    timesta TIMESTAMP date DATE first name STRING   last name STRING       reading the data       df   spark read csv path  data csv   schema   user schema  sep      dateFormat  MM dd yyyy  timestampFormat  yyyy-MM-dd HH mm ss SSSSSS        df show 10  False        ----------------------- ---------- ---------- ---------       timesta                 date       first name last name       ----------------------- ---------- ---------- ---------       2019-07-02 22 11 11 999 2019-01-01  Suresh     abc            2019-01-02 22 11 11 001 2020-01-01  Aadi       xyz            ----------------------- ---------- ---------- ---------       Please note defining the schema explicitly instead of letting spark infer the schema also improves the spark read performance

User · Answer

here my solution is   import org apache spark sql types     val spark   org apache spark sql SparkSession builder    master  local         appName  Spark CSV Reader      getOrCreate    val movie rating schema   StructType Array    StructField  UserID   IntegerType  true     StructField  MovieID   IntegerType  true     StructField  Rating   DoubleType  true     StructField  Timestamp   TimestampType  true     val df ratings  DataFrame   spark read format  csv      option  header    true      option  mode    DROPMALFORMED      option  delimiter             option  inferSchema    true      option  nullValue    null      schema movie rating schema     load args 0      file    home hadoop spark-workspace data ml-20m ratings csv   val movie avg scores   df ratings rdd map   toString       map line   gt           drop          and then split the str      val fileds   line substring 1  line length   - 1  split            extract  movie id  average rating       fileds 1  toInt  fileds 2  toDouble          groupByKey      map data   gt        val avg  Double   data  2 sum   data  2 size      data  1  avg

User · Answer

The previous solutions have used the custom StructType    With spark-sql 2 4 5  scala version 2 12 10  it is now possible to specify the schema as a string using the schema function  import org apache spark sql SparkSession      val sparkSession   SparkSession builder                appName  sample-app                master  local 2                 getOrCreate     val pageCount   sparkSession read    format  csv      option  delimiter          option  quote         schema  project string  article string  requests integer  bytes served long      load  dbfs  databricks-datasets wikipedia-datasets data-001 pagecounts sample pagecounts-20151124-170000

User · Answer

Try the below code  you need not specify the schema  When you give inferSchema as true it should take it from your csv file   val pagecount   sqlContext read format  csv      option  delimiter       option  quote         option  header    true      option  inferSchema    true      load  dbfs  databricks-datasets wikipedia-datasets data-001 pagecounts sample pagecounts-20151124-170000     If you want to manually specify the schema  you can do it as below   import org apache spark sql types    val customSchema   StructType Array    StructField  project   StringType  true     StructField  article   StringType  true     StructField  requests   IntegerType  true     StructField  bytes served   DoubleType  true      val pagecount   sqlContext read format  csv      option  delimiter       option  quote         option  header    true      schema customSchema     load  dbfs  databricks-datasets wikipedia-datasets data-001 pagecounts sample pagecounts-20151124-170000

User · Answer

This is one of option where we can pass the column names to the dataframe while loading CSV   import pandas     names     sepal-length    sepal-width    petal-length    petal-width    class       dataset   pandas read csv  C  Users NS00606317 Downloads Iris csv   names names  header 0  print dataset head 10     Output       sepal-length  sepal-width  petal-length  petal-width        class 1            5 1          3 5           1 4          0 2  Iris-setosa 2            4 9          3 0           1 4          0 2  Iris-setosa 3            4 7          3 2           1 3          0 2  Iris-setosa 4            4 6          3 1           1 5          0 2  Iris-setosa 5            5 0          3 6           1 4          0 2  Iris-setosa 6            5 4          3 9           1 7          0 4  Iris-setosa 7            4 6          3 4           1 4          0 3  Iris-setosa 8            5 0          3 4           1 5          0 2  Iris-setosa 9            4 4          2 9           1 4          0 2  Iris-setosa 10           4 9          3 1           1 5          0 1  Iris-setosa

User · Answer

I m using the solution provided by Arunakiran Nulu in my analysis  see the code   Despite it is able to assign the correct types to the columns  all the values returned are null  Previously  I ve tried to the option  option  inferSchema    true   and it returns the correct values in the dataframe  although different type    val customSchema   StructType Array      StructField  numicu   StringType  true       StructField  fecha solicitud   TimestampType  true       StructField  codtecnica   StringType  true       StructField  tecnica   StringType  true       StructField  finexploracion   TimestampType  true       StructField  ultimavalidacioninforme   TimestampType  true       StructField  validador   StringType  true     val df explo   spark read          format  csv            option  header    true            option  delimiter     t            option  timestampFormat    yyyy MM dd HH mm ss             schema customSchema           load filename    Result  root    -- numicu  string  nullable   true    -- fecha solicitud  timestamp  nullable   true    -- codtecnica  string  nullable   true    -- tecnica  string  nullable   true    -- finexploracion  timestamp  nullable   true    -- ultimavalidacioninforme  timestamp  nullable   true    -- validador  string  nullable   true    and the table is    numicu fecha solicitud codtecnica tecnica finexploracion ultimavalidacioninforme validador   ------ --------------- ---------- ------- -------------- ----------------------- ---------     null            null       null    null           null                    null      null     null            null       null    null           null                    null      null     null            null       null    null           null                    null      null     null            null       null    null           null                    null      null

User · Answer

You can also do like this by using sparkSession and implicit   import sparkSession implicits   val pagecount DataFrame   sparkSession read  option  delimiter        option  quote       option  inferSchema   true    csv  dbfs  databricks-datasets wikipedia-datasets data-001 pagecounts sample pagecounts-20151124-170000    toDF  project   article   requests   bytes served

User · Answer

Thanks to the answer by  Nulu  it works for pyspark with minimal tweaking  from pyspark sql types import LongType  StringType  StructField  StructType  BooleanType  ArrayType  IntegerType  customSchema   StructType Array      StructField  project   StringType  true       StructField  article   StringType  true       StructField  requests   IntegerType  true       StructField  bytes served   DoubleType  true     pagecount   sc read format  com databricks spark csv             option  delimiter                 option  quote                option  header    false             schema customSchema            load  dbfs  databricks-datasets wikipedia-datasets data-001 pagecounts sample pagecounts-20151124-170000

User · Answer

In pyspark 2 4 onwards  you can simply use header parameter to set the correct header   data   spark read csv  data csv   header True    Similarly  if using scala you can use header parameter as well

[scala] Provide schema while reading csv file as a dataframe

Examples related to scala

Examples related to apache-spark

Examples related to dataframe

Examples related to apache-spark-sql

Examples related to spark-csv