How can I change column types in Spark SQL s DataFrame

Question

Suppose I m doing something like   val df   sqlContext load  com databricks spark csv   Map  path  - gt   cars csv    header  - gt   true    df printSchema    root   -- year  string  nullable   true    -- make  string  nullable   true    -- model  string  nullable   true    -- comment  string  nullable   true    -- blank  string  nullable   true   df show   year make  model comment              blank 2012 Tesla S     No comment 1997 Ford  E350  Go get one now th      But I really wanted the year as Int  and perhaps transform some other columns    The best I could come up with was  df withColumn  year2    year cast  Int    select  year2 as  year   make   model   comment   blank  org apache spark sql DataFrame    year  int  make  string  model  string  comment  string  blank  string    which is a bit convoluted   I m coming from R  and I m used to being able to write  e g   df2  lt - df   gt      mutate year   year   gt   as integer            make   make   gt   toupper    I m likely missing something  since there should be a better way to do this in Spark Scala

User · Answer

To convert the year from string to int  you can add the following option to the csv reader   inferSchema  -   true   see DataBricks documentation

User · Answer

Why not just do as described under http   spark apache org docs latest api python pyspark sql html pyspark sql Column cast  df select df year cast  int    make   model   comment   blank

User · Answer

You can use below code   df withColumn  year   df  year   cast IntegerType     Which will convert year column to IntegerType column

User · Answer

df select   long col  cast IntegerType  as  int col

User · Answer

the answers suggesting to use cast  FYI  the cast method in spark 1 4 1 is broken   for example  a dataframe with a string column having value  8182175552014127960  when casted to bigint has value  8182175552014128100       df show  -------------------                     a   -------------------   8182175552014127960   -------------------       df selectExpr  cast a as bigint  a   show  -------------------                     a   -------------------   8182175552014128100   -------------------    We had to face a lot of issue before finding this bug because we had bigint columns in production

User · Answer

val fact df   df select   data  30  as  TopicTypeId     data  31  as  TopicId    data  21  cast FloatType  as   Data Value Std Err    rdd       Schema to be applied to the table     val fact schema    new StructType  add  TopicTypeId   StringType  add  TopicId   StringType  add  Data Value Std Err   FloatType       val fact table   sqlContext createDataFrame fact df  fact schema  dropDuplicates

User · Answer

This method will drop the old column and create new columns with same values and new datatype  My original datatypes when the DataFrame was created were -  root   -- id  integer  nullable   true    -- flag1  string  nullable   true    -- flag2  string  nullable   true    -- name  string  nullable   true    -- flag3  string  nullable   true    After this I ran following code to change the datatype -  df df withColumnRenamed  lt old column name gt   lt dummy column gt      This was done for both flag1 and flag3 df df withColumn  lt old column name gt  df col  lt dummy column gt   cast  lt datatype gt    drop  lt dummy column gt     After this my result came out to be -  root   -- id  integer  nullable   true    -- flag2  string  nullable   true    -- name  string  nullable   true    -- flag1  boolean  nullable   true    -- flag3  boolean  nullable   true

User · Answer

In case you have to rename dozens of columns given by their name  the following example takes the approach of  dnlbrky and applies it to several columns at once   df selectExpr df columns map cn   gt        if  Set  speed    weight    height   contains cn   s cast  cn as double  as  cn      else if  Set  isActive    hasDevice   contains cn   s cast  cn as boolean  as  cn      else cn          Uncasted columns are kept unchanged  All columns stay in their original order

User · Answer

Using Spark Sql 2 4 0 you can do that   spark sql  SELECT STRING NULLIF column      as column string

User · Answer

You can use selectExpr to make it a little cleaner   df selectExpr  cast year as int  as year    upper make  as make        model    comment    blank

User · Answer

As the cast operation is available for Spark Column s  and as I personally do not favour udf s as proposed by  Svend at this point   how about   df select  df  year   cast IntegerType  as  year            to cast to the requested type  As a neat side effect  values not castable    convertable  in that sense  will become null   In case you need this as a helper method  use   object DFHelper    def castColumnTo  df  DataFrame  cn  String  tpe  DataType     DataFrame         df withColumn  cn  df cn  cast tpe            which is used like   import DFHelper   val df2   castColumnTo  df   year   IntegerType

User · Answer

In case if you want to change multiple columns of a specific type to another without specifying individual column names     Get names of all columns that you want to change type   In this example I want to change all columns of type Array to String       val arrColsNames   originalDataFrame schema fields filter f   gt  f dataType isInstanceOf ArrayType   map   name     iterate columns you want to change type and cast to the required type val updatedDataFrame   arrColsNames foldLeft originalDataFrame   tempDF  colName    gt  tempDF withColumn colName  tempDF col colName  cast DataTypes StringType       display  updatedDataFrame show truncate   false

User · Answer

Java code for modifying the datatype of the DataFrame from String to Integer  df withColumn  col name   df col  col name   cast DataTypes IntegerType     It will simply cast the existing String datatype  to Integer

User · Answer

Edit  Newest version  Since spark 2 x you can use  withColumn  Check the docs here   https   spark apache org docs latest api scala index html org apache spark sql Dataset withColumn colName String col org apache spark sql Column  org apache spark sql DataFrame  Oldest answer  Since Spark version 1 4 you can apply the cast method with DataType on the column   import org apache spark sql types IntegerType val df2   df withColumn  yearTmp   df year cast IntegerType        drop  year        withColumnRenamed  yearTmp    year     If you are using sql expressions you can also do   val df2   df selectExpr  cast year as int  year                             make                             model                             comment                             blank     For more info check the docs  http   spark apache org docs 1 6 0 api scala  org apache spark sql DataFrame

User · Answer

EDIT  March 2016  thanks for the votes  Though really  this is not the best answer  I think the solutions based on withColumn  withColumnRenamed and cast put forward by msemelman  Martin Senne and others are simpler and cleaner    I think your approach is ok  recall that a Spark DataFrame is an  immutable  RDD of Rows  so we re never really replacing a column  just creating new DataFrame each time with a new schema   Assuming you have an original df with the following schema    scala gt  df printSchema root   -- Year  string  nullable   true    -- Month  string  nullable   true    -- DayofMonth  string  nullable   true    -- DayOfWeek  string  nullable   true    -- DepDelay  string  nullable   true    -- Distance  string  nullable   true    -- CRSDepTime  string  nullable   true    And some UDF s defined on one or several columns    import org apache spark sql functions    val toInt      udf Int  String     toInt  val toDouble   udf Double  String     toDouble  val toHour     udf  t  String    gt    04d  format t toInt  take 2  toInt    val days since nearest holidays   udf      year String  month String  dayOfMonth String    gt  year toInt   27   month toInt-12      Changing column types or even building a new DataFrame from another can be written like this   val featureDf   df  withColumn  departureDelay   toDouble df  DepDelay      withColumn  departureHour    toHour df  CRSDepTime      withColumn  dayOfWeek        toInt df  DayOfWeek                    withColumn  dayOfMonth       toInt df  DayofMonth                    withColumn  month            toInt df  Month                    withColumn  distance         toDouble df  Distance                    withColumn  nearestHoliday   days since nearest holidays                df  Year    df  Month    df  DayofMonth                                 select  departureDelay    departureHour    dayOfWeek    dayOfMonth             month    distance    nearestHoliday                 which yields    scala gt  df printSchema root   -- departureDelay  double  nullable   true    -- departureHour  integer  nullable   true    -- dayOfWeek  integer  nullable   true    -- dayOfMonth  integer  nullable   true    -- month  integer  nullable   true    -- distance  double  nullable   true    -- nearestHoliday  integer  nullable   true    This is pretty close to your own solution  Simply  keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable

User · Answer

So this only really works if your having issues saving to a jdbc driver like sqlserver  but it s really helpful for errors you will run into with syntax and types   import org apache spark sql jdbc  JdbcDialects  JdbcType  JdbcDialect  import org apache spark sql jdbc JdbcType val SQLServerDialect   new JdbcDialect     override def canHandle url  String   Boolean   url startsWith  jdbc jtds sqlserver      url contains  sqlserver      override def getJDBCType dt  DataType   Option JdbcType    dt match       case StringType   gt  Some JdbcType  VARCHAR 5000    java sql Types VARCHAR       case BooleanType   gt  Some JdbcType  BIT 1    java sql Types BIT       case IntegerType   gt  Some JdbcType  INTEGER   java sql Types INTEGER       case LongType   gt  Some JdbcType  BIGINT   java sql Types BIGINT       case DoubleType   gt  Some JdbcType  DOUBLE PRECISION   java sql Types DOUBLE       case FloatType   gt  Some JdbcType  REAL   java sql Types REAL       case ShortType   gt  Some JdbcType  INTEGER   java sql Types INTEGER       case ByteType   gt  Some JdbcType  INTEGER   java sql Types INTEGER       case BinaryType   gt  Some JdbcType  BINARY   java sql Types BINARY       case TimestampType   gt  Some JdbcType  DATE   java sql Types DATE       case DateType   gt  Some JdbcType  DATE   java sql Types DATE               case DecimalType Fixed precision  scale    gt  Some JdbcType  NUMBER     precision         scale        java sql Types NUMERIC       case t  DecimalType   gt  Some JdbcType s DECIMAL   t precision    t scale     java sql Types DECIMAL       case     gt  throw new IllegalArgumentException s Don t know how to save   dt json  to JDBC          JdbcDialects registerDialect SQLServerDialect

User · Answer

Another solution is as follows   1  Keep  inferSchema  as False  2  While running  Map  functions on the row  you can read  asString   row getString        Read CSV and create dataset Dataset lt Row gt  enginesDataSet   sparkSession              read                format  com databricks spark csv                option  header    true                option  inferSchema   false                load args 0     JavaRDD lt Box gt  vertices   enginesDataSet              select  BOX   BOX CD                toJavaRDD                map new Function lt Row  Box gt                       Override                 public Box call Row row  throws Exception                       return new Box  String row getString 0   String row get 1

User · Answer

Generate a simple dataset containing five values and convert int to string type   val df   spark range 5  select  col  id   cast  string

User · Answer

One can change data type of a column by using cast in spark sql  table name is table and it has two columns only column1 and column2 and column1 data type is to be changed  ex-spark sql  select cast column1 as Double  column1NewName column2 from table   In the place of double write your data type

User · Answer

Another way      Generate a simple dataset containing five values and convert int to string type  val df   spark range 5  select  col  id   cast  string    withColumnRenamed  id   value

User · Answer

I think this is lot more readable for me   import org apache spark sql types   df withColumn  year   df  year   cast IntegerType     This will convert your year column to IntegerType with creating any temporary columns and dropping those columns  If you want to convert to any other datatype  you can check the types inside org apache spark sql types package

User · Answer

So many answers and not much thorough explanations  The following syntax works Using Databricks Notebook with Spark 2 4  from pyspark sql functions import   df   df withColumn  COL NAME   to date BLDFm  LOAD DATE     MM-dd-yyyy      Note that you have to specify the entry format you have  in my case  MM-dd-yyyy   and the import is mandatory as the to date is a spark sql function  Also Tried this syntax but got nulls instead of a proper cast    df   df withColumn  COL NAME   df  COL NAME   cast  Date       Note I had to use brackets and quotes for it to be syntaxically correct though  PS   I have to admit this is like a syntax jungle  there are many possible ways entry points  and the official API references lack proper examples

User · Answer

First  if you wanna cast type  then this   import org apache spark sql df withColumn  year     year  cast sql types IntegerType     With same column name  the column will be replaced with new one  You don t need to do add and delete steps   Second  about Scala vs R  This is the code that most similar to R I can come up with   val df2   df select     df columns map        case year    year    gt  df year  cast IntegerType  as year       case make    make    gt  functions upper df make   as make       case other           gt  df other               Though the code length is a little longer than R s  That is nothing to do with the verbosity of the language  In R the mutate is a special function for R dataframe  while in Scala you can easily ad-hoc one thanks to its expressive power  In word  it avoid specific solutions  because the language design is good enough for you to quickly and easy build your own domain language     side note  df columns is surprisingly a Array String  instead of Array Column   maybe they want it look like Python pandas s dataframe

[scala] How can I change column types in Spark SQL's DataFrame?

Examples related to scala

Examples related to apache-spark

Examples related to apache-spark-sql