How to convert column with string type to int form in pyspark data frame

Question

I have dataframe in pyspark  Some of its numerical columns contain  nan  so when I am reading the data and checking for the schema of dataframe  those columns will have  string  type  How I can change them to int type I replaced the  nan  values with 0 and again checked the schema  but then also it s showing the string type for those columns I am following the below code   data df   sqlContext read format  csv   load  data csv  header True  inferSchema  true   data df printSchema   data df   data df fillna 0  data df printSchema     my data looks like this    here columns  Plays  and  drafts  containing integer values but because of nan present in these columns they are treated as string type

User · Answer

You could use cast as int  after replacing NaN with 0   data df   df withColumn  Plays   df call time cast  float

User · Answer

from pyspark sql types import IntegerType data df   data df withColumn  Plays   data df  Plays   cast IntegerType     data df   data df withColumn  drafts   data df  drafts   cast IntegerType       You can run loop for each column but this is the simplest way to convert string column into integer

User · Answer

Another way to do it is using the StructField if you have multiple fields that needs to be modified   Ex   from pyspark sql types import StructField IntegerType  StructType StringType newDF  StructField  CLICK FLG  IntegerType   True          StructField  OPEN FLG  IntegerType   True          StructField  I1 GNDR CODE  StringType   True          StructField  TRW INCOME CD V4  StringType   True          StructField  ASIAN CD  IntegerType   True          StructField  I1 INDIV HHLD STATUS CODE  IntegerType   True           finalStruct StructType fields newDF  df spark read csv  ctor csv  schema finalStruct    Output   Before  root   -- CLICK FLG  string  nullable   true    -- OPEN FLG  string  nullable   true    -- I1 GNDR CODE  string  nullable   true    -- TRW INCOME CD V4  string  nullable   true    -- ASIAN CD  integer  nullable   true    -- I1 INDIV HHLD STATUS CODE  string  nullable   true    After   root   -- CLICK FLG  integer  nullable   true    -- OPEN FLG  integer  nullable   true    -- I1 GNDR CODE  string  nullable   true    -- TRW INCOME CD V4  string  nullable   true    -- ASIAN CD  integer  nullable   true    -- I1 INDIV HHLD STATUS CODE  integer  nullable   true    This is slightly a long procedure to cast   but the advantage is that all the required fields can be done   It is to be noted that if only the required fields are assigned the  data type  then the resultant dataframe will contain only those fields which are changed

[python] How to convert column with string type to int form in pyspark data frame?

Examples related to python

Examples related to dataframe

Examples related to pyspark