Converting Pandas dataframe into Spark dataframe error

Question

I m trying to convert Pandas DF into Spark one  DF head   10000001 1 0 1 12 35 OK 10002 1 0 9 f NA 24 24 0 3 9 0 0 1 1 0 0 4 543 10000001 2 0 1 12 36 OK 10002 1 0 9 f NA 24 24 0 3 9 2 1 1 3 1 3 2 611 10000002 1 0 4 12 19 PA 10003 1 1 7 f NA 74 74 0 2 15 2 0 2 3 1 2 2 691   Code   dataset   pd read csv  data AS test v2 csv   sc   SparkContext conf conf  sqlCtx   SQLContext sc  sdf   sqlCtx createDataFrame dataset    And I got an error   TypeError  Can not merge type  lt class  pyspark sql types StringType  gt  and  lt class  pyspark sql types DoubleType  gt

User · Answer

I have tried this with your data and it is working     pyspark import pandas as pd from pyspark sql import SQLContext print sc df   pd read csv  test csv   print type df  print df sqlCtx   SQLContext sc  sqlCtx createDataFrame df  show

User · Answer

Type related errors can be avoided by imposing a schema as follows   note  a text file was created  test csv  with the original data  as above  and hypothetical column names were inserted   col1   col2       col25     import pyspark from pyspark sql import SparkSession import pandas as pd  spark   SparkSession builder appName  pandasToSparkDF   getOrCreate    pdDF   pd read csv  test csv     contents of the pandas data frame          col1     col2    col3    col4    col5    col6    col7    col8        0      10000001 1       0       1       12 35   OK      10002   1          1      10000001 2       0       1       12 36   OK      10002   1          2      10000002 1       0       4       12 19   PA      10003   1            Next  create the schema   from pyspark sql types import    mySchema   StructType   StructField  col1   LongType    True                           StructField  col2   IntegerType    True                           StructField  col3   IntegerType    True                           StructField  col4   IntegerType    True                           StructField  col5   StringType    True                           StructField  col6   StringType    True                           StructField  col7   IntegerType    True                           StructField  col8   IntegerType    True                           StructField  col9   IntegerType    True                           StructField  col10   IntegerType    True                           StructField  col11   StringType    True                           StructField  col12   StringType    True                           StructField  col13   IntegerType    True                           StructField  col14   IntegerType    True                           StructField  col15   IntegerType    True                           StructField  col16   IntegerType    True                           StructField  col17   IntegerType    True                           StructField  col18   IntegerType    True                           StructField  col19   IntegerType    True                           StructField  col20   IntegerType    True                           StructField  col21   IntegerType    True                           StructField  col22   IntegerType    True                           StructField  col23   IntegerType    True                           StructField  col24   IntegerType    True                           StructField  col25   IntegerType    True      Note  True  implies nullable allowed   create the pyspark dataframe   df   spark createDataFrame pdDF schema mySchema    confirm the pandas data frame is now a pyspark data frame   type df    output   pyspark sql dataframe DataFrame   Aside    To address Kate s comment below - to impose a general  String  schema you can do the following    df spark createDataFrame pdDF astype str

User · Answer

In spark version  gt   3 you can convert pandas dataframes to pyspark dataframe in one line  use spark createDataFrame pandasDF   dataset   pd read csv  quot data AS test v2 csv quot    sparkDf   spark createDataFrame dataset    if you are confused about spark session variable  spark session is as follows sc   SparkContext getOrCreate SparkConf   setMaster  quot local    quot     spark   SparkSession        builder        getOrCreate

User · Answer

You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring   If your pandas dataframe lists something like   pd info    lt class  pandas core frame DataFrame  gt  RangeIndex  5062 entries  0 to 5061 Data columns  total 51 columns   SomeCol                    5062 non-null object Col2                       5062 non-null object   And you re getting that error try   df   SomeCol    Col2      df   SomeCol    Col2    astype str    Now  make sure  astype str  is actually the type you want those columns to be   Basically  when the underlying Java code tries to infer the type from an object in python it uses some observations and makes a guess  if that guess doesn t apply to all the data in the column s  it s trying to convert from pandas to spark it will fail

User · Answer

I received a similar error message once  in my case it was because my pandas dataframe contained NULLs  I will recommend to try  amp  handle this in pandas before converting to spark  this resolved the issue in my case

User · Answer

I made this script  It worked for my 10 pandas Data frames from pyspark sql types import      Auxiliar functions def equivalent type f       if f     datetime64 ns    return TimestampType       elif f     int64   return LongType       elif f     int32   return IntegerType       elif f     float64   return FloatType       else  return StringType    def define structure string  format type       try  typo   equivalent type format type      except  typo   StringType       return StructField string  typo     Given pandas dataframe  it will return a spark s dataframe  def pandas to spark pandas df       columns   list pandas df columns      types   list pandas df dtypes      struct list          for column  typo in zip columns  types          struct list append define structure column  typo       p schema   StructType struct list      return sqlContext createDataFrame pandas df  p schema   You can see it also in this gist With this you just have to call spark df   pandas to spark pandas df

[python] Converting Pandas dataframe into Spark dataframe error

Examples related to python

Examples related to pandas

Examples related to apache-spark

Examples related to spark-dataframe