I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.
Following is the way, I did:
toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))
Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.
This question is related to
python
apache-spark
dataframe
pyspark
apache-spark-sql
Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.
We can reach the column in spark statement with col("colum_name")
keyword:
from pyspark.sql.functions import col , column
changedTypedf = joindf.withColumn("show", col("show").cast("double"))
Preserve the name of the column and avoid extra column addition by using the same name as input column:
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))
the solution was simple -
toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))
pyspark version:
df = <source data>
df.printSchema()
from pyspark.sql.types import *
# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()
Source: Stackoverflow.com