Updating a dataframe column in spark

Question

Looking at the new spark dataframe api  it is unclear whether it is possible to modify dataframe columns   How would I go about changing a value in row x column y of a dataframe   In pandas this would be df ix x y    new value  Edit  Consolidating what was said below  you can t modify the existing dataframe as it is immutable  but you can return a new dataframe with the desired modifications   If you just want to replace a value in a column based on a condition  like np where   from pyspark sql import functions as F  update func    F when F col  update col      replace val  new value                   otherwise F col  update col     df   df withColumn  new column name   update func    If you want to perform some operation on a column and create a new column that is added to the dataframe   import pyspark sql functions as F import pyspark sql types as T  def my func col       do stuff to column here     return transformed value    if we assume that my func returns a string my udf   F UserDefinedFunction my func  T StringType     df   df withColumn  new column name   my udf  update col      If you want the new column to have the same name as the old column  you could add the additional step   df   df drop  update col   withColumnRenamed  new column name    update col

User · Answer

Commonly when updating a column  we want to map an old value to a new value  Here s a way to do that in pyspark without UDF s     update df update col   mapping old value -- gt  new value from pyspark sql import functions as F df   df withColumn update col      F when df update col   old value new value       otherwise df update col

User · Answer

DataFrames are based on RDDs  RDDs are immutable structures and do not allow updating elements on-site  To change values  you will need to create a new DataFrame by transforming the original one either using the SQL-like DSL or RDD operations like map   A highly recommended slide deck  Introducing DataFrames in Spark for Large Scale Data Science

User · Answer

While you cannot modify a column as such  you may operate on a column and return a new DataFrame reflecting that change  For that you d first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only  In Python   from pyspark sql functions import UserDefinedFunction from pyspark sql types import StringType  name    target column  udf   UserDefinedFunction lambda x   new value   StringType    new df   old df select   udf column  alias name  if column    name else column for column in old df columns     new df now has the same schema as old df  assuming that old df target column was of type StringType as well  but all values in column target column will be new value

User · Answer

importing col  when from pyspark sql functions and updating fifth column to integer 0 1 2  based on the string string a  string b  string c  into a new DataFrame   from pyspark sql functions import col  when   data frame temp   data frame withColumn  col 5  when col  col 5       string a   0  when col  col 5       string b   1  otherwise 2

User · Answer

Just as maasg says you can create a new DataFrame from the result of a map applied to the old DataFrame  An example for a given DataFrame df with two rows   val newDf   sqlContext createDataFrame df map row   gt     Row row getInt 0    SOMETHING  applySomeDef row getAs Double   y     df schema    Note that if the types of the columns change  you need to give it a correct schema instead of df schema  Check out the api of org apache spark sql Row for available methods  https   spark apache org docs latest api java org apache spark sql Row html   Update  Or using UDFs in Scala   import org apache spark sql functions    val toLong   udf Long  String     toLong   val modifiedDf   df withColumn  modifiedColumnName   toLong df  columnName     drop  columnName     and if the column name needs to stay the same you can rename it back   modifiedDf withColumnRenamed  modifiedColumnName    columnName

[python] Updating a dataframe column in spark

Examples related to python

Examples related to apache-spark

Examples related to pyspark

Examples related to apache-spark-sql

Examples related to spark-dataframe