Pyspark replace strings in Spark dataframe column

Question

I d like to perform some basic stemming on a Spark Dataframe column by replacing substrings  What s the quickest way to do this    In my current use case  I have a list of addresses that I want to normalize  For example this dataframe   id     address 1       2 foo lane 2       10 bar lane 3       24 pants ln   Would become  id     address 1       2 foo ln 2       10 bar ln 3       24 pants ln

User · Accepted Answer

For Spark 1 5 or later  you can use the functions package   from pyspark sql functions import   newDf   df withColumn  address   regexp replace  address    lane    ln      Quick explanation    The function withColumn is called to add  or replace  if the name exists  a column to the data frame   The function regexp replace will generate a new column by replacing all substrings that match the pattern

User · Answer

For scala  import org apache spark sql functions regexp replace import org apache spark sql functions col data withColumn  addr new   regexp replace col  addr line

[python] Pyspark replace strings in Spark dataframe column

Examples related to python

Examples related to apache-spark

Examples related to pyspark