How to change dataframe column names in pyspark

Question

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command   df columns   new column name list   However  the same doesn t work in pyspark dataframes created using sqlContext   The only solution I could figure out to do this easily is the following   df   sqlContext read format  com databricks spark csv   options header  false   inferschema  true   delimiter   t   load  data txt   oldSchema   df schema for i k in enumerate oldSchema fields     k name   new column name list i  df   sqlContext read format  com databricks spark csv   options header  false   delimiter   t   load  data txt   schema oldSchema    This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema    Is there a better and more efficient way to do this like we do in pandas    My spark version is 1 5 0

User · Answer

df   df withColumnRenamed  quot colName quot    quot newColName quot            withColumnRenamed  quot colName2 quot    quot newColName2 quot    Advantage of using this way  With long list of columns you would like to change only few column names  This can be very convenient in these scenarios  Very useful when joining tables with duplicate column names

User · Answer

Another way to rename just one column  using import pyspark sql functions as F    df   df select       F col  count   alias  new count     drop  count

User · Answer

I made an easy to use function to rename multiple columns for a pyspark dataframe   in case anyone wants to use it    def renameCols df  old columns  new columns       for old col new col in zip old columns new columns           df   df withColumnRenamed old col new col      return df  old columns     old name1   old name2   new columns     new name1    new name2   df renamed   renameCols df  old columns  new columns       Be careful  both lists must be the same length

User · Answer

If you want to rename a single column and keep the rest as it is   from pyspark sql functions import col new df   old df select   col s  alias new name  if s    column to change else s for s in old df columns

User · Answer

There are multiple approaches you can use   df1 df withColumn  quot new column quot   quot old column quot   drop col  quot old column quot     df1 df withColumn  quot new column quot   quot old column quot    df1 df select  quot old column quot  alias  quot new column quot

User · Answer

There are many ways to do that      Option 1  Using selectExpr   data   sqlContext createDataFrame    Alberto   2     Dakota   2                                         Name    askdaosdka    data show   data printSchema      Output   ------- ----------       Name askdaosdka    ------- ----------    Alberto          2     Dakota          2    ------- ----------    root    -- Name  string  nullable   true     -- askdaosdka  long  nullable   true   df   data selectExpr  Name as name    askdaosdka as age   df show   df printSchema      Output   ------- ---       name age    ------- ---    Alberto   2     Dakota   2    ------- ---    root    -- name  string  nullable   true     -- age  long  nullable   true   Option 2  Using withColumnRenamed  notice that this method allows you to  overwrite  the same column  For Python3  replace xrange with range   from functools import reduce  oldColumns   data schema names newColumns     name    age    df   reduce lambda data  idx  data withColumnRenamed oldColumns idx   newColumns idx    xrange len oldColumns    data  df printSchema   df show    Option 3  using alias  in Scala you can also use as   from pyspark sql functions import col  data   data select col  Name   alias  name    col  askdaosdka   alias  age    data show      Output   ------- ---       name age    ------- ---    Alberto   2     Dakota   2    ------- ---   Option 4  Using sqlContext sql  which lets you use SQL queries on DataFrames registered as tables   sqlContext registerDataFrameAsTable data   myTable   df2   sqlContext sql  SELECT Name AS name  askdaosdka as age from myTable    df2 show      Output   ------- ---       name age    ------- ---    Alberto   2     Dakota   2    ------- ---

User · Answer

I use this one   from pyspark sql functions import col df select   vin  col  timeStamp   alias  Date     show

User · Answer

Method 1  df   df withColumnRenamed  quot new column name quot    quot old column name quot    Method 2  If you want to do some computation and rename the new values df   df withColumn  quot old column name quot   F when F col  quot old column name quot    gt  1  F lit 1   otherwise F col  quot old column name quot    df   df drop  quot new column name quot    quot old column name quot

User · Answer

You can put into for loop  and use zip to pairs each column name in two array  new name     quot id quot    quot sepal length cm quot    quot sepal width cm quot    quot petal length cm quot    quot petal width cm quot    quot species quot    new df   df for old  new in zip df columns  new name       new df   new df withColumnRenamed old  new

User · Answer

I like to use a dict to rename the df  rename     old1    new1    old2    new2   for col in df schema names      df   df withColumnRenamed col  rename col

User · Answer

You can use the following function to rename all the columns of your dataframe    def df col rename X  to rename  replace with                param X  spark dataframe      param to rename  list of original names      param replace with  list of new names      return  dataframe with updated names             import pyspark sql functions as F     mapping   dict zip to rename  replace with       X   X select  F col c  alias mapping get c  c   for c in to rename       return X   In case you need to update only a few columns  names  you can use the same column name in the replace with list  To rename all columns  df col rename X   a    b    c      x    y    z      To rename a some columns  df col rename X   a    b    c      a    y    z

User · Answer

this is the approach that I used   create pyspark session   import pyspark from pyspark sql import SparkSession spark   SparkSession builder appName  changeColNames   getOrCreate     create dataframe   df   spark createDataFrame data      Bob   5 62  juice       Sue  0 85  milk     schema     Name    Amount   Item      view df with column names   df show    ---- ------ -----   Name Amount  Item   ---- ------ -----    Bob   5 62 juice    Sue   0 85  milk   ---- ------ -----    create a list with new column names     newcolnames     NameNew   AmountNew   ItemNew     change the column names of the df   for c n in zip df columns newcolnames       df df withColumnRenamed c n    view df with new column names   df show    ------- --------- -------   NameNew AmountNew ItemNew   ------- --------- -------       Bob      5 62   juice       Sue      0 85    milk   ------- --------- -------

User · Answer

For a single column rename  you can still use toDF    For example   df1 selectExpr  SALARY 2   toDF  REVISED SALARY   show

User · Answer

In case you would like to apply a simple transformation on all column names  this code does the trick   I am replacing all spaces with underscore   new column name list  list map lambda x  x replace            df columns    df   df toDF  new column name list    Thanks to  user8117731 for toDf trick

User · Answer

If you want to change all columns names  try df toDF  cols

User · Answer

df withColumnRenamed  age    age2

User · Answer

We can use various approaches to rename the column name    First  let create a simple DataFrame   df   spark createDataFrame    x   1     y   2                                         col 1    col 2      Now let s try to rename col 1 to col 3  PFB a few approaches to do the same     Approach - 1   using withColumnRenamed function  df withColumnRenamed  col 1    col 3   show      Approach - 2   using alias function  df select df  col 1   alias  col3     col 2   show      Approach - 3   using selectExpr function  df selectExpr  col 1 as col 3    col 2   show      Rename all columns   Approach - 4   using toDF function  Here you need to pass the list of all columns present in DataFrame  df toDF  col 3    col 2   show     Here is the output    ----- -----   col 3 col 2   ----- -----       x     1       y     2   ----- -----    I hope this helps

[python] How to change dataframe column names in pyspark?

Examples related to python

Examples related to apache-spark

Examples related to pyspark

Examples related to pyspark-sql