[python] How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

df.columns = new_column_name_list

However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:

df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
  k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)

This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.

Is there a better and more efficient way to do this like we do in pandas ?

My spark version is 1.5.0

This question is related to python apache-spark pyspark pyspark-sql

The answer is


df.withColumnRenamed('age', 'age2')


We can use various approaches to rename the column name.

First, let create a simple DataFrame.

df = spark.createDataFrame([("x", 1), ("y", 2)], 
                                  ["col_1", "col_2"])

Now let's try to rename col_1 to col_3. PFB a few approaches to do the same.

# Approach - 1 : using withColumnRenamed function.
df.withColumnRenamed("col_1", "col_3").show()

# Approach - 2 : using alias function.
df.select(df["col_1"].alias("col3"), "col_2").show()

# Approach - 3 : using selectExpr function.
df.selectExpr("col_1 as col_3", "col_2").show()

# Rename all columns
# Approach - 4 : using toDF function. Here you need to pass the list of all columns present in DataFrame.
df.toDF("col_3", "col_2").show()

Here is the output.

+-----+-----+
|col_3|col_2|
+-----+-----+
|    x|    1|
|    y|    2|
+-----+-----+

I hope this helps.


You can use the following function to rename all the columns of your dataframe.

def df_col_rename(X, to_rename, replace_with):
    """
    :param X: spark dataframe
    :param to_rename: list of original names
    :param replace_with: list of new names
    :return: dataframe with updated names
    """
    import pyspark.sql.functions as F
    mapping = dict(zip(to_rename, replace_with))
    X = X.select([F.col(c).alias(mapping.get(c, c)) for c in to_rename])
    return X

In case you need to update only a few columns' names, you can use the same column name in the replace_with list

To rename all columns

df_col_rename(X,['a', 'b', 'c'], ['x', 'y', 'z'])

To rename a some columns

df_col_rename(X,['a', 'b', 'c'], ['a', 'y', 'z'])

You can put into for loop, and use zip to pairs each column name in two array.

new_name = ["id", "sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm", "species"]

new_df = df
for old, new in zip(df.columns, new_name):
    new_df = new_df.withColumnRenamed(old, new)

Method 1:

df = df.withColumnRenamed("new_column_name", "old_column_name")

Method 2: If you want to do some computation and rename the new values

df = df.withColumn("old_column_name", F.when(F.col("old_column_name") > 1, F.lit(1)).otherwise(F.col("old_column_name"))
df = df.drop("new_column_name", "old_column_name")

I use this one:

from pyspark.sql.functions import col
df.select(['vin',col('timeStamp').alias('Date')]).show()

I like to use a dict to rename the df.

rename = {'old1': 'new1', 'old2': 'new2'}
for col in df.schema.names:
    df = df.withColumnRenamed(col, rename[col])

There are multiple approaches you can use:

  1. df1=df.withColumn("new_column","old_column").drop(col("old_column"))

  2. df1=df.withColumn("new_column","old_column")

  3. df1=df.select("old_column".alias("new_column"))


I made an easy to use function to rename multiple columns for a pyspark dataframe, in case anyone wants to use it:

def renameCols(df, old_columns, new_columns):
    for old_col,new_col in zip(old_columns,new_columns):
        df = df.withColumnRenamed(old_col,new_col)
    return df

old_columns = ['old_name1','old_name2']
new_columns = ['new_name1', 'new_name2']
df_renamed = renameCols(df, old_columns, new_columns)

Be careful, both lists must be the same length.


For a single column rename, you can still use toDF(). For example,

df1.selectExpr("SALARY*2").toDF("REVISED_SALARY").show()

Another way to rename just one column (using import pyspark.sql.functions as F):

df = df.select( '*', F.col('count').alias('new_count') ).drop('count')

this is the approach that I used:

create pyspark session:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('changeColNames').getOrCreate()

create dataframe:

df = spark.createDataFrame(data = [('Bob', 5.62,'juice'),  ('Sue',0.85,'milk')], schema = ["Name", "Amount","Item"])

view df with column names:

df.show()
+----+------+-----+
|Name|Amount| Item|
+----+------+-----+
| Bob|  5.62|juice|
| Sue|  0.85| milk|
+----+------+-----+

create a list with new column names:

newcolnames = ['NameNew','AmountNew','ItemNew']

change the column names of the df:

for c,n in zip(df.columns,newcolnames):
    df=df.withColumnRenamed(c,n)

view df with new column names:

df.show()
+-------+---------+-------+
|NameNew|AmountNew|ItemNew|
+-------+---------+-------+
|    Bob|     5.62|  juice|
|    Sue|     0.85|   milk|
+-------+---------+-------+

If you want to change all columns names, try df.toDF(*cols)


df = df.withColumnRenamed("colName", "newColName")\
       .withColumnRenamed("colName2", "newColName2")

Advantage of using this way: With long list of columns you would like to change only few column names. This can be very convenient in these scenarios. Very useful when joining tables with duplicate column names.


If you want to rename a single column and keep the rest as it is:

from pyspark.sql.functions import col
new_df = old_df.select(*[col(s).alias(new_name) if s == column_to_change else s for s in old_df.columns])

In case you would like to apply a simple transformation on all column names, this code does the trick: (I am replacing all spaces with underscore)

new_column_name_list= list(map(lambda x: x.replace(" ", "_"), df.columns))

df = df.toDF(*new_column_name_list)

Thanks to @user8117731 for toDf trick.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to apache-spark

Select Specific Columns from Spark DataFrame Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Spark difference between reduceByKey vs groupByKey vs aggregateByKey vs combineByKey Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion Convert date from String to Date format in Dataframes

Examples related to pyspark

Pyspark: Filter dataframe based on multiple conditions How to convert column with string type to int form in pyspark data frame? Select columns in PySpark dataframe How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion PySpark: withColumn() with two conditions and three outcomes How to get name of dataframe column in pyspark? Spark RDD to DataFrame python PySpark 2.0 The size or shape of a DataFrame

Examples related to pyspark-sql

Pyspark: Filter dataframe based on multiple conditions How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filtering a pyspark dataframe using isin by exclusion How to get name of dataframe column in pyspark? show distinct column values in pyspark dataframe: python Split Spark Dataframe string column into multiple columns Convert pyspark string to date format How to change dataframe column names in pyspark?