[pyspark] How to get name of dataframe column in pyspark?

In pandas, this can be done by column.name.

But how to do the same when its column of spark dataframe?

e.g. The calling program has a spark dataframe: spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

This program calls my function: my_function(spark_df['rank']) In my_function, I need the name of the column i.e. 'rank'

If it was pandas dataframe, we can use inside my_function

>>> pandas_df['rank'].name
'rank'

This question is related to pyspark pyspark-sql

The answer is


The only way is to go an underlying level to the JVM.

df.col._jc.toString().encode('utf8')

This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')

You can get the names from the schema by doing

spark_df.schema.names

Printing the schema can be useful to visualize it as well

spark_df.printSchema()

Python

As @numeral correctly said, column._jc.toString() works fine in case of unaliased columns.

In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .

I don't know Scala syntax, but I'm sure It can be done the same.


If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'


I found the answer is very very simple...

// It is in java, but it should be same in pyspark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

The variable "theNameOftheCol" is "colName"