[dataframe] PySpark 2.0 The size or shape of a DataFrame

I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this.

In Python I can do

data.shape()

Is there a similar function in PySpark. This is my current solution, but I am looking for an element one

row_number = data.count()
column_number = len(data.dtypes)

The computation of the number of columns is not ideal...

This question is related to dataframe size pyspark shape

The answer is


You can get its shape with:

print((df.count(), len(df.columns)))

Use df.count() to get the number of rows.


Add this to the your code:

import pyspark
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

Then you can do

>>> df.shape()
(10000, 10)

But just remind you that .count() can be very slow for very large table that has not been persisted.


print((df.count(), len(df.columns)))

is easier for smaller datasets.

However if the dataset is huge, an alternative approach would be to use pandas and arrows to convert the dataframe to pandas df and call shape

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
print(df.toPandas().shape)

I think there is not similar function like data.shape in Spark. But I will use len(data.columns) rather than len(data.dtypes)


Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to size

PySpark 2.0 The size or shape of a DataFrame How to set label size in Bootstrap How to create a DataFrame of random integers with Pandas? How to split large text file in windows? How can I get the size of an std::vector as an int? How to load specific image from assets with Swift How to find integer array size in java Fit website background image to screen size How to set text size in a button in html How to change font size in html?

Examples related to pyspark

Pyspark: Filter dataframe based on multiple conditions How to convert column with string type to int form in pyspark data frame? Select columns in PySpark dataframe How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion PySpark: withColumn() with two conditions and three outcomes How to get name of dataframe column in pyspark? Spark RDD to DataFrame python PySpark 2.0 The size or shape of a DataFrame

Examples related to shape

what does numpy ndarray shape do? PySpark 2.0 The size or shape of a DataFrame Set android shape color programmatically How to change shape color dynamically?