Right now, I have to use df.count > 0
to check if the DataFrame
is empty or not. But it is kind of inefficient. Is there any better way to do that?
Thanks.
PS: I want to check if it's empty so that I only save the DataFrame
if it's not empty
This question is related to
apache-spark
apache-spark-sql
You can take advantage of the head()
(or first()
) functions to see if the DataFrame
has a single row. If so, it is not empty.
Since Spark 2.4.0 there is Dataset.isEmpty
.
It's implementation is :
def isEmpty: Boolean =
withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
Note that a DataFrame
is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):
type DataFrame = Dataset[Row]
You can do it like:
val df = sqlContext.emptyDataFrame
if( df.eq(sqlContext.emptyDataFrame) )
println("empty df ")
else
println("normal df")
On PySpark, you can also use this bool(df.head(1))
to obtain a True
of False
value
It returns False
if the dataframe contains no rows
df1.take(1).length>0
The take
method returns the array of rows, so if the array size is equal to zero, there are no records in df
.
If you are using Pypsark, you could also do:
len(df.head(1)) > 0
dataframe.limit(1).count > 0
This also triggers a job but since we are selecting single record, even in case of billion scale records the time consumption could be much lower.
For Spark 2.1.0, my suggestion would be to use head(n: Int)
or take(n: Int)
with isEmpty
, whichever one has the clearest intent to you.
df.head(1).isEmpty
df.take(1).isEmpty
with Python equivalent:
len(df.head(1)) == 0 # or bool(df.head(1))
len(df.take(1)) == 0 # or bool(df.take(1))
Using df.first()
and df.head()
will both return the java.util.NoSuchElementException
if the DataFrame is empty. first()
calls head()
directly, which calls head(1).head
.
def first(): T = head()
def head(): T = head(1).head
head(1)
returns an Array, so taking head
on that Array causes the java.util.NoSuchElementException
when the DataFrame is empty.
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
So instead of calling head()
, use head(1)
directly to get the array and then you can use isEmpty
.
take(n)
is also equivalent to head(n)
...
def take(n: Int): Array[T] = head(n)
And limit(1).collect()
is equivalent to head(1)
(notice limit(n).queryExecution
in the head(n: Int)
method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException
exception when the DataFrame is empty.
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
I know this is an older question so hopefully it will help someone using a newer version of Spark.
I found that on some cases:
>>>print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>
>>>df.take(1).isEmpty
'list' object has no attribute 'isEmpty'
this is same for "length" or replace take() by head()
[Solution] for the issue we can use.
>>>df.limit(2).count() > 1
False
I had the same question, and I tested 3 main solution :
(df != null) && (df.count > 0)
df.head(1).isEmpty()
as @hulin003 suggestdf.rdd.isEmpty()
as @Justin Pihony suggestand of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :
therefore I think that the best solution is df.rdd.isEmpty()
as @Justin Pihony suggest
I would say to just grab the underlying RDD
. In Scala:
df.rdd.isEmpty
in Python:
df.rdd.isEmpty()
That being said, all this does is call take(1).length
, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?
If you do df.count > 0
. It takes the counts of all partitions across all executors and add them up at Driver. This take a while when you are dealing with millions of rows.
The best way to do this is to perform df.take(1)
and check if its null. This will return java.util.NoSuchElementException
so better to put a try around df.take(1)
.
The dataframe return an error when take(1)
is done instead of an empty row. I have highlighted the specific code lines where it throws the error.
In Scala you can use implicits to add the methods isEmpty()
and nonEmpty()
to the DataFrame API, which will make the code a bit nicer to read.
object DataFrameExtensions {
implicit def extendedDataFrame(dataFrame: DataFrame): ExtendedDataFrame =
new ExtendedDataFrame(dataFrame: DataFrame)
class ExtendedDataFrame(dataFrame: DataFrame) {
def isEmpty(): Boolean = dataFrame.head(1).isEmpty // Any implementation can be used
def nonEmpty(): Boolean = !isEmpty
}
}
Here, other methods can be added as well. To use the implicit conversion, use import DataFrameExtensions._
in the file you want to use the extended functionality. Afterwards, the methods can be used directly as so:
val df: DataFrame = ...
if (df.isEmpty) {
// Do something
}
For Java users you can use this on a dataset :
public boolean isDatasetEmpty(Dataset<Row> ds) {
boolean isEmpty;
try {
isEmpty = ((Row[]) ds.head(1)).length == 0;
} catch (Exception e) {
return true;
}
return isEmpty;
}
This check all possible scenarios ( empty, null ).
Source: Stackoverflow.com