Right now, I have to use
df.count > 0 to check if the
DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?
PS: I want to check if it's empty so that I only save the
DataFrame if it's not empty
This question is tagged with
~ Asked on 2015-09-22 02:52:55
For Spark 2.1.0, my suggestion would be to use
head(n: Int) or
take(n: Int) with
isEmpty, whichever one has the clearest intent to you.
with Python equivalent:
len(df.head(1)) == 0 # or bool(df.head(1)) len(df.take(1)) == 0 # or bool(df.take(1))
df.head() will both return the
java.util.NoSuchElementException if the DataFrame is empty.
head() directly, which calls
def first(): T = head() def head(): T = head(1).head
head(1) returns an Array, so taking
head on that Array causes the
java.util.NoSuchElementException when the DataFrame is empty.
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
So instead of calling
head(1) directly to get the array and then you can use
take(n) is also equivalent to
def take(n: Int): Array[T] = head(n)
limit(1).collect() is equivalent to
limit(n).queryExecution in the
head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a
java.util.NoSuchElementException exception when the DataFrame is empty.
df.head(1).isEmpty df.take(1).isEmpty df.limit(1).collect().isEmpty
I know this is an older question so hopefully it will help someone using a newer version of Spark.
~ Answered on 2017-04-13 04:10:19
I would say to just grab the underlying
RDD. In Scala:
That being said, all this does is call
take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?
~ Answered on 2015-09-22 04:14:38