How to check if spark dataframe is empty

Question

Right now  I have to use df count  gt  0 to check if the DataFrame is empty or not  But it is kind of inefficient  Is there any better way to do that   Thanks   PS  I want to check if it s empty so that I only save the DataFrame if it s not empty

User · Answer

dataframe limit 1  count  gt  0  This also triggers a job but since we are selecting single record  even in case of billion scale records the time consumption could be much lower   From  https   medium com checking-emptiness-in-distributed-objects count-vs-isempty-surprised-to-see-the-impact-fa70c0246ee0

User · Answer

You can take advantage of the head    or first    functions to see if the DataFrame has a single row  If so  it is not empty

User · Answer

On PySpark  you can also use this bool df head 1   to obtain a True of False value  It returns False if the dataframe contains no rows

User · Answer

I would say to just grab the underlying RDD  In Scala   df rdd isEmpty   in Python   df rdd isEmpty     That being said  all this does is call take 1  length  so it ll do the same thing as Rohan answered   just maybe slightly more explicit

User · Answer

If you do df count  gt  0  It takes the counts of all partitions across all executors and add them up at Driver  This take a while when you are dealing with millions of rows   The best way to do this is to perform df take 1  and check if its null  This will return java util NoSuchElementException so better to put a try around df take 1    The dataframe return an error when take 1  is done instead of an empty row  I have highlighted the specific code lines where it throws the error

User · Answer

I found that on some cases    gt  gt  gt print type df    lt class  pyspark sql dataframe DataFrame  gt    gt  gt  gt df take 1  isEmpty  list  object has no attribute  isEmpty    this is same for  length  or replace take   by head     Solution  for the issue we can use    gt  gt  gt df limit 2  count    gt  1 False

User · Answer

You can do it like   val df   sqlContext emptyDataFrame if  df eq sqlContext emptyDataFrame        println  empty df    else      println  normal df

User · Answer

For Spark 2 1 0  my suggestion would be to use head n  Int  or take n  Int  with isEmpty  whichever one has the clearest intent to you   df head 1  isEmpty df take 1  isEmpty   with Python equivalent   len df head 1      0    or bool df head 1   len df take 1      0    or bool df take 1     Using df first   and df head   will both return the java util NoSuchElementException if the DataFrame is empty  first   calls head   directly  which calls head 1  head   def first    T   head   def head    T   head 1  head   head 1  returns an Array  so taking head on that Array causes the java util NoSuchElementException when the DataFrame is empty   def head n  Int   Array T    withAction  head   limit n  queryExecution  collectFromPlan    So instead of calling head    use head 1  directly to get the array and then you can use isEmpty   take n  is also equivalent to head n      def take n  Int   Array T    head n    And limit 1  collect   is equivalent to head 1   notice limit n  queryExecution in the head n  Int  method   so the following are all equivalent  at least from what I can tell  and you won t have to catch a java util NoSuchElementException exception when the DataFrame is empty   df head 1  isEmpty df take 1  isEmpty df limit 1  collect   isEmpty   I know this is an older question so hopefully it will help someone using a newer version of Spark

User · Answer

Since Spark 2 4 0 there is Dataset isEmpty   It s implementation is       def isEmpty  Boolean      withAction  isEmpty   limit 1  groupBy   count   queryExecution    plan   gt      plan executeCollect   head getLong 0     0     Note that a DataFrame is no longer a class in Scala  it s just a type alias  probably changed with Spark 2 0    type DataFrame   Dataset Row

User · Answer

df1 take 1  length gt 0   The take method returns the array of rows  so if the array size is equal to zero  there are no records in df

User · Answer

In Scala you can use implicits to add the methods isEmpty   and nonEmpty   to the DataFrame API  which will make the code a bit nicer to read     object DataFrameExtensions     implicit def extendedDataFrame dataFrame  DataFrame   ExtendedDataFrame        new ExtendedDataFrame dataFrame  DataFrame     class ExtendedDataFrame dataFrame  DataFrame        def isEmpty    Boolean   dataFrame head 1  isEmpty    Any implementation can be used     def nonEmpty    Boolean    isEmpty         Here  other methods can be added as well  To use the implicit conversion  use import DataFrameExtensions   in the file you want to use the extended functionality  Afterwards  the methods can be used directly as so   val df  DataFrame       if  df isEmpty         Do something

User · Answer

If you are using Pypsark  you could also do    len df head 1    gt  0

User · Answer

I had the same question  and I tested 3 main solution     df    null   amp  amp   df count  gt  0  df head 1  isEmpty   as  hulin003 suggest df rdd isEmpty   as  Justin Pihony suggest  and of course the 3 works  however in term of perfermance  here is what I found  when executing the these methods on the same DF in my machine  in terme of execution time    it takes  9366ms it takes  5607ms it takes  1921ms   therefore I think that the best solution is df rdd isEmpty   as  Justin Pihony suggest

User · Answer

For Java users you can use this on a dataset    public boolean isDatasetEmpty Dataset lt Row gt  ds            boolean isEmpty          try               isEmpty     Row    ds head 1   length    0            catch  Exception e                return true                    return isEmpty      This check all possible scenarios   empty  null

[apache-spark] How to check if spark dataframe is empty?

Examples related to apache-spark

Examples related to apache-spark-sql