How to find count of Null and Nan values for each column in a PySpark dataframe efficiently

Question

import numpy as np  df   spark createDataFrame        1  1  None    1  2  float 5     1  3  np nan    1  4  None    1  5  float 10     1  6  float  nan      1  6  float  nan            session    timestamp1    id2      Expected output  dataframe with count of nan null for each column  Note  The previous questions I found in stack overflow only checks for null  amp  not nan  Thats why i have created a new question   I know i can use isnull   function in spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe

User · Answer

An alternative to the already provided ways is to simply filter on the column like so

df = df.where(F.col('columnNameHere').isNull())

This has the added benefit that you don't have to add another column to do the filtering and it's quick on larger data sets.

User · Answer

Here is my one liner  Here  c  is the name of the column  df select  c   withColumn  isNull c  F col  c   isNull    where  isNull c   True   count

User · Answer

To make sure it does not fail for string  date and timestamp columns   import pyspark sql functions as F def count missings spark df sort True               Counts number of nulls and nans in each column             df   spark df select  F count F when F isnan c    F isnull c   c   alias c  for  c c type  in spark df dtypes if c type not in   timestamp    string    date     toPandas        if len df     0          print  There are no any missing values            return None      if sort          return df rename index  0   count    T sort values  count  ascending False       return df   If you want to see the columns sorted based on the number of nans and nulls in descending   count missings spark df       Col A   10       Col C   2        Col B   1       If you don t want ordering and see them as a single row   count missings spark df  False      Col A   Col B   Col C        10       1       2

User · Answer

You can use method shown here and replace isNull with isnan   from pyspark sql functions import isnan  when  count  col  df select  count when isnan c   c   alias c  for c in df columns   show    ------- ---------- ---   session timestamp1 id2   ------- ---------- ---         0          0   3   ------- ---------- ---    or  df select  count when isnan c    col c  isNull    c   alias c  for c in df columns   show    ------- ---------- ---   session timestamp1 id2   ------- ---------- ---         0          0   5   ------- ---------- ---

User · Answer

I prefer this solution  df   spark table selected table  filter condition   counter   df count    df   df select   counter - count c   alias c  for c in df columns

User · Answer

For null values in the dataframe of pyspark Dict Null    col df filter df col  isNull    count   for col in df columns  Dict Null    The output in dict where key is column name and value is null values in that column        0    Name   0    Type 1   0    Type 2   386    Total   0    HP   0    Attack   0    Defense   0    Sp Atk   0    Sp Def   0    Speed   0    Generation   0    Legendary   0

[apache-spark] How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Examples related to apache-spark

Examples related to pyspark

Examples related to apache-spark-sql

Examples related to pyspark-sql