Filter Pyspark dataframe column with None value

Question

I m trying to filter a PySpark dataframe that has None as a row value   df select  dt mvmt   distinct   collect     Row dt mvmt u 2016-03-27     Row dt mvmt u 2016-03-28     Row dt mvmt u 2016-03-29     Row dt mvmt None    Row dt mvmt u 2016-03-30     Row dt mvmt u 2016-03-31      and I can filter correctly with an string value   df df dt mvmt     2016-03-31     some results here   but this fails   df df dt mvmt    None  count   0 df df dt mvmt    None  count   0   But there are definitely values on each category  What s going on

User · Answer

Try to just use isNotNull function   df filter df dt mvmt isNotNull    count

User · Answer

If you want to filter out records having None value in column then see below example   df spark createDataFrame   123  abc    234  fre    345 None     a   b      Now filter out null value records   df df filter df b isNotNull     df show     If you want to remove those records from DF then see below   df1 df na drop subset   b     df1 show

User · Answer

PySpark provides various filtering options based on arithmetic  logical and other conditions  Presence of NULL values can hamper further processes  Removing them or statistically imputing them could be a choice    Below set of code can be considered     Dataset is df   Column name is dt mvmt   Before filtering make sure you have the right count of the dataset df count     Some number    Filter here df   df filter df dt mvmt isNotNull       Check the count to ensure there are NULL values present  This is important when dealing with large dataset  df count     Count should be reduced if NULL values are present

User · Answer

You can use Column isNull   Column isNotNull   df where col  dt mvmt   isNull     df where col  dt mvmt   isNotNull      If you want to simply drop NULL values you can use na drop with subset argument   df na drop subset   dt mvmt      Equality based comparisons with NULL won t work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL   sqlContext sql  SELECT NULL   NULL   show       -------------       NULL   NULL       -------------               null      -------------    sqlContext sql  SELECT NULL    NULL   show       -------------------       NOT  NULL   NULL        -------------------                     null      -------------------    The only valid method to compare value with NULL is IS   IS NOT which are equivalent to the isNull   isNotNull method calls

User · Answer

None Null is a data type of the class NoneType in pyspark python so  Below will not work as you are trying to compare NoneType object with string object Wrong way of filreting df df dt mvmt    None  count   0 df df dt mvmt    None  count   0 correct df df where col  quot dt mvmt quot   isNotNull    returns all records with dt mvmt as None Null

User · Answer

if column   None  COLUMN OLD VALUE ---------------- None 1 None 100 20 ------------------   Use  create a temptable on data frame   sqlContext sql  select   from tempTable where column old value  None     show     So use   column old value  None

User · Answer

isNull   isNotNull   will return the respective rows which have dt mvmt as Null or  Null  method 1   df filter df  dt mvmt   isNotNull    count   method 2   df filter df dt mvmt isNotNull    count    Both will return the same result

User · Answer

To obtain entries whose values in the dt mvmt column are not null we have  df filter  dt mvmt is not NULL     and for entries which are null we have  df filter  dt mvmt is NULL

User · Answer

There are multiple ways you can remove filter the null values from a column in DataFrame   Lets create a simple DataFrame with below code   date     2016-03-27   2016-03-28   2016-03-29   None   2016-03-30   2016-03-31   df   spark createDataFrame date  StringType      Now you can try one of the below approach to filter out the null values     Approach - 1 df filter  value is not null   show      Approach - 2 df filter col  value   isNotNull    show      Approach - 3 df filter df  value   isNotNull    show      Approach - 4 df filter df value isNotNull    show      Approach - 5 df na drop subset   value    show      Approach - 6 df dropna subset   value    show      Note  You can also use where function instead of a filter    You can also check the section  Working with NULL Values  on my blog for more information   I hope it helps

User · Answer

If you want to keep with the Pandas syntex this worked for me   df   df df dt mvmt isNotNull

[python] Filter Pyspark dataframe column with None value

Examples related to python

Examples related to apache-spark

Examples related to dataframe

Examples related to pyspark

Examples related to apache-spark-sql