I'm trying to filter a PySpark dataframe that has None
as a row value:
df.select('dt_mvmt').distinct().collect()
[Row(dt_mvmt=u'2016-03-27'),
Row(dt_mvmt=u'2016-03-28'),
Row(dt_mvmt=u'2016-03-29'),
Row(dt_mvmt=None),
Row(dt_mvmt=u'2016-03-30'),
Row(dt_mvmt=u'2016-03-31')]
and I can filter correctly with an string value:
df[df.dt_mvmt == '2016-03-31']
# some results here
but this fails:
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
But there are definitely values on each category. What's going on?
This question is related to
python
apache-spark
dataframe
pyspark
apache-spark-sql
You can use Column.isNull
/ Column.isNotNull
:
df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
df.na.drop(subset=["dt_mvmt"])
Equality based comparisons with NULL
won't work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+
sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.
if column = None
COLUMN_OLD_VALUE
----------------
None
1
None
100
20
------------------
Use create a temptable on data frame:
sqlContext.sql("select * from tempTable where column_old_value='None' ").show()
So use : column_old_value='None'
To obtain entries whose values in the dt_mvmt
column are not null we have
df.filter("dt_mvmt is not NULL")
and for entries which are null we have
df.filter("dt_mvmt is NULL")
Try to just use isNotNull function.
df.filter(df.dt_mvmt.isNotNull()).count()
If you want to filter out records having None value in column then see below example:
df=spark.createDataFrame([[123,"abc"],[234,"fre"],[345,None]],["a","b"])
Now filter out null value records:
df=df.filter(df.b.isNotNull())
df.show()
If you want to remove those records from DF then see below:
df1=df.na.drop(subset=['b'])
df1.show()
If you want to keep with the Pandas syntex this worked for me.
df = df[df.dt_mvmt.isNotNull()]
isNull()/isNotNull() will return the respective rows which have dt_mvmt as Null or !Null.
method_1 = df.filter(df['dt_mvmt'].isNotNull()).count()
method_2 = df.filter(df.dt_mvmt.isNotNull()).count()
Both will return the same result
None/Null is a data type of the class NoneType in pyspark/python so, Below will not work as you are trying to compare NoneType object with string object
Wrong way of filretingdf[df.dt_mvmt == None].count() 0 df[df.dt_mvmt != None].count() 0
df=df.where(col("dt_mvmt").isNotNull()) returns all records with dt_mvmt as None/Null
There are multiple ways you can remove/filter the null values from a column in DataFrame.
Lets create a simple DataFrame with below code:
date = ['2016-03-27','2016-03-28','2016-03-29', None, '2016-03-30','2016-03-31']
df = spark.createDataFrame(date, StringType())
Now you can try one of the below approach to filter out the null values.
# Approach - 1
df.filter("value is not null").show()
# Approach - 2
df.filter(col("value").isNotNull()).show()
# Approach - 3
df.filter(df["value"].isNotNull()).show()
# Approach - 4
df.filter(df.value.isNotNull()).show()
# Approach - 5
df.na.drop(subset=["value"]).show()
# Approach - 6
df.dropna(subset=["value"]).show()
# Note: You can also use where function instead of a filter.
You can also check the section "Working with NULL Values" on my blog for more information.
I hope it helps.
PySpark provides various filtering options based on arithmetic, logical and other conditions. Presence of NULL values can hamper further processes. Removing them or statistically imputing them could be a choice.
Below set of code can be considered:
# Dataset is df
# Column name is dt_mvmt
# Before filtering make sure you have the right count of the dataset
df.count() # Some number
# Filter here
df = df.filter(df.dt_mvmt.isNotNull())
# Check the count to ensure there are NULL values present (This is important when dealing with large dataset)
df.count() # Count should be reduced if NULL values are present
Source: Stackoverflow.com