Got a gotcha for those with their headspace in Pandas and moving to pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
spark_conf = SparkConf().setMaster("local").setAppName("MyAppName")
sc = SparkContext(conf = spark_conf)
sqlContext = SQLContext(sc)
records = [
{"colour": "red"},
{"colour": "blue"},
{"colour": None},
]
pandas_df = pd.DataFrame.from_dict(records)
pyspark_df = sqlContext.createDataFrame(records)
So if we wanted the rows that are not red:
pandas_df[~pandas_df["colour"].isin(["red"])]
Looking good, and in our pyspark DataFrame
pyspark_df.filter(~pyspark_df["colour"].isin(["red"])).collect()
So after some digging, I found this: https://issues.apache.org/jira/browse/SPARK-20617 So to include nothingness in our results:
pyspark_df.filter(~pyspark_df["colour"].isin(["red"]) | pyspark_df["colour"].isNull()).show()