[sql] Pyspark: Filter dataframe based on multiple conditions

I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3).

If the original dataframe DF is as follows:

+----+----+----+----+---+
|col1|col2|col3|col4|  d|
+----+----+----+----+---+
|   A|  xx|   D|  vv|  4|
|   C| xxx|   D|  vv| 10|
|   A|   x|   A|  xx|  3|
|   E| xxx|   B|  vv|  3|
|   E| xxx|   F| vvv|  6|
|   F|xxxx|   F| vvv|  4|
|   G| xxx|   G| xxx|  4|
|   G| xxx|   G|  xx|  4|
|   G| xxx|   G| xxx| 12|
|   B|xxxx|   B|  xx| 13|
+----+----+----+----+---+

The desired Dataframe is:

+----+----+----+----+---+
|col1|col2|col3|col4|  d|
+----+----+----+----+---+
|   A|  xx|   D|  vv|  4|
|   A|   x|   A|  xx|  3|
|   E| xxx|   B|  vv|  3|
|   F|xxxx|   F| vvv|  4|
|   G| xxx|   G|  xx|  4|
+----+----+----+----+---+

Code I have tried that did not work as expected:

cols=[('A','xx','D','vv',4),('C','xxx','D','vv',10),('A','x','A','xx',3),('E','xxx','B','vv',3),('E','xxx','F','vvv',6),('F','xxxx','F','vvv',4),('G','xxx','G','xxx',4),('G','xxx','G','xx',4),('G','xxx','G','xxx',12),('B','xxxx','B','xx',13)]
df=spark.createDataFrame(cols,['col1','col2','col3','col4','d'])

df.filter((df.d<5)& (df.col2!=df.col4) & (df.col1==df.col3)).show()

+----+----+----+----+---+
|col1|col2|col3|col4|  d|
+----+----+----+----+---+
|   A|   x|   A|  xx|  3|
|   F|xxxx|   F| vvv|  4|
|   G| xxx|   G|  xx|  4|
+----+----+----+----+---+

What should I do to achieve the desired result?

This question is related to sql filter pyspark apache-spark-sql pyspark-sql

The answer is


You can also write like below (without pyspark.sql.functions):

df.filter('d<5 and (col1 <> col3 or (col1 = col3 and col2 <> col4))').show()

Result:

+----+----+----+----+---+
|col1|col2|col3|col4|  d|
+----+----+----+----+---+
|   A|  xx|   D|  vv|  4|
|   A|   x|   A|  xx|  3|
|   E| xxx|   B|  vv|  3|
|   F|xxxx|   F| vvv|  4|
|   G| xxx|   G|  xx|  4|
+----+----+----+----+---+

faster way (without pyspark.sql.functions)

    df.filter((df.d<5)&((df.col1 != df.col3) |
                    (df.col2 != df.col4) & 
                    (df.col1 ==df.col3)))\
    .show()

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to filter

Monitoring the Full Disclosure mailinglist Pyspark: Filter dataframe based on multiple conditions How Spring Security Filter Chain works Copy filtered data to another sheet using VBA Filter object properties by key in ES6 How do I filter date range in DataTables? How do I filter an array with TypeScript in Angular 2? Filtering array of objects with lodash based on property value How to filter an array from all elements of another array How to specify "does not contain" in dplyr filter

Examples related to pyspark

Pyspark: Filter dataframe based on multiple conditions How to convert column with string type to int form in pyspark data frame? Select columns in PySpark dataframe How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filter df when values matches part of a string in pyspark Filtering a pyspark dataframe using isin by exclusion PySpark: withColumn() with two conditions and three outcomes How to get name of dataframe column in pyspark? Spark RDD to DataFrame python PySpark 2.0 The size or shape of a DataFrame

Examples related to apache-spark-sql

Select Specific Columns from Spark DataFrame Pyspark: Filter dataframe based on multiple conditions Select columns in PySpark dataframe What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Spark dataframe: collect () vs select () How does createOrReplaceTempView work in Spark? Filter df when values matches part of a string in pyspark Convert date from String to Date format in Dataframes Take n rows from a spark dataframe and pass to toPandas()

Examples related to pyspark-sql

Pyspark: Filter dataframe based on multiple conditions How to find count of Null and Nan values for each column in a PySpark dataframe efficiently? Filtering a pyspark dataframe using isin by exclusion How to get name of dataframe column in pyspark? show distinct column values in pyspark dataframe: python Split Spark Dataframe string column into multiple columns Convert pyspark string to date format How to change dataframe column names in pyspark?