I have a data frame with four fields. one of the field name is Status and i am trying to use a OR condition in .filter for a dataframe . I tried below queries but no luck.
df2 = df1.filter(("Status=2") || ("Status =3"))
df2 = df1.filter("Status=2" || "Status =3")
Has anyone used this before. I have seen a similar question on stack overflow here . They have used below code for using OR condition. But that code is for pyspark.
from pyspark.sql.functions import col
numeric_filtered = df.where(
(col('LOW') != 'null') |
(col('NORMAL') != 'null') |
(col('HIGH') != 'null'))
numeric_filtered.show()
This question is related to
apache-spark
apache-spark-sql
spark-dataframe
In java spark dataset it can be used as
Dataset userfilter = user.filter(col("gender").isin("male","female"));
df2 = df1.filter("Status = 2 OR Status = 3")
Worked for me.
If we want partial match just like contains, we can chain the contain call like this :
def getSelectedTablesRows2(allTablesInfoDF: DataFrame, tableNames: Seq[String]): DataFrame = {
val tableFilters = tableNames.map(_.toLowerCase()).map(name => lower(col("table_name")).contains(name))
val finalFilter = tableFilters.fold(lit(false))((accu, newTableFilter) => accu or newTableFilter)
allTablesInfoDF.where(finalFilter)
}
You can try, (filtering with 1 object like a list or a set of values)
ds = ds.filter(functions.col(COL_NAME).isin(myList));
or as @Tony Fraser suggested, you can try, (with a Seq of objects)
ds = ds.filter(functions.col(COL_NAME).isin(mySeq));
All the answers are correct but most of them do not represent a good coding style. Also, you should always consider the variable length of arguments for the future, even though they are static at a certain point in time.
df2 = df1.filter("Status=2")
.filter("Status=3");
In spark/scala, it's pretty easy to filter with varargs.
val d = spark.read...//data contains column named matid
val ids = Seq("BNBEL0608AH", "BNBEL00608H")
val filtered = d.filter($"matid".isin(ids:_*))
You need to use filter
package dataframe
import org.apache.spark.sql.SparkSession
/**
* @author [email protected]
*/
//
object DataFrameExample{
//
case class Employee(id: Integer, name: String, address: String, salary: Double, state: String,zip:Integer)
//
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
// create a sequence of case class objects
// (we defined the case class above)
val emp = Seq(
Employee(1, "vaquar khan", "111 algoinquin road chicago", 120000.00, "AZ",60173),
Employee(2, "Firdos Pasha", "1300 algoinquin road chicago", 2500000.00, "IL",50112),
Employee(3, "Zidan khan", "112 apt abcd timesqure NY", 50000.00, "NY",55490),
Employee(4, "Anwars khan", "washington dc", 120000.00, "VA",33245),
Employee(5, "Deepak sharma ", "rolling edows schumburg", 990090.00, "IL",60172),
Employee(6, "afaq khan", "saeed colony Bhopal", 1000000.00, "AZ",60173)
)
val employee=spark.sparkContext.parallelize(emp, 4).toDF()
employee.printSchema()
employee.show()
employee.select("state", "zip").show()
println("*** use filter() to choose rows")
employee.filter($"state".equalTo("IL")).show()
println("*** multi contidtion in filer || ")
employee.filter($"state".equalTo("IL") || $"state".equalTo("AZ")).show()
println("*** multi contidtion in filer && ")
employee.filter($"state".equalTo("AZ") && $"zip".equalTo("60173")).show()
}
}
Another way is to use function expr with where clause
import org.apache.spark.sql.functions.expr
df2 = df1.where(expr("col1 = 'value1' and col2 = 'value2'"))
It works the same.
This question has been answered but for future reference, I would like to mention that, in the context of this question, the where
and filter
methods in Dataset/Dataframe supports two syntaxes:
The SQL string parameters:
df2 = df1.filter(("Status = 2 or Status = 3"))
and Col based parameters (mentioned by @David ):
df2 = df1.filter($"Status" === 2 || $"Status" === 3)
It seems the OP'd combined these two syntaxes. Personally, I prefer the first syntax because it's cleaner and more generic.
For future references :
we can use isInCollection
to filter ,here is a example :
Note : It will look for exact match
def getSelectedTablesRows(allTablesInfoDF: DataFrame, tableNames: Seq[String]): DataFrame = {
allTablesInfoDF.where(col("table_name").isInCollection(tableNames))
}
Source: Stackoverflow.com