Spark - SELECT WHERE or filtering

Question

What s the difference between selecting with a where clause and filtering in Spark   Are there any use cases in which one is more appropriate than the other one    When do I use  DataFrame newdf   df select df col       where df col  somecol   leq 10     and when is   DataFrame newdf   df select df col       filter  somecol  lt   10     more appropriate

User · Answer

As Yaron mentioned  there isn t any difference between where and filter   filter is an overloaded method that takes a column or string argument   The performance is the same  regardless of the syntax you use     We can use explain   to see that all the different filtering syntaxes generate the same Physical Plan   Suppose you have a dataset with person name and person country columns   All of the following code snippets will return the same Physical Plan below   df where  person country    Cuba    explain   df where   person country       Cuba   explain   df where  person country      Cuba   explain   df filter  person country    Cuba    explain     These all return this Physical Plan      Physical Plan      1  Project  person name 152  person country 153   -   1  Filter  isnotnull person country 153   amp  amp   person country 153   Cuba       -   1  FileScan csv  person name 152 person country 153  Batched  false  Format  CSV  Location  InMemoryFileIndex file  Users matthewpowers Documents code my apps mungingdata spark2 src test re     PartitionFilters      PushedFilters   IsNotNull person country   EqualTo person country Cuba    ReadSchema  struct lt person name string person country string gt    The syntax doesn t change how filters are executed under the hood  but the file format   database that a query is executed on does   Spark will execute the same query differently on Postgres  predicate pushdown filtering is supported   Parquet  column pruning   and CSV files   See here for more details

User · Answer

According to spark documentation  quot where   is an alias for filter   quot  filter condition  Filters rows using the given condition  where   is an alias for filter    Parameters  condition     a Column of types BooleanType or a string of SQL expression   gt  gt  gt  df filter df age  gt  3  collect    Row age 5  name u Bob     gt  gt  gt  df where df age    2  collect    Row age 2  name u Alice      gt  gt  gt  df filter  quot age  gt  3 quot   collect    Row age 5  name u Bob     gt  gt  gt  df where  quot age   2 quot   collect    Row age 2  name u Alice

[apache-spark] Spark - SELECT WHERE or filtering?

Examples related to apache-spark

Examples related to apache-spark-sql