Filtering a spark dataframe based on date

Question

I have a dataframe of   date  string  string   I want to select dates before a certain period  I have tried the following with no luck   data filter data  date    lt  new java sql Date format parse  2015-03-14   getTime     I m getting an error stating the following  org apache spark sql AnalysisException  resolved attribute s  date 75 missing from date 72 uid 73 iid 74 in operator  Filter  date 75  lt  16508     As far as I can guess the query is incorrect  Can anyone show me what way the query should be formatted    I checked that all enteries in the dataframe have values - they do

User · Answer

I find the most readable way to express this is using a sql expression:

df.filter("my_date < date'2015-01-01'")

we can verify this works correctly by looking at the physical plan from .explain()

+- *(1) Filter (isnotnull(my_date#22) && (my_date#22 < 16436))

User · Answer

In PySpark python  one of the option is to have the column in unix timestamp format We can convert string to unix timestamp and specify the format as shown below  Note we need to import unix timestamp and lit function  from pyspark sql functions import unix timestamp  lit  df withColumn  tx date   to date unix timestamp df cast  date     MM dd yyyy   cast  timestamp       Now we can apply the filters  df cast filter df cast  tx date    gt   lit  2017-01-01              filter df cast  tx date    lt   lit  2017-01-31    show

User · Answer

Don t use this as suggested in other answers   filter f col  dateColumn    lt  f lit  2017-11-01      But use this instead   filter f col  dateColumn    lt  f unix timestamp f lit  2017-11-01 00 00 00    cast  timestamp      This will use the TimestampType instead of the StringType  which will be more performant in some cases  For example Parquet predicate pushdown will only work with the latter

User · Answer

We can also use SQL kind of expression inside filter     Note - gt  Here  I am showing two conditions and a date range for future reference     ordersDf filter  quot order status    PENDING PAYMENT  AND order date BETWEEN  2013-07-01  AND  2013-07-31   quot

User · Answer

The following solutions are applicable since spark 1 5      For lower than       filter data where the date is lesser than 2015-03-14 data filter data  date   lt lit  2015-03-14             For greater than       filter data where the date is greater than 2015-03-14 data filter data  date   gt lit  2015-03-14        For equality  you can use either equalTo or        data filter data  date       lit  2015-03-14      If your DataFrame date column is of type StringType  you can convert it using the to date function       filter data where the date is greater than 2015-03-14 data filter to date data  date    gt lit  2015-03-14        You can also filter according to a year using the year function       filter data where year is greater or equal to 2016 data filter year   date   geq lit 2016

User · Answer

df df filter df  columnname   gt   2020-01-13

[apache-spark] Filtering a spark dataframe based on date

Examples related to apache-spark

Examples related to apache-spark-sql