[date] Hive: Filtering Data between Specified Dates when Date is a String

I'm trying to filter data between September 1st, 2010 and August 31st, 2013 in a Hive table. The column containing the date is in string format (yyyy-mm-dd). I can use month() and year() on this column. But how do I use them to filter data between the above dates? Any examples/sample code would be welcome!

This question is related to date filter hive

The answer is


Try this:

select * from your_table
where date >= '2020-10-01'  

No need to extract the month and year.Just need to use the unix_timestamp(date String,format String) function.

For Example:

select yourdate_column
from your_table
where unix_timestamp(yourdate_column, 'yyyy-MM-dd') >= unix_timestamp('2014-06-02', 'yyyy-MM-dd')
and unix_timestamp(yourdate_column, 'yyyy-MM-dd') <= unix_timestamp('2014-07-02','yyyy-MM-dd')
order by yourdate_column limit 10; 

Hive has a lot of good date parsing UDFs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

Just doing the string comparison as Nigel Tufnel suggests is probably the easiest solution, although technically it's unsafe. But you probably don't need to worry about that unless your tables have historical data about the medieval ages (dates with only 3 year digits) or dates from scifi novels (dates with more than 4 year digits).

Anyway, if you ever find yourself in a situation where you would want to do fancier date comparisons, or if your date format is not in a "biggest to smallest" order, e.g. the American convention of "mm/dd/yyyy", then you could use unix_timestamp with two arguments:

select *
from your_table
where unix_timestamp(your_date_column, 'yyyy-MM-dd') >= unix_timestamp('2010-09-01', 'yyyy-MM-dd')
and unix_timestamp(your_date_column, 'yyyy-MM-dd') <= unix_timestamp('2013-08-31', 'yyyy-MM-dd')

Just like SQL, Hive supports BETWEEN operator for more concise statement:

SELECT *
  FROM your_table
  WHERE your_date_column BETWEEN '2010-09-01' AND '2013-08-31';

You have to convert string formate to required date format as following and then you can get your required result.

hive> select * from salesdata01 where from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') >= from_unixtime(unix_timestamp('2010-09-01', 'yyyy-MM-dd'),'yyyy-MM-dd') and from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') <= from_unixtime(unix_timestamp('2011-09-01', 'yyyy-MM-dd'),'yyyy-MM-dd') limit 10;
OK
1   3   13-10-2010  Low 6.0 261.54  0.04    Regular Air -213.25 38.94
80  483 10-07-2011  High    30.0    4965.7593   0.08    Regular Air 1198.97 195.99
97  613 17-06-2011  High    12.0    93.54   0.03    Regular Air -54.04  7.3
98  613 17-06-2011  High    22.0    905.08  0.09    Regular Air 127.7   42.76
103 643 24-03-2011  High    21.0    2781.82 0.07    Express Air -695.26 138.14
127 807 23-11-2010  Medium  45.0    196.85  0.01    Regular Air -166.85 4.28
128 807 23-11-2010  Medium  32.0    124.56  0.04    Regular Air -14.33  3.95
160 995 30-05-2011  Medium  46.0    1815.49 0.03    Regular Air 782.91  39.89
229 1539    09-03-2011  Low 33.0    511.83  0.1 Regular Air -172.88 15.99
230 1539    09-03-2011  Low 38.0    184.99  0.05    Regular Air -144.55 4.89
Time taken: 0.166 seconds, Fetched: 10 row(s)
hive> select * from salesdata01 where from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') >= from_unixtime(unix_timestamp('2010-09-01', 'yyyy-MM-dd'),'yyyy-MM-dd') and from_unixtime(unix_timestamp(Order_date, 'dd-MM-yyyy'),'yyyy-MM-dd') <= from_unixtime(unix_timestamp('2010-12-01', 'yyyy-MM-dd'),'yyyy-MM-dd') limit 10;
OK
1   3   13-10-2010  Low 6.0 261.54  0.04    Regular Air -213.25 38.94
127 807 23-11-2010  Medium  45.0    196.85  0.01    Regular Air -166.85 4.28
128 807 23-11-2010  Medium  32.0    124.56  0.04    Regular Air -14.33  3.95
256 1792    08-11-2010  Low 28.0    370.48  0.04    Regular Air -5.45   13.48
381 2631    23-09-2010  Low 27.0    1078.49 0.08    Regular Air 252.66  40.96
656 4612    19-09-2010  Medium  9.0 89.55   0.06    Regular Air -375.64 4.48
769 5506    07-11-2010  Critical    22.0    129.62  0.05    Regular Air 4.41    5.88
1457    10499   16-11-2010  Not Specified   29.0    6250.936    0.01    Delivery Truck  31.21   262.11
1654    11911   10-11-2010  Critical    25.0    397.84  0.0 Regular Air -14.75  15.22
2323    16741   30-09-2010  Medium  6.0 157.97  0.01    Regular Air -42.38  22.84
Time taken: 0.17 seconds, Fetched: 10 row(s)

The great thing about yyyy-mm-dd date format is that there is no need to extract month() and year(), you can do comparisons directly on strings:

SELECT *
  FROM your_table
  WHERE your_date_column >= '2010-09-01' AND your_date_column <= '2013-08-31';

Examples related to date

How do I format {{$timestamp}} as MM/DD/YYYY in Postman? iOS Swift - Get the Current Local Time and Date Timestamp Typescript Date Type? how to convert current date to YYYY-MM-DD format with angular 2 SQL Server date format yyyymmdd Date to milliseconds and back to date in Swift Check if date is a valid one change the date format in laravel view page Moment js get first and last day of current month How can I convert a date into an integer?

Examples related to filter

Monitoring the Full Disclosure mailinglist Pyspark: Filter dataframe based on multiple conditions How Spring Security Filter Chain works Copy filtered data to another sheet using VBA Filter object properties by key in ES6 How do I filter date range in DataTables? How do I filter an array with TypeScript in Angular 2? Filtering array of objects with lodash based on property value How to filter an array from all elements of another array How to specify "does not contain" in dplyr filter

Examples related to hive

select rows in sql with latest date for each ID repeated multiple times PySpark: withColumn() with two conditions and three outcomes java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient Hive cast string to date dd-MM-yyyy How to save DataFrame directly to Hive? How to calculate Date difference in Hive Select top 2 rows in Hive Just get column names from hive table Create hive table using "as select" or "like" and also specify delimiter Hive Alter table change Column Name