[hadoop] Select top 2 rows in Hive

I'm trying to retrieve top 2 tables from my employee list based on salary in hive (version 0.11). Since it doesn't support TOP function, is there any alternatives? Or do we have define a UDF?

This question is related to hadoop hive hiveql

The answer is


Yes, here you can use LIMIT.

You can try it by the below query:

SELECT * FROM employee_list SORT BY salary DESC LIMIT 2

Here I think it's worth mentioning SORT BY and ORDER BY both clauses and why they different,

SELECT * FROM <table_name> SORT BY <column_name> DESC LIMIT 2

If you are using SORT BY clause it sort data per reducer which means if you have more than one MapReduce task it will result partially ordered data. On the other hand, the ORDER BY clause will result in ordered data for the final Reduce task. To understand more please refer to this link.

SELECT * FROM <table_name> ORDER BY <column_name> DESC LIMIT 2

Note: Finally, Even though the accepted answer contains SORT BY clause, I mostly prefer to use ORDER BY clause for the general use case to avoid any data loss.


select * from employee_list order by salary desc limit 2;

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to hive

select rows in sql with latest date for each ID repeated multiple times PySpark: withColumn() with two conditions and three outcomes java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient Hive cast string to date dd-MM-yyyy How to save DataFrame directly to Hive? How to calculate Date difference in Hive Select top 2 rows in Hive Just get column names from hive table Create hive table using "as select" or "like" and also specify delimiter Hive Alter table change Column Name

Examples related to hiveql

PySpark: withColumn() with two conditions and three outcomes How to export data from Spark SQL to CSV How to calculate Date difference in Hive Select top 2 rows in Hive Add a column in a table in HIVE QL How to get/generate the create statement for an existing hive table? How do I output the results of a HiveQL query to CSV? How to select current date in Hive SQL Hive insert query like SQL Difference between Hive internal tables and external tables?