[database] How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:

insert overwrite directory '/home/output.csv' select books from table;

When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?

This question is related to database hadoop hive hiveql

The answer is


You can use INSERTDIRECTORY …, as in this example:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.


I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.

To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:

hive -e 'tab_replaced_hql_query' |  python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'

This created a perfectly valid csv. Hope this helps those who come looking for this solution.


Just to cover more following steps after kicking off the query: INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select books from table;

In my case, the generated data under temp folder is in deflate format, and it looks like this:

$ ls
000000_0.deflate  
000001_0.deflate  
000002_0.deflate  
000003_0.deflate  
000004_0.deflate  
000005_0.deflate  
000006_0.deflate  
000007_0.deflate

Here's the command to unzip the deflate files and put everything into one csv file:

hadoop fs -text "file:///home/lvermeer/temp/*" > /home/lvermeer/result.csv

If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):

hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv

I may be late to this one, but would help with the answer:

echo "COL_NAME1|COL_NAME2|COL_NAME3|COL_NAME4" > SAMPLE_Data.csv hive -e ' select distinct concat(COL_1, "|", COL_2, "|", COL_3, "|", COL_4) from table_Name where clause if required;' >> SAMPLE_Data.csv


This shell command prints the output format in csv to output.txt without the column headers.

$ hive --outputformat=csv2 -f 'hivedatascript.hql' --hiveconf hive.cli.print.header=false > output.txt

hive  --outputformat=csv2 -e "select * from yourtable" > my_file.csv

or

hive  --outputformat=csv2 -e "select * from yourtable" > [your_path]/file_name.csv

For tsv, just change csv to tsv in the above queries and run your queries


You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )

for ex:

hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv

If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.


The default separator is "^A". In python language, it is "\x01".

When I want to change the delimiter, I use SQL like:

SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table

Then, regard delimiter+"^A" as a new delimiter.


I tried various options, but this would be one of the simplest solution for Python Pandas:

hive -e 'select books from table' | grep "|" ' > temp.csv

df=pd.read_csv("temp.csv",sep='|')

You can also use tr "|" "," to convert "|" to ","


Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.


In case you are doing it from Windows you can use Python script hivehoney to extract table data to local CSV file.

It will:

  1. Login to bastion host.
  2. pbrun.
  3. kinit.
  4. beeline (with your query).
  5. Save echo from beeline to a file on Windows.

Execute it like this:

set PROXY_HOST=your_bastion_host

set SERVICE_USER=you_func_user

set LINUX_USER=your_SOID

set LINUX_PWD=your_pwd

python hh.py --query_file=query.sql

This is most csv friendly way I found to output the results of HiveQL.
You don't need any grep or sed commands to format the data, instead hive supports it, just need to add extra tag of outputformat.

hive --outputformat=csv2 -e 'select * from <table_name> limit 20' > /path/toStore/data/results.csv

You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.

You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.


Use the command:

hive -e "use [database_name]; select * from [table_name] LIMIT 10;" > /path/to/file/my_file_name.csv

I had a huge dataset whose details I was trying to organize and determine the types of attacks and the numbers of each type. An example that I used on my practice that worked (and had a little more details) goes something like this:

hive -e "use DataAnalysis;
select attack_cat, 
case when attack_cat == 'Backdoor' then 'Backdoors' 
when length(attack_cat) == 0 then 'Normal' 
when attack_cat == 'Backdoors' then 'Backdoors' 
when attack_cat == 'Fuzzers' then 'Fuzzers' 
when attack_cat == 'Generic' then 'Generic' 
when attack_cat == 'Reconnaissance' then 'Reconnaissance' 
when attack_cat == 'Shellcode' then 'Shellcode' 
when attack_cat == 'Worms' then 'Worms' 
when attack_cat == 'Analysis' then 'Analysis' 
when attack_cat == 'DoS' then 'DoS' 
when attack_cat == 'Exploits' then 'Exploits' 
when trim(attack_cat) == 'Fuzzers' then 'Fuzzers' 
when trim(attack_cat) == 'Shellcode' then 'Shellcode' 
when trim(attack_cat) == 'Reconnaissance' then 'Reconnaissance' end,
count(*) from actualattacks group by attack_cat;">/root/data/output/results2.csv

I had a similar issue and this is how I was able to address it.

Step 1 - Loaded the data from Hive table into another table as follows

DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;

Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension

Start-AzureStorageBlobCopy
-DestContext $destContext 
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"

Examples related to database

Implement specialization in ER diagram phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' cannot be loaded Room - Schema export directory is not provided to the annotation processor so we cannot export the schema SQL Query Where Date = Today Minus 7 Days MySQL Error: : 'Access denied for user 'root'@'localhost' SQL Server date format yyyymmdd How to create a foreign key in phpmyadmin WooCommerce: Finding the products in database TypeError: tuple indices must be integers, not str

Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to hive

select rows in sql with latest date for each ID repeated multiple times PySpark: withColumn() with two conditions and three outcomes java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient Hive cast string to date dd-MM-yyyy How to save DataFrame directly to Hive? How to calculate Date difference in Hive Select top 2 rows in Hive Just get column names from hive table Create hive table using "as select" or "like" and also specify delimiter Hive Alter table change Column Name

Examples related to hiveql

PySpark: withColumn() with two conditions and three outcomes How to export data from Spark SQL to CSV How to calculate Date difference in Hive Select top 2 rows in Hive Add a column in a table in HIVE QL How to get/generate the create statement for an existing hive table? How do I output the results of a HiveQL query to CSV? How to select current date in Hive SQL Hive insert query like SQL Difference between Hive internal tables and external tables?