[hadoop] Where does Hive store files in HDFS?

I'd like to know how to find the mapping between Hive tables and the actual HDFS files (or rather, directories) that they represent. I need to access the table files directly.

Where does Hive store its files in HDFS?

This question is related to hadoop hive hdfs

The answer is


Hive database is nothing but directories within HDFS with .db extensions.

So, from a Unix or Linux host which is connected to HDFS, search by following based on type of HDFS distribution:

hdfs dfs -ls -R / 2>/dev/null|grep db or hadoop fs -ls -R / 2>/dev/null|grep db

You will see full path of .db database directories. All tables will be residing under respective .db database directories.


In Hive, tables are actually stored in a few places. Specifically, if you use partitions (which you should, if your tables are very large or growing) then each partition can have its own storage.

To show the default location where table data or partitions will be created if you create them through default HIVE commands: (insert overwrite ... partition ... and such):

describe formatted dbname.tablename

To show the actual location of a particular partition within a HIVE table, instead do this:

describe formatted dbname.tablename partition (name=value)

If you look in your filesystem where a table "should" live, and you find no files there, it's very likely that the table is created (usually incrementally) by creating a new partition and pointing that partition at some other location. This is a great way of building tables from things like daily imports from third parties and such, which avoids having to copy the files around or storing them more than once in different places.


If you look at the hive-site.xml file you will see something like this

<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/usr/hive/warehouse </value>
   <description>location of the warehouse directory</description>
 </property>

/usr/hive/warehouse is the default location for all managed tables. External tables may be stored at a different location.

describe formatted <table_name> is the hive shell command which can be use more generally to find the location of data pertaining to a hive table.


Summarize few points posted earlier, in hive-site.xml, property hive.metastore.warehouse.dir specifies where the files located under hadoop HDFS

<property>
   <name>hive.metastore.warehouse.dir</name>
   <value>/user/hive/warehouse</value>
</property>

To view files, use this command:

hadoop fs -ls /user/hive/warehouse

or

http://localhost:50070
Utilities > Browse the file system
or
http://localhost:50070/explorer.html#/

tested under hadoop-2.7.3, hive-2.1.1


Hive tables may not necessarily be stored in a warehouse (since you can create tables located anywhere on the HDFS).

You should use DESCRIBE FORMATTED <table_name> command.

hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'

Please note that partitions may be stored in different places and to get the location of the alpha=foo/beta=bar partition you'd have to add partition(alpha='foo',beta='bar') after <table_name>.


Hive tables are stored in the Hive warehouse directory. By default, MapR configures the Hive warehouse directory to be /user/hive/warehouse under the root volume. This default is defined in the $HIVE_HOME/conf/hive-default.xml.


In sandbox , you need to go for /apps/hive/warehouse/ and normal cluster /user/hive/warehouse


In Hive terminal type:

hive> set hive.metastore.warehouse.dir;

(it will print the path)


It's also very possible that typing show create table <table_name> in the hive cli will give you the exact location of your hive table.


Another way to check where a specific table is stored would be execute this query on the hive interactive interface:

show create table table_name;

where table_name is the name of the subject table.

An example for the above query on 'customers' table would be something like this:

CREATE TABLE `customers`(
  `id` string, 
  `name` string)
COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://quickstart.cloudera:8020/user/hive/warehouse/
   sqoop_workspace.db/customers'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='4', 
  'totalSize'='77', 
  'transient_lastDdlTime'='1456866115')

LOCATION in the example above is where you should focus on. That is your hdfs location for hive warehouse.

Don't forget to like if you like this solution. Cheers!


describe formatted <table_name>; inside hive shell.

Notice the "Location" value that shows the location of the table.


Examples related to hadoop

Hadoop MapReduce: Strange Result when Storing Previous Value in Memory in a Reduce Class (Java) What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check Spark Version What are the pros and cons of parquet format compared to other formats? java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient How to export data from Spark SQL to CSV How to copy data from one HDFS to another HDFS? How to calculate Date difference in Hive Select top 2 rows in Hive Spark - load CSV file as DataFrame?

Examples related to hive

select rows in sql with latest date for each ID repeated multiple times PySpark: withColumn() with two conditions and three outcomes java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient Hive cast string to date dd-MM-yyyy How to save DataFrame directly to Hive? How to calculate Date difference in Hive Select top 2 rows in Hive Just get column names from hive table Create hive table using "as select" or "like" and also specify delimiter Hive Alter table change Column Name

Examples related to hdfs

What are the pros and cons of parquet format compared to other formats? How to copy data from one HDFS to another HDFS? Spark - load CSV file as DataFrame? hadoop copy a local file system folder to HDFS What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? How to fix corrupt HDFS FIles How to copy file from HDFS to the local file system Name node is in safe mode. Not able to leave Hive load CSV with commas in quoted fields Permission denied at hdfs