Difference between Hive internal tables and external tables

Question

Can anyone tell me the difference between Hive s external table and internal tables  I know the difference comes when dropping the table  I don t understand what you mean by the data and metadata is deleted in internal and only metadata is deleted in external tables  Can anyone explain me in terms of nodes please

User · Answer

hive stores only the meta data in metastore and original data in out side of hive when we use external table we can give location     by these our original data wont effect when we drop the table

User · Answer

INTERNAL   Table is created First and Data is loaded later  EXTERNAL   Data is present and Table is created on top of it

User · Answer

The only difference in behaviour  not the intended usage  based on my limited research and testing so far  using Hive 1 1 0 -cdh5 12 0  seems to be that when a table is dropped   the data of the Internal  Managed  tables gets deleted from the HDFS file system   while the data of the External tables does NOT get deleted from the HDFS file system     NOTE  See Section  Managed and External Tables  in  https   cwiki apache org confluence display Hive LanguageManual DDL which list some other difference which I did not completely understand    I believe Hive chooses the location where it needs to create the table based on the following precedence from top to bottom   Location defined during the Table Creation    Location defined in the Database Schema Creation in which the table is created  Default Hive Warehouse Directory  Property hive metastore warehouse dir in hive site xml    When the  Location  option is not used during the  creation of a hive table   the above precedence rule is used  This is applicable for both Internal and External tables  This means an Internal table does not necessarily have to reside in the Warehouse directory and can reside anywhere else   Note  I might have missed some scenarios  but based on my limited exploration  the behaviour of both Internal and Extenal table seems to be the same except for the one difference  data deletion  described above  I tried the following scenarios for both Internal and External tables    Creating table with and without Location option Creating table with and without Partition Option Adding new data using the Hive Load and Insert Statements Adding data files to the Table location outside of Hive  using HDFS commands  and refreshing the table using the  MSCK REPAIR TABLE  command Dropping the tables

User · Answer

In simple words  there are two things   Hive can manage things in warehouse i e  it will not delete data out of warehouse  When we delete table   1  For internal tables the data is managed internally in warehouse  So will be deleted   2  For external tables the data is managed eternal from warehouse  So can t be deleted and clients other then hive can also use it

User · Answer

Hive tables can be created as EXTERNAL or INTERNAL  This is a choice that affects how data is loaded  controlled  and managed  Use EXTERNAL tables when   The data is also used outside of Hive  For example  the data files     are read and processed by an existing program that doesn t lock the files  Data needs to remain in the underlying location even after a DROP TABLE  This can apply if you are pointing multiple schemas  tables or views  at a single data set or if you are iterating through various possible schemas  You want to use a custom location such as ASV  Hive should not own data and control settings  dirs  etc   you have another program or process that will do those things  You are not creating table based on existing table  AS SELECT    Use INTERNAL tables when   The data is temporary  You want Hive to completely manage the lifecycle of the table and data

User · Answer

External hive table has advantages that it does not remove files when we drop tables we can set row formats with different settings   like serde    delimited

User · Answer

Also Keep in mind that Hive is a big data warehouse  When you want to drop a table you dont want to lose  Gigabytes or Terabytes of data  Generating  moving and copying data at that scale can be time consuming   When you drop a  Managed  table hive will also trash its data  When you drop a  External  table only the schema definition from hive meta-store is removed  The data on the hdfs still remains

User · Answer

For managed tables  Hive controls the lifecycle of their data  Hive stores the data for managed tables in a sub-directory under the directory defined by hive metastore warehouse dir by default   When we drop a managed table  Hive deletes the data in the table But managed tables are less convenient for sharing with other tools  For example  lets say we have data that is created and used primarily by Pig   but we want to run some queries against it  but not give Hive ownership of the data    At that time  external table is defined that points to that data  but doesn   t take ownership of it

User · Answer

Consider this scenario which best suits for External Table   A MapReduce  MR  job filters a huge log file to spit out n sub log files  e g  each sub log file contains a specific message type log  and the output i e n sub log files are stored in hdfs   These log files are to be loaded into Hive tables for performing further analytic  in this scenario I would recommend an External Table s   because the actual log files are generated and owned by an external process i e  a MR job besides you can avoid an additional step of loading each generated log file into respective Hive table as well

User · Answer

To answer you Question   For External Tables  Hive stores the data in the LOCATION specified during creation of the table generally not in warehouse directory   If the external table is dropped  then the table metadata is deleted but not the data  For Internal tables  Hive stores data into its warehouse directory  If the table is dropped then both the table metadata and the data will be deleted   For your reference  Difference between Internal  amp  External tables    For External Tables -  External table stores files on the HDFS server but tables are not linked to the source file completely   If you delete an external table the file still remains on the HDFS server  As an example if you create an external table called    table test    in HIVE using HIVE-QL and link the table to file    file     then deleting    table test    from HIVE will not delete    file    from HDFS   External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file folder level   Meta data is maintained on master node  and deleting an external table from HIVE only deletes the metadata not the data file     For Internal Tables-  Stored in a directory based on settings in hive metastore warehouse dir  by default internal tables are stored in the following directory     user hive warehouse    you can change it by updating the location in the config file   Deleting the table deletes the metadata and data from master-node and HDFS respectively  Internal table file security is controlled solely via HIVE  Security needs to be managed within HIVE  probably at the schema level  depends on organization      Hive may have internal or external tables  this is a choice that affects how data is loaded  controlled  and managed  Use EXTERNAL tables when   The data is also used outside of Hive  For example  the data files are read and processed by an existing program that doesn   t lock the files  Data needs to remain in the underlying location even after a DROP TABLE  This can apply if you are pointing multiple schema  tables or views  at a single data set or if you are iterating through various possible schema  Hive should not own data and control settings  directories  etc   you may have another program or process that will do those things  You are not creating table based on existing table  AS SELECT    Use INTERNAL tables when   The data is temporary  You want Hive to completely manage the life-cycle of the table and data   Source   HDInsight  Hive Internal and External Tables Intro Internal  amp  external tables in Hadoop- HIVE

User · Answer

Hive has a relational database on the master node it uses to keep track of state  For instance  when you CREATE TABLE FOO foo string  LOCATION  hdfs   tmp     this table schema is stored in the database   If you have a partitioned table  the partitions are stored in the database this allows hive to use lists of partitions without going to the file-system and finding them  etc   These sorts of things are the  metadata    When you drop an internal table  it drops the data  and it also drops the metadata   When you drop an external table  it only drops the meta data  That means hive is ignorant of that data now  It does not touch the data itself

User · Answer

Internal tables are useful if you want Hive to manage the complete lifecycle of your data including the deletion  whereas external tables are useful when the files are being used outside of Hive

User · Answer

I would like to add that    Internal tables are used when the data needs to be updated or some rows need to be deleted because ACID properties can be supported on the Internal tables but ACID properties cannot be supported on the external tables  Please ensure that there is a backup of the data in the Internal table because if a internal table is dropped then the data will also be lost

User · Answer

In external tables  if you drop it  it deletes only schema of the table  table data exists in physical location   So to deleted the data use hadoop fs - rmr tablename   Managed table hive will have full control on tables  In external tables users will have control on it

User · Answer

In Hive We can also create an external table  It tells Hive to refer to the data that is at an existing location outside the warehouse directory  Dropping External tables will delete metadata but not the data

User · Answer

The best use case for an external table in the hive is when you want to create the table from a file either CSV or text

User · Answer

An internal table data is stored in the warehouse folder  whereas an external table data is stored at the location you mentioned in table creation   So when you delete an internal table  it deletes the schema as well as the data under the warehouse folder  but for an external table it s only the schema that you will loose   So when you want an external table back you again after deleting it  can create a table with the same schema again and point it to the original data location  Hope it is clear now

User · Answer

When there is data already in HDFS  an external Hive table can be created to describe the data  It is called EXTERNAL because the data in the external table is specified in the LOCATION properties instead of the default warehouse directory    When keeping data in the internal tables  Hive fully manages the life cycle of the table and data  This means the data is removed once the internal table is dropped  If the external table is dropped  the table metadata is deleted but the data is kept  Most of the time  an external table is preferred to avoid deleting data along with tables by mistake

[hadoop] Difference between Hive internal tables and external tables?

Examples related to hadoop

Examples related to hive

Examples related to hiveql