How to delete and update a record in Hive

Question

I have installed Hadoop  Hive  Hive JDBC  which are running fine for me  But I still have a problem  How to delete or update a single record using Hive because delete or update command of MySQL is not working in Hive   Thanks   hive gt  delete from student where id 1  Usage  delete  FILE JAR ARCHIVE   lt value gt    lt value gt    Query returned non-zero code  1  cause  null

User · Answer

The CLI told you where is your mistake   delete WHAT  from student      Delete   How to delete truncate tables from Hadoop-Hive   Update   Update   SET option in Hive

User · Answer

If you want to delete all records then as a workaround load an empty file into table in OVERWRITE mode  hive gt  LOAD DATA LOCAL INPATH   root hadoop textfiles empty txt  OVERWRITE INTO TABLE employee  Loading data to table default employee Table default employee stats   numFiles 1  numRows 0  totalSize 0  rawDataSize 0  OK Time taken  0 19 seconds  hive gt  SELECT   FROM employee  OK Time taken  0 052 seconds

User · Answer

Delete has been recently added in Hive version 0 14 Deletes can only be performed on tables that support ACID Below is the link from Apache    https   cwiki apache org confluence display Hive LanguageManual DML LanguageManualDML-Delete

User · Answer

To achieve your current need  you need to fire below query   gt  insert overwrite table student   gt  select  from student   gt  where id  lt  gt  1    This will delete current table and create new table with same name with all rows except the rows that you want to exclude delete  I tried this on Hive 1 2 1

User · Answer

Yes  rightly said  Hive does not support UPDATE option   But the following alternative could be used to achieve the result   Update records in a partitioned Hive table    The main table is assumed to be partitioned by some key  Load the incremental data  the data to be updated  to a staging table partitioned with the same keys as the main table  Join the two tables  main  amp  staging tables  using a LEFT OUTER JOIN operation as below   insert overwrite table main table partition  c d  select t2 a  t2 b  t2 c t2 d  from staging table t2 left outer join main table t1 on t1 a t2 a    In the above example  the main table  amp  the staging table are partitioned using the  c d  keys  The tables are joined via a LEFT OUTER JOIN and the result is used to OVERWRITE the partitions in the main table   A similar approach could be used in the case of un-partitioned Hive table UPDATE operations too

User · Answer

Once you have installed and configured Hive   create simple table    hive gt create table testTable id int name string row format delimited fields terminated by        Then  try to insert few rowsin test table   hive gt insert into table testTable values  1  row1    2  row2      Now try to delete records   you just inserted in table   hive gt delete from testTable where id   1    Error  FAILED  SemanticException  Error 10294   Attempt to do update or delete using transaction manager that does not support these operations   By default transactions are configured to be off  It is been said that update is not supported with  the delete operation used in the conversion manager  To support update delete   you must change following configuration   cd   HIVE HOME vi conf hive-site xml   Add below properties to file   lt property gt     lt name gt hive support concurrency lt  name gt     lt value gt true lt  value gt    lt  property gt    lt property gt     lt name gt hive enforce bucketing lt  name gt     lt value gt true lt  value gt    lt  property gt    lt property gt     lt name gt hive exec dynamic partition mode lt  name gt     lt value gt nonstrict lt  value gt    lt  property gt    lt property gt     lt name gt hive txn manager lt  name gt     lt value gt org apache hadoop hive ql lockmgr DbTxnManager lt  value gt    lt  property gt    lt property gt     lt name gt hive compactor initiator on lt  name gt     lt value gt true lt  value gt    lt  property gt    lt property gt     lt name gt hive compactor worker threads lt  name gt     lt value gt 2 lt  value gt    lt  property gt    Restart the service and then try delete command again    Error   FAILED  LockException  Error 10280   Error communicating with the metastore   There is problem with metastore  In order to use insert update delete operation  You need to change following configuration in conf hive-site xml as feature is currently in development    lt property gt     lt name gt hive in test lt  name gt     lt value gt true lt  value gt    lt  property gt    Restart the service and then delete command again    hive gt delete from testTable where id   1    Error   FAILED  SemanticException  Error 10297   Attempt to do update or delete on table default testTable that does not use an AcidOutputFormat or is not bucketed    Only ORC file format is supported in this first release   The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records  basically  that has an explicit or implicit row id   but so far the integration work has only been done for ORC   Tables must be bucketed to make use of these features   Tables in the same system not using transactions and ACID do not need to be bucketed   See below built table example with ORCFileformat  bucket enabled and   transactional   true      hive gt create table testTableNew id int  name string   clustered by  id  into 2 buckets stored as orc TBLPROPERTIES  transactional   true      Insert    hive gt insert into table testTableNew values  1  row1    2  row2    3  row3      Update    hive gt update testTableNew set name    updateRow2  where id   2    Delete    hive gt delete from testTableNew where id   1    Test    hive gt select   from testTableNew

User · Answer

You can delete rows from a table using a workaround  in which you overwrite the table by the dataset you want left into the table as a result of your operation    insert overwrite table your table      select   from your table      where id  lt  gt  1     The workaround is useful mostly for bulk deletions of easily identifiable rows  Also  obviously doing this can muck up your data  so a backup of the table is adviced and care when planning the  deletion  rule also adviced

User · Answer

Recently I was looking to resolve a similar issue  Apache Hive  Hadoop do not support Update Delete operations  So    So you have two ways     Use a backup table   Save the whole table in a backup table  then truncate your input table  then re-write only the data you are intrested to mantain  Use Uber Hudi  It s a framework created by Uber to resolve the HDFS limitations including Deletion and Update  You can give a look in this link  https   eng uber com hoodie    an example for point 1   Create table bck table like input table  Insert overwrite table bck table      select   from input table  Truncate table input table  Insert overwrite table input table     select   from bck table where id  lt  gt  1    NB  If the input table is an external table you must follow the following link  How to truncate a partitioned external table in hive

User · Answer

Upcoming version of Hive is going to allow SET based update delete handling which is of utmost importance when trying to do CRUD operations on a  bunch  of rows instead of taking one row at a time   In the interim   I have tried a dynamic partition based approach documented here http   linkd in 1Fq3wdb     Please see if it suits your need

User · Answer

As of Hive version 0 14 0  INSERT   VALUES  UPDATE  and DELETE are now available with full ACID support   INSERT     VALUES Syntax   INSERT INTO TABLE tablename  PARTITION  partcol1  val1   partcol2  val2        VALUES values row    values row        Where values row is    value    value        where a value is either null or any valid SQL literal  UPDATE Syntax   UPDATE tablename SET column   value    column   value       WHERE expression    DELETE Syntax   DELETE FROM tablename  WHERE expression    Additionally  from the Hive Transactions doc      If a table is to be used in ACID writes  insert  update  delete  then the table property  transactional  must be set on that table  starting with Hive 0 14 0   Without this value  inserts will be done in the old style  updates and deletes will be prohibited    Hive DML reference  https   cwiki apache org confluence display Hive LanguageManual DML Hive Transactions reference  https   cwiki apache org confluence display Hive Hive Transactions

User · Answer

Good news Insert updates and deletes are now possible on Hive Impala using Kudu   You need to use IMPALA kudu to maintain the tables and perform insert update delete records   Details with examples can be found here   insert-update-delete-on-hadoop  Please share the news if you are excited   -MIK

User · Answer

There are few properties to set to make a Hive table support ACID properties and to support UPDATE  INSERT  and DELETE as in SQL  Conditions to create a ACID table in Hive   1  The table should be stored as ORC file  Only ORC format can support ACID prpoperties for now  2  The table must be bucketed  Properties to set to create ACID table   set hive support concurrency  true  set hive enforce bucketing  true  set hive exec dynamic partition mode  nonstrict set hive compactor initiator on   true  set hive compactor worker threads  1  set hive txn manager   org apache hadoop hive ql lockmgr DbTxnManager    set the property hive in test to true in hive site xml  After setting all these properties   the table should be created with tblproperty  transactional    true   The table should be bucketed and saved as orc  CREATE TABLE table name  col1 int col2 string  col3 int  CLUSTERED BY col1 INTO 4  BUCKETS STORED AS orc tblproperties  transactional    true      Now the Hive table can support UPDATE and DELETE queries

User · Answer

Configuration Values to Set for INSERT  UPDATE  DELETE In addition to the new parameters listed above  some existing parameters need to be set to support INSERT     VALUES  UPDATE  and DELETE   Configuration key Must be set to  hive support concurrency   true  default is false  hive enforce bucketing  true  default is false   Not required as of Hive 2 0  hive exec dynamic partition mode    nonstrict  default is strict   Configuration Values to Set for Compaction  If the data in your system is not owned by the Hive user  i e   the user that the Hive metastore runs as   then Hive will need permission to run as the user who owns the data in order to perform compactions   If you have already set up HiveServer2 to impersonate users  then the only additional work to do is assure that Hive has the right to impersonate users from the host running the Hive metastore   This is done by adding the hostname to hadoop proxyuser hive hosts in Hadoop s core-site xml file   If you have not already done this  then you will need to configure Hive to act as a proxy user   This requires you to set up keytabs for the user running the Hive metastore and add hadoop proxyuser hive hosts and hadoop proxyuser hive groups to Hadoop s core-site xml file   See the Hadoop documentation on secure mode for your version of Hadoop  e g   for Hadoop 2 5 1 it is at Hadoop in Secure Mode    The UPDATE statement has the following limitations   The expression in the WHERE clause must be an expression supported by a Hive SELECT clause   Partition and bucket columns cannot be updated   Query vectorization is automatically disabled for UPDATE statements  However  updated tables can still be queried using vectorization   Subqueries are not allowed on the right side of the SET statement   The following example demonstrates the correct usage of this statement   UPDATE students SET name   null WHERE gpa  lt   1 0    DELETE Statement  Use the DELETE statement to delete data already written to Apache Hive   DELETE FROM tablename  WHERE expression     The DELETE statement has the following limitation   query vectorization is automatically disabled for the DELETE operation   However  tables with deleted data can still be queried using vectorization   The following example demonstrates the correct usage of this statement   DELETE FROM students WHERE gpa  lt   1 0

User · Answer

UPDATE or DELETE a record isn t allowed in Hive  but INSERT INTO is acceptable  A snippet from Hadoop  The Definitive Guide 3rd edition       Updates  transactions  and indexes are mainstays of traditional databases  Yet  until recently  these features have not been considered a part of Hive s feature set  This is because Hive was built to operate over HDFS data using MapReduce  where full-table scans are the norm and a table update is achieved by transforming the data into a new table  For a data warehousing application that runs over large portions of the dataset  this works well         Hive doesn t support updates  or deletes   but it does support INSERT INTO  so it is possible to add new rows to an existing table

User · Answer

You should not think about Hive as a regular RDBMS  Hive is better suited for batch processing over very large sets of immutable data   The following applies to versions prior to Hive 0 14  see the answer by ashtonium for later versions   There is no operation supported for deletion or update of a particular record or particular set of records  and to me this is more a sign of a poor schema   Here is what you can find in the official documentation   Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling  As a result - latency for Hive queries is generally very high  minutes  even when data sets involved are very small  say a few hundred megabytes   As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes  Hive aims to provide acceptable  but not optimal  latency for interactive data browsing  queries over small data sets or test queries   Hive is not designed for online transaction processing and does not offer real-time queries and row level updates  It is best used for batch jobs over large sets of immutable data  like web logs     A way to work around this limitation is to use partitions  I don t know what you id corresponds to  but if you re getting different batches of ids separately  you could redesign your table so that it is partitioned by id  and then you would be able to easily drop partitions for the ids you want to get rid of

[hadoop] How to delete and update a record in Hive

Examples related to hadoop

Examples related to hive

Examples related to sql-delete