Hive External Table Skip First Row

Question

I am using Cloudera s version of Hive and trying to create an external table over a csv file that contains the column names in the first column   Here is the code that I am using to do that   CREATE EXTERNAL TABLE Test      RecordId int     FirstName string     LastName string     ROW FORMAT serde  com bizo hive serde csv CSVSerde   WITH SerDeProperties        separatorChar           STORED AS TEXTFILE  LOCATION   user File csv    Sample Data  RecordId FirstName LastName 1  John   Doe  2  Jane   Doe    Can anyone help me with how to skip the first row or do I need to add an intermediate step

User · Answer

While you have your answer from Daniel  here are some customizations possible using OpenCSVSerde   CREATE EXTERNAL TABLE  mydb   mytable        product name  string       brand id  string       brand  string       color  string       description  string       sale price  string  PARTITIONED BY        seller id  string  ROW FORMAT SERDE      org apache hadoop hive serde2 OpenCSVSerde  WITH SERDEPROPERTIES        separatorChar      t        quoteChar              escapeChar          STORED AS INPUTFORMAT      org apache hadoop mapred TextInputFormat  OUTPUTFORMAT      org apache hadoop hive ql io HiveIgnoreKeyTextOutputFormat  LOCATION      hdfs   namenode com port data mydb mytable  TBLPROPERTIES        serialization null format             skip header line count     1     With this  you have total control over the separator  quote character  escape character  null handling and header handling   Look here and here

User · Answer

I also struggled with this and found no way to tell hive to skip first row  like there is e g  in Greenplum  So finally I had to remove it from the files  e g   cat File csv   grep -v RecordId   File no header csv

User · Answer

Header rows in data are a perpetual headache in Hive  Short of modifying the Hive source  I believe you can t get away without an intermediate step   Edit  This is no longer true  see update below   Unfortunately  that answers you question  I ll throw in some ideas for the intermediate step for completeness   You can get away without an extra step in your data load if you are willing to filter out the header row on every query that touches the table  Unfortunately this adds an extra set just about everywhere else  And you will have to get clever messy when the header row violates your schema  If you go with this approach  you might consider writing a custom SerDe that makes this row easier to filter  Unfortunately  SerDe s cannot remove the row entirely  or that might form a possible solution   they must return something like null  I ve never seen this approach taken in practice to deal with header rows since it makes reading a pain  and reading tends to be much more common than writing  It might have a place if you are dealing with one-of tables or if the header row is just one row among many malformed rows   You could do this filtering once with variations on deleting that first row in data load  A WHERE clause in an INSERT statement would do it  You could use utilities like sed to get rid of it  I ve seen both approaches taken  There are trade-offs between which approach you take and neither is the one true way to deal with header rows  Unfortunately  both these approaches take time and require temporary duplication of the data  If you absolutely need the header row for another application  the duplication would be permanent   Update   From Hive v0 13 0  you can use skip header line count  You could also specify the same while creating the table  For example   create external table testtable  name string  message string  row format delimited  fields terminated by   t   lines terminated by   n   location   testtable  tblproperties   skip header line count   1

User · Answer

Just append below property in your query and the first header or line int the record will not load or it will be skipped   Try this  tblproperties   skip header line count   1

User · Answer

create external table table name   Year int   Month int  column name data type    row format delimited fields terminated by      location   user user name example data  TBLPROPERTIES  serialization null format       skip header line count   1

User · Answer

skip header line count will skip the header line   However  if you have some external tool accessing accessing the table  it will still see that actual data without skipping those lines

User · Answer

Just for those who have already created the table with the header  Here is the alter command for the same  This is useful in case you already have the table and want the first row to be ignored without dropping and recreating  It also helps with people to familiarize with ALTER as a option with TBLPROPERTIES   ALTER TABLE tablename SET TBLPROPERTIES   skip header line count   1

User · Answer

I am not quite sure if it works with ROW FORMAT serde  com bizo hive serde csv CSVSerde   but I guess that it should be similar to ROW FORMAT DELIMITED FIELDS TERMINATED BY       In your case first row will be treated like normal row  But first field fails to be INT so all fields  for first row  will be set as NULL  You need only one intermediate step to fix it   INSERT OVERWRITE TABLE Test SELECT   from Test WHERE RecordId IS NOT NULL   Only one drawback is that your original csv file will be modified  I hope it helps  GL

[hive] Hive External Table Skip First Row

Examples related to hive

Examples related to cloudera