[hive] Hive External Table Skip First Row

I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that.

CREATE EXTERNAL TABLE Test ( 
  RecordId int, 
  FirstName string, 
  LastName string 
) 
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' 
WITH SerDeProperties (  
  "separatorChar" = ","
) 
STORED AS TEXTFILE 
LOCATION '/user/File.csv'

Sample Data

RecordId,FirstName,LastName
1,"John","Doe"
2,"Jane","Doe"

Can anyone help me with how to skip the first row or do I need to add an intermediate step?

This question is related to hive cloudera

The answer is


Just append below property in your query and the first header or line int the record will not load or it will be skipped.

Try this

tblproperties ("skip.header.line.count"="1");

Just for those who have already created the table with the header. Here is the alter command for the same. This is useful in case you already have the table and want the first row to be ignored without dropping and recreating. It also helps with people to familiarize with ALTER as a option with TBLPROPERTIES.

ALTER TABLE tablename SET TBLPROPERTIES ("skip.header.line.count"="1");

I also struggled with this and found no way to tell hive to skip first row, like there is e.g. in Greenplum. So finally I had to remove it from the files. e.g. "cat File.csv | grep -v RecordId > File_no_header.csv"


create external table table_name( 
Year int, 
Month int,
column_name data_type ) 
row format delimited fields terminated by ',' 
location '/user/user_name/example_data' TBLPROPERTIES('serialization.null.format'='', 'skip.header.line.count'='1');

I am not quite sure if it works with ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' but I guess that it should be similar to ROW FORMAT DELIMITED FIELDS TERMINATED BY ','.
In your case first row will be treated like normal row. But first field fails to be INT so all fields, for first row, will be set as NULL. You need only one intermediate step to fix it:

INSERT OVERWRITE TABLE Test
SELECT * from Test WHERE RecordId IS NOT NULL

Only one drawback is that your original csv file will be modified. I hope it helps. GL!


While you have your answer from Daniel, here are some customizations possible using OpenCSVSerde:

CREATE EXTERNAL TABLE `mydb`.`mytable`(
    `product_name` string,
    `brand_id` string,
    `brand` string,
    `color` string,
    `description` string,
    `sale_price` string)
PARTITIONED BY (
    `seller_id` string)
ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'separatorChar' = '\t',
    'quoteChar' = '"',
    'escapeChar' = '\\')
STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
    'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
    'serialization.null.format' = '',
    'skip.header.line.count' = '1')

With this, you have total control over the separator, quote character, escape character, null handling and header handling.

Look here and here.


skip.header.line.count will skip the header line.

However, if you have some external tool accessing accessing the table, it will still see that actual data without skipping those lines