Sorry if this is too basic. I have a csv file where the columns have a header row (v1, v2, etc.). I understand that to extract columns 1 and 2, I have to do: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv
. But what if I have to extract, say, columns 1 to 10, 20 to 25, and 30, 33? As an addendum, is there any way to extract directly with the header names rather than with column numbers?
Other languages have short cuts for ranges of field numbers, but not awk, you'll have to write your code as your fear ;-)
awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv
There is no direct function in awk to use field names as column specifiers.
I hope this helps.
Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
If Perl is an option:
perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32]'
-a
autosplits line into @F
fields array. Indices start at 0 (not 1 as in awk)
-F,
field separator is ,
If your CSV file contains commas within quotes, fully fledged CSV parsers such as Perl's Text::CSV_XS
are purpose-built to handle that kind of weirdness.
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();print (join ",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'
I provided more explanation within my answer here: parse csv file using gawk
Others have answered your earlier question. For this:
As an addendum, is there any way to extract directly with the header names rather than with column numbers?
I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.
for(i=0;i<$NF;i++){
hash[$i] = i;
}
Then later on, use it:
j = hash["header1"];
print $j;
You can use a for-loop to address a field with $i:
ls -l | awk '{for(i=3 ; i<8 ; i++) {printf("%s\t", $i)} print ""}'
Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file test.csv:
name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7
Then tblmap -k name,height test.csv
produces
name,height
arthur,181
berta,163
chris,175
don,185
elisa,166
As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:
csvfilter -f 1,3,5 in.csv > out.csv
If you have python (and you should), you can install it simply like this:
pip install csvfilter
Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/
Source: Stackoverflow.com