[bash] How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:

  1. The index of the identifier
  2. The identifier value

I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).

Any ideas, taking in special consideration for performance?

This question is related to bash csv shell

The answer is


A sed or awk solution would probably be shorter, but here's one for Perl:

perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`

where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)


For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.

If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.

You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):

csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u

This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.


index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file

In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:

Name,Phone
"Woo, John",425-555-1212

You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:

#!/usr/bin/env tclsh

package require csv 
package require Tclx

# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue

# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1

for_file line $fileName {
    set columns [csv::split $line]
    set columnValue [lindex $columns $columnNumber]
    if {$columnValue == $expectedValue} {
        puts $line
    }   
}

Save this script to a file called csv.tcl and invoke it as:

$ tclsh csv.tcl filename indexNumber expectedValue

Explanation

The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.


See this youtube video: BASH scripting lesson 10 working with CSV files

CSV file:

Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME

Bash script:

#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
 do

    echo -e "$user \
    ======================\n\
    Role :\t $job\n\
    ID :\t $uid\n\
    SITE :\t $location\n"
 done < $1
 IFS=$OLDIFS

Output:

Bob Brown     ======================
    Role :   Manager
    ID :     16581
    SITE :   Main

Sally Seaforth     ======================
    Role :   Director
    ID :     4678
    SITE :   HOME

Using awk:

export INDEX=2
export VALUE=bar

awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv

Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:

awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv

Jeez...with variables, and everything, awk is almost a real programming language...


I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)

#!/usr/bin/env python

import sys, csv

with sys.stdin as f:
    reader = csv.reader(f)
    for row in reader:
        for col in row:
            print col+'\t',
        print

Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:

> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,

In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.


As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:

$ csvtool -t ',' col "$index" - < csvfile | grep "$value"

According to the docs, it handles escaping, quoting, etc.


CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.

So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.


Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to csv

Pandas: ValueError: cannot convert float NaN to integer Export result set on Dbeaver to CSV Convert txt to csv python script How to import an Excel file into SQL Server? "CSV file does not exist" for a filename with embedded quotes Save Dataframe to csv directly to s3 Python Data-frame Object has no Attribute (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape How to write to a CSV line by line? How to check encoding of a CSV file

Examples related to shell

Comparing a variable with a string python not working when redirecting from bash script Get first line of a shell command's output How to run shell script file using nodejs? Run bash command on jenkins pipeline Way to create multiline comments in Bash? How to do multiline shell script in Ansible How to check if a file exists in a shell script How to check if an environment variable exists and get its value? Curl to return http status code along with the response docker entrypoint running bash script gets "permission denied"