[bash] How to extract one column of a csv file

If I have a csv file, is there a quick bash way to print out the contents of only any single column? It is safe to assume that each row has the same number of columns, but each column's content would have different length.

This question is related to bash csv extract

The answer is


You could use GNU Awk, see this article of the user guide. As an improvement to the solution presented in the article (in June 2015), the following gawk command allows double quotes inside double quoted fields; a double quote is marked by two consecutive double quotes ("") there. Furthermore, this allows empty fields, but even this can not handle multiline fields. The following example prints the 3rd column (via c=3) of textfile.csv:

#!/bin/bash
gawk -- '
BEGIN{
    FPAT="([^,\"]*)|(\"((\"\")*[^\"]*)*\")"
}
{
    if (substr($c, 1, 1) == "\"") {
        $c = substr($c, 2, length($c) - 2) # Get the text within the two quotes
        gsub("\"\"", "\"", $c)  # Normalize double quotes
    }
    print $c
}
' c=3 < <(dos2unix <textfile.csv)

Note the use of dos2unix to convert possible DOS style line breaks (CRLF i.e. "\r\n") and UTF-16 encoding (with byte order mark) to "\n" and UTF-8 (without byte order mark), respectively. Standard CSV files use CRLF as line break, see Wikipedia.

If the input may contain multiline fields, you can use the following script. Note the use of special string for separating records in output (since the default separator newline could occur within a record). Again, the following example prints the 3rd column (via c=3) of textfile.csv:

#!/bin/bash
gawk -- '
BEGIN{
    RS="\0" # Read the whole input file as one record;
    # assume there is no null character in input.
    FS="" # Suppose this setting eases internal splitting work.
    ORS="\n####\n" # Use a special output separator to show borders of a record.
}
{
    nof=patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
    field=0;
    for (i=1; i<=nof; i++){
        field++
        if (field==c) {
            if (substr(a[i], 1, 1) == "\"") {
                a[i] = substr(a[i], 2, length(a[i]) - 2) # Get the text within 
                # the two quotes.
                gsub(/""/, "\"", a[i])  # Normalize double quotes.
            }
            print a[i]
        }
        if (seps[i]!=",") field=0
    }
}
' c=3 < <(dos2unix <textfile.csv)

There is another approach to the problem. csvquote can output contents of a CSV file modified so that special characters within field are transformed so that usual Unix text processing tools can be used to select certain column. For example the following code outputs the third column:

csvquote textfile.csv | cut -d ',' -f 3 | csvquote -u

csvquote can be used to process arbitrary large files.


You can also use while loop

IFS=,
while read name val; do
        echo "............................"

        echo Name: "$name"
done<itemlst.csv

I wonder why none of the answers so far have mentioned csvkit.

csvkit is a suite of command-line tools for converting to and working with CSV

csvkit documentation

I use it exclusively for csv data management and so far I have not found a problem that I could not solve using cvskit.

To extract one or more columns from a cvs file you can use the csvcut utility that is part of the toolbox. To extract the second column use this command:

csvcut -c 2 filename_in.csv > filename_out.csv 

csvcut reference page

If the strings in the csv are quoted, add the quote character with the q option:

csvcut -q '"' -c 2 filename_in.csv > filename_out.csv 

Install with pip install csvkit or sudo apt install csvkit.


You can't do it without a full CSV parser.


Landed here looking to extract from a tab separated file. Thought I would add.

cat textfile.tsv | cut -f2 -s

Where -f2 extracts the 2, non-zero indexed column, or the second column.


yes. cat mycsv.csv | cut -d ',' -f3 will print 3rd column.


I think the easiest is using csvkit:

Gets the 2nd column: csvcut -c 2 file.csv

However, there's also csvtool, and probably a number of other csv bash tools out there:

sudo apt-get install csvtool (for Debian-based systems)

This would return a column with the first row having 'ID' in it. csvtool namedcol ID csv_file.csv

This would return the fourth row: csvtool col 4 csv_file.csv

If you want to drop the header row:

csvtool col 4 csv_file.csv | sed '1d'


Here is a csv file example with 2 columns

myTooth.csv

Date,Tooth
2017-01-25,wisdom
2017-02-19,canine
2017-02-24,canine
2017-02-28,wisdom

To get the first column, use:

cut -d, -f1 myTooth.csv

f stands for Field and d stands for delimiter

Running the above command will produce the following output.

Output

Date
2017-01-25
2017-02-19
2017-02-24
2017-02-28

To get the 2nd column only:

cut -d, -f2 myTooth.csv

And here is the output Output

Tooth
wisdom
canine
canine
wisdom
incisor

Another use case:

Your csv input file contains 10 columns and you want columns 2 through 5 and columns 8, using comma as the separator".

cut uses -f (meaning "fields") to specify columns and -d (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.

cut -f 2-5,8 -d , myvalues.csv

cut is a command utility and here is some more examples:

SYNOPSIS
     cut -b list [-n] [file ...]
     cut -c list [file ...]
     cut -f list [-d delim] [-s] [file ...]

First we'll create a basic CSV

[dumb@one pts]$ cat > file 
a,b,c,d,e,f,g,h,i,k  
1,2,3,4,5,6,7,8,9,10  
a,b,c,d,e,f,g,h,i,k  
1,2,3,4,5,6,7,8,9,10

Then we get the 1st column

[dumb@one pts]$  awk -F , '{print $1}' file  
a  
1  
a  
1

I needed proper CSV parsing, not cut / awk and prayer. I'm trying this on a mac without csvtool, but macs do come with ruby, so you can do:

echo "require 'csv'; CSV.read('new.csv').each {|data| puts data[34]}" | ruby

csvtool col 2 file.csv 

where 2 is the column you are interested in

you can also do

csvtool col 1,2 file.csv 

to do multiple columns


Been using this code for a while, it is not "quick" unless you count "cutting and pasting from stackoverflow".

It uses ${##} and ${%%} operators in a loop instead of IFS. It calls 'err' and 'die', and supports only comma, dash, and pipe as SEP chars (that's all I needed).

err()  { echo "${0##*/}: Error:" "$@" >&2; }
die()  { err "$@"; exit 1; }

# Return Nth field in a csv string, fields numbered starting with 1
csv_fldN() { fldN , "$1" "$2"; }

# Return Nth field in string of fields separated
# by SEP, fields numbered starting with 1
fldN() {
        local me="fldN: "
        local sep="$1"
        local fldnum="$2"
        local vals="$3"
        case "$sep" in
                -|,|\|) ;;
                *) die "$me: arg1 sep: unsupported separator '$sep'" ;;
        esac
        case "$fldnum" in
                [0-9]*) [ "$fldnum" -gt 0 ] || { err "$me: arg2 fldnum=$fldnum must be number greater or equal to 0."; return 1; } ;;
                *) { err "$me: arg2 fldnum=$fldnum must be number"; return 1;} ;;
        esac
        [ -z "$vals" ] && err "$me: missing arg2 vals: list of '$sep' separated values" && return 1
        fldnum=$(($fldnum - 1))
        while [ $fldnum -gt 0 ] ; do
                vals="${vals#*$sep}"
                fldnum=$(($fldnum - 1))
        done
        echo ${vals%%$sep*}
}

Example:

$ CSVLINE="example,fields with whitespace,field3"
$ $ for fno in $(seq 3); do echo field$fno: $(csv_fldN $fno "$CSVLINE");  done
field1: example
field2: fields with whitespace
field3: field3

The simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.

csvtool format '%(2)\n' input.csv

Replacing 2 with the column number will effectively extract the column data you are looking for.


The other answers work well, but since you asked for a solution using just the bash shell, you can do this:

AirBoxOmega:~ d$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10

And then you can pull out columns (the first in this example) like so:

AirBoxOmega:~ d$ while IFS=, read -a csv_line;do echo "${csv_line[0]}";done < file
a
1
a
1
a
1
a
1
a
1
a
1

So there's a couple of things going on here:

  • while IFS=, - this is saying to use a comma as the IFS (Internal Field Separator), which is what the shell uses to know what separates fields (blocks of text). So saying IFS=, is like saying "a,b" is the same as "a b" would be if the IFS=" " (which is what it is by default.)

  • read -a csv_line; - this is saying read in each line, one at a time and create an array where each element is called "csv_line" and send that to the "do" section of our while loop

  • do echo "${csv_line[0]}";done < file - now we're in the "do" phase, and we're saying echo the 0th element of the array "csv_line". This action is repeated on every line of the file. The < file part is just telling the while loop where to read from. NOTE: remember, in bash, arrays are 0 indexed, so the first column is the 0th element.

So there you have it, pulling out a column from a CSV in the shell. The other solutions are probably more practical, but this one is pure bash.


Many answers for this questions are great and some have even looked into the corner cases. I would like to add a simple answer that can be of daily use... where you mostly get into those corner cases (like having escaped commas or commas in quotes etc.,).

FS (Field Separator) is the variable whose value is dafaulted to space. So awk by default splits at space for any line.

So using BEGIN (Execute before taking input) we can set this field to anything we want...

awk 'BEGIN {FS = ","}; {print $3}'

The above code will print the 3rd column in a csv file.


Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to csv

Pandas: ValueError: cannot convert float NaN to integer Export result set on Dbeaver to CSV Convert txt to csv python script How to import an Excel file into SQL Server? "CSV file does not exist" for a filename with embedded quotes Save Dataframe to csv directly to s3 Python Data-frame Object has no Attribute (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape How to write to a CSV line by line? How to check encoding of a CSV file

Examples related to extract

python pandas extract year from datetime: df['year'] = df['date'].year is not working Accessing last x characters of a string in Bash How to extract one column of a csv file How can I extract the folder path from file path in Python? Extract and delete all .gz in a directory- Linux Read Content from Files which are inside Zip file Get string after character how to extract only the year from the date in sql server 2008? In Excel, how do I extract last four letters of a ten letter string? How can I extract all values from a dictionary in Python?