[linux] How to count number of unique values of a field in a tab-delimited text file?

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,

Red     Ball 1 Sold
Blue    Bat  5 OnSale
............... 

So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.

I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.

What if I wanted a count of these unique values as well?

Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

This question is related to linux bash command-line

The answer is


# COLUMN is integer column number
# INPUT_FILE is input file name

cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l

This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.

Code

#!/bin/bash

awk '
(NR==1){
    for(fi=1; fi<=NF; fi++)
        fname[fi]=$fi;
} 
(NR!=1){
    for(fi=1; fi<=NF; fi++) 
        arr[fname[fi]][$fi]++;
} 
END{
    for(fi=1; fi<=NF; fi++){
        out=fname[fi];
        for (item in arr[fname[fi]])
            out=out"\t"item"_"arr[fname[fi]][item];
        print(out);
    }
}
' $1

Execution Example:

bash> ./script.sh <path to tab-delimited file>

Output Example

isRef    A_15      C_42     G_24     T_18
isCar    YEA_10    NO_40    NA_50
isTv     FALSE_33  TRUE_66

awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv

Assuming the data file is actually Tab separated, not space aligned:

<test.tsv awk '{print $4}' | sort | uniq

Where $4 will be:

  • $1 - Red
  • $2 - Ball
  • $3 - 1
  • $4 - Sold

You can use awk, sort & uniq to do this, for example to list all the unique values in the first column

awk < test.txt '{print $1}' | sort | uniq

As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l


Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.

#!/bin/bash
# Syntax: $0 filename   
# The input is assumed to be a .tsv file

FILE="$1"

cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
  echo Column $i ::
  cut -f $i < "$FILE" | sort | uniq -c
  echo
done

Examples related to linux

grep's at sign caught as whitespace How to prevent Google Colab from disconnecting? "E: Unable to locate package python-pip" on Ubuntu 18.04 How to upgrade Python version to 3.7? Install Qt on Ubuntu Get first line of a shell command's output Cannot connect to the Docker daemon at unix:/var/run/docker.sock. Is the docker daemon running? Run bash command on jenkins pipeline How to uninstall an older PHP version from centOS7 How to update-alternatives to Python 3 without breaking apt?

Examples related to bash

Comparing a variable with a string python not working when redirecting from bash script Zipping a file in bash fails How do I prevent Conda from activating the base environment by default? Get first line of a shell command's output Fixing a systemd service 203/EXEC failure (no such file or directory) /bin/sh: apt-get: not found VSCode Change Default Terminal Run bash command on jenkins pipeline How to check if the docker engine and a docker container are running? How to switch Python versions in Terminal?

Examples related to command-line

Git is not working after macOS Update (xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools) Flutter command not found Angular - ng: command not found how to run python files in windows command prompt? How to run .NET Core console app from the command line Copy Paste in Bash on Ubuntu on Windows How to find which version of TensorFlow is installed in my system? How to install JQ on Mac by command-line? Python not working in the command line of git bash Run function in script from command line (Node JS)