[bash] Best way to simulate "group by" from bash?

Suppose you have a file that contains IP addresses, one address in each line:

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

You need a shell script that counts for each IP address how many times it appears in the file. For the previous input you need the following output:

10.0.10.1 3
10.0.10.2 1
10.0.10.3 1

One way to do this is:

cat ip_addresses |uniq |while read ip
do
    echo -n $ip" "
    grep -c $ip ip_addresses
done

However it is really far from being efficient.

How would you solve this problem more efficiently using bash?

(One thing to add: I know it can be solved from perl or awk, I'm interested in a better solution in bash, not in those languages.)

ADDITIONAL INFO:

Suppose that the source file is 5GB and the machine running the algorithm has 4GB. So sort is not an efficient solution, neither is reading the file more than once.

I liked the hashtable-like solution - anybody can provide improvements to that solution?

ADDITIONAL INFO #2:

Some people asked why would I bother doing it in bash when it is way easier in e.g. perl. The reason is that on the machine I had to do this perl wasn't available for me. It was a custom built linux machine without most of the tools I'm used to. And I think it was an interesting problem.

So please, don't blame the question, just ignore it if you don't like it. :-)

This question is related to bash scripting

The answer is


I feel awk associative array is also handy in this case

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here


for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

cat file

US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file

US|A|3000
US|B|3000
US|C|3000
UK|1|9000

The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.


I'd have done it like this:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

but uniq might work for you.


I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.

Edit: I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.


The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.


It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.

Among those versions, saua's solution is the best (and simplest):

sort -n ip_addresses.txt | uniq -c

I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...


Solution ( group by like mysql)

grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n

Result

3249  googleplus
4211 linkedin
5212 xing
7928 facebook

Pure (no fork!)

There is a way, using a function. This way is very quick as there is no fork!...

... While bunch of ip addresses stay small!

countIp () { 
    local -a _ips=(); local _a
    while IFS=. read -a _a ;do
        ((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
    done
    for _a in ${!_ips[@]} ;do
        printf "%.16s %4d\n" \
          $(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
    done
}

Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays, not associative array (wich is more expensive)!

time countIp < ip_addresses 
10.0.10.1    3
10.0.10.2    1
10.0.10.3    1
real    0m0.001s
user    0m0.004s
sys     0m0.000s

time sort ip_addresses | uniq -c
      3 10.0.10.1
      1 10.0.10.2
      1 10.0.10.3
real    0m0.010s
user    0m0.000s
sys     0m0.000s

On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.


You probably can use the file system itself as a hash table. Pseudo-code as follows:

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.


I feel awk associative array is also handy in this case

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here


for summing up multiple fields, based on a group of existing fields, use the example below : ( replace the $1, $2, $3, $4 according to your requirements )

cat file

US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file

US|A|3000
US|B|3000
US|C|3000
UK|1|9000

I feel awk associative array is also handy in this case

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here


The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.


You probably can use the file system itself as a hash table. Pseudo-code as follows:

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.


It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.

Among those versions, saua's solution is the best (and simplest):

sort -n ip_addresses.txt | uniq -c

I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...


The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.


I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.

Edit: I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.


Most of the other solutions count duplicates. If you really need to group key value pairs, try this:

Here is my example data:

find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

This will print the key value pairs grouped by the md5 checksum.

cat table.txt | awk '{print $1}' | sort | uniq  | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

I'd have done it like this:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

but uniq might work for you.


Pure (no fork!)

There is a way, using a function. This way is very quick as there is no fork!...

... While bunch of ip addresses stay small!

countIp () { 
    local -a _ips=(); local _a
    while IFS=. read -a _a ;do
        ((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
    done
    for _a in ${!_ips[@]} ;do
        printf "%.16s %4d\n" \
          $(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
    done
}

Note: IP addresses are converted to 32bits unsigned integer value, used as index for array. This use simple bash arrays, not associative array (wich is more expensive)!

time countIp < ip_addresses 
10.0.10.1    3
10.0.10.2    1
10.0.10.3    1
real    0m0.001s
user    0m0.004s
sys     0m0.000s

time sort ip_addresses | uniq -c
      3 10.0.10.1
      1 10.0.10.2
      1 10.0.10.3
real    0m0.010s
user    0m0.000s
sys     0m0.000s

On my host, doing so is a lot quicker than using forks, upto approx 1'000 addresses, but take approx 1 entire second when I'll try to sort'n count 10'000 addresses.


You probably can use the file system itself as a hash table. Pseudo-code as follows:

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

In the end, all you need to do is to traverse all the files and print the file names and numbers in them. Alternatively, instead of keeping a count, you could append a space or a newline each time to the file, and in the end just look at the file size in bytes.


Sort may be omitted if order is not significant

uniq -c <source_file>

or

echo "$list" | uniq -c

if the source list is a variable


I feel awk associative array is also handy in this case

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

A group by post here


The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.


Solution ( group by like mysql)

grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n

Result

3249  googleplus
4211 linkedin
5212 xing
7928 facebook

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions.

Among those versions, saua's solution is the best (and simplest):

sort -n ip_addresses.txt | uniq -c

I found http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html. But it's ugly as hell...


I'd have done it like this:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

but uniq might work for you.


I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.

Edit: I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.


The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.


Sort may be omitted if order is not significant

uniq -c <source_file>

or

echo "$list" | uniq -c

if the source list is a variable


cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'

this command would give you desired output


Most of the other solutions count duplicates. If you really need to group key value pairs, try this:

Here is my example data:

find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

This will print the key value pairs grouped by the md5 checksum.

cat table.txt | awk '{print $1}' | sort | uniq  | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

I understand you are looking for something in Bash, but in case someone else might be looking for something in Python, you might want to consider this:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

As values in the set are unique by default and Python is pretty good at this stuff, you might win something here. I haven't tested the code, so it might be bugged, but this might get you there. And if you want to count occurrences, using a dict instead of a set is easy to implement.

Edit: I'm a lousy reader, so I answered wrong. Here's a snippet with a dict that would count occurences.

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

The dictionary mydict now holds a list of unique IP's as keys and the amount of times they occurred as their values.


The canonical solution is the one mentioned by another respondent:

sort | uniq -c

It is shorter and more concise than what can be written in Perl or awk.

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.


The quick and dirty method is as follows:

cat ip_addresses | sort -n | uniq -c

If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results.

PS

If the sort command is omitted, you will not get the correct results as uniq only looks at successive identical lines.


cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'

this command would give you desired output


I'd have done it like this:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

but uniq might work for you.


Sort may be omitted if order is not significant

uniq -c <source_file>

or

echo "$list" | uniq -c

if the source list is a variable