Best way to simulate group by from bash

Question

Suppose you have a file that contains IP addresses  one address in each line   10 0 10 1 10 0 10 1 10 0 10 3 10 0 10 2 10 0 10 1   You need a shell script that counts for each IP address how many times it appears in the file  For the previous input you need the following output   10 0 10 1 3 10 0 10 2 1 10 0 10 3 1   One way to do this is   cat ip addresses  uniq  while read ip do     echo -n  ip        grep -c  ip ip addresses done   However it is really far from being efficient   How would you solve this problem more efficiently using bash    One thing to add  I know it can be solved from perl or awk  I m interested in a better solution in bash  not in those languages    ADDITIONAL INFO   Suppose that the source file is 5GB and the machine running the algorithm has 4GB  So sort is not an efficient solution  neither is reading the file more than once   I liked the hashtable-like solution - anybody can provide improvements to that solution   ADDITIONAL INFO  2   Some people asked why would I bother doing it in bash when it is way easier in e g  perl  The reason is that on the machine I had to do this perl wasn t available for me  It was a custom built linux machine without most of the tools I m used to  And I think it was an interesting problem   So please  don t blame the question  just ignore it if you don t like it   -

User · Answer

Sort may be omitted if order is not significant  uniq -c  lt source file gt    or  echo   list    uniq -c   if the source list is a variable

User · Answer

I understand you are looking for something in Bash  but in case someone else might be looking for something in Python  you might want to consider this   mySet   set   for line in open  ip address file txt         line   line rstrip        mySet add line    As values in the set are unique by default and Python is pretty good at this stuff  you might win something here  I haven t tested the code  so it might be bugged  but this might get you there  And if you want to count occurrences  using a dict instead of a set is easy to implement    Edit  I m a lousy reader  so I answered wrong  Here s a snippet with a dict that would count occurences   mydict      for line in open  ip address file txt        line   line rstrip       if line in mydict          mydict line     1     else          mydict line    1   The dictionary mydict now holds a list of unique IP s as keys and the amount of times they occurred as their values

User · Answer

The canonical solution is the one mentioned by another respondent   sort   uniq -c   It is shorter and more concise than what can be written in Perl or awk   You write that you don t want to use sort  because the data s size is larger than the machine s main memory size   Don t underestimate the implementation quality of the Unix sort command   Sort was used to handle very large volumes of data  think the original AT amp T s billing data  on machines with 128k  that s 131 072 bytes  of memory  PDP-11    When sort encounters more data than a preset limit  often tuned close to the size of the machine s main memory  it sorts the data it has read in main memory and writes it into a temporary file   It then repeats the action with the next chunks of data   Finally  it performs a merge sort on those intermediate files   This allows sort to work on data many times larger than the machine s main memory

User · Answer

I d have done it like this   perl -e  while   lt  gt    chop   h         for  k  keys  h   print   k  h  k  n     ip addresses   but uniq might work for you

User · Answer

The quick and dirty method is as follows   cat ip addresses   sort -n   uniq -c  If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results   PS  If the sort command is omitted  you will not get the correct results as uniq only looks at successive identical lines

User · Answer

The canonical solution is the one mentioned by another respondent   sort   uniq -c   It is shorter and more concise than what can be written in Perl or awk   You write that you don t want to use sort  because the data s size is larger than the machine s main memory size   Don t underestimate the implementation quality of the Unix sort command   Sort was used to handle very large volumes of data  think the original AT amp T s billing data  on machines with 128k  that s 131 072 bytes  of memory  PDP-11    When sort encounters more data than a preset limit  often tuned close to the size of the machine s main memory  it sorts the data it has read in main memory and writes it into a temporary file   It then repeats the action with the next chunks of data   Finally  it performs a merge sort on those intermediate files   This allows sort to work on data many times larger than the machine s main memory

User · Answer

Pure bash  no fork    There is a way  using a bash function  This way is very quick as there is no fork          While bunch of ip addresses stay small   countIp           local -a  ips     local  a     while IFS   read -a  a  do            ips  a lt  lt 24    a 1   lt  lt 16    a 2   lt  lt 8    a 3            done     for  a in     ips      do         printf    16s  4d n                   a gt  gt 24        a gt  gt 16 amp 255        a gt  gt 8 amp 255        a amp 255      ips  a       done     Note  IP addresses are converted to 32bits unsigned integer value  used as index for array  This use simple bash arrays  not associative array  wich is more expensive    time countIp  lt  ip addresses  10 0 10 1    3 10 0 10 2    1 10 0 10 3    1 real    0m0 001s user    0m0 004s sys     0m0 000s  time sort ip addresses   uniq -c       3 10 0 10 1       1 10 0 10 2       1 10 0 10 3 real    0m0 010s user    0m0 000s sys     0m0 000s   On my host  doing so is a lot quicker than using forks  upto approx 1 000 addresses  but take approx 1 entire second when I ll try to sort n count 10 000 addresses

User · Answer

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions    Among those versions  saua s solution is the best  and simplest    sort -n ip addresses txt   uniq -c   I found http   unix derkeiler com Newsgroups comp unix shell 2005-11 0118 html  But it s ugly as hell

User · Answer

You probably can use the file system itself as a hash table  Pseudo-code as follows   for every entry in the ip address file  do   let addr denote the ip address     if file  addr  does not exist  then     create file  addr       write a number  0  in the file    else      read the number from  addr       increase the number by 1 and write it back    fi done   In the end  all you need to do is to traverse all the files and print the file names and numbers in them  Alternatively  instead of keeping a count  you could append a space or a newline each time to the file  and in the end just look at the file size in bytes

User · Answer

I d have done it like this   perl -e  while   lt  gt    chop   h         for  k  keys  h   print   k  h  k  n     ip addresses   but uniq might work for you

User · Answer

sort ip addresses   uniq -c   This will print the count first  but other than that it should be exactly what you want

User · Answer

The quick and dirty method is as follows   cat ip addresses   sort -n   uniq -c  If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results   PS  If the sort command is omitted  you will not get the correct results as uniq only looks at successive identical lines

User · Answer

I feel awk associative array is also handy in this case    awk   count  1    END for j in count  print j count j    ips txt   A group by post here

User · Answer

Pure bash  no fork    There is a way  using a bash function  This way is very quick as there is no fork          While bunch of ip addresses stay small   countIp           local -a  ips     local  a     while IFS   read -a  a  do            ips  a lt  lt 24    a 1   lt  lt 16    a 2   lt  lt 8    a 3            done     for  a in     ips      do         printf    16s  4d n                   a gt  gt 24        a gt  gt 16 amp 255        a gt  gt 8 amp 255        a amp 255      ips  a       done     Note  IP addresses are converted to 32bits unsigned integer value  used as index for array  This use simple bash arrays  not associative array  wich is more expensive    time countIp  lt  ip addresses  10 0 10 1    3 10 0 10 2    1 10 0 10 3    1 real    0m0 001s user    0m0 004s sys     0m0 000s  time sort ip addresses   uniq -c       3 10 0 10 1       1 10 0 10 2       1 10 0 10 3 real    0m0 010s user    0m0 000s sys     0m0 000s   On my host  doing so is a lot quicker than using forks  upto approx 1 000 addresses  but take approx 1 entire second when I ll try to sort n count 10 000 addresses

User · Answer

Sort may be omitted if order is not significant  uniq -c  lt source file gt    or  echo   list    uniq -c   if the source list is a variable

User · Answer

cat ip addresses   sort   uniq -c   sort -nr   awk   print  2      1     this command would give you desired output

User · Answer

I understand you are looking for something in Bash  but in case someone else might be looking for something in Python  you might want to consider this   mySet   set   for line in open  ip address file txt         line   line rstrip        mySet add line    As values in the set are unique by default and Python is pretty good at this stuff  you might win something here  I haven t tested the code  so it might be bugged  but this might get you there  And if you want to count occurrences  using a dict instead of a set is easy to implement    Edit  I m a lousy reader  so I answered wrong  Here s a snippet with a dict that would count occurences   mydict      for line in open  ip address file txt        line   line rstrip       if line in mydict          mydict line     1     else          mydict line    1   The dictionary mydict now holds a list of unique IP s as keys and the amount of times they occurred as their values

User · Answer

sort ip addresses   uniq -c   This will print the count first  but other than that it should be exactly what you want

User · Answer

Solution   group by like mysql   grep -ioh  facebook  xing  linkedin  googleplus  access-log txt   sort   uniq -c   sort -n   Result  3249  googleplus 4211 linkedin 5212 xing 7928 facebook

User · Answer

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions    Among those versions  saua s solution is the best  and simplest    sort -n ip addresses txt   uniq -c   I found http   unix derkeiler com Newsgroups comp unix shell 2005-11 0118 html  But it s ugly as hell

User · Answer

I d have done it like this   perl -e  while   lt  gt    chop   h         for  k  keys  h   print   k  h  k  n     ip addresses   but uniq might work for you

User · Answer

The canonical solution is the one mentioned by another respondent   sort   uniq -c   It is shorter and more concise than what can be written in Perl or awk   You write that you don t want to use sort  because the data s size is larger than the machine s main memory size   Don t underestimate the implementation quality of the Unix sort command   Sort was used to handle very large volumes of data  think the original AT amp T s billing data  on machines with 128k  that s 131 072 bytes  of memory  PDP-11    When sort encounters more data than a preset limit  often tuned close to the size of the machine s main memory  it sorts the data it has read in main memory and writes it into a temporary file   It then repeats the action with the next chunks of data   Finally  it performs a merge sort on those intermediate files   This allows sort to work on data many times larger than the machine s main memory

User · Answer

You probably can use the file system itself as a hash table  Pseudo-code as follows   for every entry in the ip address file  do   let addr denote the ip address     if file  addr  does not exist  then     create file  addr       write a number  0  in the file    else      read the number from  addr       increase the number by 1 and write it back    fi done   In the end  all you need to do is to traverse all the files and print the file names and numbers in them  Alternatively  instead of keeping a count  you could append a space or a newline each time to the file  and in the end just look at the file size in bytes

User · Answer

I understand you are looking for something in Bash  but in case someone else might be looking for something in Python  you might want to consider this   mySet   set   for line in open  ip address file txt         line   line rstrip        mySet add line    As values in the set are unique by default and Python is pretty good at this stuff  you might win something here  I haven t tested the code  so it might be bugged  but this might get you there  And if you want to count occurrences  using a dict instead of a set is easy to implement    Edit  I m a lousy reader  so I answered wrong  Here s a snippet with a dict that would count occurences   mydict      for line in open  ip address file txt        line   line rstrip       if line in mydict          mydict line     1     else          mydict line    1   The dictionary mydict now holds a list of unique IP s as keys and the amount of times they occurred as their values

User · Answer

Sort may be omitted if order is not significant  uniq -c  lt source file gt    or  echo   list    uniq -c   if the source list is a variable

User · Answer

Solution   group by like mysql   grep -ioh  facebook  xing  linkedin  googleplus  access-log txt   sort   uniq -c   sort -n   Result  3249  googleplus 4211 linkedin 5212 xing 7928 facebook

User · Answer

The canonical solution is the one mentioned by another respondent   sort   uniq -c   It is shorter and more concise than what can be written in Perl or awk   You write that you don t want to use sort  because the data s size is larger than the machine s main memory size   Don t underestimate the implementation quality of the Unix sort command   Sort was used to handle very large volumes of data  think the original AT amp T s billing data  on machines with 128k  that s 131 072 bytes  of memory  PDP-11    When sort encounters more data than a preset limit  often tuned close to the size of the machine s main memory  it sorts the data it has read in main memory and writes it into a temporary file   It then repeats the action with the next chunks of data   Finally  it performs a merge sort on those intermediate files   This allows sort to work on data many times larger than the machine s main memory

User · Answer

Most of the other solutions count duplicates  If you really need to group key value pairs  try this   Here is my example data   find     xargs md5sum fe4ab8e15432161f452e345ff30c68b0 a txt 30c68b02161e15435ff52e34f4fe4ab8 b txt 30c68b02161e15435ff52e34f4fe4ab8 c txt fe4ab8e15432161f452e345ff30c68b0 d txt fe4ab8e15432161f452e345ff30c68b0 e txt   This will print the key value pairs grouped by the md5 checksum   cat table txt   awk   print  1     sort   uniq    xargs -i grep    table txt 30c68b02161e15435ff52e34f4fe4ab8 b txt 30c68b02161e15435ff52e34f4fe4ab8 c txt fe4ab8e15432161f452e345ff30c68b0 a txt fe4ab8e15432161f452e345ff30c68b0 d txt fe4ab8e15432161f452e345ff30c68b0 e txt

User · Answer

I feel awk associative array is also handy in this case    awk   count  1    END for j in count  print j count j    ips txt   A group by post here

User · Answer

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions    Among those versions  saua s solution is the best  and simplest    sort -n ip addresses txt   uniq -c   I found http   unix derkeiler com Newsgroups comp unix shell 2005-11 0118 html  But it s ugly as hell

User · Answer

I d have done it like this   perl -e  while   lt  gt    chop   h         for  k  keys  h   print   k  h  k  n     ip addresses   but uniq might work for you

User · Answer

You probably can use the file system itself as a hash table  Pseudo-code as follows   for every entry in the ip address file  do   let addr denote the ip address     if file  addr  does not exist  then     create file  addr       write a number  0  in the file    else      read the number from  addr       increase the number by 1 and write it back    fi done   In the end  all you need to do is to traverse all the files and print the file names and numbers in them  Alternatively  instead of keeping a count  you could append a space or a newline each time to the file  and in the end just look at the file size in bytes

User · Answer

I feel awk associative array is also handy in this case    awk   count  1    END for j in count  print j count j    ips txt   A group by post here

User · Answer

Most of the other solutions count duplicates  If you really need to group key value pairs  try this   Here is my example data   find     xargs md5sum fe4ab8e15432161f452e345ff30c68b0 a txt 30c68b02161e15435ff52e34f4fe4ab8 b txt 30c68b02161e15435ff52e34f4fe4ab8 c txt fe4ab8e15432161f452e345ff30c68b0 d txt fe4ab8e15432161f452e345ff30c68b0 e txt   This will print the key value pairs grouped by the md5 checksum   cat table txt   awk   print  1     sort   uniq    xargs -i grep    table txt 30c68b02161e15435ff52e34f4fe4ab8 b txt 30c68b02161e15435ff52e34f4fe4ab8 c txt fe4ab8e15432161f452e345ff30c68b0 a txt fe4ab8e15432161f452e345ff30c68b0 d txt fe4ab8e15432161f452e345ff30c68b0 e txt

User · Answer

I understand you are looking for something in Bash  but in case someone else might be looking for something in Python  you might want to consider this   mySet   set   for line in open  ip address file txt         line   line rstrip        mySet add line    As values in the set are unique by default and Python is pretty good at this stuff  you might win something here  I haven t tested the code  so it might be bugged  but this might get you there  And if you want to count occurrences  using a dict instead of a set is easy to implement    Edit  I m a lousy reader  so I answered wrong  Here s a snippet with a dict that would count occurences   mydict      for line in open  ip address file txt        line   line rstrip       if line in mydict          mydict line     1     else          mydict line    1   The dictionary mydict now holds a list of unique IP s as keys and the amount of times they occurred as their values

User · Answer

The quick and dirty method is as follows   cat ip addresses   sort -n   uniq -c  If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results   PS  If the sort command is omitted  you will not get the correct results as uniq only looks at successive identical lines

User · Answer

It seems that you have to either use a big amount of code to simulate hashes in bash to get linear behavior or stick to the quadratic superlinear versions    Among those versions  saua s solution is the best  and simplest    sort -n ip addresses txt   uniq -c   I found http   unix derkeiler com Newsgroups comp unix shell 2005-11 0118 html  But it s ugly as hell

User · Answer

for summing up multiple fields  based on a group of existing fields  use the example below     replace the  1   2   3   4 according to your requirements    cat file  US A 1000 2000 US B 1000 2000 US C 1000 2000 UK 1 1000 2000 UK 1 1000 2000 UK 1 1000 2000  awk  BEGIN   FS OFS SUBSEP      arr  1  2    3  4  END  for  i in arr  print i arr i    file  US A 3000 US B 3000 US C 3000 UK 1 9000

User · Answer

You probably can use the file system itself as a hash table  Pseudo-code as follows   for every entry in the ip address file  do   let addr denote the ip address     if file  addr  does not exist  then     create file  addr       write a number  0  in the file    else      read the number from  addr       increase the number by 1 and write it back    fi done   In the end  all you need to do is to traverse all the files and print the file names and numbers in them  Alternatively  instead of keeping a count  you could append a space or a newline each time to the file  and in the end just look at the file size in bytes

User · Answer

I feel awk associative array is also handy in this case    awk   count  1    END for j in count  print j count j    ips txt   A group by post here

User · Answer

Sort may be omitted if order is not significant  uniq -c  lt source file gt    or  echo   list    uniq -c   if the source list is a variable

User · Answer

cat ip addresses   sort   uniq -c   sort -nr   awk   print  2      1     this command would give you desired output

User · Answer

for summing up multiple fields  based on a group of existing fields  use the example below     replace the  1   2   3   4 according to your requirements    cat file  US A 1000 2000 US B 1000 2000 US C 1000 2000 UK 1 1000 2000 UK 1 1000 2000 UK 1 1000 2000  awk  BEGIN   FS OFS SUBSEP      arr  1  2    3  4  END  for  i in arr  print i arr i    file  US A 3000 US B 3000 US C 3000 UK 1 9000

User · Answer

sort ip addresses   uniq -c   This will print the count first  but other than that it should be exactly what you want

User · Answer

sort ip addresses   uniq -c   This will print the count first  but other than that it should be exactly what you want

User · Answer

The quick and dirty method is as follows   cat ip addresses   sort -n   uniq -c  If you need to use the values in bash you can assign the whole command to a bash variable and then loop through the results   PS  If the sort command is omitted  you will not get the correct results as uniq only looks at successive identical lines

[bash] Best way to simulate "group by" from bash?

Examples related to bash

Examples related to scripting