Is there a way to uniq by column

Question

I have a  csv file like this   stack2 example com 2009-11-27 01 05 47 893000000 example net 127 0 0 1 overflow example com 2009-11-27 00 58 29 793000000 example net 255 255 255 0 overflow example com 2009-11-27 00 58 29 646465785 example net 256 255 255 0       I have to remove duplicate e-mails  the entire line  from the file  i e  one of the lines containing overflow example com in the above example   How do I use uniq on only field 1  separated by commas   According to man  uniq doesn t have options for columns   I tried something with sort   uniq but it doesn t work

User · Answer

By sorting the file with sort first  you can then apply uniq   It seems to sort the file just fine     cat test csv overflow domain2 com 2009-11-27 00 58 29 793000000 xx3 net 255 255 255 0 stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 overflow domain2 com 2009-11-27 00 58 29 646465785 2x3 net 256 255 255 0  stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack3 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack4 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1    sort test csv overflow domain2 com 2009-11-27 00 58 29 646465785 2x3 net 256 255 255 0  overflow domain2 com 2009-11-27 00 58 29 793000000 xx3 net 255 255 255 0 stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack3 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack4 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1    sort test csv   uniq overflow domain2 com 2009-11-27 00 58 29 646465785 2x3 net 256 255 255 0  overflow domain2 com 2009-11-27 00 58 29 793000000 xx3 net 255 255 255 0 stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack3 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack4 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1   You could also do some AWK magic     awk -F     lines  1     0   END   for  l in lines  print lines l     test csv stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack4 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 stack3 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1 overflow domain2 com 2009-11-27 00 58 29 646465785 2x3 net 256 255 255 0

User · Answer

sort -u -t  -k1 1 file    -u for unique -t  so comma is the delimiter -k1 1 for the key field 1   Test result   overflow domain2 com 2009-11-27 00 58 29 793000000 xx3 net 255 255 255 0  stack2 domain com 2009-11-27 01 05 47 893000000 xx2 net 127 0 0 1

User · Answer

well  simpler than isolating the column with awk  if you need to remove everything with a certain value for a given file  why not just do grep -v   e g  to delete everything with the value  col2  in the second place  line  col1 col2 col3 col4  grep -v   col2   file  gt  file minus offending lines   If this isn t good enough  because some lines may get improperly stripped by possibly having the matching value show up in a different column  you can do something like this   awk to isolate the offending column  e g    awk -F    print  2      line     the -F sets the field delimited to       2 means column 2  followed by some custom delimiter and then the entire line  You can then filter by removing lines that begin with the offending value    awk -F    print  2      line     grep -v  BAD VALUE   and then strip out the stuff before the delimiter   awk -F    print  2      line     grep -v  BAD VALUE   sed  s      g     note -the sed command is sloppy because it doesn t include escaping values  Also the sed pattern should really be something like          i e  anything not the delimiter   But hopefully this is clear enough

User · Answer

awk -F         1     file    -F sets the field separator   1 is the first field    val  looks up val in the hash   a regular variable      increment  and return old value    returns logical not  there is an implicit print at the end

User · Answer

To consider multiple column   Sort and give unique list based on column 1 and column 3   sort -u -t   -k 1 1 -k 3 3 test txt    -t   colon is separator -k 1 1 -k 3 3 based on column 1 and column 3

User · Answer

Here is a very nifty way   First format the content such that the column to be compared for uniqueness is a fixed width   One way of doing this is to use awk printf with a field column width specifier    15s     Now the -f and -w options of uniq can be used to skip preceding fields columns and to specify the comparison width  column s  width    Here are three examples   In the first example     1  Temporarily make the column of interest a fixed width greater than or equal to the field s max width   2  Use -f uniq option to skip the prior columns  and use the -w uniq option to limit the width to the tmp fixed width   3  Remove trailing spaces from the column to  restore  it s width  assuming there were no trailing spaces beforehand    printf   s    str      awk    tmp fixed width 15  uniq col 8  w tmp fixed width-length  uniq col   for  i 0 i lt w i       uniq col  uniq col      printf   s n    0        uniq -f 7 -w 15     awk    uniq col 8  gsub            uniq col   printf   s n    0      In the second example     Create a new uniq column 1   Then remove it after the uniq filter has been applied   printf   s    str      awk    uniq col 1 4  printf   15s  s n   uniq col 1   0        uniq -f 0 -w 15     awk     1     gsub             0   printf   s n    0      The third example is the same as the second  but for multiple columns   printf   s    str      awk    uniq col 1 4  uniq col 2 8  printf   5s  15s  s n   uniq col 1  uniq col 2   0        uniq -f 0 -w 5     uniq -f 1 -w 15     awk     1  2     gsub             0   printf   s n    0

User · Answer

If you want to retain the last one of the duplicates you could use   tac a csv   sort -u -t  -r -k1 1  tac   Which was my requirement  here   tac will reverse the file line by line

User · Answer

or if u want to use uniq    lt mycvs cvs tr -s           awk    print  3    2    1     uniq -c -f2  gives   1 01 05 47 893000000 2009-11-27 tack2 domain com 2 00 58 29 793000000 2009-11-27 overflow domain2 com 1

[linux] Is there a way to 'uniq' by column?

Examples related to linux

Examples related to shell

Examples related to sorting

Examples related to uniq