How to delete duplicate lines in a file without sorting it in Unix

Question

Is there a way to delete duplicate lines in a file in Unix   I can do it with sort -u and uniq commands  but I want to use sed or awk  Is that possible

User · Answer

The one-liner that Andre Miller posted above works except for recent versions of sed when the input file ends with a blank line and no chars. On my Mac my CPU just spins.

Infinite loop if last line is blank and has no chars:

sed '$!N; /^$.*$\n\1$/!P; D'

Doesn't hang, but you lose the last line

sed '$d;N; /^$.*$\n\1$/!P; D'

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".

User · Answer

This can be achieved using awk Below Line will display unique Values  awk file name   uniq   You can output these unique values to a new file  awk file name   uniq  gt  uniq file name   new file uniq file name will contain only Unique values  no duplicates

User · Answer

Perl one-liner similar to  jonas s awk solution   perl -ne  print if    x        file   This variation removes trailing whitespace before comparing   perl -lne  s  s      print if    x        file   This variation edits the file in-place   perl -i -ne  print if    x        file   This variation edits the file in-place  and makes a backup file bak  perl -i bak -ne  print if    x        file

User · Answer

uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.

I think that the $!N; needs curly braces or else it continues, and that is the cause of infinite loop.

I have bash 5.0 and sed 4.7 in Ubuntu 20.10. The second one-liner did not work, at the character set match.

Three variations, first to eliminate adjacent repeat lines, second to eliminate repeat lines wherever they occur, third to eliminate all but the last instance of lines in file.

pastebin

# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.

dedupe() {
 sed -E '
  $!{
   N;
   s/[ \t]+$//;
   /^(.*)\n\1$/!P;
   D;
  }
 ';
}

# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one

norepeat() {
 sed -n -E '
  s/[ \t]+$//;
  G;
  /^(\n){2,}/d;
  /^([^\n]+).*\n\1(\n|$)/d;
  h;
  P;
  ';
}

lastrepeat() {
 sed -n -E '
  s/[ \t]+$//;
  /^$/{
   H;
   d;
  };
  G;
  # delete previous repeated line if found
  s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
  # after searching for previous repeat, move tested last line to end
  s/^([^\n]+)(\n)(.*)/\3\2\1/;
  $!{
   h;
   d;
  };
  # squeeze blank lines to one
  s/(\n){3,}/\n\n/g;
  s/^\n//;
  p;
 ';
}

User · Answer

cat filename   sort   uniq -c   awk -F      1 lt 2  print  2     Deletes the duplicate lines using awk

User · Answer

The first solution is also from http   sed sourceforge net sed1line txt    echo -e  1 n2 n2 n3 n3 n3 n4 n4 n4 n4 n5   sed -nr    N        n 1   P D  1 2 3 4 5   the core idea is   print ONLY once of each duplicate consecutive lines at its LAST appearance and use D command to implement LOOP    Explains      N   if current line is NOT the last line  use N command to read the next line into pattern space         n 1   P  if the contents of current pattern space is two duplicate string separated by  n  which means the next line is the same with current line  we can NOT print it according to our core idea  otherwise  which means current line is the LAST appearance of all of its duplicate consecutive lines  we can now use P command to print the chars in current pattern space util  n   n also printed   D  we use D command to delete the chars in current pattern space util  n   n also deleted   then the content of pattern space is the next line  and D command will force sed to jump to its FIRST command   N  but NOT read the next line from file or standard input stream    The second solution is easy to understood  from myself      echo -e  1 n2 n2 n3 n3 n3 n4 n4 n4 n4 n5   sed -nr  p  loop   N s       n 1   1  tloop D  1 2 3 4 5   the core idea is   print ONLY once of each duplicate consecutive lines at its FIRST appearance and use   command  amp  t command to implement LOOP    Explains    read a new line from input stream or file and print it once  use  loop command set a label named loop  use N to read next line into the pattern space  use s       n 1   1  to delete current line if the next line is same with current line  we use s command to do the delete action  if the s command is executed successfully  then use tloop command force sed to jump to the label named loop  which will do the same loop to the next lines util there are no duplicate consecutive lines of the line which is latest printed  otherwise  use D command to delete the line which is the same with thelatest-printed line  and force sed to jump to first command  which is the p command  the content of current pattern space is the next new line

User · Answer

awk   seen  0     file txt  seen is an associative-array that Awk will pass every line of the file to  If a line isn t in the array then seen  0  will evaluate to false  The   is the logical NOT operator and will invert the false to true  Awk will print the lines where the expression evaluates to true  The    increments seen so that seen  0     1 after the first time a line is found and then seen  0     2  and so on  Awk evaluates everything but 0 and  quot  quot   empty string  to true  If a duplicate line is placed in seen then  seen  0  will evaluate to false and the line will not be written to the output

User · Answer

An alternative way using Vim Vi compatible    Delete duplicate  consecutive lines from a file   vim -esu NONE   g  v      n 1  d   wq  Delete duplicate  nonconsecutive and nonempty lines from a file   vim -esu NONE   g  v          -   1  d   wq

User · Answer

From http   sed sourceforge net sed1line txt   Please don t ask me how this works  -        delete duplicate  consecutive lines from a file  emulates  uniq       First line in a set of duplicate lines is kept  rest are deleted   sed    N           n 1   P  D      delete duplicate  nonconsecutive lines from a file  Beware not to    overflow the buffer size of the hold space  or else use GNU sed   sed -n  G  s  n  amp  amp          -    n     n 1 d  s  n    h  P

[unix] How to delete duplicate lines in a file without sorting it in Unix?

The answer is

The first solution is also from http://sed.sourceforge.net/sed1line.txt

The second solution is easy to understood (from myself):

Examples related to unix

Examples related to shell

Examples related to scripting

Examples related to sed

Examples related to awk

Tags