[unix] How to delete duplicate lines in a file without sorting it in Unix?

Is there a way to delete duplicate lines in a file in Unix?

I can do it with sort -u and uniq commands, but I want to use sed or awk. Is that possible?

This question is related to unix shell scripting sed awk

The answer is


The one-liner that Andre Miller posted above works except for recent versions of sed when the input file ends with a blank line and no chars. On my Mac my CPU just spins.

Infinite loop if last line is blank and has no chars:

sed '$!N; /^\(.*\)\n\1$/!P; D'

Doesn't hang, but you lose the last line

sed '$d;N; /^\(.*\)\n\1$/!P; D'

The explanation is at the very end of the sed FAQ:

The GNU sed maintainer felt that despite the portability problems
this would cause, changing the N command to print (rather than
delete) the pattern space was more consistent with one's intuitions
about how a command to "append the Next line" ought to behave.
Another fact favoring the change was that "{N;command;}" will
delete the last line if the file has an odd number of lines, but
print the last line if the file has an even number of lines.

To convert scripts which used the former behavior of N (deleting
the pattern space upon reaching the EOF) to scripts compatible with
all versions of sed, change a lone "N;" to "$d;N;".


This can be achieved using awk
Below Line will display unique Values

awk file_name | uniq

You can output these unique values to a new file

awk file_name | uniq > uniq_file_name

new file uniq_file_name will contain only Unique values, no duplicates


Perl one-liner similar to @jonas's awk solution:

perl -ne 'print if ! $x{$_}++' file

This variation removes trailing whitespace before comparing:

perl -lne 's/\s*$//; print if ! $x{$_}++' file

This variation edits the file in-place:

perl -i -ne 'print if ! $x{$_}++' file

This variation edits the file in-place, and makes a backup file.bak

perl -i.bak -ne 'print if ! $x{$_}++' file

uniq would be fooled by trailing spaces and tabs. In order to emulate how a human makes comparison, I am trimming all trailing spaces and tabs before comparison.

I think that the $!N; needs curly braces or else it continues, and that is the cause of infinite loop.

I have bash 5.0 and sed 4.7 in Ubuntu 20.10. The second one-liner did not work, at the character set match.

Three variations, first to eliminate adjacent repeat lines, second to eliminate repeat lines wherever they occur, third to eliminate all but the last instance of lines in file.

pastebin

# First line in a set of duplicate lines is kept, rest are deleted.
# Emulate human eyes on trailing spaces and tabs by trimming those.
# Use after norepeat() to dedupe blank lines.

dedupe() {
 sed -E '
  $!{
   N;
   s/[ \t]+$//;
   /^(.*)\n\1$/!P;
   D;
  }
 ';
}

# Delete duplicate, nonconsecutive lines from a file. Ignore blank
# lines. Trailing spaces and tabs are trimmed to humanize comparisons
# squeeze blank lines to one

norepeat() {
 sed -n -E '
  s/[ \t]+$//;
  G;
  /^(\n){2,}/d;
  /^([^\n]+).*\n\1(\n|$)/d;
  h;
  P;
  ';
}

lastrepeat() {
 sed -n -E '
  s/[ \t]+$//;
  /^$/{
   H;
   d;
  };
  G;
  # delete previous repeated line if found
  s/^([^\n]+)(.*)(\n\1(\n.*|$))/\1\2\4/;
  # after searching for previous repeat, move tested last line to end
  s/^([^\n]+)(\n)(.*)/\3\2\1/;
  $!{
   h;
   d;
  };
  # squeeze blank lines to one
  s/(\n){3,}/\n\n/g;
  s/^\n//;
  p;
 ';
}

cat filename | sort | uniq -c | awk -F" " '$1<2 {print $2}'

Deletes the duplicate lines using awk.


The first solution is also from http://sed.sourceforge.net/sed1line.txt

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr '$!N;/^(.*)\n\1$/!P;D'
1
2
3
4
5

the core idea is:

print ONLY once of each duplicate consecutive lines at its LAST appearance and use D command to implement LOOP.

Explains:

  1. $!N;: if current line is NOT the last line, use N command to read the next line into pattern space.
  2. /^(.*)\n\1$/!P: if the contents of current pattern space is two duplicate string separated by \n, which means the next line is the same with current line, we can NOT print it according to our core idea; otherwise, which means current line is the LAST appearance of all of its duplicate consecutive lines, we can now use P command to print the chars in current pattern space util \n (\n also printed).
  3. D: we use D command to delete the chars in current pattern space util \n (\n also deleted), then the content of pattern space is the next line.
  4. and D command will force sed to jump to its FIRST command $!N, but NOT read the next line from file or standard input stream.

The second solution is easy to understood (from myself):

$ echo -e '1\n2\n2\n3\n3\n3\n4\n4\n4\n4\n5' |sed -nr 'p;:loop;$!N;s/^(.*)\n\1$/\1/;tloop;D'
1
2
3
4
5

the core idea is:

print ONLY once of each duplicate consecutive lines at its FIRST appearance and use : command & t command to implement LOOP.

Explains:

  1. read a new line from input stream or file and print it once.
  2. use :loop command set a label named loop.
  3. use N to read next line into the pattern space.
  4. use s/^(.*)\n\1$/\1/ to delete current line if the next line is same with current line, we use s command to do the delete action.
  5. if the s command is executed successfully, then use tloop command force sed to jump to the label named loop, which will do the same loop to the next lines util there are no duplicate consecutive lines of the line which is latest printed; otherwise, use D command to delete the line which is the same with thelatest-printed line, and force sed to jump to first command, which is the p command, the content of current pattern space is the next new line.

An alternative way using Vim(Vi compatible):

Delete duplicate, consecutive lines from a file:

vim -esu NONE +'g/\v^(.*)\n\1$/d' +wq

Delete duplicate, nonconsecutive and nonempty lines from a file:

vim -esu NONE +'g/\v^(.+)$\_.{-}^\1$/d' +wq


From http://sed.sourceforge.net/sed1line.txt: (Please don't ask me how this works ;-) )

 # delete duplicate, consecutive lines from a file (emulates "uniq").
 # First line in a set of duplicate lines is kept, rest are deleted.
 sed '$!N; /^\(.*\)\n\1$/!P; D'

 # delete duplicate, nonconsecutive lines from a file. Beware not to
 # overflow the buffer size of the hold space, or else use GNU sed.
 sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'

Examples related to unix

Docker CE on RHEL - Requires: container-selinux >= 2.9 What does `set -x` do? How to find files modified in last x minutes (find -mmin does not work as expected) sudo: npm: command not found How to sort a file in-place How to read a .properties file which contains keys that have a period character using Shell script gpg decryption fails with no secret key error Loop through a comma-separated shell variable Best way to find os name and version in Unix/Linux platform Resource u'tokenizers/punkt/english.pickle' not found

Examples related to shell

Comparing a variable with a string python not working when redirecting from bash script Get first line of a shell command's output How to run shell script file using nodejs? Run bash command on jenkins pipeline Way to create multiline comments in Bash? How to do multiline shell script in Ansible How to check if a file exists in a shell script How to check if an environment variable exists and get its value? Curl to return http status code along with the response docker entrypoint running bash script gets "permission denied"

Examples related to scripting

What does `set -x` do? Creating an array from a text file in Bash Windows batch - concatenate multiple text files into one Raise error in a Bash script How do I assign a null value to a variable in PowerShell? Difference between ${} and $() in Bash Using a batch to copy from network drive to C: or D: drive Check if a string matches a regex in Bash script How to run a script at a certain time on Linux? How to make an "alias" for a long path?

Examples related to sed

Retrieve last 100 lines logs How to replace multiple patterns at once with sed? Insert multiple lines into a file after specified pattern using shell script Linux bash script to extract IP address Ansible playbook shell output remove white space from the end of line in linux bash, extract string before a colon invalid command code ., despite escaping periods, using sed RE error: illegal byte sequence on Mac OS X How to use variables in a command in sed?

Examples related to awk

What are NR and FNR and what does "NR==FNR" imply? awk - concatenate two string variable and assign to a third Printing column separated by comma using Awk command line Insert multiple lines into a file after specified pattern using shell script cut or awk command to print first field of first row How to run an awk commands in Windows? Linux bash script to extract IP address Print line numbers starting at zero using awk Trim leading and trailing spaces from a string in awk Use awk to find average of a column