Fast way of finding lines in one file that are not in another

Question

I have two large files  sets of filenames   Roughly 30 000 lines in each file  I am trying to find a fast way of finding lines in file1 that are not present in file2  For example  if this is file1  line1 line2 line3  And this is file2  line1 line4 line5  Then my result output should be  line2 line3  This works  grep -v -f file2 file1 But it is very  very slow when used on my large files  I suspect there is a good way to do this using diff    but the output should be just the lines  nothing else  and I cannot seem to find a switch for that  Can anyone help me find a fast way of doing this  using bash and basic Linux binaries  EDIT  To follow up on my own question  this is the best way I have found so far using diff     diff file2 file1   grep    gt     sed  s   gt        Surely  there must be a better way

User · Answer

The comm command  short for  common   may be useful comm - compare two sorted files line by line   find lines only in file1 comm -23 file1 file2    find lines only in file2 comm -13 file1 file2    find lines common to both files comm -12 file1 file2    The man file is actually quite readable for this

User · Answer

You can achieve this by controlling the formatting of the old new unchanged lines in GNU diff output   diff --new-line-format    --unchanged-line-format     file1 file2   The input files should be sorted for this to work  With bash  and zsh  you can sort in-place with process substitution  lt       diff --new-line-format    --unchanged-line-format     lt  sort file1   lt  sort file2    In the above new and unchanged lines are suppressed   so only changed  i e  removed lines in your case  are output  You may also use a few diff options that other solutions don t offer  such as -i to ignore case  or various whitespace options  -E  -b  -v etc  for less strict matching     Explanation  The options --new-line-format  --old-line-format and --unchanged-line-format let you control the way diff formats the differences  similar to  printf format specifiers  These options format new  added   old  removed  and unchanged lines respectively  Setting one to empty    prevents output of that kind of line   If you are familiar with unified diff format  you can partly recreate it with   diff --old-line-format  - L  --unchanged-line-format    L         --new-line-format    L  file1 file2   The  L specifier is the line in question  and we prefix each with      -  or      like diff -u  note that it only outputs differences  it lacks the ---     and    lines at the top of each grouped change   You can also use this to do other useful things like number each line with  dn     The diff method  along with other suggestions comm and join  only produce the expected output with sorted input  though you can use  lt  sort      to sort in place  Here s a simple awk  nawk  script  inspired by the scripts linked-to in Konsolebox s answer  which accepts arbitrarily ordered input files  and outputs the missing lines in the order they occur in file1     output lines in file1 that are not in file2 BEGIN   FS                                preserve whitespace  NR  FNR    ll1 FNR   0  nl1 FNR          file1  index by lineno  NR  FNR    ss2  0                        file2  index by string END       for  ll 1  ll lt  nl1  ll    if    ll1 ll  in ss2   print ll1 ll      This stores the entire contents of file1 line by line in a line-number indexed array ll1    and the entire contents of file2 line by line in a line-content indexed associative array ss2    After both files are read  iterate over ll1 and use the in operator to determine if the line in file1 is present in file2   This will have have different output to the diff method if there are duplicates    In the event that the files are sufficiently large that storing them both causes a memory problem  you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read   BEGIN   FS       NR  FNR       file1  index by lineno and string   ll1 FNR   0  ss1  0  FNR  nl1 FNR     NR  FNR       file2   if   0 in ss1    delete ll1 ss1  0    delete ss1  0       END     for  ll 1  ll lt  nl1  ll    if  ll in ll1  print ll1 ll      The above stores the entire contents of file1 in two arrays  one indexed by line number ll1    one indexed by line content ss1    Then as file2 is read  each matching line is deleted from ll1   and ss1    At the end the remaining lines from file1 are output  preserving the original order   In this case  with the problem as stated  you can also divide and conquer using GNU split  filtering is a GNU extension   repeated runs with chunks of file1 and reading file2 completely each time   split -l 20000 --filter  gawk -f linesnotin awk - file2   lt  file1   Note the use and placement of - meaning stdin on the gawk command line  This is provided by split from file1 in chunks of 20000 line per-invocation   For users on non-GNU systems  there is almost certainly a GNU coreutils package you can obtain  including on OSX as part of the Apple Xcode tools which provides GNU diff  awk  though only a POSIX BSD split rather than a GNU version

User · Answer

I found that for me using a normal if and for loop statement worked perfectly   for i in   cat file2  do if     grep -i  i file1    then echo   i found   gt  gt Matching lines txt else echo   i missing   gt  gt missing lines txt  fi done

User · Answer

Using of fgrep or adding -F option to grep could help  But for faster calculations you could use Awk   You could try one of these Awk methods   http   www linuxquestions org questions programming-9 grep-for-huge-files-826030  post4066219

User · Answer

join -v 1 -t    file1 file2 line2 line3   The -t makes sure that it compares the whole line  if you had a space in some of the lines

User · Answer

Use combine from moreutils package  a sets utility that supports not  and  or  xor operations  combine file1 not file2   i e give me lines that are in file1 but not in file2  OR give me lines in file1 minus lines in file2  Note  combine sorts and finds unique lines in both files before performing any operation but diff does not  So you might find differences between output of diff and combine    So in effect you are saying  Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2  In my experience  it s much faster than other options

User · Answer

You can use Python   python -c   lines to remove   set   with open  file2    r   as f      for line in f readlines            lines to remove add line strip     with open  f1    r   as f      for line in f readlines            if line strip   not in lines to remove              print line strip

User · Answer

This seems quick for me   comm -1 -3  lt  sort file1 txt   lt  sort file2 txt   gt  output txt

User · Answer

whats the speed of as sort and diff   sort file1 -u  gt  file1 sorted sort file2 -u  gt  file2 sorted diff file1 sorted file2 sorted

User · Answer

Like konsolebox suggested  the posters grep solution  grep -v -f file2 file1   actually works great  fast  if you simply add the -F option  to treat the patterns as fixed strings instead of regular expressions  I verified this on a pair of  1000 line file lists I had to compare  With -F it took 0 031 s  real   while without it took 2 278 s  real   when redirecting grep output to wc -l   These tests also included the -x switch  which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of  but not all of  one or more lines in file1   So a solution that does not require the inputs to be sorted  is fast  flexible  case sensitivity  etc  is   grep -F -x -v -f file2 file1   This doesn t work with all versions of grep  for example it fails in macOS  where a line in file 1 will be shown as not present in file 2  even though it is  if it matches another line that is a substring of it  Alternatively you can install GNU grep on macOS in order to use this solution

User · Answer

The way I usually do this is using the --suppress-common-lines flag  though note that this only works if your do it in side-by-side format   diff -y --suppress-common-lines file1 txt file2 txt

User · Answer

If you re short of  fancy tools   e g  in some minimal Linux distribution  there is a solution with just cat  sort and uniq   cat includes txt excludes txt excludes txt   sort   uniq --unique   Test   seq 1 1 7   sort --random-sort  gt  includes txt seq 3 1 9   sort --random-sort  gt  excludes txt cat includes txt excludes txt excludes txt   sort   uniq --unique    Output  1 2       This is also relatively fast  compared to grep

[bash] Fast way of finding lines in one file that are not in another?

Examples related to bash

Examples related to grep

Examples related to find

Examples related to diff