I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff()
, but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff()
:
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The comm command (short for "common") may be useful comm - compare two sorted files line by line
#find lines only in file1
comm -23 file1 file2
#find lines only in file2
comm -13 file1 file2
#find lines common to both files
comm -12 file1 file2
The man
file is actually quite readable for this.
Use combine
from moreutils
package, a sets utility that supports not
, and
, or
, xor
operations
combine file1 not file2
i.e give me lines that are in file1 but not in file2
OR give me lines in file1 minus lines in file2
Note: combine
sorts and finds unique lines in both files before performing any operation but diff
does not. So you might find differences between output of diff
and combine
.
So in effect you are saying
Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2
In my experience, it's much faster than other options
This seems quick for me :
comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt
I found that for me using a normal if and for loop statement worked perfectly.
for i in $(cat file2);do if [ $(grep -i $i file1) ];then echo "$i found" >>Matching_lines.txt;else echo "$i missing" >>missing_lines.txt ;fi;done
Like konsolebox suggested, the posters grep solution
grep -v -f file2 file1
actually works great (fast) if you simply add the -F
option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F
it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l
.
These tests also included the -x
switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.
So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:
grep -F -x -v -f file2 file1
This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.
whats the speed of as sort and diff?
sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
You can use Python:
python -c '
lines_to_remove = set()
with open("file2", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("f1", "r") as f:
for line in f.readlines():
if line.strip() not in lines_to_remove:
print(line.strip())
'
If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat
, sort
and uniq
:
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
Test:
seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
# Output:
1
2
This is also relatively fast, compared to grep
.
$ join -v 1 -t '' file1 file2
line2
line3
The -t
makes sure that it compares the whole line, if you had a space in some of the lines.
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
The way I usually do this is using the --suppress-common-lines
flag, though note that this only works if your do it in side-by-side format.
diff -y --suppress-common-lines file1.txt file2.txt
Source: Stackoverflow.com