see if two files have the same content in python

Question

Possible Duplicates    Finding duplicate files and removing them    In Python  is there a concise way of comparing whether the contents of two text files are the same      What is the easiest way to see if two files are the same content-wise in Python   One thing I can do is md5 each file and compare   Is there a better way

User · Answer

I m not sure if you want to find duplicate files or just compare two single files  If the  latter  the above approach  filecmp  is better  if the former  the following approach is better   There are lots of duplicate files detection questions here  Assuming they are not very small and that performance is important  you can   Compare file sizes first  discarding all which doesn t match If file sizes match  compare using the biggest hash you can handle  hashing chunks of files to avoid reading the whole big file   Here s is an answer with Python implementations  I prefer the one by nosklo  BTW

User · Answer

Yes  I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison  As hash can clash  a byte-by-byte comparison may be done depending on the use case   Generally byte-by-byte comparison would be sufficient and efficient  which filecmp module already does   other things too   See http   docs python org library filecmp html e g    gt  gt  gt  import filecmp  gt  gt  gt  filecmp cmp  file1 txt    file1 txt   True  gt  gt  gt  filecmp cmp  file1 txt    file2 txt   False   Speed consideration  Usually if only two files have to be compared  hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently  e g  code below tries to time  hash vs byte-by-byte  Disclaimer  this is not the best way of timing or comparing two algo  and there is need for improvements but it does give rough idea  If you think it should be improved do tell me I will change it   import random import string import hashlib import time  def getRandText N       return     join  random choice string printable  for i in xrange N     N 1000000 randText1   getRandText N  randText2   getRandText N   def cmpHash text1  text2       hash1   hashlib md5       hash1 update text1      hash1   hash1 hexdigest        hash2   hashlib md5       hash2 update text2      hash2   hash2 hexdigest        return  hash1    hash2  def cmpByteByByte text1  text2       return text1    text2  for cmpFunc in  cmpHash  cmpByteByByte       st   time time       for i in range 10           cmpFunc randText1  randText2      print cmpFunc func name time time  -st   and the output is  cmpHash 0 234999895096 cmpByteByByte 0 0

[python] see if two files have the same content in python

Examples related to python

Examples related to file