Generating an MD5 checksum of a file

Question

Is there any simple way of generating  and checking  MD5 checksums of a list of files in Python   I have a small program I m working on  and I d like to confirm the checksums of the files

User · Answer

hashlib md5 pathlib Path  path to file   read bytes    hexdigest

User · Answer

I m clearly not adding anything fundamentally new  but added this answer before I was up to commenting status  plus the code regions make things more clear -- anyway  specifically to answer  Nemo s question from Omnifarious s answer   I happened to be thinking about checksums a bit  came here looking for suggestions on block sizes  specifically   and have found that this method may be faster than you d expect  Taking the fastest  but pretty typical  timeit timeit or  usr bin time result from each of several methods of checksumming a file of approx  11MB       sum methods py crc32 mmap filename  0 0241742134094 crc32 read filename  0 0219960212708 subprocess check output   cksum   filename   0 0553209781647 md5sum mmap filename  0 0286180973053 md5sum read filename  0 0311000347137 subprocess check output   md5sum   filename   0 0332629680634   time md5sum  tmp test data 300k d3fe3d5d4c2460b5daacc30c6efbc77f   tmp test data 300k  real    0m0 043s user    0m0 032s sys     0m0 010s   stat -c   s   tmp test data 300k 11890400   So  looks like both Python and  usr bin md5sum take about 30ms for an 11MB file   The relevant md5sum function  md5sum read in the above listing  is pretty similar to Omnifarious s   import hashlib def md5sum filename  blocksize 65536       hash   hashlib md5       with open filename   rb   as f          for block in iter lambda  f read blocksize   b                 hash update block      return hash hexdigest     Granted  these are from single runs  the mmap ones are always a smidge faster when at least a few dozen runs are made   and mine s usually got an extra f read blocksize  after the buffer is exhausted  but it s reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation     EDIT  Sorry for the long delay  haven t looked at this in some time  but to answer  EdRandall s question  I ll write down an Adler32 implementation   However  I haven t run the benchmarks for it   It s basically the same as the CRC32 would have been  instead of the init  update  and digest calls  everything is a zlib adler32   call   import zlib def adler32sum filename  blocksize 65536       checksum   zlib adler32         with open filename   rb   as f          for block in iter lambda  f read blocksize   b                 checksum   zlib adler32 block  checksum      return checksum  amp  0xffffffff   Note that this must start off with the empty string  as Adler sums do indeed differ when starting from zero versus their sum for     which is 1 -- CRC can start with 0 instead   The AND-ing is needed to make it a 32-bit unsigned integer  which ensures it returns the same value across Python versions

User · Answer

You can use hashlib md5    Note that sometimes you won t be able to fit the whole file in memory  In that case  you ll have to read chunks of 4096 bytes sequentially and feed them to the md5 method   import hashlib def md5 fname       hash md5   hashlib md5       with open fname   rb   as f          for chunk in iter lambda  f read 4096   b                 hash md5 update chunk      return hash md5 hexdigest     Note  hash md5 hexdigest   will return the hex string representation for the digest  if you just need the packed bytes use return hash md5 digest    so you don t have to convert back

User · Answer

There is a way that s pretty memory inefficient   single file   import hashlib def file as bytes file       with file          return file read    print hashlib md5 file as bytes open full path   rb     hexdigest     list of files     fname  hashlib md5 file as bytes open fname   rb     digest    for fname in fnamelst    Recall though  that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky  and analyzing any possible future use your code might be put to for security issues is impossible  IMHO  it should be flat out removed from the library so everybody who uses it is forced to update  So  here s what you should do instead     fname  hashlib sha256 file as bytes open fname   rb     digest    for fname in fnamelst    If you only want 128 bits worth of digest you can do  digest    16    This will give you a list of tuples  each tuple containing the name of its file and its hash   Again I strongly question your use of MD5  You should be at least using SHA1  and given recent flaws discovered in SHA1  probably not even that  Some people think that as long as you re not using MD5 for  cryptographic  purposes  you re fine  But stuff has a tendency to end up being broader in scope than you initially expect  and your casual vulnerability analysis may prove completely flawed  It s best to just get in the habit of using the right algorithm out of the gate  It s just typing a different bunch of letters is all  It s not that hard   Here is a way that is more complex  but memory efficient   import hashlib  def hash bytestr iter bytesiter  hasher  ashexstr False       for block in bytesiter          hasher update block      return hasher hexdigest   if ashexstr else hasher digest    def file as blockiter afile  blocksize 65536       with afile          block   afile read blocksize          while len block   gt  0              yield block             block   afile read blocksize      fname  hash bytestr iter file as blockiter open fname   rb     hashlib md5         for fname in fnamelst    And  again  since MD5 is broken and should not really ever be used anymore     fname  hash bytestr iter file as blockiter open fname   rb     hashlib sha256         for fname in fnamelst    Again  you can put   16  after the call to hash bytestr iter      if you only want 128 bits worth of digest

User · Answer

In Python 3 8  you can do  import hashlib with open  your filename txt    rb   as f      file hash   hashlib md5       while chunk    f read 8192           file hash update chunk   print file hash digest    print file hash hexdigest       to get a printable str instead of bytes     Consider using hashlib blake2b instead of md5  just replace md5 with blake2b in the above snippet   It s cryptographically secure and faster than MD5

User · Answer

change the file path to your file import hashlib def getMd5 file path       m   hashlib md5       with open file path  rb   as f          line   f read           m update line      md5code   m hexdigest       return md5code

[python] Generating an MD5 checksum of a file

Examples related to python

Examples related to md5

Examples related to checksum

Examples related to hashlib