Get MD5 hash of big files in Python

Question

I have used hashlib  which replaces md5 in Python 2 6 3 0  and it worked fine if I opened a file and put its content in hashlib md5   function   The problem is with very big files that their sizes could exceed RAM size   How to get the MD5 hash of a file without loading the whole file to memory

User · Answer

Implementation of accepted answer for Django   import hashlib from django db import models   class MyModel models Model       file   models FileField      any field based on django core files File      def get hash self           hash   hashlib md5           for chunk in self file chunks chunk size 8192               hash update chunk          return hash hexdigest

User · Answer

I think the following code is more pythonic   from hashlib import md5  def get md5 fname       m   md5       with open fname   rb   as fp          for chunk in fp              m update chunk      return m hexdigest

User · Answer

I don t like loops  Based on  Nathan Feger   md5   hashlib md5   with open filename   rb   as f      functools reduce lambda    c  md5 update c   iter lambda  f read md5 block size   128   b     None  md5 hexdigest

User · Answer

You need to read the file in chunks of suitable size   def md5 for file f  block size 2  20       md5   hashlib md5       while True          data   f read block size          if not data              break         md5 update data      return md5 digest     NOTE  Make sure you open your file with the  rb  to the open - otherwise you will get the wrong result   So to do the whole lot in one method - use something like    def generate file md5 rootdir  filename  blocksize 2  20       m   hashlib md5       with open  os path join rootdir  filename     rb    as f          while True              buf   f read blocksize              if not buf                  break             m update  buf       return m hexdigest     The update above was based on the comments provided by Frerich Raabe - and I tested this and found it to be correct on my Python 2 7 2 windows installation  I cross-checked the results using the  jacksum  tool   jacksum -a md5  lt filename gt    http   www jonelo de java jacksum

User · Answer

Here s my version of  Piotr Czapla s method   def md5sum filename       md5   hashlib md5       with open filename   rb   as f          for chunk in iter lambda  f read 128   md5 block size   b                 md5 update chunk      return md5 hexdigest

User · Answer

u can t get it s md5 without read full content  but u can use update function to read the files content block by block  m update a   m update b  is equivalent to m update a b

User · Answer

Break the file into 8192-byte chunks  or some other multiple of 128 bytes  and feed them to MD5 consecutively using update     This takes advantage of the fact that MD5 has 128-byte digest blocks  8192 is 128  64   Since you re not reading the entire file into memory  this won t use much more than 8192 bytes of memory   In Python 3 8  you can do  import hashlib with open  your filename txt    rb   as f      file hash   hashlib md5       while chunk    f read 8192           file hash update chunk  print file hash digest    print file hash hexdigest       to get a printable str instead of bytes

User · Answer

I m not sure that there isn t a bit too much fussing around here  I recently had problems with md5 and files stored as blobs on MySQL so I experimented with various file sizes and the straightforward Python approach  viz   FileHash hashlib md5 FileData  hexdigest     I could detect no noticeable performance difference with a range of file sizes 2Kb to 20Mb and therefore no need to  chunk  the hashing  Anyway  if Linux has to go to disk  it will probably do it at least as well as the average programmer s ability to keep it from doing so  As it happened  the problem was nothing to do with md5  If you re using MySQL  don t forget the md5   and sha1   functions already there

User · Answer

Using multiple comment answers in this thread  here is my solution    import hashlib def md5 for file path  block size 256 128  hr False               Block size directly depends on the block size of your filesystem     to avoid performances issues     Here I have blocks of 4096 octets  Default NTFS              md5   hashlib md5       with open path  rb   as f           for chunk in iter lambda  f read block size   b                   md5 update chunk      if hr          return md5 hexdigest       return md5 digest      This is  pythonic  This is a function It avoids implicit values  always prefer explicit ones  It allows  very important  performances optimizations   And finally   - This has been built by a community  thanks all for your advices ideas

User · Answer

import hashlib re opened   open   home parrot pass txt   r   opened   open readlines   for i in opened      strip1   i strip   n       hash object   hashlib md5 strip1 encode        hash2   hash object hexdigest       print hash2

User · Answer

A remix of Bastien Semene code that take Hawkwing comment about generic hashing function into consideration     def hash for file path  algorithm hashlib algorithms 0   block size 256 128  human readable True               Block size directly depends on the block size of your filesystem     to avoid performances issues     Here I have blocks of 4096 octets  Default NTFS       Linux Ext4 block size     sudo tune2fs -l  dev sda5   grep -i  block size       gt  Block size                4096      Input          path  a path         algorithm  an algorithm in hashlib algorithms                    ATM    md5    sha1    sha224    sha256    sha384    sha512           block size  a multiple of 128 corresponding to the block size of your filesystem         human readable  switch between digest   or hexdigest   output  default hexdigest       Output          hash             if algorithm not in hashlib algorithms          raise NameError  The algorithm   algorithm   you specified is                            not a member of  hashlib algorithms   format algorithm algorithm        hash algo   hashlib new algorithm     According to hashlib documentation using new                                             will be slower then calling using named                                           constructors  ex   hashlib md5       with open path   rb   as f          for chunk in iter lambda  f read block size   b                  hash algo update chunk      if human readable          file hash   hash algo hexdigest       else          file hash   hash algo digest       return file hash

User · Answer

Below I ve incorporated suggestion from comments  Thank you al   python  lt  3 7  import hashlib  def checksum filename  hash factory hashlib md5  chunk num blocks 128       h   hash factory       with open filename  rb   as f           for chunk in iter lambda  f read chunk num blocks h block size   b                  h update chunk      return h digest     python 3 8 and above  import hashlib  def checksum filename  hash factory hashlib md5  chunk num blocks 128       h   hash factory       with open filename  rb   as f           while chunk    f read chunk num blocks h block size                h update chunk      return h digest     original post  if you care about more pythonic  no  while True   way of reading the file check this code   import hashlib  def checksum md5 filename       md5   hashlib md5       with open filename  rb   as f           for chunk in iter lambda  f read 8192   b                  md5 update chunk      return md5 digest     Note that the iter   func needs an empty byte string for the returned iterator to halt at EOF  since read   returns b    not just

User · Answer

A Python 2 3 portable solution  To calculate a checksum  md5  sha1  etc    you must open the file in binary mode  because you ll sum bytes values   To be py27 py3 portable  you ought to use the io packages  like this   import hashlib import io   def md5sum src       md5   hashlib md5       with io open src  mode  rb   as fd          content   fd read           md5 update content      return md5   If your files are big  you may prefer to read the file by chunks to avoid storing the whole file content in memory   def md5sum src  length io DEFAULT BUFFER SIZE       md5   hashlib md5       with io open src  mode  rb   as fd          for chunk in iter lambda  fd read length   b                 md5 update chunk      return md5   The trick here is to use the iter   function with a sentinel  the empty string       The iterator created in this case will call o  the lambda function  with no arguments for each call to its next   method  if the value returned is equal to sentinel  StopIteration will be raised  otherwise the value will be returned    If your files are really big  you may also need to display progress information  You can do that by calling a callback function which prints or logs the amount of calculated bytes   def md5sum src  callback  length io DEFAULT BUFFER SIZE       calculated   0     md5   hashlib md5       with io open src  mode  rb   as fd          for chunk in iter lambda  fd read length   b                 md5 update chunk              calculated    len chunk              callback calculated      return md5

[python] Get MD5 hash of big files in Python

Examples related to python

Examples related to md5

Examples related to hashlib