[python] Hashing a file in Python

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

This question is related to python hash md5 sha1 hashlib

The answer is


TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Hope that helps!

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.


I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.


I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to hash

php mysqli_connect: authentication method unknown to the client [caching_sha2_password] What is Hash and Range Primary Key? How to create a laravel hashed password Hashing a file in Python PHP salt and hash SHA256 for login password Append key/value pair to hash with << in Ruby Are there any SHA-256 javascript implementations that are generally considered trustworthy? How do I generate a SALT in Java for Salted-Hash? What does hash do in python? Hashing with SHA1 Algorithm in C#

Examples related to md5

Hashing a file in Python How to convert password into md5 in jquery? How do I calculate the MD5 checksum of a file in Python? encrypt and decrypt md5 How to generate an MD5 file hash in JavaScript? SHA-256 or MD5 for file integrity How to reverse MD5 to get the original string? Calculate a MD5 hash from a string Calculate MD5 checksum for a file How to convert md5 string to normal text?

Examples related to sha1

How to add SHA-1 to android application Check if my SSL Certificate is SHA1 or SHA2 Hashing a file in Python Is it possible to decrypt SHA1 Hashing with SHA1 Algorithm in C# Simple (non-secure) hash function for JavaScript? How to SHA1 hash a string in Android? Java String to SHA1 Is calculating an MD5 hash less CPU intensive than SHA family functions? SHA1 vs md5 vs SHA256: which to use for a PHP login?

Examples related to hashlib

Hashing a file in Python How to correct TypeError: Unicode-objects must be encoded before hashing? Generating an MD5 checksum of a file Get MD5 hash of big files in Python