Reading binary file and looping over each byte

Question

In Python  how do I read in a binary file and loop over each byte of that file

User · Answer

If you have a lot of binary data to read, you might want to consider the struct module. It is documented as converting "between C and Python types", but of course, bytes are bytes, and whether those were created as C types does not matter. For example, if your binary data contains two 2-byte integers and one 4-byte integer, you can read them as follows (example taken from struct documentation):

>>> struct.unpack('hhl', b'\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)

You might find this more convenient, faster, or both, than explicitly looping over the content of a file.

User · Answer

To sum up all the brilliant points of chrispy  Skurmedel  Ben Hoyt and Peter Hansen  this would be the optimal solution for processing a binary file one byte at a time   with open  myfile    rb   as f      while True          byte   f read 1          if not byte              break         do stuff with ord byte     For python versions 2 6 and above  because    python buffers internally - no need to read chunks  DRY principle - do not repeat the read line with statement ensures a clean file close  byte  evaluates to false when there are no more bytes  not when a byte is zero    Or use J  F  Sebastians solution for improved speed  from functools import partial  with open filename   rb   as file      for byte in iter partial file read  1   b               Do stuff with byte   Or if you want it as a generator function like demonstrated by codeape   def bytes from file filename       with open filename   rb   as f          while True              byte   f read 1              if not byte                  break             yield ord byte      example  for b in bytes from file  filename        do stuff with b

User · Answer

To read a file     one byte at a time  ignoring the buffering       you could use the two-argument iter callable  sentinel  built-in function   with open filename   rb   as file      for byte in iter lambda  file read 1   b               Do stuff with byte   It calls file read 1  until it returns nothing b    empty bytestring    The memory doesn t grow unlimited for large files  You could pass buffering 0  to open    to disable the buffering     it guarantees that only one byte is read per iteration  slow    with-statement closes the file automatically     including the case when the code underneath raises an exception   Despite the presence of internal buffering by default  it is still inefficient to process one byte at a time  For example  here s the blackhole py utility that eats everything it is given      usr bin env python3    Discard all input   cat  gt   dev null  analog     import sys from functools import partial from collections import deque  chunksize   int sys argv 1   if len sys argv   gt  1 else  1  lt  lt  15  deque iter partial sys stdin detach   read  chunksize   b     maxlen 0    Example     dd if  dev zero bs 1M count 1000   python3 blackhole py   It processes  1 5 nbsp GB s when chunksize    32768 on my machine and only  7 5 nbsp MB s when chunksize    1  That is  it is 200 times slower to read one byte at a time  Take it into account if you can rewrite your processing to use more than one byte at a time and if you need performance   mmap allows you to treat a file as a bytearray and a file object simultaneously  It can serve as an alternative to loading the whole file in memory if you need access both interfaces  In particular  you can iterate one byte at a time over a memory-mapped file just using a plain for-loop   from mmap import ACCESS READ  mmap  with open filename   rb   0  as f  mmap f fileno    0  access ACCESS READ  as s      for byte in s    length is equal to the current file size           Do stuff with byte   mmap supports the slice notation  For example  mm i i len  returns len bytes from the file starting at position i  The context manager protocol is not supported before Python 3 2  you need to call mm close   explicitly in this case  Iterating over each byte using mmap consumes more memory than file read 1   but mmap is an order of magnitude faster

User · Answer

Python 3  read all of the file at once   with open  filename    rb   as binary file        Read the whole file at once     data   binary file read       print data    You can iterate whatever you want using data variable

User · Answer

Reading binary file in Python and looping over each byte   New in Python 3 5 is the pathlib module  which has a convenience method specifically to read in a file as bytes  allowing us to iterate over the bytes  I consider this a decent  if quick and dirty  answer   import pathlib  for byte in pathlib Path path  read bytes        print byte    Interesting that this is the only answer to mention pathlib   In Python 2  you probably would do this  as Vinay Sajip also suggests    with open path   b   as file      for byte in file read            print byte    In the case that the file may be too large to iterate over in-memory  you would chunk it  idiomatically  using the iter function with the callable  sentinel signature - the Python 2 version   with open path   b   as file      callable   lambda  file read 1024      sentinel   bytes     or b       for chunk in iter callable  sentinel            for byte in chunk              print byte     Several other answers mention this  but few offer a sensible read size    Best practice for large files or buffered interactive reading  Let s create a function to do this  including idiomatic uses of the standard library for Python 3 5    from pathlib import Path from functools import partial from io import DEFAULT BUFFER SIZE  def file byte iterator path          given a path  return an iterator over the file     that lazily loads the file             path   Path path      with path open  rb   as file          reader   partial file read1  DEFAULT BUFFER SIZE          file iterator   iter reader  bytes            for chunk in file iterator              yield from chunk   Note that we use file read1  file read blocks until it gets all the bytes requested of it or EOF  file read1 allows us to avoid blocking  and it can return more quickly because of this  No other answers mention this as well   Demonstration of best practice usage   Let s make a file with a megabyte  actually mebibyte  of pseudorandom data   import random import pathlib path    pseudorandom bytes  pathobj   pathlib Path path   pathobj write bytes    bytes random randint 0  255  for   in range 2  20      Now let s iterate over it and materialize it in memory     gt  gt  gt  l   list file byte iterator path    gt  gt  gt  len l  1048576   We can inspect any part of the data  for example  the last 100 and first 100 bytes    gt  gt  gt  l -100    208  5  156  186  58  107  24  12  75  15  1  252  216  183  235  6  136  50  222  218  7  65  234  129  240  195  165  215  245  201  222  95  87  71  232  235  36  224  190  185  12  40  131  54  79  93  210  6  154  184  82  222  80  141  117  110  254  82  29  166  91  42  232  72  231  235  33  180  238  29  61  250  38  86  120  38  49  141  17  190  191  107  95  223  222  162  116  153  232  85  100  97  41  61  219  233  237  55  246  181   gt  gt  gt  l  100   28  172  79  126  36  99  103  191  146  225  24  48  113  187  48  185  31  142  216  187  27  146  215  61  111  218  171  4  160  250  110  51  128  106  3  10  116  123  128  31  73  152  58  49  184  223  17  176  166  195  6  35  206  206  39  231  89  249  21  112  168  4  88  169  215  132  255  168  129  127  60  252  244  160  80  155  246  147  234  227  157  137  101  84  115  103  77  44  84  134  140  77  224  176  242  254  171  115  193  29    Don t iterate by lines for binary files  Don t do the following - this pulls a chunk of arbitrary size until it gets to a newline character - too slow when the chunks are too small  and possibly too large as well       with open path   rb   as file          for chunk in file    text newline iteration - not for bytes             yield from chunk   The above is only good for what are semantically human readable text files  like plain text  code  markup  markdown etc    essentially anything ascii  utf  latin  etc    encoded  that you should open without the  b  flag

User · Answer

This post itself is not a direct answer to the question  What it is instead is a data-driven extensible benchmark that can be used to compare many of the answers  and variations of utilizing new features added in later  more modern  versions of Python  that have been posted to this question     and should therefore be helpful in determining which has the best performance   In a few cases I ve modified the code in the referenced answer to make it compatible with the benchmark framework   First  here are the results for what currently are the latest versions of Python 2  amp  3   Fastest to slowest execution speeds with 32-bit Python 2 7 16   numpy version 1 16 5   Test file size  1 024 KiB   100 executions  best of 3 repetitions  1                  Tcll  array array      3 8943 secs  rel speed   1 00x    0 00  slower  262 95 KiB sec  2  Vinay Sajip  read all into memory      4 1164 secs  rel speed   1 06x    5 71  slower  248 76 KiB sec  3            codeape   iter   partial     4 1616 secs  rel speed   1 07x    6 87  slower  246 06 KiB sec  4                             codeape     4 1889 secs  rel speed   1 08x    7 57  slower  244 46 KiB sec  5               Vinay Sajip  chunked      4 1977 secs  rel speed   1 08x    7 79  slower  243 94 KiB sec  6           Aaron Hall  Py 2 version      4 2417 secs  rel speed   1 09x    8 92  slower  241 41 KiB sec  7                     gerrit  struct      4 2561 secs  rel speed   1 09x    9 29  slower  240 59 KiB sec  8                     Rick M   numpy      8 1398 secs  rel speed   2 09x  109 02  slower  125 80 KiB sec  9                           Skurmedel    31 3264 secs  rel speed   8 04x  704 42  slower   32 69 KiB sec   Benchmark runtime  min sec  - 03 26     Fastest to slowest execution speeds with 32-bit Python 3 8 0   numpy version 1 17 4   Test file size  1 024 KiB   100 executions  best of 3 repetitions  1  Vinay Sajip    yield from     walrus operator      3 5235 secs  rel speed   1 00x    0 00  slower  290 62 KiB sec  2                       Aaron Hall    yield from      3 5284 secs  rel speed   1 00x    0 14  slower  290 22 KiB sec  3         codeape   iter   partial    yield from      3 5303 secs  rel speed   1 00x    0 19  slower  290 06 KiB sec  4                      Vinay Sajip    yield from      3 5312 secs  rel speed   1 00x    0 22  slower  289 99 KiB sec  5      codeape    yield from     walrus operator      3 5370 secs  rel speed   1 00x    0 38  slower  289 51 KiB sec  6                          codeape    yield from      3 5390 secs  rel speed   1 00x    0 44  slower  289 35 KiB sec  7                                      jfs  mmap      4 0612 secs  rel speed   1 15x   15 26  slower  252 14 KiB sec  8              Vinay Sajip  read all into memory      4 5948 secs  rel speed   1 30x   30 40  slower  222 86 KiB sec  9                        codeape   iter   partial     4 5994 secs  rel speed   1 31x   30 54  slower  222 64 KiB sec  10                                        codeape     4 5995 secs  rel speed   1 31x   30 54  slower  222 63 KiB sec  11                          Vinay Sajip  chunked      4 6110 secs  rel speed   1 31x   30 87  slower  222 08 KiB sec  12                      Aaron Hall  Py 2 version      4 6292 secs  rel speed   1 31x   31 38  slower  221 20 KiB sec  13                             Tcll  array array      4 8627 secs  rel speed   1 38x   38 01  slower  210 58 KiB sec  14                                gerrit  struct      5 0816 secs  rel speed   1 44x   44 22  slower  201 51 KiB sec  15                 Rick M   numpy     yield from     11 8084 secs  rel speed   3 35x  235 13  slower   86 72 KiB sec  16                                      Skurmedel    11 8806 secs  rel speed   3 37x  237 18  slower   86 19 KiB sec  17                                Rick M   numpy     13 3860 secs  rel speed   3 80x  279 91  slower   76 50 KiB sec   Benchmark runtime  min sec  - 04 47   I also ran it with a much larger 10 MiB test file  which took nearly an hour to run  and got performance results which were comparable to those shown above   Here s the code used to do the benchmarking   from   future   import print function import array import atexit from collections import deque  namedtuple import io from mmap import ACCESS READ  mmap import numpy as np from operator import attrgetter import os import random import struct import sys import tempfile from textwrap import dedent import time import timeit import traceback  try      xrange except NameError     Python 3     xrange   range   class KiB int           KibiBytes - multiples of the byte units for quantities of information          def   new   self  value 0           return 1024 value   BIG TEST FILE   1    MiBs or 0 for a small file  SML TEST FILE   KiB 64  EXECUTIONS   100    Number of times each  algorithm  is executed per timing run  TIMINGS   3    Number of timing runs  CHUNK SIZE   KiB 8  if BIG TEST FILE      FILE SIZE   KiB 1024    BIG TEST FILE else      FILE SIZE   SML TEST FILE    For quicker testing     Common setup for all algorithms -- prefixed to each algorithm s setup  COMMON SETUP   dedent           Make accessible in algorithms      from   main   import array  deque  get buffer size  mmap  np  struct     from   main   import ACCESS READ  CHUNK SIZE  FILE SIZE  TEMP FILENAME     from functools import partial     try          xrange     except NameError     Python 3         xrange   range        def get buffer size path           Determine optimal buffer size for reading files          st   os stat path      try          bufsize   st st blksize   Available on some Unix systems  like Linux      except AttributeError          bufsize   io DEFAULT BUFFER SIZE     return bufsize    Utility primarily for use when embedding additional algorithms into benchmark  VERIFY NUM READ             Verify generator reads correct number of bytes  assumes values are correct       bytes read   sum 1 for   in file byte iterator TEMP FILENAME       assert bytes read    FILE SIZE                Wrong number of bytes generated  got      instead of       format                  bytes read  FILE SIZE       TIMING   namedtuple  TIMING    label  exec time    class Algorithm namedtuple  CodeFragments    setup  test            Default timeit  stmt  code fragment       TEST                for b in file byte iterator TEMP FILENAME      Loop over every byte               pass    Do stuff with byte            deque file byte iterator TEMP FILENAME   maxlen 0     Data sink                 Must overload   new   because  named tuples are immutable      def   new   cls  setup  test None               Dedent  unindent  code fragment string arguments          Args             setup  -- Code fragment that defines things used by  test  code                       In this case it should define a generator function named                       file byte iterator    that will be passed that name of a test file                      of binary data  This code is not timed             test  -- Code fragment that uses things defined in  setup  code                      Defaults to  TEST  This is the code that s timed                      test    cls  TEST if test is None else test    Use default unless one is provided             Uncomment to replace all performance tests with one that verifies the correct           number of bytes values are being generated by the file byte iterator function           test   VERIFY NUM READ          return tuple   new   cls   dedent setup   dedent test      algorithms           Aaron Hall  Py 2 version    Algorithm             def file byte iterator path               with open path   rb   as file                  callable   partial file read  1024                  sentinel   bytes     or b                   for chunk in iter callable  sentinel                       for byte in chunk                          yield byte                 codeape   Algorithm             def file byte iterator filename  chunksize CHUNK SIZE               with open filename   rb   as f                  while True                      chunk   f read chunksize                      if chunk                          for b in chunk                              yield b                     else                          break                 codeape   iter   partial   Algorithm             def file byte iterator filename  chunksize CHUNK SIZE               with open filename   rb   as f                  for chunk in iter partial f read  chunksize   b                         for b in chunk                          yield b                 gerrit  struct    Algorithm             def file byte iterator filename               with open filename   rb   as f                  fmt      B  format FILE SIZE     Reads entire file at once                  for b in struct unpack fmt  f read                         yield b                 Rick M   numpy    Algorithm             def file byte iterator filename               for byte in np fromfile filename   u1                    yield byte                 Skurmedel   Algorithm             def file byte iterator filename               with open filename   rb   as f                  byte   f read 1                  while byte                      yield byte                     byte   f read 1                  Tcll  array array    Algorithm             def file byte iterator filename               with open filename   rb   as f                  arr   array array  B                   arr fromfile f  FILE SIZE     Reads entire file at once                  for b in arr                      yield b                 Vinay Sajip  read all into memory    Algorithm             def file byte iterator filename               with open filename   rb   as f                  bytes read   f read      Reads entire file at once              for b in bytes read                  yield b                 Vinay Sajip  chunked    Algorithm             def file byte iterator filename  chunksize CHUNK SIZE               with open filename   rb   as f                  chunk   f read chunksize                  while chunk                      for b in chunk                          yield b                     chunk   f read chunksize                  End algorithms      Versions of algorithms that will only work in certain releases  or better  of Python    if sys version info  gt    3  3       algorithms update             codeape   iter   partial    yield from    Algorithm                 def file byte iterator filename  chunksize CHUNK SIZE                   with open filename   rb   as f                      for chunk in iter partial f read  chunksize   b                             yield from chunk                         codeape    yield from    Algorithm                 def file byte iterator filename  chunksize CHUNK SIZE                   with open filename   rb   as f                      while True                          chunk   f read chunksize                          if chunk                              yield from chunk                         else                              break                         jfs  mmap    Algorithm                 def file byte iterator filename                   with open filename   rb   as f                         mmap f fileno    0  access ACCESS READ  as s                      yield from s                         Rick M   numpy     yield from    Algorithm                 def file byte iterator filename                    data   np fromfile filename   u1                   yield from np fromfile filename   u1                           Vinay Sajip    yield from    Algorithm                 def file byte iterator filename  chunksize CHUNK SIZE                   with open filename   rb   as f                      chunk   f read chunksize                      while chunk                          yield from chunk    Added in Py 3 3                         chunk   f read chunksize                           End Python 3 3 update   if sys version info  gt    3  5       algorithms update             Aaron Hall    yield from    Algorithm                 from pathlib import Path              def file byte iterator path                       Given a path  return an iterator over the file                     that lazily loads the file                                      path   Path path                  bufsize   get buffer size path                   with path open  rb   as file                      reader   partial file read1  bufsize                      for chunk in iter reader  bytes                             yield from chunk                          End Python 3 5 update   if sys version info  gt    3  8  0       algorithms update             Vinay Sajip    yield from     walrus operator    Algorithm                 def file byte iterator filename  chunksize CHUNK SIZE                   with open filename   rb   as f                      while chunk    f read chunksize                           yield from chunk    Added in Py 3 3                         codeape    yield from     walrus operator    Algorithm                 def file byte iterator filename  chunksize CHUNK SIZE                   with open filename   rb   as f                      while chunk    f read chunksize                           yield from chunk                          End Python 3 8 0 update update         Main       def main        global TEMP FILENAME      def cleanup                Clean up after testing is completed              try              os remove TEMP FILENAME     Delete the temporary file          except Exception              pass      atexit register cleanup         Create a named temporary binary file of pseudo-random bytes for testing      fd  TEMP FILENAME   tempfile mkstemp   bin       with os fdopen fd   wb   as file           os write fd  bytearray random randrange 256  for   in range FILE SIZE           Execute and time each algorithm  gather results      start time   time time      To determine how long testing itself takes       timings          for label in algorithms          try              timing   TIMING label                              min timeit repeat algorithms label  test                                                setup COMMON SETUP   algorithms label  setup                                                repeat TIMINGS  number EXECUTIONS            except Exception as exc              print     occurred timing the algorithm       n      format                      type exc    name    label  exc               traceback print exc file sys stdout     Redirect to stdout              sys exit 1          timings append timing         Report results      print  Fastest to slowest execution speeds with   -bit Python           format              64 if sys maxsize  gt  2  32 else 32   sys version info  3        print    numpy version     format np version full version       print    Test file size       KiB  format FILE SIZE    KiB 1        print       d  executions  best of   d  repetitions  format EXECUTIONS  TIMINGS       print        longest   max len timing label  for timing in timings     Len of longest identifier      ranked   sorted timings  key attrgetter  exec time      Sort so fastest is first      fastest   ranked 0  exec time     for rank  timing in enumerate ranked  1           print     lt 2d     gt  width       8 4f  secs  rel speed   6 2f x    6 2f   slower                     6 2f  KiB sec   format                      rank                      timing label  timing exec time  round timing exec time fastest  2                       round  timing exec time fastest - 1    100  2                        FILE SIZE timing exec time    KiB 1      per sec                      width longest       print       mins  secs   divmod time time  -start time  60      print  Benchmark runtime  min sec  -   02d    02d   format int mins                                                                  int round secs      main

User · Answer

Here s an example of reading Network endian data using Numpy fromfile addressing  Nirmal comments above   dtheader  np dtype    Start Name   b    4                       Message Type   np int32   1                       Instance   np int32   1                       NumItems   np int32   1                       Length   np int32   1                       ComplexArray   np int32   1      dtheader dtheader newbyteorder   gt     headerinfo   np fromfile iqfile  dtype dtheader  count 1   print raw  Start Name      I hope this helps  The problem is that fromfile doesn t recognize and EOF and allow gracefully breaking out of the loop for files of arbitrary size

User · Answer

If the file is not too big that holding it in memory is a problem   with open  filename    rb   as f      bytes read   f read   for b in bytes read      process byte b    where process byte represents some operation you want to perform on the passed-in byte   If you want to process a chunk at a time   with open  filename    rb   as f      bytes read   f read CHUNKSIZE      while bytes read          for b in bytes read              process byte b          bytes read   f read CHUNKSIZE    The with statement is available in Python 2 5 and greater

User · Answer

This generator yields bytes from a file  reading the file in chunks   def bytes from file filename  chunksize 8192       with open filename   rb   as f          while True              chunk   f read chunksize              if chunk                  for b in chunk                      yield b             else                  break    example  for b in bytes from file  filename        do stuff with b    See the Python documentation for information on iterators and generators

User · Answer

After trying all the above and using the answer from  Aaron Hall  I was getting memory errors for a  90 Mb file on a computer running Window 10  8 Gb RAM and Python 3 5 32-bit  I was recommended by a colleague to use numpy instead and it works wonders   By far  the fastest to read an entire binary file  that I have tested  is   import numpy as np  file    binary file bin  data   np fromfile file   u1     Reference  Multitudes faster than any other methods so far  Hope it helps someone

User · Answer

Python 2 4 and Earlier  f   open  myfile    rb   try      byte   f read 1      while byte                  Do stuff with byte          byte   f read 1  finally      f close     Python 2 5-2 7  with open  myfile    rb   as f      byte   f read 1      while byte                  Do stuff with byte          byte   f read 1    Note that the with statement is not available in versions of Python below 2 5  To use it in v 2 5 you ll need to import it   from   future   import with statement   In 2 6 this is not needed   Python 3  In Python 3  it s a bit different  We will no longer get raw characters from the stream in byte mode but byte objects  thus we need to alter the condition   with open  myfile    rb   as f      byte   f read 1      while byte    b              Do stuff with byte          byte   f read 1    Or as benhoyt says  skip the not equal and take advantage of the fact that b   evaluates to false  This makes the code compatible between 2 6 and 3 x without any changes  It would also save you from changing the condition if you go from byte mode to text or the reverse   with open  myfile    rb   as f      byte   f read 1      while byte            Do stuff with byte          byte   f read 1    python 3 8  From now on thanks to    operator the above code can be written in a shorter way   with open  myfile    rb   as f      while  byte    f read 1              Do stuff with byte

User · Answer

if you are looking for something speedy  here s a method I ve been using that s worked for years   from array import array  with open  path   rb    as file      data   array   B   file read       buffer the file    evaluate it s data for byte in data      v   byte   int value     c   chr byte    if you want to iterate chars instead of ints  you can simply use data   file read    which should be a bytes   object in py3

[python] Reading binary file and looping over each byte

Examples related to python

Examples related to file-io

Examples related to binary