Extract images from PDF without resampling in python

Question

How might one extract all images from a pdf document  at native resolution and format   Meaning extract tiff as tiff  jpeg as jpeg  etc  and without resampling   Layout is unimportant  I don t care were the source image is located on the page   I m using python 2 7 but can use 3 x if required

User · Answer

You could use pdfimages command in Ubuntu as well   Install poppler lib using the below commands   sudo apt install poppler-utils  sudo apt-get install python-poppler  pdfimages file pdf image   List of files created are   for eg    there are two images in pdf   image-000 png image-001 png   It works   Now you can use a subprocess run to run this from python

User · Answer

PikePDF can do this with very little code  from pikepdf import Pdf  PdfImage  filename    quot sample-in pdf quot  example   Pdf open filename   for i  page in enumerate example pages       for j   name  raw image  in enumerate page images items             image   PdfImage raw image          out   image extract to fileprefix f quot  filename -page i 03 -img j 03  quot    extract to will automatically pick the file extension based on how the image is encoded in the PDF  If you want  you could also print some detail about the images as they get extracted            Optional  print info about image         w   raw image stream dict Width         h   raw image stream dict Height         f   raw image stream dict Filter         size   raw image stream dict Length          print f quot Wrote  name   w x h   f   size   B  image colorspace  to  out  quot    which can print something like Wrote  Im1 150x150  DCTDecode 5 952B  ICCBased to sample2 pdf-page000-img000 jpg Wrote  Im10 32x32  FlateDecode 36B  ICCBased to sample2 pdf-page000-img001 png      See the docs for more that you can do with images  including replacing them in the PDF file

User · Answer

First Install pdf2image pip install pdf2image  1 14 0  Follow the below code for extraction of pages from PDF  file path  quot file path of PDF quot  info   pdfinfo from path file path  userpw None  poppler path None  maxPages   info  quot Pages quot   image counter   0 if maxPages  gt  10      for page in range 1  maxPages  10           pages   convert from path file path  dpi 300  first page page                   last page min page 10-1  maxPages           for page in pages              page save image path       str image counter      png    PNG               image counter    1 else      pages   convert from path file path  300      for i  j in enumerate pages           j save image path       str i      png    PNG      Hope it helps coders looking for easy conversion of PDF files to Images as per pages of PDF

User · Answer

You can use the module PyMuPDF  This outputs all images as  png files  but worked out of the box and is fast   import fitz doc   fitz open  file pdf   for i in range len doc        for img in doc getPageImageList i           xref   img 0          pix   fitz Pixmap doc  xref          if pix n  lt  5          this is GRAY or RGB             pix writePNG  p s- s png     i  xref           else                  CMYK  convert to RGB first             pix1   fitz Pixmap fitz csRGB  pix              pix1 writePNG  p s- s png     i  xref               pix1   None         pix   None   see here for more resources

User · Answer

Libpoppler comes with a tool called  pdfimages  that does exactly this    On ubuntu systems it s in the poppler-utils package   http   poppler freedesktop org   http   en wikipedia org wiki Pdfimages  Windows binaries  http   blog alivate com au poppler-windows

User · Answer

I prefer minecart as it is extremely easy to use  The below snippet show how to extract images from a pdf    pip install minecart import minecart  pdffile   open  Invoices pdf    rb   doc   minecart Document pdffile   page   doc get page 0    getting a single page   iterating through all pages for page in doc iter pages        im   page images 0  as pil      requires pillow     display im

User · Answer

I started from the code of  sylvain There was some flaws  like the exception NotImplementedError  unsupported filter  DCTDecode of getData  or the fact the code failed to find images in some pages because they were at a deeper level than the page   There is my code    import PyPDF2  from PIL import Image  import sys from os import path import warnings warnings filterwarnings  ignore    number   0  def recurse page  xObject       global number      xObject   xObject   Resources     XObject   getObject        for obj in xObject           if xObject obj    Subtype        Image               size    xObject obj    Width    xObject obj    Height                data   xObject obj   data             if xObject obj    ColorSpace        DeviceRGB                   mode    RGB              else                  mode    P               imagename     s - p   s -  s   abspath  -4   p  obj 1                 if xObject obj    Filter        FlateDecode                   img   Image frombytes mode  size  data                  img save imagename     png                   number    1             elif xObject obj    Filter        DCTDecode                   img   open imagename     jpg    wb                   img write data                  img close                   number    1             elif xObject obj    Filter        JPXDecode                   img   open imagename     jp2    wb                   img write data                  img close                   number    1         else              recurse page  xObject obj      try         filename   pages   sys argv      pages    map int  pages      abspath   path abspath filename  except BaseException      print  Usage   nPDF extract images file pdf page1 page2 page3           sys exit     file   PyPDF2 PdfFileReader open filename   rb     for p in pages          page0   file getPage p-1      recurse p  page0   print   s extracted images   number

User · Answer

Well I have been struggling with this for many weeks  many of these answers helped me through  but there was always something missing  apparently no one here has ever had problems  with jbig2 encoded images   In the bunch of PDF that I am to scan  images encoded in jbig2 are very popular   As far as I understand there are many copy scan machines that scan papers and transform them into PDF files full of jbig2 encoded images   So after many days of tests decided to go for the answer proposed here by dkagedal long time ago    Here is my step by step on linux   if you have another OS I suggest to use a linux docker it s going to be much easier     First step    apt-get install poppler-utils   Then I was able to run command line tool called pdfimages like this   pdfimages -all myfile pdf   images found    With the above command you will be able to extract all the images contained in myfile pdf and you will have them saved inside images found  you have to create images found before   In the list you will find several types of images  png  jpg  tiff  all these are easily readable with any graphic tool   Then you will have some files named like  -145 jb2e and -145 jb2g   These 2 files contain ONE IMAGE encoded in jbig2 saved in 2 different files one for the header and one for the data   Again I have lost many days trying to find out how to convert those files into something readable and finally I came across this tool called jbig2dec  So first you need to install this magic tool   apt-get install jbig2dec   then you can run   jbig2dec -t png -145 jb2g -145 jb2e   You are going to finally be able to get all extracted images converted into something useful   good luck

User · Answer

Much easier solution   Use the poppler-utils package   To install it use homebrew  homebrew is MacOS specific  but you can find the poppler-utils package for Widows or Linux here  https   poppler freedesktop org     First line of code below installs poppler-utils using homebrew  After installation the second line  run from the command line  then extracts images from a PDF file and names them  image     To run this program from within Python use the os or subprocess module   Third line is code using os module  beneath that is an example with subprocess  python 3 5 or later for run   function    More info here  https   www cyberciti biz faq easily-extract-images-from-pdf-file   brew install poppler  pdfimages file pdf image  import os os system  pdfimages file pdf image     or  import subprocess subprocess run  pdfimages file pdf image   shell True

User · Answer

Here is my version from 2019 that recursively gets all images from PDF and reads them with PIL  Compatible with Python 2 3  I also found that sometimes image in PDF may be compressed by zlib  so my code supports decompression      usr bin env python3 try      from StringIO import StringIO except ImportError      from io import BytesIO as StringIO from PIL import Image from PyPDF2 import PdfFileReader  generic import zlib   def get color mode obj        try          cspace   obj   ColorSpace       except KeyError          return None      if cspace      DeviceRGB           return  RGB      elif cspace      DeviceCMYK           return  CMYK      elif cspace      DeviceGray           return  P       if isinstance cspace  generic ArrayObject  and cspace 0       ICCBased           color map   obj   ColorSpace   1  getObject     N           if color map    1              return  P          elif color map    3              return  RGB          elif color map    4              return  CMYK    def get object images x obj       images          for obj name in x obj          sub obj   x obj obj name           if   Resources  in sub obj and   XObject  in sub obj   Resources                images    get object images sub obj   Resources     XObject   getObject             elif sub obj   Subtype        Image               zlib compressed     FlateDecode  in sub obj get   Filter                   if zlib compressed                 sub obj  data   zlib decompress sub obj  data               images append                   get color mode sub obj                    sub obj   Width    sub obj   Height                     sub obj  data                     return images   def get pdf images pdf fp       images          try          pdf in   PdfFileReader open pdf fp   rb        except          return images      for p n in range pdf in numPages            page   pdf in getPage p n           try              page x obj   page   Resources     XObject   getObject           except KeyError              continue          images    get object images page x obj       return images   if   name         main          pdf fp    test pdf       for image in get pdf images pdf fp            mode  size  data    image         try              img   Image open StringIO data           except Exception as e              print   Failed to read image with PIL      format e               continue           Do whatever you want with the image

User · Answer

Try below code  it will extract all image from pdf        import sys     import PyPDF2     from PIL import Image     pdf sys argv 1      print pdf      input1   PyPDF2 PdfFileReader open pdf   rb        for x in range 0 input1 numPages           xObject input1 getPage x          xObject   xObject   Resources     XObject   getObject           for obj in xObject              if xObject obj    Subtype        Image                   size    xObject obj    Width    xObject obj    Height                    print size                  data   xObject obj   data                  print data                  print xObject obj    Filter                    if xObject obj    Filter   0       DCTDecode                       img name str x    jpg                      print img name                      img   open img name   wb                       img write data                      img close           print str x    is done

User · Answer

Often in a PDF  the image is simply stored as-is   For example  a PDF with a jpg inserted will have a range of bytes somewhere in the middle that when extracted is a valid jpg file   You can use this to very simply extract byte ranges from the PDF   I wrote about this some time ago  with sample code  Extracting JPGs from PDFs

User · Answer

In Python with PyPDF2 and Pillow libraries it is simple   import PyPDF2  from PIL import Image  if   name         main         input1   PyPDF2 PdfFileReader open  input pdf    rb        page0   input1 getPage 0      xObject   page0   Resources     XObject   getObject        for obj in xObject          if xObject obj    Subtype        Image               size    xObject obj    Width    xObject obj    Height                data   xObject obj  getData               if xObject obj    ColorSpace        DeviceRGB                   mode    RGB              else                  mode    P               if xObject obj    Filter        FlateDecode                   img   Image frombytes mode  size  data                  img save obj 1       png               elif xObject obj    Filter        DCTDecode                   img   open obj 1       jpg    wb                   img write data                  img close               elif xObject obj    Filter        JPXDecode                   img   open obj 1       jp2    wb                   img write data                  img close

User · Answer

As of February 2019  the solution given by  sylvain  at least on my setup  does not work without a small modification  xObject obj    Filter   is not a value  but a list  thus in order to make the script work  I had to modify the format checking as follows   import PyPDF2  traceback  from PIL import Image  input1   PyPDF2 PdfFileReader open src   rb    nPages   input1 getNumPages   print nPages  for i in range nPages        print i     page0   input1 getPage i      try           xObject   page0   Resources     XObject   getObject       except   xObject           for obj in xObject          if xObject obj    Subtype        Image               size    xObject obj    Width    xObject obj    Height                data   xObject obj  getData               try                   if xObject obj    ColorSpace        DeviceRGB                       mode    RGB                  elif xObject obj    ColorSpace        DeviceCMYK                       mode    CMYK                        will cause errors when saving                 else                      mode    P                   fn    p 03d- s     i   1  obj 1                    print   t   fn                 if   FlateDecode  in xObject obj    Filter                         img   Image frombytes mode  size  data                      img save fn     png                   elif   DCTDecode  in xObject obj    Filter                        img   open fn     jpg    wb                       img write data                      img close                   elif   JPXDecode  in xObject obj    Filter                         img   open fn     jp2    wb                       img write data                      img close                   elif   LZWDecode  in xObject obj    Filter                         img   open fn     tif    wb                       img write data                      img close                   else                       print  Unknown format    xObject obj    Filter               except                   traceback print exc

User · Answer

In Python with PyPDF2 for CCITTFaxDecode filter     import PyPDF2 import struct      Links  PDF format  http   www adobe com content dam Adobe en devnet acrobat pdfs pdf reference 1-7 pdf CCITT Group 4  https   www itu int rec dologin pub asp lang e amp id T-REC-T 6-198811-I  PDF-E amp type items Extract images from pdf  http   stackoverflow com questions 2693820 extract-images-from-pdf-without-resampling-in-python Extract images coded with CCITTFaxDecode in  net  http   stackoverflow com questions 2641770 extracting-image-from-pdf-with-ccittfaxdecode-filter TIFF format and tags  http   www awaresystems be imaging tiff faq html       def tiff header for CCITT width  height  img size  CCITT group 4       tiff header struct     lt      2s     h     l     h     hhll    8    h      return struct pack tiff header struct                         b II      Byte order indication  Little indian                        42     Version number  always 42                         8     Offset to first IFD                        8     Number of tags in IFD                        256  4  1  width     ImageWidth  LONG  1  width                        257  4  1  height     ImageLength  LONG  1  lenght                        258  3  1  1     BitsPerSample  SHORT  1  1                        259  3  1  CCITT group     Compression  SHORT  1  4   CCITT Group 4 fax encoding                        262  3  1  0     Threshholding  SHORT  1  0   WhiteIsZero                        273  4  1  struct calcsize tiff header struct      StripOffsets  LONG  1  len of header                        278  4  1  height     RowsPerStrip  LONG  1  lenght                        279  4  1  img size     StripByteCounts  LONG  1  size of image                        0    last IFD                           pdf filename    scan pdf  pdf file   open pdf filename   rb   cond scan reader   PyPDF2 PdfFileReader pdf file  for i in range 0  cond scan reader getNumPages         page   cond scan reader getPage i      xObject   page   Resources     XObject   getObject       for obj in xObject          if xObject obj    Subtype        Image                               The  CCITTFaxDecode filter decodes image data that has been encoded using             either Group 3 or Group 4 CCITT facsimile  fax  encoding  CCITT encoding is             designed to achieve efficient compression of monochrome  1 bit per pixel  image             data at relatively low resolutions  and so is useful only for bitmap image data  not             for color images  grayscale images  or general data               K  lt  0 --- Pure two-dimensional encoding  Group 4              K   0 --- Pure one-dimensional encoding  Group 3  1-D              K  gt  0 --- Mixed one- and two-dimensional encoding  Group 3  2-D                              if xObject obj    Filter        CCITTFaxDecode                   if xObject obj    DecodeParms     K      -1                      CCITT group   4                 else                      CCITT group   3                 width   xObject obj    Width                   height   xObject obj    Height                   data   xObject obj   data    sorry  getData   does not work for CCITTFaxDecode                 img size   len data                  tiff header   tiff header for CCITT width  height  img size  CCITT group                  img name   obj 1       tiff                  with open img name   wb   as img file                      img file write tiff header   data                                      import io                   from PIL import Image                   im   Image open io BytesIO tiff header   data   pdf file close

User · Answer

I did this for my own program  and found that the best library to use was PyMuPDF   It lets you find out the  xref  numbers of each image on each page  and use them to extract the raw image data from the PDF   import fitz from PIL import Image import io  filePath    path to file pdf   opens doc using PyMuPDF doc   fitz Document filePath    loads the first page page   doc loadPage 0     First image on page described thru a list  First attribute on image list  xref n   check PyMuPDF docs under getImageList   xref   page getImageList   0  0    gets the image as a dict  check docs under extractImage  baseImage   doc extractImage xref    gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it image   Image open io BytesIO baseImage  image       Displays image for good measure image show     Definitely check out the docs  though

User · Answer

I added all of those together in PyPDFTK here   My own contribution is handling of  Indexed files as such   for obj in xObject      if xObject obj    Subtype        Image           size    xObject obj    Width    xObject obj    Height            color space   xObject obj    ColorSpace           if isinstance color space  pdf generic ArrayObject  and color space 0       Indexed               color space  base  hival  lookup    v getObject   for v in color space    pg 262         mode   img modes color space           if xObject obj    Filter        FlateDecode               data   xObject obj  getData               img   Image frombytes mode  size  data              if color space      Indexed                   img putpalette lookup getData                    img   img convert  RGB               img save      04  png  format filename prefix  i     Note that when  Indexed files are found  you can t just compare  ColorSpace to a string  because it comes as an ArrayObject  So  we have to check the array and retrieve the indexed palette  lookup in the code  and set it in the PIL Image object  otherwise it stays uninitialized  zero  and the whole image shows as black   My first instinct was to save them as GIFs  which is an indexed format   but my tests turned out that PNGs were smaller and looked the same way   I found those types of images when printing to PDF with Foxit Reader PDF Printer

User · Answer

I installed ImageMagick on my server and then run commandline-calls through Popen       usr bin python   import sys  import os  import subprocess  import settings   IMAGE PATH   os path join settings MEDIA ROOT    pdf input      def extract images pdf        output    temp png       cmd    convert     os path join IMAGE PATH  pdf          os path join IMAGE PATH  output       subprocess Popen cmd split    stderr subprocess STDOUT  stdout subprocess PIPE    This will create an image for every page and store them as temp-0 png  temp-1 png      This is only  extraction  if you got a pdf with only images and no text

User · Answer

After some searching I found the following script which works really well with my PDF s  It does only tackle JPG  but it worked perfectly with my unprotected files  Also is does not require any outside libraries   Not to take any credit  the script originates from Ned Batchelder  and not me  Python3 code  extract jpg s from pdf s  Quick and dirty  import sys  with open sys argv 1   rb   as file      file seek 0      pdf   file read    startmark   b  xff xd8  startfix   0 endmark   b  xff xd9  endfix   2 i   0  njpg   0 while True      istream   pdf find b stream   i      if istream  lt  0          break     istart   pdf find startmark  istream  istream   20      if istart  lt  0          i   istream   20         continue     iend   pdf find b endstream   istart      if iend  lt  0          raise Exception  Didn t find end of stream        iend   pdf find endmark  iend - 20      if iend  lt  0          raise Exception  Didn t find end of JPG         istart    startfix     iend    endfix     print  JPG  d from  d to  d     njpg  istart  iend       jpg   pdf istart iend      with open  jpg d jpg    njpg   wb   as jpgfile          jpgfile write jpg       njpg    1     i   iend

User · Answer

After reading the posts using pyPDF2    The error while using  sylvain s code NotImplementedError  unsupported filter  DCTDecode must come from the method  getData    It is solved when using   data instead  by  Alex Paramonov   So far I have only met  DCTDecode  cases  but I am sharing the adapted code that include remarks from the different posts  From zilb by  Alex Paramonov  sub obj   Filter   being a list  by  mxl   Hope it can help the pyPDF2 users  Follow the code       import sys     import PyPDF2  traceback     import zlib     try          from PIL import Image     except ImportError          import Image      pdf path    path to your pdf file pdf      input1   PyPDF2 PdfFileReader open pdf path   rb        nPages   input1 getNumPages        for i in range nPages            page0   input1 getPage i           if   XObject  in page0   Resources                try                  xObject   page0   Resources     XObject   getObject               except                   xObject                   for obj name in xObject                  sub obj   xObject obj name                  if sub obj   Subtype        Image                       zlib compressed     FlateDecode  in sub obj get   Filter                           if zlib compressed                         sub obj  data   zlib decompress sub obj  data                       size    sub obj   Width    sub obj   Height                        data   sub obj  data sub obj getData                       try                           if sub obj   ColorSpace        DeviceRGB                               mode    RGB                          elif sub obj   ColorSpace        DeviceCMYK                               mode    CMYK                                will cause errors when saving  might need convert to RGB first                          else                              mode    P                           fn    p 03d- s     i   1  obj name 1                            if   Filter  in sub obj                              if   FlateDecode  in sub obj   Filter                                    img   Image frombytes mode  size  data                                  img save fn     png                               elif   DCTDecode  in sub obj   Filter                                    img   open fn     jpg    wb                                   img write data                                  img close                               elif   JPXDecode  in sub obj   Filter                                    img   open fn     jp2    wb                                   img write data                                  img close                               elif   CCITTFaxDecode  in sub obj   Filter                                    img   open fn     tiff    wb                                   img write data                                  img close                               elif   LZWDecode  in sub obj   Filter                                     img   open fn     tif    wb                                   img write data                                  img close                               else                                   print  Unknown format    sub obj   Filter                            else                              img   Image frombytes mode  size  data                              img save fn     png                       except                          traceback print exc           else              print  No image found for page  d     i   1

[python] Extract images from PDF without resampling, in python?

Examples related to python

Examples related to image

Examples related to pdf

Examples related to extract

Examples related to pypdf