Best tool for inspecting PDF files

Question

How can I inspecting PDF files  preferable with a tool  Use case  I m trying to programmatically generate PDF files  using iText    I m having trouble achieving certain layouts  but I have PDF files with text laid out the way I want  generated from Word    I would like to reverse engineer how they do it  PDF Inspector seems to be good  but I m looking for something for Windows

User · Answer

Besides the GUI-based tools mentioned in the other answers, there are a few command line tools which can transform the original PDF source code into a different representation which lets you inspect the (now modified file) with a text editor. All of the tools below work on Linux, Mac OS X, other Unix systems or Windows.

`qpdf` (my favorite)

Use qpdf to uncompress (most) object's streams and also dissect ObjStm objects into individual indirect objects:

qpdf --qdf --object-streams=disable orig.pdf uncompressed-qpdf.pdf

qpdf describes itself as a tool that does "structural, content-preserving transformations on PDF files".

Then just open + inspect the uncompressed-qpdf.pdf file in your favorite text editor. Most of the previously compressed (and hence, binary) bytes will now be plain text.

`mutool`

There is also the mutool command line tool which comes bundled with the MuPDF PDF viewer (which is a sister product to Ghostscript, made by the same company, Artifex). The following command does also uncompress streams and makes them more easy to inspect through a text editor:

mutool clean -d orig.pdf uncompressed-mutool.pdf

`podofouncompress`

PoDoFo is an FreeSoftware/OpenSource library to work with the PDF format and it includes a few command line tools, including podofouncompress. Use it like this to uncompress PDF streams:

podofouncompress orig.pdf uncompressed-podofo.pdf

`peepdf.py`

PeePDF is a Python-based tool which helps you to explore PDF files. Its original purpose was for research and dissection of PDF-based malware, but I find it useful also to investigate the structure of completely benign PDF files.

It can be used interactively to "browse" the objects and streams contained in a PDF.

I'll not give a usage example here, but only a link to its documentation:

peepdf - PDF Analysis Tool

`pdfid.py` and `pdf-parser.py`

pdfid.py and pdf-parser.py are two PDF tools by Didier Stevens written in Python.

Their background is also to help explore malicious PDFs -- but I also find it useful to analyze the structure and contents of benign PDF files.

Here is an example how I would extract the uncompressed stream of PDF object no. 5 into a *.dump file:

pdf-parser.py -o 5 -f -d obj5.dump my.pdf

Final notes

Please note that some binary parts inside a PDF are not necessarily uncompressible (or decode-able into human readable ASCII code), because they are embedded and used in their native format inside PDFs. Such PDF parts are JPEG images, fonts or ICC color profiles.
If you compare above tools and the command line examples given, you will discover that they do NOT all produce identical outputs. The effort of comparing them for their differences in itself can help you to better understand the nature of the PDF syntax and file format.

User · Answer

If you want to work programmatically from within Python  pdfminer is a good option  It allows you to work with PDF structure in memory as an object hierarchy or serialize it as XML

User · Answer

Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files  I wrote a blog article explaining it at https   blog idrsolutions com 2009 04 viewing-pdf-objects

User · Answer

I ve used PDFBox with good success   Here s a sample of what the code looks like  back from version 0 7 2   that likely came from one of the provided examples      load the document System out println  Reading document      filename   PDDocument doc   null                                                                                                                                                                                                            doc   PDDocument load filename       look at all the document information PDDocumentInformation info   doc getDocumentInformation    COSDictionary dict   info getDictionary    List l   dict keyList    for  Object o   l          System out println o toString           dict getString o        System out println o toString           look at the document catalog PDDocumentCatalog cat   doc getDocumentCatalog    System out println  Catalog     cat    List lt PDPage gt  lp   cat getAllPages    System out println    Pages      lp size     PDPage page   lp get 4   System out println  Page      page   System out println   tCropBox      page getCropBox     System out println   tMediaBox      page getMediaBox     System out println   tResources      page getResources     System out println   tRotation      page getRotation     System out println   tArtBox      page getArtBox     System out println   tBleedBox      page getBleedBox     System out println   tContents      page getContents     System out println   tTrimBox      page getTrimBox     List lt PDAnnotation gt  la   page getAnnotations    System out println   t  Annotations      la size

User · Answer

The object viewer in Acrobat is good but Windjack Solution s PDF Canopener allows better inspection with an eyedropper for selecting objects on page  Also permits modifications to be made to PDF   http   www windjack com products pdfcanopener html

User · Answer

I use iText RUPS Reading and Updating PDF Syntax  in Linux  Since it s written in Java  it works on Windows  too  You can browse all the objects in PDF file in a tree structure  It can also decode Flate encoded streams on-the-fly to make inspecting easier   Here is a screenshot

User · Answer

PDF Analyzer is similar to PDFXplorer  but it has more options  It is also free  after a single registration    https   www winking be en products pdfanalyzer http   www o2sol com pdfxplorer overview htm

User · Answer

My sugession is Foxit PDF Reader which is very helpful to do important text editing work on pdf file

User · Answer

PDFXplorer from O2 Solutions does an outstanding job of displaying the internals   http   www o2sol com pdfxplorer overview htm   Free  distracting banner at the bottom

User · Answer

There is also another option  Adobe Acrobat Pro is also able to display the internal tree structure of the PDF     Open Preflight  Go to Options  right upper corner  Internal PDF Structure   On top Adobe Acrobat Pro can also display the internal structure of the Document Fonts in the PDF most of other  PDF tree structure viewer  don t have this otion

[pdf] Best tool for inspecting PDF files?

`qpdf` (my favorite)

`mutool`

`podofouncompress`

`peepdf.py`

`pdfid.py` and `pdf-parser.py`

Final notes

Examples related to pdf

[pdf] Best tool for inspecting PDF files?

qpdf (my favorite)

mutool

podofouncompress

peepdf.py

pdfid.py and pdf-parser.py

Final notes

Examples related to pdf

`qpdf` (my favorite)

`mutool`

`podofouncompress`

`peepdf.py`

`pdfid.py` and `pdf-parser.py`