[pdf] Structure of a PDF file?

For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.

But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(

This question is related to pdf

The answer is


Here is a link to Adobe's reference material

http://www.adobe.com/devnet/pdf/pdf_reference.html

You should know though that PDF is only about presentation, not structure. Parsing will not come easy.


I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.

Other helpful links:


I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.

Other helpful links:


When I first started working with PDF, I found the PDF reference very hard to navigate. It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.

If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.


When I first started working with PDF, I found the PDF reference very hard to navigate. It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.

If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.


When I first started working with PDF, I found the PDF reference very hard to navigate. It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.

If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.


When I first started working with PDF, I found the PDF reference very hard to navigate. It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.

If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.


Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.


Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.


Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.


Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.


I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.


I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.


This may help shed a little light: (from page 11 of PDF32000.book)

PDF syntax is best understood by considering it as four parts, as shown in Figure 1:

• Objects. A PDF document is a data structure composed from a small set of basic types of data objects. Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects. Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream object.

• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub- clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level mechanism for protecting a document’s contents from unauthorized access.

• Document structure. The PDF document structure specifies how the basic object types are used to represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7, "Document Structure," describes the overall document structure; later clauses address the detailed semantics of the components.

• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of a page or other graphical entity. These instructions, while also represented as objects, are conceptually distinct from the objects that represent the document structure and are described separately. Sub-clause 7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.

Looks like navigating a PDF file will require a little more than a passing effort.


This may help shed a little light: (from page 11 of PDF32000.book)

PDF syntax is best understood by considering it as four parts, as shown in Figure 1:

• Objects. A PDF document is a data structure composed from a small set of basic types of data objects. Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects. Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream object.

• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub- clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level mechanism for protecting a document’s contents from unauthorized access.

• Document structure. The PDF document structure specifies how the basic object types are used to represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7, "Document Structure," describes the overall document structure; later clauses address the detailed semantics of the components.

• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of a page or other graphical entity. These instructions, while also represented as objects, are conceptually distinct from the objects that represent the document structure and are described separately. Sub-clause 7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.

Looks like navigating a PDF file will require a little more than a passing effort.


If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.


If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.




Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.


Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.


Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.


Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.


One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.

Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.

I'm trying to make up a spreadsheet to create a PDF form from code.


One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.

Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.

I'm trying to make up a spreadsheet to create a PDF form from code.


You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.

On windows I used a free tool PDF Analyzer to see the internal structure of PDF files. This will help in your understanding when reading the reference manual.

enter image description here

(I'm affiliated with PDF Analyzer, no intention to promote)


You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.

On windows I used a free tool PDF Analyzer to see the internal structure of PDF files. This will help in your understanding when reading the reference manual.

enter image description here

(I'm affiliated with PDF Analyzer, no intention to promote)


To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:

pdfinfo -layout some_pdf_file.pdf

A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.

Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.

To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:

grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt

This will provide a unique list of the different octal codes in the document:

grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq

To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.


To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:

pdfinfo -layout some_pdf_file.pdf

A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.

Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.

To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:

grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt

This will provide a unique list of the different octal codes in the document:

grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq

To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.


Questions with pdf tag:

ImageMagick security policy 'PDF' blocking conversion How to extract table as text from the PDF using Python? Extract a page from a pdf as a jpeg How can I read pdf in python? Generating a PDF file from React Components Extract Data from PDF and Add to Worksheet How to extract text from a PDF file? How to download PDF automatically using js? Download pdf file using jquery ajax Generate PDF from HTML using pdfMake in Angularjs Generate PDF from Swagger API documentation IPython/Jupyter Problems saving notebook as PDF Extract / Identify Tables from PDF python Create PDF from a list of images VBA Print to PDF and Save with Automatic File Name android download pdf from url then open it with a pdf reader How to convert PDF files to images How to convert webpage into PDF by using Python Window.Open with PDF stream instead of PDF location PDF Blob - Pop up window not showing content How to Display blob (.pdf) in an AngularJS app Excel VBA to Export Selected Sheets to PDF Zoom to fit: PDF Embedded in HTML correct PHP headers for pdf file download Codeigniter how to create PDF HTML embedded PDF iframe Open a selected file (image, pdf, ...) programmatically from my Android Application? Open a PDF using VBA in Excel Printing PDFs from Windows Command Line How to get raw text from pdf file using java How to display pdf in php How to display PDF file in HTML? Android open pdf file How to read pdf file and write it to outputStream How to embed PDF file with responsive width Print PDF directly from JavaScript How to embed a PDF? Save multiple sheets to .pdf How to embed a PDF viewer in a page? Duplicate headers received from server How to display a pdf in a modal window? How to open a PDF file in an <iframe>? How to build PDF file from binary string returned from a web-service using javascript PHP mPDF save file as PDF Pdf.js: rendering a pdf file using a base64 file source instead of url Upload DOC or PDF using PHP Save base64 string as PDF at client side with JavaScript Convert PDF to clean SVG? Display PDF file inside my android application How to Use pdf.js