[c#] Reading PDF content with itextsharp dll in VB.NET or C#

How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.

This question is related to c# vb.net pdf itextsharp

The answer is


In my case, I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.

Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner.  72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);

As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however, I was happy that it did preserve the carriage returns. In my case, there were enough constants in the text that I was able to extract the values that I required.


Here is a VB.NET solution based on ShravankumarKumar's solution.

This will ONLY give you the text. The images are a different story.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

LGPL / FOSS iTextSharp 4.x

var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);

None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to SimpleTextExtractionStrategy or LocationTextExtractionStrategy in the FOSS version.

Something else that might be very useful in conjunction with this:

const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);

List<string> ExtractPdfContent(string rawPdfContent)
{
    var matches = PdfTableRegex.Matches(rawPdfContent);

    var list = matches.Cast<Match>()
        .Select(m => m.Value
            .Substring(1) //remove leading (
            .Remove(m.Value.Length - 4) //remove trailing )Tj
            .Replace(@"\)", ")") //unencode parens
            .Replace(@"\(", "(")
            .Trim()
        )
        .ToList();
    return list;
}

This will extract the text-only data from the PDF if the text displayed is Foo(bar) it will be encoded in the PDF as (Foo\(bar\))Tj, this method would return Foo(bar) as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.


Here an improved answer of ShravankumarKumar. I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

//create a list of pdf pages
var pages = new List<PdfPage>();

//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
    //loop all the pages and extract the text
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        pages.Add(new PdfPage()
        {
           content = PdfTextExtractor.GetTextFromPage(reader, i)
        });
    }
}

//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y => 
    new PdfRow() { 
       content = y,
       words = y.Split(' ').ToList()
    }
).ToList());

The custom classes

class PdfPage
{
    public string content { get; set; }
    public List<PdfRow> rows { get; set; }
}


class PdfRow
{
    public string content { get; set; }
    public List<string> words { get; set; }
}

Now you can get a word by row and word index.

string myWord = pages[0].rows[12].words[4];

Or use Linq to find the rows containing a specific word.

//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();

//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();

Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
        Dim sr As StreamReader = New StreamReader(sTxtfile)
    Dim doc As New Document()
    PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
    doc.Open()
    doc.Add(New Paragraph(sr.ReadToEnd()))
    doc.Close()
End Sub

Examples related to c#

How can I convert this one line of ActionScript to C#? Microsoft Advertising SDK doesn't deliverer ads How to use a global array in C#? How to correctly write async method? C# - insert values from file into two arrays Uploading into folder in FTP? Are these methods thread safe? dotnet ef not found in .NET Core 3 HTTP Error 500.30 - ANCM In-Process Start Failure Best way to "push" into C# array

Examples related to vb.net

How to get parameter value for date/time column from empty MaskedTextBox HTTP 415 unsupported media type error when calling Web API 2 endpoint variable is not declared it may be inaccessible due to its protection level Differences Between vbLf, vbCrLf & vbCr Constants Simple working Example of json.net in VB.net How to open up a form from another form in VB.NET? Delete a row in DataGridView Control in VB.NET How to get cell value from DataGridView in VB.Net? Set default format of datetimepicker as dd-MM-yyyy How to configure SMTP settings in web.config

Examples related to pdf

ImageMagick security policy 'PDF' blocking conversion How to extract table as text from the PDF using Python? Extract a page from a pdf as a jpeg How can I read pdf in python? Generating a PDF file from React Components Extract Data from PDF and Add to Worksheet How to extract text from a PDF file? How to download PDF automatically using js? Download pdf file using jquery ajax Generate PDF from HTML using pdfMake in Angularjs

Examples related to itextsharp

How to convert HTML to PDF using iTextSharp Add Header and Footer for PDF using iTextsharp how to set width for PdfPCell in ItextSharp Merging multiple PDFs using iTextSharp in c#.net Adding an image to a PDF using iTextSharp and scale it properly ITextSharp HTML to PDF? Reading PDF content with itextsharp dll in VB.NET or C# How to return PDF to browser in MVC? Convert HTML to PDF in .NET