Reading PDF content with itextsharp dll in VB NET or C

Question

How can I read PDF content with the itextsharp with the Pdfreader class  My PDF may include Plain text or Images of  the text

User · Answer

using iTextSharp text pdf  using iTextSharp text pdf parser  using System IO   public string ReadPdfFile string fileName        StringBuilder text   new StringBuilder         if  File Exists fileName                 PdfReader pdfReader   new PdfReader fileName            for  int page   1  page  lt   pdfReader NumberOfPages  page                          ITextExtractionStrategy strategy   new SimpleTextExtractionStrategy                string currentText   PdfTextExtractor GetTextFromPage pdfReader  page  strategy                currentText   Encoding UTF8 GetString ASCIIEncoding Convert Encoding Default  Encoding UTF8  Encoding Default GetBytes currentText                 text Append currentText                     pdfReader Close              return text ToString

User · Answer

Here an improved answer of ShravankumarKumar  I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row   using iTextSharp text pdf  using iTextSharp text pdf parser     create a list of pdf pages var pages   new List lt PdfPage gt        load the pdf into the reader  NOTE  path can also be replaced with a byte array using  PdfReader reader   new PdfReader path           loop all the pages and extract the text     for  int i   1  i  lt   reader NumberOfPages  i                  pages Add new PdfPage                        content   PdfTextExtractor GetTextFromPage reader  i                         use linq to create the rows and words by splitting on newline and space pages ForEach x   gt  x rows   x content Split   n   Select y   gt       new PdfRow             content   y         words   y Split      ToList           ToList       The custom classes  class PdfPage       public string content   get  set        public List lt PdfRow gt  rows   get  set        class PdfRow       public string content   get  set        public List lt string gt  words   get  set        Now you can get a word by row and word index   string myWord   pages 0  rows 12  words 4     Or use Linq to find the rows containing a specific word     find the rows in a specific page containing a word var myRows   pages 0  rows Where x   gt  x words Any y   gt  y     myWord1    ToList       find the rows in all pages containing a word var myRows   pages SelectMany r   gt  r rows  Where x   gt  x words Any y   gt  y     myWord2    ToList

User · Answer

In my case  I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it   In the sample below the coordinates are for the entire page   I don t have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found   Rectangle  pdfRect   new Rectangle 0f  0f  612f  792f      Entire page - PDF coordinate system 0 0 is bottom left corner   72 points   inch RenderFilter  renderfilter   new RegionTextRenderFilter  pdfRect   ITextExtractionStrategy  strategy   new FilteredTextRenderListener new LocationTextExtractionStrategy     filter   string  text   PdfTextExtractor GetTextFromPage  pdfReader  1   strategy     As noted by the above comments the resulting text doesn t maintain any of the formatting found in the PDF document  however  I was happy that it did preserve the carriage returns   In my case  there were enough constants in the text that I was able to extract the values that I required

User · Answer

Here is a VB NET solution based on ShravankumarKumar s solution   This will ONLY give you the text   The images are a different story   Public Shared Function GetTextFromPDF PdfFileName As String  As String     Dim oReader As New iTextSharp text pdf PdfReader PdfFileName       Dim sOut           For i   1 To oReader NumberOfPages         Dim its As New iTextSharp text pdf parser SimpleTextExtractionStrategy          sOut  amp   iTextSharp text pdf parser PdfTextExtractor GetTextFromPage oReader  i  its      Next      Return sOut End Function

User · Answer

Public Sub PDFTxtToPdf ByVal sTxtfile As String  ByVal sPDFSourcefile As String          Dim sr As StreamReader   New StreamReader sTxtfile      Dim doc As New Document       PdfWriter GetInstance doc  New FileStream sPDFSourcefile  FileMode Create       doc Open       doc Add New Paragraph sr ReadToEnd         doc Close   End Sub

User · Answer

LGPL   FOSS iTextSharp 4 x  var pdfReader   new PdfReader path     other filestream etc byte   pageContent    pdfReader  GetPageContent pageNum     not zero based byte   utf8   Encoding Convert Encoding Default  Encoding UTF8  pageContent   string textFromPage   Encoding UTF8 GetString utf8     None of the other answers were useful to me  they all seem to target the AGPL v5 of iTextSharp  I could never find any reference to SimpleTextExtractionStrategy or LocationTextExtractionStrategy in the FOSS version   Something else that might be very useful in conjunction with this   const string PdfTableFormat           Tj   Regex PdfTableRegex   new Regex PdfTableFormat  RegexOptions Compiled    List lt string gt  ExtractPdfContent string rawPdfContent        var matches   PdfTableRegex Matches rawPdfContent        var list   matches Cast lt Match gt             Select m   gt  m Value              Substring 1    remove leading                Remove m Value Length - 4    remove trailing  Tj              Replace               unencode parens              Replace                          Trim                      ToList        return list      This will extract the text-only data from the PDF if the text displayed is Foo bar  it will be encoded in the PDF as  Foo  bar   Tj  this method would return Foo bar  as expected  This method will strip out lots of additional information such as location coordinates from the raw pdf content

[c#] Reading PDF content with itextsharp dll in VB.NET or C#

Examples related to c#

Examples related to vb.net

Examples related to pdf

Examples related to itextsharp