Reading PDF documents in Net

Question

Is there an open source library that will help me with reading parsing PDF documents in  Net C

User · Accepted Answer

Since this question was last answered in 2008  iTextSharp has improved their api dramatically  If you download the latest version of their api from http   sourceforge net projects itextsharp   you can use the following snippet of code to extract all text from a pdf into a string   using iTextSharp text pdf  using iTextSharp text pdf parser   namespace PdfParser       public static class PdfTextExtractor               public static string pdfText string path                        PdfReader reader   new PdfReader path               string text   string Empty              for int page   1  page  lt   reader NumberOfPages  page                                  text    PdfTextExtractor GetTextFromPage reader page                             reader Close                return text

User · Answer

public string ReadPdfFile object Filename  DataTable ReadLibray        PdfReader reader2   new PdfReader  string Filename       string strText   string Empty       for  int page   1  page  lt   reader2 NumberOfPages  page              ITextExtractionStrategy its   new iTextSharp text pdf parser SimpleTextExtractionStrategy        PdfReader reader   new PdfReader  string Filename       String s   PdfTextExtractor GetTextFromPage reader  page  its        s   Encoding UTF8 GetString ASCIIEncoding Convert Encoding Default  Encoding UTF8  Encoding Default GetBytes s         strText   strText   s      reader Close              return strText

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

public string ReadPdfFile object Filename  DataTable ReadLibray        PdfReader reader2   new PdfReader  string Filename       string strText   string Empty       for  int page   1  page  lt   reader2 NumberOfPages  page              ITextExtractionStrategy its   new iTextSharp text pdf parser SimpleTextExtractionStrategy        PdfReader reader   new PdfReader  string Filename       String s   PdfTextExtractor GetTextFromPage reader  page  its        s   Encoding UTF8 GetString ASCIIEncoding Convert Encoding Default  Encoding UTF8  Encoding Default GetBytes s         strText   strText   s      reader Close              return strText

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

Have a look at Docotic Pdf library  It does not require you to make source code of your application open  like iTextSharp with viral AGPL 3 license  for example   Docotic Pdf can be used to read PDF files and extract text with or without formatting  Please have a look at the article that shows how to extract text from PDFs  Disclaimer  I work for Bit Miracle  vendor of the library

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

Have a look at Docotic Pdf library  It does not require you to make source code of your application open  like iTextSharp with viral AGPL 3 license  for example   Docotic Pdf can be used to read PDF files and extract text with or without formatting  Please have a look at the article that shows how to extract text from PDFs  Disclaimer  I work for Bit Miracle  vendor of the library

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

Have a look at Docotic Pdf library  It does not require you to make source code of your application open  like iTextSharp with viral AGPL 3 license  for example   Docotic Pdf can be used to read PDF files and extract text with or without formatting  Please have a look at the article that shows how to extract text from PDFs  Disclaimer  I work for Bit Miracle  vendor of the library

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

Have a look at Docotic Pdf library  It does not require you to make source code of your application open  like iTextSharp with viral AGPL 3 license  for example   Docotic Pdf can be used to read PDF files and extract text with or without formatting  Please have a look at the article that shows how to extract text from PDFs  Disclaimer  I work for Bit Miracle  vendor of the library

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

public string ReadPdfFile object Filename  DataTable ReadLibray        PdfReader reader2   new PdfReader  string Filename       string strText   string Empty       for  int page   1  page  lt   reader2 NumberOfPages  page              ITextExtractionStrategy its   new iTextSharp text pdf parser SimpleTextExtractionStrategy        PdfReader reader   new PdfReader  string Filename       String s   PdfTextExtractor GetTextFromPage reader  page  its        s   Encoding UTF8 GetString ASCIIEncoding Convert Encoding Default  Encoding UTF8  Encoding Default GetBytes s         strText   strText   s      reader Close              return strText

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

public string ReadPdfFile object Filename  DataTable ReadLibray        PdfReader reader2   new PdfReader  string Filename       string strText   string Empty       for  int page   1  page  lt   reader2 NumberOfPages  page              ITextExtractionStrategy its   new iTextSharp text pdf parser SimpleTextExtractionStrategy        PdfReader reader   new PdfReader  string Filename       String s   PdfTextExtractor GetTextFromPage reader  page  its        s   Encoding UTF8 GetString ASCIIEncoding Convert Encoding Default  Encoding UTF8  Encoding Default GetBytes s         strText   strText   s      reader Close              return strText

User · Answer

iText is the best library I know  Originally written in Java  there is a  NET port as well   See http   www ujihara jp iTextdotNET en

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

itext   http   www itextpdf com terms-of-use index php  Guide  http   www vogella com articles JavaPDF article html

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

itext   http   www itextpdf com terms-of-use index php  Guide  http   www vogella com articles JavaPDF article html

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

itext   http   www itextpdf com terms-of-use index php  Guide  http   www vogella com articles JavaPDF article html

User · Answer

PDFClown might help  but I would not recommend it for a big or heavy use application

User · Answer

itext   http   www itextpdf com terms-of-use index php  Guide  http   www vogella com articles JavaPDF article html

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

iTextSharp is the best bet  Used it to make a spider for lucene Net so that it could crawl PDF   using System  using System IO  using iTextSharp text pdf  using System Text RegularExpressions   namespace Spider Utils            lt summary gt          Parses a PDF file and extracts the text from it           lt  summary gt      public class PDFParser                   BT   Beginning of a text object operator              ET   End of a text object operator             Td move to the start of next line              5 Ts   superscript             -5 Ts   subscript           region Fields           region  numberOfCharsToKeep              lt summary gt              The number of characters to keep  when extracting text               lt  summary gt          private static int  numberOfCharsToKeep   15           endregion           endregion           region ExtractText              lt summary gt              Extracts a text from a PDF file               lt  summary gt               lt param name  inFileName  gt the full path to the pdf file  lt  param gt               lt param name  outFileName  gt the output file name  lt  param gt               lt returns gt the extracted text lt  returns gt          public bool ExtractText string inFileName  string outFileName                        StreamWriter outFile   null              try                                  Create a reader for the given PDF file                 PdfReader reader   new PdfReader inFileName                     outFile   File CreateText outFileName                   outFile   new StreamWriter outFileName  false  System Text Encoding UTF8                    Console Write  Processing                       int totalLen   68                  float charUnit     float totalLen     float reader NumberOfPages                  int totalWritten   0                  float curUnit   0                   for  int page   1  page  lt   reader NumberOfPages  page                                          outFile Write ExtractTextFromPDFBytes reader GetPageContent page                                   Write the progress                      if  charUnit  gt   1 0f                                                for  int i   0  i  lt   int charUnit  i                                                          Console Write                                   totalWritten                                                                        else                                               curUnit    charUnit                          if  curUnit  gt   1 0f                                                        for  int i   0  i  lt   int curUnit  i                                                                  Console Write                                       totalWritten                                                              curUnit   0                                                                                      if  totalWritten  lt  totalLen                                        for  int i   0  i  lt   totalLen - totalWritten   i                                                  Console Write                                                               return true                            catch                               return false                            finally                               if  outFile    null  outFile Close                                     endregion           region ExtractTextFromPDFBytes              lt summary gt              This method processes an uncompressed Adobe  text  object              and extracts text               lt  summary gt               lt param name  input  gt uncompressed lt  param gt               lt returns gt  lt  returns gt          public string ExtractTextFromPDFBytes byte   input                        if  input    null    input Length    0  return                  try                               string resultString                           Flag showing if we are we currently inside a text object                 bool inTextObject   false                      Flag showing if the next character is literal                     e g       to get a     character or      to get                     bool nextLiteral   false                         Bracket nesting level  Text appears inside                    int bracketDepth   0                      Keep previous chars to get extract numbers etc                   char   previousCharacters   new char  numberOfCharsToKeep                   for  int j   0  j  lt   numberOfCharsToKeep  j    previousCharacters j                           for  int i   0  i  lt  input Length  i                                          char c    char input i                       if  input i     213                          c       ToCharArray   0                        if  inTextObject                                                   Position the text                         if  bracketDepth    0                                                        if  CheckToken new string      TD    Td     previousCharacters                                                                 resultString      n r                                                             else                                                               if  CheckToken new string           T            previousCharacters                                                                         resultString      n                                                                     else                                                                       if  CheckToken new string      Tj     previousCharacters                                                                                 resultString                                                                                                                                                                     End of a text object  also go to a new line                          if  bracketDepth    0  amp  amp                              CheckToken new string      ET     previousCharacters                                                          inTextObject   false                              resultString                                                           else                                                          Start outputting text                             if   c          amp  amp   bracketDepth    0   amp  amp    nextLiteral                                                                 bracketDepth   1                                                            else                                                                  Stop outputting text                                 if   c          amp  amp   bracketDepth    1   amp  amp    nextLiteral                                                                         bracketDepth   0                                                                    else                                                                          Just a normal text character                                      if  bracketDepth    1                                                                                   Only print out next character no matter what                                              Do not interpret                                          if  c          amp  amp   nextLiteral                                                                                        resultString    c ToString                                                nextLiteral   true                                                                                    else                                                                                       if    c  gt         amp  amp   c  lt                                                              c  gt   128   amp  amp   c  lt  255                                                                                                  resultString    c ToString                                                                                               nextLiteral   false                                                                                                                                                                                                                          Store the recent characters for                         when we have to go back for a checking                     for  int j   0  j  lt   numberOfCharsToKeep - 1  j                                                  previousCharacters j    previousCharacters j   1                                             previousCharacters  numberOfCharsToKeep - 1    c                          Start of a text object                     if   inTextObject  amp  amp  CheckToken new string      BT     previousCharacters                                                 inTextObject   true                                                           return CleanupContent resultString                             catch                               return                                      private string CleanupContent string text                        string   patterns                           226       222       223       224       340       342       344       300       302       304       351       350       352       353       311       310       312       313       362       364       366       322       324       326       354       356       357       314       316       317       347       307       371       373       374       331       333       334       256       231       253       273       251       221                string   replace                           -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       for  int i   0  i  lt  patterns Length  i                                  string regExPattern   patterns i                   Regex regex   new Regex regExPattern  RegexOptions IgnoreCase                   text   regex Replace text  replace i                               return text                      endregion           region CheckToken              lt summary gt              Check if a certain 2 character token just came along  e g  BT               lt  summary gt               lt param name  tokens  gt the searched token lt  param gt               lt param name  recent  gt the recent character array lt  param gt               lt returns gt  lt  returns gt          private bool CheckToken string   tokens  char   recent                        foreach  string token in tokens                                if   recent  numberOfCharsToKeep - 3     token 0    amp  amp                       recent  numberOfCharsToKeep - 2     token 1    amp  amp                        recent  numberOfCharsToKeep - 1                                  recent  numberOfCharsToKeep - 1     0x0d                          recent  numberOfCharsToKeep - 1     0x0a    amp  amp                        recent  numberOfCharsToKeep - 4                                  recent  numberOfCharsToKeep - 4     0x0d                          recent  numberOfCharsToKeep - 4     0x0a                                                               return true                                              return false                     endregion

User · Answer

http   www c-sharpcorner com UploadFile psingh PDFFileGenerator12062005235236PM PDFFileGenerator aspx is open source and may be a good starting point for you

User · Answer

There is also LibHaru  http   libharu org wiki Main Page

User · Answer

You could look into this  http   www codeproject com KB showcase pdfrasterizer aspx It s not completely free  but it looks very nice   Alex

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

User · Answer

aspose pdf works pretty well  then again  you have to pay for it

[c#] Reading PDF documents in .Net

Examples related to c#

Examples related to .net

Examples related to pdf