[c#] Tesseract OCR simple example

Hi Can you anyone give me a simple example of testing Tesseract OCR preferably in C#.
I tried the demo found here. I download the English dataset and unzipped in C drive. and modified the code as followings:

string path = @"C:\pic\mytext.jpg";
Bitmap image = new Bitmap(path);
Tesseract ocr = new Tesseract();
ocr.SetVariable("tessedit_char_whitelist", "0123456789"); // If digit only
ocr.Init(@"C:\tessdata\", "eng", false); // To use correct tessdata
List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty);
foreach (tessnet2.Word word in result)
    Console.WriteLine("{0} : {1}", word.Confidence, word.Text);

Unfortunately the code doesn't work. the program dies at "ocr.Init(..." line. I couldn't even get an exception even using try-catch.

I was able to run the vietocr! but that is a very large project for me to follow. i need a simple example like above.

Thanks

This question is related to c# ocr tesseract

The answer is


I had same problem, now its resolved. I have tesseract2, under this folders for 32 bit and 64 bit, I copied files 64 bit folder(as my system is 64 bit) to main folder ("Tesseract2") and under bin/Debug folder. Now my solution is working fine.


In my case I had all these worked except for the correct character recognition.

But you need to consider these few things:

  • Use correct tessnet2 library
  • use correct tessdata language version
  • tessdata should be somewhere out of your application folder where you can put in full path in the init parameter. use ocr.Init(@"c:\tessdata", "eng", true);
  • Debugging will cause you headache. Then you need to update your app.config use this. (I can't put the xml code here. give me your email i will email it to you)

hope that this helps


A simple example of testing Tesseract OCR in C#:

    public static string GetText(Bitmap imgsource)
    {
        var ocrtext = string.Empty;
        using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
        {
            using (var img = PixConverter.ToPix(imgsource))
            {
                using (var page = engine.Process(img))
                {
                    ocrtext = page.GetText();
                }
            }
        }

        return ocrtext;
    }

Info: The tessdata folder must exist in the repository: bin\Debug\


Here's a great working example project; Tesseract OCR Sample (Visual Studio) with Leptonica Preprocessing Tesseract OCR Sample (Visual Studio) with Leptonica Preprocessing

Tesseract OCR 3.02.02 API can be confusing, so this guides you through including the Tesseract and Leptonica dll into a Visual Studio C++ Project, and provides a sample file which takes an image path to preprocess and OCR. The preprocessing script in Leptonica converts the input image into black and white book-like text.

Setup

To include this in your own projects, you will need to reference the header files and lib and copy the tessdata folders and dlls.

Copy the tesseract-include folder to the root folder of your project. Now Click on your project in Visual Studio Solution Explorer, and go to Project>Properties.

VC++ Directories>Include Directories:

..\tesseract-include\tesseract;..\tesseract-include\leptonica;$(IncludePath) C/C++>Preprocessor>Preprocessor Definitions:

_CRT_SECURE_NO_WARNINGS;%(PreprocessorDefinitions) C/C++>Linker>Input>Additional Dependencies:

..\tesseract-include\libtesseract302.lib;..\tesseract-include\liblept168.lib;%(AdditionalDependencies) Now you can include headers in your project's file:

include

include

Now copy the two dll files in tesseract-include and the tessdata folder in Debug to the Output Directory of your project.

When you initialize tesseract, you need to specify the location of the parent folder (!important) of the tessdata folder if it is not already the current directory of your executable file. You can copy my script, which assumes tessdata is installed in the executable's folder.

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI(); api->Init("D:\tessdataParentFolder\", ... Sample

You can compile the provided sample, which takes one command line argument of the image path to use. The preprocess() function uses Leptonica to create a black and white book-like copy of the image which makes tesseract work with 90% accuracy. The ocr() function shows the functionality of the Tesseract API to return a string output. The toClipboard() can be used to save text to clipboard on Windows. You can copy these into your own projects.


I was able to get it to work by following these instructions.

  • Download the sample code Tesseract sample code

  • Unzip it to a new location

  • Open ~\tesseract-samples-master\src\Tesseract.Samples.sln (I used Visual Studio 2017)

  • Install the Tesseract NuGet package for that project (or uninstall/reinstall as I had to) NuGet Tesseract

  • Uncomment the last two meaningful lines in Tesseract.Samples.Program.cs: Console.Write("Press any key to continue . . . "); Console.ReadKey(true);

  • Run (hit F5)

  • You should get this windows console output enter image description here


This worked for me, I had 3-4 more PDF to Text extractor and if one doesnot work the other one will ... tesseract in particular this code can be used on Windows 7, 8, Server 2008 . Hope this is helpful to you

    do
    {
    // Sleep or Pause the Thread for 1 sec, if service is running too fast...
    Thread.Sleep(millisecondsTimeout: 1000);
    Guid tempGuid = ToSeqGuid();
    string newFileName = tempGuid.ToString().Split('-')[0];
    string outputFileName = appPath + "\\pdf2png\\" + fileNameithoutExtension + "-" + newFileName +
                            ".png";
    extractor.SaveCurrentImageToFile(outputFileName, ImageFormat.Png);
    // Create text file here using Tesseract
    foreach (var file in Directory.GetFiles(appPath + "\\pdf2png"))
    {
        try
        {
            var pngFileName = Path.GetFileNameWithoutExtension(file);
            string[] myArguments =
            {
                "/C tesseract ", file,
                " " + appPath + "\\png2text\\" + pngFileName
            }; // /C for closing process automatically whent completes
            string strParam = String.Join(" ", myArguments);

            var myCmdProcess = new Process();
            var theProcess = new ProcessStartInfo("cmd.exe", strParam)
            {
                CreateNoWindow = true,
                UseShellExecute = false,
                RedirectStandardOutput = true,
                RedirectStandardError = true,
                WindowStyle = ProcessWindowStyle.Minimized
            }; // Keep the cmd.exe window minimized
            myCmdProcess.StartInfo = theProcess;
            myCmdProcess.Exited += myCmdProcess_Exited;
            myCmdProcess.Start();

            //if (process)
            {
                /*
                MessageBox.Show("cmd.exe process started: " + Environment.NewLine +
                                "Process Name: " + myCmdProcess.ProcessName +
                                Environment.NewLine + " Process Id: " + myCmdProcess.Id
                                + Environment.NewLine + "process.Handle: " +
                                myCmdProcess.Handle);
                */
                Process.EnterDebugMode();
                //ShowWindow(hWnd: process.Handle, nCmdShow: 2);
                /*
                MessageBox.Show("After EnterDebugMode() cmd.exe process Exited: " +
                                Environment.NewLine +
                                "Process Name: " + myCmdProcess.ProcessName +
                                Environment.NewLine + " Process Id: " + myCmdProcess.Id
                                + Environment.NewLine + "process.Handle: " +
                                myCmdProcess.Handle);
                */
                myCmdProcess.WaitForExit(60000);
                /*
                MessageBox.Show("After WaitForExit() cmd.exe process Exited: " +
                                Environment.NewLine +
                                "Process Name: " + myCmdProcess.ProcessName +
                                Environment.NewLine + " Process Id: " + myCmdProcess.Id
                                + Environment.NewLine + "process.Handle: " +
                                myCmdProcess.Handle);
                */
                myCmdProcess.Refresh();
                Process.LeaveDebugMode();
                //myCmdProcess.Dispose();
                /*
                MessageBox.Show("After LeaveDebugMode() cmd.exe process Exited: " +
                                Environment.NewLine);
                */
            }


            //process.Kill();
            // Waits for the process to complete task and exites automatically
            Thread.Sleep(millisecondsTimeout: 1000);

            // This works fine in Windows 7 Environment, and not in Windows 8
            // Try following code block
            // Check, if process is not comletey exited

            if (!myCmdProcess.HasExited)
            {
                //process.WaitForExit(2000); // Try to wait for exit 2 more seconds
                /*
                MessageBox.Show(" Process of cmd.exe was exited by WaitForExit(); Method " +
                                Environment.NewLine);
                */
                try
                {
                    // If not, then Kill the process
                    myCmdProcess.Kill();
                    //myCmdProcess.Dispose();
                    //if (!myCmdProcess.HasExited)
                    //{
                    //    myCmdProcess.Kill();
                    //}

                    MessageBox.Show(" Process of cmd.exe exited ( Killed ) successfully " +
                                    Environment.NewLine);
                }
                catch (System.ComponentModel.Win32Exception ex)
                {
                    MessageBox.Show(
                        " Exception: System.ComponentModel.Win32Exception " +
                        ex.ErrorCode + Environment.NewLine);
                }
                catch (NotSupportedException notSupporEx)
                {
                    MessageBox.Show(" Exception: NotSupportedException " +
                                    notSupporEx.Message +
                                    Environment.NewLine);
                }
                catch (InvalidOperationException invalidOperation)
                {
                    MessageBox.Show(
                        " Exception: InvalidOperationException " +
                        invalidOperation.Message + Environment.NewLine);
                    foreach (
                        var textFile in Directory.GetFiles(appPath + "\\png2text", "*.txt",
                            SearchOption.AllDirectories))
                    {
                        loggingInfo += textFile +
                                       " In Reading Text from generated text file by Tesseract " +
                                       Environment.NewLine;
                        strBldr.Append(File.ReadAllText(textFile));
                    }
                    // Delete text file after reading text here
                    Directory.GetFiles(appPath + "\\pdf2png").ToList().ForEach(File.Delete);
                    Directory.GetFiles(appPath + "\\png2text").ToList().ForEach(File.Delete);
                }
            }
        }
        catch (Exception exception)
        {
            MessageBox.Show(
                " Cought Exception in Generating image do{...}while{...} function " +
                Environment.NewLine + exception.Message + Environment.NewLine);
        }
    }
    // Delete png image here
    Directory.GetFiles(appPath + "\\pdf2png").ToList().ForEach(File.Delete);
    Thread.Sleep(millisecondsTimeout: 1000);
    // Read text from text file here
    foreach (var textFile in Directory.GetFiles(appPath + "\\png2text", "*.txt",
        SearchOption.AllDirectories))
    {
        loggingInfo += textFile +
                       " In Reading Text from generated text file by Tesseract " +
                       Environment.NewLine;
        strBldr.Append(File.ReadAllText(textFile));
    }
    // Delete text file after reading text here
    Directory.GetFiles(appPath + "\\png2text").ToList().ForEach(File.Delete);
} while (extractor.GetNextImage()); // Advance image enumeration... 

Try updating the line to:

ocr.Init(@"C:\", "eng", false); // the path here should be the parent folder of tessdata


Ok. I found the solution here tessnet2 fails to load the Ans given by Adam

Apparently i was using wrong version of tessdata. I was following the the source page instruction intuitively and that caused the problem.

it says

Quick Tessnet2 usage

  1. Download binary here, add a reference of the assembly Tessnet2.dll to your .NET project.

  2. Download language data definition file here and put it in tessdata directory. Tessdata directory and your exe must be in the same directory.

After you download the binary, when you follow the link to download the language file, there are many language files. but none of them are right version. you need to select all version and go to next page for correct version (tesseract-2.00.eng)! They should either update download binary link to version 3 or put the the version 2 language file on the first page. Or at least bold mention the fact that this version issue is a big deal!

Anyway I found it. Thanks everyone.