How to make tesseract to recognize only numbers when they are mixed with letters

Question

I want to use tesseract to recognize only numbers  The problem is that I have mixture of numbers  amp  letters and when I use SetVariable  tessedit char whitelist    0123456789   for every symbol tesseract returns wrong digit   Can I set a threshold value so that tesseract omits the symbols with low resemblance     NOTE  I set tesseract to recognize only digits so there is no confusion between O and 0

User · Answer

add  --psm 7 -c tessedit char whitelist 0123456789   works for me when the image contain s only 1 line

User · Answer

This feature is not supported in version 4  You can still use it via -c tessedit char whitelist 0123456789 with  --oem 0  which reverts to the old model    There is a bounty to fix this issue   Possible workarounds   As stated by  amitdo   Using the --oem 0 option  the legacy engine will be used  Retraining  fine tuning   751  comment  Post-processing  751  comment

User · Answer

For tesseract 3  i try to create config file according FAQ  BEFORE calling an Init function or put this in a text file called tessdata configs digits  tessedit char whitelist 0123456789                   then  it works by using the command  tesseract  imagename  outputbase  digits

User · Answer

Recognizing only numbers is actually answered on the tesseract FAQ page   See that page for more info  but if you have the version 3 package  the config files are already set up   You just specify on the commandline  tesseract image tif outputbase nobatch digits  As for the threshold value  I m not sure which you mean   If your input is an unusual font  perhaps you might retrain with a sample of your input   An alternative is to change tesseract s pruning threshold   Both options are also mentioned in the FAQ

User · Answer

For tesseract 3  the command is simpler tesseract imagename outputbase digits according to the FAQ  But it doesn t work for me very well  I turn to try different psm options and find -psm 6 works best for my case  man tesseract for details

User · Answer

custom oem r digits --oem 1 --psm 7 -c tessedit char whitelist 0123456789   text   tess image to string croped config custom oem   I am using tesseract 4 1 1  For better result you might want to consider Image processing techniques

User · Answer

You can instruct tesseract to use only digits  and if that is not accurate enough then best chance of getting better results is to go trough training process  http   www resolveradiologic com blog 2013 01 15 training-tesseract

User · Answer

What I do is to recognize everything  and when I have the text  I take out all the characters except numbers    This replaces all except numbers from 0 to 9 recognizedText   recognizedText replaceAll    0-9             This works pretty well for me

User · Answer

If one want to match 0-9  tesseract myimage png stdout -c tessedit char whitelist 0123456789   Or if one almost wants to match 0-9  but with one or more different characters  tesseract myimage png stdout -c tessedit char whitelist 01234ABCDE

User · Answer

I made it a bit different  with tess-two   Maybe it will be useful for somebody   So you need to initialize first the API   TessBaseAPI baseApi   new TessBaseAPI    baseApi init datapath  language  ocrEngineMode     Then set the following variables  baseApi setPageSegMode TessBaseAPI PageSegMode PSM SINGLE LINE   baseApi setVariable TessBaseAPI VAR CHAR BLACKLIST          amp     lt  gt  -        ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz    baseApi setVariable TessBaseAPI VAR CHAR WHITELIST     0123456789    baseApi setVariable  classify bln numeric mode    1      In this way the engine will check only the numbers

[ocr] How to make tesseract to recognize only numbers, when they are mixed with letters?

Examples related to ocr

Examples related to tesseract