[ocr] How to make tesseract to recognize only numbers, when they are mixed with letters?

I want to use tesseract to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.

Can I set a threshold value so that tesseract omits the symbols with low resemblance?

NOTE: I set tesseract to recognize only digits so there is no confusion between O and 0.

This question is related to ocr tesseract

The answer is


What I do is to recognize everything, and when I have the text, I take out all the characters except numbers

//This replaces all except numbers from 0 to 9
recognizedText = recognizedText.replaceAll("[^0-9]+", " ");

This works pretty well for me.


add "--psm 7 -c tessedit_char_whitelist=0123456789'" works for me when the image contain's only 1 line.


For tesseract 3, the command is simpler tesseract imagename outputbase digits according to the FAQ. But it doesn't work for me very well.

I turn to try different psm options and find -psm 6 works best for my case.

man tesseract for details.


Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:

tesseract image.tif outputbase nobatch digits

As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.


For tesseract 3, i try to create config file according FAQ.

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789                 

then, it works by using the command: tesseract imagename outputbase digits


custom_oem=r'digits --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789'

text = tess.image_to_string(croped,config=custom_oem)

I am using tesseract 4.1.1.

For better result you might want to consider Image processing techniques.


This feature is not supported in version 4. You can still use it via -c tessedit_char_whitelist=0123456789 with "--oem 0" which reverts to the old model.

There is a bounty to fix this issue.

Possible workarounds:

As stated by @amitdo


I made it a bit different (with tess-two). Maybe it will be useful for somebody.

So you need to initialize first the API.

TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);

Then set the following variables

baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");

In this way the engine will check only the numbers.


You can instruct tesseract to use only digits, and if that is not accurate enough then best chance of getting better results is to go trough training process: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/


If one want to match 0-9

tesseract myimage.png stdout -c tessedit_char_whitelist=0123456789

Or if one almost wants to match 0-9, but with one or more different characters

tesseract myimage.png stdout -c tessedit_char_whitelist=01234ABCDE