I want to use tesseract
to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.
Can I set a threshold value so that tesseract
omits the symbols with low resemblance?
NOTE: I set tesseract
to recognize only digits so there is no confusion between O and 0.
What I do is to recognize everything, and when I have the text, I take out all the characters except numbers
//This replaces all except numbers from 0 to 9
recognizedText = recognizedText.replaceAll("[^0-9]+", " ");
This works pretty well for me.
add "--psm 7 -c tessedit_char_whitelist=0123456789'" works for me when the image contain's only 1 line.
For tesseract 3, the command is simpler tesseract imagename outputbase digits
according to the FAQ. But it doesn't work for me very well.
I turn to try different psm
options and find -psm 6
works best for my case.
man tesseract
for details.
Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:
tesseract image.tif outputbase nobatch digits
As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.
For tesseract 3, i try to create config file according FAQ.
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits
:
tessedit_char_whitelist 0123456789
then, it works by using the command: tesseract imagename outputbase digits
custom_oem=r'digits --oem 1 --psm 7 -c tessedit_char_whitelist=0123456789'
text = tess.image_to_string(croped,config=custom_oem)
I am using tesseract 4.1.1.
For better result you might want to consider Image processing techniques.
This feature is not supported in version 4. You can still use it via -c tessedit_char_whitelist=0123456789 with "--oem 0" which reverts to the old model.
There is a bounty to fix this issue.
Possible workarounds:
As stated by @amitdo
I made it a bit different (with tess-two). Maybe it will be useful for somebody.
So you need to initialize first the API.
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);
Then set the following variables
baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");
In this way the engine will check only the numbers.
You can instruct tesseract to use only digits, and if that is not accurate enough then best chance of getting better results is to go trough training process: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/
If one want to match 0-9
tesseract myimage.png stdout -c tessedit_char_whitelist=0123456789
Or if one almost wants to match 0-9, but with one or more different characters
tesseract myimage.png stdout -c tessedit_char_whitelist=01234ABCDE
Source: Stackoverflow.com