[ocr] Tesseract running error

I have a problem with running tesseract-ocr engine on linux. I've downloaded RUS language data and put it to tessdata directory (/usr/local/share/tessdata). When I'm trying to run tesseract with command tesseract blob.jpg out -l rus , it displays an error:

Error opening data file /usr/local/share/tessdata/eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

Failed loading language eng
Tesseract couldn't load any languages!

Could not initialize tesseract.

According to compiling guide, I used export TESSDATA_PREFIX='/usr/local/share/' to point my tessdata directory. Maybe I should edit any config files? Tesseract try to load 'eng' data files instead of 'rus'.

Screenshot: http://i.stack.imgur.com/I0Guc.png

This question is related to ocr tesseract

The answer is


No previous solution worked for me.

I've installed both by apt-get and manually downloading the tessdata, moved around /usr and so on and no one worked even if i exported the variable thousand times.

Finally, on a last try before start to cry i've tried to pass the path directly to the instance of Tesseract().

In Python: tr = Tesseract("/usr/local/share/tesseract-ocr/") and now it works. To clarify, im using tesserwrap module.


You can call tesseract API function from C code:

#include <tesseract/baseapi.h>
#include <tesseract/ocrclass.h>; // ETEXT_DESC

using namespace tesseract;

class TessAPI : public TessBaseAPI {
    public:
    void PrintRects(int len);
};

...
TessAPI *api = new TessAPI();
int res = api->Init(NULL, "rus");
api->SetAccuracyVSpeed(AVS_MOST_ACCURATE);
api->SetImage(data, w0, h0, bpp, stride);
api->SetRectangle(x0,y0,w0,h0);

char *text;
ETEXT_DESC monitor;
api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
printf("m.count: %s\n", monitor.count);
printf("m.progress: %s\n", monitor.progress);

api->RecognizeForChopTest(&monitor);
text = api->GetUTF8Text();
printf("text: %s\n", text);
...
api->End();

And build this code:

g++ -g -I. -I/usr/local/include -o _test test.cpp -ltesseract_api -lfreeimageplus

(i need FreeImage for picture loading)


You can grab eng.traineddata Github:

wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata

Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.

When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead.

# If you got the data from Google, unzip it first!
gunzip eng.traineddata.gz 
# Move the data
sudo mv -v eng.traineddata /usr/local/share/tessdata/

I'm using windows OS, I tried all solutions above and none of them work.

Finally, I install Tesseract-OCR on D drive(Where I run my python script from) instead of C drive and it works.

So, if you are using windows, run your python script in the same drive as your Tesseract-OCR.


tessdata_dir_config = r'--tessdata-dir "/usr/local/Cellar/tesseract/4.1.1/share/tessdata"'
pytesseract.image_to_string(imgCrop,lang='eng',config=tessdata_dir_config)

C# developer working on Windows here. What works for me is simply download the file eng.traineddata from the following URL:

https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

and copy it to the following directory in my Console Application project:

[Project Directory]\bin\Debug\tessdata

I did manually create the tessdata folder above.


Add this to your code :

instance.setDatapath("C:\\somepath\\tessdata");

instance.setLanguage("eng");

I'm using Visual Studio 2017 Community Edition.
I solved this problem by making a directory called tessdata in the Debug directory of my project. Then I put the eng.traineddata file into said directory.


For Windows Users:

In Environment Variables, add a new variable in system variable with name "TESSDATA_PREFIX" and value is "C:\Program Files (x86)\Tesseract-OCR\tessdata"


I had this error too on the Windows machine.

My solution.

1) Download your language files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00

For example, for eng, I downloaded all files with eng prefix.

2) Put them into tessdata directory inside of some folder. Add this folder into System Path variables as TESSDATA_PREFIX.

Result will be System env var: TESSDATA_PREFIX=D:/Java/OCR And OCR folder has tessdata with languages files.

This is a screenshot of the directory:

enter image description here


tesseract  --tessdata-dir <tessdata-folder> <image-path> stdout --oem 2 -l <lng>

In my case, the mistakes that I've made or attempts that wasn't a success.

  • I cloned the github repo and copied files from there to
    • /usr/local/share/tessdata/
    • /usr/share/tesseract-ocr/tessdata/
    • /usr/share/tessdata/
  • Used TESSDATA_PREFIX with above paths
  • sudo apt-get install tesseract-ocr-eng

First 2 attempts did not worked because, the files from git clone did not worked for the reasons that I do not know. I am not sure why #3 attempt worked for me.

Finally,

  1. I downloaded the eng.traindata file using wget
  2. Copied it to some directory
  3. Used --tessdata-dir with directory name

Take away for me is to learn the tool well & make use of it, rather than relying on package manager installation & directories


The simpliest way is to install the needed package:

sudo apt-get install tesseract-ocr-eng  #for english
sudo apt-get install tesseract-ocr-tam  #for tamil
sudo apt-get install tesseract-ocr-deu  #for deutsch (German)

As you can notice, it opens the road to others languages (i.e. tesseract-ocr-fra).