image processing to improve tesseract OCR accuracy

Question

I ve been using tesseract to convert documents into text  The quality of the documents ranges wildly  and I m looking for tips on  what sort of image processing might improve the results  I ve noticed that text that is highly pixellated - for example that generated by fax machines - is especially difficult for tesseract to process - presumably all those jagged edges to the characters confound the shape-recognition algorithms    What sort of image processing techniques would improve the accuracy  I ve been using a Gaussian blur to smooth out the pixellated images and seen some small improvement  but I m hoping that there is a more specific technique that would yield better results  Say a filter that was tuned to black and white images  which would smooth out irregular edges  followed by a filter which would increase the contrast to make the characters more distinct   Any general tips for someone who is a novice at image processing

User · Answer

I am by no means an OCR expert  But I this week had need to convert text out of a jpg   I started with a colorized  RGB  445x747 pixel jpg     I immediately tried tesseract on this  and the program converted almost nothing  I then went into GIMP and did the following  image mode grayscale image scale image 1191x2000 pixels filters enhance unsharp mask with values of radius   6 8  amount   2 69  threshold   0 I then saved as a new jpg at 100  quality   Tesseract then was able to extract all the text into a  txt file  Gimp is your friend

User · Answer

As a rule of thumb  I usually apply the following image pre-processing techniques using OpenCV library    Rescaling the image  it s recommended if you   re working with images that have a DPI of less than 300 dpi    img   cv2 resize img  None  fx 1 2  fy 1 2  interpolation cv2 INTER CUBIC   Converting image to grayscale   img   cv2 cvtColor img  cv2 COLOR BGR2GRAY   Applying dilation and erosion to remove the noise  you may play with the kernel size depending on your data set    kernel   np ones  1  1   np uint8  img   cv2 dilate img  kernel  iterations 1  img   cv2 erode img  kernel  iterations 1   Applying blur  which can be done by using one of the following lines  each of which has its pros and cons  however  median blur and bilateral filter usually perform better than gaussian blur     cv2 threshold cv2 GaussianBlur img   5  5   0   0  255  cv2 THRESH BINARY   cv2 THRESH OTSU  1   cv2 threshold cv2 bilateralFilter img  5  75  75   0  255  cv2 THRESH BINARY   cv2 THRESH OTSU  1   cv2 threshold cv2 medianBlur img  3   0  255  cv2 THRESH BINARY   cv2 THRESH OTSU  1   cv2 adaptiveThreshold cv2 GaussianBlur img   5  5   0   255  cv2 ADAPTIVE THRESH GAUSSIAN C  cv2 THRESH BINARY  31  2   cv2 adaptiveThreshold cv2 bilateralFilter img  9  75  75   255  cv2 ADAPTIVE THRESH GAUSSIAN C  cv2 THRESH BINARY  31  2   cv2 adaptiveThreshold cv2 medianBlur img  3   255  cv2 ADAPTIVE THRESH GAUSSIAN C  cv2 THRESH BINARY  31  2     I ve recently written a pretty simple guide to Tesseract but it should enable you to write your first OCR script and clear up some hurdles that I experienced when things were less clear than I would have liked in the documentation   In case you d like to check them out  here I m sharing the links with you    Getting started with Tesseract - Part I  Introduction Getting started with Tesseract - Part II  Image Pre-processing

User · Answer

Adaptive thresholding is important if the lighting is uneven across the image  My preprocessing using GraphicsMagic is mentioned in this post  https   groups google com forum   topic tesseract-ocr jONGSChLRv4  GraphicsMagic also has the -lat feature for Linear time Adaptive Threshold which I will try soon   Another method of thresholding using OpenCV is described here  http   docs opencv org trunk doc py tutorials py imgproc py thresholding py thresholding html

User · Answer

Reading text from image documents using any OCR engine have many issues in order get good accuracy  There is no fixed solution to all the cases but here are a few things which should be considered to improve OCR results   1  Presence of noise due to poor image quality   unwanted elements blobs in the background region  This requires some pre-processing operations like noise removal which can be easily done using gaussian filter or normal median filter methods  These are also available in OpenCV   2  Wrong orientation of image  Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy   3  Presence of lines  While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results  There are other issues also but these are the basic ones   This post OCR application is an example case where some image pre-preocessing and post processing on OCR result can be applied to get better OCR accuracy

User · Answer

Java version for Sathyaraj s code above      Resize public Bitmap resize Bitmap img  int newWidth  int newHeight        Bitmap bmap   img copy img getConfig    true        double nWidthFactor    double  img getWidth      double  newWidth      double nHeightFactor    double  img getHeight      double  newHeight       double fx  fy  nx  ny      int cx  cy  fr x  fr y      int color1      int color2      int color3      int color4      byte nRed  nGreen  nBlue       byte bp1  bp2       for  int x   0  x  lt  bmap getWidth      x            for  int y   0  y  lt  bmap getHeight      y                 fr x    int  Math floor x   nWidthFactor               fr y    int  Math floor y   nHeightFactor               cx   fr x   1              if  cx  gt   img getWidth                    cx   fr x              cy   fr y   1              if  cy  gt   img getHeight                    cy   fr y              fx   x   nWidthFactor - fr x              fy   y   nHeightFactor - fr y              nx   1 0 - fx              ny   1 0 - fy               color1   img getPixel fr x  fr y               color2   img getPixel cx  fr y               color3   img getPixel fr x  cy               color4   img getPixel cx  cy                   Blue             bp1    byte   nx   Color blue color1    fx   Color blue color2                bp2    byte   nx   Color blue color3    fx   Color blue color4                nBlue    byte   ny    double   bp1    fy    double   bp2                    Green             bp1    byte   nx   Color green color1    fx   Color green color2                bp2    byte   nx   Color green color3    fx   Color green color4                nGreen    byte   ny    double   bp1    fy    double   bp2                    Red             bp1    byte   nx   Color red color1    fx   Color red color2                bp2    byte   nx   Color red color3    fx   Color red color4                nRed    byte   ny    double   bp1    fy    double   bp2                 bmap setPixel x  y  Color argb 255  nRed  nGreen  nBlue                         bmap   setGrayscale bmap       bmap   removeNoise bmap        return bmap        SetGrayscale private Bitmap setGrayscale Bitmap img        Bitmap bmap   img copy img getConfig    true       int c      for  int i   0  i  lt  bmap getWidth    i              for  int j   0  j  lt  bmap getHeight    j                  c   bmap getPixel i  j               byte gray    byte    299   Color red c     587   Color green c                         114   Color blue c                 bmap setPixel i  j  Color argb 255  gray  gray  gray                        return bmap        RemoveNoise private Bitmap removeNoise Bitmap bmap        for  int x   0  x  lt  bmap getWidth    x              for  int y   0  y  lt  bmap getHeight    y                  int pixel   bmap getPixel x  y               if  Color red pixel   lt  162  amp  amp  Color green pixel   lt  162  amp  amp  Color blue pixel   lt  162                    bmap setPixel x  y  Color BLACK                                     for  int x   0  x  lt  bmap getWidth    x              for  int y   0  y  lt  bmap getHeight    y                  int pixel   bmap getPixel x  y               if  Color red pixel   gt  162  amp  amp  Color green pixel   gt  162  amp  amp  Color blue pixel   gt  162                    bmap setPixel x  y  Color WHITE                                     return bmap

User · Answer

What was EXTREMLY HELPFUL to me on this way are the source codes for Capture2Text project  http   sourceforge net projects capture2text files Capture2Text    BTW  Kudos to it s author for sharing such a painstaking algorithm   Pay special attention to the file Capture2Text SourceCode leptonica util leptonica util c - that s the essence of image preprocession for this utility   If you will run the binaries  you can check the image transformation before after the process in Capture2Text Output  folder   P S  mentioned solution uses Tesseract for OCR and Leptonica for preprocessing

User · Answer

fix DPI  if needed  300 DPI is minimum fix text size  e g  12 pt should be ok  try to fix text lines  deskew and dewarp text  try to fix illumination of image  e g  no dark part of image  binarize and de-noise image   There is no universal command line that would fit to all cases  sometimes you need to blur and sharpen image   But you can give a try to TEXTCLEANER from Fred s ImageMagick Scripts   If you are not fan of command line  maybe you can try to use opensource scantailor sourceforge net or commercial bookrestorer

User · Answer

you can do noise reduction and then apply thresholding  but that you can you can play around with the configuration of the OCR by changing the --psm and --oem values try  --psm 5 --oem 2 you can also look at the following link for further details here

User · Answer

Text Recognition depends on a variety of factors to produce a good quality output  OCR output highly depends on the quality of input image  This is why every OCR engine provides guidelines regarding the quality of input image and its size  These guidelines help OCR engine to produce accurate results   I have written a detailed article on image processing in python  Kindly follow the link below for more explanation  Also added the python source code to implement those process   Please write a comment if you have a suggestion or better idea on this topic to improve it   https   medium com cashify-engineering improve-accuracy-of-ocr-using-image-preprocessing-8df29ec3a033

User · Answer

Three points to improve the readability of the image   Resize the image with variable height and width multiply 0 5 and 1 and 2 with image height and width    Convert the image to Gray scale format Black and white    Remove the noise pixels and make more clear Filter the image     Refer below code   Resize public Bitmap Resize Bitmap bmp  int newWidth  int newHeight                                      Bitmap temp    Bitmap bmp                               Bitmap bmap   new Bitmap newWidth  newHeight  temp PixelFormat                                 double nWidthFactor    double temp Width    double newWidth                  double nHeightFactor    double temp Height    double newHeight                   double fx  fy  nx  ny                  int cx  cy  fr x  fr y                  Color color1   new Color                    Color color2   new Color                    Color color3   new Color                    Color color4   new Color                    byte nRed  nGreen  nBlue                   byte bp1  bp2                   for  int x   0  x  lt  bmap Width    x                                        for  int y   0  y  lt  bmap Height    y                                                 fr x    int Math Floor x   nWidthFactor                           fr y    int Math Floor y   nHeightFactor                           cx   fr x   1                          if  cx  gt   temp Width  cx   fr x                          cy   fr y   1                          if  cy  gt   temp Height  cy   fr y                          fx   x   nWidthFactor - fr x                          fy   y   nHeightFactor - fr y                          nx   1 0 - fx                          ny   1 0 - fy                           color1   temp GetPixel fr x  fr y                           color2   temp GetPixel cx  fr y                           color3   temp GetPixel fr x  cy                           color4   temp GetPixel cx  cy                               Blue                         bp1    byte  nx   color1 B   fx   color2 B                            bp2    byte  nx   color3 B   fx   color4 B                            nBlue    byte  ny    double  bp1    fy    double  bp2                                Green                         bp1    byte  nx   color1 G   fx   color2 G                            bp2    byte  nx   color3 G   fx   color4 G                            nGreen    byte  ny    double  bp1    fy    double  bp2                                Red                         bp1    byte  nx   color1 R   fx   color2 R                            bp2    byte  nx   color3 R   fx   color4 R                            nRed    byte  ny    double  bp1    fy    double  bp2                             bmap SetPixel x  y  System Drawing Color FromArgb                  255  nRed  nGreen  nBlue                                                                      bmap   SetGrayscale bmap                   bmap   RemoveNoise bmap                    return bmap                           SetGrayscale public Bitmap SetGrayscale Bitmap img                                     Bitmap temp    Bitmap img                  Bitmap bmap    Bitmap temp Clone                    Color c                  for  int i   0  i  lt  bmap Width  i                                          for  int j   0  j  lt  bmap Height  j                                                  c   bmap GetPixel i  j                           byte gray    byte   299   c R    587   c G    114   c B                                bmap SetPixel i  j  Color FromArgb gray  gray  gray                                                            return  Bitmap bmap Clone                         RemoveNoise public Bitmap RemoveNoise Bitmap bmap                                     for  var x   0  x  lt  bmap Width  x                                          for  var y   0  y  lt  bmap Height  y                                                  var pixel   bmap GetPixel x  y                           if  pixel R  lt  162  amp  amp  pixel G  lt  162  amp  amp  pixel B  lt  162                              bmap SetPixel x  y  Color Black                           else if  pixel R  gt  162  amp  amp  pixel G  gt  162  amp  amp  pixel B  gt  162                              bmap SetPixel x  y  Color White                                                                return bmap                 INPUT IMAGE  OUTPUT IMAGE

User · Answer

This is somewhat ago but it still might be useful   My experience shows that resizing the image in-memory before passing it to tesseract sometimes helps   Try different modes of interpolation  The post https   stackoverflow com a 4756906 146003 helped me a lot

User · Answer

The Tesseract documentation contains some good details on how to improve the OCR quality via image processing steps   To some degree  Tesseract automatically applies them  It is also possible to tell Tesseract to write an intermediate image for inspection  i e  to check how well the internal image processing works  search for tessedit write images in the above reference    More importantly  the new neural network system in Tesseract 4 yields much better OCR results - in general and especially for images with some noise  It is enabled with --oem 1  e g  as in     tesseract --oem 1 -l deu page png result pdf    this example selects the german language   Thus  it makes sense to test first how far you get with the new Tesseract LSTM mode before applying some custom pre-processing image processing steps

User · Answer

I did these to get good results out of an image which has not very small text    Apply blur to the original image  Apply Adaptive Threshold  Apply Sharpening effect    And if the still not getting good results  scale the image to 150  or 200

[image-processing] image processing to improve tesseract OCR accuracy

Examples related to image-processing

Examples related to ocr

Examples related to tesseract