Optical Character Recognition with imagerExtara

Shota Ochi

2019-01-25

You need the R package tesseract, which is bindings to a powerful optical character recognition (OCR) engine, to do OCR with imagerExtra.

See the installation guide of tesseract if you haven’t installed tesseract.

ocr function of tesseract works best for images with high contrast, little noise, and horizontal text.

ocr function doesn’t show a good performance for degraded images as shown below.

library(imagerExtra)
plot(papers, main = "Original")

OCR(papers) %>% print
[1] ""
OCR_data(papers) %>% print
[1] word       confidence bbox      
<0 rows> (or 0-length row.names)

OCR function and OCR_data function are wrappers for ocr function and ocr_data function of tesseract.

We can see OCR function and OCR_data function failed to recognize the text “Hello”.

We need to clean the image before using OCR function.

hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1))
plot(hello, main = "Hello")

OCR(hello) %>% print
[1] "Hello\n"
OCR_data(hello) %>% print
   word confidence       bbox
1 Hello   93.99038 8,9,118,54

We can see the text “Hello” was recognized.

Using tesseract in combination with imagerExtra enables us to extract text from degraded images.