Optical character recognition (OCR) is a method that helps machines recognize texts. Traditional OCR uses patterns and correlation to differentiate words from other elements. However, these techniques don’t tend to produce results with high accuracy for complex text or in-motion streams. For these tasks, you’re better served using deep learning models. In that spirit, in this article we’ll explore three deep learning models for OCR.
What is OCR?
OCR takes images that include textual elements and attempts to recognize that text. The output is a text string, and accuracy is measured as the degree of similarity between the recognized text and the text a human would be able to read from the image.
OCR is used to recognize printed text in paper documents, handwritten characters, and text elements in the physical environment, such as license plate numbers, street signs, and street numbers.
Traditional OCR algorithms are based on pattern matching, pattern recognition, or image correlation. These techniques, in a standard use case such as a document scanner, can recognize words and sentences with a very high level of accuracy.
However, in more challenging cases that include multiple font styles, camera motion blurring the image, or unclear human handwriting, legacy algorithms fall short and are gradually being replaced by deep learning methods.
Modern OCR Use Cases
OCR is an age-old technology that’s used in countless software programs and hardware devices. However, new use cases are evolving that are taking OCR capabilities to the edge:
- Parking validation — Cities and towns are using mobile OCR to automatically validate if cars are parked according to city regulations. Parking inspectors can use a mobile device with OCR to scan license plates of vehicles and check with an online database to see if they are permitted to park.
- Mobile document scanning — A variety of mobile applications allow users to take a photo of a document and convert it to text. This OCR task is more challenging than traditional document scanners, because photos have unpredictable image angles, lighting conditions, and text quality.
- Digital asset management (DAM) — Digital asset management software helps organize rich media assets such as images, videos, and animations. A key aspect of DAM systems is search-ability of rich media. By running OCR on uploaded images and video frames, DAM can make rich media searchable, and enrich it with meaningful tags.
3 Deep Learning OCR Models
Text recognition involves two steps: first, detecting and identifying a bounding box for text areas in the image, and within each text area, individual text characters. Second, identifying the characters.
To detect characters and words in images, you can use standard deep learning models, like Mask RCNN, SSD, or YOLO. However, deep learning models that are good at identifying objects in images (i.e. animals or vehicles), can find it difficult to identify text characters, and may perform worse than legacy OCR algorithms discussed above.
Therefore, specialized deep learning models have evolved to help localize and detect text in images. Here are several commonly-used models.
Convolutional-Recurrent Neural Network (CRNN)
The CRNN approach identifies words using three steps:
- A standard convolutional neural network (CNN) — the first layer breaks the image into features and is divided into “feature columns”.
- These columns are fed into a deep-bidirectional long short term memory (LSTM) cell, which provides a sequence, identifying the relationship between the characters.
- The output of the LSTM cell is fed into a transcription layer, which takes the character sequence, including redundant characters, and uses a probabilistic approach to clean the output.
Recurrent Attention Model (RAM)
The RAM model is based on the idea that when the human eye is presented with a new scene, certain parts of the image catch its attention. The eye focuses on those “glimpses” of information first and obtains information from them.
In the model, the image is cropped to different sizes around a common center, and glimpse vectors are created with prominent features from each cropped version. These glimpse vectors are flattened and passed through a “glimpse network” based on visual attention.
Glimpse vectors are then passed to a location network, which uses an RNN to predict the next part of the image to pay attention to. This location is the next input for the glimpse network. Gradually, the model explores additional parts of the image, each time performing backpropagation to see if the information from the previous glimpses is good enough to achieve a high level of accuracy.
Attention-OCR is an OCR project available on TensorFlow, and was originally designed for the problem of image captioning. It’s structured as a CRNN followed by an attention decoder.
First, the model uses convolutional network layers to extract image features. It encodes these features to strings and passes them to an RNN, followed by an attention mechanism, borrowed from the Seq2Seq machine translation model. The attention-based decoder is used to predict the text in the input image.
OCR Datasets You Can Use in Your Deep Learning Projects
When you approach a deep learning OCR project, consider using one of the following standard datasets, which can help you pre-train your model on a large number of labeled images.
- MNIST — A well-known computer vision challenge, showing one handwritten character at a time, which may be one of the 10 numerical digits (0–9). MNIST is considered an easy task for neural network classification models.
- SVHN — The Street View House Numbers dataset, extracted from Google’s street view. Digits are in assorted shapes and writing styles; however, the house number is always in the middle of the image, so text detection isn’t needed.
- Street View Text Dataset (SVT) — A more challenging OCR task involving outdoor images with noise, image artifacts, and lighting issues to deal with. The images in this dataset generally cannot be processed by legacy OCR algorithms.
Deep learning can help you significantly improve your OCR performance. Specifically, there are three highly effective models you can use.
- A convolutional-recurrent neural network (CRNN) works in layers that break images into columns, identifying relationships between characters, and then producing results.
- A recurrent attention model (RAM) achieves a high level of accuracy by “glimpsing” at the image from different perspectives, and then defining output.
- And the attention OCR model builds on the CRNN by adding an attention-based decoder. When implementing a model, you can use standard datasets like MNIST, SVHN, and SVT, as a starting point or as guidelines for building your own dataset.