In the previous articles of our series, we covered some of the deep learning techniques used in text detection, which is the first part of any OCR system. In this article, we will cover the second part that composes any OCR system, that is, text recognition.

Text recognition without text segmentation

Text recognition is the process of taking a region of an image that contains text as an input, and at the output, we want to get the text written on that image. Deep learning is very suitable for such processes.

Before deep learning, there was a lot of preprocessing and postprocessing needed in order to recognize text in images. With deep learning, a large portion of the process has been simplified because neural networks can learn some powerful features about the text inside the image. These features are then used by the same neural network to extract the text written on the image.

Special deep learning layers for text recognition

To perform text recognition using deep learning, we need to use some very specific types of layers. Namely: recurrent layers.

There are also transformers, but we will not cover them for now. Maybe we will detail them in a future article.

Recurrent layers are used in CRNNs (Convolutional Recurrent Neural Networks). There are mainly two types of these layers: LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit).

A neural network architecture used for performing text recognition will generally be composed of two parts: a convolutional feature extractor and a recurrent decoder.

As in many computer vision tasks, we use convolutional layers to extract features from images. The outputs of these layers are feature maps. These feature maps are then used as an input to recurrent layers such as LSTM.

In fact, LSTM is widely used in NLP (Natural Language Processing). They are shown to be effective in learning recurrent patterns such as the ones we find in text expressions. So it’s no surprise that they are effective at learning from features that were extracted from images that contain text.

An example implementation of such a text recognition system where there are convolutional layers followed by recurrent layers is shown in the figure below.