This article is the second part of “Deep learning for text detection.” In part 1, we outlined some deep learning approaches based on object detection used to perform text detection.

In this part, we will outline some approaches based on image segmentation.

There are different types of image segmentation, but for text detection, we will focus on instance segmentation.

Instance segmentation

Instance segmentation is the process of classifying each pixel in an image to one of N categories. Moreover, there also needs to be a distinction between pixels belonging to the same class but to different objects (we need to distinguish between a pixel belonging to person A and a pixel belonging to person B in the same image).

Instance segmentation, just like object detection, is a general-purpose method that can be used for several tasks, not just text detection. In the following sections, we’re going to look at some of the deep learning architectures that perform text detection using instance segmentation.

Two famous instance segmentation techniques for text detection are based on Fully convolutional networks (FCN). We will detail them next.

Multi-Oriented Text Detection with Fully Convolutional Networks

FCN is an architecture that is based mainly on convolution layers and without the use of fully connected layers. It is usually used as the first block of many computer vision deep learning models. For example, VGG16 or InceptionV4 have FCN blocks that come right after the input and just before dense layers.

For text detection, Zheng et al in their paper “Multi-Oriented Text Detection with Fully Convolutional Networks” [1] described an approach for text detection using FCNs. The system they designed is composed of several components, but the main component is the FCN block. The figure below shows how text detection works using their method.