Optical Character Recognition: an Introduction

Hi folks,

This week we will provide our general public with the first article about Optical Character Recognition, a key feature in the document imaging domain (but not limited to it), and later we’ll continue to detail some particularly important aspects and best practices in OCR.
But for now, let’s make a basic introduction.

Definition

Despite the many various definitions of OCR, the most simple and accurate one would be: Optical Character Recognition is meant to identify text from non-text inside a digital image.

History of OCR

The history of OCR is quite fascinating, not only because of its very fast-growing complexity but also for its unbelievable early beginnings.
Being an ahead-of-public-times technology, OCR started as discrete military research (same as computers, internet, and all each and every other advanced technology on Earth).
But can you believe that its first developments started around 1914 or that, during the 1960s (a time when the general public was barely communicating using wired telephones), some national postal companies, such as US Postal Service or the British General Post Office, were already using OCR to automatically sort our grandparent’s handwritten mail?

How does it work?

Well, the reason we need to extract text from images is that software cannot handle text unless it is encoded as text-piece-of-information.
We need text to be:

edited
indexed (so we can retrieve it later using our text-based searches)
processed – to use it for superior complex refinements (such as text-mining)
we even need text as-such so we can render it back to us as spoken information (thanks to Text To Speech technologies and applications)!

In other words, « text » from an IT point of view means character-encoding standards, such as ASCII, UNICODE, etc.
The text within an image file (ie, bitmap resulting when a document is scanned) means « text » only for us humans, who are able to recognize it.
But for almost all computer software, a bitmap containing text is nothing but a series of pixel values, the same as any other bitmap not containing text.
Except for OCR software, which can analyze all pixel values, perform highly complex processing, and determine if « patterns » can be found to match the ones corresponding to « text. »

Basically, what happens is a kind of best-guess attempt and the result is output as a text-encoded type of information.

Improving results

This is why OCR accuracy depends on many different aspects:

printed text is much easier to be correctly recognized than handwritten text.
If the language/character set of the to-be-recognized text is previously known and settings are done accordingly, OCR results are dramatically better. For instance, the page should have a correct orientation (or else use the automatic orientation detection component of the OCR software, if available), the image quality might need to be enhanced in order to optimize it before submitting to OCR, and so on.

In our forthcoming article on OCR subject, we will further explain some best practices for OCR and various factors to be considered when choosing an OCR engine (i.e., quality vs royalties, time vs hardware resources, number of supported languages, etc).

Cheers!

Bogdan

03/19/2020 edit:

Do you need to perform OCR on your document? Try our free widget below!
More online tools are available on our AvePDF website.