Hi folks,

This week we continue the Optical Character Recognition subject by providing our general public with a few tips for pre-scan and post-scan stages of document archiving.
A properly done OCR task is not simply about text extraction; it also implies a set of operations meant to optimize the OCR process and increase efficiency in overall document-management practices.
To put it in other words, operations commonly considered as “adjacent” can significantly improve or totally destroy text recognition, making your life either comfortable or a living hell.
Here are just a few things to keep in mind:

OCR optimization before scanning

  • when placing the paper in the scanner, make sure the pages have the correct text orientation. This way, you won’t have to later waste time by either having to wait for the OCR software to automatically determine the orientation or, even worse, to have to make this operation manually, via file-by-file checking.
  • adjust proper scan settings to ensure best quality for OCR. For example, 300 dpi or higher resolutions are considered optimal for most documents.
  • test OCR output for a few pages before starting a batch scanning operation to make sure your settings are optimally fine-tuned.
  • select a lossless file format (such as TIFF) and do not be afraid of big sizes if the documents are important to you: storage space is not an issue these days and you can later convert the files to any other format for handling (or sharing) purposes.

03/24/2020 edit:

Actually, storage space (especially online) does matter now. For environmental reasons, like we mentioned in this article
But since last week and a big chunk of the worldwide population working from home with sometimes sketchy internet connections, it makes sense not to upload/download/store huge files if we have the possibility to make them smaller… Netflix, YouTube, and now Facebook have reduced their video quality in Europe for the next 30 days, to avoid straining the Internet.

Actually, for important document archives, maybe the best idea would be to store the original files into TIFF format then move them on an external storage device or media (external hard-disk or DVD, etc.) and use for current work a duplicated archive containing files converted into a format that you consider optimal for your needs (JBIG2, PDF, etc.).
To a certain extent, this approach would be similar to how camera RAW format works for the professionals in digital photography domain.

OCR Optimization after scanning

  • use relevant filenames for resulting files and not mind if filenames tend to become lengthy. It’s easy to do using automated file naming tools and, even if it might take a bit more of your time at file creation stage, it can be a real life savior later. And make sure that the filename contains important data, such as the language of the text, to name just one important detail for OCR.
  • do not hesitate to use image enhancement techniques. The quality of the paper documents cannot be controlled and your scanner particular details can influence output quality (just an example among dozens: tiny scratches on scanner’s glass).

To overcome them, professional document imaging software vendors provide their users with a wide range of image correction features.
In these posts you will find some explanations on brightness/contrast/gamma, median filtering, and auto-deskew.
But more explanations are yet to come.

Cheers!

Bogdan

03/24/2020 edit:

Do you need to perform OCR on your document? Try our free widget below!
More online tools are available on our AvePDF website.