Posts Tagged ‘tips and tricks’

Optical Character Recognition: some advices

Hi folks,

This week we continue the Optical Character Recognition subject by providing our general public with few advices for pre-scan and post-scan stages of document archiving.
A properly done OCR task is not simply about text extraction, it also implies a set of operations meant to optimize the OCR process and increase efficiency in overall document-management practice.
To put it in other words, operations commonly considered as “adjacent” can actually really improve or totally distroy text recognition making your later life either comfortable or a living hell.
Here are just a few things to keep in mind :

(1) Before scanning

  • when placing the paper in scaner make sure the pages have the correct text orientation so you won’t have to later waste time by either having to wait for the OCR software to automatically determine the orientation or, even worse, to have to make this operation manually, via file-by-file checking ;
  • make proper scan settings to insure best quality for OCR (for example, 250 or 300 dpi resolutions are considered optimal for most of the documents) ;
  • test OCR output for a few pages before starting a batch scanning operation to make sure your settings are optimally fine-tuned
  • select a lossless file format (such as TIFF) ) and do not be afraid of big sizes if the documents are important to you : storage space is not an issue these days and you can later convert the files to any other format for handling (or sharing) purposes.

 

Actually, for important document archives , maybe the best idea would be to store the “original” files into TIFF format then move them on an external storage device or media (external hard-disk or DVD, etc) and use for current work a duplicated archive containing files converted into a format that you consider optimal for your needs ( JBIG2, PDF, etc).
To a certain extent, this approach would be similar to how camera RAW format works for the professionals in digital photography domain.

 

(2) After scanning

  • use relevant filenames for resulting files and not mind if filenames tend to become lengthy : it isn’t hard to do using automated file naming tools and , even if it might take a bit more of your time at file creation stage it can be a really life saviour later. And make sure that the filename contains important data, such as the language of the text, to name just one important detail for OCR.
  • do not hesitate to use image enhancement techniques : the quality of the paper documents cannot be controled and nor hardware (ie scanner) particular details which might influence output quality (just an example among dozens : tiny scratches on scanner’s glass).

 

To overcome them , professional document imaging software vendors provide their users with a wide range of image correction features.
In this blog you will find some explanations on brightness/contrast/gamma, median filtering and auto-deskew.
But more explanations are yet to come.

Cheers!

Bogdan

 

Big Browser on April 27

Google's secret weapon to fight Redmond and Cupertino Read article Eugene Kasperski : "In terms of security, Apple is 10 years behind Microsoft" Read article Repetitive tasks : geeks vs. non-geeks Read article Read this before naming your startup Read article Why the iPad Has to be Made in China Read Article

Casual Friday on April 27

A beautiful day to play outside.

A beautiful day to play outside.

Camera RAW files formats explained

Hi folks,

This week we will provide our general public with explanations on camera RAW files formats because this subject is often ignored or misunderstood and because our software supports more than 40 such formats.

Let’s start by specifying that RAW is no accronym for anything : in this rare case, “raw” literally means “raw” (“unprocessed”, that is) and the explanation for this term resides in the way digital cameras work.
Each time you are taking a picture, you are actually exposing the digital camera’s photo-sensitive chip to light.
The chip has millions of sensor units (ie, pixels) each one translating the amount of light it was hit by into a voltage level which is then converted to a digital value.
Usually, this resulting digital value can be recorded in a 12 bits or 14 bits workspace, meaning that each pixel can handle 4096 brightness levels (= 2 ^12) or 16384 brightness levels (= 2 ^14).
Commonly, no sensor records colors : imaging chips record greyscales and then convert to color by using filters and color schemes such as the Bayer Matrix .
Finally, when saving a raw file, the camera software adds various metadata (information on camera type, camera settings, etc) but this information has no influence on the stored raw image, it is simply added as tags.
In other words, the raw image data is unprocessed and uncompressed and the various settings associated with it are not applied : they are stored as metadata for later use.
To conclude description of this stage of digital photo image generation in digital cameras, we should add that raw files have big sizes, their format is proprietary to the camera manufacturer (sometimes even specific to a certain camera model) and they are often compared to “negative photo films” from classic photography process.

Learn more ...

Image enhancement : median filtering

Hi folks,

We continue the series of explanations on image enhancement techniques meant for our general public and this week we are going to give you some additional info about median filtering.

Images quite often contain artifacts known as “noise”.
“Noise” means, of course, un-wanted sounds occuring in an audition context but the term quickly expanded to other domains, designating the presence of un-wanted randomly disseminated artifacts within any given context.
In imaging domain, for instance, one of the frequently occuring noise-types is called “salt and pepper noise”.
Quite an intuitive name, as images affected by this type of noise look like as if salt and pepper particles were poured over “the clear” image (bright pixels on darker areas and dark pixels on brighter areas of the image).
The usual causes for this issue are hardware related (analog-to-digital conversion, bit errors in transmissions, etc.).

Which brings us to the median filtering : one of the most effective method to remove such noise from images is to apply the median filter.

Learn more ...

Deskew/Autodeskew : what’s that ?

Hi folks,

This week we thought about offering to our general public some explanations about deskew/autodeskew, mainly to answer two questions : what’s that and why is it important to have ?

Skew is an artifact that might appear during document scaning process and it consists of getting the document’s text/images be rotated at a slight angle.
It can have various causes but the most common is paper getting misplaced during scan.
Therefore, deskew is the process of detecting and fixing this issue on scanned files (ie, bitmap) so deskewed images will have the text/images correctly and horizontally alligned.

And why is this important ?
Well, a first benefit will be that you don’t have to scan in again the skewed documents.
Instead of the mechanical and time consuming actions that re-scan involves, everything is done automatically and efficiently by the software providing deskew feature.

Learn more ...

About JBIG2 compression

Hello folks,

Although disk storage and internet connection bandwidth are constantly increasing and getting cheaper, worldwide efforts for better file compression are increasing as well.
This is no paradox and there are too many reasons for this to mention but we all know that, for instance, file transfers are never fast enough.

For document imaging domain there is JBIG2 compression scheme.

Learn more ...