Posts Tagged ‘feature explanations’

Automatic color detection: how to dramatically reduce the size of your documents

Hi folks,

In this article we are going to explain to our general public what color detection is all about and how can it be used to dramatically reduce the size of your electronically stored documents.

In a previous article, we were showing that bitmaps (or raster images) are made of pixels (ordered in arrays or matrices, each pixel having its own coordinates and color) in a way similar to how mosaics are made out of pieces of coloured glass.
Since bits (“0″ and “1”) are used to store information about color, it is quite logical that, the more colours need to be encoded in an image, the more bits per pixel (or “bpp”) are necessary to store that information and therefore, the larger the size of the bitmap image file will be.

From a color point of view, bitmaps can be:

– Black and white
Being only 2 colors, they are encoded in 1 bpp (either “0” or “1” for either black or white) so these bitmaps consumes less size the lesser possible size for color information.

– Grayscale

Such images are in black, white and various sets of intermediary grey shades.
Generally, a 8 bpp color encoding is considered acceptable, but you will note that each pixel color requires already 8 times more data than for the B/W images.

– Color

Images are colored in nuance (color gradation) palettes of various sizes but 24 bpp color encoding is considered to be satisfactory as it can store over 16,7 million colours while the human eye can discern only about 10 million.
Of course, each pixel color for such images takes 3 times more data than the 8 bpp and 24 times more data than the 1 bpp.

Now, why is all this so important?

In real life, not only the professionals in document storage but also most of us are forced to compromise between the needs of storing documents at as high quality as possible but at smallest possible size (mainly for sharing purposes).
To achieve that, scanning operators have to separate B/W pages from grayscale and from colored ones and scan each of those sets at 1 bpp, 8 bpp and 24 bpp, respectively.
This is a terribly slow, painful and subject to human error task.

What if everything could be done instantly, automatically and with no scanning constraints?

Well, we at ORPALIS have developed a patent pending, proprietary technology of automatic color detection.
All you have to do is put all your documents in one batch, no matter their color type, scan them all in color mode and our software will automatically determine the color type of each page.
Then, depending on the detected color-type, the filter will automatically encode the image in its best suited / optimized bits-per-pixel encoding.
In other words, providing best quality for smallest possible size.

This feature is already implemented in PaperScan Pro starting with version 1.6 and will be fully programmatically available in next GdPicture.NET major release.

Care for a practical testing?

Make sure you have latest PaperScan Pro (even a trial version) installed.
For your convenience, we provide a 3 TIFF test files in a zipped folder to use for batch import, but you can test using your own images, either acquired from scanner or importing existing images files.
Each TIFF file is bigger than 1 MB so the 3 will total more than 3 MB in size.
Now save them in PDF multipage format.
The resulting PDF file (PaperScan creates it using JPEG optimization and PDF pack technology) will be about 800 kb in size.
Not bad, but if you think we can’t do even better, you’ll have to think again!

From the main menu, go to “Options / Batch Acquisition/Import Filters…“.

PaperScan Pro Batch Acquisition/Import Filters...

PaperScan Pro Batch Acquisition/Import Filters…

Select “Automatic Color Detection” option and click “Save

PaperScan Pro Automatic Color Detection

PaperScan Pro Automatic Color Detection

Now import the TIFF files again and save as multipage PDF : the resulting file is 65 kb in size !
Ta-daaam!

Our next step is to provide automatic color detection for regions of same single document.
This will be available for end-users since one of the upcoming PaperScan versions and, of course, programmatically for developers using our next GdPicture.NET toolkit!

Cheers!

Bogdan

Optical Character Recognition: some advices

Hi folks,

This week we continue the Optical Character Recognition subject by providing our general public with few advices for pre-scan and post-scan stages of document archiving.
A properly done OCR task is not simply about text extraction, it also implies a set of operations meant to optimize the OCR process and increase efficiency in overall document-management practice.
To put it in other words, operations commonly considered as “adjacent” can actually really improve or totally distroy text recognition making your later life either comfortable or a living hell.
Here are just a few things to keep in mind :

(1) Before scanning

  • when placing the paper in scaner make sure the pages have the correct text orientation so you won’t have to later waste time by either having to wait for the OCR software to automatically determine the orientation or, even worse, to have to make this operation manually, via file-by-file checking ;
  • make proper scan settings to insure best quality for OCR (for example, 250 or 300 dpi resolutions are considered optimal for most of the documents) ;
  • test OCR output for a few pages before starting a batch scanning operation to make sure your settings are optimally fine-tuned
  • select a lossless file format (such as TIFF) ) and do not be afraid of big sizes if the documents are important to you : storage space is not an issue these days and you can later convert the files to any other format for handling (or sharing) purposes.

 

Actually, for important document archives , maybe the best idea would be to store the “original” files into TIFF format then move them on an external storage device or media (external hard-disk or DVD, etc) and use for current work a duplicated archive containing files converted into a format that you consider optimal for your needs ( JBIG2, PDF, etc).
To a certain extent, this approach would be similar to how camera RAW format works for the professionals in digital photography domain.

 

(2) After scanning

  • use relevant filenames for resulting files and not mind if filenames tend to become lengthy : it isn’t hard to do using automated file naming tools and , even if it might take a bit more of your time at file creation stage it can be a really life saviour later. And make sure that the filename contains important data, such as the language of the text, to name just one important detail for OCR.
  • do not hesitate to use image enhancement techniques : the quality of the paper documents cannot be controled and nor hardware (ie scanner) particular details which might influence output quality (just an example among dozens : tiny scratches on scanner’s glass).

 

To overcome them , professional document imaging software vendors provide their users with a wide range of image correction features.
In this blog you will find some explanations on brightness/contrast/gamma, median filtering and auto-deskew.
But more explanations are yet to come.

Cheers!

Bogdan

 

Big Browser on April 27

Google's secret weapon to fight Redmond and Cupertino Read article Eugene Kasperski : "In terms of security, Apple is 10 years behind Microsoft" Read article Repetitive tasks : geeks vs. non-geeks Read article Read this before naming your startup Read article Why the iPad Has to be Made in China Read Article

Casual Friday on April 27

A beautiful day to play outside.

A beautiful day to play outside.

Optical Character Recognition: an introduction

Hi folks,

This week we will provide our general public with a first article about Optical Character Recognition, a key feature in document imaging domain (but not limited to it) and later we’ll continue to detail some particularly important aspects and best practices in OCR.
But for now, let’s just make a basic introduction.

Despite the many various definitions of OCR, a most simple and accurate one would be: Optical Character Recognition is meant to identify text from non-text inside a digital image.
The history of OCR  is quite fascinating, not only because of its very fast-growing complexity, but also for its unbelievable early beginings.
Being an ahead-of-public-times technology, OCR started as a discrete military research (same as computers, internet and all each and every other advanced technologies on Earth).
But can you believe that its first developments started around 1914 or that, during the 1960’s (a time when general public was barely communicating using wire telephones), some national postal companies, such as US Postal Service or the British General Post Office, were already using OCR to automatically sort our grandparent’s handwritten mail ?

Well, the reason we need to extract text from images is that software cannot handle text unless it is encoded as text-piece-of-information.
We need text to be edited, indexed (so we can retrieve it later using our text-based searches), processed – to use it for superior complex refinements (such as text-mining), we even need text as-such so we can render it back to us as spoken information !
In other words, “text” from an IT point of view means character-encoding standards, such as ASCII, UNICODE, etc.
The text within an image file (ie, bitmap resulting when a document is scanned) means “text” only for us humans, who are able to recognize it.
But for almost all computer software, a bitmap containing text is nothing but a series of pixel values, same as any other bitmap not containing text.
Except for OCR software, which is able to analyse all pixel values, perform highly complex processing and determine if “patterns” can be found to match the ones corresponding to “text”.
Basically, what happens is a kind of best-guess attempt and the result is output as a text-encoded type of information.

This is why OCR accuracy depends on many different aspects : printed text is much easier to be correctly recognized than handwriten text , if the language/character set of the to-be-recognized text is previously known and settings are done accordingly, OCR results are dramatically better, page should have correct orientation (or else use the automatic orientation detection component of the OCR software, if available), image quality might need to be enhanced in order to optimize it before submiting to OCR, and so on.

In our forthcoming articles on OCR subject we will further explain some best practices for OCR and various factors to be considered when chosing an OCR engine (ie, quality vs royalties, time vs hardware resources, number of supported languages, etc).

Cheers!

Bogdan

Big Browser on April 20

FTP 40 years anniversary Read article Tim Berners-Lee: demand your data from Google and Facebook Read article Visual Studio 11: A refined and disappointing experience Read article Software Professional Code of Ethics Read article Tech Republic : 10 Google services you can live without Read Article

Casual Friday on April 20

Tools for students

Tools for students

Camera RAW files formats explained

Hi folks,

This week we will provide our general public with explanations on camera RAW files formats because this subject is often ignored or misunderstood and because our software supports more than 40 such formats.

Let’s start by specifying that RAW is no accronym for anything : in this rare case, “raw” literally means “raw” (“unprocessed”, that is) and the explanation for this term resides in the way digital cameras work.
Each time you are taking a picture, you are actually exposing the digital camera’s photo-sensitive chip to light.
The chip has millions of sensor units (ie, pixels) each one translating the amount of light it was hit by into a voltage level which is then converted to a digital value.
Usually, this resulting digital value can be recorded in a 12 bits or 14 bits workspace, meaning that each pixel can handle 4096 brightness levels (= 2 ^12) or 16384 brightness levels (= 2 ^14).
Commonly, no sensor records colors : imaging chips record greyscales and then convert to color by using filters and color schemes such as the Bayer Matrix .
Finally, when saving a raw file, the camera software adds various metadata (information on camera type, camera settings, etc) but this information has no influence on the stored raw image, it is simply added as tags.
In other words, the raw image data is unprocessed and uncompressed and the various settings associated with it are not applied : they are stored as metadata for later use.
To conclude description of this stage of digital photo image generation in digital cameras, we should add that raw files have big sizes, their format is proprietary to the camera manufacturer (sometimes even specific to a certain camera model) and they are often compared to “negative photo films” from classic photography process.

Let’s keep this good and widespread analogy to describe the next stage of digital photo image generation : “developing” the “negative film” (inside the “dark room”) to obtain the actual photo.
Raw files have to be converted to TIFF or JPEG standard formats similar to how negative films need to be developed to get the prints.
This is usually done by camera’s built-in software immediately after the image was captured and consists of applying various color corrections and file compressions considered by the manufacturer as optimal and by most users as satisfactory but this allows only little control of the user over the “development” process.
For professionals however, such approach might be simply insufficient as they might require full control over processing to determine the final appearance of the image.
Therefore, they would instead use more performant software ,  and hardware to achieve this.
Just for example, they can control brightness, contrast, gamma, sharpening, temperature adjustment (white-balance), noise reduction, tint, etc. not to mention file-saving formats and compression options.

To summarize : raw formats files contain all image data and information allowing later processing (“development”) up to highest levels of image quality or customization.
One can store a photo as a raw file then, based on it, create an infinity of versions of that picture using “dark room software”, either existing or yet to come!
Alternately, camera software have limited processing performance compared to dedicated third-party specialized software, it outputs lossy or lossless images in formats such as JPEG or TIFF but everything is based on a range of settings among which only some are contrallable by user.
This option advantages amateur users as it is fast, painless and the quality is within, if not even beyond, their expectations.

We should not finish this article without mentioning Adobe’s efforts to introduce a standarizaton model for raw formats : they’ve created an openly documented file format named “DNG”  (stands for Digital Negative), not very widely adopted, at least not yet.
But of course, our software, supports DNG format, as well.

Cheers!

Bogdan

Big Browser on April 13

Jack Tremiel, the founder of Commodore computers, passed away Read article The history of super computers Read article Technical books are broken Read article Open source software in C# Read article Poll: Does it matter if Microsoft open sources .NET technologies? Read Article

Casual Friday on April 13

Wireless Technology

Wireless Technology

Raster vs vector graphics images

Hi folks,

Today we are going to explain the differences between raster graphics and vector graphics for our general public.

Raster graphics images (or bitmaps) are based on the elementary concept of pixel.
A pixel (picture element) is the smallest controllable “dot” or “point of colour” or “unit” of a picture.
Raster images are made of pixels (ordered in arrays or matrices), each pixel having its own coordinates and color, similar to how a mosaic is made out of small pieces of colored glass.
Hence the name “bitmap” used for files : the image is encoded as a “map of bits” holding the position and colour of each and every pixel.
Therefore, a bitmap image is technically defined by its width and height (in pixels) and by the amount of bits-per-pixel used for storing colour information (“colour depth”).
Hence, the greater the quality (resolution ), the bigger the file size.
For that reason, bitmap files can be uncompressed or compressed (either lossy or lossless) resulting in a large variety (including sub-varieties) of popular file formats such as TIFF, BMP, PNG, JPEG, JBIG2, etc.

Vector graphics, on the other hand, do not store image information as pixels.
Instead, they contain mathematical expressions to generate and represent all details of an image.
In other words, it contains description on “how to draw” the image, instead of “what colour each pixel must have in order to obtain the image”.
This approach has 2 main advantages : the image quality remains highest regardless of zooming actions and the file size is about the same no matter the resolution of the image.
Vector graphics use complicated (and often proprietary) algorithms therefore such file formats are generally restricted for use only with the application that generated them (such as AI format for Adobe Ilustrator, .CDR format for CorelDraw, AutoCAD DXF, etc) but this is not an absolute rule.

Some file formats contain both pixel and vector data and, among them, one is of a particular interest because it is one of the most wide-spread formats in the world : PDF
We will talk about PDF basics in a dedicated series of articles in the near future as PDF format is a quite vast subject and it has a high importance and relevance in our products, too.

But for now let’s just add that raster images can be converted to vector through a process named “vectorisation” and also vector images can be converted to raster (bitmap), this process being known as “rasterisation“.

Our products are highly oriented towards raster graphics image formats and PDF as we are specialized in document imaging.
But powerful and efficient bitmap-to-PDF or rasterise-PDF-to-bitmap conversion features are particularly to be noted among our “convert any supported format to any other supported format” general feature.
You might be surprised what a complex technology lays sometimes behind the trivial “Save as …” option.

Cheers,

Bogdan

VectorBitmapExample

Wikipedia - VectorBitmapExample

Big Browser on March 30

.NET vs Windows 8 Read article Unlearn, young programmer! Read article Is Seattle the next Silicon Valley? (Infographic) Read article Linus Torvalds : The King of Geeks (And Dad of 3) Read article The .NET blog : Improving Launch Performance for Your Desktop Applications Read Article