Posts Tagged ‘OCR’

PaperScan Version 3 New Major Version

Hi folks,

PaperScan Scanning Software version 3 is here!

New version, new interface!
When you open PaperScan you will see the first major change, the interface got a rather extreme makeover. A lot of new features are also available, and some of them where specifically requested by you, the PaperScan users.

What’s new in PaperScan V3?

We have an overall better performance of the software in terms of speed and accuracy and improvements in the following domains:
–    Scanning: preview feature in scanning wizard, improved support for camera devices and large documents.
–    Formats: SVG, EMF and WMF now supported.
–    Printing: advanced printing dialog to specify alignment, adjustment, orientation…
–    Automatic color detection: improvement of the engine to get a better compression of the files created.
–    OCR (Optical Character Recognition): now more than 60 languages are supported by our OCR engine.
–    Automatic image orientation.
–    New settings panel and profile manager to create, remove and switch configuration settings.
–    Custom keyboard shortcuts management.
Image Processing
–    More than 20 new filters and effects.
–    New despeckle filter in batch acquisition/import filters.
–    Improvement of all document imaging filters such as Auto – deskew, punch holes removal…
–    Replacement of autocrop by automatic black borders removal in batch acquisition/import filters.
–    Pre-set annotations support with a featured designer.
–    New annotation : polygon ruler.

You can see the full history of changes on our dedicated page.

The download of PaperScan V3 is available here.


To evaluate the V3, just uninstall your old version and reinstall the new one.
Each download of a commercial version (PaperScan Home and PaperScan Pro) comes with a 30 days trial. And as usual, the free version stays free!
Give it a try and let us know what you think on the forum!

The pricing of PaperScan V3 is still the same: 49 USD for the Home Edition and 149 USD for the Pro Edition.
Customers who purchased a license after December 26th 2014 are eligible for a free upgrade.
For an upgrade from version 2, the upgrade price for the Home edition is 25 USD. The upgrade price for the Pro edition is 75 USD.
You own a V1 or V2 and you’re not sure about the different opportunities to upgrade to V3? You can contact us here.
Our sales team will be happy to answer all your questions!


Optical Character Recognition: some advices

Hi folks,

This week we continue the Optical Character Recognition subject by providing our general public with few advices for pre-scan and post-scan stages of document archiving.
A properly done OCR task is not simply about text extraction, it also implies a set of operations meant to optimize the OCR process and increase efficiency in overall document-management practice.
To put it in other words, operations commonly considered as “adjacent” can actually really improve or totally distroy text recognition making your later life either comfortable or a living hell.
Here are just a few things to keep in mind :

(1) Before scanning

  • when placing the paper in scaner make sure the pages have the correct text orientation so you won’t have to later waste time by either having to wait for the OCR software to automatically determine the orientation or, even worse, to have to make this operation manually, via file-by-file checking ;
  • make proper scan settings to insure best quality for OCR (for example, 250 or 300 dpi resolutions are considered optimal for most of the documents) ;
  • test OCR output for a few pages before starting a batch scanning operation to make sure your settings are optimally fine-tuned
  • select a lossless file format (such as TIFF) ) and do not be afraid of big sizes if the documents are important to you : storage space is not an issue these days and you can later convert the files to any other format for handling (or sharing) purposes.


Actually, for important document archives , maybe the best idea would be to store the “original” files into TIFF format then move them on an external storage device or media (external hard-disk or DVD, etc) and use for current work a duplicated archive containing files converted into a format that you consider optimal for your needs ( JBIG2, PDF, etc).
To a certain extent, this approach would be similar to how camera RAW format works for the professionals in digital photography domain.


(2) After scanning

  • use relevant filenames for resulting files and not mind if filenames tend to become lengthy : it isn’t hard to do using automated file naming tools and , even if it might take a bit more of your time at file creation stage it can be a really life saviour later. And make sure that the filename contains important data, such as the language of the text, to name just one important detail for OCR.
  • do not hesitate to use image enhancement techniques : the quality of the paper documents cannot be controled and nor hardware (ie scanner) particular details which might influence output quality (just an example among dozens : tiny scratches on scanner’s glass).


To overcome them , professional document imaging software vendors provide their users with a wide range of image correction features.
In this blog you will find some explanations on brightness/contrast/gamma, median filtering and auto-deskew.
But more explanations are yet to come.




Big Browser on April 27

Google's secret weapon to fight Redmond and Cupertino Read article Eugene Kasperski : "In terms of security, Apple is 10 years behind Microsoft" Read article Repetitive tasks : geeks vs. non-geeks Read article Read this before naming your startup Read article Why the iPad Has to be Made in China Read Article

Casual Friday on April 27

A beautiful day to play outside.

A beautiful day to play outside.

Optical Character Recognition: an introduction

Hi folks,

This week we will provide our general public with a first article about Optical Character Recognition, a key feature in document imaging domain (but not limited to it) and later we’ll continue to detail some particularly important aspects and best practices in OCR.
But for now, let’s just make a basic introduction.

Despite the many various definitions of OCR, a most simple and accurate one would be: Optical Character Recognition is meant to identify text from non-text inside a digital image.
The history of OCR  is quite fascinating, not only because of its very fast-growing complexity, but also for its unbelievable early beginings.
Being an ahead-of-public-times technology, OCR started as a discrete military research (same as computers, internet and all each and every other advanced technologies on Earth).
But can you believe that its first developments started around 1914 or that, during the 1960’s (a time when general public was barely communicating using wire telephones), some national postal companies, such as US Postal Service or the British General Post Office, were already using OCR to automatically sort our grandparent’s handwritten mail ?

Well, the reason we need to extract text from images is that software cannot handle text unless it is encoded as text-piece-of-information.
We need text to be edited, indexed (so we can retrieve it later using our text-based searches), processed – to use it for superior complex refinements (such as text-mining), we even need text as-such so we can render it back to us as spoken information !
In other words, “text” from an IT point of view means character-encoding standards, such as ASCII, UNICODE, etc.
The text within an image file (ie, bitmap resulting when a document is scanned) means “text” only for us humans, who are able to recognize it.
But for almost all computer software, a bitmap containing text is nothing but a series of pixel values, same as any other bitmap not containing text.
Except for OCR software, which is able to analyse all pixel values, perform highly complex processing and determine if “patterns” can be found to match the ones corresponding to “text”.
Basically, what happens is a kind of best-guess attempt and the result is output as a text-encoded type of information.

This is why OCR accuracy depends on many different aspects : printed text is much easier to be correctly recognized than handwriten text , if the language/character set of the to-be-recognized text is previously known and settings are done accordingly, OCR results are dramatically better, page should have correct orientation (or else use the automatic orientation detection component of the OCR software, if available), image quality might need to be enhanced in order to optimize it before submiting to OCR, and so on.

In our forthcoming articles on OCR subject we will further explain some best practices for OCR and various factors to be considered when chosing an OCR engine (ie, quality vs royalties, time vs hardware resources, number of supported languages, etc).



Big Browser on April 20

FTP 40 years anniversary Read article Tim Berners-Lee: demand your data from Google and Facebook Read article Visual Studio 11: A refined and disappointing experience Read article Software Professional Code of Ethics Read article Tech Republic : 10 Google services you can live without Read Article

Casual Friday on April 20

Tools for students

Tools for students