Post by cjm on Apr 13, 2014 17:45:23 GMT
As it took me a while to gain familiarity with the concept and the issues, here is a basic account to provide some elementary background for those starting out.
Firstly, I should point out that my *expertise* is mainly in regard to Linux although I have used an OCR program on Windows XP as well.
The main problem arises with scanners which generally scan documents in JPG or PDF format. The scanned images are similar to a photograph of the document. As regards pictures and photos this is fine but when text is scanned, it means that it cannot be edited readily with word processors and the like. One can use software like Photoshop (and Gimp in Linux) to make changes but it is tedious and messy.
Fortunately there is software which converts the image to editable text - this is called Optical Character Recognition (OCR) software.
If you can afford it, Adobe has an Acrobat version. I have used Readiris on XP which works fine.
There also is free software for Windows.
On Linux (Ubuntu 12.04) I use gimageReader (free open software). This provides the graphical interface for the software, Tesseract-ocr- which is the real power behind the throne. It seems that these can also be used on a Windows platform - I have not tried it.
I only have praise for the combination. It works well and even has an Afrikaans library. Tesseract has a long history.
So Google does some good things as well!!
Both pieces of software are available in the Ubuntu repositories and can be downloaded with Synaptic Package Manager - make sure that the languages you will be operating in are downloaded as well. There also is a separate *language* for maths.
Just to run through the basic process:
1. Download the scanned document in JPG-format - or whatever image files are used by the scanner.
2. Run Your OCR software and open the scanned JPG-file.
3. Convert the JPG file with your OCR software to an editable document format and save it.
4. Now you can edit the document with your word processing editor.
I find that the gimageReader results can be improved by ensuring that the lines run straight across the page. The general quality of the image plays an important role as well.
Firstly, I should point out that my *expertise* is mainly in regard to Linux although I have used an OCR program on Windows XP as well.
The main problem arises with scanners which generally scan documents in JPG or PDF format. The scanned images are similar to a photograph of the document. As regards pictures and photos this is fine but when text is scanned, it means that it cannot be edited readily with word processors and the like. One can use software like Photoshop (and Gimp in Linux) to make changes but it is tedious and messy.
Fortunately there is software which converts the image to editable text - this is called Optical Character Recognition (OCR) software.
If you can afford it, Adobe has an Acrobat version. I have used Readiris on XP which works fine.
There also is free software for Windows.
On Linux (Ubuntu 12.04) I use gimageReader (free open software). This provides the graphical interface for the software, Tesseract-ocr- which is the real power behind the throne. It seems that these can also be used on a Windows platform - I have not tried it.
I only have praise for the combination. It works well and even has an Afrikaans library. Tesseract has a long history.
It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.
So Google does some good things as well!!
Both pieces of software are available in the Ubuntu repositories and can be downloaded with Synaptic Package Manager - make sure that the languages you will be operating in are downloaded as well. There also is a separate *language* for maths.
Just to run through the basic process:
1. Download the scanned document in JPG-format - or whatever image files are used by the scanner.
2. Run Your OCR software and open the scanned JPG-file.
3. Convert the JPG file with your OCR software to an editable document format and save it.
4. Now you can edit the document with your word processing editor.
I find that the gimageReader results can be improved by ensuring that the lines run straight across the page. The general quality of the image plays an important role as well.