Recently, I talked to a client who has a 20-year-old book he’d like to reproduce again with more "modern" technology. It’s got good information, but it looks outdated because it was typed on an old typewriter. Plus, it’s a lot easier to make changes once the document is in digital format.
Most people know about scanning, but a lot of folks are only familiar with picture scanning software. With that type of software, when you scan a printed page the scanner saves it as a picture, such as a jpg or tiff files. But you can also get the text off of a printed page and get it into a word processor without retyping it.
The answer is optical character recognition or OCR. This process translates printed text characters into type that can be edited and otherwise manipulated on a computer using a standard word processor. It’s incredibly cool technology that’s often bundled with scanners.
In fact, my ancient HP scanner came with a copy of an OCR product called WordScan, which I used a number of times over the years. However it’s so old, I wasn’t optimistic about running it under Windows XP. (Some experiments are not worth trying.) So for my client’s book, I got a new copy of OmniPage Pro 12. I was able to buy an upgrade because it lets you upgrade from any other OCR software, including those bundled with scanners. (Even old ones.)
OCR is extremely easy to use. Basically, you lay the page you want to scan on the scanner glass. Generally, the process happens in two phases: scanning and recognition. When you scan, you can tell the OCR software to scan either particular areas or the whole page. During the recognition phase, you can set up the OCR software to alert you to any terms it can’t figure out. Sometimes smears on the printout, weird fonts, or underlining can confuse the recognition engine. But it’s no big deal because you, as the smart human, can easily tell the software what the word is, even under the coffee stain.