How to Convert Documents to Text Using Your Scanner

Many ideas for articles come from clients who have computer related problems they need solved. Recently, a client needed to transfer historical records from the courthouse to a digital media. Copies alone would not do. This can easily be done using a scanner.

I’m sure that there are a lot of other people who would like to do the same thing. Not necessarily historical records, but personal documents such as… marriage certificates, letters, grade cards, deeds, contracts, etc.

Scanning a document is easy. You just lay the document on the scanner, press ‘scan’ and then ‘save’ the scan as a file. This file may be all that you’ll need unless you want to be able to “edit” the document later. When you scan something, you’re really just taking an electronic picture of it. Even if you’re scanning text, it’s still stored in your computer as a picture. You can view it, even edit it somewhat with a graphics program, but you can’t load it into your word processor or spreadsheet and change it. If you want to do that, you’ll have to convert the picture to ‘text.’

To do this, you’ll need an OCR (optical character recognition) program. Fortunately, most scanners include this program in the suite of software that comes with the scanner.

There are several reasons to convert your documents to text. For one, once they are converted, they will take much less storage space on the hard drive. But the main reason is to be able to manipulate them as you would any other text or data file. This also means that you’ll be able to search through your documents for key words and phrases. In the case of the historical records I mentioned earlier, once converted, it will be possible to include these records in a computerized database. If they were left as scanned files, you’d have to manually search each scan for a specific record.

Every OCR program works a little differently, but basically, this is how you convert graphic files to text. First, before you scan your document, set the scanners resolution to 300 dpi. The software works by identifying the shapes of the scanned letters, then comparing these shapes to those that are stored in its database. There is enough intelligence built into these programs that they can recognize letters even when they are based on different fonts. Many times, these fonts are not even in the software’s database. But in order to do this, the software needs a good scan to work with and 300 dpi seems to work well.

Your scanning and OCR software will most likely be two different programs. Bring up your scanning software first, scan the document, and save it to the hard drive. You may want to save it as a GIF or TIFF file. These files are uncompressed graphics files and will be sharper than the JPG files you normally use on the Internet. Compression always looses some detail and the OCR software needs all the detail it can get.

Once you’ve saved the scan, (remember where you saved it) call up your OCR software and load the saved file. At this point, your OCR software is ready to convert the graphics file into the type of document you want, either a Word or a plain text file. After the conversion, you’ll need to save the converted document with a new filename. That’s all there is to it. You can bring up your word processor and load the converted file just like any other word processing document.

The conversion from a graphic to text is usually less than perfect. I always load the converted document into Word and check it for errors. It’s amazing how well the conversion usually goes, especially when the scanned text may have had a mixture of different fonts and font sizes. Even graphics appearing in the text doesn’t present a problem for most OCR programs. If your document is divided into several columns per page, you might run into some additional problems. But again, it just depends on the OCR software you’re using.

This process works well, but it can be S-L-O-W going. Switching between the scanning and OCR programs, and finally loading the finished document into a word processor to check the spelling is time consuming. But if your only other option is to retype the whole thing, it might not seem like such a daunting job after all. If you don’t see OCR listed in your scanner’s software, check your setup disk. It may have been an option and you just never installed it. Stand alone OCR software can also be purchased separately.