Genealogy: Digitizing text with your scanner and OCR softwareBy Barry J. Ewell

As a genealogist, I have my fair share of copied and printed material that range from family history books and stories to documentation. I am fine with most of the material simply being stored as digital images of the work; however, there are situations where I prefer to take the document and turn it into a Word document (doc) for easy distribution and sharing.  In these instances, I use Optical Character Recognition (OCR).   

Recently I scanned a journal that included photos, histories, and related material for  several generations. Much of the text was type written.  After I scanned each page

  • I used my image editing software to separate the photos and save them as separate files.
  • Then I imported the images into OCR software from which I was able create a word document.

The OCR function software recognized 98% of the text. According to my calculations, the time it would have taken me to transcribe the journal was cut by 95%.

In another situation, I had thirty family histories (each having two to three pages) that I had in my files which were copies of copies from the 1960s.  After I scanned the histories, I imported them into OCR software which recognized over 95% of the characters, even though they were faded.  I was able to correct the text and use the spelling check function located right inside of the OCR software.

The most common OCR programs are Omnipage and Textbridge. The automatic mode in these software programs is fairly foolproof—but not infallible: accuracy will depend upon the quality of the original document.  OCR tries to match the font and layout but it probably won’t be exact and you will need to check the spelling after you have saved the scanned text as a Word (or equivalent) document. You may also need to remove unnecessary tabs if your OCR program installed them.  Just remember that the above problems are still better than typing a document from scratch.

Most scanners come with some level of OCR software. A basic OCR package will usually block out text on a page (to distinguish it from any photos or other non-text elements) and will automatically translate it.  The software should be able to reasonably imitate the typefaces used on the page (for example, keeping headlines larger), separating paragraphs, outlines, etc.

More complex OCR software will allow users to specify exactly which areas of a scanned document should be converted into text; or will be able to convert other languages in other alphabets, such as Russian, Hebrew, or Arabic.

A few OCR tips  
When you use OCR, you can scan and save documents and then import them into the software, or you can scan the document directly into the OCR software.  For example, after  you open the OCR program, put the first page in the scanner and click on the “Auto” button. The program will first scan the page as an image—then the OCR magic begins. When all the text from the first page has appeared (mistakes and all), put the second page in the scanner and press Auto again . . . and so on, until you have finished scanning the entire original. Note: there is usually a limit to the number of pages you can hold in memory but you do not want separate documents for each page.

Now you must go to the “File” menu and select: Save As. Choose Word doc or equivalent. You can now use Word to open the new document and start cleaning it up.

The following are a few tips on how to more effectively use the OCR:

  • Align the text to be scanned squarely on the scanner. Lines of text that run uphill or downhill have higher error rates.
  • Start with a good original. Is the paper wrinkled? Try ironing it (warm, not hot iron) or pressing between heavy books. Also, erase any smudges.
  • Photocopy text that curves into the binding of a book. Text that curves into the binding of a book has higher error rates. You can improve this in some cases by photocopying the page and then scanning the photocopy.
  • Photocopying may help originals that are on thin paper. Photocopying may also help originals that are on thin paper when there is printing on the reverse side. Sometimes the ink on the reverse side is read by the scanner, and confuses the OCR process.
  • Make the scan the best you can. Make sure the scanner bed/glass is clean, smudge-free. Keep the document straight and even, so you don’t end up with a “skewed” image. Adjust the color/contrast/brightness to ensure the text is dark and the background is light/white and free of “artifacts” (such as a pattern in the paper. Scan at 300dpi or better.
  • Error rates will rise with some letter forms. Error rates are higher when letterforms degrade (as with poor photocopies), or when the original is of poor quality (like copy produced on onion skin with a typewriter that has a clogged “E”).
  • Turn one document into many. If you have a later version of OCR software, try breaking the scanned original down into smaller chunks (crop out non-text elements or save columns of text as individual images) and run your OCR software on each part separately. You’ll lose formatting but gain a more accurate text document.  You can avoid this problem by investing in newer OCR software, which is getting better and better at retaining formatting of forms and tables. Knowing the benefits of updated OCR programs may give you the incentive to trade in your old OCR software for some newer OCR software solutions.
  • Proofread. No matter how accurate the program, all are fallible. Proofread, proofread, and proofread the finished document.