Editing for Liberty #13: Fixing Common Mistakes in Optical Character Recognition

This week I worked on a couple of classic books that were converted from scans or old PDFs using optical character recognition (OCR). OCR is the computerized process that “reads” an image of a text and outputs actual machine-encoded text that can be republished in a new format.

OCR is what allows us to take a faded old manuscript, rescue the text, and make a sleek new ebook out of it. But the process is far from perfect.

This week, I had to reteach myself what mistakes the machine commonly makes. Here are some of the most frequent things I found.

First, OCR can delete whole lines of text (particularly the last line of a page). It can also delete words. If a phrase seems off, check the original: chances are that a word (or more) got dropped or changed in the scanning process. OCR can also repeat a line — sometimes several sentences later where it almost makes sense.

Roth- bard

When a document is scanned, words that once wrapped from one line to the next, connected by a hyphen, keep their hyphen. The easiest way to quickly find all of these is to do a search on hyphens (or even hyphen + space) and delete the unnecessary ones.

(Hayek [I96811978)

Here there are three different errors.

First, the machine scanned the number one as a capital I. Using a font like Courier when you audit will help you see this. Or you can do a search on I^#, which will find any instance where the letter I is followed immediately by a number.

Then, the closing bracket (]) was changed into a numeral one (1). This isn’t something you can search for, but it is something to look for wherever there are dates in the original text.

Finally, somehow a space also got lost. Corrected, this entry should read (Hayek [1968] 1978).

Spellcheck vs. OCR

Spellcheck is usually a good partner for locating OCR errors, which are often hard to catch if the font you’re using doesn’t show them well.

For example, the word “ldeas” should, of course, read “Ideas.” I had been reading the document using Ariel, in which lowercase l (el) and capital I look similar. I couldn’t see the error, but spellcheck pointed it out for me.

Here’s another example: “are anlong man’s natural rights” should be “are among man’s natural rights.”

The big lesson I (re)learned this week was that even though something looked right and I thought spellcheck was being ornery, it paid off to actually check what the spellcheck thought was wrong with the word.

Of course, there are also the errors that spellcheck can’t help you with. The one that pops up the most often (and that BK has written an automated script to catch) is modem instead of modern. You can just do a search on the word modem.

Name That Reference

From an old Mises Daily:

The long run is at hand for Keynesianism, and it is ready to be interred.

What famous line is the author alluding to here?

Just for Fun

Hey, What Happened to “Spot the Error”?

Since I’m focused on OCR errors this week, I’ve skipped “Spot the Error” today. But no worries, it’ll be back next week, full of juicy misteaks.

Is there something you’d like to contribute or see covered in Editing for Liberty? Post a comment! Cartoons, quips, and contributions to “Spot the Error” and “Name That Reference” are especially welcome.

Leave a comment

Filed under Editing for Liberty

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s