This week I worked on a couple of classic books that were converted from scans or old PDFs using optical character recognition (OCR). OCR is the computerized process that “reads” an image of a text and outputs actual machine-encoded text that can be republished in a new format.
OCR is what allows us to take a faded old manuscript, rescue the text, and make a sleek new ebook out of it. But the process is far from perfect.
This week, I had to reteach myself what mistakes the machine commonly makes. Here are some of the most frequent things I found.
First, OCR can delete whole lines of text (particularly the last line of a page). It can also delete words. If a phrase seems off, check the original: chances are that a word (or more) got dropped or changed in the scanning process. OCR can also repeat a line — sometimes several sentences later where it almost makes sense.
When a document is scanned, words that once wrapped from one line to the next, connected by a hyphen, keep their hyphen. The easiest way to quickly find all of these is to do a search on hyphens (or even hyphen + space) and delete the unnecessary ones.
Here there are three different errors.
First, the machine scanned the number one as a capital I. Using a font like Courier when you audit will help you see this. Or you can do a search on I^#, which will find any instance where the letter I is followed immediately by a number.
Then, the closing bracket (]) was changed into a numeral one (1). This isn’t something you can search for, but it is something to look for wherever there are dates in the original text.
Finally, somehow a space also got lost. Corrected, this entry should read (Hayek  1978).
Spellcheck vs. OCR
Spellcheck is usually a good partner for locating OCR errors, which are often hard to catch if the font you’re using doesn’t show them well.
For example, the word “ldeas” should, of course, read “Ideas.” I had been reading the document using Ariel, in which lowercase l (el) and capital I look similar. I couldn’t see the error, but spellcheck pointed it out for me.
Here’s another example: “are anlong man’s natural rights” should be “are among man’s natural rights.”
The big lesson I (re)learned this week was that even though something looked right and I thought spellcheck was being ornery, it paid off to actually check what the spellcheck thought was wrong with the word.
Of course, there are also the errors that spellcheck can’t help you with. The one that pops up the most often (and that BK has written an automated script to catch) is modem instead of modern. You can just do a search on the word modem.
Name That Reference
From an old Mises Daily:
The long run is at hand for Keynesianism, and it is ready to be interred.
What famous line is the author alluding to here?
Just for Fun
Hey, What Happened to “Spot the Error”?
Since I’m focused on OCR errors this week, I’ve skipped “Spot the Error” today. But no worries, it’ll be back next week, full of juicy misteaks.
Is there something you’d like to contribute or see covered in Editing for Liberty? Post a comment! Cartoons, quips, and contributions to “Spot the Error” and “Name That Reference” are especially welcome.