Using Tesseract

Introduction

The tesseract OCR engine is a very complicated software system, with more than 600 adjustable parameters. It can perform very well, but you often have to tweak some of those parameters. Just remember that a complex system that's infinitely adjustable is always out of adjustment.

Unfortunately, the documentation for tesseract isn't very clear, so it's difficult for beginners to learn what needs to be tweaked, or how to do it. This page explains some basic ways to improve its performance.

Pre-processing

Because the OCR process is surprisingly sensitive to the resolution and sharpness of the page images, it's necessary to emphasize that even small defects that are barely visible to the human eye can seriously degrade the extracted text. If you have control over the original digital scanning, be sure you don't have specks of dust, hairs, or scratches on the glass window of a flat-bed scanner.

Also, make sure you have enough resolution to record all the details of the printed words on the page. Normally, average-sized print in books and journals can be scanned at 300 dpi; but unusually small print, or a typeface with very thin lines, will produce better results at 400 or 600 dpi. A rule of thumb is that the thinnest lines in the letter glyphs should be 2 pixels wide; this is usually possible if the x-height of the font is between 20 and 30 pixels.

And be careful to record page images using lossless compression. Often page images are compressed as JPEG images with some default quality (like 70 or 80) that is perfectly legible to the eye, and looks reasonable on casual inspection, but contains lots of compression artifacts that will make OCR detection of text very inaccurate. (The Internet is full of such images.) If you can't avoid JPEG compression, at least use the highest quality setting available.

If all you have is pages scanned at some distant library, you may discover that it's contaminated with noise — particularly if the pages are old, or if they were printed on rag paper that contained lots of little colored fibers. This typically shows up when tesseract complains that there are “lots of diacritics” when it searches for letters. In extreme cases, it may be useful to turn on the textord_heavy_nr setting, which is normally zero (i.e., off). But that is so heavy-handed that it usually makes an appreciable fraction of the text unreadable. [The ocrmypdf front end has its own -c (or --clean) option, which is much milder.]

Often, you can tell whether the noise-removal process has also removed important information by checking the little dots in glyphs like periods, commas, colons, and semicolons. If numerical values tend to lose their decimal points, the cleaning process should be toned down (even if tesseract continues to complain about diacritical marks).

If you tell tesseract to include the cleaned images in the final product (by using the -i option of ocrmypdf), you can see what kind of noise has caused errors in the OCR text. This may suggest what other parameters might be tweaked to reduce noise further.

Custom tuning

What you need to adjust depends on what you are trying to do. If all you need is a file of text, you don't need PDF output. Then maybe the hocr or tsv type of text output might be more useful than just plain-text output.

Most of my own use of tesseract has been to make PDFs of old books searchable. However, I've also tried to extract numerical tables from scans of technical references; a particular irritation has been the omission of tables and figures from the scans available at Google Books. (Google was evidently interested only in extracting text from them.)

tesseract and ocrmypdf

A more user-friendly interface to tesseract is provided by ocrmypdf, which can read a wide variety of image formats. However, some tesseract features are difficult to adjust from ocrmypdf.

For simple tasks, the ocrmypdf script is very handy. You can get some idea of what ocrmypdf is doing by turning on its verbose (i.e., -v) option, which produces a surprisingly large amount of output. But for more complicated problems, like extracting numerical values from tables, you need finer control over tesseract than ocrmypdf alone can provide.

Whatever you want to do, you need a better understanding of how both these commands operate, and how they can be controlled, than the regular documentation provides.

How to turn the knobs

If you only read the man pages, you get the impression that the only adjustments possible are those provided by command-line options, unless you construct a whole new configfile. In a sense, that's true; but in fact almost everything in tesseract can be adjusted from the command line by using its -c option — which you can use on the command line as many times as you need. In fact, ocrmypdf has a --tesseract-config option that lets you use a small local incremental  configuration file to adjust just one or a few of tesseract's many parameters.

What to adjust

Options

Another problem with the man pages is that they give no indication of which adjustments affect almost every task, and which ones are useful only for very special purposes. For example, the tesseract documentation devotes lots of space to the -psm option (which can also be set with the --tesseract-pagesegmode option of ocrmypdf); but, apart from the fact that its 0 and 2 settings turn off OCR entirely, they have almost no effect on the common task of adding OCR text to a plain PDF file of page images. (Note, however, that this option can be important for some special purposes; see https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/ for details.)

On the other hand, the -d (or --deskew) and -c (or --clean) options to ocrmypdf almost always improve the accuracy of the extracted text. So those options should be used routinely for most tasks. But notice that many PDF files available from Google Books and other libraries have already been properly cleaned, and don't need the -c option.

Vocabulary

In fact, you can augment the standard configuration simply by adding a list of words that occur in the document you're trying to OCR. Such a supplemental word-list often helps, particularly when the document contains proper nouns (place names or personal names) that aren't in the default dictionary for the language(s) you're dealing with.

This local wordlist will be different for every document, so it makes sense to provide a different list for each one. Reading the man page can mislead you to think that such a supplemental wordlist must be named with the suffix specified by the user_words_suffix parameter, and/or that it has to be placed in the tessdata directory — which would allow only one such file for every language. Actually, it's possible  to have a single supplemental wordlist if it's in the tessdata directory and  has the specified suffix; but that doesn't prevent you from having a local list with an arbitrary filename. You just have to specify the path to that file on the command line, either with tesseract's --user-words= option, or with its -c option followed by a user-words-file= argument.

In recent versions of ocrmypdf, the local wordlist file can be named with a --user-words option on the ocrmypdf command line. If you are using version 3 of tesseract, you have to point to any local wordlist(s) in a local config  file, which is turn can be named in a --tesseract-config option to ocrmypdf. Either way, a supplemental dictionary can be provided. (Note that you can have only one local wordlist file, and only one local config file. If you have more than one wordlist file, they should all be concatenated before being named in a command line or the local configuration file.) The wordlists do not need to be sorted.

Bear in mind that these dictionary words are only hints to tesseract; it isn't a spell-corrector like aspell. However, you can “load the dice” by changing the relative weights assigned to dictionary and non-dictionary words. These are the  language_model_penalty_non_freq_dict_word  and  language_model_penalty_non_dict_word  variables, which are only 0.1 and 0.15 by default. Increasing these values should put more weight on the dictionaries. CAUTION: putting too much weight on the dictionaries will make the engine turn noise, or real words not in the dictionary, into dictionary words; so be careful.

A complication

I first thought that it would be easy to pick words for a supplemental dictionary: just select the words that tesseract failed to OCR correctly when run with the default configuration parameters. But when I tried this, I found that some words that had been detected correctly on the first run became OCR errors on a second one that used a local dictionary of words it had missed on the first.

The problem is that tesseract stores several different internal images for every letter, because a document might contain the glyph in several different font sizes and styles. So if you tell it to pay more attention to the bad glyphs, that shifts its mapping of shapes on the page to characters in the text encoding. Putting words that were missed into the dictionary shifts the detection criteria away from well-formed glyphs and favors bad ones.

Evidently, we need to tell it to pay a little more attention to the shapes it misinterpreted initially, while continuing to pay attention to the things it got right (rather than ignoring them). So all the correct words need to be included in the wordlist file; but we also need to include the words we can read but tesseract couldn't.

In short: we really need to make it look for just  the words that really exist in the file, and ignore all other possibilities. Ideally, the dictionary should contain all the real words, and no others.

Then, in principle, we could force tesseract to accept only the words in this perfect dictionary by raising the penalties on non-dictionary words.

That still won't guarantee perfect character recognition, for two reasons. First, we won't have a list of all the real words in the PDF without manually checking every word in it; spell-checkers always seem to miss a few uncommon words that occur in real texts.. And second, even if the new dictionary list were perfect, there still might be indistinct glyphs in the page images that tesseract could mistake for other characters that form a correctly spelled (but wrong) word. Human readers can usually fix such errors by understanding the context of the ambiguous word, but machines don't understand anything.

Note: using a local dictionary tuned to the right context on the first pass could also help appreciably.

Multiple Columns

Many books and journals are printed with two or more columns per page. But tesseract can handle just a single column; so the best way to OCR this kind of document is to split the page into columns before invoking tesseract.

And there is another front end to tesseract called pdfsandwich that can split PDF pages into two columns. The problem with using it is that it does not pass as many options to tesseract itself as ocrmypdf does. So you might want to first use pdfsandwich to split the pages vertically, and then invoke tesseract on the separated columns to get the desired results.

Tables

Layout

One big problem with OCRing tables is that tesseract produces little information about document structure. However, recent versions can write a tab-separated output file that contains some layout information. So, consider using either hocr or tsv output instead of plain text.

Another problem with tables is that they have lots of whitespace between columns. You can help handle this by setting the parameter  preserve_interword_spaces  to 1 — but it does not preserve space at the left end of a line; instead, the lines are all left-justified (because there is no interword space to the left of the first word on a line). However, if you extract the text by using pdftotext with its -layout option, some whitespace at the left margin appears in the text.

A related problem is that tesseract often breaks up the rows of tables — often because the lines of type were not perfectly level. (That can be cured by using the -d option to ocrmypdf, which de-skews the tilted text.) You can keep all the parts of a table row together by setting the -psm mode to 4 or 6, instead of the default 3. Mode 4 keeps rows together even if they contain a variety of fonts; if you can rely on one font being used across an entire row, use psm mode 6.

There is a parameter called textord_tablefind_recognize_tables, which is normally turned off, but can be turned on by setting it to 1 (i.e., True.) Similarly, there is another called textord_show_tables. It appears that these only find tables with psm set to 1 to 4. These parameters seem to be used only in the layout analysis, without affecting the actual OCR of tables.

To extract the OCRed text from a PDF that has been processed by tesseract, you can use

	pdftotext -layout
to show the text from the OCRed file. This will separate fields with <TAB> characters (or their equivalent: 8 spaces), which make it fairly unwieldy.

Another way to copy table data from an OCRed image is to display the image with a browser, and copy the (invisible) OCR text to the clipboard by scanning the cursor along the rows on the displayed page. Then you can re-copy the text to a file from the clipboard. This involves a lot of mouse work, but it's still better than trying to copy a table manually.

Entropy

A less obvious reason why tables are more difficult than text to OCR correctly is that text is about 75% redundant, while numerical data are quite unpredictable. However, there is a way to overcome this problem.

Whitelisting and blacklisting

Text can contain several dozen different glyphs, even if it is set in only a single font. Numerical tables contain only the ten digits from 0 to 9, plus a decimal point, and possibly + or − signs. If you have control over the OCR process, you can restrict the set of glyphs that tesseract looks for by setting the string-valued parameter tessedit_char_whitelist to something like 0123456789+-. These will be the only characters tesseract will produce in its output. All other characters will be blacklisted, and cannot appear in the OCR text.

A similar but less restrictive way to focus tesseract's attention on numbers is to set the parameter classify_bin_numeric_mode to 1 in your supplemental config file or command-line option. This, plus careful blacklisting of inappropriate characters, can produce fairly good OCR of numerical tables.

Notice that there are two “blacklisting” parameters: tessedit_char_blacklist and tessedit_char_unblacklist, which are often misunderstood. Blacklisting a character only prevents it from being produced in tesseract's output; but some other character will be produced in its place. Sometimes the minus signs in numerical tables get OCR'd as em-dashes or underscores; then blacklisting those will prevent them from appearing where minus signs should be in the extracted text.

Any character that is normally killed in a blacklist can be revived by putting it in a tessedit_char_unblacklist string. This can make the ocr engine look for some special glyph, like a case fraction or a special Unicode symbol.

Column headings

While we are thinking about special treatments, what about column headings in tables? These contain words, or sometimes just abbreviations of words, or symbols. All those things might be left in the part of the page image used to OCR a table, because they will help you proofread the final result. But then you'd need to leave the characters used in the headings whitelisted, which might degrade the overall recognition accuracy.

One way to work around this problem is to put the words, or word fragments, used in the headings into a supplemental wordlist, as described above. Adding the strings used as column headers to a supplemental dictionary list should help.

Restricted numerical ranges

Numbers in tables are often pretty random; but not always. You can use some regularities to help steer tesseract away from trouble.

For example, a table column that contains only hours and minutes must have only 1- or 2-digit numbers smaller than 13 in the hours column, and less than 60 in the minutes. You can put those small sets of numbers into your supplemental “user-words” dictionary, as long as the OCR engine is looking for numbers as well as dictionary words. Even if there are other columns in the table with less limited numerical values, this might still help prevent a mis-OCRed value of 84 appearing in a “minutes” column.

 

Copyright © 2023 – Andrew T. Young


Back to the . . .
main LaTeX page

or the alphabetic index page

or the GF home page

or the website overview page