Using Tesseract

Introduction

The tesseract OCR engine is a very complicated software system, with more than 600 adjustable parameters. It can perform very well, but you often have to tweak some of those parameters. Just remember that a complex system that's infinitely adjustable is always out of adjustment.

Unfortunately, the documentation for tesseract isn't very clear, so it's difficult for beginners to learn what needs to be tweaked, or how to do it. This page explains some basic ways to improve its performance.

Pre-processing

Because the OCR process is surprisingly sensitive to the resolution and sharpness of the page images, it's necessary to emphasize that even small defects that are barely visible to the human eye can seriously degrade the extracted text. If you have control over the original digital scanning, be sure you don't have specks of dust, hairs, or scratches on the glass window of a flat-bed scanner.

Also, make sure you have enough resolution to record all the details of the printed words on the page. Normally, average-sized print in books and journals can be scanned at 300 dpi; but unusually small print, or a typeface with very thin lines, will produce better results at 400 or 600 dpi. A rule of thumb is that the thinnest lines in the letter glyphs should be 2 pixels wide; this is usually possible if the x-height of the font is between 20 and 30 pixels. (That typically corresponds to a space between lines of type of at least 50 or 60 pixels.)

Even higher resolution is needed when imaging pages of text that contain small print, such as footnotes, or technical papers with mathematical symbols that have small subscripts and superscripts. Another problem occurs when more than one language is involved: it's difficult to distinguish between an unadorned vowel and one festooned with accents, or between Latin letter “a” and Greek letter “α”. If tesseract has to distinguish among a larger variety of glyphs, it needs more pixels to tell similar ones apart.

And be careful to record page images using lossless compression. Often page images are compressed as JPEG images with some default quality (like 70 or 80) that is perfectly legible to the eye, and looks reasonable on casual inspection, but contains lots of compression artifacts that will make OCR detection of text very inaccurate. (The Internet is full of such images.) If you can't avoid JPEG compression, at least use the highest quality setting available.

If all you have is pages scanned at some distant library, you may discover that it's contaminated with noise — particularly if the pages are old, or if they were printed on rag paper that contained lots of little colored fibers. This typically shows up when tesseract complains that there are “lots of diacritics” when it searches for letters. In extreme cases, it may be useful to turn on the textord_heavy_nr setting, which is normally zero (i.e., off). But that is so heavy-handed that it usually makes an appreciable fraction of the text unreadable. [The ocrmypdf front end has its own -c (or --clean) option, which is much milder.]

Often, you can tell whether the noise-removal process has also removed important information by checking the little dots in glyphs like periods, commas, colons, and semicolons. If numerical values tend to lose their decimal points, the cleaning process should be toned down (even if tesseract continues to complain about diacritical marks).

If you tell tesseract to include the cleaned images in the final product (by using the -i option of ocrmypdf), you can see what kind of noise has caused errors in the OCR text. This may suggest what other parameters might be tweaked to reduce noise further. For a list of all the hundreds of parameters in the version you are using, enter

                    tesseract --print-parameters

Custom tuning

What you need to adjust depends on what you are trying to do. If all you need is a file of text, you don't need PDF output. But maybe the hocr or tsv type of text output might be more useful than just plain-text output.

Most of my own use of tesseract has been to make PDFs of old books searchable. However, I've also tried to extract numerical tables from scans of technical references; a particular irritation has been the omission of tables and figures from the scans available at Google Books. (Google was evidently interested only in extracting text from them.)

tesseract and ocrmypdf

A more user-friendly interface to tesseract is provided by ocrmypdf, which can read a wide variety of image formats. However, some tesseract features are inconvenient to adjust from ocrmypdf.

For simple tasks, the ocrmypdf script is very handy. You can get some idea of what ocrmypdf is doing by turning on its verbose (i.e., -v) option, which produces a surprisingly large amount of output. But for more complicated problems, like extracting numerical values from tables, you need finer control over tesseract than ocrmypdf alone can easily provide.

Whatever you want to do, you need a better understanding of how both these commands operate, and how they can be controlled, than the regular documentation provides.

How to turn the knobs

If you only read the  man  pages, you get the impression that the only adjustments possible are those provided by command-line options, unless you construct a whole new configfile. In a sense, that's true; but in fact almost everything in tesseract can be adjusted from the command line by using its -c option — which you can use as many times as you need. In fact, ocrmypdf has a --tesseract-config option that lets you use a small local incremental  configuration file to adjust just one or a few of tesseract's many parameters.

What to adjust

Options

Another problem with the  man  pages is that they give no indication of which adjustments affect almost every task, and which ones are useful only for very special purposes. For example, the tesseract documentation devotes lots of space to the -psm option (which can also be set with the --tesseract-pagesegmode option of ocrmypdf); but, apart from the fact that its 0 and 2 settings turn off OCR entirely, they have almost no effect on the common task of adding OCR text to a plain PDF file of page images. (Note, however, that this option can be important for some special purposes; see https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/ for details.) A brief description of the PSM settings is produced by
 tesseract --help-extra 
or
 tesseract --help-psm .

On the other hand, the -d (or --deskew) and -c (or --clean) options to ocrmypdf almost always improve the accuracy of the extracted text. So those options should be used routinely for most tasks. But notice that many PDF files available from Google Books and other libraries have already been properly cleaned, and don't need the -c option.

Despite the utility of the -d option, it can make ocrmypdf choke on a page that contains no text, such as the blank back of a halftoned image, or a page that contains only diagrams. Don't try to OCR text that doesn't exist: either omit the —deskew option, or edit out such pages before running ocrmypdf on the rest of the document.

Another problem with ocrmypdf is that its routines for re-writing a PDF file can't handle Unicode characters reliably. Anything that's not plain ASCII text can make this problem occur. In particular, asking ocrmypdf to deskew a page image, or to remove existing text on a page (e.g., by using the --force-ocr option) will make it re-write the page image. But if the page contains multi-byte UTF-8 codes, like EN- or EM-dashes, the Section sign (§), accented letters, etc., sometimes tesseract will OCR the whole file, and then ocrmypdf will choke on the UTF-8 codes when it tries to write the text into the regenerated PDF/A file. You will get some unhelpful error message like “An exception occurred while executing the pipeline”, or a vague complaint that a subprocess failed.

This error can be very mysterious, because it depends on the positions of the offending Unicode glyphs in the input file: a page that triggers the error can be converted to a PDF/A without problems by itself, but may cause the error when other pages of the original PDF precede it.

The solution is to avoid re-generating the PDF image. Instead of asking ocrmypdf to deskew the page image, or to override existing text with its -f option, re-use the original page images by invoking the --redo-ocr option. If the input PDF has badly skewed (tilted) text, you will have to straighten the lines of text by rotating the page images before attempting to re-do the OCR.

Alternatively, if you need to deskew and/or clean the input PDF image, you can run ocrmypdf twice. First, to improve the image without attempting to OCR it, set the ocrmypdf option --tesseract-timeout to 0 seconds. Then, add OCR text to the resulting improved PDF without changing the image, as described in the paragraph above.

Vocabulary

In fact, you can augment the standard configuration simply by adding a list of words that occur in the document you're trying to OCR. Such a supplemental word-list often helps, particularly when the document contains proper nouns (place names or personal names) that aren't in the default dictionary for the language(s) you're dealing with.

This local wordlist will be different for every document, so it makes sense to provide a different list for each one. Reading the man page can mislead you to think that such a supplemental wordlist must be named with the suffix specified by the user_words_suffix parameter, and/or that it has to be placed in the tessdata directory — which would allow only one such file for every language. Actually, it's possible  to have a single supplemental wordlist if it's in the tessdata directory and  has the specified suffix; but that doesn't prevent you from having a local list with an arbitrary filename. You just have to specify the path to that file on the command line, either with tesseract's --user-words= option, or with its -c option followed by a user-words-file= argument.

In recent versions of ocrmypdf, the local wordlist file can be named with a --user-words option on the ocrmypdf command line. If you are using version 3 of tesseract, you have to point to any local wordlist(s) in a local config  file, which in turn can be named in a --tesseract-config option to ocrmypdf. Either way, a supplemental dictionary can be provided. (Note that you can have only one local wordlist file, and only one local config file. If you have more than one wordlist file, they should all be concatenated before being named in a command line or the local configuration file.) The wordlists do not need to be sorted.

Bear in mind that these dictionary words are only hints to tesseract; it isn't a spell-corrector like aspell. The dictionaries tell the engine what words to look  for; what it will actually find  is a different matter that depends heavily on a sharp, clean image. However, you can “load the dice” by changing the relative weights assigned to dictionary and non-dictionary words. These are the  language_model_penalty_non_freq_dict_word  and  language_model_penalty_non_dict_word  variables, which are only 0.1 and 0.15 by default. Increasing these values puts more weight on the dictionaries. CAUTION: putting too much weight on the dictionaries will make the engine turn noise, or real words not in a dictionary, into dictionary words; so be careful. Making these penalties larger than about 0.5 is rarely useful.

The freq_dict_word part of the variable named above refers to a list of common words that most language packages contain. Sometimes, when tesseract complains that it found “no best words” on a page, it means that it found nothing in this frequent-word list. Probably you tried to make it OCR a blank page, or one that contains only photographs or diagrams. Raising a non-word penalty often makes it imagine words in background noise or a line drawing.

Remember that what tesseract thinks is a word is just a string of characters; it doesn't need to be a word in your desk dictionary. Strings of arbitrary numbers and letters can be words to tesseract. You might find things like pilcrows and section marks useful “words” to include in the wordlist.

Finer control over special words and characters is possible by using whitelists.

A complication

I first thought that it would be easy to pick words for a supplemental dictionary: just select the words that tesseract failed to OCR correctly when run with the default configuration parameters. It's easy to find most of these errors: just use ispell or aspell to produce a list of mis-spelled words in the ocr'd text. But when I tried this, I found that some words that were detected correctly on the first run became OCR errors on a second one that used a local dictionary containing words it had missed on the first.

The problem is that tesseract stores several different internal patterns for every letter, because a document might contain the glyph in several different font sizes and styles. So if you tell it to pay more attention to the bad glyphs, that shifts its mapping between shapes on the page and characters in the text. Putting words that were missed into the dictionary shifts the detection criteria away from well-formed glyphs, and favors bad ones.

Evidently, we need to tell it to pay more attention to the shapes it misinterpreted initially, while continuing  to pay attention to the things it got right (rather than ignoring them). So all the correct words should be included in the wordlist file; but we also need to include the words we  can read but tesseract couldn't.

In short: we really need to make it look for just  the words that really exist in the input file, and ignore all other possibilities. Ideally, the dictionary should contain all the real words, and no others.

Then, in principle, we could force tesseract to accept only the words in this perfect dictionary by raising the penalties on non-dictionary words.

That still won't guarantee perfect character recognition, for two reasons. First, we won't have a list of all the real words in the PDF without manually checking every word in it; spell-checkers always seem to miss a few uncommon words that occur in real texts. And second, even if the new dictionary list were perfect, there still might be indistinct glyphs in the page images that tesseract could mistake for other characters that form a different  (but correctly spelled) word. Human readers can usually fix such errors by understanding the context of the ambiguous word, but machines don't understand anything.

Note: using a local dictionary tuned to the right context on the first pass can also help appreciably.

Multiple Columns

Many books and journals are printed with two or more columns per page. But tesseract can handle just a single column; so the best way to OCR this kind of document is to split the page into columns before invoking tesseract.

And there is another front end to tesseract called pdfsandwich that can split PDF pages into two columns. The problem with using it is that it does not pass as many options to tesseract itself as ocrmypdf does. So you might want to first use pdfsandwich to split the pages vertically, and then invoke tesseract on the separated columns to get the desired results.

Alternatively, the “Advanced features” section of the ocrmypdf documentation says that unpaper, which it uses to edit images for the --clean option, can be told to expect two pages side by side on each page image. The example there shows how to do this:

	 ocrmypdf --clean --clean-final --unpaper-args '--layout double' input.pdf output.pdf
I have not tried this, but it looks useful.

Tables

Layout

One big problem with OCRing tables is that tesseract produces little information about document structure. However, recent versions can write a tab-separated output file that contains some layout information. So, consider using either hocr or tsv output instead of plain text.

Another problem with tables is that they have variable whitespace between columns. You can help handle this by setting the parameter  preserve_interword_spaces  to 1 — but it does not preserve space at the left end of a line; instead, the lines are all left-justified (because there is no interword space to the left of the first word on a line). However, if you extract the text by using pdftotext with its -layout option, some whitespace at the left margin appears in the text.

A related problem is that tesseract often breaks up the rows of tables — often because the lines of type were not perfectly level. (That can be cured by using the -d option to ocrmypdf, which de-skews the tilted text.) You can keep all the parts of a table row together by setting the -psm mode to 4 or 6, instead of the default 3. Mode 4 keeps rows together even if they contain a variety of fonts; if you can rely on one font being used across an entire row, use psm mode 6. And notice that mode 4 turns the margins of the page into rows of <SPACE> characters; you probably should crop the image before invoking  tesseract -psm 4  or  ocrmypdf --tesseract-pagesegmode 4 .

There is a parameter called textord_tablefind_recognize_tables, which is normally turned off, but can be turned on by setting it to 1 (i.e., True.) Similarly, there is another called textord_show_tables. It appears that these only find tables with psm set to 1 to 4. These parameters seem to be used only in the layout analysis, without affecting the actual OCR of tables.

To extract the OCRed text from a PDF that has been processed by tesseract, you can use

	pdftotext -layout ocred.pdf
to show the text from the OCRed file. This will separate fields with <TAB> characters (or their equivalent: 8 spaces), which make it fairly unwieldy.

Another way to copy table data from an OCRed image is to display the image with a browser, and copy the (invisible) OCR text to the clipboard by scanning the cursor along the rows on the displayed page. Then you can re-copy the text to a file from the clipboard. This involves a lot of mouse work, but it's still better than trying to copy a table manually.

Entropy

A less obvious reason why tables are more difficult than text to OCR correctly is that text is about 75% redundant, while numerical data are quite unpredictable. However, there is a way to overcome this problem.

Whitelisting and blacklisting

Text can contain several dozen different glyphs, even if it is set in only a single font. Numerical tables contain only the ten digits from 0 to 9, plus a decimal point, and possibly + or − signs. If you have control over the OCR process, you can restrict the set of glyphs that tesseract uses by setting the string-valued parameter tessedit_char_whitelist to something like " 0123456789+-". These will be the only characters tesseract will print in its OCR output. All other characters will be blacklisted, and cannot appear  in the OCR text.

A similar but less restrictive way to focus tesseract's attention on numbers is to set the parameter classify_bin_numeric_mode to 1 in your supplemental config file or command-line option. This, plus careful blacklisting of inappropriate characters, can produce fairly good OCR of numerical tables.

Technical papers that contain a few Greek letters and mathematical symbols can be OCR'd fairly well by invoking ocrmypdf with a local tesseract config file that sets the whitelist to all the characters used in the text, including a space, Greek letters, and all punctuation marks (like quotes, brackets, and braces) and other special glyphs; and then putting the individual special glyphs and their commonest combinations (like "Δx") into a local "user-words" file. This “words” file should contain individual Greek letters and any isolated special characters as well. It can be named in the local --tesseract-config file, as the value of the user_words_file parameter.

Although parameters and their values are supposed to be separated by “whitespace” in the config file, they are sometimes combined into one word if only a single <SPACE> character intervenes; then tesseract complains that it can't find this phony parameter. Use a tab or multiple spaces to avoid this bug. And be careful to include every  printed character in the “words” file, including the = sign and any string of numbers, letters and special characters isolated from neighboring words in the text by spaces. Adding the Greek language as "ell" after the main text language(s) then makes tesseract look for some of the mathematical symbols and expressions in the running text. Of course, displayed equations are still garbled; but in-line math can be more or less correct, apart from subscripts and superscripts (and exponents). Adding "+equ" to the list of languages may help remove some clutter from displayed equations.

Unfortunately, even this isn't enough to make tesseract find all the mathematical symbols correctly, because the inclusion of multiple “language” packages does not suffice to tell the engine how to tell the difference between text glyphs and those in equations: the partial derivative symbol ∂ looks a lot like the letter d; the Greek capital Δ resembles the top of the text letter A; the German eszet ß is close to the Greek letter β; etc. Depending on the fonts used in the pages you want to OCR, these differences will often turn even in-line equations into nonsense. To resolve these ambiguities, you'd need to add a special set of disambiguation patterns to tesseract's trained data files.

Another problem occurs when only a few foreign words appear in the input PDF. Should we add another language to the list tesseract is asked to look for? Sometimes it's easier just to add the foreign words to the user-words file. In any case, be sure to include any special accented letters in the whitelisted set of characters.

Then pretty good pages of the text can be produced by extracting the ocr text from the pdf with

	pdftotext -layout OCRed.pdf
and editing the resulting OCRed.txt file to collapse the excess groups of spaces with the vim command
                      :%s/  */ /g
(or by piping the pdftotext output through an equivalent sed command), which reduces each group of spaces to a single <SPACE> character.

Notice that there are two “blacklisting” parameters: tessedit_char_blacklist and tessedit_char_unblacklist, which are often misunderstood. Blacklisting a character only prevents it from being produced  in tesseract's output; but some other character will be produced in its place. Sometimes the minus signs in numerical tables and equations get OCR'd as em-dashes or underscores; then blacklisting those characters will prevent them from appearing where minus signs should be in the extracted text.

Any character that is normally killed in a blacklist can be revived by putting it in a tessedit_char_unblacklist string. This can help the ocr engine look for some special glyph, like a case fraction or a special Unicode symbol.

Column headings

While we are thinking about special treatments, what about column headings in tables? These contain words, or sometimes just abbreviations of words, or symbols. All those things might be left in the part of the page image used to OCR a table, because they will help you proofread the final result. But then you'd need to whitelist the characters used in the headings, which might degrade the overall recognition accuracy of numbers.

One way to work around this problem is to put the words, or word fragments, used in the headings into a supplemental wordlist, as described above. Adding the strings used in column headers to a supplemental dictionary list should help.

Restricted numerical ranges

Numbers in tables are often pretty random; but not always. You can use some regularities to help steer tesseract away from trouble.

For example, a table column that contains only hours and minutes must have only 1- or 2-digit numbers smaller than 13 (or 25) in the hours column, and less than 60 in the minutes. You can put those small sets of numbers into your supplemental “user-words” dictionary, as long as the OCR engine is looking for numbers as well as dictionary words. Even if there are other columns in the table with less limited numerical values, this might still help prevent a mis-OCRed value of 84 from appearing in a “minutes” column.

If you do include such short text fragments in the wordlist, you'll also need to change tesseract's value for the minimum word length. The default length is 3; it's 1 more than the parameter stopper_smallword_size, which defaults to 2. Reduce this by 1 or 2 to include isolated individual or paired glyphs in the dictionary test.

 

Copyright © 2023 – 2024, Andrew T. Young


Back to the . . .
main LaTeX page

or the alphabetic index page

or the GF home page

or the website overview page