Unfortunately, the documentation for tesseract isn't very clear, so it's difficult for beginners to learn what needs to be tweaked, or how to do it. This page explains some basic ways to improve its performance.
Also, make sure you have enough resolution to record all the details of the printed words on the page. Normally, average-sized print in books and journals can be scanned at 300 dpi; but unusually small print, or a typeface with very thin lines, will produce better results at 400 or 600 dpi. A rule of thumb is that the thinnest lines in the letter glyphs should be 2 pixels wide; this is usually possible if the x-height of the font is between 20 and 30 pixels. (That typically corresponds to a space between lines of type of at least 50 or 60 pixels.)
Even higher resolution is needed when imaging pages of text that contain small print, such as footnotes, or technical papers with mathematical symbols that have small subscripts and superscripts. Another problem occurs when more than one language is involved: it's difficult to distinguish between an unadorned vowel and one festooned with accents, or between Latin letter “a” and Greek letter “α”. If tesseract has to distinguish among a larger variety of glyphs, it needs more pixels to tell similar ones apart.
And be careful to record page images using lossless compression. Often page images are compressed as JPEG images with some default quality (like 70 or 80) that is perfectly legible to the eye, and looks reasonable on casual inspection, but contains lots of compression artifacts that will make OCR detection of text very inaccurate. (The Internet is full of such images.) If you can't avoid JPEG compression, at least use the highest quality setting available.
If all you have is pages scanned at some distant library, you may discover that it's contaminated with noise — particularly if the pages are old, or if they were printed on rag paper that contained lots of little colored fibers. This typically shows up when tesseract complains that there are “lots of diacritics” when it searches for letters. In extreme cases, it may be useful to turn on the textord_heavy_nr setting, which is normally zero (i.e., off). But that is so heavy-handed that it usually makes an appreciable fraction of the text unreadable. [The ocrmypdf front end has its own -c (or --clean) option, which is much milder.]
Often, you can tell whether the noise-removal process has also removed important information by checking the little dots in glyphs like periods, commas, colons, and semicolons. If numerical values tend to lose their decimal points, the cleaning process should be toned down (even if tesseract continues to complain about diacritical marks).
If you tell tesseract to include the cleaned images in the final product (by using the -i option of ocrmypdf), you can see what kind of noise has caused errors in the OCR text. This may suggest what other parameters might be tweaked to reduce noise further. For a list of all the hundreds of parameters in the version you are using, enter
tesseract --print-parameters
Most of my own use of tesseract has been to make PDFs of old books searchable. However, I've also tried to extract numerical tables from scans of technical references; a particular irritation has been the omission of tables and figures from the scans available at Google Books. (Google was evidently interested only in extracting text from them.)
For simple tasks, the ocrmypdf script is very handy. You can get some idea of what ocrmypdf is doing by turning on its verbose (i.e., -v) option, which produces a surprisingly large amount of output. But for more complicated problems, like extracting numerical values from tables, you need finer control over tesseract than ocrmypdf alone can easily provide.
Whatever you want to do, you need a better understanding of how both these commands operate, and how they can be controlled, than the regular documentation provides.
tesseract --help-extraor
tesseract --help-psm .
On the other hand, the -d (or --deskew) and -c (or --clean) options to ocrmypdf almost always improve the accuracy of the extracted text. So those options should be used routinely for most tasks. But notice that many PDF files available from Google Books and other libraries have already been properly cleaned, and don't need the -c option.
This local wordlist will be different for every document, so it makes sense to provide a different list for each one. Reading the man page can mislead you to think that such a supplemental wordlist must be named with the suffix specified by the user_words_suffix parameter, and/or that it has to be placed in the tessdata directory — which would allow only one such file for every language. Actually, it's possible to have a single supplemental wordlist if it's in the tessdata directory and has the specified suffix; but that doesn't prevent you from having a local list with an arbitrary filename. You just have to specify the path to that file on the command line, either with tesseract's --user-words= option, or with its -c option followed by a user-words-file= argument.
In recent versions of ocrmypdf, the local wordlist file can be named with a --user-words option on the ocrmypdf command line. If you are using version 3 of tesseract, you have to point to any local wordlist(s) in a local config file, which in turn can be named in a --tesseract-config option to ocrmypdf. Either way, a supplemental dictionary can be provided. (Note that you can have only one local wordlist file, and only one local config file. If you have more than one wordlist file, they should all be concatenated before being named in a command line or the local configuration file.) The wordlists do not need to be sorted.
Bear in mind that these dictionary words are only hints to tesseract; it isn't a spell-corrector like aspell. The dictionaries tell the engine what words to look for; what it will actually find is a different matter that depends heavily on a sharp, clean image. However, you can “load the dice” by changing the relative weights assigned to dictionary and non-dictionary words. These are the language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word variables, which are only 0.1 and 0.15 by default. Increasing these values puts more weight on the dictionaries. CAUTION: putting too much weight on the dictionaries will make the engine turn noise, or real words not in a dictionary, into dictionary words; so be careful. Making these penalties larger than about 0.5 is rarely useful.
Furthermore, what tesseract thinks is a word is just a string of characters; it doesn't need to be a word in your desk dictionary. Strings of arbitrary numbers and letters can be words to tesseract. You might find things like pilcrows and section marks useful “words” to include in the wordlist.
Finer control over special words and characters is possible by using whitelists.
The problem is that tesseract stores several different internal images for every letter, because a document might contain the glyph in several different font sizes and styles. So if you tell it to pay more attention to the bad glyphs, that shifts its mapping between shapes on the page and characters in the text encoding. Putting words that were missed into the dictionary shifts the detection criteria away from well-formed glyphs, and favors bad ones.
Evidently, we need to tell it to pay a little more attention to the shapes it misinterpreted initially, while continuing to pay attention to the things it got right (rather than ignoring them). So all the correct words need to be included in the wordlist file; but we also need to include the words we can read but tesseract couldn't.
In short: we really need to make it look for just the words that really exist in the input file, and ignore all other possibilities. Ideally, the dictionary should contain all the real words, and no others.
Then, in principle, we could force tesseract to accept only the words in this perfect dictionary by raising the penalties on non-dictionary words.
That still won't guarantee perfect character recognition, for two reasons. First, we won't have a list of all the real words in the PDF without manually checking every word in it; spell-checkers always seem to miss a few uncommon words that occur in real texts. And second, even if the new dictionary list were perfect, there still might be indistinct glyphs in the page images that tesseract could mistake for other characters that form a correctly spelled (but wrong) word. Human readers can usually fix such errors by understanding the context of the ambiguous word, but machines don't understand anything.
Note: using a local dictionary tuned to the right context on the first pass could also help appreciably.
And there is another front end to tesseract called pdfsandwich that can split PDF pages into two columns. The problem with using it is that it does not pass as many options to tesseract itself as ocrmypdf does. So you might want to first use pdfsandwich to split the pages vertically, and then invoke tesseract on the separated columns to get the desired results.
Another problem with tables is that they have variable whitespace between columns. You can help handle this by setting the parameter preserve_interword_spaces to 1 — but it does not preserve space at the left end of a line; instead, the lines are all left-justified (because there is no interword space to the left of the first word on a line). However, if you extract the text by using pdftotext with its -layout option, some whitespace at the left margin appears in the text.
A related problem is that tesseract often breaks up the rows of tables — often because the lines of type were not perfectly level. (That can be cured by using the -d option to ocrmypdf, which de-skews the tilted text.) You can keep all the parts of a table row together by setting the -psm mode to 4 or 6, instead of the default 3. Mode 4 keeps rows together even if they contain a variety of fonts; if you can rely on one font being used across an entire row, use psm mode 6. And notice that mode 4 turns the margins of the page into rows of <SPACE> characters; you probably should crop the image before invoking tesseract -psm 4 or ocrmypdf --tesseract-pagesegmode 4 .
There is a parameter called textord_tablefind_recognize_tables, which is normally turned off, but can be turned on by setting it to 1 (i.e., True.) Similarly, there is another called textord_show_tables. It appears that these only find tables with psm set to 1 to 4. These parameters seem to be used only in the layout analysis, without affecting the actual OCR of tables.
To extract the OCRed text from a PDF that has been processed by tesseract, you can use
pdftotext -layout ocred.pdfto show the text from the OCRed file. This will separate fields with <TAB> characters (or their equivalent: 8 spaces), which make it fairly unwieldy.
Another way to copy table data from an OCRed image is to display the image with a browser, and copy the (invisible) OCR text to the clipboard by scanning the cursor along the rows on the displayed page. Then you can re-copy the text to a file from the clipboard. This involves a lot of mouse work, but it's still better than trying to copy a table manually.
A similar but less restrictive way to focus tesseract's attention on numbers is to set the parameter classify_bin_numeric_mode to 1 in your supplemental config file or command-line option. This, plus careful blacklisting of inappropriate characters, can produce fairly good OCR of numerical tables.
Technical papers that contain a few Greek letters and mathematical symbols can be OCR'd fairly well by invoking ocrmypdf with a local tesseract config file that sets the whitelist to all the characters used in the text, including a space, Greek letters, and all punctuation marks (like quotes, brackets, and braces) and other special glyphs; and then putting the individual special glyphs and their commonest combinations (like "Δx") into a local "user-words" file. This “words” file should contain individual Greek letters and any isolated special characters as well. It can be named in the local --tesseract-config file, as the value of the user_words_file parameter.
Although parameters and their values are supposed to be separated by
“whitespace” in the config file, they are sometimes combined into one word
if only a single <SPACE> character intervenes; then
tesseract complains that it can't find this phony parameter. Use
a tab or multiple spaces to avoid this bug. And be careful to include
every printed character in the “words” file, including the
= sign and any string of numbers, letters and special characters
isolated from neighboring words in the text by spaces.
Another problem occurs when only a handful of foreign words appear in the
input PDF. Should we add another language to the list tesseract
is asked to look for? Sometimes it's easier just to add the foreign words
to the user-words file. In any case, be sure to include special accented
letters to the whitelisted set of characters.
Then pretty good pages of the text can be produced by extracting the ocr
text from the pdf with
Notice that there are two “blacklisting” parameters:
tessedit_char_blacklist and tessedit_char_unblacklist,
which are often misunderstood. Blacklisting a character only prevents it
from being produced in tesseract's output; but some
other character will be produced in its place. Sometimes the minus signs
in numerical tables get OCR'd as em-dashes or underscores; then
blacklisting those characters will prevent them from appearing where minus
signs should be in the extracted text.
Any character that is normally killed in a blacklist can be revived by
putting it in a tessedit_char_unblacklist string. This can help
the ocr engine look for some special glyph, like a case fraction or a
special Unicode symbol.
One way to work around this problem is to put the words, or word
fragments, used in the headings into a supplemental wordlist, as described
above. Adding the strings used in column headers to a
supplemental dictionary list should help.
For example, a table column that contains only hours and minutes must have
only 1- or 2-digit numbers smaller than 13 (or 25) in the hours column,
and less than 60 in the minutes. You can put those small sets of numbers
into your supplemental “user-words” dictionary, as long as the OCR engine
is looking for numbers as well as dictionary words. Even if there are
other columns in the table with less limited numerical values, this might
still help prevent a mis-OCRed value of 84 appearing in a “minutes”
column.
Copyright © 2023 – 2024, Andrew T. Young
or the
alphabetic index page
or the
GF home page
or the
website overview page
pdftotext -layout OCRed.pdf
and editing the resulting OCRed.txt file to collapse the excess
groups of spaces with the vim command
:%s/ */ /g
(or by piping the pdftotext output through an equivalent
sed command),
which reduces each group of spaces to a single <SPACE> character.
Column headings
While we are thinking about special treatments, what about column
headings in tables? These contain words, or sometimes just abbreviations
of words, or symbols. All those things might be left in the part of the
page image used to OCR a table, because they will help you proofread
the final result. But then you'd need to leave the characters used in the
headings whitelisted, which might degrade the overall recognition
accuracy.
Restricted numerical ranges
Numbers in tables are often pretty random; but not always. You can use
some regularities to help steer tesseract away from trouble.
Back to the . . .
main LaTeX page