Multi-Lingual Spell Checking with `ispell`

Introduction

When I began compiling the bibliography file, I did not think it would be possible to run a spell-checker on it, because so much of the content was in German, French, etc. So I tried to rely on careful proofreading to minimize errors.

Well, just as the man who defends himself in court has a fool for a lawyer, someone who tries to proofread his own typing is kidding himself. You know what should be there, so that's what you perceive on the screen; lots of errors are overlooked.

I became aware of this gradually over a period of time, as I kept discovering my errors by accident. I'd be revising an entry in the bibliography that I hadn't looked at in a couple of years, and suddenly see a glaring mistake. So I finally decided to try running the spell-checker on the file, knowing there'd be a lot of extraneous garbage it would reject because of the foreign words.

The result was horrifying. Once the errors were removed from their context, they became much more visible. But, of course, the large number of legitimate foreign words produced so much clutter that it was still difficult to catch all the errors. I determined to do the work systematically, removing all the foreign words (or at least, as many as possible) to reduce the noise level.

A first cut

The obvious way to go about this was to download foreign-language dictionaries for the spell checker — I'm using ispell here — and just run the text through the spell checker for each language successively. The idea is to remove legal spellings in all the languages seriatim, leaving only the misspelled words.

The simple-minded way to do this with ispell under UNIX or Linux is to pipe the file of references through ispell for one language after another, like this:

	<references ispell -x -l |

	ispell -x -l -d german |

	ispell -x -l -d french |

	ispell -x -l -d italian |

	. . . | sort | uniq | more

and so on, language after language. (Don't ask how this would be done under some other operating system; I only use Linux or UNIX systems.) The -x flag tells ispell not to try to make a backup file, the -l tells it to just list the spelling errors on the standard output stream, and -d says to use an alternate language dictionary, instead of the default American-English one.

Problems

Of course, nothing is as easy as it looks at first glance; this doesn't work. There are several problems:

1. I have used a variety of encodings for accented letters in the raw reference file. Some are marked by LaTeX markup, like \"{a} for ä; some are marked as nroff/troff overstrikes, like \o'a"' for the same thing; and some are just shown like a", which (it turns out) is what the Debian igerman dictionary uses.

This means that I have to write a sed script to convert everything to a common encoding scheme. I choose to convert everything to ISO-8859-1 encodings for the accented characters.

2. But then the German dictionary doesn't like the ISO encodings! It turns out that you have to add the tag -Tlatin1 to make ispell accept the ISO Latin-1 codes. [Thanks to Debian maintainer Roland Rosenfeld for explaining about the need to use -Tlatin1 .]

And these days, since the Germans decided to change their spelling conventions, we need the ogerman option (from the iogerman package) — not just german. That also means I have to have a local supplemental wordlist called .ispell_ogerman, not just .ispell_german .

3. On the other hand, the other dictionaries like ISO-8859-1 to begin with; so if you try adding the -Tlatin1 incantation to them, you get a

	ispell:  unrecognized formatter type 'latin1'

complaint. It turns out that the igerman dictionary has this encoding flagged in the german.aff affix file, which (under Debian, at least) is in the /usr/lib/ispell directory. [Despite the ispell: in the error message, the problem is in the dictionary files, not the program. The program is just where the error is detected.]

4. So, are we ready to do multi-lingual spell-checking? We are not. The problem now is that the accented characters don't occur in English, so they're not in the list of characters that make legitimate words. Therefore, ispell assumes they are some exotic form of punctuation, and breaks words where they occur. So an input word like Verhältnis gets chopped into Verh and ltnis on the English-language pass of ispell, neither of which is correct in any language; so both these word-fragments appear in the output. Instead of eliminating a correctly-spelled word, we end up having two meaningless pieces cluttering the final results.

The solution to this problem is to tell ispell that all the accented letters are legitimate. That's done with the -w switch:

	ispell -x -l -w "äëïöüàèìòùáéíóúâêîôûß" -d whatever

— and we have to do this for every language (i.e., every invocation of ispell in the pipeline), if we are to get complete words out at the end.

In addition, we have to include the apostrophe in the string of word-making characters. The reason is that (a) I've included a bunch of possessives in my .ispell_words dictionary, which is read by every invocation of ispell, for every language in the pipeline; and (b) some of those languages (like French) include words with embedded apostrophes anyway, which have to be protected from the word-breaking propensities of other languages in the list.

Whoops! We forgot the accented uppercase characters. (Think of all the German nouns that begin with Ä and Ü.)

Now, we can pass the original file through the pipeline and filter out all the things ispell thinks are correctly spelled, leaving only spelling errors to come out at the end.

Are we there yet? Almost.

5. Of course, lots of technical and uncommon or obsolescent words appear in the annotated bibliography. So I have to (a) determine what language they're from, and (b) add them to a local supplemental dictionary, like .ispell_ogerman or .ispell_french. This works well most of the time; but when I copy atmosphärer into my local .ispell_ogerman dictionary, the pipeline still spits it out as a spelling error.

Well, it turns out that you have to convert your supplemental dictionary to the same format as the main one used in the hashlist, rather than the encoding of the data being spell-checked. For most languages, that means ISO-8859-1 encoding; but the German dictionary distributed in the igerman package of Debian Linux uses the convention that ä is represented as a", ö is o", and so on. It also has sS for ß. So, the treatment is to run your .ispell_ogerman file through another sed script, to turn the extra words back from ISO-8859-1 to the simple ASCII-markup for umlauts that I was using to begin with.

In general, the language.aff file is where such information is hidden.

Now, we're done, right?

6. No. There's still a problem with the Latin.

There is no Latin dictionary for ispell distributed by Debian. There is a relatively small (13 000 base words) dictionary, called mlatin, that can serve as a starting point. I picked up a copy and am using it; but of course a lot of technical vocabulary is missing. So of course I'm accumulating a .ispell_latin file.

But the main problem is that many of the Latin documents I'm using employ the æ ligature, while the mlatin dictionary spells it out as ae. So, once again, a little sed script is needed to convert the ISO-8859-1 code for æ back to plain ae.

Note that this has to be done at the end, after all the languages that use ISO-8859-1 encodings beyond 7 bits have been checked. Once you've converted æ to ae, you've lost the distinction between this ae pair and the ae pairs that represent an a-umlaut, say, in German.

The `mspell` script

So, what does my mspell script finally look like? Here it is, minus the preliminary massaging by sed to get everything into ISO-8859-1 (which I assume most readers can do without):

	#!/bin/bash

	# Usage:  mspell textfile

	# set up the list of allowed accented characters:
	ACCENTS="áéíóúÉòàèòùâêîôûäïöüÄÖÜëçßæÆåÅ'"

	ispell -l -x -w "$ACCENTS" < $1 |
	ispell -l -x -d british -w "$ACCENTS" |
	ispell -l -x -d french -w "$ACCENTS" |
	ispell -l -x -d ogerman -Tlatin1 -C -w "$ACCENTS" |
	ispell -l -x -d italian -w "$ACCENTS" |
	ispell -l -x -d dutch -w "$ACCENTS" |
	# make the mlatin dictionary happy:
	sed -e 's/æ/ae/g;s/Æ/AE/g' |
	ispell -l -x -d latin -w "$ACCENTS" |
	sort | uniq -c | more

	# Finally, if we have modified the .ispell_german dictionary,
	# convert it to canonical form:

	sed -e 's/ä/a"/g;s/Ä/A"/g
		s/ö/o"/g;s/Ö/O"/g
		s/ü/u"/g;s/Ü/U"/g
		s/ß/sS/g' < $HOME/.ispell_german  >$HOME/.ispell_german.new

	cmp $HOME/.ispell_german $HOME/.ispell_german.new  || \
	mv $HOME/.ispell_german.new $HOME/.ispell_german

A few comments are in order here. First, notice that I've run the input file through ispell with the -d british flag as the first “foreign” dictionary. That's because I have a lot of British-English quotations in the annotations, and don't want British spellings to be flagged as errors because of my default American-English dictionary. You could use this trick to make a bilingual English-American spell-checker that would accept both British and American spellings. (In that case, the -w $ACCENTS would not be needed.)

The accent list has been stuffed into a shell variable for brevity. Notice that it contains things like c-cedilla (used in French, not just Spanish) and, as mentioned above, the apostrophe. A smaller list of accents would do for a simple bilingual spell-checker.

The hyphen might be added, to get hyphenated French and Dutch words through the English and German spell-checks; otherwise these latter will break the hyphenated combinations at the hyphen. But this has the disadvantage of flagging many hyphenated word-pairs in English as errors; because more of my text is in English than French or Dutch, I find it preferable to omit the hyphen, and enter the legal French and Dutch fragments that are split off in local supplemental dictionary lists.

Only German needs the -Tlatin1 flag. I've also used the -C flag to try to catch German compounds; but, as noted elsewhere on the Web, this is not very helpful.

Furthermore, I've had to make a symbolic link from .ispell_german to .ispell_ogerman — because who wants to keep having to keep typing ogerman instead of german?

I used the -c flag on the uniq at the end of the pipeline, to provide a count of how many times each misspelled word appears. This is useful in deciding whether or not to add it to one of the supplementary dictionaries.

I put the little stanza that modifies the supplementary German dictionary to get it into the right encoding at the end, because I often edit the .ispell_german file in another xterm window while I'm looking at the list of spelling errors with more. Maybe it should go at the beginning; feel free to move it there if you like. Notice the use of the shell `or' operator, || , so that if cmp finds that no changes have been made, the mv is unnecessary.

You might think all this activity would take forever; but on my 1.4 GHz Athlon machine, the whole 1.34 MB file of references was spell-checked in about 2 seconds — a testimony to the efficiency of hashing, at the very least.

Real-world usage

Actually, I ended up not using the mspell script from the command line to correct the bibliography. Instead, because I noticed that many of the remaining words missing from the dictionaries were the names of authors, I incorporated it in another script, which strips out the authors' names (by parsing the references in the bibliography) to a separate file. Then that list is also run through sort and uniq to produce a list of authors, one per line. Finally, the authors file and the spelling-errors file are merged in a pipeline fed by cat , re-sorted, and passed through uniq -u to eliminate the authors' names entirely from the final spelling-error list.

This method of effectively subtracting one wordlist from another is a handy trick to know — though comm would also work. Another way to handle such problems is to put a local file named .ispell_words in the same directory as the file you want to spell-check; then ispell will use that as well as your global .ispell_words in your home directory. This local list of correctly-spelled words can be useful in projects that use a few peculiar spellings.

I also noticed that some noise is added to the list of spelling errors by the URLs that are occasionally mentioned in the annotations. To fix this, I added a sed-script line to my pre-filter to null out all the URLs:

		s/http:[^ ]*//

I really should add Spanish and Danish dictionaries to the list. But I'm using a slow modem here, and it takes a while to download them — particularly if I get the wordlists as well.

Anyway, the machine ends up doing most of the work, which is the way I wanted it.

General remarks on spell-checkers

If the different languages were in clearly-marked sections (as they might be if the reference file were in XML instead of raw text), it would have been more effective to strip out the separate languages and spell-check each of them separately. But I had to run the whole file through all the languages in the list, because the various languages are intermixed in a way that is not possible to disentangle automatically.

A significant danger with this sort of undifferentiated spell-checking is that a mis-spelling of a word in one of the languages might be a correct spelling of some other word in a different language, and so get passed over. For example, it's easy for a native English-speaker to type and when copying und from a line of German, or fair for fait in a line of French. This kind of undetected error is a general problem with spell-checkers: the “wrong word” problem.

Now that spell-checking has become a common crutch to lean on, an attentive reader sees this problem in print more and more, from the daily newspaper to magazines and technical journals. The only safeguard against wrong words is careful proofreading. Spell-checkers are not proofreaders!

I have had some debate in the past with the dictionary compilers about the lack of possessives and plurals in the default ispell dictionaries. They argue that so many people fail to distinguish between its and it's, for example, that all uses of the apostrophe should be flagged.

That may make sense for the average semi-literate cretin who thinks a spell-checker will make up for his ignorance of grammar, but it's crazy from the point of view of someone who doesn't need to read the recent book Eats, Shoots and Leaves. It might be nice to provide separate dictionaries for illiterates and educated people; but how many of the illiterates would use the wrong dictionary, and then claim the spell-checker isn't any good?

So I have an elitist attitude. Tough.

Thanks

Thanks to Debian maintainer Roland Rosenfeld and ispell upstream curator Geoff Kuenning for essential advice about the use of this program! I couldn't have made this work without their help.

And thanks to all the compilers of dictionaries for ispell in the various languages.

And particular thanks to the Debian folks who wrote apt-get, which makes installing a new package as easy as typing

	apt-get install iogerman

— the main reason I switched to Debian.

Back to the . . .
introduction to the bibliography

or the GF home page

or the website overview page

Multi-Lingual Spell Checking with ispell