Spell-checking with aspell

Introduction

Some time ago, I had to deal with using the ispell utility to find spelling errors in the bibliography. That was difficult, in part because different languages used different encodings in their ispell dictionaries. Now that Unicode has taken over, it's much easier to check spelling with aspell, which can deal with any language in Unicode.

But aspell is an ungodly mess to configure. Its documentation was written by its original author, who had more problems with English than just spelling: his syntax and punctuation are as haphazard as his orthography. So aspell itself is very disorganized. It combines many disparate functions in a single executable, so it's not easy to figure out how to use it. And it suffers from “creeping featurism”, and has become a complex “little language” of its own, with command-line arguments that correspond to interactive commands.

Despite these drawbacks, aspell has been widely adopted, largely because of its flexibility in suggesting possible correct spellings for a badly mangled word. Probably this is due to its multiple ways of finding possible variations of a misspelled word.

The user's problem is to make sense of this chaotic collection of utilities. Unfortunately, the documentation for aspell is so disorganized that you have to read through it several times before you can find all the pieces that might solve your particular problem. What's needed is a more coherent overview of aspell. This Web page is a start in that direction.

The pieces

Like ispell, aspell is a command-line utility that can make a list of the misspelled words in a file. But it also can be used interactively. In both cases, aspell can suggest replacements for the incorrect words: it's not only a spell-checker; it's also a spell-correcter.

But aspell knows nothing about any human language. The identification of errors, as well as their suggested replacements, depends on a dictionary, or word-list, of correct spellings. So aspell also embodies utilities for maintaining and extending the dictionaries it uses. However, those dictionaries come in several different formats, are stored in multiple locations, and have different attributes — so there are also utilities to convert dictionaries from one kind to another.

In addition to dictionaries, there are grammar files that have rules for affixes; suggestion files that have rules describing common spelling errors; and configuration files that specify the locations of all the other files, and which ones should be invoked for each language. (And nearly everything that can be specified in a configuration file can also be specified on the aspell command line, or in an environmental variable.)

In Debian, the simplest installation of aspell also installs dictionary and other language-dependent files for a default language. But you will quickly find problems with these defaults: usually the default dictionary has too many or too few words for your purposes. It certainly won't have your e-mail address or your username, and probably won't know the names of your friends and co-workers, which will all be flagged as spelling errors if you try to spell-check your outgoing e-mail. So let's start with dictionary problems.

Dictionaries

Missing words

Words that you commonly type but are missing from the default dictionary should be put into what the aspell documentation calls a “personal” dictionary, or wordlist . It's almost, but not quite, just a list of words: there has to be a special one-line header that tells aspell what language it's for, and what its encoding is. For English, this header line is

personal_ws-1.1 en 0 utf-8
The en is the standard 2-letter ISO 639-1 language code. The zero is a default estimated length of the list; any moderate value is acceptable. (The encoding isn't really necessary if it's the same as the default encoding assumed by aspell itself; but it never hurts.)

This personal wordlist file normally is named .aspell.en.pws and is placed in your home directory. That's the default location — which, like everything else, is configurable, and can be specified in a configuration file, an environmental variable, or even on the aspell command line. But, like everything else in this complicated system, it's best to use the defaults unless you have a really urgent reason (and not just a whim, or curiosity) to do otherwise.

Defaults

The default location [ ~/ ] and default filename [ .aspell.XX.pws ] — where XX is the language code — are configured into aspell's environment. If this *.pws file has the expected name and location, aspell will automatically add its contents to the default list of correctly-spelled words for this language.

You can find the many configurable variables that aspell uses with the command

aspell dump config
but this will produce a confusing list of dozens of variables, most of which have uninformative names. If you just want the value of a particular variable, add its name to the above command line. For example, to find the directory that contains the main default dictionary, do
aspell dump config dict-dir
(which gives /usr/lib/aspell); and to verify that aspell knows your home directory, do
aspell dump config home-dir
And you can check the name of your personal dictionary, which should be in that directory, with
aspell dump config personal

Finally, if you want to spell-check a different language whose code is XX, you can add the option “-l XX” to those aspell commands. For example,

aspell -l hr dump config personal
will tell you the name you should use for a personal wordlist for Croatian.

Kinds of dictionary files

If you have listed some of those dictionary directories, you may have noticed that the files in the main dictionary directory /usr/lib/aspell have a variety of filename extensions, such as .rws and .multi; while your personal dictionary names all end with .pws. These extensions tell aspell about differences in the contents of the files.

As I mentioned above, the personal wordlist names always end with .pws. These files (apart from the one-line header) contain only lists of correctly-spelled words, and are plain text files, readable and editable with any text editor.

But the system dictionaries that end in .rws are large files that contain many thousands of words. Back around the turn of the century, when aspell was first introduced, personal computers were still fairly limited in mass storage, so these files of several tens or hundreds of kilobytes were compressed to save disk space. And the .rws files use special compression algorithms to make them more compact.

One of the tricks used to make dictionaries compact is to separate the grammatical inflections, which change word forms  in languages like Latin and Russian, from word stems . By separating stems from prefixes and suffixes, the main dictionary can be reduced to some tens of thousands of words, plus a much shorter file describing the rules for conjugations and declensions. In aspell, these rules are put into a separate file, named XX_affix.dat for the language coded as XX. (Despite the misleading “dat” extension, these are not  binary data files, but plain text.) On Debian, you'll find one affix file in the /usr/lib/aspell directory for each language you have installed.

These affix files gather all the affixes that belong to each inflection pattern as a set of prefixes or suffixes, denoted by a single letter. These one-letter codes are then attached to the stem or uninflected form in a “compressed word list”, stored as a binary file with a .cwl extension. You'll find gzipped versions of some of these files in the /usr/share/aspell/ directory.

Languages

To see what language aspell will use by default, do:
aspell dump config lang
In addition to the default language's 2-letter code, you may see some additional information in the output. Often, the code is followed by an underscore and the code for a regional dialect.

Many languages have regional dialects with slightly different spelling conventions. For example, it's well known that American English uses different spelling conventions than British English.

Dictionary varieties
Sometimes there is additional specialization of the dictionary that aspell calls a “variety”. Use
aspell dump config variety
to see if you have a variety configured; there may be some string that provides additional information, or it may just be empty.
Dictionary sizes
Another fine distinction refers to the dictionary's size. Use
aspell dump config size
to see what size your default dictionary has. This is not the number of words in the dictionary, but a peculiar subjective designation on a scale from 0 to 100; the usual default size is "60".

If you find that aspell doesn't recognize a lot of correctly-spelled words in the text you are checking, you should increase the size setting in your configuration file.

If you need a bigger dictionary, but no larger size is available, you will just have to augment the personal word list in your home directory.

Special dictionaries

General

If you find that aspell suggests many alternative spellings that look obviously wrong, your dictionary size may be too big; try reducing it. But the trouble may also be that you are just asking aspell to try too hard to generate alternatives to words it does not recognize. In that case, change the default setting of the sug-mode variable from "normal" to "fast", or otherwise tweak the selection of suggestions (such as by changing the "sug-edit-dist" variable from 2 to 1, or changing "sug-typo-analysis" from "true" to "false").

Local

If you need more correctly-spelled words only for special purposes, remember that you can put a personal wordlist file anywhere, and add it to aspell's vocabulary only when you need it. Just make your supplemental file in a directory where it's needed, make sure it has the required .aspell.XX.pws name, and explicitly call for it in either the command-line arguments or the ASPELL environmental variable. In the command line, just add "--add-p ./.aspell.XX.pws" to aspell's arguments.

But be careful: if you ask aspell to use a file that doesn't exist, it will complain. So don't call for a local spelling list that doesn't exist. (You can always make a dummy file that just has the 1-line header but no list of words, if you're not sure you need it.)

General advice

As the original documentation for aspell is so confusing, the best place to turn for help on a Debian system is the command
info aspell
which is a clearer reference manual than the old "man_aspell" pages.

Debian now has "man" pages for the associated commands that come in the aspell package, like preunzip and prezip-bin. These help you convert among the several dictionary-file formats.

 

Copyright © 2021, 2023 Andrew T. Young


Back to the . . .
Tables of Contents page

or the alphabetical index

or the main mirage page

or the GF home page

or the website overview page