Principles of LaTeX formatting

Introduction

I'm an unhappy user of the LaTeX text-formatting system. The journals I publish in all require it. But I hate it.

It's hard to find the information needed to understand how to use LaTeX. Lamport's book lacks much of what you need; the on-line documentation focuses on minutiae, and does not provide a conceptual orientation to the typesetting system. I'll try to provide that orientation here. If you find LaTeX as confusing and user-hostile as I do, this page is for you.

Overview

TeX Live

LaTeX is only one part of a vast text-formatting system adopted by Debian: TeX Live, which is installed as the Debian package texlive . Its main parts are documented in PDF files that can be displayed with its lookup tool, texdoc. If you're on a Debian system, try entering

	texdoc texdoc

at a shell prompt to see how that works.

Then the command

	texdoc texlive

will show the documentation for texlive itself. (You can skip its Chapter 3 about how to install the package, if you've used Debian's apt tool to install texlive .) Installing the texlive package actually installs several other packages, so you get dozens of TeX-related commands, like latex and bibtex and pdflatex, as well as management commands like tlmgr, recommended fonts, and other utilities.

Now, let's concentrate on LaTeX.

LaTeX

Superficially, LaTeX appears simple: take the text you want to typeset, add some formatting commands to it, run the marked-up text through the latex program, and the output (after a little further processing) is a nicely-typeset PostScript or PDF file that can be printed. It seems like a fancier version of HTML. But this notion is oversimplified and misleading.

The input to the latex command is a flat ASCII file, which contains both text to be formatted, and instructions to the formatter (LaTeX/TeX). The immediate output is not itself printable; it's a “device-independent” or DVI file, which must be post-processed before printing. The ultimate output should be a beautifully typeset document. But a lot happens in the meantime; if anything goes wrong, you have a mess on your hands. So you need to understand a few details.

LaTeX is really a set of macros for the underlying typesetting system, TeX, whose formatting commands are very low-level. LaTeX is intended to provide a high-level language to access the power of TeX. Unfortunately, it hides so much of what's actually going on that it's usually difficult to track down the problems that inevitably occur.

Furthermore, LaTeX and TeX aren't just markup codes like HTML. They're full-fledged programming languages. Think of their formatting commands as forming a program, whose data are the text to be set in type, and whose final output is a printed page. From this point of view, the LaTeX macros are like subroutines that you call to perform particular tasks (like setting a heading in larger type). Supplemental macro packages, like graphicx (which allows images to be added to the typeset pages), are like subroutine libraries.

Another programming language related to typesetting and printing is PostScript. It, too, is a complete programming language — in many ways, a more complete one than TeX. And TeX documents often are converted to PostScript for printing. But, while PostScript is intimately connected to the marks that get put on the page, Tex and LaTeX know nothing of the actual glyphs and graphics to be printed or displayed. They deal only with how things are arranged or “laid out” on the page, not the forms of the things themselves. So, while a PostScript font describes in excruciating detail the actual shapes of letters and other characters, TeX understands only their bounding boxes. The characters themselves are hidden from TeX in font files, instead of being available to the program as they are in PostScript.

But, although fonts are central to typesetting, the TeX user doesn't see much information related to fonts. Unfortunately, fonts come in several quite different forms; so there have to be programs — invoked automatically by LaTeX — that convert raw font data to the form TeX needs. Indeed, there is a whole infrastructure of scripts and libraries that are used by LaTeX, which relies on a special system of directories and subdirectories and indexes to find everything it needs.

That's fine when everything works smoothly in the background. But problems arise if this system is misconfigured in any way; not everything that goes wrong is due to errors in the user's input.

Let's begin with that input file. In what follows, I'll call it ms.tex .

Structure of the input file

The input file, ms.tex , contains both high-level instructions to the LaTeX interpreter, and the data (i.e., text) that TeX ultimately formats. This mixture of program and data can be confusing.

Modes

It's useful to think of the input as switching LaTeX back and forth between different modes, like the vi editor. Where vi uses the <ESC> key to switch from input to command mode, LaTeX uses the backslash character ( \ ) to go from text-input to command mode. And, like vi, TeX has more than one command mode. Just as vi has both visual and ex (and other) modes, TeX and LaTeX have horizontal, vertical, math, and other modes.

But in vi, the editor switches from command to input mode only when explicitly told to do so. LaTeX and TeX aren't as explicit: when they finish processing a command, the formatters return to input mode automatically. This makes the input file more difficult to read, because a command can be terminated by just a blank. So it's a little hard to keep track of what mode you're in at a particular place in the input document.

TeX recognizes math mode in a special way: material to be set as mathematics is delimited by $ signs, which toggle between normal-text input and math-input modes.

Groups

One thing that makes it easier (for both you and TeX) to understand where one mode ends and another begins is to use curly braces — { and } — to delimit groups. For example, if you type

some text in Roman type and then {\it something in Italics}

you get

some text in Roman type and then something in Italics

as the formatted output.

Unfortunately, curly braces are used for other purposes as well. They're used to delimit the arguments to some of those formatting subroutines (or macros). A common example would be

\section{This is a section heading}

which sets the words This is a section heading in larger, boldface type. The text of the heading , in braces, is the argument of the \section command, which acts like a LaTeX subroutine.

This “overloading” of braces is confusing : the command to switch to a section heading is placed outside the braces, but the command to switch typefaces is inside the braces. Just another example of bad user-interface design.

Environments

One particularly important kind of formatting subroutine is called an environment. These are usually delimited by pairs of LaTeX commands that look like

\begin{environment} input to be formatted in a particular way goes here \end{environment}

An environment sets up a particular style in which text is formatted. Some environments make lists: itemize, enumerate, center. Some make other types of displayed matter: figure, table, equation. There are many others. (Again, the variety of treatments is potentially confusing.)

The most important environment is the one that contains all the text of a document: you might have noticed that it must be contained within a

\begin{document} . . . \end{document}

pair. (The document environment sets up ordinary paragraph-style text formatting.) As other environments can be nested inside the document environment, you could (rightly) expect that further nesting is possible. A list, or a displayed equation, might be inside a table element, for example.

Preamble

Everything that precedes the \begin{document} command is called the preamble of the document. It tells LaTeX what macro packages to load. For example,

\documentclass{article}

loads one of the mandatory LaTeX packages; and

\documentclass{article} \usepackage{graphicx}

tells it to load an additional macro package needed to include illustrations. (Here, the graphicx package makes available commands like \includegraphics. You can think of the macro package as a subroutine library that makes available special operations for some particular purpose.)

Often the preamble also contains a few special commands or definitions to alter the defaults of the plain-vanilla macros.

LaTeX syntax

Optional arguments

Even in these first lines of ms.tex, we encounter some of the quirks of LaTeX syntax, because the \documentclass and \usepackage commands usually have options. These optional arguments are placed in square brackets just before the {. . .} main argument. For example,

\documentclass[12pt]{article} \usepackage[dvips]{graphicx}

modifies the typesetting style by using 12-point body type instead of the default 10-point; and the [dvips] option to the graphicx macro package tells it to add information that will be needed by a post-processor called dvips (more about this later).

Some formatting commands have no required arguments (in braces), but may take an optional argument (in square brackets). For example, the \\ command (which forces a line break) can be followed by an optional vertical space in square brackets; so

\\[.5cm]

ends the current line and adds half a centimeter of vertical space before the next one.

Alternate forms

Another irritating inconsistency is the use of alternate forms of common commands. For example, the \section command mentioned above has an alter ego called \section*. (The difference is that \section{Section heading} sets a running number in front of its “Section heading” text, while \section*{Section heading} produces a heading without a number.)

One might have expected a sensible programmer to have allowed for such variations by including another regular argument, instead of modifying the name of the command with an asterisk. Perhaps it's useful to think of this as an implicit logical argument, whose value (true or false) depends on the presence or absence of an asterisk at the end of the name.

To compound confusion, some commands take both optional arguments and these “star forms” (as they're called in LaTeX jargon).

Comments

If LaTeX is a programming language, you can expect it to have comment statements, so the programmer can explain what's going on. Like PostScript, TeX and LaTeX use the percent sign ( % ) to indicate comments; everything from % to the end of the line is ignored.

This can have side effects. In input mode, the end of a line (like spaces and tabs) is treated as “white space”, which normally becomes just an inter-word space in the typeset text. But the white space implied by the end of a line is commented out if the last character on a line is %. So

You can comment out white% space at the end of a line.

typesets as

You can comment out whitespace at the end of a line.

Actually, all following whitespace, including any at the start of the next line, is obliterated by this trick. The same thing happens in command mode, where a % at the end of a line is sometimes used to suppress whitespace.

The percent sign commonly causes problems when you want to use it literally in the input text. To typeset “99% of LaTeX users have problems,” you have to type 99\% of LaTeX users . . . . In fact, \% is really a TeX command that means “set a percent sign”.

Other special characters

This technique of using the backslash to avoid (or “escape”) the usual significance of special characters applies to most of the others:

char literal special meaning (unescaped) char literal special meaning (unescaped)
% \% introduces a comment ^ \^ introduces a superscript in math mode
# \# introduces macro's formal parameters _ \_ introduces a subscript in math mode
$ \$ math-mode delimiter { \{ begin a group
& \& data separator in tables } \} end of group
~ \~ non-breakable space in text \ \backslash begins a command

char	literal	special meaning (unescaped)	char	literal	special meaning (unescaped)
%	\%	introduces a comment	^	\^	introduces a superscript in math mode
#	\#	introduces macro's formal parameters	_	\_	introduces a subscript in math mode
$	\$	math-mode delimiter	{	\{	begin a group
&	\&	data separator in tables	}	\}	end of group
~	\~	non-breakable space in text	\	\backslash	begins a command

All of these but the last obey the rule: put a backslash before a special character, to make it appear literally in the formatted text. (We can't do this with the backslash itself, because \\ is the special command to force a line break — yet another confusing inconsistency.)

Errors and debugging

Usually, the first time you try to run latex on your input file, you'll get an error message. Unfortunately, most of the error messages are unhelpful or misleading. In fact Alan Hoenig, in his book TeX Unbound , refers to

. . . error messages that are mystical, opaque, and vaguely frightening. Experience soon teaches you that the best thing is to ignore the messages . . . .

The error messages are certainly one of the most frustrating features of TeX and LaTeX. To begin with, there's no useful indication of how to recover from an error, or how to quit and try again: standard responses such as quit and bye just generate new errors.

And it doesn't help that TeX and LaTeX present their errors in slightly different formats. However, Sections 8.2 and 8.3 of Lamport's book do explain some of the commoner errors of LaTeX and TeX, respectively.

Responding to errors: how to quit

The most useful response is H (for HELP). But entering a question-mark will elicit some possibly useful alternatives. Among them, LaTeX will tell you that X (for eXit, of course) will let you get out.

Interpreting error messages

There is something to be learned from error messages, but it takes some practice to learn. First, there's the line of ms.tex that caused the error. You'll see a number like 1.163 at the left margin, just above the ? prompt; that means line 163 in the 1st file of input — really obvious, right?

Of course, that's just the line where the error was detected . The actual cause of the error might be many lines above it. Sometimes there's a mis-typed command name, and in these cases, the error message is usually useful: it breaks the input line where the error was found, so look closely at the break in the line to see if something is obviously wrong there.

TeX vs. LaTeX

Remember that LaTeX is just a front-end to TeX , which does the real work. Both TeX and LaTeX generate error messages. You can tell which is which by looking at the first line of the message. An error that begins with

! LaTeX Error:

is a LaTeX error. But if it begins with just ! and doesn't actually say LaTeX, it's an error detected by TeX.

TeX error messages are the most mysterious, because they often refer to cryptic internal details of LaTeX macros. If you find TeX complaining about some command you don't recognize, and which is certainly not in your input file, it's likely that something you did (often much earlier) has so confused LaTeX that it has expanded a macro in a way that's nonsense to TeX itself. Such errors can be difficult to figure out.

Fortunately, there is a good explanation of both TeX and LaTeX error messages at MIT's website.

Prevention

A good way to catch potential mystifying errors before they can elicit puzzling messages is to run your file through lacheck, the LaTeX syntax checker. It won't catch all syntax errors, but it does look for simple things like mis-matched braces, which are often the cause of mysterious error messages from TeX. (You'll need to install the Debian package lacheck to get this utility; then see man lacheck for details.)

What happens after LaTeX

Running the command line

latex ms.tex

does not produce a printed page. Instead, it produces a “device-independent” file, ms.dvi. Why?

When TeX and LaTeX were written, computers were slow. It made sense to split the text-formatting task into two parts: determining where to put the characters on a page, and actually putting them there. And printers were slow, so it made sense (and still does) to check the formatted output on the terminal screen before committing it to paper.

The ms.dvi file doesn't contain any actual characters. It just tells some post-processing program where to put them. Then you can use one post-processor to view the result on the screen, and a different one to print a hard copy, without having to re-run latex again.

This division of labor is examined more thoroughly in the next section of this discussion, which goes into more technical detail.

Back to the . . .
alphabetic index

or the website overview page