Putting the cork back on the bottle: Improving Unicode support in TeX extensions

Mojca Miklavec*, Arthur Reutenauer**
* University of Ljubljana, ** GUTenberg, France
Play (26min) Download: MP4 | MP3

In the TeX world, the name of Cork is associated with a standardization effort dating back to 1990, the Cork font encoding, which can be used for most European languages written in the Latin script. At about the same time, though, a much wider standardization effort was initiated, as the Unicode Consortium was created to devise a universal character set suitable for any language and writing system. Of course, it wasn’t long before people felt the need to support Unicode in TeX–like systems.

How far are we today? The latest extensions to the TeX engine are all labelled as “supporting Unicode”, but upon closer inspection this reveals rather imprecise: does it mean enabling UTF–8 input, handling multibyte characters, or implementing all the Unicode character properties and algorithms?

In the framework of Google Summer of Code, one of us (Arthur) is sponsored to improve Unicode support in TeX. The original accepted proposal was about three aspects: combining characters, bidirectional algorithm and line–breaking properties, and I will be working on LuaTeX and XeTeX, who both handle multibyte characters and UTF–8 characters natively.

Combining characters are Unicode’s diacritical marks: you put them after a character (called “base character”) to add an accent to it. This is very similar to TeX’s accent primitive, except that they come after the character they apply to, and that you can stack them. There are also equivalences between sequences containing combining characters and precomposed characters, and algorithms to transform them (called normalization).

The bidirectional algorithm (“bidi” for short) specifies how to handle TeXt mixing characters from writing systems with different directions, for example, English and Arabic.

Finally, Unicode characters have properties pertaining to line breaks: for example, characters like NBSP forbid breaks after them (very much like TeX’s “tie”); other allow break under certain conditions, etc.

Those three aspects were chosen because it was felt they all related to capabilities that TeX had from the beginning and that they were, therefore, among the most interesting aspects of Unicode with respect to TeX processing.

In the mean time, Mojca has initiated an effort to convert TeX Live’s hyphenation patterns to UTF–8. The challenge here is to integrate the current pattern files in a way that they can be used with both 8–bit engines (TeX, pdfTeX), and engines supporting native UTF–8 input (LuaTeX, XeTeX). It has uncovered interesting issues of the relationship between TeX and Unicode: it helps us understand to which extent, and how, TeX input can be converted to Unicode, and vice–versa.