Making TeX support Unicode: The Quest of the Holy Grail

Arthur Reutenauer
Paris, France
Play (15min) Download: MP4 | MP3

As everybody knows, TeX is moving towards full Unicode support with the new XeTeX and LuaTeX engines, which enabled UTF–8 input, multidirectional typesetting, and direct use of OpenType fonts, etc. etc. … Does it mean we’re there yet? Are XeTeX and LuaTeX really Unicode–compliant?

Of course, those engines can read Unicode–encoded text directly, without resorting to complicated macros, and they can typeset Unicode characters using advanced fonts. But Unicode is much more than a pile of characters: it associate a number of properties with each one of them, and it defines transformation of character streams.

Examples include handling of combining marks: when you put such a character after a “normal” one (called base character), the former is supposed to be displayed as a diacritical mark. This isn’t supported at all in TeX — of course, TeX has always been able to handle accents, but it has done so in its own way, without putting it in relation with Unicode. Another very interesting example is bidirectional typesetting: experiments have been done as early as 1987 to mix texts in different writing directions in TeX, but there has been very little effort to support the “Unicode way” of doing so, namely, the so–callled bidirectional algorithm.

There are more such examples, and each one of them uncovers what I would like to call the “TeX way” of text processing, which is often very different from the “Unicode way”. It even looks like both approaches are completely opposite: while TeX strives to produce the best final result and loses a lot of the original input in the process, Unicode prescribes ways to handle plain text and shows little interest for the displayed appearance. This talk is an attempt to give an overview of those two ways, and to discuss options on how to unite them in order to implement Unicode support in the different “modern” TeX engines, with a particular emphasis on LuaTeX with ConTeXt as a macro package (also known as “Mark IV”).