Pango+TeX Follow-Up
A few readers asked me to elaborate on why I think a
Pango-enabled TeX is useful, how does it work, and what to anticipate. In this post I'm calling such a combination PangoTeX. To understand why PangoTeX is useful, we need to know what each of them is good at, and what not.
TeX is pretty good at breaking paragraphs into lines and composing paragraphs into pages. What it is not good at is complex text layout, means, non-one-to-one character to glyph mapping and more than one text direction. e-TeX has primitives to set text direction, but not more.
Pango on the other hand, knows nothing about pages and columns. It does break paragraphs too, but not anything to envy. Patches exist that implement the TeX
h&j algorithm for
Pango, but that doesn't matter here. What
Pango is pretty good at is the character to glyph mapping, where it implements the
OpenType specification for quite a bunch of scripts.
Pango contains modules for the following scripts: Arabic, Hebrew, Indic, Khmer, Tibetan, Thai, Syriac, and a Burmese module is recently proposed. Other than that, it has a module to use
Uniscribe on Windows, and there's also a module available on internet (that may be integrated into
Pango soon) to use the
SIL Graphite engine.
So the plan is to make TeX pass streams of characters to
Pango and ask it to shape them. The way
XeTeX is implemented is that TeX passes to the higher-level rendering engine words of text and all it asks for is the width the word would occupy.
XeTeX has backends for Apple
ATSUI and
ICU at this time.
Pango has the advantage of abstracting
OpenType in general,
Uniscribe,
Graphite, and hopefully
ATSUI in the future. So it would be enough to only have a
Pango backend.
That level of integration is pretty much what
XeTeX does, which is quite useful on its own, but doesn't mean it should be the end of it. Much more can be done by using
Pango's language/script detection features, it's bidirectional handling engine, etc, such that you don't have to mark left-to-right and right-to-left runs manually. Moreover, while doing this, we would introduce the
Unicode Character Database to TeX, such that (for example) character category codes would be automatically set for the whole BMP range, and you may query other properties of characters should need be.
The way
Omega approached the problem of
Unicode+TeX was to add a push-down automaton layer that could convert the character stream at as many stages as desired. So you could have an input layer to convert from legacy character sets to
Unicode, and then a complex shaping engine, and finally convert to font encoding. The problem with this approach is that it's very complex, so it introduced a zillion bugs. Of course, bugs can be fixed, but just then comes the next problem: Duplication. The powerful idea behind having shaping information in
OpenType fonts was left unused there. For each font you had to implement the shaping logic (ligatures, etc) in an
Omega Transformation Format file. Moreover, the whole machinery was more like Apple's AAT, rather than
OpenType, which means it doesn't have any support for individual scripts: If you want to do Arabic shaping, you have to code all the joining logic in OTF. Neither did it provide
Unicode character properties. If you need to know whether a character is a non-spacing mark, you have to list all NSMs in an OTF file. If you wanted to normalize the string, well, you had to code normalization in OTF, which is quite possible and interesting to code, but don't ask me about performance... Putting all these together, I believe that
Omega cannot become a unified
Unicode rendering engine without introducing support from outside libraries. When you do import some support from
Pango and gNUicode for example, all in a sudden you do not need all that push-down automata anymore. A charset conversion input layer that uses iconv is desirable though.
About the output layer,
XeTeX generates an extended DVI and converts it to PDF afterwards, using a backend-specific extended DVI driver. We can do that with
Pango+
Cairo too, to write to a PDF or PS backend. Or since
Pango computes glyph-strings when analyzing the text, we may not even need
Pango when converting the DVI to PDF. Anyway, what I'm more interested is to expand pdfTeX directly. We don't really need DVI these days. As I said before, the assumptions that Knuth made have proved to be wrong in the new millennium. It's not like the same DVI would render the same everywhere, no, you need the fonts. That's why fonts used in today's TeX systems is separate from fonts you use to render your desktop to you screen, because they are isolated and packaged separately in a TeX distribution, such that almost everyone has the same set of fonts... This should be changed too, with only having PDF output, and some kpathsea configuration. You still need the fonts to compile the TeX sources, but the output would be portable.
To conclude, I have changed my mind about cleaning up and adding
UCD support to
Omega and believe that we badly need a pdfTeX+
Pango engine to go with our otherwise-rocking
GNOME desktop.
That's all for now. I'm very much interested to get some feedback.