McEs, A Hacker Life: Pango+TeX Follow-Up

Saturday, August 13, 2005

Pango+TeX Follow-Up

A few readers asked me to elaborate on why I think a Pango-enabled TeX is useful, how does it work, and what to anticipate. In this post I'm calling such a combination PangoTeX. To understand why PangoTeX is useful, we need to know what each of them is good at, and what not.

TeX is pretty good at breaking paragraphs into lines and composing paragraphs into pages. What it is not good at is complex text layout, means, non-one-to-one character to glyph mapping and more than one text direction. e-TeX has primitives to set text direction, but not more.

Pango on the other hand, knows nothing about pages and columns. It does break paragraphs too, but not anything to envy. Patches exist that implement the TeX h&j algorithm for Pango, but that doesn't matter here. What Pango is pretty good at is the character to glyph mapping, where it implements the OpenType specification for quite a bunch of scripts. Pango contains modules for the following scripts: Arabic, Hebrew, Indic, Khmer, Tibetan, Thai, Syriac, and a Burmese module is recently proposed. Other than that, it has a module to use Uniscribe on Windows, and there's also a module available on internet (that may be integrated into Pango soon) to use the SIL Graphite engine.

So the plan is to make TeX pass streams of characters to Pango and ask it to shape them. The way XeTeX is implemented is that TeX passes to the higher-level rendering engine words of text and all it asks for is the width the word would occupy. XeTeX has backends for Apple ATSUI and ICU at this time. Pango has the advantage of abstracting OpenType in general, Uniscribe, Graphite, and hopefully ATSUI in the future. So it would be enough to only have a Pango backend.

That level of integration is pretty much what XeTeX does, which is quite useful on its own, but doesn't mean it should be the end of it. Much more can be done by using Pango's language/script detection features, it's bidirectional handling engine, etc, such that you don't have to mark left-to-right and right-to-left runs manually. Moreover, while doing this, we would introduce the Unicode Character Database to TeX, such that (for example) character category codes would be automatically set for the whole BMP range, and you may query other properties of characters should need be.

The way Omega approached the problem of Unicode+TeX was to add a push-down automaton layer that could convert the character stream at as many stages as desired. So you could have an input layer to convert from legacy character sets to Unicode, and then a complex shaping engine, and finally convert to font encoding. The problem with this approach is that it's very complex, so it introduced a zillion bugs. Of course, bugs can be fixed, but just then comes the next problem: Duplication. The powerful idea behind having shaping information in OpenType fonts was left unused there. For each font you had to implement the shaping logic (ligatures, etc) in an Omega Transformation Format file. Moreover, the whole machinery was more like Apple's AAT, rather than OpenType, which means it doesn't have any support for individual scripts: If you want to do Arabic shaping, you have to code all the joining logic in OTF. Neither did it provide Unicode character properties. If you need to know whether a character is a non-spacing mark, you have to list all NSMs in an OTF file. If you wanted to normalize the string, well, you had to code normalization in OTF, which is quite possible and interesting to code, but don't ask me about performance... Putting all these together, I believe that Omega cannot become a unified Unicode rendering engine without introducing support from outside libraries. When you do import some support from Pango and gNUicode for example, all in a sudden you do not need all that push-down automata anymore. A charset conversion input layer that uses iconv is desirable though.

About the output layer, XeTeX generates an extended DVI and converts it to PDF afterwards, using a backend-specific extended DVI driver. We can do that with Pango+Cairo too, to write to a PDF or PS backend. Or since Pango computes glyph-strings when analyzing the text, we may not even need Pango when converting the DVI to PDF. Anyway, what I'm more interested is to expand pdfTeX directly. We don't really need DVI these days. As I said before, the assumptions that Knuth made have proved to be wrong in the new millennium. It's not like the same DVI would render the same everywhere, no, you need the fonts. That's why fonts used in today's TeX systems is separate from fonts you use to render your desktop to you screen, because they are isolated and packaged separately in a TeX distribution, such that almost everyone has the same set of fonts... This should be changed too, with only having PDF output, and some kpathsea configuration. You still need the fonts to compile the TeX sources, but the output would be portable.

To conclude, I have changed my mind about cleaning up and adding UCD support to Omega and believe that we badly need a pdfTeX+Pango engine to go with our otherwise-rocking GNOME desktop.

That's all for now. I'm very much interested to get some feedback.

¶ 5:14 AM

Comments:

There is xmlroff (http://xmlroff.sourceforge.net/) which converts XML to PDF/PostScript, and is based on PangoPDF (http://pangopdf.sourceforge.net/).

I hope it helps.

# posted by

Simon : August 13, 2005 2:39 PM

This is some seriously sensible musing. PangoTex would be very cool, albeit quite ambitious. Hope something gets off the ground! :)

# posted by

Anonymous : August 14, 2005 1:38 AM

I don't know whether my comment is the kind of comment that you expect. I'm not a programmer myself, but an average Gnome user that has written his PhD dissertation (in Philosophy about Gadamer and Plato) using LaTeX and managed to adapt it to Lambda to submit it to be published on CD-ROM (the file is here, just in case you wonder). With this background (not technical at all), the two fundamental shortcomings that I found in TeX are that it doesn't support Unicode and it is not easy to use TT/OT fonts with TeX. My question would be whether PangoTeX will deal with such topics (ie, would make easier for the rest of us) or not. By the way, I need Unicode only for polytonic Greek (which is typeset as other Westeuropean languages).

# posted by

Anonymous : August 14, 2005 7:45 AM

Pablo Rodríguez:
It's amazing that you are using polytonic Greek :)
I am from Greece myself and I mostly encounter few locals who are interested in Greek Polytonic.

There is a recent Greek Polytonic font from http://www.ellak.gr/fonts/mgopen (MgOpen Canonica). It's properly free font.

Also, to type Greek Polytonic in Linux (GNOME), you can follow the tips from
http://simos.info/blog/?p=342
It works "out of the box".

Also, I tried last year using DocBook XML to write Greek Polytonic and here are my results:
http://simos.info/blog/?p=288
http://www.advogato.org/person/simos/diary.html?start=4 (same page, but in English).
It shows examples using xmlroff and the Apache offerings.

# posted by

Anonymous : August 15, 2005 2:50 PM

# posted by

Anonymous : August 15, 2005 2:50 PM

Thanks for all the comments. Pretty encouraging.

# posted by

behdad : August 15, 2005 5:00 PM

Longer than six months after this post, what are you plans about PangoTeX?

The XeTeX documentation includes plans for portability (I guess not before releasing version 1.0).

Wouldn't be interesting to be be able to use XeTeX in Unix-like systems?

# posted by

Anonymous : March 04, 2006 8:37 AM

Six months, sigh...

I still plan to work on PangoTeX when I get a bit more free time. In the mean time I've been learning the internals of Pango more, and following both XeTeX and pdfTeX's developments. The main problem with XeTeX is that it's not based on pdfTeX, and it's already obvious that pdfTeX is *the* TeX engine of the future.

I also think that there are parts of XeTeX that can use a better design... Anyway, I don't think they are quite interested in exploring Pango. I did send a link to my post the their mailing list. So it seems like if somebody's to experiment with PangoTeX, it's got to be myself, and that has unfortunately got to wait a bit more.

# posted by

behdad : March 04, 2006 9:08 PM

I just submitted a project proposal for the Google SoC, called 'Integrate Pango with TeX'. Maybe you don't have to make PangoTeX all by yourself after all...

XeTeX released a prototype Linux version about a week ago. No Pango support, though. And I haven't been able to play with it yet, so I have no idea how well it works.

# posted by

Anonymous : May 08, 2006 10:07 AM

Six months again, Behdad (I'm afraid... ;-))

I have been using the XeTeX implementation for Linux and I have replaced TeX with XeTeX (actually pdfLaTeX with XeLaTeX).

Do you have plans to integrate some XeTeX parts with Pango and pdfTeX in the near future?

Is LuaTeX relevant for your integration work? (I wonder whether LuaTeX will be eventually merged in pdfTeX.)

# posted by

Anonymous : September 11, 2006 2:38 PM

Hi Pablo :)

I'm planning to have a look into Pango+XeTeX during our Text Layout summit:

http://live.gnome.org/Boston2006/TextLayout

As for LuaTeX, I read it's the successor of pdfTeX, and they are adding support for 16-bit chars. That's interesting stuff too.

# posted by

behdad : September 11, 2006 3:59 PM

Is PangoPDF any good at doing XML Conversion? Man, I really need to learn how to do a decent job with xml conversion. The problem is that I can't find anyone to teach me. Pango works though?

# posted by

Unknown : January 07, 2011 4:46 PM

About Me

Twitter Updates