Franck Pommereau - Blog - Parsley really rocks!

We have seen how to use Parsley to parse BibTeX, but now we need to parse the LaTeX code inside the BibTeX entries and convert it to something else: plain text or, here, Markdown. Here again, Parsley appears to be really handy.

Of course, only LaTeX can actually parse full LaTeX¹, so we will build a very limited parser, but which should be enough to handle the simple LaTeX code we find in a BibTeX that aim to be portable (i.e, it should have no fancy macros). Note also that we assume that the code is correct, so we won’t check this and the generated Markdown may be wrong if the LaTeX code is invalid.

First we need a grammar to match the various tokens of a LaTeX source code:

text matches regular text with no white space and no special characters inside. This text is processed using method text of object tex which is responsible for emulating LaTeX behaviour (see later)
blank matches white spaces, including newlines, and this is processed by tex.blank
name and macro allow to match a call to a macro, which can be a name like in \emph or a single character like in \\ or \'
bgroup and egroup respectively match { and } while calling the appropriate methods of tex
arg matches a macro argument, which can be a group {...} or a single character
call matches a call to a macro, together with its arguments. Note how (-> tex.arity(m)):n allows to get the number of arguments expected by m and bind it to n which is then used in arg{n}:a to collect in a exactly n arguments. This is really a killer feature of Parsley!
comment matches comments and drop them
math matches $ and toggles math mode. Note that we won’t parse maths, we will just typeset them in italics
data matches any of the text blocks above
finally, doc matches a full string of LaTeX source code with possibly nested groups

The resulting grammar is as follows:

grammar = r"""
text = (anything:x ?(x not in '{\\\n\t\r %$}') -> x)+:d
     -> tex.text("".join(d))
blank = (' '|'\n'|'\t'|'\r')+:d -> tex.blank("".join(d))
name = letterOrDigit+:d -> "".join(d)
macro = '\\' ((name:n ws -> n)|anything)
bgroup = '{' -> tex.bgroup()
egroup = '}' -> tex.egroup()
arg = ((bgroup !(tex.pushpar()) doc:a !(tex.poppar()) egroup -> a)
       |anything)
call = macro:m (-> tex.arity(m)):n arg{n}:a -> tex.call(m, *a)
comment = '%' (anything:x ?(x not in '\n'))+ '\n' -> ''
math = '$' -> tex.call('math')
data = (text|blank|comment|call|math)+:d -> "".join(d)
doc = (data|(bgroup:b doc:d egroup:e -> b+d+e))+:d -> "".join(d)
"""

Next, we need to build the class for object tex used in the grammar. In it’s constructor, it just compiles the grammar passing itself as tex in the parser’s environment. Method __call__ does the actual parsing calling method doc of the parser.

# -*- coding: utf-8 -*-
import inspect
import parsley

class LaTeX (object) :
    def __init__ (self) :
        self.parser = parsley.makeGrammar(grammar, {"tex": self})
    def __call__ (self, data) :
        self.tags = [[]]
        self.newpar = True
        self.pars = []
        self.envs = []
        return self.parser(data).doc().strip()

The various attributes assigned by __call__ allow to keep track of current state while parsing and converting code:

tags is a stack of lists corresponding to the nested groups and allowing to close various tags when exiting a group. For instance, if we parse {\it hello world} we need to close italics at the end of the group. To this respect, our parser differs from LaTeX in that macros like \it will have cumulative effects: for instance {\it hello \bf world} will be rendered as _hello **world**_, which is hello world. To achieve this in LaTeX, we should have used \itshape and \bfseries instead of \it and \bf. Tags management is made using the following methods:
```
    def tag (self, tag) :
        self.tags[-1].append(tag)
    def bgroup (self) :
        self.tags.append([])
        return ""
    def egroup (self) :
        tags = self.tags.pop(-1)
        return "".join(self.call(tag, False)
                       for tag in reversed(tags))
```
See how egroup pops the tags and call the appropriate method using self.call with a second parameter set to False indicating that we are closing a tag (more below).

newpar and pars allow to manage the empty lines between paragraphs and the white space at the beginning of each paragraph. The former is True when we are currently beginning a paragraph and the latter is a stack to save/restore this information when we parse nested doc in the grammar. This is the role of tex.pushpar() and tex.poppar() that we’ve encountered in rule arg. This white space management is made by the following methods:

    def text (self, txt) :
        self.newpar = False
        return txt
    def blank (self, txt) :
        if self.newpar :
            return ""
        elif txt.count("\n") > 1 :
            self.newpar = True
            return "\n\n"
        else :
            return " "
    def pushpar (self) :
        self.pars.append(self.newpar)
    def poppar (self) :
        self.newpar = self.pars.pop(-1)

We see how text sets newpar to False and how blank allows to avoid white space at the beginning of paragraphs.

Finally, envs is a stack of environments corresponding the the nesting of \begin{...} and \end{...} in LaTeX source code.

Then come macros emulation: shortcut allows to map 1-character macros to names, then call(MACRO, ...) is a dispatcher to call_MACRO(...). Finally, arity uses module inspect to compute how many arguments a function call_MACRO expects. Note that we do not count a last argument opentag=True that is used for tags in groups (which avoid to have both open_TAG and close_TAG methods).

    shortcut = {"'" : "acute",
                "`" : "grave",
                '"' : "diaeresis",
                "^" : "circumflex",
                "~" : "tilde",
                "\\" : "newline",
                "$" : "dollar",
                }
    def call (self, name, *args) :
        name = self.shortcut.get(name, name)
        handler = getattr(self, "call_%s" % name)
        return handler(*args)
    def arity (self, name) :
        name = self.shortcut.get(name, name)
        a, _, _, d = inspect.getargspec(getattr(self, "call_%s" % name))
        if a[-1] == "opentag" and d and d[-1] == True :
            return len(a) - 2
        else :
            return len(a) -1

Here come the implementation of the macros that add accents to characters, like \'a. We use hard-coded dicts that should probably be completed to handle more cases. But the principle would remain the same.

    _acute = dict(zip("aeiouy", u"áéíóúý"))
    _grave = dict(zip("aeiouy", u"àèìòùỳ"))
    _diaeresis = dict(zip("aeiouy", u"äëïöüÿ"))
    _circumflex = dict(zip("aeiouy", u"âêîôûŷ"))
    _tilde = dict(zip("aon", u"ãõñ"))
    def accent (self, text, accent) :
        return getattr(self, "_" + accent).get(text[0], text[0]) + text[1:]
    def call_acute (self, text) :
        return self.accent(text, "acute")
    def call_grave (self, text) :
        return self.accent(text, "grave")
    def call_diaeresis (self, text) :
        return self.accent(text, "diaeresis")
    def call_circumflex (self, text) :
        return self.accent(text, "circumflex")
    def call_tilde (self, text) :
        return self.accent(text, "tilde")

Then a few useful macros, which can be easily completed with more macros.

    def call_newline (self) :
        return self.text("<br/>")
    def call_href (self, url, text) :
        return self.text("[%s](%s)" % (text, url))
    def call_dollar (self) :
        return self.text("$")
    def call_math (self) :
        return self.text("*")
    def call_l (self) :
        return self.text("l")
    def call_L (self) :
        return self.text("L")
    def call_ae (self) :
        return self.text(u"æ")
    def call_oe (self) :
        return self.text(u"œ")

We come to text styling. Tags are macros like \it that apply immediately, inserting an opening marker, and push there names onto the current group using method tag so that at the end of the group, they will be called again, but with opentag=False this time. When generating Markdown, the opening and closing tags are the same, but we could easily insert things like <tag> and </tag> instead. Other macros like \emph apply to a group, which is passed as string as processed recursively in grammar rule arg. Note that we have used marker _ for italics and * for maths.²

    def call_it (self, opentag=True) :
        if opentag :
            self.tag("it")
        return self.text("_")
    def call_bf (self, opentag=True) :
        if opentag :
            self.tag("bf")
        return self.text("**")
    def call_tt (self, opentag=True) :
        if opentag :
            self.tag("tt")
        return self.text("`")
    def call_emph (self, text) :
        return self.text("_%s_" % text)
    def call_textit (self, text) :
        return self.text("_%s_" % text)
    def call_texttt (self, text) :
        return self.text("`%s`" % text)
    def call_textbf (self, text) :
        return self.text("**%s**" % text)

Finally, we have macros for environments, i.e., \begin{...} and \end{...}, which are directly implemented as macros \begin and \end that manage the environment names onto stack envs as explained above. These macros call respectively begin_ENV and end_ENV to perform the appropriate operations. This is how environment itemize is emulated, together with macro \item.

    def call_begin (self, name) :
        handler = getattr(self, "begin_%s" % name)
        self.bgroup()
        self.envs.append(name)
        return handler()
    def call_end (self, name) :
        handler = getattr(self, "end_%s" % name)
        self.egroup()
        return handler()
    def begin_itemize (self) :
        newpar, self.newpar = self.newpar, True
        if newpar :
            return ""
        elif self.envs.count("itemize") > 1 :
            return "\n"
        else :
            return "\n\n"
    def call_item (self) :
        if self.newpar :
            self.newpar = False
            return "  " * self.envs.count("itemize") + "* "
        else :
            return "\n" + "  " * self.envs.count("itemize") + "* "
    def end_itemize (self) :
        self.newpar = True
        if "itemize" in self.envs :
            return "\n"
        else :
            return "\n\n"

In this implementation of itemize, we take care to appropriately handle newpar in order to insert the correct amount of newlines. For instance on \begin{itemize}, if there is already a paragraph separation above, we don’t need to insert more newlines. Notice also how we use stack envs to insert the correct indentation on \item and the correct number of newlines on the beginning or ending of an itemize.

LaTeX is built on the top of TeX that is a terribly complex language to parse and interpret. This process is very clearly explained in Donald Knuth’s TeXbook, but probably it could not be implemented using traditional grammar-based parsing. My goal here is not to have a Python implementation of TeX, but instead to have a quick parser/translator that will do the job for simple situations. ↩
Markdown allows both but doing so, we avoid problems with math within italics like in \textit{variable $x$ is zero} which is correctly interpreted (i.e., maths will be consistently rendered in italics independently of the context) when translated to \_variable \*x\* is zero\_ (what we do) but not when translated to \*variable \*x\* is zero\* or \_variable \_x\_ is zero\_ that typeset x in roman. On the other hand, \textit{this \emph{is} important} is translated to \_this \_is\_ important\_ which is not interpreted as we would like by python-markdown (because it is sensitive to spaces before/after _). ↩