We have seen how to use Parsley to parse BibTeX, but now we need to parse the LaTeX code inside the BibTeX entries and convert it to something else: plain text or, here, Markdown. Here again, Parsley appears to be really handy.

Of course, only LaTeX can actually parse full LaTeX1, so we will build a very limited parser, but which should be enough to handle the simple LaTeX code we find in a BibTeX that aim to be portable (i.e, it should have no fancy macros). Note also that we assume that the code is correct, so we won’t check this and the generated Markdown may be wrong if the LaTeX code is invalid.

First we need a grammar to match the various tokens of a LaTeX source code:

The resulting grammar is as follows:

grammar = r"""
text = (anything:x ?(x not in '{\\\n\t\r %$}') -> x)+:d
     -> tex.text("".join(d))
blank = (' '|'\n'|'\t'|'\r')+:d -> tex.blank("".join(d))
name = letterOrDigit+:d -> "".join(d)
macro = '\\' ((name:n ws -> n)|anything)
bgroup = '{' -> tex.bgroup()
egroup = '}' -> tex.egroup()
arg = ((bgroup !(tex.pushpar()) doc:a !(tex.poppar()) egroup -> a)
       |anything)
call = macro:m (-> tex.arity(m)):n arg{n}:a -> tex.call(m, *a)
comment = '%' (anything:x ?(x not in '\n'))+ '\n' -> ''
math = '$' -> tex.call('math')
data = (text|blank|comment|call|math)+:d -> "".join(d)
doc = (data|(bgroup:b doc:d egroup:e -> b+d+e))+:d -> "".join(d)
"""

Next, we need to build the class for object tex used in the grammar. In it’s constructor, it just compiles the grammar passing itself as tex in the parser’s environment. Method __call__ does the actual parsing calling method doc of the parser.

# -*- coding: utf-8 -*-
import inspect
import parsley

class LaTeX (object) :
    def __init__ (self) :
        self.parser = parsley.makeGrammar(grammar, {"tex": self})
    def __call__ (self, data) :
        self.tags = [[]]
        self.newpar = True
        self.pars = []
        self.envs = []
        return self.parser(data).doc().strip()

The various attributes assigned by __call__ allow to keep track of current state while parsing and converting code:

Then come macros emulation: shortcut allows to map 1-character macros to names, then call(MACRO, ...) is a dispatcher to call_MACRO(...). Finally, arity uses module inspect to compute how many arguments a function call_MACRO expects. Note that we do not count a last argument opentag=True that is used for tags in groups (which avoid to have both open_TAG and close_TAG methods).

    shortcut = {"'" : "acute",
                "`" : "grave",
                '"' : "diaeresis",
                "^" : "circumflex",
                "~" : "tilde",
                "\\" : "newline",
                "$" : "dollar",
                }
    def call (self, name, *args) :
        name = self.shortcut.get(name, name)
        handler = getattr(self, "call_%s" % name)
        return handler(*args)
    def arity (self, name) :
        name = self.shortcut.get(name, name)
        a, _, _, d = inspect.getargspec(getattr(self, "call_%s" % name))
        if a[-1] == "opentag" and d and d[-1] == True :
            return len(a) - 2
        else :
            return len(a) -1

Here come the implementation of the macros that add accents to characters, like \'a. We use hard-coded dicts that should probably be completed to handle more cases. But the principle would remain the same.

    _acute = dict(zip("aeiouy", u"áéíóúý"))
    _grave = dict(zip("aeiouy", u"àèìòùỳ"))
    _diaeresis = dict(zip("aeiouy", u"äëïöüÿ"))
    _circumflex = dict(zip("aeiouy", u"âêîôûŷ"))
    _tilde = dict(zip("aon", u"ãõñ"))
    def accent (self, text, accent) :
        return getattr(self, "_" + accent).get(text[0], text[0]) + text[1:]
    def call_acute (self, text) :
        return self.accent(text, "acute")
    def call_grave (self, text) :
        return self.accent(text, "grave")
    def call_diaeresis (self, text) :
        return self.accent(text, "diaeresis")
    def call_circumflex (self, text) :
        return self.accent(text, "circumflex")
    def call_tilde (self, text) :
        return self.accent(text, "tilde")

Then a few useful macros, which can be easily completed with more macros.

    def call_newline (self) :
        return self.text("<br/>")
    def call_href (self, url, text) :
        return self.text("[%s](%s)" % (text, url))
    def call_dollar (self) :
        return self.text("$")
    def call_math (self) :
        return self.text("*")
    def call_l (self) :
        return self.text("l")
    def call_L (self) :
        return self.text("L")
    def call_ae (self) :
        return self.text(u"æ")
    def call_oe (self) :
        return self.text(u"œ")

We come to text styling. Tags are macros like \it that apply immediately, inserting an opening marker, and push there names onto the current group using method tag so that at the end of the group, they will be called again, but with opentag=False this time. When generating Markdown, the opening and closing tags are the same, but we could easily insert things like <tag> and </tag> instead. Other macros like \emph apply to a group, which is passed as string as processed recursively in grammar rule arg. Note that we have used marker _ for italics and * for maths.2

    def call_it (self, opentag=True) :
        if opentag :
            self.tag("it")
        return self.text("_")
    def call_bf (self, opentag=True) :
        if opentag :
            self.tag("bf")
        return self.text("**")
    def call_tt (self, opentag=True) :
        if opentag :
            self.tag("tt")
        return self.text("`")
    def call_emph (self, text) :
        return self.text("_%s_" % text)
    def call_textit (self, text) :
        return self.text("_%s_" % text)
    def call_texttt (self, text) :
        return self.text("`%s`" % text)
    def call_textbf (self, text) :
        return self.text("**%s**" % text)

Finally, we have macros for environments, i.e., \begin{...} and \end{...}, which are directly implemented as macros \begin and \end that manage the environment names onto stack envs as explained above. These macros call respectively begin_ENV and end_ENV to perform the appropriate operations. This is how environment itemize is emulated, together with macro \item.

    def call_begin (self, name) :
        handler = getattr(self, "begin_%s" % name)
        self.bgroup()
        self.envs.append(name)
        return handler()
    def call_end (self, name) :
        handler = getattr(self, "end_%s" % name)
        self.egroup()
        return handler()
    def begin_itemize (self) :
        newpar, self.newpar = self.newpar, True
        if newpar :
            return ""
        elif self.envs.count("itemize") > 1 :
            return "\n"
        else :
            return "\n\n"
    def call_item (self) :
        if self.newpar :
            self.newpar = False
            return "  " * self.envs.count("itemize") + "* "
        else :
            return "\n" + "  " * self.envs.count("itemize") + "* "
    def end_itemize (self) :
        self.newpar = True
        if "itemize" in self.envs :
            return "\n"
        else :
            return "\n\n"

In this implementation of itemize, we take care to appropriately handle newpar in order to insert the correct amount of newlines. For instance on \begin{itemize}, if there is already a paragraph separation above, we don’t need to insert more newlines. Notice also how we use stack envs to insert the correct indentation on \item and the correct number of newlines on the beginning or ending of an itemize.


  1. LaTeX is built on the top of TeX that is a terribly complex language to parse and interpret. This process is very clearly explained in Donald Knuth’s TeXbook, but probably it could not be implemented using traditional grammar-based parsing. My goal here is not to have a Python implementation of TeX, but instead to have a quick parser/translator that will do the job for simple situations. 

  2. Markdown allows both but doing so, we avoid problems with math within italics like in \textit{variable $x$ is zero} which is correctly interpreted (i.e., maths will be consistently rendered in italics independently of the context) when translated to \_variable \*x\* is zero\_ (what we do) but not when translated to \*variable \*x\* is zero\* or \_variable \_x\_ is zero\_ that typeset x in roman. On the other hand, \textit{this \emph{is} important} is translated to \_this \_is\_ important\_ which is not interpreted as we would like by python-markdown (because it is sensitive to spaces before/after _).