We have seen how to use Parsley to parse BibTeX, but now we need to parse the LaTeX code inside the BibTeX entries and convert it to something else: plain text or, here, Markdown. Here again, Parsley appears to be really handy.
Of course, only LaTeX can actually parse full LaTeX1, so we will build a very limited parser, but which should be enough to handle the simple LaTeX code we find in a BibTeX that aim to be portable (i.e, it should have no fancy macros). Note also that we assume that the code is correct, so we won’t check this and the generated Markdown may be wrong if the LaTeX code is invalid.
First we need a grammar to match the various tokens of a LaTeX source code:
textmatches regular text with no white space and no special characters inside. This text is processed using methodtextof objecttexwhich is responsible for emulating LaTeX behaviour (see later)blankmatches white spaces, including newlines, and this is processed bytex.blanknameandmacroallow to match a call to a macro, which can be a name like in\emphor a single character like in\\or\'bgroupandegrouprespectively match{and}while calling the appropriate methods oftexargmatches a macro argument, which can be a group{...}or a single charactercallmatches a call to a macro, together with its arguments. Note how(-> tex.arity(m)):nallows to get the number of arguments expected bymand bind it tonwhich is then used inarg{n}:ato collect inaexactlynarguments. This is really a killer feature of Parsley!commentmatches comments and drop themmathmatches$and toggles math mode. Note that we won’t parse maths, we will just typeset them in italicsdatamatches any of the text blocks above- finally,
docmatches a full string of LaTeX source code with possibly nested groups
The resulting grammar is as follows:
grammar = r"""
text = (anything:x ?(x not in '{\\\n\t\r %$}') -> x)+:d
-> tex.text("".join(d))
blank = (' '|'\n'|'\t'|'\r')+:d -> tex.blank("".join(d))
name = letterOrDigit+:d -> "".join(d)
macro = '\\' ((name:n ws -> n)|anything)
bgroup = '{' -> tex.bgroup()
egroup = '}' -> tex.egroup()
arg = ((bgroup !(tex.pushpar()) doc:a !(tex.poppar()) egroup -> a)
|anything)
call = macro:m (-> tex.arity(m)):n arg{n}:a -> tex.call(m, *a)
comment = '%' (anything:x ?(x not in '\n'))+ '\n' -> ''
math = '$' -> tex.call('math')
data = (text|blank|comment|call|math)+:d -> "".join(d)
doc = (data|(bgroup:b doc:d egroup:e -> b+d+e))+:d -> "".join(d)
"""
Next, we need to build the class for object tex used in the grammar.
In it’s constructor, it just compiles the grammar passing itself as
tex in the parser’s environment. Method __call__ does the actual
parsing calling method doc of the parser.
# -*- coding: utf-8 -*-
import inspect
import parsley
class LaTeX (object) :
def __init__ (self) :
self.parser = parsley.makeGrammar(grammar, {"tex": self})
def __call__ (self, data) :
self.tags = [[]]
self.newpar = True
self.pars = []
self.envs = []
return self.parser(data).doc().strip()
The various attributes assigned by __call__ allow to keep track of
current state while parsing and converting code:
-
tagsis a stack of lists corresponding to the nested groups and allowing to close various tags when exiting a group. For instance, if we parse{\it hello world}we need to close italics at the end of the group. To this respect, our parser differs from LaTeX in that macros like\itwill have cumulative effects: for instance{\it hello \bf world}will be rendered as_hello **world**_, which is hello world. To achieve this in LaTeX, we should have used\itshapeand\bfseriesinstead of\itand\bf. Tags management is made using the following methods:def tag (self, tag) : self.tags[-1].append(tag) def bgroup (self) : self.tags.append([]) return "" def egroup (self) : tags = self.tags.pop(-1) return "".join(self.call(tag, False) for tag in reversed(tags))See how
egrouppops the tags and call the appropriate method usingself.callwith a second parameter set toFalseindicating that we are closing a tag (more below). -
newparandparsallow to manage the empty lines between paragraphs and the white space at the beginning of each paragraph. The former isTruewhen we are currently beginning a paragraph and the latter is a stack to save/restore this information when we parse nesteddocin the grammar. This is the role oftex.pushpar()andtex.poppar()that we’ve encountered in rulearg. This white space management is made by the following methods:def text (self, txt) : self.newpar = False return txt def blank (self, txt) : if self.newpar : return "" elif txt.count("\n") > 1 : self.newpar = True return "\n\n" else : return " " def pushpar (self) : self.pars.append(self.newpar) def poppar (self) : self.newpar = self.pars.pop(-1)We see how
textsetsnewpartoFalseand howblankallows to avoid white space at the beginning of paragraphs. -
Finally,
envsis a stack of environments corresponding the the nesting of\begin{...}and\end{...}in LaTeX source code.
Then come macros emulation: shortcut allows to map 1-character
macros to names, then call(MACRO, ...) is a dispatcher to
call_MACRO(...). Finally, arity uses module inspect to compute
how many arguments a function call_MACRO expects. Note that we do
not count a last argument opentag=True that is used for tags in
groups (which avoid to have both open_TAG and close_TAG methods).
shortcut = {"'" : "acute",
"`" : "grave",
'"' : "diaeresis",
"^" : "circumflex",
"~" : "tilde",
"\\" : "newline",
"$" : "dollar",
}
def call (self, name, *args) :
name = self.shortcut.get(name, name)
handler = getattr(self, "call_%s" % name)
return handler(*args)
def arity (self, name) :
name = self.shortcut.get(name, name)
a, _, _, d = inspect.getargspec(getattr(self, "call_%s" % name))
if a[-1] == "opentag" and d and d[-1] == True :
return len(a) - 2
else :
return len(a) -1
Here come the implementation of the macros that add accents to
characters, like \'a. We use hard-coded dicts that should probably
be completed to handle more cases. But the principle would remain the
same.
_acute = dict(zip("aeiouy", u"áéíóúý"))
_grave = dict(zip("aeiouy", u"àèìòùỳ"))
_diaeresis = dict(zip("aeiouy", u"äëïöüÿ"))
_circumflex = dict(zip("aeiouy", u"âêîôûŷ"))
_tilde = dict(zip("aon", u"ãõñ"))
def accent (self, text, accent) :
return getattr(self, "_" + accent).get(text[0], text[0]) + text[1:]
def call_acute (self, text) :
return self.accent(text, "acute")
def call_grave (self, text) :
return self.accent(text, "grave")
def call_diaeresis (self, text) :
return self.accent(text, "diaeresis")
def call_circumflex (self, text) :
return self.accent(text, "circumflex")
def call_tilde (self, text) :
return self.accent(text, "tilde")
Then a few useful macros, which can be easily completed with more macros.
def call_newline (self) :
return self.text("<br/>")
def call_href (self, url, text) :
return self.text("[%s](%s)" % (text, url))
def call_dollar (self) :
return self.text("$")
def call_math (self) :
return self.text("*")
def call_l (self) :
return self.text("l")
def call_L (self) :
return self.text("L")
def call_ae (self) :
return self.text(u"æ")
def call_oe (self) :
return self.text(u"œ")
We come to text styling. Tags are macros like \it that apply
immediately, inserting an opening marker, and push there names onto
the current group using method tag so that at the end of the group,
they will be called again, but with opentag=False this time. When
generating Markdown, the opening and closing tags are the same, but we
could easily insert things like <tag> and </tag> instead. Other
macros like \emph apply to a group, which is passed as string as
processed recursively in grammar rule arg. Note that we have used
marker _ for italics and * for maths.2
def call_it (self, opentag=True) :
if opentag :
self.tag("it")
return self.text("_")
def call_bf (self, opentag=True) :
if opentag :
self.tag("bf")
return self.text("**")
def call_tt (self, opentag=True) :
if opentag :
self.tag("tt")
return self.text("`")
def call_emph (self, text) :
return self.text("_%s_" % text)
def call_textit (self, text) :
return self.text("_%s_" % text)
def call_texttt (self, text) :
return self.text("`%s`" % text)
def call_textbf (self, text) :
return self.text("**%s**" % text)
Finally, we have macros for environments, i.e., \begin{...} and
\end{...}, which are directly implemented as macros \begin and
\end that manage the environment names onto stack envs as
explained above. These macros call respectively begin_ENV and
end_ENV to perform the appropriate operations. This is how
environment itemize is emulated, together with macro \item.
def call_begin (self, name) :
handler = getattr(self, "begin_%s" % name)
self.bgroup()
self.envs.append(name)
return handler()
def call_end (self, name) :
handler = getattr(self, "end_%s" % name)
self.egroup()
return handler()
def begin_itemize (self) :
newpar, self.newpar = self.newpar, True
if newpar :
return ""
elif self.envs.count("itemize") > 1 :
return "\n"
else :
return "\n\n"
def call_item (self) :
if self.newpar :
self.newpar = False
return " " * self.envs.count("itemize") + "* "
else :
return "\n" + " " * self.envs.count("itemize") + "* "
def end_itemize (self) :
self.newpar = True
if "itemize" in self.envs :
return "\n"
else :
return "\n\n"
In this implementation of itemize, we take care to appropriately
handle newpar in order to insert the correct amount of newlines. For
instance on \begin{itemize}, if there is already a paragraph
separation above, we don’t need to insert more newlines. Notice also
how we use stack envs to insert the correct indentation on \item
and the correct number of newlines on the beginning or ending of an
itemize.
-
LaTeX is built on the top of TeX that is a terribly complex language to parse and interpret. This process is very clearly explained in Donald Knuth’s TeXbook, but probably it could not be implemented using traditional grammar-based parsing. My goal here is not to have a Python implementation of TeX, but instead to have a quick parser/translator that will do the job for simple situations. ↩
-
Markdown allows both but doing so, we avoid problems with math within italics like in
\textit{variable $x$ is zero}which is correctly interpreted (i.e., maths will be consistently rendered in italics independently of the context) when translated to\_variable \*x\* is zero\_(what we do) but not when translated to\*variable \*x\* is zero\*or\_variable \_x\_ is zero\_that typesetxin roman. On the other hand,\textit{this \emph{is} important}is translated to\_this \_is\_ important\_which is not interpreted as we would like by python-markdown (because it is sensitive to spaces before/after_). ↩