Franck Pommereau - Blog - Pelican articles out of BibTeX files

This post is outdated because I’m not using BibTeX files anymore for my website. But it’s content remains valid.

I’ve moved to Pelican to generate my website. My home-brewed generator needed some updates in order to work with Python 3 and new versions of some packages, so I decided I had no more time for such games. One features I miss from Pelican is the ability to generate my publication list out of some BibTeX files, here is how I’ve rebuilt this feature with Pelican.

First, I store my publications has pairs of files content/Publications/foo.bib and content/Publications/foo.pdf (the PDF is optional), so they’ll become articles in a blog category Publications (see the menu bar above). The goal is to treat .bib files as source files from which an article is generated.

First, I’ve added a directory bibreader in my site’s directory, just alongside directory content. It’s a package consisting of two files:

__init__.py is the plugin itself
latex.py is an auxiliary module to parse (limited) LaTeX source file and generate Markdown from it, it’s explained in this old post

Then, I’ve added one line in pelicanconf.py to load the plugin:

PLUGINS=["bibreader"]

Let’s look at bibreader/__init__.py. It first imports a bunch of objects that we’ll use latter on. Then, it defines function tex to process BibTeX text, interpret its LaTeX content, and return a Markdown string. It also defines a function _get that tries to get several keys in turn from a dictionary. And finally, dictionary _entries converts BibTeX entry names into plain English.

from pelican.readers import BaseReader
from pelican import signals

from datetime import datetime
from io import StringIO
from urllib.parse import urlparse
from pathlib import Path
from tempfile import NamedTemporaryFile

from pybtex.database import parse_file as parse_bibfile
from markdown import Markdown

from .latex import tex as _tex

def tex (txt) :
    if isinstance(txt, str) :
        return _tex(txt)
    elif txt is not None :
        return _tex(txt)

def _get (d, *keys) :
    for k in keys :
        if k in d :
            return d[k]

_entries = {"phdthesis" : "PhD thesis",
            "inproceedings" : "Conference paper",
            "proceedings" : "Conference proceedings",
            "techreport" : "Technical report",
            "inbook" : "Book chapter",
            "article" : "Journal paper",
            "book" : "Book"}

Then, class BibTexReader is defined to handle .bib source files, and it is registered as a Pelican reader.

class BibTexReader (BaseReader) :
    enabled = True
    file_extensions = ["bib"]
    def read (self, source_path) :
        # skipped, continued below

def add_reader (readers) :
    readers.reader_classes["bib"] = BibTexReader

def register () :
    signals.readers_init.connect(add_reader)

Method read has to read the given source_path, parse its content, and return some HTML together with a meta-data dictionary. It starts with the former, and provides just the required information:

        # skipped, continued from above
        path = Path(source_path)
        bib = parse_bibfile(source_path)
        entry = list(bib.entries.values())[0]
        fields = entry.fields
        metadata = {"slug": path.stem,
                    "title": tex(_get(fields, "title", "booktitle")),
                    "date" : datetime(year=int(fields.get("year", 1)),
                                      month=int(fields.get("month", 1)),
                                      day=int(fields.get("day", 1)))}

Note that each .bib file has only one BibTeX entry. Then, we build the article content as Markdown that will be parsed at the end. To start with, we handle potential sub-title, and authors’ or editors’ names.

        content = StringIO()
        if txt := fields.get("subtitle", None) :
            content.write(f"> {tex(txt)}\n\n")
        if persons := _get(entry.persons, "author", "editor") :
            for i, who in enumerate(persons) :
                if i :
                    content.write(", ")
                names = who.first_names + who.middle_names + who.last_names
                content.write(" ".join(tex(n) for n in names))
            content.write("\n\n")

Then we handle publication type, with journal/conference name, and so on. I’ve used a simple method that tries to generate a string from required fields. If it fails, strings are tried in turn, and if everything fails, an exception is raised so it will be reported by Pelican.

        _fields = {k : tex(v) for k, v in fields.items()}
        if "type" not in _fields and entry.type in _entries :
            _fields["type"] = _entries[entry.type]
        for info in ["**{type}:** {school}",
                     "**{type}:** {booktitle}, {series} {volume}",
                     "{booktitle}, {series} {volume}",
                     "**{type}:** {booktitle}",
                     "{booktitle}",
                     "**{type}:** {journal} {volume}",
                     "{journal} {volume}",
                     "**{type}:** {journal} {number}",
                     "{journal} {number}",
                     "**{type}:** {institution}",
                     "**{type}:** {publisher} (ISBN {isbn})",
                     "{publisher} (ISBN {isbn})"] :
            try :
                txt = info.format(**_fields)
            except :
                continue
            content.write(f"_{txt}_\n\n")
            break
        else :
            raise ValueError("missing publication context")

Here, we handle external links, as DOI or HAL ids, as well as PDF file. And finally the abstract, and a copy of the BibTeX source to be copied/pasted by visitors.

        if url := fields.get("DOI", None) :
            doi = urlparse(url).path.lstrip("/")
            content.write(f" * DOI: [{doi}]({url})\n")
        if halid := fields.get("hal-id", None) :
            content.write(f" * HAL: [{halid}](https://hal.archives-ouvertes.fr/"
                          f"{halid})\n")
        if path.with_suffix(".pdf").exists() :
            content.write(f" * [get PDF]({{static}}{path.stem}.pdf)\n")
        if abstract := fields.get("abstract", None) :
            content.write("## Abstract\n\n"
                          f"{tex(abstract)}\n\n")
        content.write("## BibTeX\n\n")
        for line in path.open() :
            content.write(f"    {line.rstrip()}\n")

Finally, HTML content is rendered and we return it together with meta-data:

        return Markdown().convert(content.getvalue()), metadata