For my benchmarks, I used to store logs of every executions into CSV files that I had to aggregate for analysis. I've decided to switch to a database. It is simpler and globally more efficient.

Discussion

My benchmarks are usually organised as follows:

Relational model of the database I'm not a database enthusiast, actually, I must confess I don't like databases. But to be honest, this situation is typically the kind of problems databases have been made to solve. So, let's go. The relational diagram is shown here on the right, (drawn from a particular example I'm working on using ERAlchemy). We see that we have tables corresponding to the objects described above, plus a table runs that records the information related to any tool execution. I won't go into the details since it doesn't serve my topic. How this is implemented is described below, with code excerpts, but note that parts of the remarks below is dependent on this implementation.

So, what have we gained now we have a database?

First, a database is a storage system much simpler to manage that tons of CSV files, in particular, we don't have to manage how the different data types are encoded. I can forget about the column types, even complex ones like list:string that is automatically handled. Moreover, the definition of the database includes integrity constraints, for example a net cannot refer to a case that does not exist.

Then, a database can be queried. Here, I can easily get all the cases for which a net exists. Or: all the cases for a given model for which there is a net with at most n transitions that could be built within at most 10 seconds, ordered by the number of transitions. With CSV files, I would have to load all the data, compile it into a single Pandas table (for instance) and then perform the query on the table, which is possible of course, but not as convenient as with the database (and probably less efficient).

Moreover, with CSV files, I have to compute the joins myself, either while dumping the data (for instance, having one series of files that joins cases, nets, builds, and runs, plus another series for cases, nets, benchs, and runs), or by computing the joins after loading the data. The first solution is simpler but wastes disk space, especially for thousands or runs; the second solution is complicated and requires delicate programming. Of course Pandas can help a lot to compute joins, but again, it's not its job.

Finally, the storage is more efficient in a database and, using SQLite in particular, it's easier to copy around various versions of the database. For example, I copy it from the cluster that runs the benchmark to my laptop on which I do the data analysis.

Conclusion

There is no definitive argument in favour of using a database, but my experience is that it's a better tool because its designed exactly for this purpose. And it's not more complicated using the right libraries as shown below.

Implementation

Database

I've used pyDAL that is the standalone version of the database abstraction layer from the excellent web2py. Here is the code to declare the model depicted above:

from pydal import DAL, Field

def opendb (path="benchmark.db") :
    db = DAL("sqlite://" + path)
    db.define_table("cases",
                    Field("model", "string"),
                    Field("path", "string", unique=True),
                    Field("params", "list:string"))
    db.define_table("nets",
                    Field("case_id", "reference cases"),
                    Field("transitions", "integer"),
                    Field("places", "integer"))
    db.define_table("runs",
                    Field("date", "datetime"),
                    Field("command", "string"),
                    Field("code", "integer"),
                    Field("stdout", "text"),
                    Field("stderr", "text"),
                    Field("status", "string"),
                    Field("max_mem", "double"),
                    Field("max_time", "double"),
                    Field("clock_time", "double"),
                    Field("cpu_time", "double"),
                    Field("cpu_usage", "double"),
                    Field("mem_peak", "double"))
    db.define_table("builds",
                    Field("case_id", "reference cases"),
                    Field("run_id", "reference runs"),
                    Field("tool", "string"))
    db.define_table("benchs",
                    Field("case_id", "reference cases"),
                    Field("run_id", "reference runs"),
                    Field("tool", "string"),
                    Field("procs", "integer"),
                    Field("rate", "double"),
                    Field("events", "integer"),
                    Field("clock_time", "double"),
                    Field("cpu_time", "double"),
                    Field("cpu_usage", "double"),
                    Field("ipc_count", "integer"),
                    Field("ipc_wait", "double"),
                    Field("ipc_io", "double"),
                    Field("ipc_compute", "double"))
    return db

So this is really straightforward. I've chosen SQLite backend so that there is no database installation at all and it is easy to copy it between the cluster and my laptop.

One very interesting feature of pyDAL is that it does automatic migration of an existing database. So if I decide to add or remove tables or columns, it will convert the existing data into the new structure. (There's no magic, if you completely transform the structure, it won't work automatically.)

Now, this is the code to convert the result of any request into a Pandas DataFrame.

import pandas as pd

def row2dict (row) :
    "Convert a row into a flat dict"
    ret = {}
    for key, val in row.as_dict().items() :
        if key == "id" or key.endswith("id") or val is None :
            # don't copy database ids
            pass
        elif isinstance(val, list) :
            # parameters in lists are encoded as key=value
            for v in val :
                k, d = v.split("=", 1)
                try :
                    d = ast.literal_eval(d)
                except :
                    pass
                ret[k] = d
        elif isinstance(val, dict) :
            # recursively decode a row
            ret.update(row2dict(row[key]))
        elif isinstance(val, (str, int, long, float, datetime.datetime)) :
            # simple values
            ret[key] = val
        else :
            # could be extended here
            assert False, "unsupported %s=%r" % (key, val)
    return ret

def db2pd (rows) :
    "Convert a list of rows into a DataFrame"
    return pd.DataFrame(row2dict(r) for r in rows)

So, for instance, to analyse how the nets have been generated, I can do:

df = db2pd(db((db.cases.id == db.nets.case_id)
              & (db.cases.id == db.builds.case_id)
              & (db.runs.id == db.builds.run_id)).select())

Ordering the results by transitions just requires to add orderby=db.nets.transitions as a select() argument. More generally, pyDAL offers a rich set of possibilities to query the databases, and it is of course possible to evaluate raw SQL as well.

Measuring process performances

This is not the topic of the post, but here is the code I use to run a process while measuring its CPU and memory usage.

import collections, psutil, subprocess, datetime, os, time

# information about a process run
run = collections.namedtuple("run", ["date", "command", "status", "code",
                                     "clock_time", "cpu_time", "cpu_usage",
                                     "mem_peak", "stdout", "stderr"])

def newrun (oldrun, **update) :
    "Build a new run updating some fields"
    return run(*(update.get(attr, oldrun[pos])
                 for pos, attr in enumerate(oldrun._fields)))

def spawn (*command, timeout=0, maxmem=90, env=None, cwd=None) :
    "Run command and return the corresponding run object"
    command = [str(c) for c in command]
    if env is not None :
        newenv = os.environ.copy()
        newenv.update(env)
        env = newenv
    status = None
    peak = 0
    main = psutil.Popen(["time", "-f", "%%%e/%S/%U/%P"] + command,
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                        cwd=cwd, env=env)
    time.sleep(0.01) # let the process start
    child = main.children()[0]
    start = main.create_time()
    while True :
        try :
            # memory usage is measured every 1 second, this may be
            # sampled more often for short living processes
            if main.wait(1) is not None :
                break
        except psutil.TimeoutExpired :
            # process has not finished
            pass
        try :
            # get memory usage
            peak = max(peak, child.memory_info().vms)
            pmem = child.memory_percent()
        except psutil.NoSuchProcess :
            # process has finished
            break
        if timeout and time.time() - start > timeout :
            # process runs out of time
            status = "reached time limit"
        elif maxmem and pmem > maxmem :
            # process consumes too much memory (90% by default)
            status = "reached mem limit"
        if status :
            # kill the process because of timeout/maxmem
            try :
                # SIGTERM (be gentle)
                child.terminate()
            except :
                pass
            try :
                # give it 1 second to stop
                child.wait(1)
            except :
                pass
            try :
                # SIGKILL (enough is enough)
                child.kill()
            except :
                pass
    # parse stderr to get the output from time utility
    stderr, times, _ = main.stderr.read().rsplit("%", 2)
    times = times.split("/")
    if times[3] == "?" :
        times[3] = "0"
    return run(date        = datetime.datetime.fromtimestamp(start),
               command     = command,
               status      = status or "clean",
               code        = main.returncode,
               clock_time  = float(times[0]),
               cpu_time    = float(times[1]) + float(times[2]),
               cpu_usage   = float(times[3]) / 100.0,
               mem_peak    = peak,
               stdout      = main.stdout.read(),
               stderr      = stderr)

def pyspawn (*command, **options) :
    "Same as spawn but also parse Python exception traceback"
    t = spawn(*command, **options)
    status = None
    if t.code != 0 and t.status == "clean" :
        if any(l == "MemoryError" for l in t.stderr.splitlines()) :
            status = "out of memory"
        elif any(l == "Traceback (most recent call last):"
                 for l in t.stderr.splitlines()) :
            status = "raised Exception"
            lines = iter(t.stderr.splitlines())
            for line in lines :
                if line == "Traceback (most recent call last):" :
                    break
            for line in lines :
                if not line.startswith("  ") :
                    status = "raised " + line
                    break
        else :
            status = "error"
    return run(date        = t.date,
               command     = t.command,
               status      = status or t.status,
               code        = t.code,
               clock_time  = t.clock_time,
               cpu_time    = t.cpu_time,
               cpu_usage   = t.cpu_usage,
               mem_peak    = t.mem_peak,
               stdout      = t.stdout,
               stderr      = t.stderr)

def run2dict (run) :
    """Convert a run into a dict for easy insertion in the database
    For example:
    >>> r = run(...)
    >>> db.runs.insert(**run2dict(r))
    """
    return dict(zip(run._fields, run))

Drawing the database schema

Last, just to show how ERAlchemy is used:

def db2diagram (db, path) :
    "Draw the relational diagram of a database"
    with open(path, "w") as out :
        links = {}
        for table in db.tables :
            out.write("[%s]\n" % table)
            for field in db[table].fields :
                if field == "id" :
                    star = "*"
                else :
                    star = ""
                kind = getattr(db[table], field).type
                if kind.startswith("reference ") :
                    _, target = kind.split()
                    links[target, table] = "*--1"
                    kind = "%s*" % target
                out.write('    %s%s {label: "%s"}\n' % (star, field, kind))
        for (src, tgt), lnk in links.items() :
            out.write("%s %s %s\n" % (src, lnk, tgt))
    os.system("eralchemy -i %s -o %s.pdf" % (path, os.path.splitext(path)[0]))