The possibility to reproduce and compare results of other researchers is essential for scientific progress. In many research fields, however, it is often impossible to specify the complete experiment setting; e.g., in the scope of a scientific publication. As a consequence, a  reliable comparison becomes difficult, if not impossible. TIRA is our approach to address this shortcoming. TIRA-jokingly, "The Incredible Research Assistant"-provides a means for evaluation as a service. It focuses on hosting shared tasks and facilitates the submission of softwares as opposed to the output of running a software on a test dataset (a so-called run). TIRA encapsulates the submitted softwares into virtual machines. This way, even after a shared task is over, the submitted softwares can be re-evaluated at the click of a button, which severely increased the reproducibility of the corresponding shared task. An overview of existing shared tasks is available at [service] [video]


TIRA is currently one of the few (if not the only) platform that supports software submissions with little extra effort. We have used it to organize 12 shared tasks within PAN@CLEF, CoNLL, and the now running WSDM Cup. All told, 300 pieces of software have been collected to date, all archived for re-execution. This ensures replicability, and also reproducibility (e.g., re-evaluating the collected software on new datasets).

For a recent example on applied reproducibility: we hosted a shared task where participants submitted software adversary to those submitted to a previous shared task: author obfuscation vs. authorship verification. Evaluating the obfuscators involved running obfuscated texts through all of 44 previously submitted verifiers to check whether authors could still be identified. This is something that would have been virtually impossible without TIRA.

- supports almost any working environment and software stack (incl. Windows)
- apparently does not impede participation in shared tasks
  (we have so far not observed drops of registrations or heard any serious complaints afterward)
- prevents participants from directly accessing the test datasets (blind evaluation)
- prevents leakage of test datasets,
- allows for controlling the amount of information passed back to participants when they run software on test datasets
- for the above reasons supports the use of proprietary and sensitive datasets
- allows for many different task setups (e.g., for the source retrieval task, participants accessed our in-house ClueWeb search engine ChatNoir)

TIRA's only requirement to participants is that
- software is executable from a POSIX command line (Cygwin on Windows) with a number of parameters

TIRA's requirements for organizers are that they
- supply datasets
- supply run evaluation software
- review participant runs for errors
- moderate evaluation results and whether it should be published
- help to answer participant questions as they arise
That's nothing more than they are doing, anyway.

TIRA does currently not support
- GPU acceleration inside virtual machines
- accessing cluster computers to run, e.g., MapReduce jobs
These are things that will be available eventually.

TIRA's operational costs include
- running the virtual machines and the servers that host them

We are currently running TIRA on our Betaweb cluster at the Digital Bauhaus Lab (130 machines with 64GB RAM each). We can afford to host hundreds of virtual machines simultaneously, and we would be willing to offer hosting free of charge. In return
- we'd ask to be highlighted in appropriate places on web pages, presentations, papers, etc.
- we'd ask task organizers to make sure participants properly cite TIRA in case they mention it in their papers

As a grain of salt: TIRA is still a prototype (beta), and it is rough around the edges in some places. We are working toward an open source release.  


Students: Anna Beyer, Matthias Busse, Clement Welsch, Arnd Oberländer, Johannes Kiesel, Adrian Teschendorf, Manuel Willem