Monday, May 12, 2008

Experimental Repeatability or simply Open Source?

This year SIGMOD and KDD started playing with the idea of experimental repeatability. The basic idea is to generate guidelines and processes that will encourage repeatability of the experiments presented in many papers.

The reasons are rather obvious: We need to be able to reproduce the experiment, to avoid any hidden bias, catch errors, and even avoid outright fraud. Furthermore, this encourages publications of techniques that are easy to implement and test. Why do we care? If the method is impossible to implement then it is an obstacle to research progress. A published paper that claims to be the state of the art, but is not reproducible may prevent other reproducible methods from being published, just for lack of comparison with the current state of the art.

Now, to achieve experimental repeatability we need two things:

  • Access to the data sets
  • Access to the code
Both parts tend to have issues: When someone uses multi-terrabyte data sets, it is highly unclear how to give access to such data to outsiders. (Our work on the evolution of web databases used a 3.3Tb dataset -- I have no idea how to even make the data available.) Other issues include copyrighted datasets, e.g., archives of newspaper articles. Despite these issues, I believe that at the end it is relatively easy to give access to the used datasets. See, for example, the UCI Machine Learning Repository, the UCR Time Series, the Linguistic Data Consortium, the Wharton Research Data Services (WRDS), and Daniel Lemire's set of pointers. (Feel free to post more pointers in the comments.)

The second aspect is access to the underlying code. One may argue that instead of giving access to the code we should describe clearly how to implement the algorithms, give the settings, and so on. This avoids any intellectual property issues, and everyone is happy. Personally, I do not buy this. No matter how nicely someone implements someone else's algorithms, nobody is going to spend much of time optimizing the code for a competing technique. This may lead to flawed experimental comparisons. Another alternative is to use common datasets and simply pick the performance numbers from the published paper, without reimplementing the competing technique. (This works only when the underlying hardware is irrelevant -- e.g., for precision/recall experiments in information retrieval.)

My own take? Encourage publication of open source software. If the code is open and available, comparisons are easy, and the whole issue of experimental repeatability becomes moot. No need for committees to verify that the reported results are indeed correct, no need to upload code into machines with different architecture, making sure that the code runs without any segmentation faults, and so on. If the code is available, even if the results are incorrect, someone will catch that in the future. (If the results are incorrect, the code and data is available, and nobody cares to replicate the results, then experimental repeatability is a moot point.)

Now, it is easy to talk about open source, but anyone who tries knows what a pain it is to take the scripts used to run experiments and make them ready to use by anyone else. (Or even to be reused later, from the author :-) Therefore, we need to give further incentives. The idea of the JMLR journal to have a track for submissions of open source software; this track serves as "a venue for collection and dissemination of open source software"

Perhaps this is the way to proceed, an alternative to the "experimental repeatability requirements" that may be too difficult to follow.

3 comments:

Daniel Lemire said...

Very good post.

Why not have a "citation" system for open source software? If you use someone's software, you have to cite them (as usual).

Imagine you have a guy who crafted a nice implementation of a very useful piece of software that all scientists in his field are using... this person would become a highly cited author.

Why not?

You can help science by discovering new things, or by helping others discover new things. I believe we should reward both things.

As for data, we need a benevolent company or organization to come forward and allow us to make available terabytes of data to all. I cannot afford to make available 1 TB of data for download on my lab's servers. The sysadmin would probably kill me.

Spiros said...

When I was a young, naive grad student (2000 maybe?), I posted the same question in "Ask Slashdot" -- never saw a response.

I recently asked why don't we do wikis/blogs instead of conferences, as a way to scale the discussion sections of yore in, e.g., Royal Society journals (which are half the paper, after it was presented in the society, if you've ever read one). Better have 500 eyeballs commenting in public, some of which may be genuinely interested, rather than 3 eyeballs (the reviewers) reading it in bed or whatever. Never got a response to that either.

So I was becoming a bit cynical about the whole process -- glad to see a post on these! I recommend "strong accept" for this post! :-)

Spiros said...

Wrt 1st comment: Have you seen Ohloh, btw? I think their main idea is very related -- the current implementation might be another story, though.

Citations at the level of "using libfoo" are easy to extract. There are also "citations" at a different level, in bboards, blogs and wikis -- or even commit logs (at the single file level!). Here is a thought, e.g. (choice of GNOME random, they have a long history, though)

svn log --xml http://svn.gnome.org/svn/ >gnome-logs.xml

Maybe you can find who are really the key "authors" and who are the "freeloaders".

As for data, there are efforts out there (eg google for Datapository at CMU). We don't really need benevolent companies (although it would be nice -- e.g., Yahoo is making some data available, I believe, on M45). These efforts do need more publicity, although I'm optimistic they will eventually take off.

However, the only problem with these is it might leave people in industry a bit out of the picture. Being in industry myself now, I'm have doubts about whether this would fly -- even with companies that purport to embrace opensource...

Post a Comment