Why Open Source Will Rule Scientific Computing (Part 2)

In my first blog in this series of six articles, I offered five reasons why open source will rule scientific computing. In this post, I discuss reason #1: Open Science.

It breaks my heart to say this, but too often scientists and engineers (of all people) have forgotten the importance of openness, transparency and reproducibility in their practice of science.  For example, in the fields I am most familiar–visualization, graphics, and biomedical research–recent years have seen the publication of many papers with impressive claims and high-level descriptions of algorithms, which are extremely difficult if not impossible to replicate because the source code and data are not available. Further, some researchers treat their data as proprietary and seem to make a career out of analyzing it, with no chance for others to replicate results. Other researchers publish complex algorithms and frequently leave out important implementation details or parameter settings. As a result, some algorithms may take years to reproduce even by expert researchers, and the actual results can vary widely due to missing implementation details and parameter settings.

Many of us in the scientific computing community have come to the conclusion that the only solution to this sad state of affairs is to practice Open Science, of which open source software, open data, and open access play key roles. Open source software enables researchers to rapidly reproduce the results of computationsal experiments and explore the behavior of algorithms. Open data enables researchers to apply their software to pertinent test cases, and compare competing algorithms. And, of course, open access enables others to read and access publications in order to understand the science. Further, from the evidence I’ve seen the practice of Open Science provides many other benefits including fostering rapid innovation, fair comparison of technology, and providing an ideal resource for educating technologists of the future.

Many others have independently come to the same conclusion. What is particularly heartening is the emergence of open access journals such as the now well-known PLoS series of journals. Open data initiatives are becoming widespread, ranging from Data.gov (increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government) to sites that aggregate data such as InfoChimps, Freebase, and Information Aesthetics. Even the US Library of Congress has starting exploring ways to release open source software. Just recently, Jill P. Mesirov, a member of the Broad Institute of Massachusetts Institute of Technology and Harvard proposed a Reproducible Research System (RRS), based on the concept of reproducible research proposed by Jon Claerbout in 1990. In Dr. Mesirov’s vision, the RRS system

“consists of two components. The first element is a Reproducible Research Environment (RRE) for doing the computational work. An RRE provides computational tools together with the ability to automatically track the provenance of data, analyses, and results and to package them (or pointers to persistent versions of them) for redistribution. The second element is a Reproducible Research Publisher (RRP), which is a document-preparation system, such as standard word-processing software, that provides an easy link to the RRE. The RRS thus makes it easy to perform analyses and then to embed them directly into a paper. A reader can readily reproduce the analysis and, in fact, can extend it within the document itself by changing parameters, data, filters, and so on.”

(By the way, I’d provide a link to the full article above, but it is only available through subscription, membership or fee. Such is the state of scientific discourse in 2010: that an article espousing open research environments requires payment to access, sigh…)

Obviously Kitware has taken the lessons of Open Science to heart by developing open source software and providing open data; however we feel like we’ve added a unique twist to the mix as demonstrated by the Insight Journal, the VTK Journal, and the many offshoots of the MIDAS Journal. What makes these journals special is that they are not simply collections of pdf documents and data; they also support source code submissions, which are compiled, executed and tested as part of the submission process. Thus, automated evaluation of source code and data is combined with human review of the published technology to generate a final assessment of the (software) technology. Another area where we’ve had some impact is working with a mainstream publisher the Optical Society (OSA). With Kitware’s help they have produced the Interactive Science Publishing system (ISP) that enables the creation of active documents. The documents have special, embedded links that launch a viewer and download (open) data, enabling readers to interactively explore the data.

At the heart of what we do as technologists is practice Science. At the heart of science is the ability to reproduce the results of others. Thus Open Science, and the practice of open source, is critical to the future of scientific computing.

In the next blog, I will discuss reason #2: The Search for Authenticity.

Questions or comments are always welcome!