Open Science, Open Access and Open Source

I have been thinking this over for quite a while, and have written this post several times over in my mind. As an undergraduate student I remember admiring scientists and imagining how amazing it must be to have a job where you got to discover new things, think of better solutions to problems facing our society, and making the world a better place. As my studies continued I aspired to become one of those researchers, and made the decision to take my studies further and applied to do a PhD.

As a PhD student I enjoyed learning more about materials, and was excited to be working with gold nanoparticles and research into how we might make real devices out of this novel new material in the Nanomaterial Engineering Group. It was exciting, challenging, and fascinating using techniques such as X-ray and neutron reflectometry, electron and atomic force microscopy, and Langmuir-Blodgett troughs. As I learned more through my work, I became frustrated with the quality of the software I used, as I had always imagined that “real scientists” had better tools available to them. It became even more frustrating when I realized how bad some of the instrument control software was, and how so many of the file formats could only be used in one or two expensive and hard-to-use programs that only worked on one or two platforms.

Towards the end of my PhD, I decided I would like to take some action. I had been trying to draw and render images of molecular structures, and wanted a way to do simple geometry optimizations for posters, papers and web pages. At first I tried to do some of this using an existing commercial package, but it only worked on Windows and we only had one license for the department. The training provided to me as a researcher in areas such as programming and analysis were disappointing and all too often generic tools such as Word, Powerpoint, and Excel were the most viable choice for preparing, analyzing, and presenting our work. I began writing more software, but much of it was written from scratch with little guidance. As I searched for a better way I came across some open-source libraries and tools.

I found a program run by Google called “Summer of Code” where they offered me the opportunity to “flip bits not burgers.” I was extremely lucky to find an idea on KDE’s idea page for a molecule editor in Kalzium. I was very excited, as I had been using KDE for many years. This was a pivotal moment for me, where my life and career took a twist I never expected into the world of open science – and I have loved every minute of it.

It was through that work that I became involved in the Avogadro project and later Open Babel, and met Geoff who later that year offered me a position in his new research group. This was an exciting opportunity as not only did we share a passion for correlating experimental and computational techniques, but Geoff was also very active in open chemistry. After I moved out to Pittsburgh Geoff introduced me to the Blue Obelisk, and I now proudly count myself as one of their un-members. We published an open access paper on the Blue Obelisk five years on last year.

After a two year postdoctoral position with Geoff, who was extremely supportive of my work in open chemistry, I met Bill Hoffman from Kitware. I knew that Kitware developed CMake, but beyond that was not really aware of what they did. It turned out that they were involved in much more than just CMake, with open source tools and frameworks such as VTK, ParaView, ITK, CDash and more. They had been working on open scientific software for over a decade, and they were hiring! They weren’t just making applications either, they were tackling the whole problem including development, testing, and validation of open-source, cross-platform applications and frameworks.

After accepting a position with Kitware in 2009, one thing I never really appreciated was just how poor access is to publicly-funded research. I can no longer access scientific papers I and others wrote, that were funded with tax payer money from both the UK and the US! I think that is terrible, and later realized I had become part of the scholarly poor, on which Peter wrote a follow up detailing the plight of those of us in industry. There is currently raging debate on open access, and campaigns such as The Cost of Knowledge need our support. The products of publicly-funded research should be available to all, whether they are in academia, industry, government or anywhere else.

There are too many black boxes in science today, too much published work that is not available to all or reproduced by others. Mathematics used to be the language of science, but more and more it is computer software that is needed to learn more, and too much of this code is closed, unpublished and poorly shared. Papers must include mathematical proofs, or refer to proofs already published, but it is common to see work published that used closed, proprietary package X to conduct a simulation. This is changing, as Scientific American recently published an article on how “Secret Computer Code Threatens Science.” Science also published an article about “Shining Light into Black Boxes,” detailing the growing problem of witheld source code preventing meaningful peer review and reproducibility of research.

Michael Nielsen published a book called “Reinventing Discovery” that talks about the value of networked science, and is well worth a read if you have not yet had a chance. The Panton Principles outline the need to make scientific data open, and the Science Code Manifesto calls for openly-available code in science. The core goals of the Blue Obelisk are open data, open standards, and open source. I think for science to progress we must embrace openness and sharing, and resist the urge to hoard data building up small empires on proprietary code and data.

One thing I hope to see come from all of the controversy of the Research Works Act is a clarification that publicly-funded research should be available to all, whether you think they will understand it or not. Scientists need to get better at communicating with the general public, and being more transparent about how research is done. I think open science will give us a chance to increase public engagement in science, which seems to be a growing problem in an age where we can all access the internet and a wealth of knowledge available on it.

I think that we need to figure out sustainable ways to fund the development of open software platforms to enable the next generation of researchers to push back the frontiers of science. We need to remember that we are publishing to share the results of (often publicly-funded) research, and so we should be using liberal licenses such as CC-BY, CC0 which allow reuse and further analysis. We also need liberally licensed software that allows those same things, with simple licenses such as BSD and Apache 2.0. These libraries should contain well-tested implementations of data structures, algorithms and best structures, along with training for researchers to help them take advantage of these resources. If there is a better way to do something, contributions and integration should be encouraged as is the case in most open source communities.

Our Open Chemistry project recently got Phase II SBIR funding, and I am very excited to be leading that work at Kitware. It is part of a collaborative, open effort to improve the tools and frameworks available in the area leveraging new software processes to enable wider community involvement.

Questions or comments are always welcome!