Virtually Everywhere

April 15, 2011

What is a computer?
There was a time when it was easy to answer this question, a time when a computer was mostly a physical, or hardware, device (see Figure 1). It used to be that on top of that hardware device, a thin layer of logic was used to control the tasks performed by the physical layer.



Figure 1. The ENIAC, which became operational in 1946,
is considered to be the first general-purpose electronic
computer. It was a digital computer that was capable of
being programmed to address a range of comupting issues
(http://en.wikipedia.org/wiki/ENIAC).

There have been significant changes since those early days. Modern computers are an amalgam of physical hardware complemented with layer upon layer of abstraction logic. Users of modern computers interact with the higher abstraction layers and no longer get exposed to the details of the lower layers. This multi-layered organization has made it possible to intercept the middle layers and fully disconnect the operating system from the actual physical device in which it is running. As a consequence, what we used to call “the computer” is now disembodied from the physical layer and can therefore be moved across different receptacles where it can be incarnated.

This technology, in general called virtualization, has recently transitioned from advanced and exclusive to mainstream. We have therefore started taking advantage of it, mostly in regards to disseminating the use of open-source software.
The beauty of virtualization is that the virtual computer is literally a stream of bytes, and therefore can be stored, copied, modified and redistributed just as any other digital good. Granted, it tends to be quite a large file, but so are movies in digital formats.

The main scenarios in which we have recently been using virtualization are:
    Teaching
    Debugging
    Providing reference systems
    Running the reproducibility verification of the Insight Journal
    Trying new OS distribution without committing to fully reinstalling our computer

Here we elaborate on our experience using virtualization in some of these scenarios.

Debugging
Despite being aware of virtualization for some time, the event that sparked our attention came as a secondary effect of teaching our class, “Open Source Software Practices,”at Rensselaer Polytechnic Institute.  In order to expose students to the inner working practices of an open source community, they are required to work in a joint class software project and then they work in small groups projects.  In a recent class, for the joint project we chose to work with Sahana [1], a piece of humanitarian free and open source software (HFOSS)[2], designed to coordinate the delivery of humanitarian relief to disaster zones.

One of the easiest ways to get introduced to the system was for the students to sign up with the Sahana bug tracking database and select easy bugs that could be tackled in a couple of days. The Sahana development team has done an excellent job of preparing an easy entry path for new contributors. One of the most remarkable items in that reception area was the presence of a virtual appliance with a fully configured build of Sahana, along with a minimal mockup database intended for testing purposes [3]. Students were able to download the virtual appliance (for VirtualBox), boot it in their own laptops, and be working on fixing a bug in a matter of minutes. This was an eye-opening experience.

We know quite well, from our maintenance experience with our open source toolkits, that bringing new developers into an open source project is not a trivial feat. The fact that Sahana succeeded so well in delivering a “portable” platform (in the form of these virtual machines) that new developers can take and use instantly without installing additional software, dealing with version incompatibilities, or having to get familiar with the installation details of a database and an associated web server such as permissions and policies, make this approach a clear winner.

One of the very appealing properties of this method is that the new developers do not need to compromise the configuration of their current computers just to work on a particular bug of a given project. We have seen on many occasions that the effort to replicate the conditions in which a bug happens may require the installation of specific versions of libraries or software tools and their configuration in a particular fashion. Over time, the computers of developers who must do this on a continuous basis end up having an overwhelming mixture of installed libraries, which can easily generate conflicts and obstruct maintenance.  With a virtual machine, on the other hand, the process is perfectly clean. The machine is downloaded and booted; the bug is explored and fixed; a patch is submitted; and the virtual machine is shut down and discarded. The developer then returns to a clean computer, with no secondary traces or inconvenient remnants from his recent bug fixing excursion.

Teaching
When teaching a course to a group  of, let’s say, 30 people, it is highly desirable to ensure that all of them have the software correctly installed, a similar directory structure, and access to the necessary source code and eventually any other binary tools that may be needed for the course. For example, a typical ITK course will require you to have the source code of ITK installed, the proper version of CMake, a well configured compiler and a set of data files suitable to be used as input for hands-on exercises. This has to be done despite the fact that attendees will use their personal laptops for the course, and therefore will have a large variety of hardware platforms, operating systems and development environments installed on them.

In this context, virtualization offers an interesting alternative. Should a virtualization application be installed in the course attendee’s computers, it becomes possible to give to each one of them a virtual appliance that has been carefully crafted to contain all the software tools needed for the course. Such an appliance can be delivered to attendees in the form of a USB memory stick or a traditional CD.

The Microscopy Tutorial at MICCAI
At the MICCAI 2010 conference, we delivered a tutorial on “Microscopy Image Analysis.” As usual, following a pragmatic approach to training, we wanted to incorporate hands-on exercises in this tutorial, but we were challenged by the need to install a full application (V3D developed by Hanchuan Peng’s team at the HHMI – Janelia Campus) along with a full build of ITK and the set of input data required to run exercises.

The computers used for the tutorial, however, were the laptops that attendees brought as their personal machines to the conference. This was a double challenge. First, a wide variety of different machines was used (Macs, Linux and Windows), and second, the configurations of these machines was vastly different. They had different versions of operating systems and different types and versions of build systems. Virtual machines were therefore a natural choice to isolate that heterogeneity from the uniform platform that we needed to use for delivering a common experience to the course attendees.

The preparation for the course used two independent stages. The first step was installing the virtualization software (in this case VirtualBox, from Oracle). The second step was installing the image of the actual virtual machine (also known as a “Virtual Appliance”). This preparation of the virtual machine certainly requires considerable time and attention, but it has the advantage that its outcome becomes reusable and redistributable.

The slides used for this tutorial [4] and the virtual machine [5] are available on the MIDAS database.

In this particular case, the file containing the virtual machine is about 2 GB in size, but there are better ways to compact a virtual appliance than what we used in this particular case. This virtual appliance can be run both in a VirtualBox application and in a VMWare server.

 

Our use of virtual appliances in the MICCAI Microscopy tutorial was so rewarding that we will be using them again for the following upcoming tutorials:
    CVPR 2011: Video Bridge between ITK and OpenCV
    MICCAI 2011: Microscopy Image Analysis
    MICCAI 2011: Simple ITK : Image analysis for human beings
    MICCAI 2011: ITKv4 : The next generation

Journal
The Insight Journal is the vehicle for members of the ITK community to share their own developed classes with others. One of the most unique characteristics of the Insight Journal is that it is the only Journal that verifies the reproducibility of the work that is described in a paper. In order to do this, authors must include with their papers the full set of source code, data and parameters that enable others to replicate the content of the paper.  The system run by the Insight Journal takes this source code, compiles it, runs it, and finally compares the generated data with reference data provided by the authors. Given that the Journal receives submissions from any registered user, and that registration requirements are minimal, the implementation of the Journal translates into: “Here is an online system in which we are going to take source code submitted by anyone on the Internet, and we will compile it and run it.” This is something that is not necessarily the most prudent thing to do.

In order to restrict the risk of damage, whether by malicious or defective code, a Xen virtualization platform was put in place. In this platform, a Linux virtual machine in which commonly used libraries and software tools have been preinstalled is instantiated from scratch in order to test each individual paper submission. In this way, every paper is tested in a uniform environment that is in pristine condition. Should anything go wrong with the build or the execution process, the damage is contained because the instantiation of the virtual machine is discarded after the paper
verification terminates.

The virtualization environment also enables us to create a safe “walled garden” in which the code is tested with limited access to risked services such as networking. The image of the virtual machine is updated regularly to include recently released versions of ITK, VTK and CMake, among other tools.

Given that in some cases, users may want to further customize the configuration in which the paper source code is evaluated, we are considering the option in which authors can create the submitted paper by taking a publicly available copy of a pre-configured virtual appliance, then proceed to customize it by installing additional software, including the one that implements the content of their paper, and finally pack the resulting new virtual appliance and submit it as a paper to the Insight Journal.

This of course will have the overhead of transmitting and storing very large files, but it also provides a whole new horizon of possibilities when it comes to the richness of content that can be made part of a technical or scientific publication. It would be one step closer to achieving an ideal environment for verification of reproducibility in computational sciences.

The Cloud
Cloud environments are yet another implementation of virtualization technologies. In this case, the cloud provides three main components:
    A repository of virtual machine images.
    A computation service in which users can request hardware on a pay-per-use basis.
    A storage service in which users can deposit data and pay per data size and transfer.

These platforms enable us to provide preinstalled virtual machine images with a configured and built version of ITK, that users can instantiate for their own testing. Prices of running machines in the cloud are in the order of $1 per hour, and users only pay for the time between the instantiation of the machine and when it is shutdown.

Users in the cloud can also take pre-existing virtual machine images, modify them (for example to install in them additional software), and then put these new images back into the cloud repository for others to use. A permission system makes it possible to make some of these images fully public, or to restrict access to a limited set of users.

Cloud computing is a virtualization paradigm in which a collection of computing resources are made available to pay-per-use customers in the form of virtual computers. The cloud service provider (for example Amazon EC2, Rackspace or Microsoft Azure) actually owns a large collection of hardware equipment distributed in different geographical locations. Those hardware devices have been configured to be able to run virtual computers at the request of customers. A virtual computer is equivalent to a one-to-one copy of the byte data existing in the hard drive of any modern desktop or laptop. Such a copy includes not only the software applications that users interact with, but also the operation system layers that normally interact with real hardware. In the context of cloud computing, those virtual machines are essentially run as emulated computers in an environment where the emulation presents a minimum overhead.

Customers of the service can choose among many pre-configured virtual computers (also known as “images”), and can choose to instantiate them on hardware platforms of different capacity (memory, number of processors, disk space), also known as “instances”. User can also select to instantiate as many of these virtual devices as they need, release them or reduce them according to their usage needs, and all along, only pay for the resources that they are using. Software infrastructure is available for performing this scaling automatically, according to the load that an application may be experiencing.

Open-source software platforms are a natural fit for cloud computing environments because open-source software is not crippled by licensing limitations, and therefore can be copied, instantiated, modified and redistributed without any legal concerns.

By storing scientific data in cloud storage services, it becomes available directly to cloud computing devices without further transfer of data. Modern cloud storage providers offer multiple options for uploading large amounts of data, ranging from a high-speed multi-channel upload network to mail-shipped high-capacity storage media (such as multi-terabyte hard drives), that is still the most cost-effective way of transferring large amounts of data. Once customers have uploaded data to the cloud, they can make it available to the virtual machines they instantiate to process it.

We currently host an “image” in the Amazon EC2 elastic computing service. You could instantiate that image and have a functional computer in which an ITK source tree and it corresponding binary build are already available.

Conclusion
The luxury of being able to configure, pack and ship around the digital version of a fully configured computer gives us plenty of opportunities to address in a more effective manner the challenges of large scale software development and the joys of building communities around them.

References
[1] http://sahanafoundation.org/
[2] http://hfoss.org/
[3] http://eden.sahanafoundation.org/wiki/InstallationGuidelinesVirtualMachine
[4] http://midas.kitware.com/collection/view/30
[5] http://midas.kitware.com/item/view/450

Resources
If you are interested in trying some of the available pre-configured images, the following resources are helpful:

VirtualBox

The community provides the following two resources.

Images


http://virtualboximages.com/

VMWare

http://store.vmware.com/store/vmware/en_US/DisplayProductDetailsPage/productID.221027300

 Luis Ibáñez is a Technical Leader at Kitware, Inc . He is one of the main developers of the Insight Toolkit (ITK). Luis is a strong supporter of Open Access publishing and the verification of reproducibility in scientific publications.

 

Leave a Reply