Simple, Parallel Computing with vtkSMPTools

Introduction

The Visualization Toolkit system has been in active development for over twenty years, and from the earliest days the community has concerned itself with parallel computing. The class vtkMultiThreader was introduced in 1997, mainly to support threaded, shared-memory data processing in the imaging pipeline. Scalable, distributed parallel computing has also been a major focus. Applications such as ParaView depend on these capabilities to execute on large leadership computing facilities. For example, ParaView/VTK has been successfully run using 256,000 processors in a co-processing simulation with the Phasta turbulent flow analysis code on Argonne's Mira IBM Blue Gene Q system [https://vimeo.com/120504264]. Such data parallel, distributed computing applications require a sophisticated execution model, and VTK has seen several incarnations of the pipeline architecture as a result. Similar development efforts continue to this day, with client-server models extending the reach of server-side parallel computing through systems such as VTK/ParaViewWeb [http://www.paraview.org/web/]. In fact with the advent of cloud computing and inexpensive systems such as Amazon AWS and Google Compute Engine, it is now possible to use VTK/ParaView in a distributed, customized, relatively inexpensive, on-demand basis to meet the computational needs of many organizations.

However as we all know the computing industry is not standing still and the field of parallel computing is advancing rapidly due to innovations in GPU and multicore technologies. With the plateauing of serial clock speeds (due to power and other physical constraints), chip manufacturers are adding computing capacity in the form of additional computing cores and vectorized processing units. While such hardware represents a significant leap in computing potential—with the distinct possibility of a dawning exascale computing era—these hardware advances pose significant challenges to software systems. Not only are popular programming languages unsuited for most parallel computing applications, the diversity of emerging parallel approaches makes it very difficult to create general programs that can take advantage of different hardware and software libraries across a variety of platforms. There is also the algorithm design challenge: Many developers simply adapt applications that were initially developed in a serial computing environment. Such adaptations typically use relatively limited task- and data-parallel approaches that require significant synchronization and interprocess communication, as well as locks and mutexes to access serial data structures. These approaches often do not scale well on massively parallel hardware. Instead, in the interest of scaling across the emerging parallel hardware, developers must rethink parallel algorithms that may be slower and less efficient at small scale, but execute efficiently as data becomes large.

These challenges are significant and require concerted effort to realize the full potential of parallel computing. The VTK/ParaView community, led by Dr. Berk Geveci at Kitware in collaboration with prestigious computing organizations such as DOE's Sandia, Los Alamos, and Oak Ridge National Labs, is taking these challenges to heart and has developed a broad vision for the future, one that depends on a multi-pronged approach to address differing parallel computing needs. The major components of this vision include:

  • vtkSMPTools, an abstraction for threaded processing which under the hood uses different libraries such as TBB, OpenMP and X-Kaapi. The typical target application is coarse-grained shared-memory computing as provided by mainstream multicore, threaded CPUs such as Intel's i5 and i7 architectures.
  • VTK-m, a toolkit of scientific visualization algorithms for emerging processor architectures (GPUs and co-processors). VTK-m is designed for fine-grained concurrency and provides abstract data and execution models that can be utilized to construct a variety of scalable algorithms.
  • New Algorithms, requiring the redesign and implementation of algorithms and data structures that best take advantage of new parallel computing hardware and libraries. This requires identifying essential capabilities for a variety of visualization and data analysis tasks.

Currently all of these areas are under active development. In particular this blog post addresses vtkSMPTools due to its relative simplicity and maturity. Future articles will describe the other approaches as they mature and are further incorporated into VTK and ParaView.

vtkSMPTools

The best place to start is to read this short but informative description. As stated in the introduction to this document, the objective of the SMP (symmetric multiprocessing) framework is to provide a basic infrastructure for the easy development of shared memory parallel algorithms in VTK. The resulting framework is quite simple supporting:

  • atomic integers and associated operations;
  • thread local storage; and
  • parallel building blocks such as parallel for loops.

A noteworthy feature of vtkSMPTools is that it defines an abstraction to a variety of underlying threading systems, and at compile time (using the CMake advanced option VTK_SMP_IMPLEMENTATION_TYPE) it is possible to choose one of several possible threading facilities. Currently the choices are the Intel Threading Building Blocks library TBB; Kaapi; Simple (soon to be replaced by the OpenMP backend); and Sequential (as in serial processing). At the time of this writing, OpenMP is also being added to the mix as well and will be available soon. In addition, key general-purpose algorithms such as parallelsort will also be supported by vtkSMPTools.

With vtkSMPTools, rather than worrying about scheduling and balancing work among multiple threads, a developer can focus on writing algorithms as a sequences of small, independent tasks. vtkSMPTools takes care of distributing the work to the underlying thread resources.

Parallel For()

This framework implements a thin wrapper around existing tools such as TBB and OpenMP, and provides a simple API for writing parallel code. Useful algorithms that scale well can be created using the parallel for loop vtkSMPTools::For(). The arguments to For() consist of start and end parameters, expressed as integers of type vtkIdType and which define the semi-open range [start,end), and a functor to be executed over this range. The functor is a C++ class with an implementation of operator(). Optionally, a "chunk" or "grain" size can be provided that provides a hint as how to divide up the problem for effective load balancing vs scheduling overhead.

In practice the parallel for construct will look like the following:

This high level specification of the computation enables the different backends to employ various scheduling strategies for best performance. For example, the OpenMP backend supports static, dynamic, guided or implementation-dependent, default scheduling strategies, which can be chosen at runtime by setting the OMP_SCHEDULE environment variable. (Please refer to the OpenMP specification for a description of these strategies.) The TBB backend implements a technique called work stealing. In work stealing, each thread has its own work pool, from where it takes the next task to execute and adds any new tasks that are generated. Using local queues removes contention that is present when using a single work pool. When a thread has exhausted its own pool, it steal more work from other threads' pools hence balancing the workload.

The optional grain size parameter to parallel-for gives a hint about the granularity of the computation to be performed. This is useful for a good balance between scheduling overhead and load balancing. Parallelism is achieved by splitting the input range into smaller sub-ranges based on the grain size and assigning these sub ranges to the threads. If the grain size is too low, the functor is invoked many times and the overhead of invocation adds up resulting in poor performance. If it is too high, the overhead is reduced, but it can result in poor load balancing as some threads may finish their work sooner (if some of the sub-ranges are less demanding) and become idle since there are not enough sub-ranges to keep them fed. The grain size depends on the type of computation and it is recommended that an appropriate value be specified. The TBB backend can chose a grain size heuristically (and adaptively adjust it) but this feature is not available on the other backends. Other backends simply assign tasks based on a small multiple of the number of available threads.

Example

Here's a simple example inspired by the VTK class vtkWarpVector to create the class vtkSMPWarpVector. vtkSMPWarpVector simply computes new point coordinates by displacing input points by an associated, scaled vector field. First the functor is defined, including overloading operator(), which in this case is templated over the type of points and vectors.

Next in the VTK class, the functor is used as follows (in the templated execution method vtkSMPWarpVectorExecute2):

(The full implementation is available in Filters/SMP/vtkSMPWarpVector.h/.cxx) That's it, with very few lines of code, a scalable, threaded, load-balanced algorithm has been implemented.

Advanced Capabilities and Future Work

In an ideal parallel algorithm, threads operate on data without retaining state between invocations; and read and write from and to shared memory without contention. In the example above, no initialization or final reduction is required on each thread, and writing to memory is uniquely determined by which point id the thread is working on. In more realistic applications, however, many algorithms do require synchronization, initialization, and maintenance of local state. To provide these capabilities, vtkSMPTools provides the following tools (see the document referenced previously for more information):

Thread local storage is used to store VTK instances and other objects. Developers can instantiate objects using vtkSMPThreadLocal and vtkSMPThreadLocalObject (for VTK classes that require VTK-specific allocation and deletion). These objects are created once per thread, and persist as each edge processes its assigned tasks. Such local thread objects are a great way to accumulate output over multiple thread invocations. Functor initialization and reduction is typically used in combination with the creation of local thread objects, since these objects must often be initialized; and the results combined (or reduced) into final form as execution concludes. vtkSMPTools provides these capabilities by invoking the optional Initialize() and Reduce() method (defined by the functor in combination with the operator() method). Atomic Integers can be used to ensure data integrity when reading from and writing to shared memory. vkSMPTools supports a subset of the C++11 atomic API; mainly the operators ++, -, +=, -=, load, and store. These integers can be used when multiple threads must access and manipulate a common data structure.

Currently many VTK algorithms are being rewritten with these simple constructs with good performance scaling. New algorithms are also being designed and implemented (both in the vtkSMPTools framework as well as in VTK-m). If you are interested in realizing performance gains in your own code, we recommend that you build VTK with vtkSMPTools enabled (currently TBB is the most popular). We also look forward to receiving contributions of your new and old algorithms implemented using this framework.

2 Responses to Simple, Parallel Computing with vtkSMPTools

  1. Andrew Maclean says:

    This is really exciting work. Whilst I understand that a lot of the VTK algorithms need to be rewritten to support parallelisaion.
    My question relates to how many other aspects of VTK need to be rewritten?
    For instance, I suspect that there is potential in, for example, vtkParametricFunctionSource for a rewrite that has some sort of adaptive tesselation near the poles of regular surfaces (ellipses) or in in some regions of non-orientiable surfaces. Maybe there is scope for parallelisation here with some master process selecting optimal results from various parellel processes? This could get rid of the ugly dark shadows at the poles.

  2. Will Schroeder says:

    Andrew- The nice thing about the vtkSMPTools backend is that it is fairly non-intrusive in that if an algorithm is rewritten using it, then even if the code is built without a parallel backend (i.e., built in Sequential) it will perform well. But building with something like TBB will enable most modern multi-core systems to realize significant performance gains (my development laptop is 4/core 8/threads and linear scaling gains are not uncommon for small numbers of threads).

    As far as rewriting other algorithms: VTK is chock full of code that can be rewritten for (parallel) performance, clarity, design flexibility, and new algorithmic advances. Obviously as a community we need to prioritize and then produce new implementations. I’d be happy to work with you and the community to create a priority list so everybody can pitch in, and find something that matches their skill set and/or interests. Although there is no need to wait if a particular task has got you excited — go for it!

Questions or comments are always welcome!