Simple, Parallel Computing with vtkSMPTools

Introduction
The Visualization Toolkit (VTK) system has been in active development for over 20 years, and from the earliest days, the community has concerned itself with parallel computing. The class vtkMultiThreader was introduced in 1997, mainly to support threaded, shared-memory data processing in the imaging pipeline. Scalable, distributed parallel computing has also been a major focus. Applications such as ParaView depend on these capabilities to execute on large leadership computing facilities. For example, ParaView/VTK has been successfully run using 256K processors in a coprocessing simulation with the Phasta turbulent flow analysis code on Mira, Argonne National Laboratory’s IBM Blue Gene/Q system [1]. Such data-parallel, distributed computing applications require a sophisticated execution model, and VTK has seen several incarnations of the pipeline architecture as a result.

Similar development efforts continue to this day, with client-server models extending the reach of server-side parallel computing through systems such as VTK/ParaViewWeb [2]. In fact, with the advent of cloud computing and systems such as Amazon Web Services (AWS) and Google Compute Engine, it is now possible for many organizations to use VTK/ParaView in a distributed, customized, relatively inexpensive, on-demand basis to meet computational needs.

However, as we all know, the computing industry is not standing still. The field of parallel computing is advancing rapidly due to innovations in graphics processor unit (GPU) and multi-core technologies. With the plateauing of serial clock speeds, due to power and other physical constraints, chip manufacturers are adding computing capacity in the form of additional computing cores and vectorized processing units. While such hardware represents a significant leap in computing potential—with the distinct possibility of a dawning exascale computing era—these hardware advances pose significant challenges to software systems. Not only are popular programming languages unsuited for most parallel computing applications, but the diversity of emerging parallel approaches makes it very difficult to create general programs that can take advantage of different hardware and software libraries across a variety of platforms.

There is also the algorithm design challenge: Many developers simply adapt applications that were initially developed in a serial computing environment. Such adaptations typically use relatively limited task- and data-parallel approaches that require significant synchronization and inter-process communication, as well as locks and mutexes to access serial data structures. These approaches often do not scale well on massively-parallel hardware. Instead, in the interest of scaling across the emerging parallel hardware, developers must rethink parallel algorithms that may be slower and less efficient at a small scale, but that execute efficiently as data becomes large.

These challenges are significant and require a concerted effort to realize the full potential of parallel computing. The VTK/ParaView community—led by Dr. Berk Geveci at Kitware, in collaboration with prestigious computing organizations such as Department of Energy National Laboratories—is taking these challenges to heart and has developed a broad vision for the future. This vision is one that depends on a multi-pronged approach to address differing parallel computing needs. The major components of this vision include the following:

1. vtkSMPTools

This is an abstraction for threaded processing, which under the hood uses different libraries such as Threaded Building Blocks (TBB), OpenMP, and X-Kaapi. The typical target application is coarse-grained shared-memory computing as provided by mainstream multi-core, threaded central processing units (CPUs) such as Intel’s i5 and i7 architectures.

2. VTK-m

This is a toolkit of scientific visualization algorithms for emerging processor architectures (GPUs and coprocessors). VTK-m is designed for fine-grained concurrency and provides abstract data and execution models that can be utilized to construct a variety of scalable algorithms.

3. New Algorithms

This component requires the redesign and implementation of algorithms and data structures that best take advantage of new parallel computing hardware and libraries. To do so, it is necessary to identify essential capabilities for a variety of visualization and data analysis tasks.

Currently, all of these components are under active development. This article addresses vtkSMPTools due to its relative simplicity and maturity. Future articles will describe the other components (VTK-m and new algorithms) as they mature and are further incorporated into VTK and ParaView.

vtkSMPTools

The best place to start learning about vtkSMPTools is a short but informative description, titled “Parallel Processing with the SMP Framework in VTK” [3]. As stated in the introduction to this document, the objective of the symmetric multiprocessing (SMP) framework is to provide a basic infrastructure for the easy development of shared-memory parallel algorithms in VTK. The resulting framework is quite simple, supporting the following:

  • atomic integers and associated operations
  • thread local storage
  • parallel building blocks such as parallel for loops

A noteworthy feature of vtkSMPTools is that it defines an abstraction to a variety of underlying threading systems. At compile time (using the CMake advanced option VTK_SMP_IMPLEMENTATION_TYPE), it is possible to choose one of several possible threading facilities. Currently, the choices are the Intel TBB library [4], Kaapi [5], Simple (soon to be replaced by the OpenMP backend), and Sequential (as in serial processing). As noted, OpenMP is being added to the mix and will be available soon. In addition, key general-purpose algorithms such as parallel_sort will be supported by vtkSMPTools.

With vtkSMPTools, rather than worrying about scheduling and balancing work among multiple threads, a developer can focus on writing algorithms as a sequence of small, independent tasks. vtkSMPTools takes care of distributing the work to the underlying thread resources.

Parallel For()

The vtkSMPTools framework implements a thin wrapper around existing tools such as TBB and OpenMP, and it provides a simple application programming interface (API) for writing parallel code. Useful algorithms that scale well can be created using the parallel for loop vtkSMPTools::For(). The arguments to For() consist of start and end parameters, which are expressed as vtkIdType integers and define the semi-open range [start,end), and a functor to be executed over this range. The functor is a C++ class with an implementation of operator(). Optionally, a “chunk” or “grain” size can be provided that offers a hint as to how to divide up the problem for effective load balancing versus scheduling overhead.

In practice, the parallel for construct looks like this:

vtkSomeFunctor func;
vtkSMPTools::For(begin,end,func);

This high-level specification of the computation enables the different backends to employ various scheduling strategies for optimal performance. For example, the OpenMP backend supports static, dynamic, guided, or implementation-dependent default scheduling strategies, which can be selected at runtime by setting the OMP_SCHEDULE environment variable. (Please refer to the OpenMP specification for a description of these strategies.) The TBB backend implements a technique called work stealing [6]. In work stealing, each thread has its own work pool, from which it takes the next task to execute and adds any new tasks that are generated. Using local queues removes contention, which is present when using a single work pool. When a thread has exhausted its own pool, it steals more work from other threads’ pools, and hence, balances the workload.

The optional grain size parameter to the parallel for construct provides a hint regarding the granularity of the computation to be performed. This is useful for equilibrium between scheduling overhead and load balancing. Parallelism is achieved by splitting the input range into smaller sub-ranges based on the grain size and assigning these sub-ranges to the threads. If the grain size is too low, the functor is invoked many times, and the overhead of invocation adds up, resulting in poor performance. If the grain size is too high, the overhead is reduced. This can result in poor load balancing, as some threads may finish their work sooner (if particular sub-ranges are less demanding) and become idle, since there are not enough sub-ranges to keep them fed.

The grain size depends on the type of computation, and it is recommended that an appropriate value be specified. The TBB backend can chose a grain size heuristically (and adaptively adjust it), but this feature is not available on other backends discussed in this article. Other backends simply assign tasks based on a small multiple of the number of available threads.

Example

The following is a simple example, inspired by the VTK class vtkWarpVector, for creating the class vtkSMPWarpVector. vtkSMPWarpVector simply computes new point coordinates by displacing input points by an associated, scaled vector field. First, the functor is defined, including the overloading operator(), which in this case is templated over the type of points and vectors.

#include "vtkSMPTools.h"

template <class T1, class T2>
class vtkSMPWarpVectorOp
{
public:
T1 *InPoints;
T1 *OutPoints;
T2 *InVector;
T1 scaleFactor;
void  operator()(vtkIdType begin, vtkIdType end)
{
T1* inPts = this->InPoints + 3*begin;
T1* outPts = this->OutPoints + 3*begin;
T2* inVec = this->InVector + 3*begin;
T1 sf = this->scaleFactor;
vtkIdType size = 3*(end-begin);
vtkIdType nend = begin + size;

   for (vtkIdType index = begin; index < nend;
index++)
{
*outPts = *inPts + sf * (T1)(*inVec);
inPts++; outPts++; inVec++;
}
}
};

Next, in the VTK class, the functor is used as follows (in the templated execution method vtkSMPWarpVectorExecute2):

template <class T1, class T2>
void vtkSMPWarpVectorExecute2(vtkSMPWarpVector *self, T1 *inIter, T1 *outIter,
T2 *inVecIter, vtkIdType size)
{
vtkSMPWarpVectorOp<T1, T2> func;
func.InPoints = inIter;
func.OutPoints = outIter;
func.InVector = inVecIter;
func.scaleFactor = (T1)self->GetScaleFactor();

 vtkSMPTools::For(0, size, func);
}

(The full implementation is available in Filters/SMP/vtkSMPWarpVector.h/.cxx.)

Thus, with very few lines of code, a scalable, threaded, load-balanced algorithm has been implemented!

Advanced Capabilities and Future Work

In an ideal parallel algorithm, threads operate on data without retaining state between invocations, and they read from and write to shared memory without contention. In the example above, no initialization or final reduction is required on each thread, and writing to memory is uniquely determined by the point ID on which the thread is working.

In more realistic applications, however, many algorithms require synchronization, initialization, and maintenance of local state. To provide these capabilities, vtkSMPTools offers the following tools. (For more information, see the previously referenced document, “Parallel Processing with the SMP Framework in VTK.”)

1. Thread local storage

Thread local storage is used to store VTK instances and other objects. Developers can instantiate objects using
vtkSMPThreadLocal and vtkSMPThreadLocalObject (for VTK classes that require VTK-specific allocation and deletion). These objects are created once per thread, and they persist as each edge processes its assigned tasks. Implementing such local thread objects is a great way to accumulate output over multiple thread invocations.

2. Functor initialization and reduction

Functor initialization and reduction is typically used in combination with the creation of local thread objects, since these objects must often be initialized. The results are combined (or reduced) into final form as execution concludes. vtkSMPTools provides these capabilities by invoking the optional Initialize() and Reduce() methods (defined by the functor in combination with the operator() method).

3. Atomic integers

Atomic integers can be used to ensure data integrity when reading from and writing to shared memory. vkSMPTools supports a subset of the C++11 atomic API, mainly the operators ++, –, +=, -=, load, and store. These integers can be used when multiple threads must access and manipulate a common data structure.

Currently, many VTK algorithms are being rewritten with these simple constructs with good performance scaling. New algorithms are also being designed and implemented (both in the vtkSMPTools framework and in VTK-m). If you are interested in realizing performance gains in your own code, we recommend that you build VTK with vtkSMPTools enabled. (At the moment, TBB is the most popular.) We also look forward to receiving contributions of your new and old algorithms implemented using this framework.

References

[1] https://vimeo.com/120504264
[2] http://www.paraview.org/web
[3] http://www.vtk.org/Wiki/images/3/3b/VTK_SMP_Guide.pdf
[4] https://www.threadingbuildingblocks.org
[5] https://github.com/perarnau/xkaapi/blob/kcomputer/README
[6] http://en.wikipedia.org/wiki/Work_stealing

Will Schroeder is president and CEO of Kitware. His role is to identify technology and business opportunities and to obtain the necessary support for Kitware to meet these opportunities. Will also provides technical leadership for large open-source projects such as the National Library of Medicine Insight Toolkit project and VTK.

 

Sujin Philip is a member of Kitware’s Scientific Computing team. His background is in using massive and hybrid parallelism (combining message passing, shared memory multithreading, and GPGU) for scientific computing and visualization applications.

 

Berk Geveci is the Senior Director of Scientific Computing at Kitware. He is one of the lead developers of the ParaView visualization application and VTK. His research interests include large-scale parallel computing, computational dynamics, finite elements, and visualization algorithms.

Questions or comments are always welcome!