2. Exploiting Data Parallelism

Vector and SIMD machines

It is reasonably clear how to exploit the data parallelism implicit in array syntax on vector and SIMD machines. Indeed, these architectures are designed specifically to exploit data parallelism!

(Aside: SIMD means 'Single Instruction Multiple Data'. A SIMD machine is typically an array of identical simple processors, which are all fed the same instruction stream but execute it on different data. Normally each processor has a mask bit which may be set or unset for each instruction and governs whether or not it executes the instruction. It is evident that Fortran 90 array assignments and WHERE statements are very closely matched to the capabilities of SIMD architectures, perhaps not surprisingly as Fortran array syntax was originally developed specifically for them!)

Distributed memory MIMD machines

However, an increasingly important type of high performance computer architecture is the Distributed Memory MIMD architecture (DM MIMD for short). `MIMD' stands for `Multiple Instruction Multiple Data'. Such a machine comprises a set of fairly independent processors, each of which can execute an independent instruction stream and has its own `local' memory for storing code and data, connected together by a communications network for exchanging data. Such architectures have proved to be cost effective, scalable and relatively general purpose. In addition to purpose-built multi-processor machines of this type, workstation networks are becoming increasingly popular and may also be regarded as falling into this category.

Because each processor can execute its own instruction stream, a program can be executed on it by decomposing it into a number of different programs, or `tasks', each performing part of the full problem. This is known as `algorithmic' or `task' parallelism, or indeed as a `MIMD programming model', since it fully exploits the capabilities of a MIMD architecture. However, this approach is only practical if a small number of processors is used, because of the difficulty of designing and writing a large number of different interacting tasks to solve a single problem. (This approach is also not very scalable, i.e. it is not easy to change the number of tasks to suit different numbers of processors.)

SPMD programming model

By far the most popular style of programming `massively parallel' DM MIMD machines (i.e. those with large numbers of processors) is the SPMD ('Single Program Multiple Data') programming model. The same program, though not necessarily the same instruction stream, is executed on all processors, each processor operating on a part of the data. Thus different processors can execute different branches, loops, etc in the program, on different data. It is evident that the SPMD model is based on data parallelism (though in general it can now also be combined with algorithmic parallelism, i.e. different processors may execute different execution threads, and thus perform different operations, on their own part of the data). Henceforth we shall concentrate on the SPMD programming model.

Message-passing

Whichever programming model is used, if the program running on one processor requires data that is stored in the local memory of another processor, the data must be communicated through the interconnection network. Up to now, this has normally been achieved by explicit `message passing', that is, by inserting `send' and `receive' statements in the program. For example, if processor `p1' requires a data item `x' stored in the local memory of processor `p2', then p2 must execute a send statement (e.g. send (x, p1)) and p1 must execute a corresponding receive statement (e.g. recv (x,p2)). (See the forthcoming lectures on `PVM' for a specific example of a message-passing library which can be used in Fortran.) Typically, accesses to local data are much faster than non-local accesses (i.e. communications). Therefore, for efficiency it is important to partition the program and data so as to try to (i) to minimise communications and (ii) to maximise potential parallelism.

For example, to run the Jacobi iteration program (example (ii) of the first lecture) on a DM MIMD machine using message-passing, it must be modified as follows:

  1. The data arrays (A (100,100) and OLD_A (100,100)) must be `distributed' over the processor memories. E.g. if there are 16 processors which are regarded as being logically arranged as a 4 * 4 array, A and OLD_A may be partitioned into 16 subarrays of size (25,25), each of which is stored on the corresponding processor.

  2. The program is modified to only update the local segment of the array.

  3. Message-passing must be inserted to communicate data where necessary. In this example, since the update of each point depends on the values of its 4 nearest neighbours, the 'edge' values of each segment must be swapped between processors in every iteration. Care has to be taken about special cases, e.g. the outside edges of the overall array are not communicated. Messages must also be exchanged to evaluate the global termination condition (ANY (A - OLD_A > 1E-7 * A)).

It is evident that message-passing programming is difficult! It is also error-prone and makes programs hard to modify. The difficulty of programming DM MIMD systems has so far been a big obstacle to using them!

Automatic parallelisation

This situation has motivated a lot research in recent years towards the goal of `automatic parallelisation', i.e. the automatic transformation of data parallel applications written in a standard sequential language like Fortran into SPMD message-passing programs that can be executed on DM MIMD machines. It has become clear that this can be at least partly achieved: if the required data partitioning and distribution is prescribed, a compiler can automatically partition the data and computation according to this prescription, and insert the necessary communications.

The really difficult part of fully automatic transformation is to automatically determine a suitable data partitioning and distribution. As we have said, an efficient data distribution must spread out the data arrays over the processors (rather than replicating them, i.e. storing a copy on each processor) as much as possible in order to maximise the potential parallelism, while distributing them in such a way as to minimise communications. To determine a suitable distribution therefore requires global analysis of the program's data access patterns and their relative importance (i.e. how often each is executed). Often this information cannot be determined statically as it depends on runtime values. Much research is being conducted on this problem, but currently no satisfactory conclusion has been reached. (Incidentally, note the difference between the analysis required here and that required for vectorisation or SIMD execution. The latter only require local analysis of each statement or loop to determine whether it is potentially data parallel.)

This situation has prompted the development of an informal standard for extensions to Fortran to declare data distribution, called `High Performance Fortran' (HPF). This long preamble has now led us to the main topic of this lecture! The rest of the lecture will give a brief overview of data distribution in HPF.


Continue in sequence, or back to the index.
Written by John Merlin (jhm@vcpc.univie.ac.at).
Converted to HTML by Simeon Warner on May 23 1994.