3. High Performance Fortran (HPF)

Introduction

HPF is an informal standard for extensions to Fortran 90 to assist its implementation on parallel architectures, particularly for data parallel computation. Among other things it includes directives for expressing data distribution across multiple memories, and concurrent distribution features.

It was developed in a series of meetings in the USA between January 92 and June 93 by a working group comprising most parallel computer manufacturers, compiler vendors, and a number of government and academic research groups. The effort was launched and driven by Prof. Ken Kennedy of Rice University and Prof. Geoffrey Fox of Syracuse University. It is supported by virtually all parallel computer manufacturers, and the next revision of Fortran, in 1995, will include everything necessary to make an HPF program a legal Fortran 95 program.

This lecture will just give an overview of HPF's data mapping declarations.

Directives

A directive is a structured comment that has a special meaning to a particular compiler, but is treated as a comment (i.e. ignored) by other compilers.

In HPF, all of the data mapping declarations are directives. HPF directives are comments starting `HPF$' immediately after the comment character (`!', or `C' in column 1 in old-style source form).

Being comments they are ignored by non-HPF compilers, so an HPF program should be compilable by an ordinary Fortran compiler for any platform. (However, HPF does introduce some actual syntax extensions to Fortran, so this claim only applies if these extensions are not used. All of these HPF syntax extensions will actually be incorporated in the next revision of Fortran standard, Fortran 95, however).

They also do not change the semantics of a program - that is, it should perform the same actions and produce the same results regardless of the particular data mapping. Data mapping just affects the performance of a program, not its semantics. (There are two qualifications to this statement, however: (i) operations may be repeated redundantly if data are replicated, and (ii) results may vary because of the effects of different orderings for reduction operations like SUM, in which the elements may be summed in any order).

Data mapping in HPF

Data mapping means the mapping of data arrays to processor memories. In HPF, this may be specified in 2 stages, alignment and distribution:

Alignment
relates the elements of one array to the elements of another array or a template (see below), such that elements that are aligned are mapped to the same processor regardless of distribution directives.

Distribution
specifies how a data array or template is to be distributed over an (abstract) processor array.

HPF introduces the concept of a template, which is a virtual array, i.e. one that occupies no storage. Its sole use is as an abstract array with which to align data arrays, and which can then be distributed over the processor array, i.e. it provides an intermediary in the mapping of data arrays to processors.

It is not mandatory to use templates, though they prove convenient in some contexts. Neither is alignment mandatory - arrays can be distributed directly onto processors. However, if it is intended that arrays will be aligned (i.e. distributed in a related fashion), then it is clearer and safer to specify this explicitly rather than relying on it being achieved as a side-effect of distribution.

We shall now show the form of the various HPF directives for data mapping.

Templates and processors

Examples of the directives for declaring templates and processors are:

     !HPF$ TEMPLATE  S,  T (16),  U (2*M, 2*N)

     !HPF$ PROCESSORS  P(4),  Q(8, 8)

Alignment

The ALIGN directive relates the elements of an array to the elements of another array or a template, such that elements that are `aligned' with each other are guaranteed to be mapped to the same processor(s), regardless of the distribution directives.

For example, given 2-dimensional arrays A and T with the same shape:

     !HPF$ ALIGN  A (:,:)  WITH  T (:,:)
declares that each element of A is `aligned' with the corresponding element of T. This ensures that, for any values of i and j, element A(i,j) will be mapped to the same processor(s) as element T(i,j).

In this example, A is called the alignee and T is called the align target. We shall use A and T as shorthand for alignee and align target respectively.

In general one can express any linear mapping of the elements of A to those of T, by using subscript triplets to specify a regular section of T. The regular section of T must conform with A. Each element of A is then aligned with the corresponding element of T (i.e. that in the same position within the regular section).

For example, with:

           REAL  A(4)
     !HPF$ TEMPLATE T(8)
some possible alignments are:
!HPF$ ALIGN  A (:)  WITH  T (1:4)
!HPF$ ALIGN  A (:)  WITH  T (2:8:2)
!HPF$ ALIGN  A (:)  WITH  T (4:1:-1)

As normal for regular sections, any dimension of T can contain a scalar subscript rather than a subscript triplet. The only requirement is that the section selected from T must conform with A. In this way an array can be `embedded' within a larger dimensional array or template, as in the following example:
!HPF$ ALIGN A (:) WITH T2 (:,2)

A `*' may appear in any dimension of A or T, and it means that the subscript in that dimension plays no part in the alignment relation. Thus, if a `*' appears in one of A's dimensions, it means that each set of elements whose subscripts differ only in that dimension is aligned with the same element(s) of T (which is called collapsing a dimension). E.g.:
!HPF$ ALIGN A2 (:,*) WITH T (:)
(2nd dimension of `A2' collapsed)
This means that every element in a given row of A2 is mapped to the same processor(s) (since each row is aligned with a single element of T, which cannot be split across multiple processors however T is distributed). Therefore operations and assignments involving different elements in the same row will be executed without communications, at the expense of preventing concurrent operations and assignments on the elements of a row.

If a `*' appears in one of T's dimensions, it means that each element of A is aligned with a whole set of elements of T whose subscripts differ only in that dimension (i.e. A is copied, or replicated, over that dimension of T). For example:
!HPF$ ALIGN A (:) WITH T2 (:,*)
(`A' replicated over 2nd dimension of `T2')
causes A to be replicated over the second dimension of T2, and consequently also replicated over whatever processor array dimension the second dimension of T2 is distributed over.

Replicating a variable has the advantage that its value can be read by multiple processors without communication, but the disadvantage of complicating its updating, as all copies must be updated. This is necessary because all copies of a replicated variable must be kept consistent, that is, they must all have the same value at any point in the program, because semantically there is just one copy of any given variable in the HPF program.

An array cannot appear more than once as an alignee in an ALIGN directive in a given scoping unit-once it has been aligned it cannot be aligned again.

Notice that, by aligning data objects with other data objects rather than templates, it is possible to create an `alignment tree' as shown below. Only the root of an alignment tree can be distributed.

Distribution

The DISTRIBUTE directive specifies how an array or template is to be distributed over a processor array.

A so-called distribution format is specified for each dimension of the distributee (the object that is distributed). Only four distribution formats are provided: block, cyclic, block-cyclic and collapsed (i.e. undistributed). For simplicity we shall illustrate them for a 1-dimensional template:

     !HPF$ TEMPLATE  T (12)
distributed over a 1-dimensional processor array:
     !HPF$ PROCESSORS P (4)

Block distribution

Here the elements are divided into uniform (or nearly uniform) blocks of consecutive elements, which are allocated in order to consecutive processors. If the number of elements, n, is exactly divisible by the number of processors, p, then the blocks are of uniform size n/p. Otherwise blocks of size are allocated to the first processors, the remaining elements form a small block which is allocated to the next processor, and no elements are allocated to any remaining processors.

For example:

     !HPF$ DISTRIBUTE  T (BLOCK)  ONTO  P
results in the elements being allocated to processors as follows:

An explicit blocksize can be specified in parentheses after the BLOCK keyword. Thus in the above example, BLOCK (4) would mean that the elements are distributed in blocks of 4 rather than the default value of 3 (in which case processor P(4) would be unoccupied).

The specified blocksize must be such that the elements do not `wrap around' the processor array. To allow `wrap around' a cyclic or block-cyclic distribution must be specified, as we shall now describe.

Cyclic and block-cyclic distribution

In the basic type of cyclic distribution, the first element is allocated to the first processor, the second to the second processor, etc. If there are more elements than processors then the distribution `wraps around' the processor array cyclically until all the elements are allocated.

For example:

     !HPF$ DISTRIBUTE  T (CYCLIC)  ONTO  P
results in the following allocation of elements to processors:

An explicit blocksize may be specified in parentheses after the CYCLIC keyword, just as for the block distribution. In this case, however, the elements are allowed to wrap around the processor array. If they do, the distribution is often called block-cyclic. For example,

     !HPF$ DISTRIBUTE  T (CYCLIC (2))  ONTO  P
results in the following distribution of elements to processors:

Incidentally, notice that BLOCK(b) and CYCLIC(b) are identical when there is no wrap around (i.e. when ).

Cyclic distributions are sometimes useful for spreading the computation load uniformly over processors, in cases where computation is only performed on a subset of array elements or is otherwise irregular over an array.

Collapsed distribution

An asterisk in a distributee dimension means that it is collapsed, i.e. not distributed. E.g.:

     !HPF$ DISTRIBUTE  T (*)  ONTO  SCALAR_PROC

Multi-dimensional distributions

These descriptions generalise straightforwardly to multi-dimensional distributees and processor arrays, with the words `element' and `processor' replaced by `subscript value' and `processor subscript value'.

Examples of distributing a 2-dimensional template T2 onto processor arrays P(4) and Q(2,2) are as follows:
!HPF$ DISTRIBUTE  T2 (BLOCK, *)  ONTO  P
!HPF$ DISTRIBUTE  T2 (*, BLOCK)  ONTO  P
!HPF$ DISTRIBUTE  T2 (BLOCK, BLOCK)  ONTO  Q

The ONTO clause may be omitted from the DISTRIBUTE directive, in which case the distribution is onto an implementation-dependent processor array. For example, some implementations may allow a default processor array to be specified by a command line argument or environment variable when the HPF compiler is invoked; others may have a built-in default.

Example

We conclude with a simple example, showing how the Jacobi iteration program (example (ii) of the first lecture) may be converted to HPF to run on a DM MIMD machine with 16 processors.

This program involves 2 arrays, A and OLD_A, and it is seen that corresponding elements of these arrays are always accessed together. (That is, in the 2 statements involving both arrays, namely:

     DO WHILE ( ANY (a - old_a > 1.0E-07 * a) )
and
     old_a = a
the equivalent `elemental' expression or assignment involves corresponding elements, e.g. A (i,j) and OLD_A (i,j)). Therefore it is sensible to align A and OLD_A so that corresponding elements are mapped to the same processor, avoiding communications in these statements (except for the communications inevitably required for the reduction function ANY).

Since A and OLD_A are 2 dimensional square arrays, we choose to configure the processors also as a 2d square array of size 4 * 4, and to distribute each dimension of A and OLD_A blockwise along a dimension of the processor array. Recall that the update step involves replacing each element A(i,j) with the average of its nearest neighbours. Therefore, with a block distribution, the edge elements of the array segment on each processor will have to be communicated to the neighbouring processor, and the chosen distribution minimises the `surface to volume' ratio, i.e. the communication to calculation ratio.

To describe this distribution only 3 simple HPF directives must be added to the program - no other changes are necessary:

     PROGRAM jacobi
       REAL  a (100,100),  old_a (100,100)
     
     !HPF$ PROCESSORS  p (4,4)
     !HPF$ ALIGN a (:,:) WITH old_a (:,:)
     !HPF$ DISTRIBUTE old_a (BLOCK, BLOCK) ONTO p
     
       ...  set boundary values of A as required
       a (2:99, 2:99) = 0.0      ! initialise interior of A to 0
       old_a = 0.0               !     "      all of OLD_A to 0
     
       DO WHILE ( ANY (a - old_a > 1.0E-07 * a) )
         old_a = a
         a (2:99, 2:99) = 0.25 * (a (1:98, 2:99) + a (3:100, 2:99)  &
                                       + a (2:99, 1:98) + a (2:99, 3:100))
       ENDDO
     END PROGRAM

Notice that this is the same data distribution as that described earlier in this lecture (in the message-passing section), so it will generate code with the same communications as described there. The HPF version, however, is considerably clearer and easier to write than the corresponding message-passing version would be!


Back to the index.
Written by John Merlin (jhm@vcpc.univie.ac.at).
Converted to HTML by Simeon Warner on May 23 1994.