HPF Users Group Meeting
Porto, Portugal
June 25-26, 1998
Notes taken by Charles Koelbel

 

Executive Summary

This was the annual meeting of the HPF Users Group, this year organized by the Vienna Center for Parallel Computing. Like the first year’s meeting, it was a good overview of recent developments in HPF compilers and usage. The talks were divided more-or-less 50-50 between the compiler/tool developers and the users (with some bias toward the developers, who often represented their users).

The very short summary is that, contrary to popular belief, HPF is not a dead language. Comparisons to other systems (notably MPI) showed performance differences ranging from tens of percent in HPF’s favor to larger factors (up to 10 in some worst-case examples) against HPF; a factor of 2 (against HPF) was probably close to the median difference. Quantitative comparisons of programming effort came out more positively for HPF, with estimates running a bit above half the development time of other approaches. Of course, additional improvements on the performance front are also in progress, so the "cost per performance" factor should continue to shift in HPF’s favor in the current year.

This was, of course, tempered by HPF’s well-known difficulties finding acceptance among the high performance computing community. Various suggestions were offered for how to increase that acceptance, including yet better tools and access to truly high-performance libraries through HPF. Additional HPF extensions, aimed more at improving compiler analysis and performance than at increasing user expressivity, are another option. There are, however, not many additional resources that can be thrown at the HPF problem — as PGI representatives put it, "we’re programming as fast as we can."

The somewhat longer version of the summary would highlight the two keynote talks by Ken Kennedy and Yoshiki Seo/Hitoshi Sakagami as interesting and important presentations. (Only the Japanese talk is described below, because my arrival in Porto was delayed by an Italian civil aviation authority strike. I’m taking word-of-mouth as evidence that Ken did his usual excellent job.) Other than that, the talks were generally of high quality. Next year’s meeting is tentatively scheduled in Maui, to allow relatively easy access by Americans and Japanese researchers.

HUG’99 is tentatively scheduled for Maui, Hawaii (US) next spring or summer. Piyush Mehrotra will chair the meeting. Watch for more information on the HPFF web sites ( http://www.crpc.rice.edu/HPFF/ and http://www.vcpc.univie.ac.at/information/mirror/HPFF/).

Detailed Notes

Mike Delves (NA Software), "SEMC3D Code Port to HPF"

SEMC3D is an EM code used by the auto industry to measure electronic fields inside cars, to ensure that interference between systems doesn’t cause bad effects ("for example, people’s heads don’t explode"). The original code was F77 using a finite difference leap-frog scheme, with a regular mesh around the car and refinements around individual parts. Translating to HPF followed the now-familiar model: cleaning code (using FORESYS), translating to F90, tuning F90 performance (and debugging), adding HPF directives, and tuning. The hardest part was the port to F90; lots of cosmetic features were trivial, but working around sequence and storage association is harder. (In SEMC3D, this turned out to be not an issue due to original clean coding.) The real fun came in teasing out the bugs uncovered in the F90 version (and in the F90 compilers — "I won’t mention any other names, but our compiler didn’t have any bugs"). The natural decomposition was a3D BLOCK(m) distribution, where m was chosen to optimize block sizes. To parallelize the code, array syntax and EXTRINSIC(HPF_SERIAL) were used. (NASL’s compiler uses a somewhat unusual definition of SERIAL, where it means "do this in parallel".) Performance tests show that the NASL HPF compiler beats the NASL F95 compiler by 10% or more on one node. In answer to an audience question, this was due to the compiler backends, not the F90 constructs themselves (as demonstrated by running the F77 codes through the F90 compiler). Speedup was very good relative to the initial times for each compiler, although off of MPI for this code by a factor of 3. Other HPF codes were much closer, in particular for some MPI codes that had been unable to distribute the arrays efficiently.

Thomas Brandes (GMD), "Porting to HPF: Experiences with DBETSY3D within PHAROS"

The code uses the boundary element method to compute stresses and deformations in object surfaces; it is used by Mercedes for engine design and other things. The original code was written in F77 (actually, the original code was in Fortran IV) with the usual vectorization tricks (linearized arrays, etc.). Actually, it’s a system of 7 programs, but they only had time to parallelize one of them — the solver for a block-structured matrix, using LU decomposition on each block. The solver for individual blocks was an out-of-core solver as well. Again, they went ahead with a port to F90 using FORESYS and hand-cleanup, then inserting HPF directives. One tricky bit was that the performance portability between the PGI and NA Software compilers wasn’t perfect — PGI correctly parallelized an INDEPENDENT loop with a CALL, while NA Software needed the HPF_SERIAL interface as described in the last talk. The parallelism was between subdomains, rather than straight data parallelism. Performance on one processor was very close (within 10%, it looked like) on one processor. Scalability depended on the test case used, particularly on how the data distribution determined the load balancing. The main messages were that HPF needs parallel I/O and dynamic load balancing. In evaluating the tools, the only one that they had problems with was VAMPIR, which had difficulties with very large trace files. The effort analysis was that most of the effort went into conversion to the code to F90, which (apparently due to construct cleanups) improved the performance on the C90 enough that they didn’t need to run on workstation clusters as originally planned.

Christian Borel (MATRA) & Thomas Brandes (GMD), "Porting of the Industrial Computational Fluid Dynamics Code AEROLOG to HPF"

As you might guess from the name, AEROLOG is a real industrial application for compressible flow. Their problem is that they have both a shared- and distributed-memory code, with active development going on in the shared-memory version (in particular, adding new physics). Load balancing is achieved by splitting (dynamically varying) (linearized) array appropriately between processors. "That is the problem of this code, that you have this very strange version of memory management." Most of the subroutines run over a single subdomain; loops over the subdomains can be parallelized at coarse granularity as before. The PHAROS development process went ahead as usual. The addition to the DBETSY code from the last talk was that subdomains need to be saved across the subroutine calls (rather than being created at each call). They chose the inelegant solution that didn’t break the existing HPF compilers — add a dimension over the processors. Unfortunately, this meant a lot of rewriting of the code (especially parts using indirection arrays, which had to adjust their contents). Now that some compilers implement the GEN_BLOCK distribution (and will soon handle pointers to mapped data), they’d like to go back and test the more intuitive directives. The PGI compiler failed to produce a scalable program, due to over-pessimistic broadcasts used to implement the necessary indirection. Replicating the indirection array (the boundaries of subdomains) improved the efficiency on all machines, and was recommended as the best style to use. Performance comparisons with MPI showed how important that optimization and not repeatedly executing the executor code really was to HPF execution. Overall conclusion is that HPF is useful, but not a complete solution for all applications.

Panel Discussion

Panelists:

Henk started the panel with a number of issues to be addressed:

Discussing Fortran, the panelists agreed that Fortran is not seen as a growth market, but is definitely still alive and kicking. "The fact that we’re having this discussion is proof that Fortran is not dead, but it’s also proof that it’s not healthy." A lot of the problem is pure marketing — people’s perception of Fortran is fixed on FORTRAN 77 (and earlier), not on more recent developments. Moreover, little interesting development (from a CS viewpoint) is coming out of Fortran. "If Fortran 2000 doesn’t address object-oriented programming, it will be dead." "If Fortran 2000 becomes object-oriented and loses efficiency, it’s dead too." However, it does look like the OO proposal for F2000 is sensible. It may be that the important thing for Fortran to do is make sure it can interface with other languages; multi-language programs are becoming the norm these days.

With that deft transition, we talked about interoperability for a while. "Fortran is successful because of its culture, and a big part of that is the libraries it has developed. We simply cannot afford to lose those." There are very good reasons for HPF to call parallel libraries written in other paradigms (e.g. ScaLAPACK), as well as for other paradigms to call HPF (e.g. Scott Baden’s KeLP system). Setting up such cross-language interoperability is difficult to do in a standard way. ISO approved a standard for interlanguage operability, which has since been sitting on a shelf ignored. We could define an HPF interface to this level of interoperability, but it’s not clear that anyone would notice. (Also, it’s not clear how easy it would be technically, as HPF goes out of its way to avoid specifying low-level data layout. Different applications might well store distributed arrays in different orders.) Anybody can propose such an extension to the standard to the Fortran 2000 committee, although major changes are unlikely to be adopted at this point.

On to more current topics. "If OpenMP is the answer, I’d like to know what the question is." Two or three participants reported difficulty getting real information about OpenMP. "The specification does leave some things to the imagination, doesn’t it?" However, the simpler OpenMP constructs seem to be well worthwhile on shared-memory machines. A marriage of those constructs with (some of) HPF’s data distributions might well be very successful. However, a better OpenMP linkage to Fortran 90 is needed first. An EXTRINSIC interface from HPF seems simple as well, if the HPF implementation is built directly on OpenMP. If the implementation is more aggressive, for example generating OpenMP calling MPI for a hierarchical system, then things become dicier.

Everybody has new extensions to suggest for HPF. It’s not clear how harmonious they all are. One suggestion was to require an (unimplemented) extension be dropped from the list of approved extensions for every new extension added. Vendors were not overjoyed to hear more requests for features. However, it is clear that more advice to the compilers is needed for advanced and irregular applications. Many of us wish for ways to provide more information without telling the compiler all the details of how to do the implementation. Some of the proposals may even supply this, but it’s hard to tell at first glance.

Yoshiki Seo (NEC) and Hitoshi Sakagami (Imeiji Institute), "HPF/JA: HPF Extensions for Real-World Parallel Applications"

This was a combined invited talk about HPF activities in Japan. The background is that, although Japan has several distributed parallel machines installed and under development, they lack message-passing programming expertise. They therefore see HPF as vital to providing parallel software. HPF/JA is a consortium of vendors (the original founders) and users; all members must commit to trying to implement or use HPF in their work. Their most important conclusion so far is that implementation-independent HPF programming techniques are needed — there should be clear advice for how to get high performance that does not depend on which implementation you use. In addition, they have developed a number of proposed extensions, all suggested by real-world applications and designed to fit with the HPF single-threaded, global address space model. These are somewhat at right angles to the HPF 2 approved extensions. The list is:

Dr. Sakagami then took over to talk about benchmarking of real-world applications. This included both evaluating standard HPF and their new extensions. There were 5 codes:

Guy Lonsdale (NEC Europe), "Contact-Impact Kernels in HPF+"

See http://www.par.univie.ac.at/hpf+/ for the details of the project, including downloadable codes. The purpose of the HPF+ project was to do a fair evaluation of HPF for irregular applications, starting from the standpoint that extensions of HPF (version 1, when they started) would definitely be needed. The key features are:

Reusing schedules starts by using the PUREST directive — going beyond PURE to say there is no communication at all. This ensures that local routines are executed locally. The REUSE directive tells the compiler that the communication schedule does not change (i.e. that the inspector need not be re-executed). Extensions allow conditional scheduling and labeling schedules to be saved. HALO is a generalization of SHADOW for irregular distributions. The idea is to give the compiler the knowledge of the array distribution, the ON clause, and the access pattern; the user must explicitly include an UPDATE_HALO directive to make the shadow region consistent. "Yes, this is dangerous; if you get this wrong, the program doesn’t run correctly." The result of all this was performance within a factor of 2 or 3 on one processor (slowdown due primarily to F90 compilers), and near-linear scaling equivalent to an MPI program. The talk then applied all this to a auto crash simulation; the main problem is detecting when (originally) disconnected elements intersect (i.e. when two parts hit each other) in order to repair the data structure. The performance of the contact penetration correction showed seriously bad behavior — reuse of the schedules was more than a factor of 10 off from optimized MPI. The difference was that MPI "spent years of blood, sweat, and tears" on minimizing communication. The overall conclusion was that HPF+ had made great advances in INDEPENDENT loops and the new directives, making this type of computation feasible, but the bad news is that single-processor overhead is still too high, storage requirements for the generated code are too high.

Frederic Bregier (LaBRI), "Propositions for Handling Irregular Problems in HPF 2"

The usual problems with irregular computations were listed: lack of dependence information, difficulty in generating communication, need for complex alignments, and need for generalized block or indirect distributions. Their approach is to provide a regular-computation-like programming style with explicit support for inspector-executor computations. They do this by beginning with the generic TREE as the basic data structure; a new directive identifies this structure (with the limitation that the tree have static depth). Distribution of the lower levels is done logically, while upper levels of the tree are replicated. Some related extensions move HPF more toward use of trees. They currently have a prototype implementation. All HPF2 distributions are implemented internally using GEN_BLOCK (INDIRECT is handled by reordering). Inspectors and executors are implemented for trees. The experimental results used sparse Cholesky factorization. Inspection is a small part of the overall cost. The conclusion is that irregular tree codes can provide a lot more information to the compiler, and this can lead to better implementations. Future work includes more properties of trees, extension f inspectors, and integrating their tree library better.

Ken Kennedy (Rice), "Advanced Optimization Techniques for High Performance Fortran"

"The goal of our work is to push compiler technology as far as it can be, not to be competitive with any other compiler product." Unlike the previous talks, the goal was to compile a "pure" version of the NAS benchmarks in HPF rather than create new extensions. Basically, as few source changes as possible were used on the benchmarks (as opposed to other versions of the NAS benchmarks, which had a 2x code size blowup). Some of the key features for the dHPF compiler are a general computation partitioning model ("not just the LHS owner-computes rule"), based on integer communication and computation sets handled by the Omega calculator and aggressive communication placement optimizations. This was applied to the NAS Parallel Benchmarks. Some interesting issues were:

Results from BT showed HPF at 97% of MPI’s performance on 4 processors, and 74% of MPI at 24 processors. The difference was in MPI’s use of a wrapped and skewed ordering of blocks, which is not an HPF distribution pattern. (Ken does think that compilers could automate this, but "we haven’t thought about it that hard yet.") "The lesson is that compiler technology has further to go, but that including those technologies makes HPF more useful to programmers."

Piyush Mehrotra (ICASE), "A Comparison of the PETSc Library and HPF Implementations for a Structured-grid PDE Computation"

"Let me say that the work here was done almost entirely by the first author." [M. Ehtesham Hayder, now of Rice] The BRATU problem being solved by the code is fuel ignition, a nonlinear elliptic PDE boundary value problem using a regular grid. There is a Newton iterative solver, plus a Krylov solver for the linear approximation and a Schwarz method on the subdomains. This translates to global reductions, nearest-neighbor stencils, and communication-free solves for local domains. This was solved by PETSc, a library of parallel iterative solvers developed at Argonne with tremendous amounts of expertise included in the high-level abstractions. The programmer has to provide initialization, function evaluation and Jacobian evaluation for her application, and a convergence criteria. The conversion of BRATU to HPF required conversion of linearized arrays to multi-dimensional versions, conversion of some loops to F90 array syntax, inlining of a (sparse) matrix multiply routine that handled boundary conditions, and using simple one- or two-dimensional BLOCK distributions. The initial performance was disaster, since the ILU preconditioner was sequential. (PETSc had used a different preconditioner in its parallelization.) Changing to the PETSc preconditioner improved matters, but the HPF compiler still introduced superfluous communication. HPF_LOCAL was used to avoid block-to-block communication. Performance figures took up the rest of the talk; HPF was slower on the non-preconditioned problem by about a factor of 1.5, but scaled pretty much the same. PETSc had more efficient inner loops (computation), so showed higher proportional overhead. When preconditioning was included, HPF was much closer to PETSc, even beating it for some problem sizes. In conclusion, HPF puts its effort into the compiler and PETSc into its library, with equivalent results. There was slightly more effort to use HPF, since it required understanding of the compiler and/or output code. Tools for understanding HPF codes are definitely needed. The single-node performance of HPF was below par, apparently because of internal handling of multi-dimensional arrays (when they are handed off to the native compiler).

Gavin Pringle (EPCC), "A Comparison of Various PGHPF Models on the Cray T3D/T3E"

The talk was basically about common performance bottlenecks on the SGI/Cray machines of the title. The background is that Edinburgh has a 512 processors on the T3D, 128 processors on the T3E. PGHPF gives good single-processor performance, but scaling "is an issue" (although the current version seems to be much better). PGHPF on the T3E uses the CRAFT/hardware direct remote memory access, rather than two-sided communication. The first bottleneck was from an n-body code which copied columns of data to a temporary, using either DO loops or array syntax. The conclusion was that index notation was faster, due to better packing of messages. Next came unstructured data movement, implemented as a gather of an array. The conclusion was that this was not a problem for the T3E. Tying it together, he looked at a real application — magnetohydrodynamics using a 6th order finite difference code. T3E efficiency was 70% on 64 processors, although the CRAFT compiler option actually slowed the code down (due to turning off message aggregation). Comparing equivalent intrinsics (FORALL, CSHIFT, and EOSHIFT), they found that CSHIFT was always faster, often by large factors (up to 9 times on small arrays). Moreover, FORALL was faster still, with or without the INDEPENDENT directive. Again, the CRAFT option on PGHPF sped up single-element communication and slowed down passing of sections. EPCC is now producing some small test programs to demonstrate the effects of coding style on HPF programs. "I find it very easy to promote HPF, and I find it hard to dispel negative opinions of HPF in general."

Steve Piacsek (NRL), "Performance of Explicit Ocean Models on Shared and Distributed Memory Computers Using HPF"

"I underlined the ‘U’ in the name of the conference because I am a user." They were studying the parallelization of the 2-D component of ocean models, since this is the key part (3-D models add the complexity of Helmholtz models for the free surface). The code was first developed in CM Fortran, and he is now running as many models as he can before NRL turns off their CM-5 in September. There is also a separate effort to explore MPI and SHMEM on the Origin 2000 for the same codes. The model is high-resolution to correctly model flow through the various straits in the domain (the Mediterranean). The code is centered in space and time, with leapfrog time-stepping and nearest-neighbor communication. Scaling is doing well enough, with about an improvement of 1.85 at every doubling of the number of processors. PFA was not doing nearly as well for large systems. A real scaling study in MPI (up to 4GB storage, i.e. a resolution of 200 meters over the whole Mediterranean) shows continued feasibility for large codes. The 3D code shows less good scaling. "I would really like to use the HPF compiler improvements when they are available." Other future work includes improved models of boundary conditions and other scientific aims.

Mike Delves(NA Software), "The Real Benefits and Costs to Industry of the Ownership of HPF Codes"

"I’m Tim Cooper today." [Filling in for the first listed author of the paper] The HEIDI project is aimed at evaluating emerging technologies (including HPF) and showing "that they are worth the cost of emergence." In this case, the partners are Alfa Romeo Avio doing combustion modeling; the primary aim was in estimating the cost of developing and maintaining a real HPF code. "The group there is perfect for us: they do not know about HPF, they do know about MPI and want nothing to do with it, and they need to see speedup on their code." Technically, the code is a fluid flow on body-fitted coordinates, with emphasis on 3D steady-state RANS equations, spray models, turbulence, radiative heat transfer, and reaction chemistry. "You’ll be getting sick of the sequence of steps in porting a code to HPF." They have inserted the HPF directives and are in the process of optimization. "I hope you get the impression from the speed I’m going through these slides that this has been a routine project." In this case, tidying up the code to F90 resulted in cutting run time in half (using the NAS compiler); adding HPF on one processor added 25% overhead to F90. On a multiprocessor cluster connected by 100Mb Ethernet (and using the wrong distribution), preliminary results show a speedup of 288/224 on two processors, 288/210 on 3 processors. This does not include optimization of the I/O sections. In terms of cost of ownership, the compiler costs are negligible. Conversion costs (for F77 codes) are 8 man-years per code per MPI code, 4 man-years in PHAROS, and 1.5 man-years in HEIDI. (The further reduced cost was due to learning effects — they didn’t have to come up to speed on HPF, having lived through PHAROS.) They can’t estimate relative runtime costs, having no MPI code to compare costs. They are happy with HPF, and dismissed MPI out of hand early on. The one drawback was compile and link times for F90 and HPF.

Arild Dyrseth (Bergen), "HPF and OpenMP"

His motivation was to evaluate the available directive-based languages on the Origin 2000, using short kernels taken from seismic processing applications. Their work started on distributed-memory machines with HPF, producing acceptable performance for the Paragon and a workstation network. Initial observations of the Origin were that it was difficult to time the program or control execution at a fine level (e.g. mapping data onto processors). They compared OpenMP (simply adding the directives to the loops that were parallel, plus sharing the data) and PFA (SGI’s parallelizer, with directive help) to the PGI HPF implementation, both using the SMP code generation and distributed memory generation. Lots of speedup curves shown; the bottom line is that PGHPF is well short of the SGI performance levels. In conclusion, OpenMP works. So does PFA. HPF on IRIX "needs more work" on the performance. However, feedback to OpenMP is needed about F95 support; in particular, FORALL, array syntax, ALLOCATABLE, and reductions need to be considered. "For an open standard, it is hard to find who to talk to" about OpenMP.

Henk Sips (Delft), "A Methodology for Converting Applications to HPF"

The motivation for the talk, as earlier in the day, was in providing advice on managing the transition to HPF. In particular, they wanted to apply a realistic software engineering model to porting a short-range n-body solver (SIMMIX) to HPF. The algorithm uses explicit neighbor lists and an FFT pass; it is fairly cleanly written. Using a simple Amdahl’s law model, they came up with a potential speedup of 8.3 (with half that achieved at 9 processors). Differing-length neighbor lists imply an unbalanced workload, which they worked around (in HPF 1) by changing data structures to an edge list, which also removed the need for irregular distributions and the like. The performance went up by a factor of 2 or so, rather less than they expected (but accomplished in 24.5 person-days). They had expected 25 days of effort, so that was well on-target. Unfortunately, they were far off on their estimation of where the effort would go — 9 days of the 25 went into debugging the compiler rather than working on the code. The conclusion was that better communication between programmers and compiler writers was definitely needed at this stage of HPF’s evolution.

Thomas Fahringer (Vienna), "On the Development of HPF Tools as Part of the Aurora Project"

Aurora is a long-term project working on high-level languages and associated tools for parallel computation. Given where it’s located, it’s no surprise that one of the tools is the Vienna Fortran compiler. However, the talk focused on other tools, starting with an HPF debugger, which is needed because serial debugging does not suffice: bugs surface in parallel on live data (which may not fit on one processor), due to erroneous HPF features (e.g. untrue INDEPENDENT assertions), and may need visualization (of data mappings or processor arrays) to find them. Relating program behavior to the program source is, of course, also problematic due to extensive optimizing transformations. The debugger is built on an MPI debugger, with interfaces to the compiler symbol tables and a visualization system. P3T is a performance estimator for regular, well-structured programs that includes handling of compiler optimizations and machine parameters (e.g. cache behavior). They are now using program traces, and hope to incorporate symbolic analysis soon. "I can show you one slide about the estimation." (Slide full of very small print, with estimate errors from 0.0% to 25%.) Parallel code instrumentation is also supported by SCALA, with the granularity of instrumentation selected by the user. For example, it is possible to break out time for the inspector and executor generated by an INDEPENDENT loop. The output is displayed by MADERIA, which looks like it includes the usual sorts of graphic feedback. Finally, Migrator is designed to port F77 and F90 codes to (partially extended) HPF. It includes "algorithmic patterns", loop restructuring, and loop parallelization. The pattern matching is at a semantic level (apparently based on dataflow graphs) in a bottom-up fashion. Questions from the audience seemed skeptical (starting with a muffed answer regarding whether the pattern match considered side effects that the intrinsics might remove). More information available at http://www.par.univie.ac.at/~tf/aurora/project4/index.html.

Doug Miles (PGI), "Vendor Presentation"

PGHPF can generate F77, F90, C or C++ code that can be compiled with its own node compilers (or another vendor’s) for several chips. Coming soon will be compilers for the IA-64 (a.k.a. Merced) and a native code generator for HPF. Most of the time was spent on a success story ported from CRAFT. The application was Hydra, a particle-particle particle-mesh code using adaptive gridding. The bottom line is that the HPF version (without adaptation) has a speedup of 128 on a 256-processor T3E. With adaptation (just completed in HPF, considered infeasible in CRAFT), HPF is 2.5 times faster due to the better algorithm. Features of HPF include SHADOW, caller remapping, ON HOME, REDUCTION, HPF_CRAFT, and incremental improvements in the compiler performance and in the profiler. They have almost all of the base language of HPF 2, except some dark corners of Fortran 95 and SORT_UP and SORT_DOWN. In the approved extensions, many of the approved extensions are already there. GEN_BLOCK is still in development, as is character array language. The INDIRECT distribution is not in development yet, nor are processor subsets or ON directed at processor subsets. The high-priority projects now are GEN_BLOCK, more support for shared memory and OpenMP, native compilation, Windows NT support including the VI stuff from HPVM, and Extreme Linix.

David McNamara (Pacific Sierra), "Vendor Presentation"

This is the company that produces VAST (in several versions). "We don’t make your compiler, we make it better." A free Linix VAST-HPF implementation can be found at http://www.psrv.com/ (but is limited to 2 processors). DEEP (their environment) includes "everything that would be needed to get a program up and running". Most important to them is retaining the HPF single-thread model for the programmer, targeted at a variety of systems including shared memory and distributed memory. As you might guess, this relies on giving the environment access to all of the compiler’s information. Seems to be a reasonably nice interface, judging from the screen shot. Both static (compiler) and dynamic (profiling) information could be displayed. The debugger includes a nice GUI, plus links to the HPDF debugger concept. Future directions include cache tuning advice and other NUMA support. They are in the final stages of development now.

Mike Delves (NA Software), "Vendor Presentation"

"I’ll assume that all you are interested in is what language features we support." In the Fortran90Plus compiler, they have just released version 2.0 which emphasizes optimizations. It handles full Fortran 95, plus VARYING_STRING, IEEE Exception Handling, the proposed C Interoperability standard, and OpenGL. In general, they compare themselves to g77 on Linix for Pentiums, beating it by 10%. HPFPlus handles HPF 2 core language with some (unnamed) restrictions, compiling to Fortran + MPI. Their main market is Linix clusters, although they do support other machines (beating IBM’s xlhpf). "We are committed to supporting full Fortran 95 and full HPF 2, but I am not prepared to give you a time schedule… So this slide tells you absolutely nothing."

Werner Krotz-Vogel (Pallas), "Vendor Presentation"

TotalView’s advantages mirror what users (are thought to) want:

Besides HPF, TotalView supports C, C++, and Fortran. It has a modular structure, allowing new features to be plugged in in short order. Pallas also distributes the VAMPIR performance tool, which gives very nice visualizations of parallel execution traces.

Closing Discussion

Ken Kennedy and Harold Ehold started the closing discussion with short presentations of surveys of HPF usage. Both can be found by links from the HPFF web page (http://www.crpc.rice.edu/HPFF/home.html), so I won’t detail them here. Suffice it to say that the number of HPF applications and HPF compilers continues to grow.

The real meat of the discussion was planning the future of the HUG meetings. The general sentiment was summed up at one point by the comment "May I suggest that ‘Who thinks we should have another round of HPF Forum meetings?’ and ‘Who wants to take part in those meetings?’ are separate questions?" It was generally agreed that some of the more successful HPF 2 Approved Extensions should be moved into the "core language". The GEN_BLOCK data distribution was mentioned as one that should be "promoted", while the INDIRECT distribution was not. However, there is no regularly-meeting body that could promote such changes, even though they are necessary for continued acceptance of the language. Similarly, several of the suggested extensions to HPF (notably explicit handling of overlap areas) were independently developed by multiple groups; the HUG consensus was that if these proposals could be unified, they would be good additions to the Approved Extensions list. Again, however, the HPF Forum is not meeting to consider such adoptions, nor was it terribly obvious that there was enthusiasm for starting another long series of meetings.

On reflection, it was not obvious that the series had to be long, either. The key to creating solid improvements to the language through extensions is having ample discussion of any proposals. Chuck Koelbel of Rice University agreed to set up a repository for proposals on the web, where they could be downloaded and discussed by any interested parties. With electronic discussion, it seemed that there was no need for multiple Forum meetings.

However, the need for HPF Forum meetings still exists in order to ratify "promotions" of features from Approved Extensions to Core Language. The audience agreed that a good opportunity to do this would be immediately after the HPF Users Group meeting, presumably every year. Therefore, the HUG organizer will arrange for a 1- or 1.5-day meeting of the Forum for the purposes of voting to accept or reject any pending proposals (as maintained on the aforementioned web site). The same meeting will consider moving features from the Approved Extensions to the Core Language to make them more standard.

This led to the discussion of scheduling the HUG’99 meeting. At the HUG’97 meeting, it had been suggested that the meeting alternate between America, Europe, and Asia (really, Japan) on an annual basis. The Japanese contingent, however, was somewhat hesitant to host the meeting until their implementations were closer to the product stage. In addition, travel costs to Asia are still high. A "compromise" solution of having the meeting in Hawaii was suggested and instantly met with acclaim. Brian Smith (associated with the Maui High Performance Computer Center) has been involved with HPFF and HUG in the past, and Ken Kennedy and Chuck Koelbel agreed to contact him about local arrangements on Maui. It was felt that appointing Brian chair in his absence was a bad idea, however. After suitable back-and-forth, Piyush Mehrotra volunteered to chair the meeting. He set about forming the program committee from people in the room, and the meeting was adjourned.

[HUG'98 home page]