Next: Scalable Libraries Up: An HPF Encyclopedia Previous: HPF, the Language

Subsections

Compiling Techniques for HPF

It is probably true that the first people to become interested in HPF were the compiler writers, because of the advanced features that would be needed to compile HPF efficiently. Ken Kennedy, himself a ``compiler person,'' in 1991 proposed a single Fortran dialect ``for all seasons'': uni, multi-, host-slave, SPMD, MIMD, and SIMD machines. The HPFF was formed, and a spate of compiler papers ensued from Houston, Vienna, and Syracuse, including a number of papers to guide HPF compiler developers.

One is Hiranandani's excellent paper in Supercomputing '93 which covers a number of optimization techniques [95]. Another one covers run-time statistics a compiler could use to generate better code [8]. A number of compile-time optimizations for HPF are discussed in Zima and Chapman [188].

Early Testbeds for Studying HPF Compiler Optimizations

Fortran D started at Rice University in 1992 as a research project to explore HPF compiler techniques, e.g. I/O [149,180]. Initial experience with Fortran D showed that early HPF compilers were too machine specific; that they worked well with stencil programs but had to do better on linear algebra codes and pipelined codes. Preliminary experience with Rice's Fortran D compiler showed its very early state (it compiled only BLOCK decompositions, for example), and a plea was made for people to be patient with HPF implementers. In the meantime (1995), a new Fortran D effort (Fortran D95) under the direction of John Mellor-Crummey began over from scratch. The Fortran D95 compiler accepts HPF plus extensions to handle indirect addressing and other new features.

Syracuse University also put together a prototype version of an HPF compiler, as described in Bozkus et. al. [33,35] to investigate techniques that could be used by other HPF implementers. Their general approach was to skip dependence and interprocedural analysis and instead recognize certain patterns of Fortran 90 and generate calls to a run-time library. (The work is based on Li and Marina Chen's Crystal work [122], but it was also influenced by the ARF compiler [25]). Their compiler, called Fortran 90D/HPF, has many intrinsic functions, implemented in EXPRESS. The performance of the functions on an iPSC hypercube is described in a 76-page Technical Report by Ahmad, et. al. [12]. One interesting part of the report is the array specification passed to the intrinsic routines also shown on page 3 of that report, which could become a de facto standard. The paper includes algorithms that could be used by compiler implementers.

Perhaps the most efficient solution to parallel programming is Fortran M interoperating with HPF [43]. A paper by the Kennedy gang gives a good overview of both compilers, and points the way toward more portable parallel programming in the future.

Work on Specific HPF Compiler Optimization Techniques

At Rice University, Laurie Liebrock looked at linearized arrays and wrote a paper with Kennedy describing Fortran D techniques for indirect indexing and for handling linearized arrays [124]. The idea is to support the ``natural topology'' of the arrays and thereby get a better volume-to-surface ratio. This approach compares favorably with runtime handling of PARTI (see Runtime Libraries for more information on PARTI). It is certainly the case that one must deal with the linearized arrays in a program before converting it to HPF, so tools to do this are welcome.

This work was updated in 1997 [125] where Liebrock and Kennedy show how automatic mapping could work in real-world applications, specifically multiblock problems. The goal is to input the meshes along with communication costs, and then have their tool generate ALIGN, DISTRIBUTE, and PROCESSOR directives.

Ken Kennedy et al. [108] addressed the problem of finding the local memory access sequence for programs with BLOCK-CYCLIC distribution. They find a faster method than the one previously proposed by Chatterjee [47]. This result was confirmed by Hank Sips [171], which evaluates several different ways to generate local loop bounds that perform irregular local access into regularly spaced global arrays. Sips considers both the local enumeration and local storage compression problem. Is all this work on block-cyclic worthwhile? According to [65], Dongarra argued for the importance of BLOCK-CYCLIC. IBM's ESSL and ScaLAPACK require BLOCK-CYCLIC in order to gain maximum performance.

Several papers address the issue of run-time data redistribution. There is work by John Gilbert [48] that uses a dataflow graph approach. Kalns and Ni [105] present a technique for data-processor mapping that minimizes the total amount of data that must be communicated (this paper does not reference the work by Gilbert and Chatterjee). A technical report from Vienna [135] gives a uniform method of eliminating unnecessary data redistributions, including at procedure boundaries. The necessity of doing redistributions at procedure boundaries was pointed out by Hall [87]. Work by Li and Chen shows that the data alignment problem is NP-complete, even when source and target distribution patterns are given [122]. A more recent paper by Chen and Wu [185] proposes an algebraic analysis of data motion in order to optimize data redistribution. First, the communications are expressed as statements in a communications algebra. The statements are reduced, and then pattern matching is performed. Certain idioms are mapped to efficient routines in a runtime library.

Other compiler research projects attempt to optimize communications. Communications optimization is crucial for HPF compilers. Typical optimizations include message vectorization [94,188] and efficient collective communication [83,122]. For example Gautier de Lahaut [77] attempts to model communications at compile time and emit code that does the communication in an optimal fashion. This would be as opposed to the PARTI approach, which builds schedules at runtime, a commonly used approach [10,113,168]. Agrawal et al. suggests how to use interprocedural analysis to optimize communications even across procedure boundaries.

A general framework for optimizing communications at run time can be found in Gupta et al. [86]. Included in this approach is the analysis of availability of data at processors other than the owner of the data. This can eliminate redundant communication. Bordawekar et al. [31] describe how to compile out-of-core stencil problems from HPF into efficient computing schedules that could overlap computations.

One aspect of HPF compiler analysis that has received less attention than loop analysis and communication optimization is the topic of reductions. Most compilers are capable of idiom recognition and can generate reasonable code for simple reduction loops. IBM Japan's compiler group [174] went further and detected reductions within complex loops, privatized reduction variables, and generated coalesced global communication outside the loop. It's based on analyzing strongly connected components of the program graph.

A very important paper on compilation technology for data parallel programs, though not directly mentioning HPF, is Rogers' work on locality [165].

A special issue of Scientific Programming in Spring, 1997 summarized a number of compilation techniques [13,37,106,145].

Next: Scalable Libraries Up: An HPF Encyclopedia Previous: HPF, the Language

Donna Bergmark
2/18/1998