Next: Applications Up: An HPF Encyclopedia Previous: Runtime Libraries and HPF

Subsections

Tools

As the PCRC [73] has noted, ``the requirements posed by emerging software systems challenge existing performance and debugging technology.'' They go on to say

`Run-time systems for ... HPF and HPC++..; environments for creating and accessing parallel, distributed data structures; and software for adaptive application execution and run-time decision analysis will all require new performance and debugging techniques, particularly for dynamic instrumentation, run-time queries, dynamic guidance, and execution state access.'

As David LaFrance-Linden notes [119], HPF debuggers have the same requirements as other debuggers: present control flow and data in terms of the original source. This is not easy for HPF:

Since HPF compilers automatically distribute data and computation, thereby widening the gap between actual execution and original source, meeting this requirement is both more important and more difficult.

We would add that this observation applies to all the exeuction analysis tools one might use with HPF.

A readable survey of tools for HPF has been started by Jean-Louis Pazat as part of the PAMPA project. This was meant to cover work being done on HPF in Europe, but includes some pointers to other parts of the world as well. The D system is a collection of a number of HPF programming tools [94,93]. The Annai toolset aids programming, debugging, and tuning of HPF programs (and MPI programs, too) [55]. In fact, many of the tools that work for debugging and tuning MPI programs also work for HPF program, since most HPF compilers for distributed memory machines now use MPI as the communications layer. Finally, there is HPFIT [40], which is an environment the includes an editor, parser, dependence analysis tool, and an optimization kernel. It is a framework into which other HPF tools could be plugged.

Debuggers

The earliest HPF debugger was Cheng and Hood's p2d2 - (Portable Parallel / Distributed Debugger [49]). p2d2 is a protocol-based debugger; essentially they defined a special variant just for HPF. After Cheng left NAS, Hood continued development on the debugger, and PTOOLS even contemplated making it a standard. The SC'94 paper said that the debugger would rely on the HPF processor for data distribution information. They acknowledge the potential usefulness of the defined HPF runtime, and would have a debugger ``shim'' linked in with the user's HPF program. As of February 1995, its functionality was not yet complete. However, in his 1996 paper [97], Hood mentions the HPF interface as a possibility, e.g. it would determine an array mapping at runtime. Work to adapt it to pghpf was done at LANL, but as yet p2d2 is not in wide use for debugging HPF programs.

Both DEC and IBM have parallel debuggers, but they do not necessarily connect back to the HPF source. IBM's pdbx can be used on HPF programs, but in a limited way. DEC distributes a modified version of its Ladebug compiler for use with HPF [64]. DEC had a fancy debugger (code-named Aardvark) in Beta testing (April 1996). This debugger is described in [119]. It has its own approach to handling control flow, and a flexible data model.

Dolphin Interconnect Solutions has a debugger (originally BBN's [18]) that will work with the pghpf compiler, announced for release later in 1997. This debugger should be useful across a variety of platforms. It has a variety of control flow options, and can display distributed data. It even has a graphical display feature for 2-D and 3-D arrays.

NA Software is developing an HPF debugger called mdb, which works with their HPF FortranPlus compiler. It is HPF source level and can graphically display data distributions.

Performance Analysis

HPF experience at Cornell shows that the tracing and debugging tools that work with the underlying message passing library (pdbx for all of SP2-hosted libraries, vt for MPL and MPI, and at one time, nupshot for MPI) are excellent ways for finding out what each HPF processor is doing. Also, since xHPF and PGHPF [162,163,158] emit editable, compilable FORTRAN it is possible to insert program markers and trace ones way through the calls to the runtime library. With compilers, this performance analysis option disappears because the compiler does not emit editable, compilable code. Therefore, specialized performance analysis tools will be all the more necessary for compiled code.

Profiling tools are very important. FORGE software comes with a parallel profiling tool, which breaks out time spent on communication and on computation, at the loop level. This tool, called polytime, summarizes the results from a FORGE-instrumented parallel code, and is extremely helpful. Code instrumentation can be done ineractively via the FORGExplorer toolkit or at the command line by the xHPF compiler.

On Unix systems, prof and gprof can be used if the runtime can emit multiple files for per-node profile information, as is the case with IBM and DEC. IBM also has a graphical profiler (xprofiler) that can give a line by line profiler of hot spots in your HPF program, which must have been compiled with IBM's own HPF compiler, xlhpf [101]. Traces from xlhpf programs can also be viewed by IBM's Visualization Tool (vt), with HPF source referback. The PGHPF compiler comes with its own profiler, called pgprof, which can also give a line-by-line profile. Atexpert is available for Cray machines' native HPF compilers.

Adve, Koelbel and Mellor-Crummey [8] explore the use of the Fortran D compiler in support of performance tuning. They discuss their experiences with the self-tuning model for pipelined computations. This paper is directed at compiler writers. The basic idea of this paper seems to be that certain parameters (such as time it takes to handle an interrupt) could be measured by instrumentation inserted by the compiler. The compiler could generate code to print out mean and variation for a number of system statistics, while subtracting out the time it spends in its own routines [9]. Idea is that compiler instrumentation can collect summary statistics in a far more concise manner than system traces; and that detailed trace results are not always required for tuning.

In the Performance and Debugging workshop on Cape Cod in October 1994 , several papers dealt with HPF performance analysis. Adve presented work, done with Dan Reed and John Mellor-Crummey. Information can be collected from the Fortran D compiler in the form of SDDF files. Because of this, PABLO can be used to do performance analysis of HPF programs. Static records describe stuff the compiler knows about the program; in addition dynamic records that have pointers back to the static records are emitted at runtime from the instrumented program. Program instrumentation comes in two forms: instrumentation calls inserted directly into the source code at Do loops and procedure boundaries; and instrumented wrappers to the iPSC communications calls. Later on they changed the Fortran 77D compiler to instrument the compiled code itself, as reported in their SC'95 paper [9].

Since then, the Pablo performance analyzer has been adapted to work with Fortran D and with PGHPF. The adapted visualizer is called ``dPablo'' and ``svPablo'', respetively.

Adve, Reed, and Mellor-Crummey strongly believe in generating editable and compilable code, and Adve admitted privately that they sometimes ``tweak'' this output to fix up stuff the compiler missed; but he stated that their compiler does not ever emit unnecessary communications (which the APR preprocessor often did, before Version 2.0). Communications recognized by the Fortran D compiler includes shifts, broadcast, reductions (including vector) and pipeline. The Fortran D editor is a slight extension of PED; in particular it says what kinds of communications will be generated (similar to DMP in FORGE 90). Then performance data is fed back into the editor, where run times are displayed on the source listing. All this research was finally ``productized'' in svPablo and may well be added to pgprof.

The pC++ project is also doing work on tools under the guidance of Allen Malony [131]. They stress the importance of designing the performance analysis environment along with the compiler. As of mid 1994, they had a profiling tool, a portable event trace capturing library, a source code instrumenter, and instrumented runtime system libraries. They have a trace conversion utility to go from pC++ tracefiles into SDDF and/or ALOG. The work shown by Dennis Gannon and posted by Allen Malony at the Cape Cod workshop looked really good, and was a persuasive case for compiler data bases.

The PALLAS toolset [117] consists of TotalView for debugging, vampir for tracing, and pghpf for compiling and profiling, making it quite a complete MPI/HPF toolset. It became available in 1996 for the IBM SP2 and other machines. At Cornell, we use vampir for tracing all HPF programs, whether compiled by xlhpf, pghpf, or xhpf, as well as for MPI programs. It is quite useful for (1) giving a breakdown of computation vs. communications and (2) breaking down user-defined states (if you instrument your program ``by hand'' first).

The Annai toolset comes with a Performance Monitor and Analyzer (PMA) that instruments programs and lets users select their analysis views. Certain regions can be instrumented, which is typical of instrumentation tools [186].

Finally there is a handly tool from NASA Ames called ntv which can be used to view AMES-generated traces or SP2-generated traces. Its strong advantage is its simplicity, and it also has source-level referback to xlhpf-compiled programs.

Performance Prediction

Closely related to Performance Analysis, this category of HPF tools attempts to predict the performance of an HPF program as one varies the target architecture, problem size, or number of processors. One of the goals of the HPF project is to provide reliable high performance across a spectrum of platforms. (This is one area of HPF that clearly requires excellent benchmarking data.) Parasher et al. at NPAC designed an application development environment that includes a tool [151,152] that predicts performance on different architectures. This software uses an abstraction of the hardware and of the application and can respond to statistical queries or produce a PICL trace.

An earlier paper by Balasundaram, Gox, Kennedy and Kremer [16] proposed a static estimator to evaluate the relative efficiency of different data decomposition schemes given a Fortran D or message passing program, and a distributed memory multiprocessor. A set of kernel routines ``train'' the estimator for each target machine. This work probably led to the self-tuning concept described earlier.

The Fortran D system includes the D Editor which promotes programming ease in HPF and/or Fortran D. It helps the user zoom in on portions of the program containing sequentialized code or expensive communication, and advises them how to proceed, much in the style of its PED predecessor [93]. In these cases, it's predicting poor performance based on the code the compiler was forced to generate.

Yet another approach to performance prediction is the ``program model'' approach.[140,166] Combining run-time statistics with the compiler analysis described in [9], performance scalability is predicted as a function of N (problem size) and P (number of processors). Each program phase gets its own parameterized execution model. This idea was later incorporated into research by LeBlank, Poulos, and Meira [136] on a tool called ``Carnival''. The main difficulty with this approach is that it requires lots of support from the compiler. The tool also must learn to parse calls to the runtime library in order to do its instrumentation.

Automatic Data Distribution and Alignment

This is a current area of research. Ayguade and Garcia at Barcelona [14,76] have created a ParaScope-based tool which, given a Fortran 77 program, will emit directives for aligning the data. By 1997 [15], this research had yielded a tool called DDT that inputs Fortran 77 programs and generates HPF directives to align and distribute data. The tool attempts to achieve an optimal mapping by examing loops and code blocks in the program. This paper also includes a good review of the literative on automatic mapping.

Rice University's Fortran D system includes an automatic data partitioner. It uses the performance estimator and the compiler itself to select the best decomposition. This is similar to the work being done by Garcia and Ayguade, except that they look at the source code itself to predict the best decomposition. Presumably, FORGE's Automatic Partitioning does the same thing; a comparative study should be made.

The data layout assistant tool described by Kennedy and Kremer [107] proposes a data layout for an F77 program or extends a partial layout to a full decomposition. It was developed in conjunction with the Fortran D project; as of 1995 it was very much a prototype tool (only 1D decomposition, single-procedure programs only, number of processors specified at compile time). However, Kremer is proposing to continue this research at Rutgers.

The pedLambda tool at Cornell [24] automatically reorganizes loop nests, in the presence of distribution directives, to optimize message traffic on a distributed memory machine, but it will not change data layout.

Orlando and Perego [147] present a template for a runtime load-balancing loops which have irregular loop durations by hynamically moving work from less busy processors to more busy ones. It is not clear, however, that this could be coded in HPF. (They used MPI.) Further, while distribution of work is uneven, the distribution doesn't change curing computation.

Program Data Visualization

Runtime data distribution, use, and change are valuable clues for analyzing the performance of a running HPF program. A number of projects are underway to accomplish visualization of distributed data. Many compilers come with tools to display such data at runtime: DEC Fortran 90, PGHPF, and SofTech have all shown such tools on the conference floor.

The PTOOLS DAQV project is an attempt to design a visualization tool that will have the same look and feel across HPF compilers, and across platforms. This tool was originally (April 96) designed to work with HPF on SGI Power series, using PGHPF. In May '96, it was ported to the IBM SP2 at Cornell. In late summer 1996, DAQV was ported to XL HPF. Thus the PGHPF reference implementation exists (on two machines) and an HPF reference exists (on two compilers). Using either of two different methods (``push'' or ``pull'') the programmer can arrange for array data to be dumped out at runtime. This can be visualized by other tools; a sample tool is included with DAQV, called the ``dandy'' DAQV.

A very nice paper by Kimelman et al. [109] shows some possible ways of viewing HPF data structures at run time, similar to the DAQV project. It is not yet clear how to visualize HPF program execution but Kimelman and Zernik's tool shows where the data is moving, in terms of sub-arrays. Color is used to good effect.

The Graphical Data Distribution Tool ( GDDT ) [81,114] being developed at Univeristy of Linz by Rainer Koppler and Siegfried Grabner is a prototype tool for visualizing distributed data structures. It also uses color for good effect. It supports both source-based and trace-based visualization of distributed data structures at the level of data-parallel programming languages. It also offers both analysis of HPF-like source codes and tiny visual programming facilities by allowing interactive and graphical specification of distributed data structures and automatic source code generation. Finally, due to the importance of visualization of run-time information, the tool provides an open interface to external compilation systems and run-time environments. A few visualization tools are provided that utilize data gained from this interface. It was originally developed to work with Vienna Fortran, so the current version supports this HPF predecessor. Since this language covers a superset of HPF's facilities, the tool may also be used for visualization of HPF data structures. The Parallel Debugging Tool (PDT) [54] can show distributed data in two forms: the values in the global array and the distribution itself. VIA (from CST Images) can also display distributed data.

Next: Applications Up: An HPF Encyclopedia Previous: Runtime Libraries and HPF

Donna Bergmark
2/18/1998