The observation commonly referred to as Amdahl's law was first formulated in [12]. Asymptotic analysis of parallel algorithms is discussed in many computer science texts, such as those by Akl [8], Leighton [187], and Smith [267]. Cook [64] discusses problems for which no efficient parallel algorithms have been discovered.
Many different approaches to performance modeling have been proposed, each appropriate for different purposes. See, for example, the papers by Culler et al. [67], Eager et al. [89], Flatt and Kennedy [97], Karp and Flatt [167], and Nussbaum and Agarwal [216]. Patel [224] discusses the modeling of shared-memory computers. The book by Kumar et al. [179] provides many example models and a more detailed treatment of the concept of isoefficiency. Gustafson et al. [129,130] introduce the concept of scaled speedup. Singh, Hennessy, and Gupta [259], Sun and Ni [274], and Worley [297,298] discuss various constraints on the scalability of parallel programs. Lai and Sahni [183] and Quinn and Deo [237] discuss speedup anomalies in search problems. Faber et al. [93] argue against the concept of superlinear speedup. Fromm et al. [115], Harrison and Patel [134], and Thomasian and Bay [284] use queuing models to study performance of parallel systems. Kleinrock [173] reviews techniques used for performance analysis of networks and discusses issues that arise in high-speed (gigabit/sec) WANs.
The chapter notes in Chapter 1 provide references on parallel computer architecture. Feng [94] provides a tutorial on interconnection networks. Hypercube networks have been used in a variety of multicomputers such as the Cosmic Cube [254], nCUBE-2 [212], Intel iPSC, and Thinking Machines CM2 [281]. The Intel DELTA and Intel Paragon [276] use two-dimensional mesh networks. The Cray T3D and MIT J machine [72] use a three-dimensional torus. Adams, Agrawal, and Siegel [2] survey multistage interconnection networks, and Harrison [133] discusses the analytic modeling of these networks. Various forms of multistage network have been used in the BBN Butterfly [31], NYU Ultracomputer [123], IBM RP3 [226], and IBM SP [271]. The IBM SP uses a bidirectional multistage network constructed from 4 4 crossbars (a modified 4-ary n -fly) similar to that illustrated in Figure 3.18. Seitz [253,255] provides an introduction to multicomputers and their interconnection networks. Dally [69] discusses networks and the concept of bisection width, while Leighton [187] provides a more detailed and theoretical treatment. Dally and Seitz [70,71] discuss routing techniques. The material in Example 3.8 is based on work by Foster and Worley [110]. Ethernet was designed by Metcalfe and Boggs [205]; Shoch, Dalal, and Redell [257] describe its evolution.
Miller and Katz [208], Foster, Henderson, and Stevens [103], and Pool et al. [229] discuss the I/O requirements of scientific and engineering applications. Del Rosario and Choudhary [76] discuss problems and prospects for parallel I/O. Henderson, Nickless, and Stevens [145] discuss application I/O requirements and describe a flexible I/O architecture for parallel computers. Plank and Li [228] discuss checkpointing. Bordawekar, del Rosario, and Choudhary [41] explain the utility of a two-phase I/O strategy. A special issue of the Journal of Parallel and Distributed Computing [60] discusses various aspects of parallel I/O, as do Aggarwal and Vitter [4] and Katz, Gibson, and Patterson [168]. DeWitt and Gray [79] discuss parallel database machines. Gibson [120] examines the design and performance analysis of redundant disk arrays (RAID disks). Hennessy and Patterson [134] provide a good description of I/O system performance analysis and design.
The parallel versions of Floyd's shortest-path algorithm [98] are due to Jenq and Sahni [158], while the parallel version of Dijkstra's single-source algorithm [80] is described by Paige and Kruskal [217]. Our analysis of these algorithms follows Kumar and Singh [182], who also present analyses that take into account bandwidth limitations on hypercube and two-dimensional mesh architectures. Bertsekas and Tsitsiklis [35] describe a pipelined variant of Floyd 2 that improves performance by allowing iterations to proceed concurrently, subject only to dataflow constraints. Aho, Hopcroft, and Ullman [7] and Cormen, Leiserson, and Rivest [65] provide good introductions to sequential graph algorithms. Quinn and Deo [236] and Das, Deo, and Prasad [73,74] describe parallel graph algorithms. Ranka and Sahni's [238] book on parallel algorithms for image processing and pattern recognition includes relevant material.
Here is a
Web Tour
providing access to additional information on performance analysis and
the architecture and performance of parallel and distributed computer
systems.
© Copyright 1995 by Ian Foster