Recall that in Section 2.4.1 we developed a parallel algorithm to sum P values distributed among P tasks. This algorithm is essentially Algorithm 11.1 with an addition operator used as OP. That is, the algorithm maintains a partial sum as the local state in each node, and in each step accumulates a partial sum received from another node into this partial sum. After steps, the sum of the P input values is available in every node.
This same algorithm can be used to perform a reduction using any commutative associative operator, such as multiplication or maximum; the commutative associative operator is used as OP in Algorithm 11.1. The algorithm can also be used to implement a barrier operation, which synchronizes the tasks that execute it. In this case, the values communicated are simply null tokens, and the operation performed on each pair of incoming messages is a synchronization operation that waits for the two tokens to be available.
Figure 11.1: Using the hypercube algorithm to reduce four vectors
of length N=4 distributed among four tasks. The computation is
performed in steps, with each task in each step exchanging
N data values with a neighbor and performing N combine operations.
The labels in the boxes denote the origin of the values that they
contain; hence, 0.1 and 2.3 represent intermediate results
obtained when contributions from task 0 and 1, or 2 and 3, are
combined. R represents the final reduced
values.
In the related vector reduction problem, each of P tasks supplies a vector of N values and N separate reductions are performed to produce a vector of N results. As illustrated in Figure 11.1, these N reductions can be achieved in steps by using Algorithm 11.1. The operator OP is defined as follows: take two vectors of N values as input and apply the commutative associative operator N times to produce a vector of N results. The per-processor cost of this simple exchange algorithm is
where is the cost of applying the reduction operator. This algorithm is efficient for small N , when message startup costs dominate. However, for larger N it is inefficient, since it performs many redundant operations.
An alternative recursive halving algorithm utilizes the same hypercube communication structure but applies a divide-and-conquer technique to reduce message volume (Figure 11.2). In effect, Algorithm 11.1 is applied twice. In the reduction phase, each processor communicates (and combines) N/2 data in the first stage, half as much ( N/4 ) in the second, and so on, so that each processor communicates a total of N(P-1)/P data in steps. The global sum is then complete, and the vector of N reduced values is evenly distributed over the P processors. This process is reversed (without the reductions) to broadcast the result. Communication cost is
Figure 11.2: Using the recursive halving algorithm to reduce four vectors
of length N=4 distributed over four tasks. In the first
stages, values are combined to compute the N reduced values,
represented as R; these values are distributed over the four
tasks. In the third and fourth stages, the process is reversed in
order to broadcast the values.
The recursive halving algorithm sends twice as many messages as the simpler algorithm does, but less data. It also performs less computation. Hence it will be more efficient for certain values of N and P and on certain machines. A robust hybrid algorithm can be designed that starts with the recursive halving approach and switches to an exchange algorithm after a certain number of stages so as to avoid some of the broadcast communication.
We can use similar techniques to define an efficient vector broadcast algorithm. Here, the problem is to replicate N values located in a single task (the ``root'') in each of P-1 other tasks. A simple algorithm uses the binary tree communication structure illustrated in Figure 2.8. The root task first sends the data to two other tasks; each of these tasks forwards the data to two other tasks, and so on, until the data are completely distributed. Total cost is approximately
This algorithm is efficient for small N and P . For larger problems and processor configurations, it has the disadvantage that most processors are idle most of the time and the total time is dominated by the term. In these situations, it can be more efficient to break the message into pieces and then to route these pieces separately by using the hypercube communication structure. Communication costs are then approximately as follows (the chapter notes provide pointers to descriptions of this algorithm):
© Copyright 1995 by Ian Foster