Abstract. A hypercube parallel computer is a network of processors, each with only local memory, whose activities are coordinated by messages the processors send between themselves. The interconnection network corresponds to the edges of an $n$-dimensional cube with a processor at each vertex. This paper gives an overview of the hypercube architecture and its relation to other distributed-memory message-passing multiprocessors.

The computation of a crossproducts matrix is an important part of many applications in statistics and provides a simple yet interesting example of a hypercube algorithm. This example is presented to illustrate the concepts in programming a hypercube.

1. Introduction. Interest in parallel computing has been growing for many years because of its potential for very high performance. Hypercube designs were discussed already in the 60’s and 70’s (for example [11]), but building such machines was impractical due to their complexity and the large number of components they would require. It was only after recent advances in VLSI technology that hypercube parallel computers became practical. Researchers at Caltech completed the first hypercube parallel computer (the 64-processor Cosmic Cube) in 1983 [10] and since then have successfully applied it and its successors to numerous scientific applications [5].

Commercial hypercubes started to appear in 1985 and currently are available from at least four vendors. These are the Intel iPSC with up to 128 processors, the Ametek’s System/14 with up to 256 processors, the NCUBE Corporation’s NCUBE/six-ten with up to 1024 processors, and Floating Point Systems T Series with up to 16384 processors (The FPS design makes a 16K-processor machine possible, but large configurations have yet to be built.). Hypercube overviews specific to some of these machines are given in [7] and [8]. A performance comparison of three hypercubes is given in [2]. At ORNL we have the Intel and NCUBE machines, so that most of our experience is based on these two, however this overview is intended to be more general. The hypercube is now perhaps the most successful large-scale parallel architecture and the currently available machines have the potential of supercomputer or near supercomputer performance at a relatively low cost.

The hypercube parallel architecture falls into the MIMD (Multiple Instruction Multiple Data) category of Flynn’s taxonomy [3]. From a user point of view, there are two broad classes of MIMD multiprocessors that can be defined by the placement of memory and mode of communication. In a shared-memory multiprocessor, the processors communicate by accessing a memory or a set of memories common to all processors over a switching network. In a distributed-memory multiprocessor, each processor has only local memory, and the processors communicate by sending messages between themselves over a communication network. Of course, systems with both local and shared memory are possible. The distinction between the two broad classes is not always clear cut at the hardware level and it is the programming environment that is either shared-memory or distributed-memory message-passing. The hypercube is considered a distributed-memory message-passing multiprocessor. The communication network between processors is perhaps the most critical issue in the design of distributed-memory multiprocessors. A further breakdown of distributed-memory message-passing multiprocessors is thus based on the communication network topology and one of these is the hypercube topology.

The remainder of this paper is organized as follows. Section 2 describes the hypercube architecture in detail and shows how other architectures can be embedded in the hypercube. The programming environment is presented in § 3 and an example application and its implementation are given in § 4.

2. The Hypercube Architecture. A hypercube parallel computer is a network of processors, each with only local memory, whose activities are coordinated by
messages the processors send between themselves. A $d$-dimensional hypercube has $2^d$ processors. The communication network corresponds to the edges of a $d$-dimensional cube, where each vertex or node is a processor. Each node is uniquely identified by $d$ binary digits, and the binary tags of any two neighbors differ by exactly one bit. Thus to communicate with each of its neighbors, each node has degree $d$ (has $d$ communication channels). To illustrate the interconnection scheme, consider the hypercubes of dimension 0 through 4 in Fig. 1.

A $d$-dimensional hypercube is obtained by duplicating a $(d-1)$-dimensional hypercube and connecting corresponding nodes. The binary tags are prefixed by a 0 in the original hypercube and by a 1 in the duplicate to obtain the new binary tags. Note that each bit in the tags is associated with a dimension of the hypercube. Flipping a given bit in the tags defines pairs of neighbors across a given dimension. For example, flipping the middle bit in the $d-3$ hypercube of Fig. 1 defines neighbors in the vertical dimension.

Communication between neighbors is sent directly over the communication link between them. If a message needs to be sent between two nodes that are not neighbors, the message is routed through intermediate nodes until it reaches its destination. The routing algorithm is very simple. At each stage, the message is sent to a neighbor whose binary tag is one bit closer to the destination binary tag. The path length is thus the number of bit positions in which binary tags of the two nodes differ. Several routes are possible depending on the order in which the binary tag digits are considered. Often a communication coprocessor is also present on each node to free each main node processor of most of the communication overhead. In fact, messages that are only passing through may be processed entirely by the coprocessor.

The hypercube architecture has a good balance between node degree (number of channels per node) and diameter (maximum path length between any two nodes). Ideally, an interconnection scheme should have a small diameter for fast communication and nodes should have small degree, so that the scheme easily scales up to a large number of processors. Table 1 gives these two parameters for a number of interconnection schemes, including the hypercube, for $n$ processors. Both the tree and the hypercube have a good balance in terms of these two parameters. However, the tree is not a homogeneous structure and a communication bottleneck will occur at the root in many applications, because only a single path exists between any pair of nodes. In a hypercube, on the other hand, there are $d$ disjoint paths between any pair of nodes. This can be exploited to enhance communication bandwidth and fault tolerance.

Many interconnection schemes can be embedded in a hypercube. These include the ring, the 1-d, 2-d and 3-d mesh, as well as trees. Fig. 2 illustrates some of these interconnection schemes and Fig. 3 illustrates their embeddings in a hypercube. Most regular embeddings are accomplished by the use of Gray codes. These are sequences of $d$-bit strings such that successive strings differ by a single bit. Binary tags of neighbor nodes in a hypercube differ by a single bit, so Gray codes define paths in hypercubes. Those communication patterns that cannot be embedded in a hypercube, such as a complete graph, will still execute efficiently, because the longest path (diameter) in a hypercube with $2^d$ processors is only $d$ and the average path length is $d/2$. For these
The most popular method of programming a hypercube is to write a single program that will run asynchronously on each node working on different data. Some applications can be split into homogeneous sections that naturally result into an identical program on each node. When this is not the case, it is still easier to write a single program whose function will differ depending on the node where it is running rather than 64 or 128 different programs. The best way to illustrate this is by an example, which is given in the next section.

Debugging facilities are improving, but still have a long way to go. For example, processor lights on the Intel iPSC indicate if processors are busy, idle, or communicating. This is an ironic throwback to the early days of computing, when it was useful to watch computer console lights to tell what was going on. Similar information can now be more conveniently obtained from system logs post-processors. For example, SEEcube, developed at Tufts U., gives a color graphics slow-motion replay of a parallel program execution from the system log produced during execution. Among other information, it displays communication patterns and busy-idle states of processors and at this time it works on the Intel or NCUBE hypercubes. Such displays are useful not only in debugging but also for improving efficiency of parallel programs. Much current research in computer science is concentrating on programming aids for parallel processing. Some real-time debuggers have already been developed and will likely become available soon. Hypercube simulators (for example [11], however, are still the most used debugging tools, because they run in familiar sequential environments and allow the use of simple debugging techniques such as inserting print statements.

Programming a hypercube is becoming easier because of several factors. First, many bugs present in early versions of operating systems and compilers have been corrected and more efficient versions were developed. Second, more debugging aids are becoming available. And third, libraries of communication subroutines, matrix operations and factorizations, and other useful subrou­tines are being developed and are already available. Also, an ever increasing number of researchers, particularly within computer science, are becoming familiar with programming hypercube multiprocessors. Thus the researcher in statistical computing has better chances of finding a knowledgeable source of help.

4. An Example. The computation of a crossproducts matrix is an important part of many methods in statistics and provides a simple yet interesting implementation on a hypercube. Suppose we have an \( n \times p \) matrix \( X \) and we wish to compute the \( p \times p \) crossproducts matrix \( A = XX' \). If \( y \) is the time for a multiply and an add operation, then the computation of \( A \) requires approximately \( n p^2 y \) time on a single processor.

Note that the computation of \( A \) can be divided into a sum of \( r \) crossproduct computations by partitioning \( X \) into \( r \) blocks of rows

\[
X = \begin{bmatrix}
X_1 \\
X_2 \\
\vdots \\
X_r
\end{bmatrix}
\]

and computing \( A = \sum_{i=1}^{r} A_i \), where \( A_i = X_i X_i' \). Given a
cross-products matrix on every node for further computation discussed in [12]. or any application requiring

illustrates and the final cross-products matrix \( A \) ends up on processor neighbors (binary tags differ by one bit only). Fig. 5

Each are exactly \( m \text{Unication startup time and for each step is given on the right. A communication step takes }\)

summed.

FIG. 4. Computation of \( A = XX \) on an \( 8 \)-processor hypercube, with final result on processor 0.

FIG. 5. Communication pattern for Fig. 4.

d-dimensional hypercube with \( r = 2^d \) processors, we can compute all \( A_i \)'s concurrently. This takes approximately \( \frac{1}{2} \frac{n^2}{r} \gamma \) time. We have thus reduced the time \( r \) fold, but we are not done because the \( A_i \)'s need to be summed.

The fastest way to compute a sum in parallel is by pairwise summation. Incidentally, pairwise summation has better numerical properties than simple sequential summation. Fig. 4 illustrates how this may be done on a \( d=3 \) hypercube. Each column lists the matrices computed by a given processor and arrows indicate where a partial sum is communicated to another processor. If no matrix or communication is listed for a given time step, the processor is idle. Column headings give the processor tag with its binary representation and the time required for each step is given on the right. A communication step takes approximately \( \alpha + \frac{1}{2} \nu p^2 \beta \) time, where \( \alpha \) is communication startup time and \( \beta \) is communication rate. Each sum then takes approximately \( \frac{1}{2} \nu p^2 \gamma \) time. There are exactly \( d = \log_{2}r \) communication-summation steps and the final cross-products matrix \( A \) ends up on processor 0. Note that communication is always between neighbors (binary tags differ by one bit only). Fig. 5 illustrates this communication pattern.

In some applications it may be desirable to have the cross-products matrix on every node for further computation. These may be an orthogonal decomposition discussed in [12], or any application requiring regression computations with many response variables such as bootstrapping, where each node processor would perform regressions based on the same cross-products matrix. Putting the cross-products matrix on each node can be done in exactly the same amount of time as the single cross-products matrix on node 0 by simply keeping all processors busy during the summing process. This is illustrated in Figures 6 and 7. Note that all communication is still only between neighbors. This communication pattern is known as the exchange algorithm, since we exchange data once over each dimension of the hypercube. The exchange algorithm arises in many applications. For example, the Fast Fourier Transform algorithm also requires this communication pattern, but in reverse order.

The total time for this cross product computation is

\[
\frac{1}{2} \frac{n^2}{r} \gamma + (\alpha + \frac{1}{2} \nu p^2 \beta + \nu p^2 \gamma) \log_{2}r ,
\]

where \( r = 2^d \) is the number of processors. The first part of the expression (local cross product computation) decreases linearly with \( r \) and the second part (summation) increases logarithmically with \( r \). Thus for a given matrix, there is a maximum number of processors that can be used before the total time begins to increase due to the summation process.

One measure of parallel algorithm performance is speedup. This is given by the execution time on a single processor divided by the execution time on \( r \) processors.
Speedup for the above computation is given by

$$\frac{r}{1 + \frac{r \log_2 \delta}{n}}$$

where $\delta = \frac{2^n}{p} + \beta + 1$. As the number of rows, $n$, increases, the speedup approaches $r$. The number of columns, $p$, affects only $\delta$, which becomes roughly constant for large values of $p$. Thus speedup of the crossproducts algorithm is affected mostly by $n$ and very little by $p$.

This algorithm runs an identical program on each node. We start with one partition $X_i$ of $X$ on each node. Providing $X_i$ to each node will differ between different machines. For example on the Intel machine, the host communicates each partition directly to each node via a direct communication link. Provided that the $X_i$s have been distributed to appropriate nodes, the following node program will compute $A$ on each node.

1. $A = X_i' \times X_i$
2. for $k = 1$ to $d$ do:
   3. send $A$ over dimension $k$
   4. receive $B$ over dimension $k$
5. $A = A + B$

This is a sequential program that executes asynchronously on each node. Synchronization is achieved by the send and receive. Both have to be completed before the addition in step 5 takes place, because we need $B$ for the addition and we need to send $A$ before $B$ is added to it.

A FORTRAN implementation of this program was programmed on the Intel iPSC hypercube. A host program (not shown here) sends one partition $X_i$ of $X$ to each node of the hypercube. The node program that runs on each processor is given below.

```fortran
integer ci, p, d, copen, npar(3)
real S(60000)
open communication channel
cl = copen(0)
wait to receive local parameters and matrix from host

call recvw ( ci , 0 , npar , 12 , ln , nhost , idp )
n = npar(1)
p = npar(2)
d = npar(3)
lenmat = p*(p+1)/2
mx = lenmat + 1
call recvw ( ci , 0 , S(mx) , p*n*4 , ln , nhost , idp )
compute local crossproduct
call ccpp ( S(1) , S(mx) , n , p )
```

Both $A$ and $X_i$ are stored in the array $S$ starting in locations 1 and $mx$, respectively. The storage locations for $X_i$ are used for $B$ once $X_i$ is no longer needed. Both $A$ and $B$ are symmetric, so only their upper triangle is stored. The subroutines `crossp` and `add` are the usual sequential routines that compute the crossproduct of a matrix and add two matrices, respectively. The `recvw` is a FORTRAN callable communication routine of the iPSC. The `irecvw` and `isendw` are FORTRAN subroutines distributed by Intel for communicating long messages on the iPSC. The second parameter in these routines is the message type, which is used to select incoming messages. That is, any incoming message is stored by the node operating system, but a message is read by the program only when that type of message is requested. The function `mynode0` returns the tag of the node where the program is executing. The exclusive-or function `ieor`, applied to `mynode0` and $2^{\text{a}(k-1)}$, flips the $k$-th bit of `mynode0` giving the tag of dimension $k$ neighbor. Synchronization is achieved by the communication routines, which halt the program until a message is sent or until a message of appropriate type is received.

Timing the crossproducts program for various matrix sizes and numbers of processors on the Intel iPSC hypercube produced Table 2. The resulting speedup graphs are given in Fig. 8. As expected, Fig. 8 shows that speedup depends mostly on $n$ and very little on $p$. Both Table 2 and Fig. 8 show that the use of more than approximately $n/10$ processors produces little additional reduction in execution time, thus efficient use of a hypercube requires that the application is "large enough". This behavior is typical of many parallel algorithms.

### Table 2

<table>
<thead>
<tr>
<th>matrix size</th>
<th>120</th>
<th>200</th>
<th>300</th>
<th>400</th>
<th>500</th>
<th>600</th>
<th>800</th>
<th>1000</th>
<th>1600</th>
<th>3200</th>
<th>6400</th>
</tr>
</thead>
<tbody>
<tr>
<td>$r$</td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td>100</td>
<td>200</td>
<td>400</td>
<td>800</td>
</tr>
<tr>
<td>$p$</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
<td>100</td>
</tr>
<tr>
<td>execution time in seconds for $r$ processors</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Many computationally intensive methods in statistics can potentially benefit from parallel computation. The crossproduct computation was selected only to illustrate how to program a hypercube. The flexibility of a hypercube makes it ideal for testing parallel algorithms. Perhaps the greatest advantage of the hypercube parallel architecture is that it has been commercially available for almost two years and that a large amount of work has already been spent in developing algorithms and subroutine libraries such as those discussed in [4], [6] and [9].

Acknowledgments. I am grateful to several members of the Computer Science group in our Mathematical Sciences Section of the Oak Ridge National Laboratory for numerous discussions and helpful comments on this paper. This work was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400.

REFERENCES


