## Title

Determining the idle time of a tiling
Permalink
https://escholarship.org/uc/item/6tn8504v

## Authors

Hogstedt, Karin
Carter, Larry
Ferrante, Jean
Publication Date
1999-07-06
Peer reviewed

# Determining the Idle Time of a Tiling 

Karin Högstedt, Larry Carter, Jeanne Ferrante ${ }^{\dagger}$<br>Department of Computer Science and Engineering, UCSD<br>9500 Gilman Drive, La Jolla, CA 92093-0114


#### Abstract

This paper investigates the idle time associated with a parallel computation, that is, the time that processors are idle because they are either waiting for data from other processors or waiting to synchronize with other processors. We study doubly-nested loops corresponding to parallelogram- or trapezoidal-shaped iteration spaces that have been parallelized by the wellknown tiling transformation. We introduce the notion of rise $r$, which relates the shape of the iteration space to that of the tiles. For parallelogram- shaped iteration spaces, we show that when $r \leq-2$, the idle time is linear in $P$, the number of processors, but when $r \geq-1$, it is quadratic in $P$. In the context of hierarchical tiling, where multiple levels of tiling are used, a good choice of rise can lead to less idle time and better performance. While idle time is not the only cost that should be considered in evaluating a tiling strategy, current architectural trends (of deeper memory hierarchies and multiple levels of parallelism) suggest it has increasing importance.


## 1 Introduction

Tiling is a well-known transformation that has been used for parallelism and to obtain better locality in the memory hierarchy.

Tiling starts with the iteration space determined by a perfectly nested loop in the program, and partitions the iteration space into tiles of chosen size and shape. The iterations represented by each tile are executed in proximity: on a given processor, all the iterations of one tile are executed before any iterations of the next tile.

In the parallel context, different cost functions have been used to evaluate different tiling choices. For example, [RS91] minimizes communication costs; [WL91] maximizes the granularity of parallelism. None of this previous work considers minimizing idle time, that is,

[^0]the time a processor is idle because it is waiting for data from another processor or to synchronize with other processors.

Current architectural trends toward multiple levels of parallelism and deeper memory hierarchies warrant a closer look at idle time. Handling these multiple levels entails recursively partitioning tiles into subtiles multiple times; at each subdivision, the number of subtiles per tile is roughly the ratio of the larger to the smaller memory capacity. Because of the trend towards more levels, this ratio is decreasing. Thus the number of subtiles per tile is decreasing. This in turn means the idle time can be a larger fraction of the total and should not be ignored.

In hierarchical tiling [CFH95a, CFH95b], tiling is done for each level of memory hierarchy and parallelism. Typically, tiling for a given level corresponds to introducing one to three levels of loop nesting. This partitions the iteration space of the larger level into smaller tiles. Thus, the tiles at a given level become the iteration space at the next level, which is itself tiled. The tiles are typically parallelograms, triangles, or trapezoids (or their higher-dimensional equivalents), and their size and shape is chosen by the tiling transformation. This gives rise to a need to examine the relationship of the size and shape of the iteration space and tiles to the overall execution costs, and particularly to the idle time.

In this paper, we define the notion of rise, which relates the shape of the iteration space to that of the tiles. We show that the idle time depends on the rise, and that large values of rise lead to larger idle time.

In hierarchical tiling, choosing the tile shape and size for one level affects the possible values of the rise at the next finer level of granularity. The interaction of the choices at different levels must guide the best overall choice. To evaluate the global solution it is important to have a reasonably accurate and efficiently computed estimation of idle time. In the paper, we

- Give simple formulas that bound the idle time of two-dimensional tilings of parallelogram-shaped iteration spaces. When the rise $r$ is greater than or equal to -1 , the idle time is roughly proportional to the square of the number processors $P$. The case of $r \leq-2$ is more favorable, with the idle time being smaller and proportional to $P$. These formulas hold for both block and block cyclic distribution of
iterations to processors. (We don't have a simple formula for the case of $-2<r<-1$ ).
- Show our formula is exact when $r \geq-1$ for "sufficiently high" iteration spaces.
- Derive bounds for the more general trapezoidal- or triangular-shaped iteration spaces.
- Illustrate that a good choice of $r$ can lead to better performance.

To illustrate the idea behind the rise, consider the three tilings of the same iteration space in Figure 5. In this example, all tiles have identical height, width, and communication times, but different values of rise. The result is vastly different parallel execution times due to different idle times. The best execution times corresponds to the tiling with negative rise, and the worst to positive rise.

Because a given tiling choice determines the rise, we can use our formula to advantage for multi-level tiling. In particular, at a level with small number of processors, $P$ (most notably $P=1$ ), we can with only a small penalty choose a tiling which slopes downward with respect to the dependences, even though this results in a (locally unfavorable) positive rise. Since this downwardly sloping tile is itself the iteration space for the next level of tiling, it will allow a negative rise for its subtiles, resulting in reduced idle time for this level. This is illustrated in Figure 7.

## 2 Related Work

Iteration space tiling is an important and well-known technique [JM86, GJG88, IT88, W87, RAP87, W89, HA90, L90, LRW91, RS91, WL91, WL91b, B92, KM92] used previously to achieve both parallelism and locality for a given loop nest. Tiling partitions the iteration space of a loop nest into uniform tiles of a given size and shape (except at the boundaries of the iteration space) which tessellate the iteration space. Tiling can often be achieved by first strip mining and then interchanging loops [W89].

Unimodular transformations [IT88, B92, WL91b] are often used before tiling, for instance, to change the original loop order. In two dimensions, unimodular transformations can skew the iteration space so that only rectangular (as opposed to parallelogram-shaped) tiles need be considered. Our approach considers parallelepiped shapes (the generalization of 2-dimensional parallelograms).

The cost measures used by previous work on tiling for parallelism all differ from idle time. For instance, [RS91] attempts to minimize communication time, and [WL91] considers maximizing the granularity of parallel loops. [KP93] shows how to estimate performance factors such as parallelism granularity and number of cache misses for transformations such as tiling. It does not explicitly consider the number of wasted cycles.

## 3 Nomenclature and Definitions

### 3.1 Tilings, Tiles and Iteration Spaces

In this paper, we give some non-standard but roughly equivalent definitions to the optimizing compiler literature [IT88]. In our formulation, tiles are subsets of $\Re^{K}, K$-dimensional real space. This allows us to use the tile's volume as a measure of execution time, and thereby to avoid unimportant complications having to do with there being a variable number of lattice points per tile, depending on the alignment of the tile with respect to the lattice points. Furthermore, it is easier to make geometric arguments about the volume of partial tiles than to compute the number of iterations involved.

Definition 1 Let $\vec{n}$ be a vector in $\Re^{K}$. The $\mathbf{j}$-th halfopen hyperswath with normal $\vec{n}$, denoted $\boldsymbol{H}_{\boldsymbol{j}}(\vec{n})$, is the set of points

$$
\left\{\vec{x} \left\lvert\, j-1 \leq \frac{\vec{x} \cdot \vec{n}}{\vec{n} \cdot \vec{n}}<j\right.\right\} .
$$

Thus, $H_{j}(\vec{n})$ is bounded by two hyperplanes orthogonal to $\vec{n}$.

Observe that the family $\mathcal{F}(\vec{n})$ of half-open hyperswaths $H_{j}(\vec{n})$ for integral $j$ partitions $\Re^{K}$ into disjoint pieces.

Definition 2 Let $\vec{n}_{1}, \vec{n}_{2}, \ldots, \vec{n}_{K}$ be a set linearly independent vectors in $\Re^{K}$. A tiling $\mathcal{T}$ is the $K$ families of hyperswaths $\mathcal{F}\left(\vec{n}_{i}\right), 1 \leq i \leq K$.

In the standard approach to tiling [W87, W89], the set of iterations that a nested loop executes, and the data dependences between them, are represented by an iteration space graph (ISG). For the purposes of this paper, the iteration space is simply a subset of $\Re^{K}$, typically one bounded by a polygonal region. The intuition behind our representation is that the lattice points of the iteration space (those points with all coordinates integral) are the iterations of the ISG. However, none of our theorems are concerned with these lattice points; instead in our model, the amount of computation is the volume of the iteration space.

We assume the program's data dependences are identical at each iteration, and can be modeled by a finite set of dependence vectors in $\Re^{K}$. These vectors provide constraints on the allowable order of execution. Furthermore, they dictate where communication is needed.

Definition 3 Given an iteration space I, a tiling
$\mathcal{F}\left(\vec{n}_{1}\right), \ldots, \mathcal{F}\left(\vec{n}_{K}\right)$, and a sequence of $K$ integers
$i_{1}, i_{2}, \ldots, i_{K}$, the tile $\boldsymbol{T}_{\boldsymbol{i}_{1}, i_{2}, \ldots, i_{K}}$ is $I \cap H_{i_{1}}\left(\vec{n}_{1}\right) \cap \ldots \cap$ $H_{i_{K}}\left(\vec{n}_{K}\right)$.

When $H_{i_{1}}\left(\vec{n}_{1}\right) \cap \ldots \cap H_{i_{K}}\left(\vec{n}_{K}\right)$ is a subset of $I$, then $T_{i_{1}, i_{2}, \ldots, i_{K}}$ is called a full tile, otherwise it is either a partial or empty (or non-existing) tile.

Figure 1 shows a tile in 2-dimensional space.
The full tiles are identical parallelepipeds, and it is (relatively) easy to generate code that iterates through the lattice points contained in the tile. Parallelepipedshaped tiles are sufficiently general to implement the tiling methodologies presented in [IT88] and [WL91b],


Figure 1: A tile which is the intersection of the iteration space (shown as a rectangle) and half-open hyperswaths with normals given by $\vec{n}_{1}$ and $\vec{n}_{2}$. The lattice points correspond to iterations; those included in the tile are shown as filled circles. Note that lattice points on the upper and the right-hand lines are not included in the tile.
which consider unimodular transformations with skewing. Some papers give a more general definition of a tiling, including hexagonal-shaped tiles [RAP87] or tiles of varying sizes. In particular, Flynn Hummel [H93] uses tiles of decreasing sizes to improve load balancing.

Given a tiling, there are certain dependences between tiles inherited from the program. We say there is a tile dependence from tile $T_{1}$ to tile $T_{2}$ if there is a path of dependences from an iteration in $T_{1}$ to an iteration in $T_{2}$.

The data input to and output from the iterations represented by a tile, referred to as the surface of the tile, may need to be moved to or from the processor executing the tile. The volume of the tile intuitively represents the computational work of the tile. It is desirable to have a small surface to volume ratio so that the computation time dominates the communication overhead, and so that data movement can be overlapped with computation.

In the remainder of this paper we will consider in detail a common instance of tiling. We assume the original program had the form:

```
for jj = 1 to n
    for kk = a1*jj+a2 to a1*jj+a2+m
        ... loop body ...
```

where $\mathrm{n}, \mathrm{m}, \mathrm{a} 1$, and a 2 are constants.
Note that the iteration space is a two-dimensional parallelogram. Later in the paper we will also consider triangular and trapezoidal iteration spaces, which arise when the lines representing the upper and lower bound on the second loop index (kk) have different slopes.

A tiling of the parallelogram will partition it into tiles $\left\{T_{j, k}\right\}$, which are smaller parallelograms (plus some partial tiles). It is well known that to ensure a legal tiling (one that allows tiles to be executed atomically in an order that satisfies the dependences), it is sufficient that $\vec{n} \cdot \vec{d} \geq 0$ for all hyperswath normals $\vec{n}$ and dependence vectors $\vec{d}\left[\right.$ IT88]. ${ }^{1}$ Pictorially, a tiling is certain to be legal if the lower left-hand angle of a tile includes all dependence vectors. We will investigate the desirability of using tiles with wider angles. To rule out the uninteresting case of "embarrassingly parallel"

[^1]problems, we assume there is a tile dependences from $T_{j-1, k}$ to $T_{j, k}$.

In our pictures (such as Figure 2), we draw the kk axis vertically. We consider tiling where the first family of hyperswaths are parallel to this axis. Let $w$ be the width of these swaths. Each swath will be assigned to a processor. We consider two common ways of doing this assignment: a block distribution, where there are exactly $P$ vertical swaths (one for each processor), and a block cyclic distribution, where there are $b P$ swaths assigned $P$ processors in a cyclic fashion (that is, swath $j$ is assigned to processor $j$ modulo $P$ ).

Each swath is partitioned into tiles via a second family of swaths, each intersecting the first family $h$ apart as shown in Figure 2. Each full tile has area $h w$. In our figures, the vertical columns appear as a stack of tiles.


Figure 2: A typical picture of a 2-dimensional iteration space. The regions between the vertical hyperplanes, are called stacks of tiles, and are all of width $w$. The distance between the intersections of the crossing hyperplanes and the vertical hyperplanes is called the height $h$ of a tile. The area of all full tiles is equal to $h w$. The arrows in the picture corresponds to either the execution order or to tile dependences between the tiles. Each tile is always dependent on the tile to its left, and we execute the tiles in a stack from bottom to top.

Both the block distribution and the block cyclic distribution are very advantageous scheduling algorithms. They reduce the number of memory accesses needed, since consecutive tiles executed on a processor share a surface (which increases locality), and limit the interprocessor communication to the dependences that cross the vertical cuts. They also allow pipelined execution of tiles, and are furthermore simple to implement.

### 3.2 Idle time

The cost function for a tiling we use in this paper is the total number of idle processor cycles that occur during the execution of a program on $P$ processors. This idle time can occur for two different reasons: (a) communication idle time when a processor is waiting for data from another processor, or (b) synchronization idle time where it has finished all of the tiles assigned to it, and is waiting at a barrier for all other processors to finish their tiles.

We model communication by assuming that a processor sends its output surface as soon as the sender has
finished computing it, and the data is queued by the receiving processor. This strategy is desirable (though not optimal [VSM96]) in that the receiver will never unnecessarily wait for data, and also the sender can occupy itself with useful work while it itself is waiting for data. The time spent both sending and receiving the data are considered to be necessary work and not counted as idle time.

Idle time of type (a) occurs when a sender has not initiated sending the data early enough to eliminate all waiting time by the receiver. In a typical computer system, in order to achieve the most efficient communication, it is necessary for the sender to initiate the send operation some number $L$ of cycles before the receiver has completed the previous tile. We will call $L$ the lead time of the communication system. One can think of $L$ as the amount of time it takes the first bit of a message to travel from the sender to the receiver, but in fact, there are many more complicated factors (such as operating system overheads and the ability of a system to overlap communication with computation) that contribute to $L$, and $L$ is usually a function of the message length.

We define the parameter $\boldsymbol{c}=L / h w$. Thus, $c$ is the lead time amortized over the computation of a full tile. Generally $c$ is between 0 and 1 , and we will assume ${ }^{2}$ that $c>0$.

Another delicate modeling question concerns the lead time needed to prevent idle cycles when the data corresponds to a partial tile. In this paper, we make the simplifying assumption that the lead time needed for a partial tile is proportional to the height of the tile's output surface (i.e., to the amount of data being transferred). Although beyond the scope of this paper, modifying this assumption would have little effect on our results.

Our assumption that computation time is proportional to the area of the tile is also inaccurate for several reasons: first, it doesn't accurately account for the communication overheads of partial tiles; second, there may be more computation overhead for the irregularly shaped partial tiles; third, the first and last stack of tiles have reduced communication costs; and fourth, there may be varying numbers of iterations in a given area, depending on the alignment of lattice points. Nevertheless our approximation is relatively accurate for large tiles.

Idle time is not the only cost that must be considered in evaluating a choice of tile size and shape. Other important factors are the total number of messages, and the total size of all the messages. These costs have been considered by many other papers that use a latencybandwidth model of communication cost. But for fixed width $w$ and height $h$ tiles, both the number of messages and their total length will be constant. This allows us to isolate the effect of tile shape (as given by the direction of the second family of swaths) on idle time.

### 3.3 Notation

We now define the notation that will be used in the rest of the paper.

[^2]The set of tiles is $\left\{T_{j, k} \mid 1 \leq j \leq b P\right\}$. The first non-empty tile in the first stack is numbered $T_{1,1}$. Depending on the relationship between the slope of the iteration space and the slope of the tiles, the lowest non-empty tile of other stacks might have any integer as its second subscript. Let $l_{\boldsymbol{j}}$ be the smallest integer such that $T_{j, l_{j}}$ is non-empty.

For each $(j, k)$ pair, we define $\boldsymbol{S}_{\boldsymbol{j}, \boldsymbol{k}}$ to be the union of of the tiles $T_{j, m}$ with $m \leq k$, that is, all tiles up to the $k$-th in the $j$-th stack.

Let $\boldsymbol{A}_{\boldsymbol{T}_{\boldsymbol{j}, \boldsymbol{k}}}$ denote the area, which represents the execution time, of tile $T_{j, k}$. If $T_{j, k}$ is a full tile of width $w$ and height $h$, then $A_{T_{j, k}}=h w$.

Similarly, we define $\boldsymbol{A}_{\boldsymbol{S}_{\boldsymbol{j}, \boldsymbol{k}}}$ to be the area of $S_{j, k}$, and $\boldsymbol{A}_{S_{j}}$ to be the total area of the $j$-th stack.
$\boldsymbol{F}_{\boldsymbol{T}_{\boldsymbol{j}, k}}$ will denote the finishing time of tile $T_{j, k}$, that is, the elapsed time from the start of execution of $T_{1,1}$ to the end of execution of tile $T_{j, k}$ (including sending its output surface) by the processor assigned to it.

Let $\boldsymbol{y}_{\boldsymbol{T}_{j, k}}$ denote the height of the right-hand output surface of $T_{j, k}$. As described in the discussion of lead time, we assume that tile $T_{j+1, k}$ cannot begin execution until $c w y_{T_{j, k}}$ after tile $T_{j, k}$ is completed.

Tile $T_{j, k}$ must wait for tile $T_{j-1, k}$ to its left, and tile $T_{j, k-1}$ below. Any other tile dependences will be from tiles that finished earlier than these two, and so will not further delay the finishing time of $T_{j, k}$. This gives the following formula:

## Formula 1

$$
\begin{aligned}
& F_{T_{j, k}}=A_{T_{j, k}}+\max \left(F_{T_{j, k-1}}, F_{T_{j-1, k}}+c w y_{T_{j-1, k}}\right) \\
& F_{T_{j, k}}=0 \quad \text { if } j<1 \text { or } k<l_{j}
\end{aligned}
$$

The next observation follows from the monotonicity of plus and max.

Observation 1 Decreasing the finishing times of any number of tiles can never increase the finishing time of any other tile.

The execution time of an application, $\boldsymbol{E}$, is the maximum finishing time of all tiles.

The idle time of processor $p_{j}$, denoted $\boldsymbol{I}_{\boldsymbol{p}_{j}}$, is $E \mathrm{mi}-$ nus the the sum of $A_{T_{m, k}}$ for all tiles $T_{m, k}$ that are executed on processor $p_{j}$. If $I_{p_{1}}=\ldots=I_{p_{P}}$ we refer to the processor idle time by $\boldsymbol{I}_{\boldsymbol{p}}$. The definition of idle time includes both types of idle time, the communication idle time $\boldsymbol{I}_{\boldsymbol{p}_{j}}^{c}$ that $p_{j}$ spends waiting for data from other processors, and the synchronization idle time $\boldsymbol{I}_{\boldsymbol{p}_{\boldsymbol{j}}}^{\boldsymbol{s}}$ after it has finished its last tile and before the application has finished.

It will turn out that when a processor is executing a stack of tiles, it may have idle time at the lowest tiles in the stack and additional idle time for the highest tiles in the stack, but if the stack is sufficiently high (in a technical sense defined later), the tiles in between will be executed in a "wavefront" with no additional idle time. We will write $\boldsymbol{I}_{\boldsymbol{p}_{\boldsymbol{j}}}^{\boldsymbol{c}}=\boldsymbol{I}_{\boldsymbol{p}_{\boldsymbol{j}} \text { bottom }}^{\boldsymbol{c}}+\boldsymbol{I}_{\boldsymbol{p}_{\boldsymbol{j} \boldsymbol{t o p}}}^{\boldsymbol{c}}$ to differentiate the communication idle time incurred by processor $p_{j}$ at the stacks' bottoms from that at the stacks' tops.

Definition 4 The idle time of an application, $\boldsymbol{I}_{\boldsymbol{a}}$, is the sum of the idle times of the individual processors.


Figure 3: a) Illustration of the rise $r_{b}=\frac{w}{h}\left(s_{1}-s_{2}\right)$, where $s_{1}$ is the slope of the iteration space and $s_{2}$ is the slope of the tile. Geometrically $r_{b}$ is the number of tile heights that the iteration space boundary rises in the width of one tile. The rise turns out to greatly affect the processor idle time. b) $r_{b}=0$. c) $r_{b}=1.5$. d) $r_{b}=-1$.


Figure 4: Rise $r_{t}$ at the top of the iteration space.

| $P=2$ | Execution Time |
| :--- | :---: |
| $r=1$ | $(14+5 c) h w$ |
| $r=0$ | $\max (13+c, 9+5 c) h w$ |
| $r=-1$ | $\max (12+c, 4+4 c) h w$ |


| $P=3$ | Execution Time |
| :--- | :---: |
| $r=1$ | $(14+5 c) h w$ |
| $r=0$ | $\max (10+2 c, 9+5 c) h w$ |
| $r=-1$ | $\max (8+2 c, 4+4 c) h w$ |


| $P=6$ | Execution Time |
| :--- | :---: |
| $r=1$ | $(14+5 c) h w$ |
| $r=0$ | $(9+5 c) h w$ |
| $r=-1$ | $(4+4 c) h w$ |

Table 1: Total execution times for varying values of $r$. The rise affects the execution time up to a factor of 3 , depending on the values of $c(c h w$ is the finishing time of one full surface of length $h)$ and $P$ (the number of processors).

An important concept relating the iteration space to a tiling is the rise.

Definition 5 The rise $\boldsymbol{r}$ is $\frac{w}{h}\left(s_{1}-s_{2}\right)$, where $w$ and $h$ are the dimensions of the tile, and $s_{1}$ and $s_{2}$ are the


Figure 5: This picture illustrates the importance of choosing a tiling with a good value of $r$. The numbers inside the tiles are the finishing times for a normalized area $=h w=1$ and $P=6$. The iteration space and tiles in all three examples have the same area. Each tile depends on the tiles below it and to the left of it. In (a) $r=1$, (b) $r=0$, and (c) $r=-1$. Table 1 has the execution times for other values of $P$.
slopes of the iteration space and the tile respectively. If the top and bottom slope of an iteration space are not the same, we distinguish between the bottom rise $\boldsymbol{r}_{\boldsymbol{b}}$ (Figure 3) and the top rise $\boldsymbol{r}_{\boldsymbol{t}}$ (Figure 4).

Figures 3 and 4 gives some examples of different values of rise. Geometrically $r$ is the number of tile heights that the iteration space boundary rises in the width of one tile.

The following two observations, easy to see geometrically, are central to our proofs.

Observation 2 When $r_{b} \geq-1, A_{T_{j, k}} \leq A_{T_{j-1, k+1}}$
Observation 3 When $r_{b} \leq-1, A_{T_{j, k}} \geq A_{T_{j-1, k+1}}$

## 4 Preview of Results

It turns out that the execution time varies significantly with the rise, as illustrated in Figure 5(a), (b), and (c), where the rise is equal to 1,0 , and -1 respectively. The number written inside each tile is the finishing time, given at least six processors. Table 1 contains the execution times for 2,3 , and 6 processors.

The first processor in Figure 5(a) for example, can execute the five tiles in its first stack (two half tiles, and three whole) in 4 time steps. The second processor on the other hand, has to wait for the first processor, both before starting executing its first tile and also later, before executing its second tile, in order to receive the output values sent from the first processor.

For typical values of $c$ in our example with $P=6$, the total execution time for the tiling in Figure 5(a) is

| Infinite iteration spaces |  |
| :--- | :--- |
| $r_{b} \geq-1$ | $I_{p_{j} \text { bottom }}^{c}=(j-1)\left(1+r_{b}+c\right) h w$ |
| $r_{b} \leq-2$ | $I_{p_{j o t t o m}}^{c} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)$ |


| Parallelogram-shaped iteration spaces |  |
| :--- | :--- |
| Block distribution |  |
| $r \geq-1$ | $I_{a}=P(P-1)(1+r+c) h w$ |
| $r \leq-2$ | $I_{a} \leq \max \left(\left(c-\frac{1-r^{2}}{2 r}\right) P h w, 0\right)-\frac{r P h w}{2}$ |


| Parallelogram-shaped iteration spaces |  |
| :--- | :--- |
| Block cyclic distribution |  |
| $r \geq-1\left(A_{S_{j}} \geq P h w(1+r+c)\right)$ | $I_{a}=P(P-1)(1+r+c) h w$ |
| $r \leq-2$ | $I_{a} \leq \max \left(\left(c-\frac{1}{2 r}\right) b P h w, \frac{-r b P h w}{2}\right)$ |


|  |  |
| :--- | :--- |
| Trapezoidal iteration spaces $r_{b}<r_{t}$ |  |
| Block distribution |  |


| Trapezoidal iteration spaces $r_{t}<r_{b}$ <br> Block distribution |  |
| :--- | :--- |
| $-1 \leq r_{t}<r_{b}$ | $I_{a}=P(P-1)\left(1+\frac{r_{t}+r_{b}}{2}+c\right) h w$ |
| $-(1+c) \leq r_{t}<-1 \leq r_{b}$ | $I_{a} \leq P(P-1)\left(1+\frac{r_{t}+r_{b}}{2}+c\right) h w-\frac{r_{t} P h w}{2}$ |
| $r_{t} \leq-(1+c)<-1 \leq r_{b}$ | $I_{a} \leq-P(P-1) \frac{r_{t}-r_{b}}{2} h w-\frac{r_{t} P h w}{2}$ |
| $r_{t}<r_{b} \leq-2$ | $\left.I_{a} \leq \max \left(c-\frac{1-r^{2}}{2 r}\right) P h w, 0\right)-P(P-1) \frac{r_{t}-r_{b}}{2} h w-\frac{r_{t} P h w}{2}$ |

Table 2: Summary of results.
about three times longer than for the tiling shown in Figure 5(c). This is because in $5(\mathrm{a})$, each processor needs to wait while two tiles are executed at the previous processor, whereas in 5 (c) all processors can start at the same time. Furthermore we see in Figure 5(b) that a tiling using rectangular tiles $(r=0)$ results in an execution time that is about twice as long as when $r=-1$.

Table 2 is a summary of the results that will be presented in the rest of the paper. We look at five tiling scenarios as given by the table headings. The case of "infinite iteration space" examines the idle time $I_{p_{j}}^{c}{ }_{\text {bottom }}$ that occurs at the bottom of $P$ sufficiently tall stacks. It will be shown that, although there may be idle time for several tiles at the bottom of each stack, once a processor reaches a row that contains only full tiles, it will incur no additional idle time (until the stack top is encountered). The next two cases are for parallelogramshaped iteration spaces with block and block cyclic distributions. The last cases examine trapezoidal (which includes triangular) iteration spaces with block distribution, with different cases depending on whether successive stacks are becoming taller or shorter.

Each of these cases are further divided into subcases, depending on the rise(s). For parallelograms, the reason there are two intervals of the rise is that when $r_{b} \leq$ -2 , the only possible communication idle time for a stack of tiles is waiting for the first two messages at the bottom of the stack, whereas in the case when $r_{b} \geq-1$, messages never arrive ahead of when they are needed.

When $-2<r_{b}<-1$, neither statement is necessarily true, but depend on the where the bottom edge of the iteration space intersects the stack. This complicates the formulas considerably and we will not cover this interval of $r_{b}$ in this paper.

It is convenient, because of this different behavior in the two cases, to define a constant $v$ that in some cases gives the slope of the "wavefront" of parallel execution of tiles.

Definition 6 We define $\boldsymbol{v}$ to $b e-(1+c) \frac{h}{w}$ when $r_{b} \geq$ -1 and $v=r_{b} \frac{h}{w}$ when $r_{b} \leq-2$.

Section 5 examines infinitely high iteration spaces, parallelograms are covered in Section 6.1, and trapezoids and triangles are covered in Section 6.2. Because of space limitations, only the proofs of the first lemmas are included here. The full set of proofs can be found in the technical report [HCF96].

## 5 Infinite iteration spaces

### 5.1 Total processor idle time for $r_{b} \geq-1$

Assume that we extend the iteration space to infinity towards the top edge of the original iteration space. We will start by obtaining a formula for the finishing time for any tile $T_{j, k}$ in a row where all tiles to its left are full tiles.

Since it isn't well defined what a block cyclic distribution for an infinite iteration space is, we are only covering block distribution in this section.

Lemma 1 When $r_{b} \geq-1$ and $T_{j, k}$ is a tile in row $k$ (for $j, k \geq 1$ ) where all tiles $T_{l, k}$ are full (for $1 \leq l \leq j$ ), then

$$
F_{T_{j, k}}=A_{S_{1, k}}+(j-1)(c+1) h w
$$

Proof: See appendix.
The lemma shows that after the $k$-th row, the finishing time of the tiles in the $j$-th stack are a constant larger than the corresponding area of the first stack. Since all tiles are full above the $k$-th row, this means both the work and the finishing time of tiles in the $j$ th stack are increasing by $h w$. Thus no additional idle time is introduced above the $k$-th row. This justifies the earlier discussion of the definition of $I_{p_{j} \text { bottom }}^{c}$ where "sufficiently high" means "above row $k$ ". We have:

Corollary 1 When $r_{b} \geq-1$,

$$
I_{p_{j} \text { bottom }}^{c}=(j-1)\left(1+r_{b}+c\right) h w
$$

Proof: See appendix.

### 5.2 Total processor idle time for $r_{b} \leq-2$



Figure 6: Iteration space for $r_{b} \leq-2$. The shaded area correspond to the computation that can be executed immediately without waiting for data from another processor.

When the rise is less than -2 we see in Figure 6 that all processors can start executing at the same time; not only their first tile but also the second and, depending on the rise, maybe more. (The shaded areas in Figure 6 correspond to these tiles.) This changes the formula considerably, since now more of the time that otherwise would be spent waiting for earlier processors is being overlapped by computation. We will show that the total idle time for an arbitrary stack up to and including its first full tile is given by the following formula in an infinite iteration space.

$$
I_{p_{j b o t t o m}}^{c} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)
$$

Lemma 2 When $r_{b} \leq-2, j \geq 2$, and $T_{j, k}$ is a full tile, then

$$
F_{T_{j, k}} \leq A_{S_{j, k}}+\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)
$$

Proof: See appendix.
Corollary 2 When $r_{b} \leq-2$ and $j \geq 2$,

$$
I_{p_{j \text { bottom }}}^{c} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)
$$

Proof: See appendix.
It can be shown that this approximation is off by at most $\frac{h w}{4}$.

## 6 Finite iteration spaces

We have so far derived formulas for the idle time in infinite iteration spaces, but we need now to derive similar formulas for finite iteration spaces. Removing computation from the infinite iteration space to make it finite will in some cases introduce extra idle time at the top of each stack.

The idle time that occur at the top of the $l$-th stack in an iteration space can occur for two reasons. The processor executing the $l$-th stack might be idle while it still have computation left to do in the $l$-th stack. We call this idle time $I_{l \text { top }}^{c}$. If $l$ is the last stack to be executed by processor $p_{j}$, then there might also be idle time due to synchronization after $l$ has been executed. The following two lemmas state the formulas for these two idle times. (The proofs for these lemmas, and all subsequent lemmas and theorems can be found in [HCF96].)
Lemma 3

$$
I_{l t o p}^{c} \begin{cases}=0 & \left(\text { for } r_{t} \geq-1\right) \\ \leq \frac{-r_{t} h w}{2} & \left(\text { for } r_{t}<-1\right)\end{cases}
$$

## Lemma 4

$$
I_{p_{j}}^{s}= \begin{cases}h w\left(r_{t}-v \frac{w}{h}\right)(P-j) & \left(\text { for } r_{t} \geq v \frac{w}{h}\right) \\ -h w\left(r_{t}-v \frac{w}{h}\right)(j-1) & \left(\text { for } r_{t} \leq v \frac{w}{h}\right)\end{cases}
$$

where $r_{t}$ is the top rise and $v$ is the wavefront constant.
We are now ready to derive formulas for the idle time. These formulas all come from the definition of $I_{p_{j}}=I_{p_{j} \text { botom }}^{c}+I_{p_{j} \text { top }}^{c}+I_{p_{j}}^{s}$, where $I_{p_{j} \text { top }}^{c}=\sum_{l=j \bmod P} I_{l_{\text {top }}}^{c}$. The formulas concerning block distribution are then all obtained straightforwardly using lemmas 3,4 and Corollary 1 or 2 . In the case of block cyclic distribution, $I_{p_{j} \text { bottom }}^{c}$ might also differ from the formulas in Corollary 1 and 2 (see proofs of Lemma 7 and 8).

### 6.1 Parallelogram shaped iteration spaces

When the iteration space is a parallelogram, all stacks have the same height, and assuming that they also have the same width, all processors will therefore perform the same amount of computation. According to the definition of processor idle time we can conclude that $I_{p_{1}}=\ldots=I_{p_{P}}=I_{p}$. The total idle time, $I_{a}$, therefore becomes equal to $P I_{p}$.

### 6.1.1 Block Distribution for $r \geq-1$

We claim that the idle time for the processor executing the $j$-th stack is given by the following lemma.

Lemma 5 Let $r \geq-1$ and assume that there exists a row $k$, such that for all stacks, there is a full tile in $k .{ }^{3}$ The total processor idle time for the processor executing the $j$-th stack is then given by the following formula.

$$
I_{p}=(P-1) h w(1+r+c)
$$

where $h$ and $w$ are the height and width of a tile, chw is the time is takes to communicate one tile surface of height $h$, and $r$ is the rise.

The next theorem follows from this lemma using the fact that the idle time of an application is the sum of the idle times of the individual processors.

Theorem 1 Let $r \geq-1$. If there are $P$ stacks distributed one per processor, and there exists a row which includes full tiles in every stack, then the total idle time will be

$$
I_{a}=P(P-1) h w(1+r+c)
$$

where $h$ is the height and $w$ the width of the tiles, chw is the communication time for one tile surface of length $h$, and $r$ is the rise.

### 6.1.2 Block Distribution for $r \leq-2$

The total idle time for an arbitrary stack is given by the following lemma.

Lemma 6 If $r \leq-2$ and each processor only executes one stack, then the following formula is an upper bound on the total idle time for one processor.

$$
I_{p} \leq \max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r h w}{2}\right)
$$

where $w$ is the width and $h$ the height of the tile, chw is the time it takes to communicate one tile surface of length $h$, and $r$ is the rise.

As before, this lemma gives the following theorem.
Theorem 2 If $r \leq-2$ and there are $P$ stacks, then

$$
I_{a} \leq \max \left(\left(c-\frac{1}{2 r}\right) P h w, \frac{-r P h w}{2}\right)
$$

is an upper bound on the total idle time.

[^3]
### 6.1.3 Block Cyclic Distribution for $r \geq-1$

In the case of block cyclic distribution, we assume there are $P$ processors executing $b P$ stacks, such that $p_{j}$ executes stack numbers $j \bmod P$. As before we assume the stacks to be "sufficiently high" (there is at least one row of full tiles), but in the block cyclic case we also need the assumption that they have an area of at least $P h w(1+r+c)$. This last assumption turns out to be sufficient to assure that a processor can start executing the next stack immediately after the finishing its current stack.

We claim that the idle time is given by the following lemma.

Lemma 7 Let $r \geq-1$ and $p_{j}$ be the processor that executes stack numbers $j, j+P, j+2 P, \ldots, j+(b-1) P$. Let all stacks have an area of at least Phw $(1+r+c)$, and let there exist a row in which all tiles are full.

$$
I_{p}=(P-1)(1+r+c) h w
$$

Theorem 3 If $r \geq-1$ and there are $b P$ stacks as described in Lemma 7, then the total idle time is given by

$$
I_{a}=P(P-1)(1+r+c) h w
$$

### 6.1.4 Block Cyclic Distribution for $r \leq-2$

When $r \leq-2$ we do not need to make any assumptions on the height of the stacks.
Lemma 8 Let $r \leq-2$ and $p_{j}$ be the processor that executes stack numbers $j, j+P, j+2 P, \ldots, j+(b-1) P$.

$$
I_{p} \leq \max \left(\left(c-\frac{1}{2 r}\right) b h w, \frac{-r b h w}{2}\right)
$$

Theorem 4 If $r \leq-2$ and there are $b P$ stacks, then the total idle time $\bar{i}$ s given by

$$
I_{a} \leq \max \left(\left(c-\frac{1}{2 r}\right) b P h w, \frac{-r b P h w}{2}\right)
$$

### 6.2 Trapezoidal iteration spaces

### 6.2.1 Block Distribution for $-1 \leq r_{b}<r_{t}$

We claim that the idle time for the processor executing the $j$-th stack is given by the following lemma.
Lemma 9 Let $r_{t}>r_{b} \geq-1$ and assume that there exists a row $k$ such that for all stacks there is a full tile in $k$. The total processor idle time for the processor executing the $j$-th stack is then given by the following formula.
$I_{p_{j}}=h w\left((j-1)\left(1+r_{b}+c\right)+\left(r_{t}+1+c\right)(P-j)\right)$
where $h$ and $w$ are the height and width of a tile, chw is the time is takes to communicate one tile surface of height $h$, and $r$ is the rise.

Theorem 5 If there are $P$ stacks distributed one per processor, and there exist a row which includes full tiles in every stack, then the total idle time for $-1 \leq r_{b}<r_{t}$ will be

$$
I_{a}=P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)
$$

where $h$ is the height and $w$ the width of the tiles, chw is the communication time for one tile surface of length $h$, and $r_{b}$ and $r_{t}$ are the bottom and top rise.

### 6.2.2 Block Distribution for $r_{b} \leq-2, r_{b}<r_{t}$

The total idle time for the processor executing the $j$-th stack is given by the following lemma.

Lemma 10 If $r_{b} \leq-2$ and each processor only executes one stack, then the following formula is a upper bound on the total idle time for one processor.
For $r_{b} \leq-2<-1 \leq r_{t}$ :

$$
I_{p_{j}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right)(P-j)
$$

For $r_{b}<r_{t}<-1, r_{b} \leq-2$ :

$$
\begin{aligned}
I_{p_{j}} \leq & \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+ \\
& +h w\left(r_{t}-r_{b}\right)(P-j)-\frac{r_{t} h w}{2}
\end{aligned}
$$

where $w$ is the width and $h$ the height of the tile, chw is the time it takes to communicate one tile surface of length $h, r_{b}$ is the rise at the bottom of the iteration space and $r_{t}$ is the rise at the top.

Theorem 6 If $r_{b} \leq-2, r_{b}<r_{t}$ and there are $P$ stacks, then
For $r_{b} \leq-2<-1 \leq r_{t}$ :

$$
I_{a} \leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+P(P-1) h w \frac{r_{t}-r_{b}}{2}
$$

For $r_{b}<r_{t}<-1, r_{b} \leq-2$ :

$$
\begin{aligned}
I_{a} \leq & \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+ \\
& +P(P-1) h w^{\frac{r_{t}-r_{b}}{2}-\frac{r_{t} h w P}{2}}
\end{aligned}
$$

is an upper bound on the total idle time.

### 6.2.3 Block Distribution for $r_{b}>r_{t}, r_{b} \geq-1$

We claim that the idle time for the processor executing the $j$-th stack is given by the following lemma.

Lemma 11 Let $r_{b} \geq-1$ and assume that there exists a row $k$, such that for all stacks, there is a full tile in $k$. The total processor idle time for the processor executing
the $j$-th stack is then given by the following formula.
For $-1 \leq r_{t}<r_{b}$ :

$$
I_{p_{j}}=h w\left((j-1)\left(1+r_{b}+c\right)+\right.
$$

$$
\left.+(P-j)\left(1+r_{t}+c\right)\right)
$$

For $-(1+c) \leq r_{t}<-1 \leq r_{b}$ :

$$
I_{p_{j}} \leq h w \overline{( }(j-1)\left(1 \overline{+} r_{b}+c\right)+
$$

$$
\left.+(P-j)\left(1+r_{t}+c\right)-\frac{r_{t}}{2}\right)
$$

For $r_{t} \leq-(1+c)<-1 \leq r_{b}$ :

$$
I_{p_{j}} \leq-h w\left((j-1)\left(r_{t}-r_{b}\right)+\frac{r_{t}}{2}\right)
$$

where $h$ and $w$ are the height and width of a tile, chw is the time is takes to communicate one tile surface of height $h$, and $r_{b}$ is the rise at the bottom of the iteration space and $r_{t}$ is the rise at the top.

Theorem 7 Let $r_{b} \geq-1$. If there are $P$ stacks distributed one per processor, and there exist a row which includes full tiles in every stack, then the total idle time will be
For $-1 \leq r_{t}<r_{b}$ :

$$
I_{a}=P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)
$$

For $-(1+c) \leq r_{t}<-1 \leq r_{b}$ :

$$
I_{a} \leq P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)-\frac{r_{t} P h w}{2}
$$

For $r_{t} \leq-(1+c)<-1 \leq r_{b}$ :
$I_{a} \leq-P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{r_{t} P h w}{2}$
where $\bar{h}$ is the height and $w^{2}$ the width of the tiles, chw is the communication time for one tile surface of length $h$, and $r_{b}$ and $r_{t}$ are the bottom and top rise.

### 6.2.4 Block Distribution for $r_{t}<r_{b} \leq-2$

The total idle time for the processor executing the $j$-th stack is given by the following lemma.

Lemma 12 If $r_{t}<r_{b} \leq-2$ and each processor only executes one stack, then the following formula is an upper bound on the total idle time for one processor.

$$
\begin{aligned}
I_{p_{j}} \leq & \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)- \\
& -h w\left(r_{t}-r_{b}\right)(j-1)-\frac{r_{t} h w}{2}
\end{aligned}
$$

where $w$ is the width and $h$ the height of the tile, chw is the time it takes to communicate one tile surface of length $h$, and $r_{b}$ and $r_{t}$ are the bottom and top rise.

Theorem 8 If $r_{t}<r_{b} \leq-2$ and there are $P$ stacks, then

$$
\begin{aligned}
I_{a} \leq & \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)- \\
& -P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{P r_{t} h w}{2}
\end{aligned}
$$

## 7 Idle time tradeoffs for multilevel tiling

This section will illustrate how paying attention to idle time can be important in the context of tiling for multiple levels of memory and parallelism hierarchy.

In Figure 7 an iteration space with vertical and horizontal dependences is partitioned into tiles and these tiles are again partitioned into subtiles. We will assume that the original iteration space and the subtiles are rectangular and consider the effect of the shape of the intermediate tiles. In Figure 7a) we use a square intermediate tile whereas in b) we use a sloping parallelogram. Conventional tiling techniques would choose the tiling of Figure 7a). We argue that in some circumstances the tiling in b) is more desirable.

Suppose for example the intermediate tiles are brought in from disk to main memory and the subtiles are executed on individual processors. Since there is only a single main memory, i.e. $P=1$ for the coarse granularity tiles, there is no idle time at this level. However,
the idle time to execute each subtile using multiple processors is significant. In the case of square tiles in Figure 7a) each tile incur a substantial idle time penalty. However for the sloping tiles, even though there is a similar penalty for the partial tiles, the full tiles will be able to execute without idle time.


Figure 7: This figure shows two different two level tilings of an iteration space. The first has rise of zero in both cases, whereas the second has negative rise for subtiles and positive for tiles. Which of the two is more efficient depends on the degree of parallel execution at the two levels.

We have observed the importance of idle time in actual programs [CFHAG96]. The first example is a PDElike nested loop, tiled for both cache and registers on the IBM RS/6000. Conventional tiling would produce rectangular cache tiles and an unrolled loop (which can be considered to be a very small rectangular "register tile"). Hierarchical tiling would also choose rectangular cache tiles; however, because of the pipelined functional unit of depth 2 in the RS/6000, which is modeled as two-way parallelism, register tiles are chosen to be parallelograms with rise of +1 with respect to the cache tiles. Since there is only one set of registers per cache, having a positive rise doesn't lead to any idle time. However, each register tile is further partitioned into "instruction tiles". The choice of register tile shape makes the instruction tiles have rise of -1 , which gives greater instruction level parallelism. The code produced by hierarchical tiling ran $64 \%$ faster than the conventionally tiled code on the RS/6000, taking an average of 3.23 cycles per iteration point compared to 5.31 . The second example is a protein matching code with a similar set of considerations. In this example on the IBM SP2, the inner loop code with rectangular register tiles ran at 8.0 cycles per iteration, whereas a register tile a parallelogram with rise of -1 ran at 6.37 cycles per iteration, i.e., $26 \%$ faster. The difference in performance on these two codes illustrates that idle time can have a significant effect on real performance.

## References

[B92] Banerjee, U., "Unimodular Transformations of Double Loops", in Advances in Languages and Compilers for Parallel Processing, A. Nicolau and D. Padua, editors, MIT Press, 1992.
[CFH95a] Carter, L., J. Ferrante and S. Flynn Hummel, "Efficient Parallelism via Hierarchical Tiling," Proc. of SIAM Conference on Parallel Processing for Scientific Computing, (February, 1995).
[CFH95b] Carter, L., J. Ferrante and S. Flynn Hummel, "Hierarchical Tiling for Improved Superscalar Performance," Ninth International Parallel Processing Symposium, Santa Barbara, CA, April 1995.
[CFHAG96] Carter, L., J. Ferrante, S. Flynn Hummel, B. Alpern, K. S. Gatlin, "Hierarchical Tiling: A Methodology for High Performance," UCSD Computer Science and Engineering Technical Report, available at http://wwwcse.ucsd.edu/users/carter/ppbib.html.
[CK92] Carr, S. and K. Kennedy, "Compiler Blockability of Numerical Algorithms," Proc. of Supercomputing, November, 1992.
[GJG88] Gannon, D., W. Jalby and K. Gallivan, "Strategies for Cache and Local Memory Management by Global Program Transformation," Journal of Parallel and Distributed Computing, Vol. 5, No. 5, October 1988, pp. 587-616.
[HCF96] Högstedt, K., L. Carter and J. Ferrante, "Determining the Idle Time of a Tiling," UCSD Computer Science and Engineering Technical Report, no. CS96-489, 1996, available at http://wwwcse.ucsd.edu/users/carter/ppbib.html.
[HA90] Hudak, D. E. and S. G. Abraham, "Compiler Techniques for Data Partitioning of Sequentially Iterated Parallel Loops," IEEE Transactions on Parallel and Distributed Systems, Vol. 2 No. 3, July, 1991, pp. 318-328.
[H93] Hummel, S. F., "Fractiling: A Method for Scheduling Parallel Loops on NUMA Machines", IBM RC 18958, June 1993.
[IT88] Irigoin, F. and R. Triolet, "Supernode Partitioning," Proc. 15th ACM Symp. on Principles of Programming Languages, January 1988, pp. 319-328.
[JM86] Jalby, W. and U. Meier, "Optimizing Matrix Operations on a Parallel Multiprocessor with a Hierarchical Memory System," Proc. International Conference on Parallel Processing, August 1986, pp. 429-432.
[KM92] Kennedy, K. and K. S. Mc Kinley, "Optimizing for Parallelism and Data Locality," Proc. International Conference on Supercomputing, July 1992, pp. 323-334.
[KP93] Kelly, W. and W. Pugh, "Determining Schedules based on Performance Estimation," Technical report, Dept. of Computer Science, University of Maryland, College Park UMIACS-TR-9367, December 1993.
[LRW91] Lam, M, E. Rothberg, and M. Wolf, "The Cache Performance and Optimization of Blocked Algorithms," Proc. ASPLOS, April 1991, pp. 63-74.
[L90] Lee, F. F., "Partitioning of Regular Computation on Multiprocessor Systems," Journal of Parallel and Distributed Computing, Vol. 9, 1990, pp. 312-317.
[RS91] Ramanujam, J. and P. Sadayappan, "Tiling Multidimensional Iteration Spaces for Nonshared Memory Machines," Proceedings of Supercomputing, November, 1991, pp. 111-120.
[RAP87] Reed, D. A., L. M. Adams and M. L. Patrick, "Stencil and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems," IEEE Transactions on Computers, July, 1987, pp. 845-858.
[VSM96] Van der Wijngaart, R.F., S.R. Sarukkai, and P. Mehra, "The Effects of Interrupts on Software Pipeline Execution on Message-passing Architectures," International Conference on Supercomputing, May, 1996.
[WL91] Wolf, M. E. and M. S. Lam, "A Data Locality Optimizing Algorithm," Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, June, 1991, pp. 30-44.
[WL91b] Wolf, M. E. and M. S. Lam, "A Loop Transformation Theory and an Algorithm to Maximize Parallelism," IEEE Transactions on Parallel and Distributed Systems, Vol. 2, No. 4, October, 1991, pp. 452-471.
[W87] Wolfe, M., "Iteration Space Tiling for Memory Hierarchies," Parallel Processing for Scientific Computing, G. Rodrigue (Ed), SIAM 1987, pp. 357-361.
[W89] Wolfe, M., "More Iteration Space Tiling ," Proceedings of Supercomputing, November, 1989, pp. 655-664.

## Appendix

## Infinite iteration spaces

Lemma 1 When $r_{b} \geq-1, j, k \geq 1$ and $T_{j, k}$ is a tile in row $k$ where all tiles $T_{l, k}, l \leq j$ are full, then $F_{T_{j, k}}=A_{S_{1, k}}+(j-1)(c+1) h w$

Proof of Lemma 1: We first need two additional lemmas (13 and 14).
Lemma 13 Let $T_{j, k}$ be a tile in row $k$ such that $T_{l, k}, l \leq j$ are full tiles.
$F_{T_{j, k}} \leq F_{T_{j-1, k+1}}+c h w \quad($ for $j \geq 2)$
Proof of Lemma 13:

## Induction Hypothesis:

$F_{T_{j, k}} \leq \bar{F}_{T_{j-1, k+1}}+c h w \quad($ for $j \geq 2)$
for all tiles within the iteration space boundaries. $F_{T_{j, k}}=0$ otherwise.
Base Case ( $j+k=2$ ):
$A_{T_{j, k}}=0$ :
Formula 1 states that the finishing time of a non-existing tile is equal to 0 , and the induction hypothesis is therefore trivially true in this case.
$A_{T_{j, k}} \neq 0$ :
$F_{T_{j, k}}=\max \left(F_{T_{j-1, k}}+c w y_{T_{j-1, k}}, F_{T_{j, k-1}}\right)+A_{T_{j, k}} \quad$ (Form. 1)
Now, Formula 1 says that the finishing time for non-existing tiles is equal to 0 , and when $r_{b} \geq-1$ we know that both $T_{j-1, k}$ and $T_{j, k-1}$ do not exist. We therefore have that

$$
F_{T_{j, k}} \leq \max (0+0,0)+A_{T_{j, k}} \quad\left(r_{b} \geq-1,\right. \text { Form. 1) }
$$

$$
\left.\begin{array}{l}
\overline{\leq} \max (0+0,0)+A_{T_{j-1, k+1}} \\
<\max \left(F_{T_{m}}\right. \\
+\operatorname{cwl}_{T}
\end{array} . F_{T}\right)+A_{T} \quad(\mathrm{Obs} .2)
$$

$$
\lesssim \quad \max \left(F_{T_{j-2, k+1}}+c w y_{T_{j-2, k+1}}, F_{T_{j-1, k}}\right)+A_{T_{j-1, k+1}}
$$

for $j+k=2$.

$$
\begin{align*}
& F_{T_{j-1, k+1}}+c h w  \tag{Form.1}\\
& F_{T_{j-1, k+1}}+c
\end{align*}
$$

Induction Step: Suppose the induction hypothesis holds for $j+k=x-1$, show it also holds for $j+k=x$.

$$
\begin{array}{rlrl}
F_{T_{j, k}} & =\max \left(F_{T_{j-1, k}}+c w y_{T_{j-1, k}}, F_{T_{j, k-1}}\right)+A_{T_{j, k}} & & (\text { Form. 1) } \\
& \leq \max \left(F_{T_{j-1, k}}+c h w, F_{T_{j, k-1}}\right)+A_{T_{j, k}} & & \\
& \leq \max \left(F_{T_{j-1, k}}+c h w, F_{T_{j-1, k}}+c h w\right)+A_{T_{j, k}} & & \text { (Induction hypothesis) } \\
& \leq F_{T_{j-1, k}+c h w+A_{T_{j, k}}}+c h w+A_{T_{j-1, k+1}} & & \text { (Obs. 2) } \\
& \leq F_{T_{j-1, k}}+c h w  \tag{Obs.2}\\
& \leq F_{T_{j-1, k+1}}+c h w & &
\end{array}
$$

since $F_{T_{j-1, k+1}} \geq F_{T_{j-1, k}}+A_{T_{j-1, k+1}}$ by the definition of finishing time (Form 1). We have now proved the induction hypothesis and lemma.
End of Proof of Lemma 13
Lemma 14 Let $T_{j, k}$ be a tile in row $k$ such that $T_{l, k}, l \leq j$ are full tiles.
$F_{T_{j, k}}=F_{T_{j-1, k}}+(c+1) h w \quad($ for $j \geq 2)$
Proof of Lemma 14

$$
\begin{aligned}
F_{T_{j, k}} & =\max \left(F_{T_{j-1, k}}+c w y_{T_{j-1, k}}, F_{T_{j, k-1}}\right)+A_{T_{j, k}} & & (\text { Form. } 1) \\
& =\max \left(F_{T_{j-1, k}}+c h w, F_{T_{j, k-1}}\right)+A_{T_{j, k}} & & \left(A_{T_{j-1, k}}=h w\right) \\
& =F_{T_{j-1, k}}+c h w+A_{T_{j, k}} & & (\text { Lemma } 13) \\
& =F_{T_{j-1, k}}+c h w+h w & & \left(A_{T_{j, k}}=h w\right) \\
& =F_{T_{j-1, k}}+(c+1) h w & &
\end{aligned}
$$

End of Proof of Lemma 14
We are now ready to complete the proof of Lemma 1.

$$
\begin{array}{rlrl}
F_{T_{j, k}} & =F_{T_{j-1, k}}+(c+1) h w & & (\text { for } j \geq 2) \\
& =F_{T_{1, k}}+(j-1)(c+1) h w & & \text { (femma 14) } \\
& =A_{S_{1, k}}+(j-1)(c+1) h w & & \\
& & \text { (by simple induction) } \\
& & &
\end{array}
$$

## End of Proof of Lemma 1

## Corollary 1

$$
I_{p_{j} \text { bottom }}^{c}=(j-1)\left(1+r_{b}+c\right) h w
$$


a)

b)

Figure 8: Iteration spaces showing that the difference between the area of stack 1 and stack $j$ up to a certain level always differs with $r_{b} h(j-1) w$. This fact is used in the proof of Corollary 1.
when $r_{b} \geq-1$.
Proof of Corollary 1: The idle time for $p_{j}$ in an infinite iteration space is given by the finishing time of a tile $T_{j, k}$ in row $k$ such that $T_{l, k}, l \leq j$ are full tiles, minus the amount of computation $p_{j}$ has executed before $F_{T_{j, k}}$

Using the notation in Figure 8, and Lemma 1, we have that

$$
\begin{array}{rlr}
I_{p_{j \text { bottom }}}^{c} & =F_{T_{j, k}}-A_{S_{j, k}} & (\text { for } j \geq 1) \\
& =A_{S_{1, k}}+(j-1)(c+1) h w-A_{S_{j, k}} & (\text { Lemma } 1) \\
& =\left(A_{S_{1, k}}-A_{S_{j, k}}\right)+(j-1)(c+1) h w & \\
& =r_{b} h(j-1) w+(j-1)(c+1) h w & \\
& =(j-1)\left(1+r_{b}+c\right) h w & \tag{Figure8}
\end{array}
$$

where $A_{S_{j, k}}$ is the area of the $j$-th stack up to, and including row $k$.
End of Proof of Corollary 1

Lemma 2 When $r_{b} \leq-2, j \geq 2$ and $T_{j, k}$ is a full tile, then

$$
F_{T_{j, k}} \leq A_{S_{j, k}}+\max \left(\left(c+\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)
$$

Proof of Lemma 2: We first need to prove the following lemma.
Lemma $15 \quad F_{T_{j, k}} \geq F_{T_{j-1, k+1}}+c h w \quad\left(\right.$ for $\left.j \geq 2, k \geq f_{j}\right)$
where $f_{j}$ is the value of $k$ such that $T_{j, k}$ is the first full tile in stack $j$.
Proof of Lemma 15:
Induction Hypothesis:
$F_{T_{j, k}} \geq F_{T_{j-1, k+1}}+c w y_{T_{j-1, k+1}} \quad($ for $j \geq 1)$
for all tiles within the iteration space boundaries. $F_{T_{j, k}}=0$ otherwise.
Base Case $\left(j+k=l_{b P}+1\right.$, where $l_{b P}$ is the lowest value of $k$ such that $T_{b P, k}$ exists.):
$A_{T_{j, k}}=0$ :
Formula 1 states that the finishing time of a non-existing tile is equal to 0 , and the induction hypothesis is therefore trivially true in this case.
$A_{T_{j, k}} \neq 0:$

$$
F_{T_{j, k}}=\max \left(F_{T_{j-1}, k}+c w y_{T_{j-1}, k}, F_{T_{j, k-1}}\right)+A_{T_{j, k}} \quad \text { (Form. 1) }
$$

Now, Formula 1 says that the finishing time for non-existing tiles is equal to 0 , and when $r_{b} \leq-2$ we know that both $T_{j-1, k}$ and $T_{j, k-1}$ do not exist when $j+k=l_{b P}+1$. We therefore have that

$$
\begin{array}{rlrl}
F_{T_{j, k}} & =\max (0+0,0)+A_{T_{j, k}} & & \left(r_{b} \leq-2,\right. \text { Form. 1) } \\
& \geq \max (0+0,0)+A_{T_{j-1, k+1}} & & (\text { Obs. } 3) \\
& \geq \max \left(F_{T_{j-2, k+1}}+c w y_{T_{j-2, k+1}}, F_{T_{j-1, k}}\right)+A_{T_{j-1, k+1}} & \left(A_{T_{j-2, k+1}}=A_{T_{j-1, k}}=0\right) \\
& \geq F_{T_{j-1, k+1}}+c w y_{T_{j-1, k+1}} . & & (\text { Form. }) \\
& \left.\geq F_{T_{j-1, k+1}}\right) & \left(A_{T_{j-1, k+1}}=0\right)
\end{array}
$$

Induction Step: Suppose the induction hypothesis holds for $j+k=x-1$, show it also holds for $j+k=x$.
$F_{T_{j, k}}=\max \left(F_{T_{j-1, k}}+c w y_{T_{j-1, k}}, F_{T_{j, k-1}}\right)+A_{T_{j, k}}$
(Form. 1)
$\geq \max ^{2}\left(F_{T_{j-1, k}}+c w y_{T_{j-1}, k}, F_{T_{j-1, k}}+c w y_{T_{j-1, k}}\right)+A_{T_{j, k}} \quad$ (Induction hypothesis)
$\geq F_{T_{j-1, k}}+c w y_{T_{j-1, k}}+A_{T_{j, k}}$
$\geq F_{T_{j-1, k}}+c w y_{T_{j-1, k}}+A_{T_{j-1}}$
(Obs. 3)
$\left.\geq \max _{T_{j-1, k+1}+c w y_{T_{j-1, k}}} F_{T_{j-2, k+1}}+c w y_{T_{j-2}}, F_{T_{j-1, k}}\right)+c w y_{T_{j-1, k}}+A_{T_{j-1, k+1}} \quad$ (Induction hypothesis)
Now we have shown that $F_{T_{j, k}} \geq F_{T_{j-1, k}}+c w y_{T_{j-1, k}}$ for all tiles. We want to show $F_{T_{j, k}} \geq F_{T_{j-1, k}}+c w h$ for $k \geq f_{j}$ and this is trivially true since when $T_{j, k}$ is full, $y_{T_{j-1, k}}=h$, and we have now proved the lemma.
End of Proof of Lemma 15

Using Lemma 15 we now have

$$
F_{T_{j, k}}=\max \left(F_{T_{j-1, k}}+c w y_{T_{j-1, k}}, F_{T_{j, k-1}}\right)+A_{T_{j, k}}
$$

(Form. 1)

$$
\begin{equation*}
=F_{T_{j, k-1}}+A_{T_{j, k}} \tag{1}
\end{equation*}
$$

(for $k>f_{j}$ ) (Lemma 15)
To calculate the finishing time for $k>f_{j}$ we now need to calculate the finishing time for $T_{j, k}$, the first full tile in stack $j$.
Lemma 16 Let $T_{j, f_{j}}$ be the first full tile in stack $j$. Then

$$
F_{T_{j, f_{j}}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{j, f_{j}}} \quad(\text { for } j \geq 2)
$$

Proof of Lemma 16 Let us first calculate the finishing time for the first full tiles neighboring tiles.

$$
\begin{array}{rlrl}
F_{T_{j-1, f_{j}}} & =\max \left(F_{T_{j-2, f_{j}}}+c w y_{T_{j-2, f_{j}}}, F_{T_{j-1, f_{j}-1}}\right)+A_{T_{j-1, f_{j}}} & & (\text { Form. } 1) \\
& =\max \left(F_{T_{j-2, f_{j}}}+c w y_{T_{j-2, f_{j}}}, A_{T_{j-1, f_{j}-1}}\right)+A_{T_{j-1, f_{j}}} & & \left(A_{T_{j-2, f_{j}-1}}=A_{T_{j-1, f_{j}-2}}=0 \text { when } r_{b} \leq-2\right) \\
& =\max \left(0+0, A_{T_{j-1, f_{j}-1}}\right)+A_{T_{j-1, f_{j}}} & & \left(A_{T_{j-2, f_{j}}}=0\right) \\
& =A_{T_{j-1, f_{j}-1}}+A_{T_{j-1, f_{j}}} \\
& =A_{S_{j-1, f_{j}}} & &  \tag{2}\\
& & \\
F_{T_{j, f_{j}-1}} & =\max \left(F_{T_{j-1, f_{j}-1}}+c w y_{T_{j-1, f_{j}-1}}, F_{T_{j, f_{j}-2}}\right)+A_{T_{j, f_{j}-1}} & & (\text { Form. } 1) \\
& =\max \left(F_{T_{j-1, f_{j}-1}}+c w y_{T_{j-1, f_{j}-1}}, A_{S_{j, f_{j}-2}}\right)+A_{T_{j, f_{j}-1}} & & \left(l_{j-1}>f_{j}-2 \text { whenr} r_{b} \leq-2\right) \\
& =\max \left(A_{T_{j-1, f_{j}-1}}+c w y_{T_{j-1, f_{j}-1}}, A_{S_{j, f_{j}-2}}\right)+A_{T_{j, f_{j}-1}} & \left(A_{T_{j-2, f_{j}-1}}=A_{T_{j-1, f_{j}-2}}=0 \text { when } r_{b} \leq-2\right) \\
& =\max \left(c w y_{T_{j-1, f_{j}-1}}-\left(A_{S_{j, f_{j}-2}}-A_{T_{j-1, f_{j}-1}}\right), 0\right)+A_{S_{j, f_{j}-2}}+A_{T_{j, f_{j}-1}} \\
& =\max \left(c w y_{T_{j-1, f_{j}-1}}-\left(A_{S_{j, f_{j}-2}}-A_{T_{j-1, f_{j}-1}}\right), 0\right)+A_{S_{j, f_{j}-1}}
\end{array}
$$

Now we can calculate the finishing time for the first full tile $T_{j, k}$ in stack $j$ using these two formulas.

$$
\begin{align*}
F_{T_{j, f_{j}}} & =\max \left(F_{T_{j-1, f_{j}}}+c w y_{T_{j-1, f_{j}}}, F_{T_{j, f_{j}-1}}\right)+A_{T_{j, f_{j}}}  \tag{Form.1}\\
& =\max \left(A_{S_{j-1, f_{j}}}+c w y_{T_{j-1, f_{j}}}, F_{T_{j, f_{j}-1}}\right)+A_{T_{j, f_{j}}}  \tag{eqn.2}\\
& =\max \left(A_{S_{j-1, f_{j}}}+c w y_{T_{j-1, f_{j}}}\right. \\
& \left.\quad \max \left(c w y_{T_{j-1, f_{j}-1}}-\left(A_{S_{j, f_{j}-2}}-A_{T_{j-1, f_{j}-1}}\right), 0\right)+A_{S_{j, f_{j}-1}}\right)+A_{T_{j, f_{j}}}  \tag{eqn.3}\\
& =\max \left(c w y_{T_{j-1, f_{j}}}-\left(A_{S_{j, f_{j}-1}}-A_{S_{j-1, f_{j}}}\right)\right. \\
& \left.\quad \max \left(c w y_{T_{j-1, f_{j}-1}}-\left(A_{S_{j, f_{j}-2}}-A_{T_{j-1, f_{j}-1}}\right), 0\right)\right)+A_{S_{j, f_{j}-1}}+A_{T_{j, f_{j}}} \\
& =\max \left(c w y_{T_{j-1, f_{j}}}-\left(A_{S_{j, f_{j}-1}}-A_{S_{j-1, f_{j}}}\right), c w y_{T_{j-1, f_{j}-1}}-\left(A_{S_{j, f_{j}-2}}-A_{T_{j-1, f_{j}-1}}\right), 0\right)+A_{S_{j, f_{j}}} \\
& =\max \left(c h w-\left(A_{S_{j, f_{j}-1}}-A_{S_{j-1, f_{j}}}\right), c w y_{T_{j-1, f_{j}-1}}-\left(A_{S_{j, f_{j}-2}}-A_{T_{j-1, f_{j}-1}}\right), 0\right)+A_{S_{j, f_{j}}} \\
& =\max \left(D_{1}, D_{2}, 0\right)+A_{S_{j, f_{j}}}
\end{align*}
$$

Nown


Figure 9: Iteration space for $r_{b} \leq-2$.

$$
\begin{array}{rlrl}
D_{2} & =c w y_{T_{j-1}, f_{j}-1}-\left(A_{2}-A_{1}\right) & & \\
& =c w y-\left(\frac{u x}{2}-\frac{y v}{2}\right) & & \left(\frac{x}{w}=\frac{u}{-r_{b} h}\right) \\
& =c w y-\left(\frac{u w u}{-2 r_{b} h}-\frac{y v}{2}\right) & & \left(\frac{v}{w}=\frac{y}{-r_{b} h}\right) \\
& =c w y-\left(\frac{w u^{2}}{-2 r_{b} h}-\frac{y w y}{-2 r_{r} h}\right) & & \left(u=-r_{b} h-t\right) \\
& =c w y-\left(\frac{w y^{2}}{2 r_{b} h}-\frac{w u^{2}}{2 r_{b} h}\right) & & (t=h-y) \\
& =c w y-\left(\frac{w y^{2}}{2 r_{b} h}-\frac{w\left(-r_{b} h-t\right)^{2}}{2 r_{b} h}\right) & \\
& =c w y-\left(\frac{w y^{2}}{2 r_{b} h}-\frac{w\left(-r_{b} h-(h-y)\right)^{2}}{2 r_{b} h}\right) & \\
& =c w y-\left(\frac{w}{2 r_{b} h}\left(y^{2}-\left(-r_{b} h-(h-y)\right)^{2}\right)\right) & & \\
& =c w y-\frac{w}{2 r_{b} h}\left(y^{2}-\left(-r_{b} h-h+y\right)^{2}\right) &
\end{array}
$$

We see that $D_{2}$ is a linear function of $y$, and since $0 \leq y \leq h$, we know that the maximum value of $D_{2}$ is $\max \left(D_{2}[y=h], D_{2}[y=0]\right) . D_{1}$ on the other hand is a parabola, but taking the second derivative with respect to $y$, we see it is concave and the maximum value of $D_{1}$ is therefore bounded by $\max \left(D_{1}[y=h], D_{1}[y=0]\right)$.

Going back to the expression for $F_{T_{j, f_{j}}}$ we have for $j \geq 2$

$$
\begin{align*}
F_{T_{j, f_{j}}}= & \max \left(D_{1}, D_{2}, 0\right)+A_{S_{j, f_{j}}}  \tag{eqn.4}\\
\leq & \max \left(D_{1}[y=h], D_{1}[y=0], D_{2}[y=h], D_{2}[y=0], 0\right)+A_{S_{j, f_{j}}} \\
\leq & \max \left(c h w-\frac{w}{2 r_{b} h}\left(\left(r_{b} h+h+h\right)^{2}-2 r_{b}^{2} h^{2}-2 r_{b} h^{2}\right)\right. \\
& \operatorname{chw}-\frac{w}{2 r_{b} h}\left(\left(r_{b} h+h\right)^{2}-2 r_{b}^{2} h^{2}-2 r_{b} h^{2}\right) \\
& \left.\operatorname{chw}-\frac{w}{2 r_{b} h}\left(h^{2}-\left(-r_{b} h-h+h\right)^{2}\right),-\frac{w}{2 r_{b} h}\left(-\left(-r_{b} h-h\right)^{2}\right), 0\right)+A_{S_{j, f_{j}}} \\
\leq & \max \left(c h w-\frac{h w}{2 r_{b}}\left(\left(r_{b}+2\right)^{2}-2 r_{b}^{2}-2 r_{b}\right), \operatorname{chw}-\frac{h w}{2 r_{b}}\left(\left(r_{b}+1\right)^{2}-2 r_{b}^{2}-2 r_{b}\right),\right. \\
& \left.\operatorname{chw}-\frac{h w}{2 r_{b}}\left(1-r_{b}^{2}\right), \frac{h w}{2 r_{b}}\left(r_{b}+1\right)^{2}, 0\right)+A_{S_{j, f}} \\
\leq & \max \left(c h w-\frac{h w}{2 r_{b}}\left(r_{b}^{2}+4 r_{b}+4-2 r_{b}^{2}-2 r_{b}\right), c h w-\frac{h w}{2 r_{b}}\left(1-r_{b}^{2}\right),\right. \\
& \left.\operatorname{chw}-\frac{\left(1-r_{b}^{2}\right)}{2 r_{b}} h w, \frac{\left(r_{b}+1\right)^{2}}{2 r_{b}} h w, 0\right)+A_{S_{j, f}} \\
\leq & \max \left(c h w-\frac{h w}{2 r_{b}}\left(1-r_{b}^{2}+2 r_{b}+3\right), c h w-\frac{\left(1-r_{b}^{2}\right)}{2 r_{b}} h w, \frac{\left(r_{b}+1\right)^{2}}{2 r_{b}} h w, 0\right)+A_{S_{j, f_{j}}} \\
\leq & \max \left(c h w-\frac{1-r_{b}^{2}}{2 r_{b}} h w-\frac{2 r_{b}+3}{2 r_{b}} h w, \operatorname{chw}-\frac{1-r_{b}^{2}}{2 r_{b}} h w, \frac{\left(r_{b}+1\right)^{2}}{2 r_{b}} h w, 0\right)+A_{S_{j, f}} \\
\leq & \max \left(c h w-\frac{1-r_{b}^{2}}{2 r_{b}} h w, \frac{\left(r_{b}+1\right)^{2}}{2 r_{b}} h w, 0\right)+A_{S_{j, f}} \\
\leq & \max \left(c h w-\frac{1-r_{b}^{2}}{2 r_{b}} h w, 0\right)+A_{S_{j, f_{j}}}
\end{align*}
$$

## End of Proof of Lemma 16

We are now ready to complete the proof of Lemma 2.

$$
\begin{aligned}
F_{T_{j, k}} & =F_{T_{j, k-1}}+A_{T_{j, k}} \\
& =F_{T_{j, f_{j}}}+\left(k-1-f_{j}\right) h w+A_{T_{j, k}} \\
& \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{j, f_{j}}}+\left(k-1-f_{j}\right) h w+A_{T_{j, k}} \\
& \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{j, k}}
\end{aligned}
$$

(for $k>f_{j}, j \geq 2$ ) (eqn. 1)
(by simple induction)
(Lemma 16)
and we have now proven the lemma.
End of Proof of Lemma 2

## Corollary 2

$$
I_{p_{j} \text { bottom }}^{c} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right) \quad(\text { for } j \geq 2)
$$

when $r_{b} \leq-2$.
Proof of Corollary 2: The idle time for $p_{j}$ in an infinite iteration space is given by the finishing time of a full tile $T_{j, k}$ in stack $j$, minus the amount of computation $p_{j}$ has executed before $F_{T_{j, k}}$. In other words,

$$
\begin{aligned}
I_{p_{j b o t t o m}^{c}}^{c} & =F_{T_{j, k}}-A_{S_{j, k}} \\
& \leq A_{S_{j, k}}+\max \left(\left(c+\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-A_{S_{j, k}} \quad(\text { for } j \geq 2) \quad \text { (Lemma 2) } \\
& \leq \max \left(\left(c+\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)
\end{aligned}
$$

## End of Proof of Corollary 2

## Finite parallelogram-shaped iteration spaces



Figure 10: Iteration space for $r_{t} \leq-2$. Light shaded areas at each stack correspond to added idle time if that stack finishes last.


Figure 11: Upper right corner of iteration space with $r_{t} \geq-1$. The shaded area correspond to the computation that is removed from the first partial tile in each stack when making the infinite iteration space finite.

## Lemma 3

$$
I_{l}^{c} \begin{cases}\leq \frac{-r_{t} h w}{2} & \left(\text { for } r_{t}<-1\right) \\ =0 & \left(\text { for } r_{t} \geq-1\right)\end{cases}
$$

Proof of Lemma 3: If $r_{t}<-1$, the amount of time a processor can be forced to be idle even though it still has computation left to do, is less than or equal to the shaded area in Figure 10, $j$ being an arbitrary stack. Since this area varies in size from processor to processor, the processors will not finish at the same time. The maximum amount of area that will be added to the execution time of a single processor ${ }^{4}$ is equal to $\frac{-r_{t} h w}{2}$.

If on the other hand $r_{t} \geq-1$, the $j$-th stack will not have to wait for the $(j-1)$-th processor since the amount of computation removed from the $j$-th stack always will be smaller or equal to the amount of computation removed in the $(j-1)$-th stack. (See the shaded areas in Figure 11.) The $(j-1)$-th processor will therefore always have the data ready for $p_{j}$ at the time $p_{j}$ needs it, and $I_{p_{j}}^{c}$ will be equal to 0 .
End of Proof of Lemma 3

[^4]
## Lemma 4

$$
I_{p_{j}}^{s}= \begin{cases}h w\left(r_{t}-v \frac{w}{h}\right)(P-j) & \left(\text { for } r_{t} \geq v \frac{w}{h}\right) \\ -h w\left(r_{t}-v \frac{w}{h}\right)(j-1) & \left(\text { for } r_{t} \leq v \frac{\underline{w}}{h}\right)\end{cases}
$$

where $r_{t}$ is the top rise and $v$ is the wavefront constant.
Proof of Lemma 4: The amount of time $p_{j}$ will be idle after it has finished all its computation is given by the following cases in the case of block distribution.


Figure 12: Top of iteration space. The shaded area correspond to the computation that $p_{1}$ is going to be idle after having executed its stack.
Case $1\left(v \frac{w}{h} \leq r_{t}\right)$ : After $p_{1}$ has finished executing, it will be idle during the time $p_{P}$ executes the shaded area in Figure 12 (follows from the definition of wavefront), i.e. during $\left(-v w(P-1)+r_{t} h(P-1)\right) w$ time. The $j$-th processor, will similarly be idle during $\left(-v w(P-j)+r_{t} h(P-j)\right) w$ time.


Figure 13: Top of iteration space. The shaded area correspond to the computation that $p_{P}$ is going to be idle after having executed its stack.
Case $2\left(r_{t} \leq v \frac{w}{h}\right)$ : After $p_{P}$ has finished executing, it will be idle during the time $p_{1}$ executes the shaded area in Figure 13 (follows from the definition of wavefront), i.e. during $(P-1)\left(-r_{t} h-(-v w)\right) w$ time. The $j$-th processor, will similarly be idle during $(j-1)\left(-r_{t} h-(-v w)\right) w$ time.

The additional idle time after $p_{j}$ has finished computing its stack is thus equal to

$$
\begin{array}{rll}
\left(-v w(P-j)+r_{t} h(P-j)\right) w & =h w\left(r_{t}-v \frac{w}{h}\right)(P-j) & \left(\text { for } v \frac{w}{h} \leq r_{t}\right) \\
(j-1)\left(-r_{t} h-(-v w)\right) w & =-h w\left(r_{t}-v \frac{w}{h}\right)(j-1) & \left(\text { for } r_{t} \leq v \frac{w}{h}\right)
\end{array}
$$

In the case of block cyclic distribution we need to note that the only synchronization idle time for $p_{j}$ occurs after $p_{j}$ has executed all its stacks. Since the amount of time $p_{j}$ needs to wait at this point will be the same as in the block distribution case, we have proven the lemma also for block cyclic distribution.

## End of Proof of Lemma 4

Lemma 5 Let $r \geq-1$ and assume that there exists a row $k$, such that for all stacks, there is a full tile in $k$. The total processor idle time for the processor executing the $j$-th stack is then given by the following formula.

$$
I_{p}=(P-1) h w(1+r+c)
$$

where $h$ and $w$ are the height and width of a tile, chw is the time is takes to communicate one tile surface of height $h$, and $r$ is the rise.

## Proof of Lemma 5:

$$
\begin{array}{rlrl}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \\
& =I_{p_{j} \text { bottom }}^{c}+I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =(j-1)\left(1+r_{b}+c\right) h w+\sum_{l=j \bmod P} I_{l t o p}^{c}+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w & & \text { (by definition) } \\
& =(j-1)\left(1+r_{b}+c\right) h w+0+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w & & (\text { Cor. } 1, \text { Lemma } 3 \text { and } 4) \\
& =(j-1)(1+r+c) h w+(P-j)(r-(-(1+c))) h w & & \left(v \frac{w}{h}=-(1+c)\right) \\
& =(j-1)(1+r+c) h w+(P-j)(1+r+c) h w &
\end{array}
$$

Since $(P-1)(1+r+c) h w$ is independent on $j, I_{p}=I_{p_{j}}=(P-1)(1+r+c) h w$, and we have proven the lemma.

## End of Proof of Lemma 5

Theorem 1 Let $r \geq-1$. If there are $P$ stacks distributed one per processor, and there exists a row which includes full tiles in every stack, then the total idle time will be

$$
I_{a}=P(P-1) h w(1+r+c)
$$

where $h$ is the height and $w$ the width of the tiles, chw is the communication time for one tile surface of length $h$, and $r$ is the rise.

Proof of Theorem 1: Follows directly from Lemma 5 End of Proof of Theorem 1

Lemma 6 If $r \leq-2$ and each processor only executes one stack, then the following formula is an upper bound on the total idle time for one processor.

$$
I_{p} \leq \max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r h w}{2}\right)
$$

where $w$ is the width and $h$ the height of the tile, chw is the time it takes to communicate one tile surface of length $h$, and $r$ is the rise.

## Proof of Lemma 6:

$$
\begin{aligned}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \\
& =I_{p_{j}}^{c}+I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by deftom definition) } \\
& =I_{p_{j}}^{c} \text { bottom }+I_{p_{j}}^{c}-(j-1)\left(r_{t}-v \frac{w}{h}\right) h w & & (\text { Lemma } 4) \\
& =I_{p_{j}}^{c}{ }_{\text {bottom }}+I_{p_{j}}^{c}-0 & & \left(v \frac{w}{h}=r_{b}=r_{t}\right) \\
& =I_{p_{j} \text { bottom }}^{c}+\sum_{l=1} l=j \bmod P I_{l \text { lop }}^{c} & & \text { (by definition) } \\
& \leq \max \left(\left(c-\frac{1-r_{b}}{2 r_{b}}\right) h w, 0\right)+\frac{-r_{t} h w}{2} & & \text { (Cor. 2 and Lemma 3) } \\
& \leq \max \left(\left(c-\frac{1-r}{2 r}\right) h w, 0\right)+\frac{-r h w}{2} & & \\
& \leq \max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r h w}{2}\right) & &
\end{aligned}
$$

Since $\max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r h w}{2}\right)$ is independent on $j, I_{p}=I_{p_{j}}=\max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r h w}{2}\right)$, and we have proven the lemma.

## End of Proof of Lemma 6

Theorem 2 If $r \leq-2$ and there are $P$ stacks, then

$$
I_{a} \leq \max \left(\left(c-\frac{1}{2 r}\right) P h w, \frac{-r P h w}{2}\right)
$$

is an upper bound on the total idle time.
Proof of Theorem 2: Follows directly from Lemma 6.
End of Proof of Theorem 2

## Lemma 7

Let $r_{b} \geq-1$ and $p_{j}$ be the processor that executes stack numbers $j \bmod P$. Let all stacks have an area of at least Phw $(1+\bar{r}+c)$, and let there exist a row in which all tiles are full.
$I_{p}=(P-1)(1+r+c) h w$
Proof of Lemma 7:
Let us first note that

$$
\begin{array}{rlrl}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =I_{p_{j}}^{c} \text { bottom }+I_{p_{j} t o p}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =I_{p_{j} \text { bottom }}^{c}+\sum_{l=j \bmod P I_{\text {top }}^{c}+I_{p_{j}}^{s}} & \text { (by definition) }
\end{array}
$$

$I_{l_{\text {top }}}^{c}$ and $I_{p_{j}}^{s}$ are given by Lemma 3 and 4 , but we need to derive a formula for $I_{p_{j}}^{c}{ }_{\text {bottom }}$ in the case of block cyclic distribution.

Let $I_{j \text { bottom }}^{c}$ be the amount of time the processor executing $S_{j}$ would be idle before starting executing $S_{j}$ if $S_{j}$ was the first stack to be executed on that processor. (This amount of idle time is given by Corollary 1.)

Since $S_{j-1}$ and $S_{j}$ are executed in lock step and all stacks take the same amount of time to execute since they are all of the same area, $S_{j+P-1}$ and $S_{j+P}$ also execute in lock step. Data calculated in $S_{j+P-1}$ will therefore not delay the execution of $S_{j+P}$ any more than already accounted for in $I_{j+P_{\text {bottom }}^{c}}$.

An very course estimation of $I_{p_{j} \text { bottom }}^{c}$ would be to say it is the sum of all $I_{l \text { bottom }}^{c}$ where $l$ is a stack executed by $p_{j}$. But if $p_{j}$ can overlap $I_{l b o t t o m}^{c}$ for some stack $S_{l}$ with useful computation, then $I_{l b o t t o m}^{c}$ will not be part of $I_{p_{j b t t o m}}^{c}$. Overlapping can occur if

$$
I_{j+P_{\text {bottom }}}^{c} \leq F_{S_{j}}
$$

where $F_{S_{j}}$ is the finishing time of the $j$-th stack.
So if we can show that $I_{j+P_{b o t t o m}^{c}}^{c} \leq F_{S_{j}}$ then we know that $I_{j+P_{b o t t o m}}^{c}$ can be fully overlapped by the computation of $S_{j}$ and will therefore not be added to $I_{p_{j} \text { bottom }}^{c}$. Inductive reasoning can then be used to show that $I_{p_{j b o t t o m}}^{c}$ will infact only consist of $I_{j \text { bottom }}^{c}$.

$$
\begin{aligned}
F_{S_{j}} & =I_{j \text { bottom }}^{c}+I_{j}^{c} & & \text { (by definition) } \\
& =(j-1)(1+r+c) h w+0+A_{S_{j}} & & \text { (Cor. } 1 \text { and Lemma } 3) \\
& \geq(j-1)(1+r+c) h w+P h w(1+r+c) & & (\text { by assumption }) \\
& \geq(j+P-1)(1+r+c) h w & & \\
& \geq I_{j+P_{b o t t o m}}^{c} & & \text { (Cor. } 1)
\end{aligned}
$$

So we have

$$
\begin{aligned}
& I_{p_{j} \text { bottom }}^{c}=I_{j \text { bottom }}^{c} \\
& =(j-1)(1+r+c) h w \quad(\text { Cor. } 1) \\
& I_{p_{j}}=I_{p_{j}}^{c}+I_{p_{j}}^{s} \quad \text { (by definition) } \\
& =I_{p_{j} \text { bottom }}^{c}+I_{p_{j} \text { top }}^{c}+I_{p_{j}}^{s} \quad \text { (by definition) } \\
& =I_{p_{j} b o t t o m}^{c}+0+h w\left(r_{t}-v \frac{w}{h}\right)(P-j) \quad \text { (Lemma } 3 \text { and 4) } \\
& =I_{p_{j} \text { bottom }}^{c}+h w(r-(-(1+c)))(P-j) \quad\left(v \frac{w}{h}=-(1+c)\right) \\
& =(j-1)(1+r+c) h w+h w(1+r+c)(P-j) \quad \text { (from above) } \\
& =(P-1)(1+r+c) h w
\end{aligned}
$$

and we have proven the lemma.

## End of Proof of Lemma 7:

Theorem 3 If $r \geq-1$ and there are bP stacks, then the total idle time is given by

$$
I_{a}=P(P-1)(1+r+c) h w
$$

Proof of Theorem 3: Follows directly from Lemma 7.
End of Proof of Theorem 3

## Lemma 8

Let $r_{b} \leq-2$ and $p_{j}$ be the processor that executes stack numbers $j, j+P, j+2 P, \ldots, j+(b-1) P$.
$I_{p} \leq \max \left(\left(c-\frac{1}{2 r}\right) b h w, \frac{-r b h w}{2}\right)$

Proof of Lemma 8: For $p_{j} I_{p_{j} \text { top }}^{c}$ will be $b \frac{-r h w}{2}$ (Lemma 3), since $p_{j}$ will need to wait for communication at the top of all stacks. $I_{p_{j}}^{s}$ is by Lemma 4 equal to 0 , since $r_{t}=r_{b}=v \frac{w}{h}$. $I_{p_{j} b o t t o m}^{c}$ will be less or equal to $b \max \left(\left(c-\frac{1-r_{b}}{2 r_{b}}\right) h w, 0\right)$ (Corollary 2). ${ }^{5}$

$$
\begin{array}{rlrl}
\text { We therefore have } & & \\
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =I_{p_{j}}^{c}+I_{p_{j o t t o m}^{c}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& \leq b \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+b \frac{-r_{t} h w}{2}+0 & & \left(\text { Cor. 2, Lemma 3 and } 4\left(v \frac{w}{h}=r_{b}=r_{t}\right)\right. \\
& \leq b \max \left(\left(c-\frac{1-r^{2}}{2}\right) h w, 0\right)+b \frac{r h w}{2} & & \left(v \frac{w}{h}=r_{b}=r_{t}\right) \\
& \leq \max \left(\left(c-\frac{1-r^{2}}{2 r}\right) b h w, 0\right)-\frac{r b h w}{2} & & \\
& \leq \max \left(\left(c-\frac{1}{2 r}\right) b h w, \frac{-r b h w}{2}\right) & &
\end{array}
$$

## End of Proof of Lemma 8

Theorem 4 If $r \leq-2$ and there are bP stacks, then the total idle time is given by

$$
I_{a} \leq \max \left(\left(c-\frac{1}{2 r}\right) b P h w, \frac{-r b P h w}{2}\right)
$$

Proof of Theorem 4:
$I_{a} \leq \sum_{j=1}^{P} \max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r b h w}{2}\right) \quad($ Lemma 8$)$

$$
\begin{aligned}
& \leq P \max \left(\left(c-\frac{1}{2 r}\right) h w, \frac{-r b h w}{2}\right) \\
& <\max \left(\left(c-\frac{1}{0}\right) b P h w, \frac{-r b P h w}{2}\right)
\end{aligned}
$$

## End of Proof of Theorem 4

## Finite trapezoidal iteration spaces

Lemma 9 Let $r_{b} \geq-1$ and assume that there exists a row $k$, such that for all stacks, there is a full tile in $k$.
The total processor idle time for the processor executing the $j$-th stack is then given by the following formula.
$I_{p_{j}}=h w\left((j-1)\left(1+r_{b}+c\right)+\left(r_{t}+1+c\right)(P-j)\right)\left(\right.$ for $\left.-1 \leq r_{b}<r_{t}\right)$
where $h$ and $w$ are the height and width of a tile, chw is the time is takes to communicate one tile surface of height $h$, and $r$ is the rise.

$$
\begin{array}{rlrl}
\text { Proof of Lemma 9: } & & \\
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s}+I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
I_{p_{j}} & =I_{p_{j} \text { bottom }}^{c}+I_{p_{j o p}}^{c}+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w & & \text { (by definition) } \\
I_{p_{j}} & =I_{p_{j}}^{c} \text { bottom }+I_{p_{j}}^{c}+(\text { Lemma. 4) } \\
I_{p_{j}} & =I_{p_{j}}^{c} \text { bottom }+I_{p_{j}}^{c}+(P-j)\left(1+r_{t}+c\right) h w & & \left(v \frac{w}{h}=-(1+c)\right) \\
I_{p_{j}} & =(j-1)\left(1+r_{b}+c\right) h w+I_{p_{j}}^{c}+(P-j)\left(1+r_{t}+c\right) h w & & \text { (Cor. 1) } \\
I_{p_{j}} & =(j-1)\left(1+r_{b}+c\right) h w+0+(P-j)\left(1+r_{t}+c\right) h w & & \text { (Lemma 3) } \\
I_{p_{j}} & =(j-1)\left(1+r_{b}+c\right) h w+(P-j)\left(1+r_{t}+c\right) h w & &
\end{array}
$$

## End of Proof of Lemma 9

Theorem 5 If there are $P$ stacks distributed one per processor, and there exist a row which includes full tiles in every stack, then the total idle time will be
$I_{a}=P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right) \quad\left(\right.$ for $\left.-1 \leq r_{b}<r_{t}\right)$
where $h$ is the height and $w$ the width of the tiles, chw is the communication time for one tile surface of length $h$, and $r_{b}$ and $r_{t}$ are the bottom and top rise.

Proof of Theorem 5: Let $-1 \leq r_{b}<r_{t}$.

[^5]\[

$$
\begin{aligned}
I_{a} & =\sum_{j=1}^{P} h w\left((j-1)\left(1+r_{b}+c\right)+\left(r_{t}+1+c\right)(P-j)\right) \\
& =h w\left(1+r_{b}+c\right) \sum_{j=1}^{P}(j-1)+h w\left(r_{t}+1+c\right) \sum_{j=1}^{P}(P-j) \\
& =h w\left(1+r_{b}+c\right) \frac{P(P-1)}{2}+h w\left(r_{t}+1+c\right) \frac{P(P-1)}{2} \\
& =h w\left(2+r_{t}+r_{b}+2 c\right) \frac{P(P-1)}{2} \\
& =P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)
\end{aligned}
$$
\]

## End of Proof of Theorem 5

Lemma 10 If $r_{b} \leq-2$ and each processor only executes one stack, then the following formula is a upper bound on the total idle time for one processor.

$$
\begin{array}{ll}
I_{p_{j}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right)(P-j) & \left(\text { for } r_{b} \leq-2<-1 \leq r_{t}\right) \\
I_{p_{j}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right)(P-j)-\frac{r_{t} h w}{2} & \left(\text { for } r_{b}<r_{t}<-1, r_{b} \leq-2\right)
\end{array}
$$

where $w$ is the width and $h$ the height of the tile, chw is the time it takes to communicate one tile surface of length $h, r_{b}$ is the rise at the bottom of the iteration space and $r_{t}$ is the rise at the top.

## Proof of Lemma 10:

For $r_{b} \leq-2<-1 \leq r_{t}$ :

$$
\begin{array}{rlrl}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \\
& =I_{p_{j}}^{c} \text { (by dettom } & I_{p_{j}}^{c}+I_{p_{j}}^{s} & \\
& =I_{p_{j}}^{c} \text { bottom }+I_{p_{j}}^{c}+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w & & (\text { Lemma } 4 \text { ) } \\
& =I_{p_{j}}^{c} \text { botinition) } \\
& =I_{p_{j}}^{c}+(P-j)\left(r_{t}-r_{b}\right) h w & & \left(v \frac{w}{h}=r_{b}\right) \\
& \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+(P-j)\left(r_{t}-r_{b}\right) h w & & (\text { Cor. } 2)
\end{array}
$$

For $r_{b}<r_{t}<-1, r_{b} \leq-2$ :

$$
I_{p_{j}}=I_{p_{j}}^{c}+I_{p_{j}}^{s} \quad \quad \text { (by definition) }
$$

$$
=I_{p_{j} \text { bottom }}^{P_{c}^{c}}+I_{p_{j} o p}^{c}+I_{p_{j}}^{s} \quad \quad \text { (by definition) }
$$

$$
=I_{p_{j} \text { bottom }}^{c}+I_{p_{j} \text { top }}^{c}+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w \quad \quad(\text { Lemma } 4)
$$

$$
=I_{p_{j} \text { bottom }}^{c}+I_{p_{j t o p}}^{c}+(P-j)\left(r_{t}-r_{b}\right) h w \quad\left(v \frac{w}{h}=r_{b}\right)
$$

$$
\leq I_{p_{j} \text { bottom }}^{c}+\frac{-r_{t} h w}{2}+(P-j)\left(r_{t}-r_{b}\right) h w \quad \quad \text { (Lemma 3) }
$$

$$
\leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-\frac{r_{t} h w}{2}+(P-j)\left(r_{t}-r_{b}\right) h w \quad \text { (Cor. 2) }
$$

## End of Proof of Lemma 10

Theorem 6 If $r_{b} \leq-2, r_{b}<r_{t}$ and there are $P$ stacks, then
$I_{a} \leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right) \quad\left(\right.$ for $\left.r_{b} \leq-2<-1 \leq r_{t}\right)$
$I_{a} \leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{r_{t} h w P}{2} \quad\left(\right.$ for $r_{b}<r_{t}<-1, r_{b} \leq-2$ ) is an upper bound on the total idle time.

## Proof of Theorem 6:

$$
\begin{array}{rlrl}
I_{a} & \leq \sum_{j=1}^{P}\left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right)(P-j)\right) & & \left(\text { for } r_{b} \leq-2<-1 \leq r_{t}\right) \\
& \leq P \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right) \sum_{j=1}^{P}(P-j) & & (\text { Lemma 10) } \\
& \leq P \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right) \frac{P(P-1)}{2} & & \\
& \leq \max \left(P\left(c-\frac{1-r_{b}}{2 r_{b}}\right) h w, 0\right)+P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right) & & \\
I_{a} & \leq \sum_{j=1}^{P}\left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right)(P-j)-\frac{r_{t} h w}{2}\right) & & \left(\text { for } r_{b}<r_{t}<-1, r_{b} \leq-2\right) \\
& \left.\leq \operatorname{Pmax}\left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right) \sum_{j=1}^{P}(P-j)-\frac{r_{t} h w P}{2}\right) & & \\
& \leq \operatorname{Lemmax} 10) \\
& \leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+h w\left(r_{t}-r_{b}\right) \frac{P(P-1)}{2}-\frac{r_{t} h w P}{2} & \\
& \leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{r_{t} h w P}{2} & &
\end{array}
$$

## End of Proof of Theorem 6

Lemma 11 Let $r_{b} \geq-1$ and assume that there exists a row $k$, such that for all stacks, there is a full tile in $k$. The total processor idle time for the processor executing the $j$-th stack is then given by the following formula.

$$
\begin{array}{ll}
I_{p_{j}}=h w\left((j-1)\left(1+r_{b}+c\right)+(P-j)\left(1+r_{t}+c\right)\right) & \left(\text { for }-1 \leq r_{t}<r_{b}\right) \\
I_{p_{j}} \leq h w\left((j-1)\left(1+r_{b}+c\right)+(P-j)\left(1+r_{t}+c\right)-\frac{r_{t}}{2}\right) & \left(\text { for }-(1+c) \leq r_{t}<-1 \leq r_{b}\right) \\
I_{p_{j}} \leq-h w\left((j-1)\left(r_{t}-r_{b}\right)+\frac{r_{t}}{2}\right) & \left(\text { for } r_{t} \leq-(1+c)<-1 \leq r_{b}\right)
\end{array}
$$

where $h$ and $w$ are the height and width of a tile, chw is the time is takes to communicate one tile surface of height $h$, and $r_{b}$ is the rise at the bottom of the iteration space and $r_{t}$ is the rise at the top.

## Proof of Lemma 11:

$$
\text { For } \begin{array}{rlrl}
-1 & \leq r_{t}<r_{b}: & & \\
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =I_{p_{j}}^{c}+I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definitiom) } \\
& =I_{p_{j}}^{c}+j o t o m \\
& =I_{p_{j}}^{c} I_{p_{j}}^{c}+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w & & \text { (Lemma 4) } \\
& =I_{p_{j}}^{c}+I_{p_{j}}^{c}+(P-j)\left(1+r_{t}+c\right) h w & & \left(v \frac{w}{h}=-(1+c)\right) \\
& =(j-1)\left(1+r_{b}+c\right) h w+(P-j)\left(1+r_{t}+c\right) h w & & \text { (Com. } 1)
\end{array}
$$

For $-(1+c) \leq r_{t}<-1 \leq r_{b}$ :

$$
\begin{aligned}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \\
& =I_{p_{j}}^{c} \text { bottom } & I_{p_{j}}^{c}+I_{p_{j}}^{s} & \\
& =I_{p_{j}}^{c} \text { by dettom }+I_{p_{j}}^{c}+(P-j)\left(r_{t}-v \frac{w}{h}\right) h w & & \text { (Lemma definition) } \\
& =I_{p_{j} \text { bottom }}^{c}+I_{p_{j} \text { top }}^{c}+(P-j)\left(1+r_{t}+c\right) h w & & \left(v \frac{w}{h}=-(1+c)\right) \\
& \leq I_{p_{j}}^{c} \text { bottom }+\frac{-r_{t} h w}{2}+(P-j)\left(1+r_{t}+c\right) h w & & \text { (Lemma 3) } \\
& \leq(j-1)\left(1+r_{b}+c\right) h w-\frac{r_{t} h w}{2}+(P-j)\left(1+r_{t}+c\right) h w & & \text { (Cor. 1) }
\end{aligned}
$$

For $r_{t} \leq-(1+c)<-1 \leq r_{b}$ :

$$
\begin{aligned}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \\
& =I_{p_{j}}^{c}+I_{p_{j o t t o m}^{c}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =I_{p_{j} \text { bottom }}^{c}+I_{p_{j}}^{c}-(j-1)\left(r_{t}-v \frac{w}{h}\right) h w & & \text { (Lemma 4) } \\
& =I_{p_{j}}^{c}-\left(\text { bottom }+I_{p_{j}}^{c}-(j-1)\left(1+r_{t}+c\right) h w\right. & & \left(v \frac{w}{h}=-(1+c)\right) \\
& \leq I_{p_{j}}^{c} \text { bottom }+\frac{r_{t} h w}{2}-(j-1)\left(1+r_{t}+c\right) h w & & \text { (Lemma 3) } \\
& \leq(j-1)\left(1+r_{b}+c\right) h w-\frac{r_{t} h w}{2}-(j-1)\left(1+r_{t}+c\right) h w & & \text { (Cor. 1) } \\
& \leq-(j-1) \frac{r_{t}-r_{b}}{2} h w-\frac{r_{t} w}{2} & &
\end{aligned}
$$

## End of Proof of Lemma 11

Theorem 7 Let $r_{b} \geq-1$. If there are $P$ stacks distributed one per processor, and there exist a row which includes full tiles in every stack, then the total idle time will be

| $I_{a}=P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)$ | $\left(\right.$ for $\left.-1 \leq r_{t}<r_{b}\right)$ |
| :--- | :--- |
| $I_{a} \leq P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)-\frac{r_{t} P h w}{2}$ | $\left(\right.$ for $\left.-(1+c) \leq r_{t}<-1 \leq r_{b}\right)$ |
| $I_{a} \leq-P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{r_{t} P h w}{2}$ | (for $\left.r_{t} \leq-(1+c)<-1 \leq r_{b}\right)$ |

where $h$ is the height and $w$ the width of the tiles, chw is the communication time for one tile surface of length $h$, and $r_{b}$ and $r_{t}$ are the bottom and top rise.

## Proof of Theorem 7:

$$
\begin{aligned}
I_{a}= & \sum_{j=1}^{P}\left(h w\left((j-1)\left(1+r_{b}+c\right)+(P-j)\left(1+r_{t}+c\right)\right)\right) & & \left(\text { for }-1 \leq r_{t}<r_{b}\right) \\
= & h w\left(1+r_{b}+c\right) \sum_{j=1}^{P}(j-1)+ & & \\
& +h w\left(1+r_{t}+c\right) \sum_{j=1}^{P}(P-j) & & \\
= & h w\left(1+r_{b}+c\right) \frac{P(P-1)}{2}+h w\left(1+r_{t}+c\right) \frac{P(P-1)}{2} & & \\
= & P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right) & & \\
I_{a} \leq & \sum_{j=1}^{P}\left(h w\left((j-1)\left(1+r_{b}+c\right)+(P-j)\left(1+r_{t}+c\right)-\frac{r_{t}}{2}\right)\right) & \left(\text { for }-(1+c) \leq r_{t}<-1 \leq r_{b}\right) & \text { (Lemma 11) 11) } \\
\leq & h w\left(1+r_{b}+c\right) \sum_{j=1}^{P}(j-1)+ & & \\
& +h w\left(1+r_{t}+c\right) \sum_{j=1}^{P}(P-j)-\frac{P r_{t} h w}{2} & & \\
\leq & h w\left(1+r_{b}+c\right) \frac{P(P-1)}{2}+h w\left(1+r_{t}+c\right) \frac{P(P-1)}{2}-\frac{P r_{t} h w}{2} & & \\
I_{a} \leq & P(P-1) h w\left(1+c+\frac{r_{t}+r_{b}}{2}\right)-\frac{r_{t} P h w}{2} & & \\
& \sum_{j=1}^{P}\left(-h w\left((j-1)\left(r_{t}-r_{b}\right)+\frac{r_{t}}{2}\right)\right) & \text { for } \left.r_{t} \leq-(1+c)<-1 \leq r_{b}\right) & \text { (Lemma 11) } \\
\leq & -h w\left(r_{t}-r_{b}\right) \sum_{j=1}^{P}(j-1)-\frac{P r_{t} h w}{2} & & \\
\text { End of Proof of Theorem } \mathbf{7} & -h w\left(r_{t}-r_{b}\right) \frac{P(P-1)}{2}-\frac{P r_{t} h w}{2} & -P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{P r_{t} h w}{2} &
\end{aligned}
$$

Lemma 12 If $r_{t}<r_{b} \leq-2$ and each processor only executes one stack, then the following formula is an upper bound on the total idle time for one processor.

$$
I_{p_{j}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-h w\left(r_{t}-r_{b}\right)(j-1)-\frac{r_{t} h w}{2}
$$

where $w$ is the width and $h$ the height of the tile, chw is the time it takes to communicate one tile surface of length $h$, and $r$ is the rise.

Proof of Lemma 12:

$$
\begin{aligned}
I_{p_{j}} & =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \\
& =I_{p_{j}}^{c} \text { bottom } & & \text { (by definition) } \\
& =I_{p_{j}}^{c}+I_{p_{j}}^{s} & & \text { (by definition) } \\
& =I_{p_{j}}^{c} \text { bottom }+I_{p_{j}}^{c}+(j-1)\left(r_{t}-v \frac{w}{h}\right) h w & & (\text { Lemma 4) } \\
& \leq I_{p_{j}}^{c}-(j-1)\left(r_{t}-r_{b}\right) h w & & \left(v \frac{w}{h}=r_{b}\right) \\
& \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-\frac{r_{t} h w}{2}-(j-1)\left(r_{t}-r_{b}\right) h w & & (\text { Cor. } 2)
\end{aligned}
$$

## End of Proof of Lemma 12

Theorem 8 If $r_{t}<r_{b} \leq-2$ and there are $P$ stacks, then
$I_{a} \leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{P r_{t} h w}{2}$
is a tight upper bound on the total idle time.

## Proof of Theorem 8:

$I_{a} \leq \sum_{j=1}^{P}\left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-h w\left(r_{t}-r_{b}\right)(j-1)-\frac{r_{t} h w}{2}\right) \quad$ (Lemma 12)
$\leq \operatorname{Pmax}\left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b_{2}}}\right) h w, 0\right)-h w\left(r_{t}-r_{b}\right) \sum_{j=1}^{P}(j-1)-\frac{P r_{t} h w}{2}$
$\leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-h w\left(r_{t}-r_{b}\right) \frac{P(P-1)}{2}-\frac{P r_{t} h w}{2}$
$\leq \max \left(P\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)-P(P-1) h w\left(\frac{r_{t}-r_{b}}{2}\right)-\frac{P r_{t} h w}{2}$

## End of Proof of Theorem 8

## New proofs

Lemma 17 is to replace Lemma 2. Changes need to be propagated to all places where Lemma 2 is directly or indirectly used, namely in Corollary 2, Lemma 6, 8, 10, 12, Theorem 2, 4, 6 and 8.
Lemma 17 When $r_{b} \leq-2, j \geq 2$ and $k \geq f_{2}$, then

$$
F_{T_{j, k}} \leq A_{S_{j, k}}+\max \left(\left(c+\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left((j-2)\left(1+r_{b}+c\right) h w, 0\right)
$$

Proof of Lemma 17: We first need to prove a couple of lemmas (18, 19, 20 and 21). Each lemma determines the finishing time for tiles in each of the shaded areas in Figure 14.


Figure 14: The tiles in the shaded areas are the tiles for which the corresponding lemmas determine the finishing time.

Lemma 18 Let $f_{j}$ be the value of $k$ such that $T_{j, k}$ is the first full tile in stack $j$. $F_{T_{j, k}}=F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w \quad\left(\right.$ for $\left.j \geq 1, f_{j} \leq k<f_{j-1}-1\right)$

## Proof of Lemma 18:

Induction Hypothesis:
$F_{T_{j, k}}=F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w \quad\left(\right.$ for $\left.j \geq 1, f_{j} \leq k<f_{j-1}-1\right)$
Base Case ( $k=f_{j}$ ): Trivially true.
Induction Step: Suppose the induction hypothesis holds for $k$, show it also holds for $k+1$.

$$
\begin{array}{rlrl}
F_{T_{j, k+1}} & =\max \left(F_{T_{j-1, k+1}}+c w y_{T_{j}-1, k+1}, F_{T_{j, k}}\right)+A_{T_{j, k+1}} & & \text { (Form. 1) } \\
& =\max \left(F_{T_{j}, 1, k+1}+c h w, F_{T_{j, k}}\right)+h w & & \left(A_{T_{j, k+1}}=h w \text { when } k+1 \geq f_{j}\right) \\
& =\max \left(\max \left(F_{T_{j-2, k+1}}+c w y_{T_{j-2, k+1}}, F_{T_{j-1, k}}\right)+A_{T_{j-1, k+1}}+c h w,\right. & & \text { (Form. 1) } \\
& \left.=F_{T_{j, k}}\right)+h w & & \text { ( } A_{T_{j-2}, k+1}=0 \text { when } k+1<f_{j-} \\
& =\max \left(F_{T_{j-1, k}}+A_{T_{j-1, k+1}}+c h w, F_{T_{j, k}}\right)+h w & \max \left(F_{T_{j-1, k}}+A_{T_{j-1, k+1}}+c h w, F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w\right)+h w &  \tag{Form.1}\\
\text { (Induction hypothesis) }
\end{array}
$$

We know that $F_{T_{j-1, k}}<F_{T_{j-1, f_{j}}}+\left(k-f_{j}\right) h w$ since the $j$-th stack consists of partial tiles between row $f_{j}$ and $f_{j-1}$ and $k<f_{j-1}-1$. We also know that $F_{T_{j-1, f_{j}}}+c h w+h w<F_{T_{j, f_{j}}}$ from Formula 1. So we have that $F_{T_{j-1, k}}<F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w-h w-c h w$ and therefore $F_{T_{j-1, k}}+A_{T_{j-1, k+1}}+c h w<F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w$. This is used to complete the proof.

$$
\begin{align*}
F_{T_{j, k}} & =\max \left(F_{T_{j-1, k}}+A_{T_{j-1, k+1}}+\operatorname{chw}, F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w\right)+h w \\
& =F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w+h w  \tag{Reasoningabove}\\
& =F_{T_{j, f_{j}}}+\left(k+1-f_{j}\right) h w
\end{align*}
$$

## End of Proof of Lemma 18

Lemma 19 Let $f_{j}$ be the value of $k$ such that $T_{j, k}$ is the first full tile in stack $j$.
$F_{T_{l, k}}=F_{T_{l, f_{j}}}+\left(k-f_{j}\right) h w \quad\left(\right.$ for $\left.j \geq 1, l \geq j, f_{j} \leq k<f_{j-1}-1\right)$
Proof of Lemma 19:
Induction Hypothesis:
$F_{T_{l, k}}=F_{T_{l, f_{j}}}+\left(k-f_{j}\right) h w \quad\left(\right.$ for $\left.j \geq 1, l \geq j, f_{j} \leq k<f_{j-1}-1\right)$
Base Case $(l=j)$ :
$F_{T_{j, k}}=F_{T_{j, f_{j}}}+\left(k-f_{j}\right) h w \quad($ Lemma 18)

Induction Step: Suppose the induction hypothesis holds for $l$, show it also holds for $l+1$.
Induction Hypothesis:
$F_{T_{l+1, k}}=F_{T_{l+1, f_{j}}}+\left(k-f_{j}\right) h w \quad\left(\right.$ for $\left.j \geq 1, l+1 \geq j, f_{j} \leq k<f_{j-1}-1\right)$
Base Case ( $k=f_{j}$ ): Trivially true.
Induction Step: Suppose the induction hypothesis holds for $k$, show it also holds for $k+1$.

$$
\begin{array}{rlr}
F_{T_{l+1, k+1}}= & \max \left(F_{T_{l, k+1}}+c w y_{T_{l, k+1}},\right. & \\
& \left.F_{T_{l+1}, k}\right)+A_{T_{l+1, k+1}} & \text { (Form. 1) } \\
= & \max \left(F_{T_{l, k+1}}+c h w,\right. & \\
& \left.F_{T_{l+1}, k}\right)+h w & \left(A_{T_{l+1, k+1}}=h w\right) \\
= & \max \left(F_{T_{l, f_{j}}}+\left(k+1-f_{j}\right) h w+c h w,\right. & \\
& \left.F_{T_{l+1}, k}\right)+h w & \text { (1st induction hypothesis) } \\
= & \max \left(F_{T_{l, f_{j}}}+\left(k+1-f_{j}\right) h w+c h w,\right. & \\
& \left.F_{T_{l+1, f_{j}}}+\left(k-f_{j}\right) h w\right)+h w & \\
= & \max \left(F_{T_{l, f_{j}}}+\left(k+1-f_{j}\right) h w+h w+c h w,\right. & \\
& \left.F_{T_{l+1, f_{j}}}+\left(k+1-f_{j}\right) h w\right) & \\
= & F_{T_{l+1, f_{j}}}+\left(k+1-f_{j}\right) h w & \text { (Form. 1) }
\end{array}
$$

## End of Proof of Lemma 19

Lemma 20 Let $f_{j}$ be the value of $k$ such that $T_{j, k}$ is the first full tile in stack $j$.

$$
F_{T_{l, f_{j}}}=\max \left(F_{T_{j, f_{j}}}+(l-j)(c+1) h w, F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right) \quad(\text { for } j \geq 1, l>j)
$$

## Proof of Lemma 20:

## Induction Hypothesis:

$F_{T_{l, f_{j}}}=\max \left(F_{T_{j, f_{j}}}+(l-j)(c+1) h w, F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right) \quad($ for $j \geq 1, l>j)$
Base Case $(l=j+1)$ :

$$
\begin{align*}
F_{T_{j+1, f_{j}}} & =\max \left(F_{T_{j, f_{j}}}+c w y_{T_{j, f_{j}}}, F_{T_{j+1, f_{j}-1}}\right)+A_{T_{j+1, f_{j}}} & & \text { (Form. 1) }  \tag{Form.1}\\
& =\max \left(F_{T_{j, f_{j}}}+\operatorname{chw}, F_{T_{j+1, f_{j}-1}}\right)+h w & & \left(A_{T_{j+1, f_{j}}}=h w\right)
\end{align*}
$$

$$
=\max _{F_{\infty}}\left(\max \left(F_{T_{j-1, f_{j}}}+c w y_{T_{j-1, f_{j}}}, F_{T_{j, f_{j}-1}}\right)+A_{T_{j, f_{j}}}+c h w\right.
$$

$$
\begin{equation*}
\left.F_{T_{j+1, f_{j}-1}}\right)+h w \tag{Form.1}
\end{equation*}
$$

$$
=\max \left(\max \left(F_{T_{j-1, f_{j}}}+c w y_{T_{j-1, f_{j}}}, F_{T_{j, f_{j}-1}}\right)+A_{T_{j, f_{j}}}+\operatorname{chw}\right.
$$

$$
\left.\max \left(F_{T_{j, f_{j}-1}}+c w y_{T_{j, f_{j}-1}}, F_{T_{j+1, f_{j}-2}}\right)+A_{T_{j+1, f_{j}-1}}\right)+h w \quad \text { (Form. 1) }
$$

$$
=\max \left(\max \left(F_{T_{j-1, f_{j}}}+c w y_{T_{j-1, f_{j}}}, F_{T_{j, f_{j}-1}}\right)+A_{T_{j, f_{j}}}+c h w\right.
$$

$$
\left.F_{T_{j+1, f_{j}-2}}+h w\right)+h w
$$

$=\max \left(F_{T_{j, f_{j}}}+\operatorname{chw}, F_{T_{j+1, f_{j}-2}}+h w\right)+h w$
$\left(A_{T_{j+1, f_{j}-1}}=A_{T_{j, f_{j}}}=h w\right)$
$\max \left(F_{I_{j, f_{j}}}+c h w, F_{T_{j+1, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w+h w\right)+h w \quad$ (Lemma 18)
$=\max \left(F_{T_{j, f_{j}}}+(j+1-j)(c+1) h w, F_{T_{j+1, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right)$
Induction Step: Suppose the induction hypothesis holds for $l$, show it also holds for $l+1$.

$$
\begin{array}{rlrl}
F_{T_{l+1, f_{j}}} & =\max \left(F_{T_{l, f_{j}}}+c w y_{T_{l, f_{j}}}, F_{T_{l+1, f_{j}-1}}\right)+A_{T_{l+1, f_{j}}} & & \text { (Form. 1) } \\
& =\max \left(F_{T_{l, f_{j}}}+c w y_{T_{l, f_{j}}}\right. \\
& \left.\max \left(F_{T_{l, f_{j}-1}}+c w y_{T_{l, f_{j}-1}}, F_{T_{l+1, f_{j}-2}}\right)+A_{T_{l+1, f_{j}-1}}\right)+A_{T_{l+1, f_{j}}} & \text { (Form. 1) } \\
& =\max \left(\max \left(F_{T_{l-1, f_{j}}}+c w y_{T_{l-1, f_{j}}}, F_{T_{l, f_{j}-1}}\right)+A_{T_{l, f_{j}}}+c w y_{T_{l, f_{j}}}\right. & & \\
& \left.\max \left(F_{T_{l, f_{j}-1}}+c w y_{T_{l, f_{j}-1}}, F_{T_{l+1, f_{j}-2}}\right)+A_{T_{l+1, f_{j}-1}}\right)+A_{T_{l+1, f_{j}}} & \text { (Form. 1) } \\
& =\max \left(\max \left(F_{T_{l-1, f_{j}}}+c w y_{T_{l-1, f_{j}}}, F_{T_{l, f_{j}-1}}\right)+A_{T_{l, f_{j}}}+c w y_{T_{l, f_{j}}},\right. & & \\
& \left.\quad F_{T_{l+1, f_{j}-2}}+A_{T_{l+1, f_{j}-1}}\right)+A_{T_{l+1, f_{j}}} & & \left(A_{T_{l, f_{j}}}=A_{T_{l+1, f_{j}-1}}=h w\right) \\
& =\max \left(F_{T_{l, f_{j}}}+c w y_{T_{l, f_{j}}}, F_{T_{l+1, f_{j}-2}}+A_{T_{l+1, f_{j}-1}}\right)+A_{T_{l+1, f_{j}}} & & \text { (Form.1) }  \tag{Form.1}\\
& =\max \left(F_{\left.T_{l, f_{j}}+c h w, F_{T_{l+1, f_{j}-2}}+h w\right)+h w}\right. & \left(A_{T_{l+1, f_{j}}}=A_{T_{l+1, f_{j}-1}}=h w\right)
\end{array}
$$

$$
=\max \left(\max \left(F_{T_{j, f_{j}}}+(l-j)(c+1) h w, F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right)\right.
$$

$$
\left.+c h w, F_{T_{l+1, f_{j}-2}}+h w\right)+h w
$$

(Induction hypothesis)

$$
=\max \left(\max \left(F_{T_{j, f_{j}}}+(l-j)(c+1) h w, F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right)\right.
$$

$$
\begin{equation*}
\left.+c h w, F_{T_{l+1, f_{j+1}}}+\left(f_{j}-2-f_{j+1}\right) h w+h w\right)+h w \tag{Lemma19}
\end{equation*}
$$

$$
=\max \left(F_{T_{j, f_{j}}}+(l+1-j)(c+1) h w, F_{T_{l, f_{j}+1}}+\left(f_{j}-f_{j+1}\right) h w\right.
$$

$$
\left.+(c+1) h w, F_{T_{l+1}, f_{j+1}}+\left(f_{j}-f_{j+1}\right) h w\right)
$$

## End of Proof of Lemma 20

$$
=\max \left(F_{T_{j, f_{j}}}+(l+1-j)(c+1) h w, F_{T_{l+1, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right) \quad \text { (Form. 1) }
$$

Lemma 21 Let $f_{j}$ be the value of $k$ such that $T_{j, k}$ is the first full tile in stack $j$.

$$
F_{T_{l, f_{j}}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(l-j) h w, 0\right)+A_{S_{l, f_{j}}} \quad(\text { for } j \geq 1, l \geq j)
$$

## Proof of Lemma 21:

## Induction Hypothesis:

$F_{T_{l, f_{j}}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(l-j) h w, 0\right)+A_{S_{l, f_{j}}} \quad($ for $j \geq 1, l \geq j)$
Base Case ( $j=J$ where $J$ is the last stack): When $j=J$ we also know $l=j=J$, since $l \geq j$ and $J$ is the maximum value of both $l$ and $j$.

$$
\begin{aligned}
F_{T_{J, f_{J}}} & \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{J, f_{J}}} \\
& \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(J-J) h w, 0\right)+A_{S_{J, f_{J}}}
\end{aligned}
$$

(Lemma 16)

Induction Step: Suppose the induction hypothesis holds for $j+1$ show it also holds for $j$.
Case $l=j$ :

$$
\begin{align*}
F_{T_{j, f_{j}}} & \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{j, f_{j}}}  \tag{Lemma16}\\
& \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-j) h w, 0\right)+A_{S_{j, f_{j}}}
\end{align*}
$$

Case $l>j$ :

$$
F_{T_{l, f_{j}}}=\max \left(F_{T_{j, f_{j}}}+(l-j)(c+1) h w, F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right) \quad \quad(\text { Lemma } 20)
$$

$$
\leq \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{j, f_{j}}}+(l-j)(c+1) h w\right.
$$

$$
\begin{equation*}
\left.F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}^{2 r_{b}}\right) h w\right) \tag{Lemma16}
\end{equation*}
$$

$\leq \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+(l-j)(c+1) h w+A_{S_{l, f_{j}}}\right.$ $\left.+(l-j) r_{b} h w, F_{T_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right)$
$\leq \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+(l-j)(c+1) h w+A_{S_{l, f_{j}}}\right.$ $+(l-j) r_{b} h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)$ $\left.+\max \left(\left(1+r_{b}+c\right)(l-j-1) h w, 0\right)+A_{S_{l, f_{j+1}}}+\left(f_{j}-f_{j+1}\right) h w\right) \quad$ (Induction hypothesis)
$\leq \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+(l-j)\left(1+r_{b}+c\right) h w\right.$,
$\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(l-j-1) h w, 0\right)$
$\left.-\left(f_{j}-f_{j+1}\right) h w+\left(f_{j}-f_{j+1}\right) h w\right)+A_{S_{l, f_{j}}}$
$\left(A_{S_{l, f_{j}}}-A_{S_{l, f_{j+1}}}=\left(f_{j}-f_{j+1}\right) h w\right)$
$\leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{l, f_{j}}}$
$+\max \left((l-j)\left(1+r_{b}+c\right) h w,\left(1+r_{b}+c\right)(l-j-1) h w, 0\right)$
$\leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{l, f_{j}}}+\max \left((l-j)\left(1+r_{b}+c\right) h w, 0\right) \quad(l-j-1 \geq 0)$
End of Proof of Lemma 21
Now we can continue to prove Lemma 17, using a doubly nested induction proof.
Induction Hypothesis:
$F_{T_{j, k}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)+A_{S_{j, k}} \quad\left(\right.$ for $\left.j \geq 2, k \geq f_{2}\right)$
Base Case ( $k=f_{2}$ ):
$F_{T_{j, f_{2}}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{j, f_{2}}}+\max \left((j-2)\left(1+r_{b}+c\right) h w, 0\right) \quad$ (Lemma 21)
Induction Step: Suppose the induction hypothesis holds for $k-1$, show it also holds for $k$.
Induction Hypothesis:

$$
F_{T_{j, k}} \leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)+A_{S_{j, k}} \quad\left(\text { for } j \geq 2, k>f_{2}\right)
$$

Base Case ( $j=2$ ):
$F_{T_{2, k}}=\max \left(F_{T_{1, k}}+\operatorname{chw}, F_{T_{2, k-1}}\right)+h w$
$=\max \left(A_{S_{1, k}}+(c+1) h w, F_{T_{2, k-1}}+h w\right)$
$\leq \max \left(A_{S_{1, k}}+(c+1) h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)\right.$
$\left.+\max \left(\left(1+r_{b}+c\right)(2-2) h w, 0\right)+A_{S_{2, k-1}}+h w\right)$
(Induction hypothesis)
$\leq \max \left(A_{S_{1, k}}+(c+1) h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{2, k}}\right)$
$\left(A_{S_{2, k}}=A_{S_{2, k-1}}+h w\right)$
$\leq \max \left(\left(1+r_{b}+c\right) h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)\right)+A_{S_{2, k}}$
$\left(A_{S_{1, k}}-A_{S_{2, k}}=r_{b} h w\right)$
$\leq \max \left(\left(1+r_{b}+c\right) h w,\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{2, k}}$
$\leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+A_{S_{2, k}}$
$\left(1+r_{b}+c \leq c-\frac{1-r_{b}^{2}}{2 r_{b}}\right)$
$\leq \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left((2-2)\left(1+r_{b}+c\right) h w, 0\right)+A_{S_{2, k}}$

Induction Step: Suppose the induction hypothesis holds for $j-1$, show it also holds for $j$.

$$
\begin{array}{rlr}
F_{T_{j, k}}= & \max \left(F_{T_{j-1, k}}+c h w, F_{T_{j, k-1}}\right)+h w & \text { (Form. 1) } \\
\leq & \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-2-1) h w, 0\right)\right. & \\
& \left.+A_{S_{j-1, k}}+c h w, F_{T_{j, k-1}}\right)+h w & \text { (2nd induction hypothesis) } \\
\leq & \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-2-1) h w, 0\right)\right. & \\
& +A_{S_{j-1, k}}+c h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right) & \\
& \left.+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)+A_{S_{j, k-1}}\right)+h w & \\
\leq & \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-3) h w, 0\right)\right. & \\
& +A_{S_{j-1, k}}+(c+1) h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right) & \\
& \left.+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)+A_{S_{j, k}}\right) & \\
\leq & \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max ^{2}\left(\left(1+r_{b}+c\right)(j-3) h w, 0\right)\right. & \\
& +\left(1+r_{b}+c\right) h w, \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right) & \\
& \left.+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)\right)+A_{S_{j, k}} \\
\leq & \max \left(\max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-2) h w,\right.\right. & \\
& \left.\left(1+r_{b}+c\right) h w\right), \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right) & \\
& \left.\left.+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)\right)+A_{S_{j, k}}-A_{S_{j, k}}=r_{b} h w\right) \\
\leq & \max \left(\left(c-\frac{1-r_{b}^{2}}{2 r_{b}}\right) h w, 0\right)+\max \left(\left(1+r_{b}+c\right)(j-2) h w, 0\right)+A_{S_{j, k}} & \\
\text { End of Proof } \operatorname{of~} \operatorname{Lemma} \mathbf{1 7}
\end{array}
$$


[^0]:    *Also at San Diego Supercomputing Center.
    ${ }^{\dagger}$ email: $\{$ hogstedt, carter, ferrante\}@cs.ucsd.edu. This work was partly supported by NSF grant CCR-9504150.

[^1]:    ${ }^{1}$ This condition is not a necessary condition.

[^2]:    ${ }^{2}$ In fact, it is possible to design a system that has negative $c$, though we know of no such real system.

[^3]:    ${ }^{3}$ Here and in subsequent sections we assume that if the stacks are so short that there is no row with a full tile in each stack, then we will assume that the stacks are higher and the results that we derive will then be an upper bound on the idle time. This follows from Observation 1.

[^4]:    ${ }^{4}$ The processor that finishes last will not necessarily be the $P$-th processor.

[^5]:    ${ }^{5}$ This step is really only true for $j>2$ since the first processor will not be delayed waiting for communication from an earlier processor, whereas the second processor will. But even though $p_{1}$ does not have to wait at the beginning of the execution of the first stack, it has to wait the same amount of time after the execution of its last stack. This means that we can, without losing accuracy in our calculations, assume that $p_{1}$ is in fact being idle at the beginning of the execution of the first stack.

