Any operation may be broken down into a *collection of operands*, a *calculation* and a *distribution of results*. Collection and distribution are data movement operations.

#### Data movement

- Simple memory access for a sequential machine.
- Local or non-local memory access for a distributed memory machine.

For our SIMD distributed memory machine, a non-local memory access requires data transfer via the inter-PE communication links.

#### Available Data Transfer Primitives



Given the network topology and the constraint that all PEs must perform identical operations, the only data transfer primitive possible is a shift.

## 1D Configuration



We have already mentioned that the MCU is responsible for reconfiguring the network for 1D arrays.

## Multi Sheet Arrays



Edge registers within the interface can provide the temporary storage required for multi-sheet shift.



With a little extra connectivity we can perform transpose operations with our edge registers.

## **Broadcast & Selection**



Edge registers can also be used for column/row broadcast, and column/row selection.



Alternatively we might add column/row *wire or* broadcast lines, allowing single cycle column/row extraction and column/row broadcast.<sup>1</sup>

<sup>&</sup>lt;sup>1</sup>Combining these connections can give global status information

Let us consider the coding of a problem which requires more complex data transfer.

## Matrix multiplication

For simplicity we will assume that the two matrices to be multiplied are the same size as the processor array.

e.g. all values zero?

Examine the data flow for a single result:



There is a similar data flow graph for each result.

We must allocate each node on the data flow graph to a single processor.

#### We wish to:

- Minimize Communications
- Maximize Parallelism

Looking at the problem we appear to have quite complex communications. The allocation of nodes to processors is not obvious.

#### In our favour:

- The result matrix is this the same size as the processor matrix.
- We have an excess of parallelism. ( $n^3$  parallelism,  $n^2$  processors)

We can partition the work such that each processor does all the calculations required to produce a single element of the result matrix.

Simply this involves providing a Processor P(i,j) with a copy of the two vectors  $A(,j)^2$  and B(i,). Then the calculation of R(i,j) is a local calculation.

In order to accomplish the required data movement we must take element A(i,j) and smear it across all processors P(-,j) using SMEAR\_X.<sup>3</sup>

Similarly we must take element B(i,j) and smear it across all processors P(i, j) using SMEAR\_Y.

 $<sup>^{2}</sup>$ A( ,j) = A(1:n,j)

<sup>&</sup>lt;sup>3</sup>SMEAR\_X is accomplished in hardware via column selection and broadcast.

We can perform the relevant data movements and calculations on the array in the following manner:

#### (Pseudo Fortran 90)

```
R\_tot = 0
DO loop k = 1, n
A\_col\_k = SMEAR\_X(A, k)
B\_row\_k = SMEAR\_Y(B, k)
R\_tot = R\_tot + (A\_col\_k * B\_row\_k)
loop
```

Note that all operations act on all array elements at once. The Dot Product of the two n by n matrices is the elemental sum of n elemental multiplications.

#### Data Transforms

The Matrix multiplication can be divided into two distinct tasks:

- Data Transform
- Calculation

We have shown how regular data transforms can be built from simple shift operations which make use of the edge registers.

We have also shown that certain operations may be accelerated using column and row broadcast lines.

Other Transforms require the use of *activity control* in order selectively move data items. With *activity control* it is possible to produce an arbitrary data transform. Although this may take a very long time.

#### Hardware Data Transforms

Data Transforms are needed in most real problems.

Depending on the problem we might spend more time re-arranging data than processing it.

• Hardware Data Transforms.

We can implement specific transforms with extra hardware in order to reduce transform time.

We must balance our communications capability and our processing capability.

How do we know if we have done it right?

It is often easier to say that it is wrong!

#### Conclusions

#### **Process**

- Decide Algorithm
- Map Problem onto array
  - Minimize Communications
  - Maximize Parallelism

## We can't ignore

- Our machine is SIMD
- Processor array size
- Processor connectivity