High Performance Computing Center North
In this exercise you will implement the Scalable Universal Matrix Multiplication Algorithm (SUMMA)  to compute the matrix product C = AB; A mxn and B nxp. You will implement the multiplication with m = n = p = dim (it is not much more difficult to implement the general case).
The algorithm is presented in algorithm 4.3-4.4 in Scientific Computing on High Performance Computers. For the broadcast operation use a pipelined operation according to algorithm 4.3, not logarithmic scheme (why?).
In the file matmult.c and matmul.f90 there is a framework for initiating the processors and distributing a matrix over the 2D grid in rectangular blocks. Look in the files for more information. matmult.c uses nrutil.c and nrutil.h so copy all three files if you are going to use matmult.c.
Observe that the distributed blocks of the matrices need not be square, they are often rectangular. This is not a problem when computing the matrix product. To obtain a higher performance use BLAS for the rank-1 update, see man sger.
The pictures in this PDF describes the variables in matmult.c
Run the routine for a different number of processors with a matrix of the size 2000. Plot (with Matlab or similar) 1) M-flops against the number of processors, 2) the ratio M-flops/n against n. Try to explain the result. You could try a smaller/larger matrix and see if you get a similar result.
Source code that may become useful: matmul.f90, matmult.c, nrutil.c and nrutil.h
The files can be downloaded by this link: ass1.tar.gz
References  R. A. van de Geijn and J. Watts. Summa: Scalable universal matrix multiplication algorithm. Technical Report CS-95-286, Univ. Tennessee, 1995.
Original assignment by Mikael Adlers, Linköpings Universitet.