The performance (the speedup in particular) of the micromagnetics application has been measured on a Compaq SC45 cluster consisting of 11 nodes Alpha Server ES45 with 4 Alpha processors (EV68 @ 1 GHz, 8 MB Cache/CPU) and 16 GB of shared memory each. The nodes are interconnected with a Quadrics switch, which provides a maximum MPI bandwidth of 600 MB/s. Since this machine has been shared with several other users, up to 24 processors have been available for speedup measurements.
The speedup has been measured as
The energy minimization method, which uses the LMVM method of the TAO package (cf. Sec. 4.1), has been applied to calculate the nucleation field of FePt nanoparticles (cf. Sec. 8.3). The timing results are summarized in Fig. 6.5. On 8 and 16 processors we find a ``superlinear'' behavior of the solution part of the application. This is a well known phenomenon in parallel computing and can be attributed to caching effects. As the same total amount of data is distributed over more processors, the relative amount decreases and may reach a size, where it fits into the fast cache memory of modern computer architectures. As a result, the data need not be fetched from the main memory (which is a lot slower than the cache memory) and the calculations are completed a lot faster. However, as ever more processors are used, communication requires more and more time which eventually leads to a saturation of the speedup factor.
|
The parallel time integration using PVODE is not as efficiently parallelized as the TAO package, which is shown in Fig. 6.6.
|
For comparison, Tab. 6.7 shows the speedup obtained on a Beowulf type cluster of 900 MHz AMD PCs running Linux [79] (for a different problem). These machines are linked with a standard switched 100 MBit Ethernet network.
|