It is difficult to overestimate the magnitude of the
discontinuity that the high performance computing (HPC) community is about to
experience because of the emergence of next generation of multi-core and heterogeneous
processor designs. For at least two decades, HPC programmers have taken it for
granted that each successive generation of microprocessors would, either
immediately or after minor adjustments, make their old software run
substantially faster. But three main factors are converging to bring this "free
ride" to an end.
First, system builders have encountered intractable physical barriers - too
much heat, too much power consumption, and too much leaking voltage - to
further increases in clock speeds. Second, physical limits on the number and
bandwidth of pins on a single chip means that the gap between processor performance
and memory performance, which was already bad, will get increasingly worse.
Finally, the design trade-offs being made to address the previous two factors
will render commodity processors, absent any further augmentation, inadequate
for the purposes of tera- and petascale systems for advanced applications.
This daunting combination of obstacles has forced the designers of new multi-core
and hybrid systems, searching for more computing power, to explore
architectures that software built on the old model are unable to effectively
exploit without radical modification. Currently available Linear Algebra
software packages rely on parallel implementations of the Basic Linear Algebra
Subroutines (BLAS) to take advantage of multiple execution units. This solution
is characterized by a fork-join model of parallel execution, which may result
in suboptimal performance on current and future generations of multi-core processors
since it introduces strict dependencies due to the presence of non
parallelizable portions of code. The PLASMA project aims to overcome the
shortcomings of this approach by introducing a pipelined model of parallel
execution.