Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations

Kashi, Aditya; Nayak, Pratik; Kulkarni, Dhruva; Scheinberg, Aaron; Lin, Paul; Anzt, Hartwig

Submitted by claxton on Thu, 12/01/2022 - 14:49

Title	Batched sparse iterative solvers on GPU for the collision operator for fusion plasma simulations
Publication Type	Conference Paper
Year of Publication	2022
Authors	Kashi, A., P. Nayak, D. Kulkarni, A. Scheinberg, P. Lin, and H. Anzt
Conference Name	2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Date Published	2022-07
Publisher	IEEE
Conference Location	Lyon, France
Abstract	Batched linear solvers, which solve many small related but independent problems, are important in several applications. This is increasingly the case for highly parallel processors such as graphics processing units (GPUs), which need a substantial amount of work to keep them operating efficiently and solving smaller problems one-by-one is not an option. Because of the small size of each problem, the task of coming up with a parallel partitioning scheme and mapping the problem to hardware is not trivial. In recent history, significant attention has been given to batched dense linear algebra. However, there is also an interest in utilizing sparse iterative solvers in a batched form, and this presents further challenges. An example use case is found in a gyrokinetic Particle-In-Cell (PIC) code used for modeling magnetically confined fusion plasma devices. The collision operator has been identified as a bottleneck, and a proxy app has been created for facilitating optimizations and porting to GPUs. The current collision kernel linear solver does not run on the GPU-a major bottleneck. As these matrices are well-conditioned, batched iterative sparse solvers are an attractive option. A batched sparse iterative solver capability has recently been developed in the Ginkgo library. In this paper, we describe how the software architecture can be used to develop an efficient solution for the XGC collision proxy app. Comparisons for the solve times on NVIDIA V100 and A100 GPUs and AMD MI100 GPUs with one dual-socket Intel Xeon Skylake CPU node with 40 OpenMP threads are presented for matrices representative of those required in the collision kernel of XGC. The results suggest that GINKGO's batched sparse iterative solvers are well suited for efficient utilization of the GPU for this problem, and the performance portability of Ginkgo in conjunction with Kokkos (used within XGC as the heterogeneous programming model) allows seamless execution for exascale oriented heterogeneous architectures at the various leadership supercomputing facilities.
URL	https://ieeexplore.ieee.org/document/9820663
DOI	10.1109/IPDPS53621.2022.00024

File:

icl-utk-1608-2022.pdf

External Publication Flag: