Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product

Submitted by webmaster on Wed, 12/02/2015 - 13:16

Title	Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product
Publication Type	Conference Paper
Year of Publication	2015
Authors	Anzt, H., S. Tomov, and J. Dongarra
Conference Name	Spring Simulation Multi-Conference 2015 (SpringSim'15)
Date Published	2015-04
Publisher	SCS
Conference Location	Alexandria, VA
Abstract	This paper presents a heterogeneous CPU-GPU implementation for a sparse iterative eigensolver the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG). For the key routine generating the Krylov search spaces via the product of a sparse matrix and a block of vectors, we propose a GPU kernel based on a modied sliced ELLPACK format. Blocking a set of vectors and processing them simultaneously accelerates the computation of a set of consecutive SpMVs significantly. Comparing the performance against similar routines from Intel's MKL and NVIDIA's cuSPARSE library we identify appealing performance improvements. We integrate it into the highly optimized LOBPCG implementation. Compared to the BLOBEX CPU implementation running on two eight-core Intel Xeon E5-2690s, we accelerate the computation of a small set of eigenvectors using NVIDIA's K40 GPU by typically more than an order of magnitude.

Project Tags:

File: