Performance of Portable Sparse Matrix-Vector Product Implemented Using OpenACC

—The aim of this paper is to study the performance of OpenACC implementations of sparse matrix-vector product for several storage formats: CSR, ELL, JAD, pJAD, and BSR, achieved on Intel CPU and NVIDIA GPU platforms to compare them with the performance of SpMV implementations using the BSR storage format provided by Intel MKL and NVIDIA cuSPARSE libraries. Numerical experiments show that vendor-provided BSR is the best format for CPUs but in the case of GPUs, the pJAD storage format allows to achieve better performance.


I. INTRODUCTION
S PARSE matrix-vector product (SpMV) is a central part of many numerical algorithms and its performance can have a very big impact on the performance of scientific and engineering applications [1], [2].There are a lot of various sparse matrix storage formats and sophisticated techniques for developing efficient implementations of SpMV that utilize the underlying hardware of modern multicore CPUs and GPUs [3], [4], [5], [6], [7], [8], [9].Unfortunately, these methods are rather complicated and usually depend on particular computer architecture, thus developing efficient and portable sparse matrix source code is still a challenge.However, the results presented in [10] and [11] show that simple SPARSKIT SpMV routines using various storage formats (CSR, ELL, JAD) [1] can be easily and efficiently adapted to modern CPU-based or GPU-accelerated architectures.Loops in source codes can be easily parallelized using OpenMP [12] or OpenACC [13], [14] directives, while the rest of the work can be done by a compiler.Such parallelized SpMV routines achieve performance comparable with the performance of the SpMV routines available in libraries optimized by hardware vendors (i.e.Intel MKL, NVIDIA cuSPARSE).OpenACC, a standard for accelerated computing, provides compiler directives for offloading C/C++ programs from host to attached accelerator devices.Such simple directives allow marking regions of source code for automatic acceleration in a portable vendor-independent manner.Moreover, OpenACC programs can be compiled using the multicore option, and then such programs can also be run on CPU-based architectures [15], [16], [17] without any changes in source codes.
Recently, the Block Compressed Row (BSR) format [18], [19], which is a generalization of the Compressed Sparse Row (CSR) format, has become very popular.Intel MKL and NVIDIA cuSPARSE provide optimized SpMV implementations for this format.Moreover, the other formats have been deprecated.Especially, BSR has replaced the HYB format in cuSPARSE.In this paper we compare the performance of portable OpenACC implementations of sparse matrixvector product for CSR, ELL, JAD, pJAD, and BSR with the performance of SpMV implementations using the BSR storage format provided in Intel MKL and NVIDIA cuSPARSE libraries.

II. SPARSE MATRIX REPRESENTATIONS
Let us assume that A is a sparse matrix with a significant number of zero entries, and x, y are dense vectors.The SpMV operation is defined as follows: It is clear that if we do not multiply entries of x by zero entries of A, then (1) requires 2 • n nz floating point operations (one multiplication and one addition per nonzero entry of A).The structure of a sparse matrix can be characterized by n, n nz , n nz /n, and max nz , where n is the number of rows, n nz is the total number of nonzero elements, n nz /n the average number of nonzero elements per row, max nz is the biggest number of nonzero elements per row.Table I shows values of these parameters for a set of test matrices, selected from Matrix Market [20] and University of Florida Sparse Matrix Collection [21].It is clear that the performance of SpMV depends on the matrix storage format that utilizes the underlying hardware.For description purposes of several possible sparse matrix storage formats, let us consider the following matrix as an example: where n = 4, n nz = 8, n nz /n = 2, and max nz = 3.Now let us consider a few basic (ELL, JAD, CSR [1], [22]), as well as, more sophisticated (pJAD [11], BSR [18], [19]) storage formats for sparse matrices.

A. ELL
The ELL storage format was introduced in Ellpack-Itpack package.It assumes that a sparse matrix is represented by two arrays (Figure 1).Nonzero elements are stored in the first one called a.The second one called ja contains the corresponding column indices [23].Both arrays are n × ncol, where ncol = max nz .While ELL is simple and provides easy access to matrix entires, when n nz /n max nz , the number of stored zero entries of the matrix increases significantly.

B. JAD
The JAD (i.e.Jagged Diagonal) format storage is represented by three arrays (Figure 2).It is similar to ELL, but removes the assumption on the fixed-length rows [22].Firstly, a sparse matrix needs to be sorted in non-increasing order of the number of nonzeros per row The arrays a and ja of dimension nz contain nonzero elements (i.e.jagged diagonals) and the corresponding column indices.The array ia contains the beginning position of each jagged diagonal.Additionally, we can add array rlen which contains the number of nonzero elements in each row.Entries of this array can be calculated (in parallel) using the following formula.Let jdiag be the number of jagged diagonals.Then for each row, i = 0, . . ., n − 1, we have Note that this format is devoid of the inconvenience associated with the need to store zero elements in rows completed to the width of max nz .

C. pJAD
The pJAD storage format is an optimized version of JAD (Figure 3).This format assumes aligning (padding) columns of the arrays a and ja [11].We add zero elements, thus the number of elements of each column should be a multiple of a given bsize and rows of each block should have the same length.Entries of the array brlen contain widths of blocks of bsize rows.Note that pJAD assumes to store at most jdiag • (bsize − 1) additional zero entries, where jdiag is the number of jagged diagonals stored in a. Padding of jagged diagonals is important especially for GPUs.It allows coalesced memory access and reduces thread divergence within a block of threads [24].

E. BSR
The BSR storage format can be treated as a generalization of CSR.A sparse matrix is represented by four arrays (Figure 5).

III. ALGORITHMS
The SpMV operation for all storage formats presented in Section II can be implemented using OpenACC to be executed on both GPU-accelerated and CPU-based systems.OpenACC offers compiler directives for offloading selected computations from host to attached accelerator devices.It allows to indicate regions of source code for automatic parallelization in a portable manner.Algorithms 1, 2, 3, 4, and 5 show how to implement SpMV in C/C++ using OpenACC for all considered formats: ELL, JAD, pJAD, CSR, and BSR formats, respectively.OpenACC-specific parts of the implementation start with #pragma acc directives.The parallel loop directive defines a loop to be accelerated on GPU.Additional clauses, namely gang and vector_length tell that gangs (i.e.blocks of threads) should perform an iteration of loops.Threads within gangs work in vector or SIMD mode [13].The loop seq construct placed before a loop within parallel loop says that such a loop should be executed sequentially by a single thread.The present clause says that indicated variables are previously allocated on GPU.It allows to avoid unnecessary data movements between host and device memory systems.OpenACC provides the data construct that can be used to specify such scope of data in accelerated regions.Data transfers can also be initialized using the enter data and exit data constructs [13].Figure 6 shows output messages generated by the compiler using the -acc=gpu option.
When OpenACC programs are compiled using the -acc=multicore option, the compiler generates appropriate parallel regions to be executed in parallel on CPU cores (Figure 7).It should be noticed that if we omit OpenACC directives, we will get sequential implementations of SpMV.

IV. PERFORMANCE OF SPMV
All OpenACC implementations of SpMV have been tested on the computer equipped with two Xeon Gold 6342 @ 2.80GHz (48 cores) and NVIDIA A40 GPU (10752 cores, FP64 Peak perf.584.6 GFLOPS), running under Linux Oper-  ating System with Intel OneAPI and NVIDIA HPC compiler suits.The results have been compared with SpMV implementations using the BSR storage format that are provided in Intel MKL and NVIDIA cuSPARSE libraries.Table II shows the performance (GFLOPS) obtained for all considered implementations for both CPUs and GPU and the set of sparse matrices from Table I calculated as follows: where t is the execution time of SpMV (in seconds).It should be noticed that in the case of BSR, the table shows the best performance achieved for the optimal block size determined empirically.All experiments have been performed for FP64.

V. RESULTS OF EXPERIMENTS
On CPU, the best performance for the majority of matrices is obtained for Intel MKL BSR implementation.For the smaller matrices, the best results are achieved by OpenACC implementation of SpMV using the CSR format.Other Ope-nACC implementations achieve worse performance than Intel MKL BSR.Especially, OpenACC BSR is much slower than its well-optimized counterpart.In most cases pJAD achieves better performance than JAD, however its performance is  still worse than the optimized BSR provided by Intel.The ELL format gives the worse performance for matrices with n nz /n max nz , when the number of stored zero entries increases significantly.
On GPU, we can observe that pJAD implementation achieves the best results for the largest number of matrices (eleven matrices).It outperforms JAD format for all matrices and Nvidia cuSPARSE BSR for almost all matrices.Moreover, for several matrices pJAD outperforms Intel MKL BSR significantly.The second best format is CSR.It gains the best results for seven matrices.For the others, the pJAD format is always better and the JAD format is almost always better.The ELL format achieves best results for five matrices (cry10000, af23560, ecology2, thermal2, atmosmodl), most of which have almost the same row length (n nz /n ≈ max nz ).Nvidia cuSPARSE BSR gains the highest performance for only one matrix (i.e.ldoor), however for larger matrices it is much faster than OpenACC BSR.As with CPU, the difference between OpenACC and Nvidia cuSPARSE implementations of BSR is significant.

VI. CONCLUSIONS AND FUTURE WORK
We have shown that sparse matrix-vector product using several formats can easily be implemented using OpenACC in order to utilize underlying hardware of modern CPUs and GPUs.Our implementations achieve reasonable performance on GPU and CPU, in some cases comparable with the performance of vendor optimized implementations using the BSR format, and sometimes even better.
It seems that the use of pJAD is very promising for GPUs.Its OpenACC portable implementation achieves much better performance than BSR optimized by the vendor.In the future we plan to provide non-portable optimized version of SpMV using pJAD.

Fig. 7 :
Fig. 7: Compiler output messages of Algorithm 1 compiled using -acc=multicore Proceedings of the 18 th Conference on Computer Science and Intelligence Systems pp.1155-1160