Performance of Portable Sparse Matrix-Vector Product Implemented Using OpenACC

Kinga Stec; Przemyslaw Stpiczynski

Performance of Portable Sparse Matrix-Vector Product Implemented Using OpenACC

Kinga Stec, Przemyslaw Stpiczynski

DOI: http://dx.doi.org/10.15439/2023F9640

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1155–1160 (2023)

Full text

Abstract. The aim of this paper is to study the performance of OpenACC implementations of sparse matrix-vector product for several storage formats: CSR, ELL, JAD, pJAD, and BSR, achieved on various Intel CPU and NVIDIA GPU platforms to compare them with the performance of SpMV implementations using the BSR storage format provided by Intel MKL and NVIDIA cuSPARSE libraries. Numerical experiments show that vendor-provided BSR is the best format for CPUs but in case of GPUs, the pJAD storage format allows to achieve better performance.

References

Y. Saad, Iterative methods for sparse linear systems. SIAM, 2003.
R. Helfenstein and J. Koko, “Parallel preconditioned conjugate gradient algorithm on GPU,” J. Computational Applied Mathematics, vol. 236, no. 15, pp. 3584–3590, 2012. http://dx.doi.org/10.1016/j.cam.2011.04.025
X. Feng, H. Jin, R. Zheng, Z. Shao, and L. Zhu, “A segment-based sparse matrix-vector multiplication on CUDA,” Concurrency and Computation: Practice and Experience, vol. 26, no. 1, pp. 271–286, 2014. http://dx.doi.org/10.1002/cpe.2978
J. C. Pichel, J. A. Lorenzo, F. F. Rivera, D. B. Heras, and T. F. Pena, “Using sampled information: is it enough for the sparse matrix-vector product locality optimization?” Concurrency and Computation: Practice and Experience, vol. 26, no. 1, pp. 98–117, 2014. http://dx.doi.org/10.1002/cpe.2949
F. Vázquez, G. O. López, J. Fernández, and E. M. Garzón, “Improving the performance of the sparse matrix vector product with GPUs,” in 10th IEEE International Conference on Computer and Information Technology, CIT 2010, Bradford, West Yorkshire, UK, June 29-July 1, 2010, 2010. http://dx.doi.org/10.1109/CIT.2010.208 pp. 1146–1151.
S. Williams, L. Oliker, R. W. Vuduc, J. Shalf, K. A. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging multicore platforms,” Parallel Computing, vol. 35, no. 3, pp. 178–194, 2009. http://dx.doi.org/10.1016/j.parco.2008.12.006
K. K. Matam and K. Kothapalli, “Accelerating sparse matrix vector multiplication in iterative methods using GPU,” in International Conference on Parallel Processing, ICPP 2011, Taipei, Taiwan, September 13-16, 2011, 2011. http://dx.doi.org/10.1109/ICPP.2011.82 pp. 612–621.
C. Stylianou and M. Weiland, “Exploiting dynamic sparse matrices for performance portable linear algebra operations,” in IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2022, Dallas, TX, USA, November 13-18, 2022. IEEE, 2022. http://dx.doi.org/10.1109/P3HPC56579.2022.00010 pp. 47–57.
B. Yilmaz, “A novel graph transformation strategy for optimizing SpTRSV on CPUs,” Concurrency and Computation Practice and Experience, 2023. http://dx.doi.org/10.1002/cpe.7761
B. Bylina, J. Bylina, P. Stpiczyński, and D. Szałkowski, “Performance analysis of multicore and multinodal implementation of SpMV operation,” in Proceedings of the Federated Conference on Computer Science and Information Systems, September 7-10, 2014, Warsaw, Poland. IEEE, 2014. http://dx.doi.org/10.15439/2014F313 pp. 575–582.
P. Stpiczyński, “Semiautomatic acceleration of sparse matrix-vector product using OpenACC,” in Parallel Processing and Applied Mathematics, 11th International Conference, PPAM 2015, Kraków, Poland, September 6-9, 2015, Revised Selected Papers, Part II, ser. Lecture Notes in Computer Science, vol. 9574. Springer, 2016. http://dx.doi.org/10.1007/978-3-319-32152-3_14 pp. 143–152.
R. van der Pas, E. Stotzer, and C. Terboven, Using OpenMP – The Next Step. Affinity, Accelerators, Tasking, and SIMD. Cambridge MA: MIT Press, 2017.
S. Chandrasekaran and G. Juckeland, Eds., OpenACC for Programmers: Concepts and Strategies. Addison-Wesley, 2018.
R. Farber, Ed., Parallel Programming with OpenACC. Morgan Kaufmann, 2017.
H. J. Eberl and R. Sudarsan, “OpenACC parallelisation for diffusion problems, applied to temperature distribution on a honeycomb around the bee brood: A worked example using BiCGSTAB,” in Parallel Processing and Applied Mathematics - 10th International Conference, PPAM 2013, Warsaw, Poland, September 8-11, 2013, Revised Selected Papers, Part II, 2013. http://dx.doi.org/10.1007/978-3-642-55195-6_29 pp. 311–321.
P. Stpiczyński, “Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP, OpenACC and CUDA,” Journal of Parallel and Distributed Computing, vol. 137, pp. 238–245, 2020. http://dx.doi.org/10.1016/j.jpdc.2019.12.004
B. Dmitruk and P. Stpiczyński, “Solving tridiagonal Toeplitz systems of linear equations on GPU-accelerated computers,” Concurrency and Computation Practice and Experience, vol. 34, p. 6449, 2022. http://dx.doi.org/10.1002/cpe.6449
R. W. Vuduc and H. J. Moon, “Fast sparse matrix-vector multiplication by exploiting variable block structure,” in High Performance Computing and Communications, First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3726. Springer, 2005. doi: 10.1007/11557654_91 pp. 807–816.
R. Shahnaz and A. Usman, “Blocked-based sparse matrix-vector multiplication on distributed memory parallel computers,” Int. Arab J. Inf. Technol., vol. 8, no. 2, pp. 130–136, 2011.
R. F. Boisvert, R. Pozo, K. A. Remington, R. F. Barrett, and J. Dongarra, “Matrix Market: a web resource for test matrix collections,” in Quality of Numerical Software - Assessment and Enhancement, Proceedings of the IFIP TC2/WG2.5 Working Conference on the Quality of Numerical Software, Assessment and Enhancement, Oxford, UK, 8-12 July 1996, ser. IFIP Conference Proceedings, R. F. Boisvert, Ed., vol. 76. Chapman & Hall, 1997, pp. 125–137.
T. A. Davis and Y. Hu, “The University of Florida sparse matrix collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1–25, 2011. http://dx.doi.org/10.1145/2049662.2049663
R. Li and Y. Saad, “GPU-accelerated preconditioned iterative linear solvers,” The Journal of Supercomputing, vol. 63, no. 2, pp. 443–466, 2013. http://dx.doi.org/10.1007/s11227-012-0825-3
R. Grimes, D. Kincaid, and D. Young, “ITPACK 2.0 user’s guide,” Center for Numerical Analysis, University of Texas, Tech. Rep. CNA-150, 1979.
J. Cheng, M. Grossman, and T. McKercher, Eds., Professional CUDA C Programming. Wiley and Sons, 2014.