Influence of loop transformations on performance and energy consumption of the multithreded WZ factorization
Beata Bylina, Jarosław Bylina, Monika Piekarz
Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 479–488 (2022)
Abstract. High-level loop transformations are a key instrument to effectively exploit the resource in modern architectures. Energy consumption on multi-core architectures is one of the major issues connected with high-performance computing. We examine the impact of four loop transformation strategies on performance and energy consumption. The investigated strategies include: loop fission, loop interchange (permutation), strip-mining, and loop tiling. Additionally, a column-wise and row-wise store formats for dense matrices are considered. Parallelization and vectorization are implemented using OpenMP directives. As a test, the WZ factorization algorithm is used. The comparison of selected strategies of the loop transformation is done for Intel architecture, namely Cascade Lake. It has been shown that for WZ factorization, which is an example of an application in which we can use the loop transformation, optimization towards high-performance can also be an effective strategy for improving energy efficiency. Our results show also that block size selection in loop tilling has a significant impact on energy consumption.
- Beata Bylina and Jarosław Bylina. OpenMP Thread Affinity for Matrix Factorizationon Multicore Systems. Proceedings of the Federated Conference onComputer Science and Information Systems, 11:489–492, 2017. https://doi.org/10.15439/2017F231.
- Beata Bylina and Jarosław Bylina. Nested loop transformations on multi- and many-core computers with shared memory. In Selected Topics in Applied Computer Science, volume I, pages 167–186. Maria Curie-Skłodowska University Press, Lublin, 2021. http://stacs.matrix.umcs.pl/v01/stacs_v01.pdf.
- Simplice Donfack, Jack Dongarra, Mathieu Faverge, Mark Gates, Jakub Kurzak, Piotr Luszczek, and Ichitaro Yamazaki. A survey of recent developments in parallel implementations of Gaussian elimination. Concurrency and Computation: Practice and Experience, 27(5):1292–1309, 2014. https://doi.org/10.1002/cpe.3306.
- D.J. Evans and M. Hatzopoulos. A parallel linear system solver. International Journal of Computer Mathematics, 7(3):227–238, 1979. https://doi.org/10.1080/00207167908803174.
- Franz Franchetti, Yevgen Voronenko, and Markus Püschel. Formal loop merging for signal transforms. SIGPLAN Not., 40(6):315–326, June 2005. https://doi.org/10.1145/1064978.1065048.
- Fred G. Gustavson. New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms, pages 211–234. Springer US, Boston, MA, 2001. https://doi.org/10.1007/978-0-387-35407-1_13.
- D. Hackenberg, R. Schöne, T. Ilsche, D. Molka, J. Schuchart, and R. Geyer. An energy efficiency feature survey of the Intel Haswell processor. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pages 896–904, 2015. https://doi.org/10.1109/IPDPSW.2015.70.
- Vasilios Kelefouras and Karim Djemame. A methodology for efficient code optimizations and memory management. In Proceedings of the 15th ACM International Conference on Computing Frontiers, CF ’18, page 105–112, New York, NY, USA, 2018. Association for Computing Machinery. https://doi.org/10.1145/3203217.3203274.
- K. Khan, M. Hirki, T. Niemi, J. Nurminen, and Z. Ou. RAPL in action: Experiences in using RAPL for power measurements. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS), 3, 01 2018. https://doi.org/10.1145/3177754.
- Martin Kong and Louis-Noël Pouchet. Model-driven transformations for multi- and many-core CPUs. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, page 469–484, New York, NY, USA, 2019. Association for Computing Machinery. https://doi.org/10.1145/3314221.3314653.
- João V.F. Lima, Issam Raïs, Laurent Lefevre, and Thierry Gautier. Performance and energy analysis of openmp runtime systems with dense linear algebra algorithms. In 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pages 7–12, 2017. https://doi.org/10.1109/SBAC-PADW.2017.10.
- João Vicente Ferreira Lima, Issam Raïs, Laurent Lefèvre, and Thierry Gautier. Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms. The International Journal of High Performance Computing Applications, 33(3):431–443, 2019. https://doi.org/10.1177/1094342018792079.
- Maxime Mirka, Guillaume Devic, Florent Bruguier, Gilles Sassatelli, and Abdoulaye Gamatié. Automatic energy-efficiency monitoring of openmp workloads. In 2019 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), pages 43–50, 2019. https://doi.org/10.1109/ReCoSoC48741.2019.9034988.
- Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. Loop transformations: Convexity, pruning and optimization. SIGPLAN Not., 46(1):549–562, January 2011. https://doi.org/10.1145/1925844.1926449.
- Yukinori Sato, Tomoya Yuki, and Toshio Endo. An autotuning framework for scalable execution of tiled code via iterative polyhedral compilation. ACM Trans. Archit. Code Optim., 15(4), January 2019. https://doi.org/10.1145/3293449.
- Md Abdullah Shahneous Bari, Abid M. Malik, Ahmad Qawasmeh, and Barbara Chapman. Performance and energy impact of openmp runtime configurations on power constrained systems. Sustainable Computing: Informatics and Systems, 23:1–12, 2019. https://doi.org/10.1016/j.suscom.2019.04.002.
- Przemysław Stpiczyński. Vectorized algorithm for multidimen- sional Monte Carlo integration on modern GPU, CPU and MIC architectures. J. Supercomput., 74(2):936–952, February 2018. https://doi.org/10.1007/s11227-017-2172-x.
- Aleksandar Vitorović, Milo V. Tomašević, and Veljko M. Milutinović. Chapter five - manual parallelization versus state-of-the-art parallelization techniques: The spec cpu2006 as a case study. In Ali Hurson, editor, Advances in Computers, volume 92 of Advances in Computers, pages 203 – 251. Elsevier, 2014. https://doi.org/10.1016/B978-0-12-420232-0.00005-2.
- P. Yalamov and D.J. Evans. The WZ matrix factorisation method. Parallel Computing, 21(7):1111–1120, 1995. https://doi.org/10.1016/0167-8191(94)00088-R.