HPC operation with time-dependent cluster-wide power capping

Alexander Kammeyer; Florian Burger; Daniel Lübbert; Katinka Wolter

HPC operation with time-dependent cluster-wide power capping

Alexander Kammeyer, Florian Burger, Daniel Lübbert, Katinka Wolter

DOI: http://dx.doi.org/10.15439/2024F1066

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 385–393 (2024)

Full text

Abstract. HPC systems have increased in size and power consumption. This has lead to a shift from a pure performance centric standpoint to power and energy aware scheduling and management considerations for HPC. This trend was further accelerated by rising energy prices and the energy crisis that began in 2022. Digital Twins have become valuable tools that enable energy and power aware scheduling of HPC clusters. This paper uses an existing Digital Twin and extends it with a node energy model that allows the prediction of the cluster power consumption. The Digital Twin is then used to simulate system-wide power capping for different energy shortages functions of varying degree. Different policies are proposed and tested towards their effectiveness in improving the job wait times and overall throughput under limiting conditions. Based on a real world HPC cluster, these policies are implemented. Depending on the pattern of the energy limitation and workload, improvements of up to 40 percent are possible compared to scheduling without policies for these conditions.

References

Prometeus GmbH, “Top500 list,” May 2024. [Online]. Available: https://top500.org/
A. Konopelko, L. Kostecka-Tomaszewska, and K. Czerewacz-Filipowicz, “Rethinking eu countries’ energy security policy resulting from the ongoing energy crisis: Polish and german standpoints,” Energies, vol. 16, no. 13, 2023. http://dx.doi.org/10.3390/en16135132. [Online]. Available: https://www.mdpi.com/1996-1073/16/13/5132
Bundesministerium für Wirtschaft und Klimaschutz, “Ordinances on energy saving ensikumav and ensimimav,” Sep. 2022. [Online]. Available: https://www.bmwk.de/Redaktion/DE/Downloads/E/ensikumav.html
A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Towards an hpc cluster digital twin and scheduling framework for improved energy efficiency,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Śl ̨ezak, Eds., vol. 35, 2023. http://dx.doi.org/10.15439/2023F3797 p. 265–268.
P. Czarnul, J. Proficz, and A. Krzywaniak, “Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments,” Scientific Programming, vol. 2019, p. 8348791, 2019. http://dx.doi.org/10.1155/2019/8348791. [Online]. Available: https://doi.org/10.1155/2019/8348791
B. Kocot, P. Czarnul, and J. Proficz, “Energy-aware scheduling for high-performance computing systems: A survey,” Energies, vol. 16, no. 2, 2023. http://dx.doi.org/10.3390/en16020890. [Online]. Available: https://www.mdpi.com/1996-1073/16/2/890
J. Corbalan and L. Brochard, “Ear: Energy management framework for supercomputers,” Barcelona Supercomputing Center (BSC) Working paper, 2019.
M. D’Amico and J. C. Gonzalez, “Energy hardware and workload aware job scheduling towards interconnected hpc environments,” IEEE Transactions on Parallel and Distributed Systems, p. 1, 2021. http://dx.doi.org/10.1109/TPDS.2021.3090334
B. Bylina, J. Bylina, and M. Piekarz, “Impact of processor frequency scaling on performance and energy consumption for wz factorization on multicore architecture,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Ślęzak, Eds., vol. 35, 2023. http://dx.doi.org/10.15439/2023F6213 p. 377–383.
B. Bylina and M. Piekarz, “The scalability in terms of the time and the energy for several matrix factorizations on a multicore machine,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Ślęzak, Eds., vol. 35, 2023. http://dx.doi.org/10.15439/2023F3506 p. 895–900.
A. Krzywaniak, J. Proficz, and P. Czarnul, “Analyzing energy/performance trade-offs with power capping for parallel applications on modern multi and many core processors,” in 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018. http://dx.doi.org/10.15439/2018F177 p. 339–346.
H. van der Valk, H. Haße, F. Möller, and B. Otto, “Archetypes of digital twins,” Business & Information Systems Engineering, vol. 64, no. 3, p. 375–391, Jun 2022. http://dx.doi.org/10.1007/s12599-021-00727-7. [Online]. Available: https://doi.org/10.1007/s12599-021-00727-7
ISO Central Secretary, “Digital twin – concepts and terminology,” International Organization for Standardization, Geneva, CH, Standard ISO/IEC 30173:2023, Nov. 2023. [Online]. Available: https://www.iso.org/standard/81442.html
N. A. Simakov, M. D. Innus, M. D. Jones, R. L. DeLeon, J. P. White, S. M. Gallo, A. K. Patra, and T. R. Furlani, “A slurm simulator: Implementation and parametric analysis,” in High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, S. Jarvis, S. Wright, and S. Hammond, Eds. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-72971-8_10. ISBN 978-3-319-72971-8 p. 197–217.
N. A. Simakov, R. L. Deleon, Y. Lin, P. S. Hoffmann, and W. R. Mathias, “Developing accurate slurm simulator,” in Practice and Experience in Advanced Research Computing, ser. PEARC ’22. New York, NY, USA: Association for Computing Machinery, 2022. doi: 10.1145/3491418.3535178. ISBN 9781450391610. [Online]. Available: https://doi.org/10.1145/3491418.3535178
A. Jokanovic, M. D’Amico, and J. Corbalan, “Evaluating slurm simulator with real-machine slurm and vice versa,” in 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2018. http://dx.doi.org/10.1109/PMBS.2018.8641556 p. 72–82.
T. Ohmura, Y. Shimomura, R. Egawa, and H. Takizawa, “Toward building a digital twin of job scheduling and power management on an hpc system,” in Job Scheduling Strategies for Parallel Processing, D. Klusáček, C. Julita, and G. P. Rodrigo, Eds. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-22698-4_3. ISBN 978-3-031-22698-4 p. 47–67.
J. M. Kunkel, H. Shoukourian, M. R. Heidari, and T. Wilde, “Interference of billing and scheduling strategies for energy and cost savings in modern data centers,” Sustainable Computing: Informatics and Systems, vol. 23, p. 49–66, 2019. http://dx.doi.org/10.1016/j.suscom.2019.04.003. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S221053791830297X
X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka, “Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13. New York, NY, USA: Association for Computing Machinery, 2013. doi: 10.1145/2503210.2503264. ISBN 9781450323789. [Online]. Available: https://doi.org/10.1145/2503210.2503264
A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Developing a digital twin to measure and optimise hpc efficiency.” Submitted to IMEKO World Congress 2024, 2024.
Intel Corporation, “Intel xeon processor e5-2690 v4,” Jul. 2024. [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/91770/intel-xeon-processor-e5-2690-v4-35m-cache-2-60-ghz.html
——, “Intel xeon gold 6132 processor,” Jul. 2024. [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/123541/intel-xeon-gold-6132-processor-19-25m-cache-2-60-ghz.html
G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, p. 114 ff., Apr. 1965. http://dx.doi.org/10.1109/N-SSC.2006.4785860
D. G. Feitelson, D. Tsafrir, and D. Krakov, “Experience with using the parallel workloads archive,” Journal of Parallel and Distributed Computing, vol. 74, no. 10, p. 2967–2982, 2014. http://dx.doi.org/10.1016/j.jpdc.2014.06.013. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731514001154
A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “Hpl - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,” Dec. 2018, version 2.3. [Online]. Available: https://www.netlib.org/benchmark/hpl/
J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, p. 3–10, 2016. http://dx.doi.org/10.1177/1094342015593158. [Online]. Available: https://doi.org/10.1177/1094342015593158
H. G. Weller, G. Tabor, H. Jasak, and C. Fureby, “A tensorial approach to computational continuum mechanics using object-oriented techniques,” Computer in Physics, vol. 12, no. 6, p. 620–631, 11 1998. http://dx.doi.org/10.1063/1.168744. [Online]. Available: https://doi.org/10.1063/1.168744
S. Agostinelli, J. Allison, K. Amako et al., “Geant4—a simulation toolkit,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 506, no. 3, p. 250–303, 2003. http://dx.doi.org/10.1016/S0168-9002(03)01368-8. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0168900203013688
N. Sudermann-Merx, Fortgeschrittene Modellierungstechniken. Berlin, Heidelberg: Springer Berlin Heidelberg, 2023, p. 161–193. ISBN 978-3-662-67381-2. [Online]. Available: https://doi.org/10.1007/978-3-662-67381-2_7
D. Kolossa and G. Grübel, “Evolutionary computation and nonlinear programming in multi-model-robust control design,” in Real-World Applications of Evolutionary Computing, S. Cagnoni, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000. ISBN 978-3-540-45561-5 p. 147–157.
T. Wilde, A. Auweter, and H. Shoukourian, “The 4 pillar framework for energy efficient hpc data centers,” Computer Science - Research and Development, vol. 29, no. 3, p. 241–251, Aug 2014. http://dx.doi.org/10.1007/s00450-013-0244-6. [Online]. Available: https://doi.org/10.1007/s00450-013-0244-6
D. G. Feitelson, “Packing schemes for gang scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996. ISBN 978-3-540-70710-3 p. 89–110.
D. G. Feitelson and M. A. Jettee, “Improved utilization and responsiveness with gang scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997. ISBN 978-3-540-69599-8 p. 238–261.
A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility for resource management,” in Job Scheduling Strategies for Parallel Processing, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003. http://dx.doi.org/10.1007/10968987_3. ISBN 978-3-540-39727-4 p. 44–60.