Towards an HPC cluster digital twin and scheduling framework for improved energy efficiency
Alexander Kammeyer, Florian Burger, Daniel Lübbert, Katinka Wolter
DOI: http://dx.doi.org/10.15439/2023F3797
Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 265–268 (2023)
Abstract. Demand for compute resources and thus energy demand for HPC are steadily increasing while the energy market transforms to renewable energy and is facing significant price increases. Optimizing energy efficiency of HPC clusters is therefore a major concern. Different possible optimization dimensions are discussed in this paper. This paper presents a digital twin design for analyzing and reducing energy consumption of a real-world HPC system. The digital twin is based on the HPC cluster at PTB. The digital twin receives information from multiple internal and external data sources to cover the different optimization opportunities. The digital twin also consists of a scheduling simulation framework that uses the data from the digital twin and real-world job traces to test the influence of the different parameters on the HPC cluster.
References
- O. Mämmelä, M. Majanen, R. Basmadjian, H. De Meer, A. Giesler, and W. Homberg, “Energy-aware job scheduler for high-performance computing,” Computer Science - Research and Development, vol. 27, no. 4, p. 265–275, 2012. [Online]. Available: https://doi.org/10.1007/s00450-011-0189-6
- Bundesministerium für Umwelt, Naturschutz und nukleare Sicherheit (BMU), “Klimaschutzplan 2050,” https://www.bmwk.de/Redaktion/DE/Publikationen/Industrie/klimaschutzplan-2050.pdf, 2019.
- ——, “Klimaschutzprogramm 2030 der Bundesregierung zur Umsetzung des Klimaschutzplans 2050,” https://www.bundesregierung. de/resource/blob/974430/1679914/e01d6bd855f09bf05cf7498e06d0a3ff/2019-10-09-klima-massnahmen-data.pdf, Oct. 2019.
- R. UMWELT, Energieeffizienter Rechenzentrumsbetrieb DE-UZ 161, 2nd ed., https://produktinfo.blauer-engel.de/uploads/criteriafile/de/DE-UZ%20161-201502-de%20Kriterien.pdf, Fränkische Straße 7, 53229 Bonn, Feb. 2015.
- T. Ohmura, Y. Shimomura, R. Egawa, and H. Takizawa, “Toward building a digital twin of job scheduling and power management on an hpc system,” in Job Scheduling Strategies for Parallel Processing, D. Klusáček, C. Julita, and G. P. Rodrigo, Eds. Cham: Springer Nature Switzerland, 2023, p. 47–67.
- M. Ott and D. Kranzlmüller, “Best practices in energy-efficient high performance computing,” in Workshops der INFORMATIK 2018 - Architekturen, Prozesse, Sicherheit und Nachhaltigkeit. Bonn: Köllen Druck+Verlag GmbH, 2018, p. 167–176.
- A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Optimization of energy efficiency of an hpc cluster: On metrics, monitoring and digital twins,” in Sensor and Measurement Science International, ser. SMSI 2023. AMA Service GmbH, May 2023, p. 378–379. [Online]. Available: https://doi.org/10.5162/SMSI2023/P51
- K. Ahmed, J. Liu, and X. Wu, “An Energy Efficient Demand-Response Model for High Performance Computing Systems,” in 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2017, p. 175–186.
- A. Krzywaniak, J. Proficz, and P. Czarnul, “Analyzing Energy/Performance Trade-Offs with Power Capping for Parallel Applications On Modern Multi and Many Core Processors,” in 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018, p. 339–346.
- V. Avelar, D. Azevedo, A. French, and E. N. Power, “Pue: a comprehensive examination of the metric,” White paper, vol. 49, 2012.
- K. Ahmed, “Energy Demand Response for High-Performance Computing Systems,” Ph.D. dissertation, Florida International University, Miami, 2018.
- J. Eitzinger, T. Gruber, A. Afzal, T. Zeiser, and G. Wellein, “Clustercockpit — a web application for job-specific performance monitoring,” in 2019 IEEE International Conference on Cluster Computing (CLUSTER), 2019, p. 1–7.
- Bundesnetzagentur für Elektrizität, Gas, Telekommunikation, Post und Eisenbahnen (BNetzA), “SMARD - Strommarktdaten, Stromhandel und Stromerzeugung in Deutschland,” https://www.smard.de/home/marktdaten, May 2023.
- Electricity Maps ApS, “Electricity Maps,” https://www.electricitymaps.com/, May 2023.
- Deutscher Wetterdienst (DWD), “Open Data Server of the German Meteorological Service,” https://opendata.dwd.de/, May 2023.
- Bright Sky Developers, “Bright Sky JSON API for DWD’s open weather data,” https://brightsky.dev/, May 2023.
- B. Kocot, P. Czarnul, and J. Proficz, “Energy-aware scheduling for high-performance computing systems: A survey,” Energies, vol. 16, no. 2, 2023. [Online]. Available: https://www.mdpi.com/1996-1073/16/2/890
- A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility for resource management,” in Job Scheduling Strategies for Parallel Processing, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, p. 44–60.
- N. A. Simakov, M. D. Innus, M. D. Jones, R. L. DeLeon, J. P. White, S. M. Gallo, A. K. Patra, and T. R. Furlani, “A slurm simulator: Implementation and parametric analysis,” in High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, S. Jarvis, S. Wright, and S. Hammond, Eds. Cham: Springer International Publishing, 2018, p. 197–217.
- N. A. Simakov, R. L. Deleon, Y. Lin, P. S. Hoffmann, and W. R. Mathias, “Developing accurate slurm simulator,” in Practice and Experience in Advanced Research Computing, ser. PEARC ’22. New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3491418.3535178
- X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, “Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC, 2013.
- D. G. Feitelson, D. Tsafrir, and D. Krakov, “Experience with using the parallel workloads archive,” Journal of Parallel and Distributed Computing, vol. 74, no. 10, p. 2967–2982, 2014. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731514001154