Logo PTI Logo FedCSIS

Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS)

Annals of Computer Science and Information Systems, Volume 39

HPC operation with time-dependent cluster-wide power capping

, , ,

DOI: http://dx.doi.org/10.15439/2024F1066

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 385393 ()

Full text

Abstract. HPC systems have increased in size and power consumption. This has lead to a shift from a pure performance centric standpoint to power and energy aware scheduling and management considerations for HPC. This trend was further accelerated by rising energy prices and the energy crisis that began in 2022. Digital Twins have become valuable tools that enable energy and power aware scheduling of HPC clusters. This paper uses an existing Digital Twin and extends it with a node energy model that allows the prediction of the cluster power consumption. The Digital Twin is then used to simulate system-wide power capping for different energy shortages functions of varying degree. Different policies are proposed and tested towards their effectiveness in improving the job wait times and overall throughput under limiting conditions. Based on a real world HPC cluster, these policies are implemented. Depending on the pattern of the energy limitation and workload, improvements of up to 40 percent are possible compared to scheduling without policies for these conditions.

References

  1. Prometeus GmbH, “Top500 list,” May 2024. [Online]. Available: https://top500.org/
  2. A. Konopelko, L. Kostecka-Tomaszewska, and K. Czerewacz-Filipowicz, “Rethinking eu countries’ energy security policy resulting from the ongoing energy crisis: Polish and german standpoints,” Energies, vol. 16, no. 13, 2023. http://dx.doi.org/10.3390/en16135132. [Online]. Available: https://www.mdpi.com/1996-1073/16/13/5132
  3. Bundesministerium für Wirtschaft und Klimaschutz, “Ordinances on energy saving ensikumav and ensimimav,” Sep. 2022. [Online]. Available: https://www.bmwk.de/Redaktion/DE/Downloads/E/ensikumav.html
  4. A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Towards an hpc cluster digital twin and scheduling framework for improved energy efficiency,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Śl ̨ezak, Eds., vol. 35, 2023. http://dx.doi.org/10.15439/2023F3797 p. 265–268.
  5. P. Czarnul, J. Proficz, and A. Krzywaniak, “Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments,” Scientific Programming, vol. 2019, p. 8348791, 2019. http://dx.doi.org/10.1155/2019/8348791. [Online]. Available: https://doi.org/10.1155/2019/8348791
  6. B. Kocot, P. Czarnul, and J. Proficz, “Energy-aware scheduling for high-performance computing systems: A survey,” Energies, vol. 16, no. 2, 2023. http://dx.doi.org/10.3390/en16020890. [Online]. Available: https://www.mdpi.com/1996-1073/16/2/890
  7. J. Corbalan and L. Brochard, “Ear: Energy management framework for supercomputers,” Barcelona Supercomputing Center (BSC) Working paper, 2019.
  8. M. D’Amico and J. C. Gonzalez, “Energy hardware and workload aware job scheduling towards interconnected hpc environments,” IEEE Transactions on Parallel and Distributed Systems, p. 1, 2021. http://dx.doi.org/10.1109/TPDS.2021.3090334
  9. B. Bylina, J. Bylina, and M. Piekarz, “Impact of processor frequency scaling on performance and energy consumption for wz factorization on multicore architecture,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Ślęzak, Eds., vol. 35, 2023. http://dx.doi.org/10.15439/2023F6213 p. 377–383.
  10. B. Bylina and M. Piekarz, “The scalability in terms of the time and the energy for several matrix factorizations on a multicore machine,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Ślęzak, Eds., vol. 35, 2023. http://dx.doi.org/10.15439/2023F3506 p. 895–900.
  11. A. Krzywaniak, J. Proficz, and P. Czarnul, “Analyzing energy/performance trade-offs with power capping for parallel applications on modern multi and many core processors,” in 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018. http://dx.doi.org/10.15439/2018F177 p. 339–346.
  12. H. van der Valk, H. Haße, F. Möller, and B. Otto, “Archetypes of digital twins,” Business & Information Systems Engineering, vol. 64, no. 3, p. 375–391, Jun 2022. http://dx.doi.org/10.1007/s12599-021-00727-7. [Online]. Available: https://doi.org/10.1007/s12599-021-00727-7
  13. ISO Central Secretary, “Digital twin – concepts and terminology,” International Organization for Standardization, Geneva, CH, Standard ISO/IEC 30173:2023, Nov. 2023. [Online]. Available: https://www.iso.org/standard/81442.html
  14. N. A. Simakov, M. D. Innus, M. D. Jones, R. L. DeLeon, J. P. White, S. M. Gallo, A. K. Patra, and T. R. Furlani, “A slurm simulator: Implementation and parametric analysis,” in High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, S. Jarvis, S. Wright, and S. Hammond, Eds. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-72971-8_10. ISBN 978-3-319-72971-8 p. 197–217.
  15. N. A. Simakov, R. L. Deleon, Y. Lin, P. S. Hoffmann, and W. R. Mathias, “Developing accurate slurm simulator,” in Practice and Experience in Advanced Research Computing, ser. PEARC ’22. New York, NY, USA: Association for Computing Machinery, 2022. doi: 10.1145/3491418.3535178. ISBN 9781450391610. [Online]. Available: https://doi.org/10.1145/3491418.3535178
  16. A. Jokanovic, M. D’Amico, and J. Corbalan, “Evaluating slurm simulator with real-machine slurm and vice versa,” in 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2018. http://dx.doi.org/10.1109/PMBS.2018.8641556 p. 72–82.
  17. T. Ohmura, Y. Shimomura, R. Egawa, and H. Takizawa, “Toward building a digital twin of job scheduling and power management on an hpc system,” in Job Scheduling Strategies for Parallel Processing, D. Klusáček, C. Julita, and G. P. Rodrigo, Eds. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-22698-4_3. ISBN 978-3-031-22698-4 p. 47–67.
  18. J. M. Kunkel, H. Shoukourian, M. R. Heidari, and T. Wilde, “Interference of billing and scheduling strategies for energy and cost savings in modern data centers,” Sustainable Computing: Informatics and Systems, vol. 23, p. 49–66, 2019. http://dx.doi.org/10.1016/j.suscom.2019.04.003. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S221053791830297X
  19. X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka, “Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13. New York, NY, USA: Association for Computing Machinery, 2013. doi: 10.1145/2503210.2503264. ISBN 9781450323789. [Online]. Available: https://doi.org/10.1145/2503210.2503264
  20. A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Developing a digital twin to measure and optimise hpc efficiency.” Submitted to IMEKO World Congress 2024, 2024.
  21. Intel Corporation, “Intel xeon processor e5-2690 v4,” Jul. 2024. [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/91770/intel-xeon-processor-e5-2690-v4-35m-cache-2-60-ghz.html
  22. ——, “Intel xeon gold 6132 processor,” Jul. 2024. [Online]. Available: https://ark.intel.com/content/www/us/en/ark/products/123541/intel-xeon-gold-6132-processor-19-25m-cache-2-60-ghz.html
  23. G. E. Moore, “Cramming more components onto integrated circuits,” Electronics, vol. 38, no. 8, p. 114 ff., Apr. 1965. http://dx.doi.org/10.1109/N-SSC.2006.4785860
  24. D. G. Feitelson, D. Tsafrir, and D. Krakov, “Experience with using the parallel workloads archive,” Journal of Parallel and Distributed Computing, vol. 74, no. 10, p. 2967–2982, 2014. http://dx.doi.org/10.1016/j.jpdc.2014.06.013. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731514001154
  25. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “Hpl - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,” Dec. 2018, version 2.3. [Online]. Available: https://www.netlib.org/benchmark/hpl/
  26. J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, p. 3–10, 2016. http://dx.doi.org/10.1177/1094342015593158. [Online]. Available: https://doi.org/10.1177/1094342015593158
  27. H. G. Weller, G. Tabor, H. Jasak, and C. Fureby, “A tensorial approach to computational continuum mechanics using object-oriented techniques,” Computer in Physics, vol. 12, no. 6, p. 620–631, 11 1998. http://dx.doi.org/10.1063/1.168744. [Online]. Available: https://doi.org/10.1063/1.168744
  28. S. Agostinelli, J. Allison, K. Amako et al., “Geant4—a simulation toolkit,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 506, no. 3, p. 250–303, 2003. http://dx.doi.org/10.1016/S0168-9002(03)01368-8. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0168900203013688
  29. N. Sudermann-Merx, Fortgeschrittene Modellierungstechniken. Berlin, Heidelberg: Springer Berlin Heidelberg, 2023, p. 161–193. ISBN 978-3-662-67381-2. [Online]. Available: https://doi.org/10.1007/978-3-662-67381-2_7
  30. D. Kolossa and G. Grübel, “Evolutionary computation and nonlinear programming in multi-model-robust control design,” in Real-World Applications of Evolutionary Computing, S. Cagnoni, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000. ISBN 978-3-540-45561-5 p. 147–157.
  31. T. Wilde, A. Auweter, and H. Shoukourian, “The 4 pillar framework for energy efficient hpc data centers,” Computer Science - Research and Development, vol. 29, no. 3, p. 241–251, Aug 2014. http://dx.doi.org/10.1007/s00450-013-0244-6. [Online]. Available: https://doi.org/10.1007/s00450-013-0244-6
  32. D. G. Feitelson, “Packing schemes for gang scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996. ISBN 978-3-540-70710-3 p. 89–110.
  33. D. G. Feitelson and M. A. Jettee, “Improved utilization and responsiveness with gang scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997. ISBN 978-3-540-69599-8 p. 238–261.
  34. A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility for resource management,” in Job Scheduling Strategies for Parallel Processing, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003. http://dx.doi.org/10.1007/10968987_3. ISBN 978-3-540-39727-4 p. 44–60.