Logo PTI Logo FedCSIS

Proceedings of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS)

Annals of Computer Science and Information Systems, Volume 43

Slurm plugin for HPC operation with time-dependent cluster-wide power capping

, , ,

DOI: http://dx.doi.org/10.15439/2025F0376

Citation: Proceedings of the 20th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 43, pages 175183 ()

Full text

Abstract. HPC systems are shared between many users. Managing their resources and scheduling compute jobs is a central task of these clusters. Scheduling also allows to control the workload and energy consumption of an HPC system. A Digital Twin of an HPC cluster can aid in the scheduling process by providing energy measurements about the system and predict scheduling decisions with a simulation. For a real-world use case, an integration of the Digital Twin with the scheduler is necessary. A possible use case are energy limitations as part of a demand response process between the HPC operator and energy supplier. Therefore, this paper introduces a plugin for Slurm, an open-source scheduler, that implements a scheduling algorithm for time-dependent cluster-wide power capping. It uses a node energy model to predict the energy consumption of jobs and can start jobs at different frequencies to stay below the configured power limit. The plugin interfaces with the Digital Twin that provides energy measurements for the compute nodes to track the system power consumption in real time and update the power limitations if necessary. The plugin is tested on a cluster and compared against a scheduling simulation of the algorithm. The analysis compares the power profile of the simulation and the real system and the allocation of the jobs over time. Differences in the execution and the power trace are analysed and discussed.

References

  1. RAL UMWELT, Rechenzentren DE-UZ 228, 2nd ed., Fränkische Straße 7, 53229 Bonn, Feb. 2025. [Online]. Available: https://produktinfo.blauer-engel.de/uploads/criteriafile/de/DE-UZ-228-280225-de-Kriterien-V3.pdf
  2. T. Patki, N. Bates, G. Ghatikar, A. Clausen, S. Klingert, G. Abdulla, and M. Sheikhalishahi, “Supercomputing centers and electricity service providers: A geographically distributed perspective on demand management in europe and the united states,” in High Performance Computing, J. M. Kunkel, P. Balaji, and J. Dongarra, Eds. Cham: Springer International Publishing, 2016. https://dx.doi.org/10.1007/978-3-319-41321-1_13. ISBN 978-3-319-41321-1 p. 243–260.
  3. A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “HPC operation with time-dependent cluster-wide power capping,” in Proceedings of the 19th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Śl˛ezak, Eds., vol. 39, 2024. https://dx.doi.org/10.15439/2024F1066 p. 385–393.
  4. M. A. Jette and T. Wickberg, “Architecture of the slurm workload manager,” in Job Scheduling Strategies for Parallel Processing, D. Klusáček, J. Corbalán, and G. P. Rodrigo, Eds. Cham: Springer Nature Switzerland, 2023. ISBN 978-3-031-43943-8 p. 3–23.
  5. P. Czarnul, J. Proficz, and A. Krzywaniak, “Energy-aware high-performance computing: Survey of state-of-the-art tools, techniques, and environments,” Scientific Programming, vol. 2019, p. 8348791, 2019. https://dx.doi.org/10.1155/2019/8348791. [Online]. Available: //https://dx.doi.org/10.1155/2019/8348791
  6. B. Kocot, P. Czarnul, and J. Proficz, “Energy-aware scheduling for high-performance computing systems: A survey,” Energies, vol. 16, no. 2, 2023. https://dx.doi.org/10.3390/en16020890. [Online]. Available: https://www.mdpi.com/1996-1073/16/2/890
  7. X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. E. Papka, “Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13. New York, NY, USA: Association for Computing Machinery, 2013. https://dx.doi.org/10.1145/2503210.2503264. ISBN 9781450323789. [Online]. Available: https://doi.org/10.1145/2503210.2503264
  8. S. Wallace, X. Yang, V. Vishwanath, W. E. Allcock, S. Coghlan, M. E. Papka, and Z. Lan, “A data driven scheduling approach for power management on hpc systems,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’16. IEEE Press, 2016. https://dx.doi.org/10.5555/3014904.3014979. ISBN 9781467388153
  9. D. Bodas, J. Song, M. Rajappa, and A. Hoffman, “Simple power-aware scheduler to limit power consumption by hpc system within a budget,” in 2014 Energy Efficient Supercomputing Workshop, 2014. https://dx.doi.org/10.1109/E2SC.2014.8 p. 21–30.
  10. K. Ahmed, J. Liu, and K. Yoshii, “Enabling demand response for hpc systems through power capping and node scaling,” in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2018. https://dx.doi.org/10.1109/HPCC/SmartCity/DSS.2018.00133 p. 789–796.
  11. ISO Central Secretary, “Digital twin – concepts and terminology,” International Organization for Standardization, Geneva, CH, Standard ISO/IEC 30173:2023, Nov. 2023. [Online]. Available: https://www.iso.org/standard/81442.html
  12. W. Brewer, M. Maiterth, V. Kumar, R. Wojda, S. Bouknight, J. Hines, W. Shin, S. Greenwood, D. Grant, W. Williams, and F. Wang, “A digital twin framework for liquid-cooled supercomputers as demonstrated at exascale,” in SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, 2024. https://dx.doi.org/10.1109/SC41406.2024.00029 p. 1–18.
  13. B. Jung, A. Kammeyer, V. Peltason, M. Ulbig, M. Wehming, and D. Hutzschenreuter, “Systems metrology in future cities – the example smart metrology campus (smc),” Measurement: Sensors, p. 101800, 2024. https://dx.doi.org/10.1016/j.measen.2024.101800. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665917424007761
  14. A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Developing a digital twin to measure and optimise hpc efficiency,” Measurement: Sensors, p. 101481, 2024. https://dx.doi.org/10.1016/j.measen.2024.101481. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2665917424004574
  15. A. Kammeyer, F. Burger, D. Lübbert, and K. Wolter, “Determining data centre pue with a digital twin,” in Sensor and Measurement Science International, ser. SMSI 2025. AMA Service GmbH, May 2025. https://dx.doi.org/10.5162/SMSI2025/A7.1 p. 71–72.
  16. B. Bylina, J. Bylina, and M. Piekarz, “Impact of processor frequency scaling on performance and energy consumption for wz factorization on multicore architecture,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Śl˛ ezak, Eds., vol. 35, 2023. https://dx.doi.org/10.15439/2023F6213 p. 377–383.
  17. B. Bylina and M. Piekarz, “The scalability in terms of the time and the energy for several matrix factorizations on a multicore machine,” in Proceedings of the 18th Conference on Computer Science and Intelligence Systems, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, and D. Śl˛ezak, Eds., vol. 35, 2023. https://dx.doi.org/10.15439/2023F3506 p. 895–900.
  18. A. Krzywaniak, J. Proficz, and P. Czarnul, “Analyzing energy/performance trade-offs with power capping for parallel applications on modern multi and many core processors,” in 2018 Federated Conference on Computer Science and Information Systems (FedCSIS), 2018. https://dx.doi.org/10.15439/2018F177 p. 339–346.
  19. Raspberry Pi Ltd., “How to build a Raspberry Pi cluster,” May 2025. [Online]. Available: https://www.raspberrypi.com/tutorials/cluster-raspberry-pi-tutorial/
  20. SchedMD LLC, “Slurm power saving guide,” May 2025. [Online]. Available: https://slurm.schedmd.com/power_save.html
  21. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “Hpl - a portable implementation of the high-performance linpack benchmark for distributed-memory computers,” Dec. 2018, version 2.3. [Online]. Available: https://www.netlib.org/benchmark/hpl/
  22. J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems,” The International Journal of High Performance Computing Applications, vol. 30, no. 1, p. 3–10, 2016. https://dx.doi.org/10.1177/1094342015593158.
  23. H. G. Weller, G. Tabor, H. Jasak, and C. Fureby, “A tensorial approach to computational continuum mechanics using object-oriented techniques,” Computer in Physics, vol. 12, no. 6, p. 620–631, 11 1998. doi: 10.1063/1.168744.
  24. S. Agostinelli, J. Allison, K. Amako et al., “Geant4—a simulation toolkit,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 506, no. 3, p. 250–303, 2003. https://dx.doi.org/10.1016/S0168-9002(03)01368-8. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0168900203013688
  25. N. Sudermann-Merx, Fortgeschrittene Modellierungstechniken. Berlin, Heidelberg: Springer Berlin Heidelberg, 2023, p. 161–193. ISBN 978-3-662-67381-2. [Online]. Available: https://doi.org/10.1007/978-3-662-67381-2_7
  26. D. Kolossa and G. Grübel, “Evolutionary computation and nonlinear programming in multi-model-robust control design,” in Real-World Applications of Evolutionary Computing, S. Cagnoni, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000. ISBN 978-3-540-45561-5 p. 147–157.
  27. “Rpi 4 consumes 2.5w when shut down,” May 2025. [Online]. Available: https://raspberrypi.stackexchange.com/questions/104944/rpi-4-consumes-2-5w-when-shut-down
  28. R. P. Becker, “Entwurf und implementierung eines plugins für slurm zum planungsbasierten scheduling,” Bachelor’s thesis, Freie Universität Berlin, Berlin, 2021.
  29. D. G. Feitelson, “Packing schemes for gang scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996. ISBN 978-3-540-70710-3 p. 89–110.
  30. D. G. Feitelson and M. A. Jettee, “Improved utilization and responsiveness with gang scheduling,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997. ISBN 978-3-540-69599-8 p. 238–261.
  31. S. J. Chapin, W. Cirne, D. G. Feitelson, J. P. Jones, S. T. Leutenegger, U. Schwiegelshohn, W. Smith, and D. Talby, “Benchmarks and standards for the evaluation of parallel job schedulers,” in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999. https://dx.doi.org/10.1007/3-540-47954-6_4. ISBN 978-3-540-47954-3 p. 67–90.