Towards an HPC Cluster Digital Twin and Scheduling Framework for Improved Energy Efficiency

Demand for compute resources and thus energy demand for HPC is steadily increasing while the energy market transforms to renewable energy and is facing significant price increases. Optimizing energy efficiency of HPC clusters is therefore a major concern. Different possible optimization dimensions are discussed in this paper. This paper presents a digital twin design for analyzing and reducing energy consumption of a real-world HPC system. The digital twin is based on the HPC cluster at PTB. The digital twin receives information from multiple internal and external data sources to cover the different optimization opportunities. The digital twin also consists of a scheduling simulation framework that uses the data from the digital twin and real-world job traces to test the influence of the different parameters on the HPC cluster.


I. INTRODUCTION
T HE Physikalisch-Technische Bundesanstalt (PTB) oper- ates a compute cluster at its Berlin site that is used by various departments within PTB.The cluster is the backbone of numerous research activities.The HPC cluster currently consists of 60 CPU nodes with CPUs from two generations and two special GPU nodes with 10 GPUs.This amounts to an energy consumption of approximately 30 kW.PTB plans to extend the installed compute power by another 50 kW in Berlin and, perspectively, to install a new cluster with over 200 kW installed power at its Brunswick site.
The PTB relies on HPC for many research activities in different departments including physics, mathematics and medicine.The new AI strategy of PTB will require more compute resources for AI applications and model training.Energy efficiency becomes a concern with this increased demand.While the energy efficiency of modern CPUs has improved over the years, the total energy consumption of HPC systems has increased at an even faster rate [1].With this increasing energy usage, the total cost of ownership is impacted by increasing energy prices in a volatile market environment.Additionally, PTB, as a federal institution, is bound by the climate protection plan 2050 [2] and climate protection program 2030 [3] of the German federal government.New data centres, like the planned Brunswick site, need to meet the standards defined by the Blue Angel certificate for data centres [4].
Currently, the influence of different internal and external factors on the energy efficiency of the HPC cluster at PTB are not well understood.A digital twin can help in understanding those factors in detail and test changes to the system configuration without adverse effects to the production system and is commonly used for this purpose [5].However, optimization goals and systems differ from every site and cluster.Thus, a specific, tailored solution is needed for this multi-factor optimization problem.Other HPC centres are also making efforts to improve their energy efficiency and are reporting on those efforts [6].
This paper introduces a design specification for digital twin of the HPC cluster at PTB.The digital twin shall help the operators of the cluster to improve the cluster operation with regards to several optimization goals.The digital twin collects data from various data sources for that purpose.In addition, the digital twin contains a simulation framework with a scheduling model.This simulation framework is used to test parameters and settings with regard to the combined optimization goals.
The next chapter presents the optimization dimensions that have to be taken into consideration.Chapter III presents the design specification of the digital twin and its different data sources.Finally, Chapter IV introduces the scheduling framework created to simulate the HPC cluster with data from the digital twin.

II. BACKGROUND
Increasing the energy efficiency and energy usage of a compute cluster is a multi-dimensional optimization problem.Energy consumption, CO 2 emissions, energy cost, cooling and energy limitations have been identified to be of particular interest of PTB [7].Optimizations for each of these goals are possible.Each of these dimensions needs a way to be monitored and a representation within the digital twin is required.The rest of this chapter describes these dimensions in detail:

A. Energy consumption
The most obvious optimization is the overall reduction of energy consumption by the HPC cluster.Different mechanisms  have been proposed like Dynamic Voltage and Frequency Scaling (DVFS) [8], turning off idle nodes [1], adapting jobs to the energy budget and running nodes at reduced frequencies [9].
The digital twin can be used to test different parameters and their effect on the cluster and monitor the long-term effect of configuration changes.

B. CO 2 emissions
The energy in the energy grid comes from different sources with different CO 2 emissions associated to them.Tracking the equivalent CO 2 emissions when using energy and moving jobs to times of low CO 2 emissions can help to reduce the CO 2 equivalent associated with the operation of the cluster.Since each energy source has different emissions, days with strong winds or low cloud coverage reduce emissions while sources like natural gas and coal used to cover base load in the energy grid increase emissions.

C. Energy cost
Energy prices for large consumers are dynamically determined at energy spot markets.These prices are volatile and shifting compute jobs to low-cost times can directly reduce energy costs.Optimizing for this goal can directly save energy costs.Using the data collected by the digital twin, it may be possible to identify a correlation between pricing and high energy availability, e.g.due to a lot of available renewable energy.

D. External influences
The weather has a direct influence on energy efficiency, hence most metrics, such as Power Usage Effectiveness [10], are averages over a year-long period to average certain weather and season related effects.PTB intends to move the cluster cooling to a free cooling system, which use temperature differences to cool the cluster.These machines are more efficient compared to classic compressor-based cooling machines but only work up to a certain outside temperature.Therefore, they might not provide sufficient cooling on extremely warm days.
The cluster needs to adapt to such conditions, e.g. by limiting the amount of active nodes.Monitoring system temperatures and integrating weather forecast data into the digital twin, the HPC system can be configured to stay within defined operational parameters, e.g. by reducing system load.
Another problem might be limited energy availability.Other industries with high energy usage have developed so called demand response mechanisms to reduce energy consumption in cooperation with energy service providers when not enough energy is available.With increasing power demands of HPC clusters such strategies might be required for HPC as well and some strategies have been proposed [11].Similar to the limitations imposed by insufficient cooling capacities, the cluster can be adjusted to using less energy when not enough energy can be supplied by the energy provider.

III. DIGITAL TWIN FOR HPC
A digital twin is used to reason about a real-world object with data available digitally.In order to do so, real world data needs to be collected and models about the object have to be created.The previous section introduced the relevant dimensions for the optimization problem.These dimensions need to be represented in the digital twin with data and models.
The HPC cluster receives compute jobs from the users in a job queue.The scheduler decides which jobs are run on which nodes at a certain time.A resource manager module manages the available system resources.The compute nodes execute the jobs submitted to the system.Each node has its own CPU, RAM and network connection.Some nodes also have GPUs.The general operation of the cluster is simulated via a scheduling simulation.This simulation is key to understand the cluster operation.Chapter IV introduces the scheduling simulation.The nodes operation can be estimated with energy estimates.No actual jobs need to run to simulate the scheduling simulation.
The simulation results and data is used by the optimizer component that influences the scheduler and the resource manager to change the operation parameters of the HPC 266 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The cluster status, including job and node information, is collected with ClusterCockpit [12].Energy usage is collected via energy meters installed on site.This includes energy used by the cluster and the cooling infrastructure.Heat meters are used to monitor the amount of heat removed from the cluster by the cooling system.
All external data sources have been chosen specific to PTBs location and requirements as well as the optimization goals described in Chapter II.The external data sources for the digital twin include energy grid information, weather forecast data, energy prices and the CO 2 emission associated with the current energy mix in the energy grid.Information about the energy grid is obtained from Bundesnetzagentur [13].This federal agency offers information about the German energy grid and the energy sources used at any given time.Energy prices are also retrieved from Bundesnetzagentur.CO 2 estimations are retrieved from Electricity Maps [14].Finally, weather information are retrieved from the German Meteorological Service (DWD) [15] via the Bright Sky API [16].
Figure 1 shows the architecture described in this Chapter in a schematic representation.The individual data collectors described in this Chapter can be found in figure 2. They are grouped into internal and external data sources with InfluxDB being the core component.

IV. SCHEDULING SIMULATION FRAMEWORK
This section gives a brief overview of existing scheduling simulators and introduces a new scheduling simulator that uses the digital twin data of the PTB HPC cluster as input for its simulations and models the operation of the cluster.
Energy aware scheduling has been of interest in the community with various goals, metrics and proposed solutions [17].Scheduling simulators have been used to test different scheduling algorithms or additional parameters without changing production systems or the need of running actual jobs.Slurm [18] is a scheduler used in production HPC systems.Simakov et.al. developed multiple Slurm simulator versions [19], [20] that follow the steady Slurm development.All versions allow to test different Slurm parameters without altering the corresponding production system.Slurm can be combined with other schedulers like NQSV and digital twins [5].However, since every setup is different, a custom solution is necessary.Yang et al. [21] proposed a scheduling simulator that offers two pricing options for scheduling but does not support dynamic pricing based on constantly changing energy prices.
The data of the digital twin is used for the scheduling simulations.The simulator connects to the database in order to get both the internal and external data.Because the digital twin is a purpose-build solution, no suitable simulator exists.Therefore a new solution was developed.
For development, testing and validation purposes the simulator component presented in this paper can generate synthetic job traces.For longer, realistic job traces the simulator can read the standard workload format (swf) from the Parallel Workloads Archive [22] and can use job traces from this archive as input.Additionally, job traces from the PTB compute cluster can also be used as input.
Furthermore, the data from the digital twin database as well as the job traces are time dependent.The scheduling simulation can run at arbitrary time points.The simulator was designed in such a way that it can run a given job trace at a chosen start time and match the data from the digital twin database accordingly.This allows to test varying internal and external factors on the same job trace or the same internal and external factors on different job traces.
The results of the simulations can be used to tune the system parameters to meet the optimization goals from Chapter II.An automatic optimizer can be implemented as a component of the HPC cluster for automatic tuning and adjustment of parameters.

V. CONCLUSION
With increasing compute demand and thus energy demand of HPC clusters, energy consumption and availability becomes a concern for HPC cluster operators.Possible goals for optimizations like energy consumption, CO 2 emissions, energy cost and external influences that affect energy and cooling have been discussed in this paper.
A design of a digital twin, a representation of a real-world HPC cluster at PTB, has been proposed.It has a time-series database at its core and is connected to data sources for the internal cluster aspects as well as external factors relevant for the optimization goals.The digital twin makes data about the cluster available to the administrators.This data helps to asses and monitor the efforts towards meeting desired efficiency goals over time.
As part of the digital twin, a scheduling simulator has been developed that simulates the operation of the cluster.It uses the data from the digital twin and allows to test different job traces.Because of the time dependence of the data, the simulator can map different job traces with multiple data points to test traces with different data.The digital twin can be used by the administrators to test various configuration parameters and algorithms for their effect on the optimization goals.
So far, the digital twin is only used for simulations.It has not yet been integrated with the scheduler of the real-world HPC system.However, this shall be done in the future.
With the scheduling simulation framework, as part of the digital twin, further empirical studies of the optimization goals are planned.These simulations are a first step towards more efficient cluster operation at PTB and are the basis for future improvements of the real-world HPC cluster, especially toward an automatic optimizer component as part of the scheduler.
Proceedings of the 18 th Conference on Computer Science and Intelligence Systems pp.265-268 DOI: 10.15439/2023F3797 ISSN 2300-5963 ACSIS, Vol.35 IEEE Catalog Number: CFP2385N-ART ©2023, PTI 265 Topical area: Information Technology for Business and Society Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Figure 1 .
Figure 1.Schematic representation of the digital twin for PTBs HPC cluster

Figure 2 .
Figure 2. Software components of the digital twin with all data sources and models