Simulator of a Supercomputer Job Management System as a Scientific Service
Gennadiy Savin, Boris Shabanov, Dmitriy Lyakhovets, Anton Baranov, Pavel Telegin
DOI: http://dx.doi.org/10.15439/2020F208
Citation: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 21, pages 413–416 (2020)
Abstract. Job management system (JMS) is an important part of any supercomputer. JMS creates a schedule for launching jobs of different users. Actual job management systems are complex software systems with a number of settings. These settings have a significant impact on various JMS metrics, such as supercomputer resources utilization, mean waiting time of a job in queue, and others. Various JMS simulators are widely used to study the influence of JMS settings or modifications, new scheduling algorithms, jobs input stream parameters or available computing resources for JMS efficiency metrics. The article presents the comparative analysis results of the actual JMS simulators (Alea, ScSF, Batsim, AccaSim, Slurm simulator) and their application areas. The authors consider new ways to use the JMS simulator as a scientific service for researchers. With such a service, the researchers are able to study various hypotheses about JMS efficiency, algorithms or parameters. This gives the folowing: (1) research is performed on the service side around the clock, (2) the simulator accuracy or adequacy is provided by the service, (3) the research results reproducibility is ensured, and the simulator-as-a-service becomes a single entry point for the researchers.
References
- A. Reuther et al. “Scalable system scheduling for HPC and big data,” J. of Parallel and Distributed Computing, vol. 111, 2018, pp. 76–92. https://dx.doi.org/10.1016/j.jpdc.2017.06.009
- A. V. Baranov, E. A. Kiselev, D. S. Lyakhovets, “The quasi scheduler for utilization of multiprocessing computing system’s idle resources under control of the Management System of the Parallel Jobs,” Bul. of the South Ural State University. Series Comp. Math. and Software Engineering, issue 3(4), 2014, pp. 75–84 (in Russian). https://dx.doi.org/10.14529/cmse140405
- B. Shabanov, A. Ovsiannikov, A. Baranov, S. Leshchev, B. Dolgov, and D. Derbyshev, “The distributed network of the supercomputer centers for collaborative research,” in Program systems: Theory and applications, 8:4(35), 2017, pp. 245–262 (In Russian). https://dx.doi.org/10.25209/2079-3316-2017-8-4-245-262
- N. N. Kuzyurin, D. A. Grushin, and S. A. Fomin, “Two-dimensional packing problems and optimization in distributed computing systems,” in Proc.of the Institute for System Programming of the RAS, vol. 26, no 1, 2014, pp. 483–502 (in Russian).
- A. I. Tikhomirov, “The English Auction Method for Scheduling Jobs in a Distributed Network of Supercomputer Centers,” Lobachevskii J. of Math., vol. 40, issue 5, 2019, pp. 606–613. https://dx.doi.org/10.1134/s1995080219050214
- A. B. Yoo, M. A. Jette, and M. Grondona, “SLURM: Simple Linux Utility for Resource Management,” Lecture Notes in Comp. Science, vol. 2862, 2003, pp. 44–60. https://dx.doi.org/10.1007/10968987_3
- R. L. Henderson, “Job scheduling under the Portable Batch System,” Lecture Notes in Comp. Science, vol. 949, 1995, pp. 279–294. https://dx.doi.org/10.1007/3-540-60153-8_34
- IBM Spectrum LSF overview, https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_foundations/chap_lsf_overview_foundations.html
- G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, “Joint Supercomputer Center of the Russian Academy of Sciences: Present and Future,” Lobachevskii J. of Mathematics, vol. 40, issue 11, 2019, pp. 1853–1862. https://dx.doi.org/10.1134/S1995080219110271
- N. Capit et al., “A batch scheduler with high level components,” in IEEE Int. Symp. on Cluster Comp. and the Grid, Cardiff, Wales, UK, vol. 2, 2005, pp. 776–783. https://dx.doi.org/10.1109/CCGRID.2005.1558641
- M. C. Cera et al., “Supporting Malleability in Parallel Architectures with Dynamic CPUSETs Mapping and Dynamic MPI,” in Distributed Computing and Networking, 2010, pp. 242–257. https://dx.doi.org/10.1007/978-3-642-11322-2_26
- I. M. Yakimov, M. V. Trusfus, V. V. Mokshin, and A. P. Kirpichnikov, “AnyLogic, ExtendSim and Simulink Overview Comparison of Structural and Simulation modelling Systems,” in Proc. 3rd Russian-Pacific Conf. on Computer Technology and Applications (RPC), Vladivostok, 2018, pp. 1–5. https://dx.doi.org/10.1109/RPC.2018.8482152
- S. W. Cox, “GPSS World: A brief preview,” in 1991 Winter Simulation Conference Proceedings, Phoenix, AZ, USA, 1991, pp. 59–61. https://dx.doi.org/10.1109/WSC.1991.185591
- A. Legrand, M. Quinson, H. Casanova, and K. Fujiwara, “The SIMGRID Project Simulation and Deployment of Distributed Applications,” in 15th IEEE Int. Conf. on High Performance Distributed Computing, Paris, 2006, pp. 385–386. https://dx.doi.org/10.1109/HPDC.2006.1652196
- S. R. Chelladurai, “Gridsim: a flexible simulator for grid integration study,” 2017. https://dx.doi.org/10.24124/2017/1375
- I. C. Legrand and H. B. Newman, “The MONARC toolset for simulating large network-distributed processing systems,” in Winter Simulation Conf. Proc. (Cat. No.00CH37165), Orlando, FL, USA, vol.2, 2000, pp. 1794–1801. https://dx.doi.org/10.1109/WSC.2000.899171
- D. Klusacek, H. Rudova, “Alea 2: job scheduling simulator,” in SIMUTOOLS ICST, 2010. https://dx.doi.org/10.4108/ICST.SIMUTOOLS2010.8722
- W. H. Bell, D. G. Cameron, F. P. Millar, L. Capozza, K. Stockinger, and F. Zini, “Optorsim: A Grid Simulator for Studying Dynamic Data Replication Strategies,” The Int. J. of High Performance Computing Applications, 17(4), 2003, pp. 403–416. https://dx.doi.org/10.1177/10943420030174005
- W. Chen and E. Deelman, “WorkflowSim: A toolkit for simulating scientific workflows in distributed environments,” in 2012 IEEE 8th Int. Conf. on E-Science, Chicago, IL, 2012, pp. 1–8. https://dx.doi.org/10.1109/eScience.2012.6404430
- J. Taheri, A. Zomaya, S. Khan, “Grid Simulation Tools for Job Scheduling and Data File Replication,” in Scalable Computing and Communications: Theory and Practice, New Jersey: Wiley, 2013, pp. 777–797.
- P. F. Dutot, M. Mercier, M. Poquet, O. Richard, “Batsim: a realistic language-independent resources and jobs management systems simulator,” in Job Scheduling Strategies for Parallel Processing, 2015, pp. 178–197. https://dx.doi.org/10.1007/978-3-319-61756-5_10
- G. P. Rodrigo, E. Elmroth, P. Ostberg, L. Ramakrishnan, “ScSF: A Scheduling Simulation Framework,” Lecture Notes in Comp. Science, vol. 10773, 2017. https://dx.doi.org/10.1007/978-3-319-77398-8_9
- N. A. Simakov et al., “A Slurm Simulator: Implementation and Parametric Analysis,” Lecture Notes in Comp. Science, vol. 10724, 2017. https://dx.doi.org/10.1007/978-3-319-72971-8_10
- D. Klusacek, M. Soysal, F. Suter, “Alea — Complex Job Scheduling Simulator,” Lecture Notes in Comp. Science, vol. 12044, 2019. https://dx.doi.org/10.1007/978-3-030-43222-5_19
- C. Galleguillos, Z. Kiziltan, A. Netti et al., “AccaSim: a customizable workload management simulator for job dispatching research in HPC systems,” in Cluster Comput. vol. 23, 2020, pp. 107–122. https://dx.doi.org/10.1007/s10586-019-02905-5
- A. Baranov, P. Telegin, B. Shabanov, D. Lyakhovets, “Measure of Adequacy for the Supercomputer Job Management System Model,” in Proc. of the 2019 Fed. Conf. on Computer Science and Information Systems, ACSIS, vol. 18, 2019, pp. 423–426. https://dx.doi.org/10.15439/2019F186