Influence of Locality on the Scalability of Method-and System-Parallel Explicit Peer Methods
Matthias Korch, Thomas Rauber, Matthias Stachowski, Tim Werner
DOI: http://dx.doi.org/10.15439/2016F464
Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 685–694 (2016)
Abstract. Because the numerical solution of initial value problems (IVPs) of systems of ordinary differential equations (ODEs) can be computationally intensive, several parallel methods have been proposed in the past. One class of modern parallel IVP methods are the peer methods proposed by Schmitt and Weiner, some of which are publicly available in the software package EPPEER released in 2012. Since they possess eight independent stages, these methods offer natural parallelism across the method suitable for the typical numbers of CPU cores in modern multicore workstations. EPPEER is written in FORTRAN95 and uses OpenMP as parallel programming model. In this paper, we investigate the influence of the locality of memory references on the scalability of method- and system- parallel explicit peer methods. In particular, we investigate the interplay between the linear combination of the stages and the function evaluations by applying different program transformations to the loop structure and by evaluating their performance in detailed runtime experiments. These experiments point out that loop tiling is required to improve cache utilization while still allowing the compiler to vectorize along the system dimension. To show that for certain classes of right-hand-side functions a stage-parallel execution is not optimal, and to enhance the scalability of the peer methods to core numbers larger than the number of stages of a method, system-parallel implementations have been derived. Runtime experiments show that there are IVPs for which these new implementations outperform stage-parallel implementations on numbers of cores less than or equal to the number of stages. Moreover, by exploiting the ability to utilize higher core numbers, higher speedups than the number of stages have been reached.
References
- E. Hairer, S. P. Nørsett, and G. Wanner, Solving Ordinary Differential Equations I: Nonstiff Problems, 2nd ed. Berlin: Springer, 2000.
- K. Burrage, Parallel and Sequential Methods for Ordinary Differential Equations. New York: Oxford University Press, 1995.
- B. A. Schmitt, “Peer methods for ordinary differential equations,” http://www.mathematik.uni-marburg.de/~schmitt/peer/, last checked 2015-08-17.
- B. A. Schmitt and R. Weiner, “Parallel two-step W-methods with peer variables,” SIAM J. Numer. Anal., vol. 42, no. 1, pp. 265–282, 2004.
- B. A. Schmitt, R. Weiner, and S. Beck, “Two-step peer methods with continuous output,” BIT Numer. Math., vol. 53, pp. 717–739, 2013.
- B. A. Schmitt, R. Weiner, and S. Jebens, “Parameter optimization for explicit parallel peer two-step methods,” Appl. Numer. Math., vol. 59, pp. 769–782, 2009.
- R. Weiner, K. Biermann, B. A. Schmitt, and H. Podhaisky, “Explicit two-step peer methods,” Comput. Math. Appl., vol. 55, no. 609–619, 2008.
- B. A. Schmitt and R. Weiner, Manual for explicit parallel peer code EPPEER, Aug. 2012.
- Y. Maday and G. Turinici, “A parareal in time procedure for the control of partial differential equations,” C.R.A.S. Sér. I Math, vol. 335, pp. 387–391, 2002.
- P. J. van der Houwen and E. Messina, “Parallel Adams methods,” J. Comput. Appl. Math., vol. 101, pp. 153–165, Jan. 1999.
- K. Ahnert, D. Demidov, and M. Mulansky, “Solving ordinary differential equations on gpus,” in Numerical Computations with GPUs, V. Kindratenko, Ed. Springer, 2014, ch. 7, pp. 125–157.
- S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, “Efficient management of parallelism in object oriented numerical software libraries,” in Modern Software Tools in Scientific Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, Eds. Birkhäuser Press, 1997, pp. 163–202.
- M. Korch and T. Rauber, “Parallel low-storage Runge-Kutta solvers for ODE systems with limited access distance,” Int. J. High Perf. Comput. Appl., vol. 25, no. 2, pp. 236–255, 2011. http://dx.doi.org/10.1177/1094342010384418
- M. Korch and T. Rauber, “Locality optimized shared-memory implementations of iterated Runge-Kutta methods,” in Euro-Par 2007, ser. LNCS, vol. 4641. Springer, 2007, pp. 737–747.
- R. Karrenberg, “Automatic SIMD vectorization of SSA-based control flow graphs,” Dissertation, Universität des Saarlandes, Saarbrücken, Jul. 2014.
- S. Ghosh, M. Martonosi, and S. Malik, “Cache miss equations: A compiler framework for analyzing and tuning memory behavior,” ACM Trans. Prog. Lang. Syst. (TOPLAS), vol. 21, no. 4, pp. 703–746, 1999.
- D. Feld, T. Soddemann, M. Jünger, and S. Mallach, “Facilitate SIMD-Code-Generation in the Polyhedral Model by Hardware-aware Automatic Code-Transformation,” in Proc. of the 3rd International Workshop on Polyhedral Compilation Techniques, A. Größlinger and L.-N. Pouchet, Eds., Berlin, Germany, Jan. 2013, pp. 45–54.
- S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.
- G. Hager, J. Treibig, J. Habich, and G. Wellein, “Exploring performance and power properties of modern multi-core chips via simple machine models,” Concurrency and Computation: Practice and Experience, vol. 28, pp. 189–210, 2016.
- J. Ansel, “Autotuning programs with algorithmic choice,” Ph.D. dissertation, Massachusetts Institute of Technology, Cambridge, MA, Feb. 2014.
- N. Kalinnik, M. Korch, and T. Rauber, “Online auto-tuning for the time-step-based parallel solution of ODEs on shared-memory systems,” Journal of Parallel and Distributed Computing, vol. 74, no. 8, pp. 2722–2744, 2014. http://dx.doi.org/10.1016/j.jpdc.2014.03.006
- B. A. Schmitt and R. Weiner, “Parallel start for explicit parallel two-step peer methods,” Numerical Algorithms, vol. 53, no. 2-3, pp. 363–381, 2010.