Benchmarking overlapping communication and computations with multiple streams for modern GPUs

Paweł Czarnul

Benchmarking overlapping communication and computations with multiple streams for modern GPUs

Paweł Czarnul

DOI: http://dx.doi.org/10.15439/2018F17

Citation: Communication Papers of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 17, pages 105–110 (2018)

Full text

Abstract. The paper presents benchmarking a multi-stream application processing a set of input data arrays. Tests have been performed and execution times measured for various numbers of streams and various compute intensities measured as the ratio of kernel compute time and data transfer time. As such, the application and benchmarking is representative of frequently used operations such as vector weighted sum, matrix multiplication etc. The paper shows benefits of using multiple data streams for various compute intensities compared to one stream, benchmarked for 4 GPUs: professional NVIDIA Tesla V100, Tesla K20m, desktop GTX 1060 and mobile GeForce 940MX. Additionally, relative performances are shown for various numbers of kernel computations for these GPUs.

References

C. Woolley, “Gpu optimization fundamentals,” February 2013, nVIDIA Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf.
J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, 1st ed. Addison-Wesley Professional, 2010. ISBN 0131387685, 9780131387683
P. Czarnul, Parallel Programming for Modern High Performance Computing Systems. CRC Press, 2018, ISBN 9781138305953.
J. Luitjens, “Cuda streams. best practices and common pitfalls,” 2014, nVIDIA, http://on-demand.gputechconf.com/gtc/2014/ presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf.
Y. Ukidave, A. K. Ziabari, P. Mistry, G. Schirner, and D. Kaeli, “Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms,” The International Journal of High Performance Computing Applications, vol. 28, no. 3, pp. 319–334, 2014. http://dx.doi.org/10.1177/1094342014526907. [Online]. Available: https://doi.org/10.1177/1094342014526907
P. Czarnul, “Parallelization of large vector similarity computations in a hybrid cpu+gpu environment,” The Journal of Supercomputing, vol. 74, no. 2, pp. 768–786, Feb 2018. http://dx.doi.org/10.1007/s11227-017-2159-7. [Online]. Available: https://doi.org/10.1007/s11227-017-2159-7
P. Rościszewski, P. Czarnul, R. Lewandowski, and M. Schally-Kacprzak, “Kernelhive: a new workflow-based framework for multilevel high performance computing using clusters and workstations with cpus and gpus,” Concurrency and Computation: Practice and Experience, vol. 28, no. 9, pp. 2586–2607. http://dx.doi.org/10.1002/cpe.3719. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3719
P. Czarnul, J. Kuchta, M. Matuszek, J. Proficz, P. Rościszewski, M. Wójcik, and J. Szymański, “Merpsys: An environment for simulation of parallel application execution on large scale hpc systems,” Simulation Modelling Practice and Theory, vol. 77, pp. 124 – 140, 2017. http://dx.doi.org/https://doi.org/10.1016/j.simpat.2017.05.009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1569190X17300916
J. Gómez-Luna, J. M. González-Linares, J. I. Benavides, and N. Guil, “Performance models for asynchronous data transfers on consumer graphics processing units,” Journal of Parallel and Distributed Computing, vol. 72, no. 9, pp. 1117 – 1126, 2012. http://dx.doi.org/https://doi.org/10.1016/j.jpdc.2011.07.011 Accelerators for High-Performance Computing. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731511001468
N. P. Deshmukh, H. J. Kang, S. D. Billings, R. H. Taylor, G. D. Hager, and E. M. Boctor, “Elastography using multi-stream gpu: an application to online tracked ultrasound elastography, in-vivo and the da vinci surgical system,” PloS one, vol. 9, no. 12, p. e115881, 2014. http://dx.doi.org/10.1371/journal.pone.0115881. [Online]. Available: http://europepmc.org/articles/PMC4277422
H. Li, D. Yu, A. Kumar, and Y. C. Tu, “Performance modeling in cuda streams – a means for high-throughput data processing,” in 2014 IEEE International Conference on Big Data (Big Data), Oct 2014. http://dx.doi.org/10.1109/BigData.2014.7004245 pp. 301–310.
M. Sourouri, T. Gillberg, S. B. Baden, and X. Cai, “Effective multi-gpu communication using multiple cuda streams and threads,” in 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Dec 2014. http://dx.doi.org/10.1109/PADSW.2014.7097919. ISSN 1521-9097 pp. 981–986.
H. Khaleghzadeh, Z. Zhong, R. Reddy, and A. Lastovetsky, “Out-of-core implementation for accelerator kernels on heterogeneous clouds,” The Journal of Supercomputing, vol. 74, no. 2, pp. 551–568, Feb 2018. http://dx.doi.org/10.1007/s11227-017-2141-4. [Online]. Available: https://doi.org/10.1007/s11227-017-2141-4
K. Osawa, A. Sekiya, H. Naganuma, and R. Yokota, “Accelerating matrix multiplication in deep learning by using low-rank approximation,” in 2017 International Conference on High Performance Computing Simulation (HPCS), July 2017. http://dx.doi.org/10.1109/HPCS.2017.37 pp. 186–192.
V. Yegnanarayanan, “An application of matrix multiplication,” Resonance, vol. 18, no. 4, pp. 368–377, Apr 2013. http://dx.doi.org/10.1007/s12045-013- 0052-0. [Online]. Available: https://doi.org/10.1007/s12045-013-0052-0
W. Liu and B. Vinter, “A framework for general sparse matrix-matrix multiplication on gpus and heterogeneous processors,” CoRR, vol. abs/1504.05022, 2015. [Online]. Available: http://arxiv.org/abs/1504.05022