Superlinear Speedup in HPC Systems: why and when?

Sasko Ristov, Radu Prodan, Marjan Gusev, Karolj Skala

DOI: http://dx.doi.org/10.15439/2016F498

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 889–898 (2016)

Full text

Abstract. The speedup is usually limited by two main laws in high-performance computing, that is, the Amdahl's and Gustafson's laws. However, the speedup sometimes can reach far beyond the limited linear speedup, known as superlinear speedup, which means that the speedup is greater than the number of processors that are used. Although the superlinear speedup is not a new concept and many authors have already reported its existence, most of them reported it as a side effect, without explaining why and how it is happening. In this paper, we analyze several different superlinear speedup types and define a taxonomy for them. Additionally, we present several explanations and cases of superlinearity existence for different types of granular algorithms (tasks), which means that they can be divided into many sub-tasks and scattered to the processors for execution. Apart from frequent explanation that having more cache memory in parallel execution is the main reason, we summarize other different effects that cause the superlinearity, including the superlinear speedup in cloud virtual environment for both vertical and horizontal scaling.

References

E. Alba, G. Luque, and S. Nesmachnow, “Parallel metaheuristics: recent advances and new trends,” International Transactions in Operational Research, vol. 20, no. 1, pp. 1–48, 2013. http://dx.doi.org/10.1111/j.1475-3995.2012.00862.x.
J. L. Gustafson, “Reevaluating Amdahl’s law,” Communication of ACM, vol. 31, no. 5, pp. 532–533, May 1988. http://dx.doi.org/10.1145/42411.42415.
X. Ye, W. Dong, P. Li, and S. Nassif, “Hierarchical multialgorithm parallel circuit simulation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 1, pp. 45–58, Jan 2011. http://dx.doi.org/10.1109/TCAD.2010.2067870
J. Rufino, A. I. Pereira, and J. Pidanic, “Copssa---constrained parallel stretched simulated annealing,” in Radioelektronika (Radioelektronika), 2015 25th International Conference, April 2015. http://dx.doi.org/10.1109/RADIOELEK.2015.7129044 pp. 435–439.
T. Ciamulski and M. Sypniewski, “Linear and superlinear speedup in parallel fdtd processing,” in 2007 IEEE Antennas and Propagation Society International Symposium, June 2007. http://dx.doi.org/10.1109/APS.2007.4396642. ISSN 1522-3965 pp. 4897–4900.
A. Gupta and D. Milojicic, “Evaluation of hpc applications on cloud,” in Open Cirrus Summit (OCS), 2011 Sixth, Oct 2011. http://dx.doi.org/10.1109/OCS.2011.10 pp. 22–26.
S. A. Tsaftaris, “A scientist’s guide to cloud computing,” Computing in Science Engineering, vol. 16, no. 1, pp. 70–76, Jan 2014. http://dx.doi.org/10.1109/MCSE.2014.12
L. Liu, M. Zhang, Y. Lin, and L. Qin, “A survey on workflow management and scheduling in cloud computing,” in Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, May 2014. http://dx.doi.org/10.1109/CCGrid.2014.83 pp. 837–846.
M. Gusev and S. Ristov, “Superlinear speedup in Windows Azure cloud,” in Cloud Networking (IEEE CLOUDNET), 2012 IEEE 1st International Conference on, Paris, France, 2012, pp. 173–175.
G. M. Amdahl, “Validity of the single-processor approach to achieving large scale computing capabilities,” in AFIPS Conference Proceedings, vol. 30. AFIPS Press, Reston. Va., Atlantic City, N.J., Apr. 18-20 1967. http://dx.doi.org/10.1145/1465482.1465560 pp. 483–485.
A. H. Karp and H. P. Flatt, “Measuring parallel processor performance,” Commun. ACM, vol. 33, no. 5, pp. 539–543, May 1990. http://dx.doi.org/10.1145/78607.78614.
Y. Shi, “Reevaluating amdahl’s law and gustafson’s law,” Computer Sciences Department, Temple University, Tech. Rep. MS:38-24, Oct. 1996.
M. Gusev and S. Ristov, “A superlinear speedup region for matrix multiplication,” Concurrency and Computation: Practice and Experience, vol. 26, no. 11, pp. 1847–1868, 2013. http://dx.doi.org/10.1002/cpe.3102.
M. Gusev and S. Ristov, “Resource scaling performance for cache intensive algorithms in Windows Azure,” in Intelligent Distributed Computing VII, ser. SCI, F. Zavoral, J. J. Jung, and C. Badica, Eds. Springer International Publishing, 2014, vol. 511, pp. 77–86. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-01571-2_10
J. L. Hennessy and D. A. Patterson, “Computer Architecture, Fifth Edition: A Quantitative Approach,” MA, USA, 2012.
M. Otte and N. Correll, “C-forest: Parallel shortest path planning with superlinear speedup,” IEEE Transactions on Robotics, vol. 29, no. 3, pp. 798–806, June 2013. http://dx.doi.org/10.1109/TRO.2013.2240176
A. F. P. Camargos, R. M. S. Batalha, C. A. P. S. Martins, E. J. Silva, and G. L. Soares, “Superlinear speedup in a 3-d parallel conjugate gradient solver,” IEEE Transactions on Magnetics, vol. 45, no. 3, pp. 1602–1605, March 2009. http://dx.doi.org/10.1109/TMAG.2009.2012753
G. Velkoski, S. Ristov, and M. Gusev, “Loosely or tightly coupled affinity for matrix - vector multiplication,” in Information Communication Technology Electronics Microelectronics (MIPRO), 2013 36th International Convention on. Opatija, Croatia: IEEE, May 2013. ISBN 978-953-233-076-2 pp. 228–233.
L. Djinevski, S. Ristov, and M. Gusev, “Superlinear speedup for matrix multiplication in GPU devices,” in ICT Innovations 2012, ser. Advances in Intelligent Systems and Computing, S. Markovski and M. Gusev, Eds. Springer Berlin Heidelberg, 2013, vol. 207, pp. 285–294. ISBN 978-3-642-37168-4. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-37169-1_28
N. Anchev, M. Gusev, S. Ristov, and B. Atanasovski, “Intel vs AMD: Matrix multiplication performance,” in Information Communication Technology Electronics Microelectronics (MIPRO), 2013 36th International Convention on. Opatija, Croatia: IEEE, May 2013. ISBN 978-953-233-076-2 pp. 182–187.
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “Starpu: A unified platform for task scheduling on heterogeneous multicore architectures,” Concurr. Comput. : Pract. Exper., vol. 23, no. 2, pp. 187–198, Feb. 2011. http://dx.doi.org/10.1002/cpe.1631. [Online]. Available: http://dx.doi.org/10.1002/cpe.1631
G. Kosec and M. Depolli, “Superlinear speedup in OpenMP parallelization of a local PDE solver,” in MIPRO, 2012 Proceedings of the 35th International Convention, 2012, pp. 389–394.
J. L. Gustafson, “Fixed time, tiered memory, and superlinear speedup,” in Distributed Memory Computing Conference, 1990., Proceedings of the Fifth, vol. 2, Apr 1990. http://dx.doi.org/10.1109/DMCC.1990.556383 pp. 1255–1260.
I. A.-S. Ibrahim, H.-W. Loidl, and P. Trinder, “High-performance cloud computing for symbolic computation domain,” Journal of Computations & Modelling, vol. 6, no. 1, pp. 107–133, 2016. [Online]. Available: http://www.scienpress.com/journal_focus.asp?main_id=58&Sub_id=IV&Issue=1771
N. Theera-Ampornpunt, S. G. Kim, A. Ghoshal, S. Bagchi, A. Grama, and S. Chaterji, “Fast training on large genomics data using distributed support vector machines,” in 2016 8th International Conference on Communication Systems and Networks (COMSNETS), Jan 2016. http://dx.doi.org/10.1109/COMSNETS.2016.7439943 pp. 1–8.
P. E. McKenney, “Retrofitted parallelism considered grossly suboptimal,” in Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism, ser. HotPar’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 13–13. [Online]. Available: http://dl.acm.org/citation.cfm?id=2342788.2342801
J. Ichnowski and R. Alterovitz, “Scalable multicore motion planning using lock-free concurrency,” IEEE Transactions on Robotics, vol. 30, no. 5, pp. 1123–1136, Oct 2014. http://dx.doi.org/10.1109/TRO.2014.2331091
S. Priyadarshi, C. S. Saunders, N. M. Kriplani, H. Demircioglu, W. R. Davis, P. D. Franzon, and M. B. Steer, “Parallel transient simulation of multiphysics circuits using delay-based partitioning,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 10, pp. 1522–1535, Oct 2012. http://dx.doi.org/10.1109/TCAD.2012.2201156
M. Monagan and R. Pearce, “Parallel sparse polynomial division using heaps,” in Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, ser. PASCO ’10. New York, NY, USA: ACM, 2010. http://dx.doi.org/10.1145/1837210.1837227. ISBN 978-1-4503-0067-4 pp. 105–111. [Online]. Available: http://doi.acm.org/10.1145/1837210.1837227
R. D. Phillips, L. T. Watson, and R. H. Wynne, “An smp soft classification algorithm for remote sensing,” in Proceedings of the 19th High Performance Computing Symposia, ser. HPC ’11. San Diego, CA, USA: Society for Computer Simulation International, 2011, pp. 104–110. [Online]. Available: http://dl.acm.org/citation.cfm?id=2048577.2048591
P. Peschlow, A. Voss, and P. Martini, “Good news for parallel wireless network simulations,” in Proceedings of the 12th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, ser. MSWiM ’09. New York, NY, USA: ACM, 2009. http://dx.doi.org/10.1145/1641804.1641828. ISBN 978-1-60558-616-8 pp. 134–142. [Online]. Available: http://doi.acm.org/10.1145/1641804.1641828
M. Gusev and S. Ristov, “The optimal resource allocation among virtual machines in cloud computing,” in Proceedings of The 3rd International Conference on Cloud Computing, GRIDs, and Virtualization (CLOUD COMPUTING 2012), Nice, France, 2012, pp. 36–42.
S. S. Bokhari, “Parallel solution of the subset-sum problem: An empirical study,” Concurr. Comput. : Pract. Exper., vol. 24, no. 18, pp. 2241–2254, Dec. 2012. http://dx.doi.org/10.1002/cpe.2800. [Online]. Available: http://dx.doi.org/10.1002/cpe.2800
J. Maqbool, S. Oh, and G. C. Fox, “Evaluating arm hpc clusters for scientific workloads,” Concurrency and Computation: Practice and Experience, vol. 27, no. 17, pp. 5390–5410, 2015. http://dx.doi.org/10.1002/cpe.3602 CPE-14-0161.R2.
S. Ristov, K. Cvetkov, and M. Gusev, “Implementation of a scalable l3b balancer,” Scalable Computing: Practice and Experience, vol. 17, no. 2, pp. 79–90, 2016. http://dx.doi.org/10.1109/TE.2014.2327007
S. Ristov, M. Gusev, L. Djinevski, and S. Arsenovski, “Performance impact of reconfigurable L1 cache on GPU devices,” in Computer Science and Information Systems (FedCSIS 2013), Federated Conference on, Krakow, Poland, Sep. 2013, pp. 507 – 510.
M. A. Rodriguez and R. Buyya, “Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds,” IEEE Transactions on Cloud Computing, vol. 2, no. 2, pp. 222–235, April 2014. http://dx.doi.org/10.1109/TCC.2014.2314655
R. N. Calheiros and R. Buyya, “Meeting deadlines of scientific workflows in public clouds with tasks replication,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 7, pp. 1787–1796, July 2014. http://dx.doi.org/10.1109/TPDS.2013.238
M. Mao and M. Humphrey, “Scaling and scheduling to maximize application performance within budget constraints in cloud workflows,” in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, May 2013. http://dx.doi.org/10.1109/IPDPS.2013.61. ISSN 1530-2075 pp. 67–78.
Supercomputers, “Top500,” [retrieved: April, 2015]. [Online]. Available: http://www.top500.org/
X. Ye, W. Dong, P. Li, and S. Nassif, “Maps: Multi-algorithm parallel circuit simulation,” in Proceedings of the 2008 IEEE/ACM International Conference on Computer-Aided Design, ser. ICCAD ’08, 2008. http://dx.doi.org/10.1109/ICCAD.2008.4681554. ISBN 978-1-4244-2820-5 pp. 73–78.