Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing
Morgan Geldenhuys, Ben Pfister, Dominik Scheinert, Lauritz Thamsen, Odej Kao
Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 553–561 (2022)
Abstract. Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead.In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of cloud orchestration technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.
- H. Isah, T. Abughofa, S. Mahfuz, D. Ajerla, F. H. Zulkernine, and S. Khan, “A survey of distributed data stream processing frameworks,” IEEE Access, vol. 7, 2019.
- H. Nasiri, S. Nasehi, and M. Goudarzi, “Evaluation of distributed stream processing frameworks for iot applications in smart cities,” J. Big Data, vol. 6, p. 52, 2019.
- H. Li, J. Wu, Z. Jiang, X. Li, X. Wei, and Y. Zhuang, “Integrated recovery and task allocation for stream processing,” in IPCCC. IEEE Computer Society, 2017.
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. V. Ryaboy, “Storm@twitter,” in MOD, C. E. Dyreson, F. Li, and M. T. Özsu, Eds. ACM, 2014.
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in HotCloud, E. M. Nahum and D. Xu, Eds. USENIX Association, 2010.
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas, “Apache flinkTM: Stream and batch processing in a single engine,” IEEE Data Eng. Bull., vol. 38, no. 4, pp. 28–38, 2015.
- S. Jayasekara, A. Harwood, and S. Karunasekera, “A utilization model for optimization of checkpoint intervals in distributed stream processing systems,” Future Gener. Comput. Syst., vol. 110, pp. 68–79, 2020.
- J. W. Young, “A first order approximation to the optimum checkpoint interval,” Commun. ACM, vol. 17, pp. 530–531, 1974.
- J. Daly, “A model for predicting the optimum checkpoint interval for restart dumps,” in ICCS, ser. LNCS, P. M. A. Sloot, D. Abramson, A. V. Bogdanov, J. J. Dongarra, A. Y. Zomaya, and Y. E. Gorbachev, Eds., vol. 2660. Springer, 2003.
- J. T. Daly, “A higher order estimate of the optimum checkpoint interval for restart dumps,” Future Gener. Comput. Syst., vol. 22, no. 3, 2006.
- M. K. Geldenhuys, L. Thamsen, and O. Kao, “Chiron: Optimizing fault tolerance in qos-aware distributed stream processing jobs,” in IEEE International Conference on Big Data, Big Data 2020, Atlanta, GA, USA, December 10-13, 2020. IEEE, 2020, pp. 434–440.
- L. A. Bautista-Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka, “Low-overhead diskless checkpoint for hybrid computing systems,” in HiPC. IEEE Computer Society, 2010.
- L. A. Bautista-Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, “Distributed diskless checkpoint for large scale systems,” in CCGrid. IEEE Computer Society, 2010.
- H. Li, L. Pang, and Z. Wang, “Two-level incremental checkpoint recovery scheme for reducing system total overheads,” PLoS ONE, vol. 9, 2014.
- A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, modeling, and evaluation of a scalable multi-level checkpointing system,” in SC. IEEE, 2010.
- L. A. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka, “FTI: high performance fault tolerance interface for hybrid systems,” in SC, S. A. Lathrop, J. Costa, and W. Kramer, Eds. ACM, 2011.
- A. Kulkarni, A. Manzanares, L. Ionkov, M. Lang, and A. Lumsdaine, “The design and implementation of a multi-level content-addressable checkpoint file system,” in HiPC. IEEE Computer Society, 2012.
- A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal, “Chaos engineering,” IEEE Softw., vol. 33, no. 3, pp. 35–41, 2016.
- P. Carbone, G. Fóra, S. Ewen, S. Haridi, and K. Tzoumas, “Lightweight asynchronous snapshots for distributed dataflows,” CoRR, vol. abs/1506.08603, 2015.
- M. K. Geldenhuys, D. Scheinert, O. Kao, and L. Thamsen, “Phoebe: Qos-aware distributed stream processing through anticipating dynamic workloads,” ICWS, 2022.
- S. Di, Y. Robert, F. Vivien, and F. Cappello, “Toward an optimal online checkpoint solution under a two-level HPC checkpoint model,” IEEE Trans. Parallel Distributed Syst., vol. 28, no. 1, pp. 244–259, 2017.
- T. Hérault, T. Largillier, S. Peyronnet, B. Quétier, F. Cappello, and M. Jan, “High accuracy failure injection in parallel and distributed systems using virtualization,” in CF, G. Johnson, C. Trinitis, G. Gaydadjiev, and A. V. Veidenbaum, Eds. ACM, 2009.
- G. Jacques-Silva, B. Gedik, H. Andrade, K. Wu, and R. K. Iyer, “Fault injection-based assessment of partial fault tolerance in stream processing applications,” in DEBS, D. M. Eyers, O. Etzion, A. Gal, S. B. Zdonik, and P. Vincent, Eds. ACM, 2011.
- C. Pham, L. Wang, B. Tak, S. Baset, C. Tang, Z. T. Kalbarczyk, and R. K. Iyer, “Failure diagnosis for distributed systems using targeted fault injection,” IEEE Trans. Parallel Distributed Syst., vol. 28, no. 2, pp. 503–516, 2017.
- A. Basiri, L. Hochstein, N. Jones, and H. Tucker, “Automating chaos experiments in production,” in ICSE, H. Sharp and M. Whalen, Eds. IEEE / ACM, 2019.
- A. Blohowiak, A. Basiri, L. Hochstein, and C. Rosenthal, “A platform for automating chaos experiments,” in ISSRE. IEEE Computer Society, 2016.
- F. Schmidt, F. Suri-Payer, A. Gulenko, M. Wallschläger, A. Acker, and O. Kao, “Unsupervised anomaly event detection for VNF service monitoring using multivariate online arima,” in CloudCom. IEEE Computer Society, 2018.
- A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, “Large-scale cluster management at google with borg,” in EuroSys, L. Réveillère, T. Harris, and M. Herlihy, Eds. ACM, 2015.
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in MSST, M. G. Khatib, X. He, and M. Factor, Eds. IEEE Computer Society, 2010.
- P. Á. López, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. WieBner, “Microscopic traffic simulation using SUMO,” in ITSC, W. Zhang, A. M. Bayen, J. J. S. Medina, and M. J. Barth, Eds. IEEE, 2018.