Model-Agnostic Machine Learning Model Updating – A Case Study on a real-world Application

Julia Poray; Bogdan Franczyk; Thomas Heller

Model-Agnostic Machine Learning Model Updating – A Case Study on a real-world Application

Julia Poray, Bogdan Franczyk, Thomas Heller

DOI: http://dx.doi.org/10.15439/2024F4426

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 157–167 (2024)

Full text

Abstract. The application of developments in the real world is the final aim of all scientific works. In the case of Data Science and Machine Learning, this means there are additional tasks to care about, compared to the rather academic part of ``just'' building a model based on the available data. In the well accepted Cross Industry Standard for Data Mining (CRISP-DM), one of these tasks is the maintenance of the deployed application. This task can be of extreme importance, since in real-world applications the model performance often decreases over time, usually due to Concept Drift. This directly leads to the need to adapt/update the used Machine Learning model. In this work, available model-agnostic model update methods are evaluated on a real-world industry application, here Virtual Metrology in semiconductor fabrication. The results show that for the real-world use case sliding window techniques performed best. The models used in the experiments were an XGBoost and Neural Network. For the Neural Network, Model-Agnostic Meta-Learning and Learning to learn by Gradient Descent by Gradient Descent were applied as update techniques (among others) and did not show any improvement compared to the baseline of not updating the Neural Network. The implementation of the update techniques was validated on an artificial use case for which they worked well.

References

V. Maitra, Y. Su, and J. Shi, “Virtual metrology in semiconductor manufacturing: Current status and future prospects,” Expert Systems with Applications, vol. 249, p. 123559, 2024. http://dx.doi.org/10.1016/j.eswa.2024.123559
S. Yan, C. Luo, S. Wang, S. Ding, L. Li, J. Ai, Q. Sheng, Q. Xia, Z. Li, Q. Chen, S. Li, H. Dai, and Y. Zhong, “Virtual metrology modeling for cvd film thickness with lasso-gaussian process regression,” in 2023 China Semiconductor Technology International Conference (CSTIC), 2023. http://dx.doi.org/10.1109/CSTIC58779.2023.10219236 pp. 1–4.
C. Schröer, F. Kruse, and J. M. Gómez, “A systematic literature review on applying crisp-dm process model,” Procedia Computer Science, vol. 181, pp. 526–534, 2021. http://dx.doi.org/10.1016/j.procs.2021.01.199
F. Bayram, B. S. Ahmed, and A. Kassler, “From concept drift to model degradation: An overview on performance-aware drift detectors,” Knowledge-Based Systems, vol. 245, p. 1, 2022. doi: 10.1016/j.knosys.2022.108632
A. L. Suárez-Cetrulo, D. Quintana, and A. Cervantes, “A survey on machine learning for recurring concept drifting data streams,” Expert Systems with Applications, vol. 213, p. 118934, 2023. http://dx.doi.org/10.1016/j.eswa.2022.118934. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417422019522
J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–37, Mar. 2014. http://dx.doi.org/10.1145/2523813
J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 12, pp. 2346–2363, 2019. http://dx.doi.org/10.1109/TKDE.2018.2876857
A. Choudhary, P. Jha, A. Tiwari, and N. Bharill, “A brief survey on concept drifted data stream regression,” in Soft Computing for Problem Solving, A. Tiwari, K. Ahuja, A. Yadav, J. C. Bansal, K. Deep, and A. K. Nagar, Eds. Singapore: Springer Singapore, 2021. http://dx.doi.org/10.1007/978-981-16-2712-5_57 pp. 733–744.
A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018. http://dx.doi.org/10.1109/MSP.2017.2765202
Y. Song, G. Zhang, J. Lu, and H. Lu, “A fuzzy kernel c-means clustering model for handling concept drift in regression,” in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2017. http://dx.doi.org/10.1109/FUZZ-IEEE.2017.8015515 pp. 1–6.
D. Liu, Y. Wu, and H. Jiang, “Fp-elm: An online sequential learning algorithm for dealing with concept drift,” Neurocomputing, vol. 207, pp. 322–334, 2016. http://dx.doi.org/10.1016/j.neucom.2016.04.043
S. J. Delany, P. Cunningham, A. Tsymbal, and L. Coyle, “A case-based technique for tracking concept drift in spam filtering,” in Applications and Innovations in Intelligent Systems XII, A. Macintosh, R. Ellis, and T. Allen, Eds. London: Springer London, 2005. http://dx.doi.org/10.1007/1-84628-103-2_1. ISBN 978-1-84628-103-7 pp. 3–16.
Y. Song, G. Zhang, H. Lu, and J. Lu, “A noise-tolerant fuzzy c-means based drift adaptation method for data stream regression,” in 2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2019. http://dx.doi.org/10.1109/FUZZ-IEEE.2019.8859005 pp. 1–6.
J. Vanschoren, “Meta-learning: A survey,” arXiv preprint https://arxiv.org/abs/1810.03548, 2018. http://dx.doi.org/10.48550/arXiv.1810.03548
J. Son, S. Lee, and G. Kim, “When meta-learning meets online and continual learning: A survey,” 2023. http://dx.doi.org/10.48550/arXiv.2311.05241
A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn, “Learning to adapt in dynamic, real-world environments through meta-reinforcement learning,” 2019.
S. Lee, H. Jeon, J. Son, and G. Kim, “Sequential bayesian continual learning with meta-learned neural networks,” 2024. [Online]. Available: https://openreview.net/forum?id=6r0BOIb771
J. von Oswald, C. Henning, B. F. Grewe, and J. Sacramento, “Continual learning with hypernetworks,” 2022. http://dx.doi.org/10.48550/arXiv.1906.00695
K. Li and J. Malik, “Learning to optimize,” 2016. http://dx.doi.org/10.48550/arXiv.1606.01885
H. M. Gomes, J. P. Barddal, L. E. B. Ferreira, and A. Bifet, “Adaptive random forests for data stream regression.” in European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), 2018. [Online]. Available: https: //www.ppgia.pucpr.br/~jean.barddal/assets/pdf/arf_regression.pdf
J. Montiel, R. Mitchell, E. Frank, B. Pfahringer, T. Abdessalem, and A. Bifet, “Adaptive xgboost for evolving data streams,” in 2020 International Joint Conference on Neural Networks (IJCNN), 2020. http://dx.doi.org/10.1109/IJCNN48605.2020.9207555 pp. 1–8.
F. M. de Souza, J. Grando, and F. Baldo, “Adaptive fast xgboost for regression,” in Intelligent Systems, J. C. Xavier-Junior and R. A. Rios, Eds. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-21686-2_7. ISBN 978-3-031-21686-2 pp. 92–106.
J. Zheng, F. Shen, H. Fan, and J. Zhao, “An online incremental learning support vector machine for large-scale data,” Neural Computing and Applications, vol. 22, pp. 1023–1035, 2013. http://dx.doi.org/10.1007/s00521-011-0793-1
Łukasz Korycki and B. Krawczyk, “Adaptive deep forest for online learning from drifting data streams,” 2020. http://dx.doi.org/10.48550/arXiv.2010.07340
A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap, “One-shot learning with memory-augmented neural networks,” 2016. http://dx.doi.org/10.48550/arXiv.1605.06065
S. Xu and J. Wang, “Dynamic extreme learning machine for data stream classification,” Neurocomputing, vol. 238, pp. 433–449, 2017. http://dx.doi.org/10.1016/j.neucom.2016.12.078
N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” 2018. http://dx.doi.org/10.48550/arXiv.1707.03141
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” in Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, ser. PODS ’02. New York, NY, USA: Association for Computing Machinery, 2002. http://dx.doi.org/10.1145/543613.543615. ISBN 1581135076 p. 1–16.
R. Klinkenberg, “Learning drifting concepts: Example selection vs. example weighting,” Intell. Data Anal., vol. 8, pp. 281–300, 2004. http://dx.doi.org/10.3233/IDA-2004-8305
M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas, “Learning to learn by gradient descent by gradient descent,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2016/file/fb87582825f9d28a8d42c5e5e5e8b23d-Paper.pdf
N. Oza, “Online bagging and boosting,” in 2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 3, 2005. doi: 10.1109/ICSMC.2005.1571498 pp. 2340–2345 Vol. 3.
E. Lughofer, “Efficient sample selection in data stream regression employing evolving generalized fuzzy models,” in 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2015. http://dx.doi.org/10.1109/FUZZ-IEEE.2015.7337844 pp. 1–9.
Y. Song, G. Zhang, J. Lu, and H. Lu, “A fuzzy kernel c-means clustering model for handling concept drift in regression,” in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2017. http://dx.doi.org/10.1109/FUZZ-IEEE.2017.8015515 pp. 1–6.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, p. 1735–1780, nov 1997. http://dx.doi.org/10.1162/neco.1997.9.8.1735
L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012. http://dx.doi.org/10.1109/MSP.2012.2211477
A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009. [Online]. Available: https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf
C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 1126–1135. [Online]. Available: https://proceedings.mlr.press/v70/finn17a.html
Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas, “Learning to learn without gradient descent by gradient descent,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017. http://dx.doi.org/10.5555/3305381.3305459 p. 748–756.
W. N. Street and Y. Kim, “A streaming ensemble algorithm (sea) for large-scale classification,” in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’01. New York, NY, USA: Association for Computing Machinery, 2001. http://dx.doi.org/10.1145/502512.502568. ISBN 158113391X p. 377–382.
J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” The Journal of Machine Learning Research, vol. 8, pp. 2755–2790, 2007. [Online]. Available: http://jmlr.org/papers/v8/kolter07a.html
M. P. S. Bhatia, “A two ensemble system to handle concept drifting data streams: recurring dynamic weighted majority,” International Journal of Machine Learning and Cybernetics, vol. 10, 03 2019. http://dx.doi.org/10.1007/s13042-017-0738-9
A. Liu, J. Lu, and G. Zhang, “Diverse instance-weighting ensemble based on region drift disagreement for concept drift adaptation,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 293–307, 2020. http://dx.doi.org/10.1109/TNNLS.2020.2978523
B. Celik and J. Vanschoren, “Adaptation strategies for automated machine learning on evolving data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, p. 3067–3078, Sep. 2021. http://dx.doi.org/10.1109/tpami.2021.3062900
T. Chen, X. Chen, W. Chen, H. Heaton, J. Liu, Z. Wang, and W. Yin, “Learning to optimize: A primer and a benchmark,” Journal of Machine Learning Research, vol. 23, no. 189, pp. 1–59, 2022. [Online]. Available: http://jmlr.org/papers/v23/21-0308.html
S. Wang, J. Sun, and Z. Xu, “Hyperadam: A learnable task-adaptive adam for network training,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 5297–5304, 07 2019. http://dx.doi.org/10.1609/aaai.v33i01.33015297
S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint https://arxiv.org/abs/1609.04747, 2016. http://dx.doi.org/10.48550/arXiv.1609.04747
T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol. 4, pp. 26–31, 2012. [Online]. Available: https://cir.nii.ac.jp/crid/1370017282431050757
O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. de Freitas, and J. Sohl-Dickstein, “Learned optimizers that scale and generalize,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017. http://dx.doi.org/10.5555/3305890.3306069 p. 3751–3760.
K. Lv, S. Jiang, and J. Li, “Learning gradient descent: better generalization and longer horizons,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17. JMLR.org, 2017. http://dx.doi.org/10.5555/3305890.3305913 p. 2247–2255.
T. Chen, W. Zhang, Z. Jingyang, S. Chang, S. Liu, L. Amini, and Z. Wang, “Training stronger baselines for learning to optimize,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020. http://dx.doi.org/10.5555/3495724.3496339 pp. 7332–7343.
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016. http://dx.doi.org/10.1145/2939672.2939785. ISBN 978-1-4503- 4232-2 pp. 785–794.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Good-fellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, http://dx.doi.org/10.48550/arXiv.1603.04467, Software available from tensorflow.org.
D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014. http://dx.doi.org/10.48550/arXiv.1412.6980