Considering various aspects of models’ quality in the ML pipeline - application in the logistics sector
Eyad Kannout, Michał Grodzki, Marek Grzegorowski
DOI: http://dx.doi.org/10.15439/2022F296
Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 403–412 (2022)
Abstract. The industrial machine learning applications todayinvolve developing and deploying MLOps pipelines to ensure the versatile quality of forecasting models over an extended period, simultaneously assuring the model's accuracy, stability, short training time, and resilience. In this study, we present the ML pipeline conforming to all the abovementioned aspects of models' quality formulated as a constrained multi-objective optimization problem. We also provide the reference implementation on state-of-the-art methods for data preprocessing, feature extraction, dimensionality reduction, feature and instance selection, model fitting, and ensemble blending. The experimental study on the real data set from the logistics industry confirmed the qualities of the proposed approach, as the successful participation in an international data competition did.
References
- E. Zdravevski, P. Lameski, C. Apanowicz, and D. Śl ̨ezak, “From big data to business analytics: The case study of churn prediction,” Appl. Soft Comput., vol. 90, p. 106164, 2020. http://dx.doi.org/10.1016/j.asoc.2020.106164
- E. Kannout, “Context Clustering-based Recommender Systems,” in 2020 15th Conference on Computer Science and Information Systems (FedCSIS), 2020. http://dx.doi.org/10.15439/2020F54 pp. 85–91.
- M. Grzegorowski, A. Janusz, S. Lazewski, M. Swiechowski, and M. Jankowska, “Prescriptive analytics for optimization of fmcg delivery plans,” in Proceedings of IPMU’22, 2022.
- Y. Li, Y. Yang, K. Zhu, and J. Zhang, “Clothing sale forecasting by a composite gruprophet model with an attention mechanism,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 8335–8344, 2021. http://dx.doi.org/10.1109/TII.2021.3057922
- M. Grzegorowski, J. Litwin, M. Wnuk, M. Pabis, and L. Marcinowski, “Survival-based feature extraction - application in supply management for dispersed vending machines,” IEEE Transactions on Industrial Informatics, 2022. http://dx.doi.org/10.1109/TII.2022.3178547
- D. Ślęzak, M. Grzegorowski, A. Janusz, M. Kozielski, S. H. Nguyen, M. Sikora, S. Stawicki, and L. Wrobel, “A Framework for Learning and Embedding Multi-Sensor Forecasting Models into a Decision Support System: A Case Study of Methane Concentration in Coal Mines,” Information Sciences, vol. 451-452, pp. 112–133, 2018.
- C. Renggli, L. Rimanic, N. M. Gürel, B. Karlas, W. Wu, and C. Zhang, “A Data Quality-Driven View of MLOps,” CoRR, vol. abs/2102.07750, 2021. [Online]. Available: https://arxiv.org/abs/2102.07750
- Y. Zhou, Y. Yu, and B. Ding, “Towards MLOps: A Case Study of ML Pipeline Platform,” in 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), 2020. http://dx.doi.org/10.1109/I-CAICE51518.2020.00102 pp. 494–500.
- A. Subbaswamy, R. Adams, and S. Saria, “ Evaluating Model Robustness and Stability to Dataset Shift ,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, A. Banerjee and K. Fukumizu, Eds., vol. 130. PMLR, 13–15 Apr 2021, pp. 2611–2619. [Online]. Available: https://proceedings.mlr.press/v130/subbaswamy21a.html
- M. Grzegorowski and D. Ślęzak, “On resilient feature selection: Computational foundations of r-C-reducts,” Inf. Sci., vol. 499, pp. 25–44, 2019. http://dx.doi.org/10.1016/j.ins.2019.05.041
- C. Rudin, “Please Stop Explaining Black Box Models for High Stakes Decisions,” CoRR, vol. abs/1811.10154, 2018.
- X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,” Knowl. Based Syst., vol. 212, p. 106622, 2021. http://dx.doi.org/10.1016/j.knosys.2020.106622
- J. Blank and K. Deb, “Pymoo: Multi-Objective Optimization in Python,” IEEE Access, vol. 8, pp. 89 497–89 509, 2020. http://dx.doi.org/10.1109/ACCESS.2020.2990567
- H. M. Ridha, C. Gomes, H. Hizam, M. Ahmadipour, A. A. Heidari, and H. Chen, “Multi-objective optimization and multi-criteria decision-making methods for optimal design of standalone photovoltaic system: A comprehensive review,” Renewable and Sustainable Energy Reviews, vol. 135, p. 110202, 2021. http://dx.doi.org/10.1016/j.rser.2020.110202
- M. Grzegorowski, E. Zdravevski, A. Janusz, P. Lameski, C. Apanowicz, and D. Śl ̨ezak, “Cost Optimization for Big Data Workloads Based on Dynamic Scheduling and Cluster-Size Tuning,” Big Data Research, vol. 25, p. 100203, 2021. http://dx.doi.org/10.1016/j.bdr.2021.100203
- N. Verbiest, J. Derrac, C. Cornelis, S. García, and F. Herrera, “Evolutionary wrapper approaches for training set selection as preprocessing mechanism for support vector machines: Experimental evaluation and support vector analysis,” Applied Soft Computing, vol. 38, pp. 10–22, 2016. http://dx.doi.org/10.1016/j.asoc.2015.09.006
- A. Janusz, M. Grzegorowski, M. Michalak, Ł. Wróbel, M. Sikora, and D. Ślęzak, “Predicting Seismic Events in Coal Mines Based on Underground Sensor Measurements,” Engineering Applications of Artificial Intelligence, vol. 64, pp. 83–94, 2017.
- M. Grzegorowski, “Massively Parallel Feature Extraction Framework Application in Predicting Dangerous Seismic Events,” in Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, FedCSIS 2016, Gdańsk, Poland, September 11-14, 2016, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 8. IEEE, 2016. http://dx.doi.org/10.15439/2016F90 pp. 225–229.
- A. Janusz, D. Ślęzak, S. Stawicki, and M. Rosiak, “Knowledge Pit - A Data Challenge Platform,” in Proceedings of the 24th International Workshop on Concurrency, Specification and Programming, Rzeszow, Poland, September 28-30, 2015, ser. CEUR Workshop Proceedings, Z. Suraj and L. Czaja, Eds., vol. 1492. CEUR-WS.org, 2015, pp. 191–195.
- G. F. Frederico, “From Supply Chain 4.0 to Supply Chain 5.0: Findings from a Systematic Literature Review and Research Directions,” Logistics, vol. 5, no. 3, 2021. http://dx.doi.org/10.3390/logistics5030049
- L. Barua, B. Zou, and Y. Zhou, “Machine learning for international freight transportation management: A comprehensive review,” Research in Transportation Business & Management, vol. 34, p. 100453, 2020. http://dx.doi.org/10.1016/j.rtbm.2020.100453 Data analytics for international transportation management.
- N. Servos, X. Liu, M. Teucke, and M. Freitag, “Travel Time Prediction in a Multimodal Freight Transport Relation Using Machine Learning Algorithms,” Logistics, vol. 4, no. 1, 2020. http://dx.doi.org/10.3390/logistics4010001
- S.-H. Chung, “Applications of smart technologies in logistics and transport: A review,” Transportation Research Part E: Logistics and Transportation Review, vol. 153, p. 102455, 2021. http://dx.doi.org/10.1016/j.tre.2021.102455
- J. Nobre and R. F. Neves, “Combining Principal Component Analysis, Discrete Wavelet Transform and XGBoost to trade in the financial markets,” Expert Systems with Applications, vol. 125, pp. 181–194, 2019. http://dx.doi.org/10.1016/j.eswa.2019.01.083
- R. Kasimbeyli, Z. Kamisli Ozturk, N. Kasimbeyli, G. Dinc Yalcin, and B. çmen Erdem, “Comparison of Some Scalarization Methods in Multiobjective Optimization,” Bull. Malays. Math. Sci. Soc., vol. 42, p. 18751905, 09 2019. http://dx.doi.org/10.1007/s40840-017-0579-4
- M. Grzegorowski, “Selected aspects of interactive feature extraction,” Ph.D. dissertation, University of Warsaw, 2021.
- D. Granato, J. S. Santos, G. B. Escher, B. L. Ferreira, and R. M. Maggio, “Use of principal component analysis (PCA) and hierarchical cluster analysis (HCA) for multivariate association between bioactive compounds and functional properties in foods: A critical perspective,” Trends in Food Science & Technology, vol. 72, pp. 83–90, 2018. doi: 10.1016/j.tifs.2017.12.006
- M. Grzegorowski and S. Stawicki, “Window-based feature extraction framework for multi-sensor data: A posture recognition case study,” in 2015 Federated Conference on Computer Science and Information Systems, FedCSIS 2015, Lódz, Poland, September 13-16, 2015, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 5. IEEE, 2015. http://dx.doi.org/10.15439/2015F425 pp. 397–405. [Online]. Available: https://doi.org/10.15439/2015F425
- E. Zdravevski, P. Lameski, R. Mingov, A. Kulakov, and D. Gjorgjevikj, “Robust histogram-based feature engineering of time series data,” in 2015 Federated Conference on Computer Science and Information Systems, FedCSIS 2015, Lódz, Poland, September 13-16, 2015, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 5. IEEE, 2015. http://dx.doi.org/10.15439/2015F420 pp. 381–388. [Online]. Available: https://doi.org/10.15439/2015F420
- B. Petrovska, E. Zdravevski, P. Lameski, R. Corizzo, I. Stajduhar, and J. Lerga, “Deep Learning for Feature Extraction in Remote Sensing: A Case-Study of Aerial Scene Classification,” Sensors, vol. 20, no. 14, p. 3906, 2020. http://dx.doi.org/10.3390/s20143906. [Online]. Available: https://doi.org/10.3390/s20143906
- D. Ślęzak, A. Chadzynska-Krasowska, J. Holland, P. Synak, R. Glick, and M. Perkowski, “Scalable cyber-security analytics with a new summary-based approximate query engine,” in 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11-14, 2017, J. Nie, Z. Obradovic, T. Suzumura, R. Ghosh, R. Nambiar, C. Wang, H. Zang, R. Baeza-Yates, X. Hu, J. Kepner, A. Cuzzocrea, J. Tang, and M. Toyoda, Eds. IEEE Computer Society, 2017. http://dx.doi.org/10.1109/BigData.2017.8258128 pp. 1840–1849. [Online]. Available: https://doi.org/10.1109/BigData.2017.8258128
- D. Ślęzak, R. Glick, P. Betlinski, and P. Synak, “A new approximate query engine based on intelligent capture and fast transformations of granulated data summaries,” J. Intell. Inf. Syst., vol. 50, no. 2, pp. 385–414, 2018. http://dx.doi.org/10.1007/s10844-017-0471-6. [Online]. Available: https://doi.org/10.1007/s10844-017-0471-6
- M. Muniswamaiah, T. Agerwala, and C. C. Tappert, “Approximate Query Processing for Big Data in Heterogeneous Databases,” in 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020, X. Wu, C. Jermaine, L. Xiong, X. Hu, O. Kotevska, S. Lu, W. Xu, S. Aluru, C. Zhai, E. Al-Masri, Z. Chen, and J. Saltz, Eds. IEEE, 2020. http://dx.doi.org/10.1109/BigData50022.2020.9378310 pp. 5765–5767. [Online]. Available: https://doi.org/10.1109/BigData50022.2020.9378310
- G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
- E. Wari and W. Zhu, “A survey on metaheuristics for optimization in food manufacturing industry,” Applied Soft Computing, vol. 46, pp. 328– 343, 2016. http://dx.doi.org/10.1016/j.asoc.2016.04.034
- M. Okulewicz and J. Mandziuk, “A metaheuristic approach to solve Dynamic Vehicle Routing Problem in continuous search space,” Swarm Evol. Comput., vol. 48, pp. 44–61, 2019. doi: 10.1016/j.swevo.2019.03.008. [Online]. Available: https://doi.org/10.1016/j.swevo.2019.03.008
- M. Ulinski, A. Zychowski, M. Okulewicz, M. Zaborski, and H. Kordulewski, “Generalized Self-adapting Particle Swarm Optimization Algorithm,” in Parallel Problem Solving from Nature - PPSN XV - 15th International Conference, Coimbra, Portugal, September 8-12, 2018, Proceedings, Part I, ser. Lecture Notes in Computer Science, A. Auger, C. M. Fonseca, N. Lourenço, P. Machado, L. Paquete, and L. D. Whitley, Eds., vol. 11101. Springer, 2018. http://dx.doi.org/10.1007/978-3-319-99253-2_3 pp. 29–40. [Online]. Available: https://doi.org/10.1007/978-3-319-99253-2_3
- M. Grzegorowski, A. Janusz, D. Śl ̨ezak, and M. S. Szczuka, “On the Role of Feature Space Granulation in Feature Selection Processes,” in 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, December 11-14, 2017, J. Nie, Z. Obradovic, T. Suzumura, R. Ghosh, R. Nambiar, C. Wang, H. Zang, R. Baeza-Yates, X. Hu, J. Kepner, A. Cuzzocrea, J. Tang, and M. Toyoda, Eds. IEEE Computer Society, 2017. http://dx.doi.org/10.1109/BigData.2017.8258124 pp. 1806–1815.
- S. Stawicki, D. Ślęzak, A. Janusz, and S. Widz, “Decision Bireducts and Decision Reducts – A Comparison,” International Journal of Approximate Reasoning, vol. 84, pp. 75–109, 2017.
- J. G. Bazan, A. Skowron, and P. Synak, “Dynamic Reducts as a Tool for Extracting Laws from Decisions Tables,” in Methodologies for Intelligent Systems, 8th International Symposium, ISMIS ’94, Charlotte, North Carolina, USA, October 16-19, 1994, Proceedings, ser. Lecture Notes in Computer Science, Z. W. Ras and M. Zemankova, Eds., vol. 869. Springer, 1994. http://dx.doi.org/10.1007/3-540-58495-1_35 pp. 346–355.
- S. H. Nguyen and M. S. Szczuka, “Feature Selection in Decision Systems with Constraints,” in Rough Sets - International Joint Conference, IJCRS 2016, Santiago de Chile, Chile, October 7-11, 2016, Proceedings, ser. Lecture Notes in Computer Science, V. Flores, F. A. C. Gomide, A. Janusz, C. Meneses, D. Miao, G. Peters, D. Śl ̨ezak, G. Wang, R. Weber, and Y. Yao, Eds., vol. 9920, 2016. http://dx.doi.org/10.1007/978-3-319-47160-0_49 pp. 537–547. [Online]. Available: https://doi.org/10.1007/978-3-319-47160-0_49
- K. Killamsetty, D. Sivasubramanian, G. Ramakrishnan, and R. Iyer, “Glister: Generalization based data subset selection for efficient and robust learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 9, pp. 8110–8118, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16988
- N. Zhai, P. Yao, and X. Zhou, “Multivariate Time Series Forecast in Industrial Process Based on XGBoost and GRU,” in 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), vol. 9, 2020. http://dx.doi.org/10.1109/ITAIC49862.2020.9338878 pp. 1397–1400.
- Y. Wang and X. Sherry Ni, “A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization,” International Journal of Database Management Systems, vol. 11, no. 01, p. 0117, Feb 2019. http://dx.doi.org/10.5121/ijdms.2019.11101
- A. Janusz, A. Jamiołkowski, and M. Okulewicz, “Predicting the Costs of Forwarding Contracts: Analysis of Data Mining Competition Results,” in Proceedings of the 17th Conference on Computer Science and Intelligence Systems, FedCSIS 2022, Sofia, Bulgaria, September 4-7, 2022. IEEE, 2022.
- J. G. Bazan, S. Bazan-Socha, S. Buregwa-Czuma, Ł. Dydo, W. Rzasa, ̨ and A. Skowron, “A Classifier Based on a Decision Tree with Verifying Cuts,” Fundam. Informaticae, vol. 143, no. 1-2, pp. 1–18, 2016. http://dx.doi.org/10.3233/FI-2016-1300
- D. Ślęzak, M. Grzegorowski, A. Janusz, and S. Stawicki, “Toward interactive attribute selection with infolattices - A position paper,” in Rough Sets - International Joint Conference, IJCRS 2017, Olsztyn, Poland, July 3-7, 2017, Proceedings, Part II, ser. Lecture Notes in Computer Science, L. Polkowski, Y. Yao, P. Artiemjew, D. Ciucci, D. Liu, D. Śl ̨ezak, and B. Zielosko, Eds., vol. 10314. Springer, 2017. http://dx.doi.org/10.1007/978-3-319-60840-2_38 pp. 526–539. [Online]. Available: https://doi.org/10.1007/978-3-319-60840-2_38
- A. Janusz, G. Hao, D. Kaluza, T. Li, R. Wojciechowski, and D. Ślęzak, “Predicting escalations in customer support: Analysis of data mining challenge results,” in 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020, X. Wu, C. Jermaine, L. Xiong, X. Hu, O. Kotevska, S. Lu, W. Xu, S. Aluru, C. Zhai, E. Al-Masri, Z. Chen, and J. Saltz, Eds. IEEE, 2020. http://dx.doi.org/10.1109/BigData50022.2020.9378024 pp. 5519–5526. [Online]. Available: https://doi.org/10.1109/BigData50022.2020.9378024