Beating Gradient Boosting: Target-Guided Binning for Massively Scalable Classification in Real-Time

Dymitr Ruta; Ming Liu; Ling Cen

Beating Gradient Boosting: Target-Guided Binning for Massively Scalable Classification in Real-Time

Dymitr Ruta, Ming Liu, Ling Cen

DOI: http://dx.doi.org/10.15439/2023F7166

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1301–1306 (2023)

Full text

Abstract. Gradient Boosting (GB) consistently outperforms other ML predictors especially in the context of binary classification based on multi-modal data of different forms and types. Its newest efficient implementations including XGBoost, LGBM and CATBoost push GB even further ahead with fast GPU-accelerated compute engine and optimized handling of categorical features. In an attempt to beat GB in both the performance and processing speed we propose a new simple, yet fast and very robust classification model based on predictive binning. At first all features undergo massively parallelized binning into a unified ordinal compressed (uint8) risk representation independently guided by and optimized to maximize the AUC score against the target. The resultant array of summarized micro-predictors, resembling 0-depth decision trees ordinally expressing the target risk, are then passed through the greedy feature selection to compose a robust wide-margin voting classifier, whose performance can beat GB while the extreme build and execution speed along with highly compressed representation welcomes extreme data sizes and real-time applicability. The model has been applied to detect cyber-security attacks on IoT devices within FedCSIS'2023 Challenge and scored 2nd place with the AUC≈1, leaving behind all the latest GB variants in performance and speed.

References

F. Alaba, M. Othman, I. Hashem, F. Alotaibi. Internet of Things security: A survey, Journal of Network and Computer Applications 88:10-28, 2017.
M. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, A. Sheth, Machine learning for internet of things data analysis: a survey, Digital Communications and Networks, 2018.
M. Garadi, A. Mohamed, A. Ali, X. Du, I. Ali and M. Guizani. A Survey of Machine and Deep Learning Methods for Internet of Things (IoT) Security, IEEE Communications Surveys & Tutorials, 2020.
L. Mason, J. Baxter, P.L. Bartlett, and M. Frean. Boosting Algorithms as Gradient Descent In S.A. Solla, T.K. Leen and K. Müller (eds): Advances in Neural Information Processing Systems 12: 512–518, MIT Press, 1999.
J.H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29(5): 1189-1232, 2001.
D. Ruta, M. Liu, L. Cen. Feature Engineering for Prediction of Frags in Tactical Games. Proc. Int. Conf. 2023 IEEE International Conference on Multimedia and Expo, 2023.
D. Ruta, M. Liu, L. Cen and Q. Hieu Vu. Diversified gradient boosting ensembles for prediction of the cost of forwarding contracts. Proc. 17th Int. Conf. on Computer Science and Intelligence Sys., pp 431-436, 2022.
D. Ruta, L. Cen, M. Liu and Q. Hieu Vu. Automated feature engineering for prediction of victories in online computer games. Proc. Int. Conf on Big Data, pp 5672-5678, 2021.
Q. Hieu Vu, D. Ruta, L. Cen and M. Liu. A combination of general and specific models to predict victories in video games. Proc. Int. Conf. on Big Data, pp 5683-5690, 2021.
D. Ruta, L. Cen and Q. Hieu Vu. Deep Bi-Directional LSTM Networks for Device Workload Forecasting. Proc. 15th Int. Conf. Comp. Science and Inf. Sys., pp 115-118, 2020.
D. Ruta, L. Cen, Q. Hieu Vu. Greedy Incremental Support Vector Regression. Proc. Fed. Conf. on Comp. Sci. and Inf. Sys., pp 7-9, 2019.
Q. Hieu Vu, D. Ruta and L. Cen. Gradient boosting decision trees for cyber-security threats detection based on network events logs. Proc. IEEE Int. Conf. Big Data, pp 5921-5928, 2019.
Q. Hieu Vu, D. Ruta, A. Ruta and L. Cen. Predicting Win-rates of Hearthstone Decks: Models and Features that Won AAIA’2018 Data Mining Challenge. Int. Symp. Advances in Artificial Intelligence and Apps (AAIA), pp 197-200, 2018.
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics 1:80–83, 1945.
H.B. Mann, D.R. Whitney. On a test whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 18: 50–60, 1947.
C.X. Ling, J. Huang, and H. Zhang. AUC: a statistically consistent and more discriminating measure than accuracy. In Int. Joint Conf. on Artificial Intelligence, pp 519–526, 2003.
Z. Yang, Q. Xu, S. Bao, Y. He, X. Cao, Q. Huang. Optimizing Two-way Partial AUC with an End-to-end Framework. IEEE Trans. on Pattern Analysis and Machine Intelligence 45:10228-10246, 2023
L. Prokhorenkova, G. Gusev, A. Vorobev, A.V. Dorogush and A. Gulin. Catboost: unbiased boosting with categorical features. In S. Bengio at al (eds.): Advances in Neural Information Processing Systems 31:6638–6648, Curran Associates, Inc, 2018.