Considering various aspects of models’ quality in the ML pipeline - application in the logistics sector

The industrial machine learning applications today involve developing and deploying MLOps pipelines to ensure the versatile quality of forecasting models over an extended period, simultaneously assuring the model’s accuracy, stability, short training time, and resilience. In this study, we present the ML pipeline conforming to all the abovementioned aspects of models’ quality formulated as a constrained multi-objective optimization problem. We also provide the reference implementation on state-of-the-art methods for data preprocessing, feature extraction, dimensionality reduction, feature and instance selection, model fitting, and ensemble blending. The experimental study on the real data set from the logistics industry confirmed the qualities of the proposed approach, as the successful participation in an international data competition did.


I. INTRODUCTION
M ACHINE learning (ML) algorithms are widely used within decision-support [1] or recommender systems [2] in many branches of the industry, like fast-moving consumer goods (FMCG) [3], e-commerce [4], logistics [5], or even hard-coal mining [6]. However, as ML models continue to run in production environments for an extended period, new expectations and concerns have also arisen. Some time ago, data scientists were expected to deliver a fine-tuned model, today, attention is paid to building ML operations (MLOps) pipelines responsible for continuous monitoring and ensuring the quality of the developed models during their functioning [7], [8]. It is also worth paying attention to the ongoing shift in the quality assurance of models' predictions, which are no longer limited to optimizing a single measure, such as accuracy or root mean square error (RMSE). It should cover more models' characteristics like stability [9], resilience [10], or interpretability [11].
The fundamental task of data analysis is to represent the data accordingly to the investigated problem. The selection of appropriate criteria for assessing the quality of generated predictions is no less important. The choice of such measure primarily depends on the nature of the problem, e.g., there are different ones for classification and regression. The quality measure we choose in the model optimization process will have a crucial impact on its performance. Once decided, we can rely on AutoML meta-learning to ignite a versatile exploration of several learning algorithms meanwhile optimizing their parameters, which, however, is very costly and timeconsuming [12]. Whereas, over-optimizing a single measure in many applications is simply unnecessary. In particular, further tuning a model of sufficient quality may lead to overfitting, increase complexity, and reduce interpretability, not to mention the longer learning time and the increased cost of computing resources. Furthermore, in many cases, optimizing a single quality measure is insufficient. Reaching the optimal regression model according to the RMSE, which meanwhile is vulnerable to data deficiencies (e.g., unavailability of selected attributes), is pointless. One of the ways we may address those concerns is to refer to the multi-objective optimization (MOO) [13]. However, as the result of MOO, we do not end up with a single solution but many Pareto optimal models. Selecting the best one is still a complex and time-consuming task related to the multi-criteria decision aiding (MCDA) [14].
In response to the above expectations and challenges, let us present the ML pipeline for training forecasting models that allows optimizing not only a single quality measure, such as RMSE, accuracy, or F1-score, but also taking into account the robustness and resilience of the ensemble blended. The developed pipeline assumes that during the training procedure, the goal is not to optimize the model over days to achieve even a minimal quality improvement on a single error measure but to adhere to many business expectations possibly fast. Accordingly, we define the task as a MOO and adapt ·constrained scalarization for the investigated criteria [14]. By referring to quality thresholds that correspond directly to business expectations, we could significantly limit the time of model fitting (from days to seconds), which obviously determines the lower cost of cloud computing resources [15]. Furthermore, the adopted principle of building a model on random subsets of attributes and rows allows to achieve a variety of different models within an ensemble [16], [17]. Such an approach to training set selection enables a very straightforward parallelization of the learning procedure and hence provides significant acceleration of computation [18].
Yet another material aspect of the developed solution is the proposed feature extraction mechanism. The method is composed of several steps. Firstly, we combine available data sources into one flat file and aggregate the one-to-many relations with the common SELECT . . . GROU P BY SQL-based approach to extract some generic statistics. Later, we use feature extraction methods like one-hot encoding and ordinal coding to obtain a numerical representation of the data. This way, we achieve a sparse data representation, which poses a big problem for the boosting tree algorithms by impacting the quality of their cuts on the attributes. Such a situation imposes the construction of deeper trees, making generalization difficult and leading to over-fitting. Therefore, after encoding a given feature, we apply one of the most popular dimensionality reduction methods -principal component analysis (PCA) -to use the first few components.
To show the particular qualities of our solution, we present a case study in the logistics industry for predicting costs associated with forwarding contracts. For this purpose, we used three data sets from the machine learning contest organized on the KnowledgePit.ml platform [19], which we combined together, preprocessed, and analyzed with the developed solution. In the conducted research, we assume that the acceptable level of the prediction error measured with the RMSE measure should not exceed 2.5% of the average cost of forwarding contracts in data that corresponds to RMSE of approx 0.17. We also assume the robustness threshold of 0.02, understood as the maximal acceptable difference of RMSE achieved by the model on the training and validation set during the training procedure. Furthermore, we set the resilience threshold so the constructed ensemble should consist of at least 10 models. This way, we could provide reliable forecasts even if some of the models within the ensemble became unreliable and could impair the overall prediction quality of the ensemble. Such a situation may occur in production environments, e.g., due to a software error or unavailability of the critical attributes for this model.
The main contributions of this paper are as follows: 1) The ML pipeline considering various aspects of models' quality formulated as a constrained multi-objective optimization problem. 2) The complete reference implementation of the ML pipeline providing methods for preprocessing, feature extraction, dimensionality reduction, feature and instance selection, model fitting, and ensemble blending.
3) The experimental study on the real data set from the logistics industry that confirmed several qualities of the proposed approach, including small prediction error (RMSE), robustness to over-fitting, fast computing time, and resilience. The rest of the paper is organized as follows. In Section II, we review the related literature. Section III provides a complete reference for the developed ML pipeline. In Section IV, we describe in detail the experiments conducted in this study including the description of the data, experimental setup, and the results. Finally, in Section V, we draw conclusions and suggest possible future research directions.

II. RELATED WORKS
Due to the general availability and affordability of cloud services [15], and the proven effectiveness of machine learning [5], [17], modern enterprises massively automate their processes and optimize decision-making with intelligent use of the collected data [3]. This trend is beneficial to many industries, including supply management and logistics [20]. Let us pay special attention to international freight transportation, which is related to moving goods between countries and may involve many stakeholders: shippers, carriers, forwarders, third-party logistics services, and customs of two or more countries for each movement [21]. In this context, machine learning is seen as one of the primary enablers for the dynamic development of enterprises, allowing for apt data-driven decisions, including route planning, travel time prediction, vehicle scheduling, estimated time of arrival, and foremost accurately predicting costs related to the execution of forwarding contracts [22], [23].
We can model this task as a regression of the forwarding contract costs conditioned by the attributes of orders, such as the type of order, basic characteristics of the shipped goods (e.g., dimensions, special requirements), and the expected route that a driver will have to cover. Among the ML algorithms commonly applied to solve the regression problems, we may point out eXtreme gradient boosting trees (XGBoost), deep neural networks, or support vector machines [21], [24]. Considering the industry specifics and the dynamics of changes in the business and technological environment, the developed data-driven decision-making system should promptly adapt to changes and operate reliably even in the event of data deficiencies. One of the ways to simultaneously address several potentially conflicting concerns is multiobjective optimization (MOO) [13].
Classically, MOO problems are often solved using scalarization techniques. In brief, scalarization means that the objective functions are aggregated (or reformulated as constraints), and then a single-objective problem is solved [25]. However, this method requires defining the perfect balance between objectives' importance. Another possibility to solve such a problem is to rely on Pareto front (PF) methods. For instance, the ÷-constraint approach can obtain a set of PF solutions by keeping only one objective and subdividing the others into several segments with some thresholds. Here, we do not end up with a single solution but potentially many models, and selecting the optimal one requires further effort [14]. In the proposed framework, we refer to ÷-constraint filtering, but instead of choosing a single model, we blend the ensemble of several solutions [17]. This way, we not only avoid the multi-criteria decision task but also introduce the additional resilience level to our solution [10].
Among the popular ensembling techniques, we may mention random forest and XGBoost. These approaches of blending tree models minimize the regression (or decision) trees' tendency of overfitting, hence, ensuring better robustness and stability [9], [24]. The stability, RMSE, and resilience can be further improved by ensuring that the trained models in the final ensemble are relatively different from each other. One way to do this is to train models on diverse subsets of objects and attributes [17]. The training set selection, complemented by parallelization of computation, can lead to better general- ization and minimize the overall training latency [16], [18]. It is usually essential to ingest several diverse data sources to provide adequately rich data representation with many different attributes. However, models such as XGBoost or deep neural networks operate on numerical features, whereas in most databases, we will also find some other data types.
A typical approach, in this case, would be to apply feature extraction, in particular, to encode each feature in numerical form [26]. One of the weaknesses of this approach may be the creation of a very sparse data representation related to the one-hot encoding of categorical features, which in turn can have a negative impact on tree models' performance.
Furthermore, as the number of features increases, the model training takes far more time and consumes more resources. There are many methods allowing to project or embed the data into a lower-dimensional space while retaining as much information as possible, just to mention independent component analysis, multidimensional scaling, or principal component analysis (PCA) [26]. The central idea of PCA is to reduce the dimensionality retaining as much as possible of the variation present in the data [27]. In the proposed framework, we refer to all the above-mentioned techniques. Furthermore, putting the things together, we propose the end-to-end ML pipeline adhering to the ML operations paradigm that ensures the solution's quality over time.

III. SOLUTION OVERVIEW
The developed ML pipeline consisted of several stages related to data ingestion, preprocessing, extending representation, reducing dimensionality, robust model training, and resilient ensemble blending. The first step in the developed pipeline is data ingestion and integration. Next, we perform data cleansing, encoding categorical variables to the numeric form, extracting some custom characteristics from text columns, imputing missing values, and conducting further feature extraction (FE) [18], [28].
FE addresses the problem of finding the most compact and informative data representation and is fundamental for every ML pipeline. The importance of proper data representation was aptly identified by Pedro Domingos: "At the end of the day, some machine learning projects succeed, and some fail. What makes the difference? Easily the most important factor is the features used". Using more relevant data sources and better knowledge representation may have a crucial impact on the final model quality. Therefore we plan to further extend the developed FE methods by introducing histogram-based feature engineering [29]. Among other relevant, recently reported approaches that could be in the future implemented in our ML framework, we may indicate embedding selected statistics from survival analysis or features derived from deep learning methods into data representation [5], [30].
These activities sometimes require additional effort that may depend on the data. Hence, they may not always be fully automated. For example, in the discussed case study of forwarding contracts, we ingest three data sources: css_main, css_routes, and fuel_prices tables (cf. Section IV-A). However, to integrate css_routes, we have to aggregate the data first. We execute this with the aggregating query: SELECT AVG(.), MIN(.), MAX(.), SUM(.), COUNT(.) ... GROUP BY id_contract for each interesting variable in the table. Considering that the attributes obtained in this way are not intended for financial settlements but to feed the machine learning procedure, a vital extension of our approach would be to relay on approximated results of SQL instead of exact ones [31]. Approximate query engines can generate summaries of Big Data sets much faster with only a slight loss in precision, which may be negligible in model generalization [32], [33].
Instead of using all the SQL results for fitting the predictive model, the outcomes of the aggregation queries are transformed with PCA, and several first components are integrated into the main data. The schematic view of this process, related to the discussed case study, you may find on the left part of Figure 1. In general, whenever the encoding of categorical features into numerics significantly increases data sparsity, the derived variables are encoded with PCA, and a few first components are kept. At the same time, the rest may be omitted. To provide an example, in Figure 2, we present the variance explained by the first 10 principal components (PC) of one-hot encoded first_load_country attribute.
The central part of the pipeline is selecting the training set, i.e., features and instances, to achieve the best performance and robustness of the models. In the developed approach, we iteratively draw a subset of attributes and instances as a ratio of the original data controlled by two thresholds: É r and É c (cf. Algorithm 1). In the next step, we train the selected predictive model (e.g., XGBoost or LightGBM) [34]. Since we fit the predictive model to significantly smaller data chunks, we naturally minimize the time of this process. The selection of training subsets is random and independent from each other. Hence, this process may also be easily parallelized, e.g., by drawing several subset candidates simultaneously and training the models in parallel. We also see a potential to extend our framework with the heuristic search over the subsets of attributes and features to reach the optimal quality (i.e., minimalize reported error) faster [35]- [37], and to introduce more advanced feature selection techniques [26]. In this context, granular feature selection techniques could be a perfect fit [38], including r-C-reducts, bi-reducts [39], or reviving the concept of dynamic reducts [40]. Besides more advanced feature selection algorithms, instance selection has space for improvement as well. Ordering the records by date and considering only the newest instances while drawing the subsets for training data is one of many possible ideas. Combining those in an ensemble could yield interesting results.

Data:
-dataTrain, dataTest -training and test data θ -acceptable RMSE threshold ϑ -stability threshold for RMSE ρ -expected resilience level ωr and ωc -instance and feature selection thresholds score()-quality measure, e.g., RMSE -N -maximal number of unsuccesful attempts Result: ensemble of models / * Initialization * / ensemble ← ∅; k ← 0 validationSet ← dataT rain.sample dataT rain ← dataT rain \ validationSet while |ensemble| < ρ ∧ k < N do trainSet ← draw ωr rows and ωc cols from dataTrain presented in Algorithm 1. In particular, the primary objective function is to minimize an error measure, e.g., RMSE. We guarantee that each of the trained models yields an error lower than the predefined threshold ¹ on validation data. Furthermore, we introduce a stability threshold Ó to avoid overfitting. We calculate it as an absolute difference between errors reported on training and validation sets in each round. The best models are blended to form an ensemble of possibly diverse models. The additional parameter Ä determines the number of models within the ensemble to provide a certain resilience level in case some o the models were put out of action.
In the future, we plan to extend a multicriteria evaluation with a specific approach to assure difference between the models explicitly. One of the possible solutions could be measuring the distances between the reported scores on a validation set. Alternatively, we may assure the feature importance rankings reported by models are possibly dissimilar (cf. Figures 4). Another approach to construct ensembles of possibly diverse models could be achieved by referring to r-C-reducts on nonoverlapping feature subsets or by ensuring constraints between attribute sets [10], [41]. A combination of the above techniques would also be an exciting future research direction, primarily since many attributes are somehow related, 406 PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022 e.g., principal components generated from the same original features or all the columns in databases referring to the same source of knowledge. Combining these techniques with stateof-the-art instance selection and training set selection methods would also be of great importance [42].
Both in wrapper search and model training, we focused mainly on XGBoost [43]. However, we also complemented it with an evaluation of LightGBM [34]. We conducted grid search model parameter tuning for best performing data subsets in the feature selection stage This part was conducted iteratively and was alternated with the wrapper-based feature selection. Such an iterative approach allowed us to, over time, fine-tune the wrappers' configuration. In the future, we plan to continue experiments with more advanced techniques of searching the hyper-parameter space in order to optimize the model faster and more efficiently [44].

A. Data
The data sets contain 6 years of history of orders appearing on the transport exchange, along with details such as the type of order, basic characteristics of the shipped goods (e.g., dimensions, special requirements), as well as the expected route that a driver will have to cover (cf. Table I). In particular, the training data consist of two tables: css_main_training.csv and css_routes_training.csv, and the additional data table containing historical wholesale fuel prices for the period of training and test data. The first file (i.e., css_main_training.csv) contains basic information about the contracts, and the second one (i.e., css_routes_training.csv) describes the main sections of the planned routes associated with each contract. In both tables, the first column (i.e., id_contract) contains identifiers that allow matching records from css_main_training.csv and css_routes_training.csv files. Additionally, the second column in the css_main_training.csv file (i.e., expenses) contains information about the prediction target. Values in this column are available only for the training data.

B. Task and experimental setup
In our study, we investigated the task of predicting the costs related to the execution of forwarding contracts, which was defined within the 8th data mining competition organized online on the KnowledgePit platform in association with the Federated Conference on Computer Science and Information Systems (FedCSIS '22). The task is to design an accurate method for regression of costs associated with forwarding contracts [45], based on contract data and planned routes (cf. Section IV-A). The quality of predictions was evaluated with the RMSE measure. The experiments were also planned to validate the relation between training speed and the size of training data, controlled with the É r and É c parameters.
Many threats may impact the models' performance or impede their operations, including missing data or software errors. Therefore, the experimental setup was designed to also evaluate the resilience of the final solution, understood as the RMSE error achieved when some models within the ensemble cannot be applied (e.g., due to a software error or missing data attributes). However, up to our knowledge, there is no established methodology allowing us to assess the resilience of the ML models. One of the approaches could be to randomly delete subsets of test data, e.g., by dropping particular columns. It is, however, not straightforward how to implement this kind of test. Shall we randomly drop a certain percent (e.g., 5% or 10%) or number (3 or 5) of all columns? For data sets containing 20000 attributes, such operation may have minimal impact on predictive models [6], [17]. Or shall we drop the model's most important feature(s)? Different approaches would be preferable for other modeling techniques. Consider a random forest that relies on many redundant weak tree models, XGBoost where particular predictors are boosting the formerly selected ones, or a single tree that depends on just a few attributes with one surrogate or verifying cut per split  [46]. In our case, we decided to implement a straightforward approach that may be dedicated to the ensembling techniques. Namely, we were dropping randomly selected models from the ensemble to measure the impact of such an operation on the RMSE. In the conducted experiments, we used the features extraction, dimensionality reduction, model training, and resilient ensemble blending method, as described in Section III. As the base model, we used XGBoost a [24]. We optimized the model parameters only once on the selected subset of data with the grid search procedure, and since it was not the major point of our research, we kept those parameters later unchanged. The É r and É c parameters were set to 0.5, meaning that each of the xgbt models within the ensemble was trained on a random subset of 50% features and 50% instances from the training set. The expected model quality threshold ¹ was set to 0.17, which corresponds to 2.5% of the median expenses in data (cf. Figures 3). The robustness threshold Ó was 0.02, and the resilience threshold Ä was 10. To visualize the results, we use box plots -a standard way of displaying data distribution by encoding their five key characteristics: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. For the purpose of evaluation, we split the data into three sets: training, validation, and test. The first two constituted a part of the training procedure. The last was used only for evaluation. Furthermore, we present the results achieved by our method within the "FedCSIS 2022 Challenge: Predicting the Costs of Forwarding Contracts" on the preliminary and final competition data. Figure 5 shows how the values of RMSE are spread out for ensemble models on the training, validation, and test data. The results show that RMSE for training, validation, and test dataset are very close to each other avoiding model overfitting. This confirmed that the ensemble yields not only satisfactory quality but also guarantees high robustness and stability, which is especially visible between validation and test sets (cf. Figure 5). This confirms the effectiveness of the multicriterial evaluation, such as acceptable RMSE threshold ¹ and stability threshold for RMSE Ó, which are applied while selecting the ensemble models. When we compare the results achieved by our method within our experimental setup with the results achieved during the FedCSIS 2022 Challenge: Predicting the Costs of Forwarding Contracts, we may notice it yielded very similar results on both preliminary and final data sets, 0.165 and 0.161, respectively. The results confirm the predictability, repeatability, and stability of the proposed method.

C. Results
Next, we evaluate the resilience of our solution. In this experiment, we randomly deleted 1, 3, 5, 7, and 9 models from the ensemble to simulate the scenario when some models would be unavailable. Then, we calculated RMSE against the test dataset. We repeated the procedure 10 times and plotted the results in Figure 6. The results show that RMSE is slightly decreasing when deleting some models from the ensemble. However, the largest impact was spotted when deleting the majority of the models (9 out of 10). In fact, this leads us to investigate the most important features which are very correlated or have a high impact on the target variable. Therefore, we calculated the F-score based on the Information Gain (IG) measure. It is worth noting that IG in the decision/regression tree-based models is a measure of how much information a feature provides about the target feature. Figure 4 shows the relative importance of features for two selected XGBoost models from the ensemble. The values on the x-axis show the average gain for the top ten features across all splits where those features were used. Observably, the proposed training set selection procedure allowed us to train several significantly different models that rely on different features, which leveraged ensemble smoothing and enabled the high resilience of the final solution. Considering the importance of features for 10 XG-Boost models constituting the final solution (cf. Fig 4), we may notice that only seven features were relatively impactful (i.e., F-score > 100) for more than one model, namely: diff_start_end_days, diff_start_end_weeks, direc-tion_PC1, km_nonempty, km_total, last_unload_country_PC2, last_unload_country_PC5.
For some of the potential threats, like data deficiencies, we may notice that several attributes were derived from the same (or similar) sources of information, e.g., features derived from the start and end dates or principal components of one-hot-encoded last_unload_country. Thus, in such a case,  applying a more formal methodology for resilience assessment would be advisable and may be considered a valid future research direction. Another valuable extension of the proposed framework would be to include information from experts in the machine learning process to ensure that the features which, according to the experts, have high importance are selected in at least some of the ensemble components [26], [47].
Finally, we focus on measuring the speed and costeffectiveness of the solution. This factor has recently gained a lot of attention because, in many practical applications of ML, like recommender or threat detection systems, the predictions are highly influenced by the most recent data. Thus, the model must be continually retrained to consider the most recent information, and it is very important to make a balance between speed and accuracy. In our proposed solution, this can be achieved by training data sets of randomly selected chunks of data. Figure 7 shows the time taken for building the XGBoost model using 100%, 75%, 50%, and 25% of data rows (cf. É r parameter) in the training data set, with the unchanged model hyper-parameters. This experiment was repeated 10 times, considering different (randomly chosen) data rows in each run. Furthermore, in Figure 8, we also plot RMSE each XGBoost model would achieve in FedCSIS'22 data competition. We may notice that depending on the data subset, the final model quality varies and slightly decreases along with the declining size of training data chunks.
V. SUMMARY The industrial machine learning applications today involve the development and deployment of MLOps pipelines, which consist of automated activities that were once manually performed by data analysts, including data ingestion and preprocessing, feature extraction and selection, model fitting, etc. These solutions are designed to ensure the quality of predictions during the production use of forecasting models, which may be months or even years. Quality assurance in such a long period requires the development of a repeatable  learning procedure to (re)train the models in case of shifts or drifts in data. Furthermore, apart from confirming the model's accuracy, it is expected to assure many quality criteria, such as stability, resilience, and low computational cost, in a fast and reliable way.
In this study, we present the ML pipeline that considers several qualities of the models in a multicriterial manner. Besides assuring RMSE optimization, our solution also ensures robustness, resilience, and instant (re)training time. The developed pipeline consists of several states, including preprocessing, feature extraction, dimensionality reduction, robust model training, and resilient ensemble blending. In this paper, we also elaborate on several promising future research directions, including applying more advanced features and instances selection techniques, incorporating experts' knowledge into the machine learning processes, ensuring the ensemble diversity more explicitly, or providing a formal methodology to assess the resilience of the predictive models.
We confirmed the qualities of our pipeline with the versatile experimentation on the real data from international freight forwarders and by participating in an international data mining competition organized along to FedSCSIS'22 conference. The achieved RMSE is comparable to the best and most complex models reported by 135 teams from 24 countries in the FedC-SIS contest, meanwhile conforming to more requirements. We may conclude that the proposed solution provides high-quality results with excellent resilience and stability, and the models are developed within seconds of training on low-cost compute resources. In the future, we plan to augment the developed framework with the discussed extensions and subject it to in-depth experimental analysis on a more significant number of real data sets from various fields, including the mining industry [17], [18], fire service [28], FMCG [3], cloud resource management [15], and for predicting escalations in customer support [48]. We believe that the developed approach will be equally effective in all those applications.