KnowledgePit Meets BrightBox: A Step Toward Insightful Investigation of the Results of Data Science Competitions

We discuss the benefits of integrating the KnowledgePit data science competition platform with the BrightBox technology aimed at diagnostics of machine learning models embedded within complex software systems. We briefly recall the history of international challenges held at KnowledgePit and we also discuss in what sense such technologies as BrightBox can be helpful during the post-challenge analysis. In particular, we show how to combine solutions submitted by the competition participants in order to obtain even more accurate predictions. The discussed functionalities are of significant importance for the sponsors and organizers of data science / machine learning online contests because they support adoption of submissions while designing ultimate solutions of real-world problems.

Abstract-We discuss the benefits of integrating the Knowl-edgePit data science competition platform with the BrightBox technology aimed at diagnostics of machine learning models embedded within complex software systems. We briefly recall the history of international challenges held at KnowledgePit and we also discuss in what sense such technologies as BrightBox can be helpful during the post-challenge analysis. In particular, we show how to combine solutions submitted by the competition participants in order to obtain even more accurate predictions. The discussed functionalities are of significant importance for the sponsors and organizers of data science / machine learning online contests because they support adoption of submissions while designing ultimate solutions of real-world problems.
Index Terms-Data science competitions; machine learning; model stacking; KnowledgePit platform; BrightBox technology

I. INTRODUCTION
K NOWLEDGE Pit 1 is an online platform for organizing data science / machine learning (ML) challenges. Its architecture was first presented in [1] and since then, it has been improving continually. Currently, KnowledgePit puts together the functionalities of a typical competition platformsuch as Kaggle 2 -with additional tools that make it possible for the competition sponsors and organizers to investigate the submitted solutions with respect to their true usefulness in the corresponding real-life decision problems. These tools are available thanks to integrating KnowledgePit with BrightBox -the technology developed by QED Software 3 for the purpose of assessing the decision models basing on the analysis of mistakes that they are making [2]. In this paper, we discuss one of such functionalities -designed at the border of KnowledgePit and BrightBox -which lets us create better models by mixing solutions acquired from the competition participants.
The paper is organized as follows: In Section II, we recall the main ideas behind KnowledgePit and, as an illustration, we report the history of the KnowledgePit contests held in cooperation with the FedCSIS conference series. Analogously, Section III introduces the main ideas behind BrightBox, with a special emphasis on its contributions into the KnowledgePit's This research was co-funded by the Polish National Centre for Research and Development in frame of project MAZOWSZE/0198/19. functionality. Section IV refers to the data science challenge which was associated with this year's FedCSIS conference 4 . Besides describing the competition itself, we include here some KnowledgePit-supported visualizations that can be helpful for sponsors and organizers. In Section V, we explain our aforementioned idea of mixing the competition solutions and report the experimental results obtained for the challenge outlined in Section IV. Section VI concludes the paper.

II. THE HISTORY OF KNOWLEDGEPIT
The platforms such as Kaggle attract thousands of data scientists to participate in challenges aiming at solving real-life problems. Such challenges not only address specific problems but often facilitate innovative applications of ML algorithms. On the one hand, they are appealing to those for whom competitive challenges can be a source of new interesting research topics. They can also be an attractive addition to academic courses for students who are interested in practical applications. On the other hand, setting up a public data science competition is a form of outsourcing a given task to the community [3]. It can be beneficial to the sponsors and organizers who set up the contest, as it is an inexpensive approach to solve the problem that they are after [4].
Accordingly, it should not be surprising that the scope of our own platform -KnowledgePit -shifted during the years from organizing smaller, mostly student-focused challenges and projects to international data science competitions. Although KnowledgePit still hosts several student competitions for MLrelated university courses every year, the most prestigious events are those prepared for big industry clients and partners, in association with international conferences [5], [6].
One may say that our competitions grew together with recognition of the FedCSIS conferences. Together with the one reported in Section IV [7], there have been already nine challenges held at KnowledgePit in cooperation with FedCSIS. The series started in 2014 with the AAIA'14 Data Mining Competition: Key Risk Factors for Polish State Fire Service [8]. Other competition topics included the recognition of firefighters' activities based on inertial sensor readings [9], predicting seismic activity in coal mines [10] [11], marking hair follicles on microscopic images 5 , predicting win-rates of custom card decks in collectible card games [12], [13], and predicting typical patterns in network device workloads [14]. All of these competitions were highly successful. With more than 1,300 participating teams and several thousands of submitted solutions, they significantly contributed to solving important reallife challenges. They also provided us with a comprehensive survey on the state-of-the-art ML approaches in the related fields, such as time series forecasting [15], feature extraction [16], as well as prediction model ensembling [17].

III. KNOWLEDGEPIT MEETS BRIGHTBOX
During its journey, KnowledgePit had to evolve to fit the needs of our industrial partners. One of the most significant needs has been related to the post-competition analysis of the submitted solutions. With this regard, it was possible to meet the industry expectations thanks to integrating KnowledgePit with the aforementioned BrightBox technology.
As highlighted in Section I, BrightBox is a software technology which assesses decision models (aimed at classification, regression, prediction, etc.) basing on their mistakes (i.e. differences between their outputs and the observed ground truth). Its main application field is in diagnostics of ML models deployed within complex systems [18]. Its methods have deep roots in the theory of rough sets [19]. A diagnosed model is approximated by the rough-set-based surrogate modelsthe ensembles of so-called approximate decision reducts [20]. Then, particular cases are investigated by looking at their neighborhoods -the groups of other cases that are classified similarly by the surrogate reducts (i.e. objects that fit the same rules induced by reducts [21]). If a mistake that happened for a given case often repeats in its neighborhood, then it is likely that the diagnosed model was not trained correctly for such objects from the beginning. As another example, if the neighborhood is almost empty (i.e.: the diagnosed object seems to be classified by different rules than objects observed before), then BrightBox can conclude that the model seems to find itself in a new situation. Such hints are useful from an operational viewpoint when it comes to rebuilding ML models as parts of the aforementioned complex systems.
Although it was a non-trivial effort to adapt the default settings of BrightBox to the specifics of a data science competition platform (for instance, we had to address different scenarios of the access to the diagnosed models' behavior on the training data), it enabled us to extend basic KnowledgePit's functionalities (such as the analysis of trends in the quality scores or survey-based summaries of commonly applied ML techniques) with in-depth diagnostics of individual submissions. Since BrightBox does not require a direct access to the diagnosed models, it can be applied to construct the abovediscussed surrogate models that approximate submitted solutions and allow reasoning about their properties. For example, it allows to approximate feature importance coefficients of 5 https://knowledgepit.ai/esensei-challenge/ Fig. 1: Exemplary visualizations of one of solutions from challenge [5]. Points on the upper plot correspond to cases from the test data and their color reflects prediction errors. The lower plot shows approximated (using BrightBox) feature importance for the model used to create the analyzed submission. An experimental evaluation showed that the Spearman correlation between the approximated feature importance values and the actual values estimated for the model was ≈ 0.7. models used to create the submissions (see Figure 1). It can also provide insightful information on types of errors committed by models, and similarities between solutions submitted by competition participants. Prototype implementations of some selected functionalities provided by the BrightBox technology have been already tested in our previous competitions [5], [22].

IV. FEDCSIS 2022 CHALLENGE
As mentioned in Section II, FedCSIS 2022 hosted the ninth KnowledgePit competition associated with the FedCSIS series. We use this competition as an illustration how KnowledgePit works, as well as a prerequisite for the discussion in Section V. We refer to [7] for more details about this particular contest. The best solutions submitted by its participants are described in [23] (1st place), [24] (2nd), [25] (3rd), and [26] (4th).
The task was to predict the execution costs of so-called forwarding contracts. The data about contracts was provided by the competition sponsor -a company that develops decision   support and optimization systems in the transportation, spedition, and logistics areas. That company was highly interested in a deeper analysis of the submitted solutions with respect to their potential deployment within the designed systems.
For the evaluation purposes, the data was divided into the training set (330,055 historical contracts) and test set (72,452 newer contracts). The data was carefully anonymized, but it was done in such a way that its analytical value is not lost (see also our other competitions [5]). Given a regression nature of the considered prediction task, we selected one of typical evaluation measures -the root mean square error (RMSE). However, let us note that sometimes specifying a measure that truly reflects a real-life decision problem is not easy [27].
The competition was conducted in a standard way including: (i) the online preliminary evaluation on an unknown subset of the test data (the participants submit solutions for the full test set but preliminary evaluation is done on a subset which is unknown to them), (ii) the associated public leaderboard (only preliminary results are shown during the competition), and (iii) the final evaluation on the full test set (which is calculated after the competition is closed, after the participants select their final solutions to be evaluated, and only if they submit to KnowledgePit the reports that describe their solutions). Table I displays the top results. Let us note that the preliminary and final rankings are different. Actually, the solution at the 3rd final place would be the best one if only preliminary scores were taken into account. Such cases are specially worth investigating with the usage of the BrightBox technology. Table I shows also the baseline solution -the model that we prepared by ourselves prior to the contest's beginning [7]. In this challenge, only four teams exceeded the baseline score (though sometimes it may be even less, see e.g. [14]).  [24], [25], [26], we decided to check how accurate the predictions could be if we mix submissions from different teams. In this way, not only could we verify which solutions are the most influential and potentially worth deploying, but also we could get a better insight into where the limit to prediction quality is when working with this type of the data.
To create the ensemble of submissions, we used the postcompetition data analytics tools provided by KnowledgePit thanks to its integration with BrightBox (see also [22]). We compared the results of the LASSO regression-based model stacking [28] to simple model averaging methods. We also investigated two ensembling scenarios. In the first one, we used all correctly formatted submissions. In the second one, we restrained the submissions to those with the best score in the preliminary evaluation from each team. In both cases, we trained the logistic regression model with LASSO regularization on predictions for the competition's preliminary evaluation set. Then, we verified the performance of each ensemble on the same final test set as all individual submissions.
We tuned the regularization parameter λ of the LASSO regression using the cross-validation technique, following the guidelines taken from the glmnet package 6 . We chose λ for which the validation loss was lowest. Table II shows the results for the obtained ensembles. As a reference, we include the results obtained using simple averaging of the best solutions from top 5 teams (solutions [23], [24], [25], [26] and the baseline model), and the results of weighted averaging in which the weights correspond to the preliminary scores. We also include results obtained for the simple averaging, and weighted averaging when the number of used solutions corresponds to the optimal selection of the λ value.
The ensemble that achieved the best RMSE was trained on all submissions. It had non-zero coefficients for 28 solutions submitted by nine different teams. All of those coefficients were positive. The highest total impact on predictions had the submissions of hieuvq (0.321), followed by Lord of the ML (0.189), Cyan (0.156), and Dymitr (0.151). The combined impact of the remaining teams was less than 0.187.
We analyzed residuals of the resulting predictions. Figure  3 depicts their distribution across the test data. The top-left plot 3a shows the relation between ground-truth targets and predictions for test cases from the FedCSIS 2022 Challenge. It can be seen that they are aligned along the diagonal (the red line) with relatively few outlying cases. Similarly, in the top-right part of Figure 3 there is a scatter plot of residuals with regard to predictions of the ensemble. It shows that the residuals are not evenly distributed and there are a few prediction ranges with slightly larger magnitude (variance) of prediction errors. Both of those plots also show that the ensemble is nearly unbiased. This observation was confirmed -the mean difference between the ground truths and the predictions is nearly zero, i.e, −0.00027. However, the distribution of residuals is not Gaussian. Such a hypothesis was rejected with high confidence using the Shapiro-Wilk test 7 . This can also be seen in Figures 3c and 3d. The distribution of residuals has long tails and the frequency of high error values is far from the theoretical quantile values of the Gaussian distribution. This observation suggests that the considered ensemble model could be further improved. Herein, a thorough BrightBoxbased investigation of the morphology of mistakes made by each of the aforementioned 28 solutions can be helpful.

VI. CONCLUSIONS
We discussed the idea of organizing online data science competitions, with some examples taken from our own experience. We referred to KnowledgePit -our competition platform which, besides standard functionalities, includes some advanced analytical and visualization tools. We paid a special attention to the BrightBox technology which is used by KnowledgePit to approximate and diagnose solutions submitted by the competition participants. One of the methods which can be used within the resulting KnowledgePit-BrightBox environment, refers to mixing different solutions together, which leads toward more efficient ML models, as well as additional insights with regard to particular submissions.
Consequently, in order to provide even more value to the future competition sponsors, we are extending KnowledgePit's functionalities to utilize a broader range of XAI tools for the purpose of advanced post-challenge research. We also believe that such functionalities can be useful from an education perspective, whereby KnowledgePit may be considered as a platform for assessing and improving the data science competencies at universities, as well as in companies.
We also plan to continue integration of KnowledgePit with BrightBox. In particular, we are going to develop better tools for inter-competition analysis of individual platform users. Such functionality would help us to monitor the progress and skill development of participants. It could also provide value to our industrial partners who often look for potential skilled employees. In this context, KnowledgePit could be used by our partners as a tool that facilitates the recruitment of researchers for projects related to the competition topics, and for the evaluation of job candidates. Actually, QED Software is already using KnowledgePit for such evaluations.