An End-to-end Machine Learning System for Mitigating Checkout Abandonment in E-Commerce

—Electronic Commerce (E-Commerce) has become one of the most signiﬁcant consumer-facing tech industries in recent years. This industry has considerably enhanced people’s lives by allowing them to shop online from the comfort of their own homes. Despite the fact that many people are accustomed to online shopping, e-commerce merchants are facing a signiﬁcant problem, a high percentage of checkout abandonment. In this study, we have proposed an end-to-end Machine Learning (ML) system that will assist the merchant to minimize the rate of checkout abandonment with proper decision making and strategy. As a part of the system, we developed a robust ML model that predicts if someone will checkout the products added to the cart based on the customer’s activity. Our system also provides the merchants with the opportunity to explore the underlying reasons for each single prediction output. This will indisputably help the online merchants in business growth and effective stock management


I. INTRODUCTION
A S we are living in an era where digitalization and technology are evolving day by day, our dependency on the internet has noticeably increased.E-commerce has made shopping easy and safe for all internet users all over the world.People nowadays prefer exploring websites to find their daily needs rather than walking around shopping malls, supermarkets, and shops.They do not have to take the hassle of finding a product and waiting for a long billing queue, which makes purchasing simple and quick.On the contrary, e-commerce also makes it easier for companies to reach out to new customers all over the world.A report from Statista [1] shows that global sales have jumped from 1,336 billion to 5,542 billion USD in the last 6 years in the e-commerce industry.In the near future, undoubtedly the dependency on online shopping will increase significantly.
Recently, the online retailers are encountering numerous business challenges such as the lack of trust, customer churn, product return and so on.With the rapid technological advancement in the data science domain, researchers have already started to solve these type of problems by utilizing data science approaches.Some of the existing research works are related to the product review classification such as the authors in [2] proposed DNN networks to train a classifier for identifying the product quality from product reviews.One of the major problems in e-commerce industry is the return of the product.In [3] and [4], the researchers tried to address this issue by using different predictive modeling techniques.
Apart from these, one of the most common business challenges in the e-commerce industry is high checkout abandonment rate.According to the study of Baymard Institute, a research institute in Denmark, the average checkout abandonment rate is 69.82% [5].Also during the COVID-19 pandemic, online shopping behaviors have been significantly changed.When individuals browse an online store for a particular product, as a natural consequence, they often add many additional items to their cart.Among the added items in the cart, some are in need and the others may be their favorite but are not in great demand, and most of the time the majority of them are never checked out which results in high checkout abandonment rate.However, very few researchers have contributed to solve the issues related to online shopping carts.Jian et al [6] proposed a framework to predict buyers' repurchases intention from the cart information.In another study [7], the author built a recommendation system using the shopping cart information.
In this research, we have tried to address the aforementioned business problem of the e-commerce industry and proposed an end-to-end ML system that will automatically perform all the steps such as data collection, transformation, preprocessing, statistical analysis, and predictive analytics.In the case of predictive analytics, we have conducted an extensive experiment and found CatBoost as the outperforming approach that predicts the checkout possibility of users with the highest accuracy (=0.76) and precision (=0.694).Moreover, we have applied a model agnostic local explanation approach for the explainability that will help the e-commerce merchants to analyze how every single customer gets influenced by different factors.
Our contribution to this study can be considered from two perspectives: one from a research standpoint, and the other from a commercial standpoint.The research perspective is that Proceedings of the of the 17 th Conference on Computer Science and Intelligence Systems pp.129-132 no previous study attempted to solve this particular problem and proposed any end-to-end ML based solution.On the other hand, if business people conduct targeted marketing or apply other business strategies to consumers who are most likely to purchase the products in the cart, then the sales will be increased.Apart from that, the prediction will assist the merchant in maintaining effective stock management as well.
Indisputably, the combination of statistical insights, predictive output, and local explanation aid the seller in developing proper strategies for business growth.

II. PROPOSED SOLUTION
As a solution to the checkout abandonment issue in ecommerce business, in this phase, we have proposed an endto-end system shown in Fig. 1.From Fig. 1 it can be observed that our proposed system takes raw data from the database and outputs analytical and predictive insights to a dashboard.Development of this system is composed of several steps: data understanding, preprocessing, exploratory data analysis, data modeling, and a modelagnostic approach.
To conduct the experiment, a large real dataset has been collected from a prominent SaaS platform that integrates with online stores to track behavior in real-time.Our dataset contains 27 features in total, where 21 features are numerical and the other 6 features are categorical.Table I and Table II shows the description of all the numeric and categorical features respectively.Among all the instances(=28410), 55% instances belong to class 0 ('not checked out') and the other 45% belongs to class 1 ('checked out') which indicates that the dataset is almost balanced.
After collecting the data, in the preprocessing phase, we have applied the James-Stein encoder to convert the categorical features into informative numerical representations.The mathematical expression of the James-Stein encoder is as follows: Where The idea of the James-Stein encoder is to shrink the category's mean target towards a more median average.
As most of the features do not fall into a Gaussian distribution, in this experiment, we have applied min-max scaling to scale the features from 0 to 1.The page where the user started the visit.week day Day of the week origin From which origin did the user land on the store.utm source From which source did the user land on the store.utm medium From which medium did the user land on the store.device The user's device type.
Then, during the data modeling phase, we have followed a proper and systematic workflow that has been illustrated in Fig. 2.This workflow has been started just the completion of the data preprocessing steps.At first, we have segregated the entire processed dataset into training (=80%) and validation (=20%) data to eliminate biases during the model evaluation phase.Then, in the second stage, we have chosen the ML algorithms based on the objectives as well as the characteristics of the data.In this study, we have applied 5 SOTA algorithms such as XGBoost [9], LightGBM [10], CatBoost [11], mGBDTs [13], and TabNet [12] not only targeting the best prediction output but also with a special focus on interpreting the models.To create a baseline performance, we have also included a DNN model.
Then in the evaluation phase, we have applied 5 evaluation metrics namely Accuracy, Precision, Recall, f 0.5 -score, and ROC-AUC.The fourth and most important step is to tune the hyperparameters of the model very attentively to obtain the best prediction output without over-fitting or under-fitting.For  tuning the hyperparameters, the Bayesian approach has been chosen in this experiment since it takes less time compared to others to find the best set of parameters and improves generalization performance on the test data.After getting the optimized hyperparameters for all of the models, in the fifth stage, we have trained our models using the training dataset.
With the completion of the training phase, we went on to the sixth stage, in which we have utilized the trained models to make predictions on the unseen validation dataset and track their performance against each evaluation metric.Then, in the following step, we have compared the performance among the models to figure out the best model for this particular business problem.

III. BEST MODEL AND FEATURES SELECTION
In this phase, we have divided our analysis into two parts.We have started the analysis by explaining the performance of each model and selecting the best model from their comparison.Finally, we have explored the top features of our best model to find insight that can help to make important business decisions.
Table III shows both of the training and testing results for each of the models of our experiment.From the Table III, it can be observed that our obtained test results are very close to the training results, which indicates that the model has learned the underlying patterns well from the data without over-fitting.By considering the business problem we were trying to solve, we have mainly focused on Precision, f 0.5 -score, and Accuracy.Significantly, it has been appeared that all the models have been performed better than our baseline DNN model in terms of Precision, f 0.5 -score, and Accuracy.Although TabNet pretrained model performs well for tabular data, in our case it fails to outperform the tree-based models.Similarly, the mGBDT fails to outperform the other non-differentiable boosting algorithms.All of the tree-based boosting algorithms: XGBoost, Catboost, and LightGBM have yielded the results close to each other.Comparing the results of each models, it can be seen that CatBoost has consistently outperformed all other models with accuracy (=0.76) and precision (= 0.694).As the conclusion of all comparisons, we chose CatBoost as the best performing model for checkout prediction.
Furthermore, we have extracted the most effective 5 features from the CatBoost classifier.Table IV shows the best features in descending order based on the feature importance.

IV. MODEL EXPLANATION AND DECISION SUPPORT
To make our model's decision more transparent, in this study, we have built a support system using the model agnostic local explanation technique, LIME [14].This method will assist us in comprehending the factors that influence a complex black-box model around a single instance of interest.
In the following Table V, we have taken two instances from the unseen validation data for local explanations.Instance 1 has a checkout-abandonment status, and Instance 2 has a checkout status.The prediction and explanations for these examples can be found in Fig. 3 and 4.
Fig 3 illustrates the decision rules and feature significance based on which the CatBoost model made the decision for instance 1, which was actually abandoning the checkout after adding items to the cart.We can observe that the model predicts with a 95% probability that this person will abandon the checkout.Also, the aforementioned 3 most important features: "total visits", "customer since days", and "total purchased sum" significantly influenced the model to decide in favor of checkout abandonment.In Fig 4, we can observe that the model predicts with 99% probability that instance 2 will checkout the added item which is actually correct.From the value of "customer since days" feature, it is obvious that being the old customer gave the model confidence that instance 2 will checkout the products.The aforementioned illustrations of local explanations of predictions for individual examples can immensely assist business analysts or decision-makers in making meaningful and unbiased decisions.Especially, this explanation technique can especially assist in making decisions in confusing situations where it is difficult to decide whether a person will checkout the items from the cart or not.

V. CONCLUSION
In this study, we have aimed to minimize the checkout abandonment rate, which is a key concern in modern ecommerce business, by proposing an end-to-end system.One of the most important components of the system is the ML model that predicts the probability of checkout abandonment for each of the customer.Also, it provides the explanation of the decision taken by the model for further business support.In case of predictive analytics, the CatBoost was found as the best performer with 0.694 (Precision), 0.719 (f 0.5 -score), and 0.76 (Accuracy).For reliable decision making with additional support, we have integrated the LIME, that interprets the model's output as well as extract the decision rules.

TABLE II DATA
DESCRIPTION OF CATEGORICAL VARIABLES

TABLE III MODEL
PERFORMANCE ON TRAINING AND TESTING DATA.

TABLE IV FEATURE
IMPORTANCE (TOP 5) OF CATBOOST CLASSIFIER MD RIFATUL ISLAM RIFAT ET AL. : AN END-TO-END MACHINE SYSTEM FOR MITIGATING CHECKOUT ABANDONMENT