Data Mining for Bankruptcy Prediction: An Experiment in Vietnam

— In the history of the world economy, the bank-ruptcy of some large companies has caused global financial crises. The study aimed to postulate a model of bankruptcy prediction for listed companies on Vietnam's stock market. The research used six popular algorithms in data mining to predict bankruptcy risk with data collected from 4693 observations in the period 2009-2020. The research results showed that Logistic algorithms, Artificial Neural Network, Decision Tree have a high level of predicting bankruptcy with an accuracy of 98%. The study identified the three most important indicators: inventory turnover ratio, debt to equity ratio, and debt ratio that affect the corporate bankruptcy prediction. The study showed the threshold points of 10-indicators to avoid bankruptcy likelihood. These results recommended that the model could be applied in practice to reduce risks for businesses and investors in the Vietnamese market.


I. INTRODUCTION
HE REPORT of The World Bank [1] indicates that Vietnam is an active emerging economy with speedy economic growth in the East Asia area.Besides the development of business, the economy has many potential risks.The context is that the global economic growth outlook is somewhat bleak in the face of uncertain potential risks such as the US-China Trade War, Brexit, inflationary trends due to unpredictable price changes.The epidemic of Coronavirus (Covid-19 epidemic) strongly affects people's psychology in general and stock investors in particular.Measuring the health of enterprises in Vietnam is now extremely urgent.There are several models to predict corporate bankruptcy, for example, the model of market approach [2] and the model of accounting approach [3].Employing models in practice in Vietnam's stock market is essential because of the difficulty of qualitative predictability in the increasingly unpredictable environment.The models support how to measure the bankruptcy prediction from potential business risks.In Vietnam, the quality of accounting information is not too excellent [4] and companies listed or unlisted on the stock market report losses leading to a high risk of bankruptcy.To ensure the rights and benefit of enterprises and creditors, the Law of Vietnam in which the Law on Bankruptcy 2014 and the Law on Securities 2010 (the latest being the Law on Securities 2019 takes effect from January 1, 2021) have issued and concretized these regulations.

T
Previous studies have given diverse criteria as financial ratios in predicting corporate bankruptcy.Some studies show that the Z-Score model has a strong practical application of financial status to the prediction of bankruptcy as studied by Liang, Lu, Tsai, Shih [5], Barboza, et al. [6], Chou, et al. [7], Antunes, et al. [8], Le, et al. [9], Le, et al. [10], Veganzones and Séverin [11], Mai, et al. [12], Son, et al. [13], Chen, et al. [14].However, previous studies were mainly used in developed countries to predict bankruptcy and few studies applied data mining in predicting bankruptcy, especially in emerging security markets such as Vietnam.
This study uses several data mining techniques to predict corporate bankruptcy for a Vietnamese case study.The main contributions of this study are as follows: (i) building a framework model for predicting bankruptcy, (ii) Collecting Vietnam's data sets for the past twelve years for the bankruptcy prediction, (iii) testing to compare technical performance for predicting bankruptcy on the Vietnamese dataset; and (iv) Combining Bagging and Boosting methods, the test results show the best overall accuracy of 98% to improve forecasting bankruptcy.
Adopting and combining new techniques to improve the accuracy in forecasting corporate bankruptcy is encouraged by researchers and practitioners.The results help to reinforce and enhance the bankrupting prediction model.

II. LITERATURE REVIEW
Research on predicting the financial downturn of companies through Z-score and Zeta models Altman [15], this is a handbook that presents the quantitative techniques commonly used in research papers.empirical finance research along with real, modern research examples.By referring to this handbook, the author has understood and applied it to the study of the Z-score model.Konglai and Jingjing [16] compiled a sample of failed managed groups and normally managed groups that contained 130 listed companies from Shanghai and Shenzhen exchanges in 2009.Using the MDA discriminant analysis model and the logistic model, the author chooses 5 financial factors: profitability, debt repayment ability, operating ability, growth ability, and capital structure.Ohlson [17] was the first to apply the logistic regression model in the study to predict the probability of default of enterprises.Some related studies such as Meeampol, et al. [18] in the Thai stock market.Research by Kumar and Rao [19], on a new method to estimate internal credit risk and predict bankruptcy under the Basel II regime.The results of the study showed that the Z-score could predict bankruptcy with 98.6% accuracy compared with 93.5% according to Altman's score.
Researchers use various algorithms of intelligent techniques to solve the problem of corporate bankruptcy [20].According to Serrano-Cinca [21] and Fletcher and Goss [22], neural networks (NNs) are the most commonly used technique.And the data mining algorithms used to predict bankruptcy risk include decision trees (DT) and support vector machines (SVM) [23].A decision tree is a structured hierarchical tree used to classify objects based on a series of rules.When given data about objects containing attributes along with their classes, the decision tree will generate rules to predict the class of the unknown objects (unseen data).Support vector machines (SVM) is a supervised machine learning model used to analyze and classify data.SVM takes incoming data and classifies them into two different classes.Many studies have used data mining techniques in predicting bankruptcy.Some studies related to predicting bankruptcy using data mining techniques are listed in Table 1.

III. METHODOLOGY A. Measuring Variables
There are many measures for predicting bankruptcy; however, each measure has both advantages and disadvantages.Ghazali, et al. [24] state that the Altman Z-Score is probably the most popular measure of a company's financial health and has been used to determine bankruptcy prediction in numerous studies.This study will determine the bankruptcy prediction based on the Z-score approach of Altman [3].Altman's Z-score gives a calculation of the Z-score based on the following formula: In which: A1-Current assets minus current liabilities, then divided by total assets; A2 -Retained profit divided by total assets; A3 -Profit before tax and interest divided by total assets; A4 -Book value of equity divided by total liabilities; A5 -Revenue divided by total assets.
If the Z-index < 1.81, the company is in the bankruptcy prediction zone that the likelihood of bankruptcy will be assigned a value of 1. Otherwise, it will be assigned a value of 0.
This study uses 30 attributes of financial indicators including liquidity ratios, capital budgeting ratios, profitability ratios, efficiency ratios (activities ratios), market ratios, and debt ratios (leverage ratios).The properties are briefly described in Appendix 1.

B. Applying Data Mining Algorithms
Data mining has many different expressions.It is the process of automatically extracting valuable information which is predictive information hidden in the huge amount of data in reality.Data mining emphasizes automated and predictive aspects.This study uses Logistic Regression, Bayesian Network, K-nearest neighbor, Artificial Neural Network (ANN), Support Vector Machine (SVM), and Decision Tree that is commonly used to predict bankruptcy.
1) Logistic Regression: The Logistic regression model introduced by Berkson [25] is a commonly used tool in data analysis with binary variables.Some developments by Altman, et al. [26] and Flitman [27] are used in multivariate re-gression analysis, discriminant analysis.From this binary dependent variable, a procedure will be used to predict the probability of the event occurring according to the rule if the predicted probability is greater than 0.5 (default cut-off point) then the prediction result will be "yes" occurs, otherwise, the predicted result will be given as "no".The Binary Logistic regression model is as follows: Fig. 1 Binary logistic regression model P is the probability that Y = 1 when the independent variables take on a particular value.Accordingly, the probability that the event does not occur is: The regression coefficients were estimated by the method of Maximum Likelihood (ML).The logit regression model can be used to estimate the log(odds) ratio for each independent variable of the model of Ohlson [17].The parameters βn were estimated by the method of ML.
2) Bayesian: Bayesian Network is applied for classification based on a probabilistic graphical model and the probability of the Bayesian Network is a value from 0 to 1. Bayesian Network is a set of variables and their conditional dependencies that are linked together by a probability association.According to Carlin and Louis [28], the Bayesian method is more about statistics than regression.For fraud detection, a Bayesian network will be built with Bayesian rule along with the condition P(Y=1) + P(Y=0) = 1 written as follows: P(Y=1│X) = [P(X│Y=1)P(Y=1)]/P(X) P(Y=0│X) = [P(X│Y=0)P(Y=0)]/P(X) P(Y=0│X) = [P(X│Y=0)P(Y=0)]/P(X) In which: P(X) = P(Y=1)P(X│Y=1)+P(Y=0) P(X│Y=0) The components are calculated as follows: P(Y=1) is the error rate of the sample used to run the model, assuming the variables are independent.
3) K-Nearest Neighbors (K-NN): K-Nearest Neighbors algorithm is used in data mining.K-NN is a method to classify objects based on query points and all the objects in the training data.An object is classified based on its K neighbors.K is a positive integer that is determined before performing algorithms.Euclidean distance is often used to calculate the distance between objects.

4) Artificial Neural Network (ANN):
Artificial Neural Network is an information processing model that is simulated based on the activity of the nervous system of an organism.A neural network can consist of one or more neurons that each neuron is an information processing unit and the connections between neurons form a network structure.A neural network is a computational model defined by parameters: Neuron type, connection architecture, and learning algorithms.The neurons are connected by a weight matrix.The typical structure of a neural network consists of three layers: input, hidden, and output [29] (see Fig. 2).For the best classification, it is necessary to determine the optimal hyperplane located as far away from the data points of all classes as possible.Fig. 3 Support vector machine Fig. 3 depicts the SVM algorithm: Given a training set represented in a vector space where each document is a point, this method finds a decision hyperplane h that can best divide the points on this space into two separate layers, respectively, the layer containing the data containing the feature simulated by the black dot and the layer containing the data containing the feature simulated by the white dot.The quality of this hyperplane is determined by the boundary of the nearest data point of each layer to this plane.The purpose of the SVM algorithm is to find the maximum boundary distance.[31], widely used in many different fields.After the introduction of the machine learning method system, the Decision Tree was further developed with the C4.5 algorithm by Quinlan [32] and the ID3 algorithm by Quinlan [33].A Decision Tree is a structured classification tree that classifies objects based on sequences of rules.To determine which variable to use classification first, which variable to use later, the information weight (entropy) for each variable is calculated, the higher entropy, the more categorical information the variable carries.

C. Combining Techniques for Data Mining
For improving the accuracy of the method of hybridization of models in the classification problem, this research employed Boosting and Bagging to improve the accuracy of the classification algorithms.
Bagging comes from two abbreviations, Bootstrap and Aggregation [34].Bagging is a combination of independent base models that leads to a significant reduction in errors.Therefore, the goal is to get as many base models as independent as possible.Bagging generates classifiers from subsets that revert to the Bootstrap samples and a machine learning algorithm, each of which generates a basic classifier.The classifiers will be combined by the majority voting method.That is, when there is an example that needs to be classified, each classifier will produce a result.And the result that appears the most will be taken as the result of the Boosting is a method of building a set of weak classifiers to improve the efficiency of these classifiers.After each iteration, the weak classifier will focus on learning on elements that were misclassified in previous iterations.To classify newly arrived data, people use the majority voting rule from the classification results of each weak classification model [35].

D. Evaluating The Model
Confusion Matrix is commonly used in model evaluation.This study employs a calculation of indices of Confusion Matrix as shown in Table 2.The effectiveness of the opinion classification model is evaluated based on 4 indexes: Accuracy, Precision, Recall, and Harmonized Mean (F1-score).In which: Collecting Data This study uses data collected from the Vietnamese stock exchange in the period 2009 -2020.Data is collected from audited financial statements of listed companies after excluding companies in the field of listed companies.banking, securities, and insurance sectors.After determining the indicators, the data used to perform analysis and forecasting is 4693 observations, presented in Table 3 by year and by field.
The study objectives are to use data mining algorithms including Logistic Regression, Bayesian Network, K-nearest neighbor, Artificial Neural Network, Support Vector Machine, and Decision Tree for predicting bankruptcy and to determine the accuracy of these data mining algorithms.The data are randomly divided into 2 parts to build and test the model: Training data is used for building the research model and testing data is used to test the predictive likelihood of the model.The description of indicator characteristics in the research model is presented in Appendix 1. Out of 4693 observations, 2395 observations are at risk of bankruptcy, accounting for 51.03% and vice versa 48.97% is normal.Thus, the data on the number of normal enterprises and the bankruptcy likelihood is quite balanced.
Appendix 1 reveals a testing result of the difference in the mean value of 30 indicators in the research model between the normal enterprise group and the bankruptcy likelihood group.27/30 indicators that have a difference between the two groups and are statistically significant, only 3 of the indicators of growth have no difference between the two normal groups and the bankruptcy likelihood group including X18-Operating profit growth, X19-Net profit growth, and X20-Equity growth.

IV. RESULTS AND DISCUSSIONS
To achieve the objective of the study on the question of commonly used classification algorithms in Data mining, which algorithm gives the best predictive results, Weka software is applied to research data to conduct experiments.Logistic regression and ANN algorithms give a high probability of bankruptcy prediction (accuracy over 97%).To improve the accuracy of the method of hybridization of models in the classification, Boosting and Bagging methods are employed.The results presented in Figure 5, Figure 6, and Figure 7 show the accuracy.Bankruptcy prediction results of Bagging and Boosting methods have improved over the original basic methods.With 30 ratios used for forecasting in the model, which ratio is the most important and has the most predictive significance?When using Weka software to identify significant variables in bankruptcy prediction.Figure 8 shows 10 ratios that have the greatest impact on corporate bankruptcy prediction.In which X13 -Total asset turnover is the most important indicator, followed by X5 -Debt to equity ratio, and X24 -Debt ratio.
We select the 10 most significant indicators as a set of important indicators in predicting bankruptcy including X13 -Total assets turnover ratio, X5 -Debt to equity ratio, X24 -Debt ratio, X3 -Receivables turnover ratio, X26 -Receivables conversion period, X27 -Payables conversion period, X7 -Operating cash flows ratio, X1 -Current ratio, X14 -Inventory conversion period), and X2 -Quick ratio.To test again whether the financial ratios are the most important in predicting bankruptcy, we use the dataset with a set of 10ratio replaces the set of 30-ratio.The results of this study are consistent, similar, and have higher accuracy than those of Liang, et al. [5], Barboza, et al. [6], Chou, et al. [7], Antunes, et al. [8], Chen, et al. [14].
The results show that the efficiency when using the reduced data set with 10-ratio has the same accuracy as when using the data set of 30-ratio for algorithms with high prediction accuracy rates such as Logistics, ANN, and DT.Even in the Bayesian and KNN algorithms, the accuracy of the prediction is far superior to that of the dataset with full indicators.From these results, this research suggests choosing a set of the most important indicators for predicting bankrupt that saves resources in forecasting at high accuracy.
The research continues to study using the Decision Tree algorithm (J48) after removing the ratio that has no influence or little importance to perform the analysis.The results show that the Decision Tree algorithms predict bankruptcy with an accuracy of 97.9%, implying that it is appropriate to use the Decision Tree model to predict bankruptcy for Vietnamese enterprises.Appendix 7 depicts the Decision Tree results of the 10 most important indicators which lead to the corporate bankruptcy risk.At level 1, X13-Total asset turnover ratio is the most important ratio to predict bankruptcy risk for businesses that when asset turnover is less than 1.4654 then the business is forecasted to be at risk of bankruptcy.The next most important metric for bankruptcy is X5-Debt to equity ratio, at level 2 with a threshold of 0.511 will lead to bankruptcy.

V. CONCLUSIONS AND RECOMMENDATIONS
This study uses data mining to predict corporate bankruptcy.The sample is companies that have been listed in Vietnam in the period 2009-2020.This study is to evaluate whether data mining algorithms can be used to predict the bankruptcy of companies in Vietnam accurately or not, which financial indicators are the most effective ratios to predict.To achieve the research objectives, the research has in turn used algorithms including Logistic Regression, Bayesian Network, K-nearest neighbor, Artificial Neural Network (ANN), Support Vector Machine, and Decision Tree.Based on the research results, it can be seen that all 6 methods are accurate in predicting the status, normal, or risk of bankruptcy, of the companies in the sample.In addition, we recommend the use of Decision Tree, ANN that will give the highest prediction accuracy.It can be concluded that these models are suitable for predicting bankruptcy for Vietnamese enterprises in the current period.Moreover, the research has shown 10 financial ratios that are most important in predicting bankruptcy risk.From the above research results, the research has some recommendations for businesses and investors as well as practical suggestions for listed companies to minimize bankruptcy risk.
Total assets turnover ratios, debt to equity ratio, and debt ratio are the three most important indicators in predicting the bankruptcy risk of a business.The results also show that Vietnamese enterprises during the study period are at risk of bankruptcy due to improper implementation of investment decisions stemming from the use of excessive financial leverage and inefficient business activities.The evidence of this research is an important scientific basis for financial managers when planning strategies.
It is necessary to carry out the process to improve the health of the business on the existing foundation, the process of making fundamental changes in the business to increase the ability to operate more efficiently, and create a better "new normal" environment for the business to achieve the strategies and goals.The research postulates some recommendations: Prepare financial statements under the current regulations of the Ministry of Finance; Financial statements must be audited by reputable auditing agencies; In addition to cultivating knowledge about management and law, the listed companies need to regularly improve their knowledge in corporate finance, especially financial ratios to measure business health.
The results show that financial managers need to be careful with regulations on mobilizing funding sources, fully exploiting internal capital sources, especially from retained earnings to reduce the cost of using corporate capital and to limit the use of debt, especially short-term debt.Moreover, the financial managers need to increase the exploitation of highly liquid assets to improve investment efficiency.Furthermore, the financial managers need to regularly re-check the investment regulations so that the business plan can be adjusted in time.
However, the study still has certain limitations.The factors which impact corporate bankruptcy are not only in financial ratios but also come from human behavior.This study has not mentioned the intervening factors such as human behavior, crowd psychology, and speculation affecting the increase or decrease in bankruptcy risk of listed companies in Vietnam.

Fig. 2
Fig. 2 Artificial Neural Network model5) Support Vector Machine (SVM):A support vector machine (SVM) is a classical algorithm that solves problems of big data classification[30].SVM takes input and classifies Bagging generates N-selected training sets with iterations from the original training data set.

Fig. 7
Fig. 7 Comparison of accuracy among methods

Fig. 9
Fig. 9 Accuracy of algorithms with two datasets

Fig. 8
Fig. 8 The most important indicators

TABLE 3
DANG NGOC HUNG, VU THI THANH BINH: DATA MINING FOR BANKRUPTCY PREDICTION