A Conjoint Analysis of Road Accident Data using K-modes Clustering and sayesian Networks (Road Accident Analysis using clustering and classification)

— Road and oraffic accidenos are one of ohe imporoano concerns in ooday’s world. Every counory receives a huge damage from road accidenos in oerms of public healoh and properoy loss. Therefore, road accideno analysis plays an imporoano role in public healoh domain. Road accideno analysis is performed in order oo idenoify ohe associaoed facoors ohao are responsible for road accidenos. Knowledge of ohese facoors would be very useful oo undersoand ohe circumsoances of road accidenos and can be used oo avoid ohe road accidenos. One of ohe problems in accideno analysis is ohao moso of ohe road accideno daoa is of biased naoure. For example, ohe crioical road accidenos are very few in comparison oo sligho/minor injury accidenos. Various soudies has focused ohao clusoering prior oo analysis can increase ohe efficiency and accuracy of classificaoion. The mooive of ohis soudy is oo perform a conjoino analysis on road accideno daoa, oo invesoigaoe improvemeno in ohe performance of classificaoion of unbiased daoa afoer clusoering.


I. INTRODUCTION
Road and traffic accident [1] is one of the biggest harm received from the transportation to the public health.Transportation systems itself is not responsible for these traffic accidents but several other factors [2,3].These factors can be defined as environmental factors such as weather and temperature, road specific factors such as road type, road width, and road shoulder width, human factors i.e. wrong side driving, excess driving speed and other factors.Whenever a road accident took place in any road across the world, some of these accident factors are involved.Also, these factors and their influence on road accident are not similar in all countries; but they influenced every road accidents in different countries in different ways.Several studies [4][5][6][7][8][9][10][11][12][13] have focused on identification of these factors so that relationship between accident factors and accident severity can be established.This relationship can be utilized to overcome the accident rate by providing some preventive measures [13].Analysis of road and traffic accidents is widely known as road and traffic safety in which outcome of accident analysis can be utilized for traffic accident prevention.The literature in the traffic safety domain is quite rich as it consists of several research studies [14][15][16][17][18][19][20] on road accident data analysis using several techniques such as statistical techniques, mathematical models, data mining and machine learning techniques.It has been observed that classification accuracy is one of the most important parameter to evaluate the performance of the classifier on certain data sets.sut, if the data is not balanced or if the distribution of target attribute class values is not uniformly distributed, the classifier accuracy can be biased.In this study, we are using k-modes clustering and sayesian networks to perform a conjoint analysis on imbalanced road accident data from Leeds, UK in which severe injury accidents and slight injury accidents has a large difference in accident counts.The results reveal that although conjoint analysis on imbalanced data is efficient enough to improve the accuracy of classifier but it is not guarantee that all clusters will achieve a biased classification or improved performance that ca n be achieved without clustering.The organization of the paper is as follows: The section 2 will discuss about the data set used and the methodology adopted for this study.Section 3 will discuss the experimental results and discussion.Finally, we conclude in section 4.

A. Data Set
The data set used for this study is obtained from Leeds, UK [21].The data set consists of 14 attributes and 1246 accident records over a period of five years from 2011-2015.The accident attributes in the data are geo-coordinates of the accident locations, number of vehicles, accident date, time, month and year, type of victim, sex of victim, type of accident, severity of accident, type of vehicle, road type, road surface conditions, weather conditions etc.

B. Cluster Analysis
Clustering [22] provides homogeneous segments out of the large data set.Usually, clustering is applied on large data set in which class labels are missing.After clustering, homogeneous segments are achieved, this can be assigned with a label after investigating the properties of the data objects in the group.We have used k-mode clustering technique to segment our accident data into homogeneous groups.K-modes algorithm [23] is an enhanced version on traditional k-means algorithm with only difference of the similarity measure that is given as follows.
The distance function of k-modes algorithm can be defined as, Where,

Given a set of categorical data objects D defined by n attributes A
K-modes algorithm is quite suitable for nominal or categorical data sets.Our accident data consists of categorical attributes; hence we have selected k-modes clustering for road accident analysis.The procedure of k-modes algorithm is given as follows: K-mode clustering Algorithm: Input: Data set D, k number of cluster to be formed Output: k clusters 1. Initially select k random objects as cluster centers or modes 2. Find the distance between every object and the cluster centre using k-modes distance measure 3. Assign each object to the cluster whose distance with the object is minimum 4. Select a new center or mode for every cluster and compare it with the previous value of centre or mode; if the values are different, continue with step 2.

C. Number of Cluster Selection
In order to determine the number of clusters to be formed out of the data, Bayesian information criteria (BIC) is used [24].The BIC criteria can be defined in Eq.4.
Where, p is the number of model parameters and n is the sample size.

D. Bayesian Networks
Bayesian Networks (BNs) have proven track record in the field of data analysis.It is widely applicable to establish relationships between different set of attributes using probabilistic calculations.It has wide applications is bioinformatics, text classification, medicine, information retrieval, gaming and transportation.In BNs, the relationships between different set of variables is represented by arcs or edges in a graph, and variables are represented as nodes.The detailed desciption about Bayesian Networks can be found in [25][26].

E. Performance Evaluation Parameters
In this paper, several performance paameters [22] have been used to calculate the model fittingfor every clusters made from the data.These indicators/parameters are accuracy, sensitivity, specificity and the HMSS (Harmonic means of sensitivity and specificity) and ROC area.This indicators can be calculates using follwing equations: Where, TP-True Positive, TN-True Negative, FP-False Positive, FN-False Negative.

III. RESULTS AND DISCUSSION
This section presents the experimental analysis and discussion on results.Initially, data preprocessing is performed on the road accident data to give it a proper shape required for analysis.Several attributes are transformed into suitable form using data transformtion methods..

A. Cluster Analysis
After data selection and data preprocessing, the selected data is used for cluster analysis using k-modes clustering algorithm.The number of clusters for k-modes algorithm is determined by observing the BIC values for different cluster models.The Fig 1 illustrates the cluster selection using BIC values.
Based on fact mentioned in previous studies [27][28], a cluster model with 4 clusters is selected.Further, k-modes technique is applied on the data and the four cluster obtained.The description of these four clusters are given in Table 1.

B. Performance Evaluation of Bayesian Network
Further, Bayesian Networks (BNs) are used to investigate the responsible factors that contribute to accident severity.Therefore, several BNs were built for each clusters and the whole data set.
The main objective of this study is to identify if some new findings are there after performing a conjoint analysis (kmodes and BN).Further, these BNs that were built for 4 clusters and whole data set were compared using performance indicators and complexity to validate the goodness of model fitting.Table 2 illustrates the accuracy, sensitivity, specificity and HMSS and ROC for each cluster and whole data (WD).
It can be seen from Table 2 that minimum accuracy is achieved in C3 and the highest accuracy is achieved in C1.As the accident data was imbalanced data, ROC values are also taken into consideration.The ROC values indicate that performance of classification is better in C2 whereas in other clusters, the ROC values are lower than the ROC value of WD.It simply indicates that although more accuracy can achieved as a result of clustering process but is the data is of imbalanced nature, it is not guarantee that efficient classification results can be achieved.

IV. CONCLUSION
The paper presents a conjoint analysis using k-modes clustering and Bayesian Networks on an imbalanced road accident data from Leeds, UK.The main objective of this study was to validate the performance of classification before and after the clustering process.Initially, the k-modes algorithm is used to cluster the data into 4 homogeneous groups and further these clusters and the whole data set is analyzed using Bayesian Networks.Different Bayesian Networks are built for each cluster and the entire data.Further, these Bayesian Networks are evaluated on the basis of performance indicators.The result indicates the classification accuracy is slightly improved as a result of clustering process but the ROC values are slightly decreased for some clusters.This indicates that performance of the classifier in terms of accuracy is biased towards one class value which has comparatively large number of instances.The future work will comprise of detailed analysis of these Bayesian Networks to establish the relationships between different road accident attributes to identify which attributes have higher impact on severity of accidents.

Table 2 :
Bayesian network performance on each cluster and whole data set