A data mining approach for classification of traffic violations types

a Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, 86400 Batu Pahat, Johor, Malaysia b Faculty Science Computer and Mathematics, Universiti Teknologi MARA (UiTM), Segamat, Johor, Malaysia. c Traffic Systems Sdn. Bhd., 30-1, Jln Radin Bagus 3, Sri Petaling, 57000 Kuala Lumpur, Malaysia 1 qiqiiyla.othman@gmail.com; 2 feresa@uthm.edu.my; 3 aidam@uthm.edu.my; 4 salama@uthm.edu.my; 5 shamalap@uitm.edu.my; 6 shafiza@senatraffic.com.my


Introduction
When a law enforcement official issues a traffic summons (also known as traffic tickets), it is to inform the motorist, which includes anyone who drives a car, truck, or bus as well as anyone who rides a motorcycle, that they have been stopped [1] [2]. At some point or another, the majority of drivers are recommended for a moving infringement because they are speeding, running a red light, or committing some other type of criminal traffic infraction [3]. The results of tickets are not calamitous but at the very least, dealing with a ticket requires an investment of time. The fact that many people do not consider street activity offenses to be wrongdoing, for some strange reason, gives the impression that they are not considered wrongdoing; however, nothing could be further from the truth [4]. Because of the high number of fatalities that can result from traffic offenses around the world, they have become a major source of public concern [5] [6]. For some inexplicable reason, a significant proportion of people do not regard street activity crimes to be criminal offenses [7].
Data mining allows the processing of large amounts of historical data and the condensing of that information into valuable information that can be used to construct various models, such as prediction

283
International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 7, No. 3, November 2021, pp. 282-291 Othman et al. (A data mining approach for classification of traffic violations types) models, clustering models, and anomaly models [8]. Data mining software such as the RapidMiner allows users to analyze data using various data mining approaches until knowledge is extracted as the information is accessible in the diverse organizations to make the best possible move [3] [7].
According to the literature, several studies have been conducted on the demographic and socioeconomic features of criminal offenders from varied backgrounds. However, minimal studies characterize traffic offenders and drivers who receive citation tickets or warnings. Understanding traffic offences is critical because it will allow for the development of more effective prevention and enforcement strategies to reduce these offenses and, ultimately, road accidents on the road [4] [9]. Moreover, data mining can be applied in many industries to help improve or forecast many things. For traffic violations, prior research has primarily looked at how well drivers can predict the characteristics of other drivers who are ticketed for traffic violations [5] [7].
This study aims to build a comparative model for classifying traffic violation types based on a data mining approach. Traffic violation types are categorized into Citation, Warning, and ESERO (Electronic Safety Equipment Repair Order) that referred to [7]. The classification algorithms to be used include the Naïve Bayes, Gradient Boosted Trees, and Deep Learning algorithms [10] [11]. This research is scoped to traffic violation data from Montgomery County between the years 2013 to 2016, and the dataset was extracted from a website called data.world. The classification models developed in this paper will be measured for accuracy, recall, precision, and f-measure.
For the existing dataset from [11], the data used in this study came from two different sources: the Southwest City Police Department (SWCPD) and the United States Census Bureau in the year 2000. All traffic citation data was obtained from the SWCPD and consisted of all traffic offenses committed between the dates of January 1, 1999, and October 10, 1999, totaling 87,792 traffic violations and 211,689 fines within that time period. In addition to driver demographic parameters (day, date, and time of the violation), the data on these violation occurrences includes information on the types of charges levied against the driver, his or her speed, and the legal speed limit in the area. The dataset consists of the accident day, year, variables, the vehicle involved, and people included. There are 39 qualities chosen from both datasets, and after data cleansing, 573 of accident information causes driver's casualty.
The remainder of this work is organized in the following manner: Section 2 outlines the methodology that was utilized to complete the data mining work, as well as the dataset and the assessment metrics that were employed in the process. Following that, Section 3 summarizes the findings, and Section 4 closes with conclusion and some recommendations for further research.

Method
The Knowledge Discovery in Database (KDD) framework is used in this study to classify the different types of traffic violations. The KDD framework is a data mining system that seeks to uncover interesting patterns in the underlying data. KDD is beneficial for a large dataset, and it can process data from the database as indicated by client necessities. KDD also incorporates how information is prepared, what calculations can be applied to obtain a substantial measure of information proficiently, and how the results can be translated and visualized [12] [13]. KDD begins with data warehousing, in which a related field is coming from the database. Data warehousing helps set the phase in KDD in two important ways: data cleaning and data access [14].
There are five phases in KDD that need to be implemented to get the results from classification or prediction techniques. The five phases include selection, pre-processing, transformation, data mining, and evaluation [15]. In this research, the data mining part will involve the classification process to predict traffic violation offenders based on the summons issued. The phases in these experiments are shown in Fig. 1 as adopted from the KDD methodology.

Data Selection
According to [16], the process of selecting a certain characteristic from an initial dataset that is most relevant to the data mining activities at hand is known as data selection. Because of the removal of irrelevant or repetitive features, the execution time for the data mining operation will be reduced, while the precision will be raised as a result of the process. When it comes to boosting the effectiveness of data mining algorithms, feature selection is crucial since it ensures that only meaningful and beneficial attributes are used. The final collection of features chosen meets key critical requirements for shrinking in terms of overall size [17].
Data gathering is one of the technical stages required and should be taken totally with the goal that it very well may be run and tested later to think about the execution of the order in expectation of criminal traffic offense [18]. The primary reason for this stage is to get an appropriate dataset with proper credit and run the test. In this study, the dataset was extracted from data.world. The dataset was contained 35 attributes and 7,700 rows. After completing data preparation using TurboPrep, the dataset consists of 8 attributes and 7,700 rows. The eight attributes are date, description, vehicle type, year, make, model, violation type, and gender. Fig. 2 shows the dataset after data pre-processing.

Data Pre-Processing
The improvement of data mining cannot be isolated from the fast advancement of data innovation that permits a comprehensive measure of information aggregated in line with the development of data innovation [19]. Mining implies an endeavor to profit from a vast number of fundamental materials. Given the best practice, experts, talented individuals, and individuals who work to discover data in information mining propose some procedure with work process or approach well-ordered easy to expand odds of accomplishment in putting into utilization the examination.
Right off the bat, the dataset got from the site has any sections that are inadequate as missing information, invalid information, or even pointless information. Likewise, additional credits do not apply to the examination in data mining. The information is not significant it is additionally better evacuated because it is nearness can decrease the quality or precision of the data mining later. Data cleaning is essential in every research to detect and remove errors from the raw data [20]. TurboPrep processed the dataset in RapidMiner Tools in the pre-processing data phase. The dataset selected by the operator is read in the RapidMiner tool. Turbo Prep is designed to make data preparation less time-consuming and difficult [21]. It gives a user interface where a data is continuously visible front and center, so the data can make changes step-by-step and immediately see the results, with an exhaustive run of supporting capacities to get ready so the data for model-building or presentation.
During this process, firstly, data need to choose whether they want to do a prediction or clustering. After that, RapidMiner will display all the details in every attribute in the dataset. In TurboPrep, data can be transformed; for example, rename the attribute, change type, remove the column, and delete all the selected columns from the dataset [22]. Moreover, one more thing is that TurboPrep can replace a missing value in the dataset. The best thing if using TurboPrep is that this tool provides quality measures. It means the user can see at a glance typical data quality problems. They can show the details about the quality measures are calculated in the dataset. The details will show missing value, infinite, IDs, stability, and valid. Users can check the details and then make a data transformation so that all the attributes with a high value of missing value and low stability will be removed from the dataset.

Classification Algorithms
Graduated boosted trees, Naïve Bayes, and Deep Learning were used in this classification experiment. According to [23], Based on the Bayes theorem, Naïve Bayes is a probabilistic classifier in which all variables or factors are presumed to be independently variable or factor from one another. The algorithm is straightforward to design and performs admirably when dealing with enormous datasets. According to the Bayes Theorem, the probability of P(A|X) = P(X|A) x P(A) / P(X) x P(X), where P(A) is the relative frequency of class A samples, and p is increased when P(X|A)P(A) is increased, and p is increased when P(X|A)P(A) is increased [23].
The second classification algorithm used in this experiment is the Gradient Boosted Trees [24]. It is possible to train a boosted decision tree using an ensemble learning method, in which one independent tree corrects the errors of another independent tree. If the first tree makes a mistake, it is corrected by a second one, and so on. The second tree makes a mistake by the first and second trees, and so on. According to [25], boosting is one of the most effective learning concepts to be established in the last twenty years since it may combine a large number of poor learners into a single strong learner with little effort. A gradient boosted decision tree is a classification model that aggregates all tree-based classification models and uses estimations to gradually achieve its prediction outcomes. Boosting may be a nonlinear regression strategy that is adaptive and makes a difference in the precision of trees as they grow in complexity. Improved trees outperform normal trees in terms of accuracy but are slower and less interpretable by humans than standard trees. The Gradient boosting approach is designed to address these concerns.
The latest algorithm used in this research work is Deep Learning, which mimics human intelligence [26], and many recognition problems with huge training samples in numerous representations and highspeed streams benefit from the use of this technique. Deep learning, which is based on base learning technology (particularly, neural networks), can provide a cross-therapy information analysis to allow for better informed treatment decisions.

Experimental Setup
The software specification will be used to run all the results that had been generated and allow the researcher to get an accurate result. The purpose of building the prediction of traffic violations is by using data mining tools called RapidMiner. RapidMiner is an open-source data mining with the java computer program and stage for data science computer program. It gives a collaborative environment for data planning, deep learning, machine learning, predictive examination, and text mining. Rapid Miner is created in an open center show. RapidMiner combines instruments and appropriateness to supply a user-friendly integration environment of the most up-to-date data mining procedures [27].

Evaluation Metrics
This section presents the evaluation metrics that need to be applied in this research, for example, accuracy, precision, recall, and F-measure [28]- [30]. Accuracy is defined as the ratio of true positives to the total number of observations in a dataset, while Precision is the ratio of true positives (TP) to all

Results and Discussion
This section presents the comparative results of the classification experiments using Gradient Boosted Trees, Naïve Bayes, and Deep Learning. The results are reported based on accuracy, precision, recall, and F-measure. Implementation of the models is described in terms of processes in RapidMiner. In RapidMiner, a process is visualized through a series of connected operators that transform the data for further analysis. On the other hand, operators represent an element that takes input and produces output, such as a function, a formula, or a node. The processes shown in Fig. 5 include feature set, Deep Learning training, cross-validation, model simulator, and explain prediction.

Naïve Bayes Classifiers
The analysis is performed by using RapidMiner Studio. This research work uses Naïve Bayes in RapidMiner to construct the prediction model. Data was retrieved utilizing the recovery administrator, and information was passed to the administrator named "cross-validation." The set part and discretize operator are used in the pre-processing step. Cross-validation is connected to evaluate and discover the accuracy of the model. The cross-validation operator may be settled; it has two sub-processes testing and preparing.
During the testing and training phase, there is a subprocess for validation. The training model needs to use the sub-processes of validation. After that, the trained model is applied in the testing phase performance also be measured. A Naïve Bayes operator will be used during the cross-validation and testing training phase. The process "Apply model operator" tests the model while "Apply performance operator" evaluates the performance. Fig. 3 shows the processes in Naïve Bayes classification in RapidMiner.

Fig. 3. Process of Naïve Bayes classifier in RapidMiner
RapidMiner performance operator will provide several options to evaluate the performance using the Naïve Bayes model. The classifier accuracy performance for Naïve Bayes was high at 65.75 percent, precision performance was evaluated at 77.24 percent, recall performance was 69.01 percent, and fmeasure for this classifier is 72.89 percent.  Fig. 4 shows the Gradient boosted trees classifier by using RapidMiner. The H20 GBT operator is used to predict the traffic violation dataset, which is the dataset that has been used in this experiment. During this phase, the application model and performance operator need to be used to calculate the Gradient boosted trees classifier's performance, accuracy, precision, recall, and f-measure.

Fig. 4. Process of Gradient Boosted Trees classifier in RapidMiner
The performance of the Gradient Boosted Trees classifier shows that the accuracy is 69.59 percent. The precision performance was evaluated as 70.92 percent, while recall performance was recorded at 91.92 percent, and the f-measure in this experiment was 80.06 percent.

Deep Learning Classifier
As mentioned before, this research work was proposed to compare all the three algorithms used in this research work. The last algorithm is Deep Learning. Fig. 5 shows the process in RapidMiner. All of the Deep Learning classifier performance was observed. The accuracy is high for 69.22 percent. The precision performance was evaluated at 72.52 percent. Recall of the model was recorded as 87.01 percent, and the f-measure in this model is 79.10 percent.

Performance Comparison
In this project, a traffic violation dataset has been testing the efficiency of the proposed algorithms. In this research work, three methods are chosen to be compared and evaluated. These methods are Naïve Bayes, Gradient Boosted Trees, and Deep Learning. As shown in Table 4, Naïve Bayes got an accuracy of 65.75 percent, while Gradient Boosted Trees accuracy got 69.59 percent, and Deep Learning accuracy was 69.22 percent. The results are shown in Fig. 6.  Table 1 shows that all three-model performance was compared. Gradient Boosted Trees is the highest in performance accuracy, the second-highest is Deep Learning, and the lowest is Naïve Bayes. Naïve Bayes is at the highest percentage at precision performances, 77.24 percent, while Deep Learning is 72.52 percent, and the lowest percentage is Gradient Boosted Tree with 70.92 percent. Recall of the model for Gradient Boosted Trees was the higher percentage which is 91.92 percent, and the secondhighest is Deep Learning with 87.01 percent and Naïve Bayes with 69.01 percent. For f-measure, Gradient Boosted Trees was the highest percentage which is 80.06 percent, while Deep Learning was 79.10 percent and 72.89 percent for the Naïve Bayes model. According to the findings, the Gradient Boosted Trees and Deep Learning algorithms have good accuracy and recall, but have low precision. As the predicted labels are inaccurate when compared to the training labels, this situation can arise for a variety of reasons. In general, a low precision indicates that the results contain a higher proportion of false positives. With regard to accuracy and precision, Naïve Bayes was shown to have low accuracy and precision but has high recall. Naïve Bayes is a picky classifier that does not process all of the findings and only performs well on high-precision datasets, as demonstrated in the following example. The challenge, however, is that as the sample data size grows, improving the recall rate becomes more difficult since precision diminishes.

Conclusion
Using three classification techniques, including gradient boosted trees, Naïve Bayes, and deep learning, the classification of traffic violation kinds has been successfully accomplished. In performance assessment, Gradient Boosted Trees scores the highest accuracy of 69.59%, Deep Learning scores the second-highest accuracy of 69.22%, and the Naïve Bayes scores the lowest accuracy of 65.75%. In the future, the results of this comparative classification experiment could serve as a standard or as a baseline for the development of classification or prediction models for traffic infractions. Towards the end of the research project, a visualization method might be implemented to illustrate the intensity of activity violations across geographical areas and accident-prone areas. The dataset also plays an essential role in this research work, and finding an excellent dataset is not easy. Therefore, we will move forward to find a high-quality dataset to improve prediction performance in the future. The prediction results can also be further improved through different classifiers such as Support Vector Machines and other Deep Learning models to get better performance and accuracy. ISSN 2442-6571 International Journal of Advances in Intelligent Informatics 290 Vol. 7, No. 3, November 2021, pp. 282-291