Improved point center algorithm for k-means clustering to increase software defect prediction

The current era’s characteristic is constant technological advances and information that software is everywhere like Web, mobile, desktop, embedded, or software developed to help achieve goals more easily, quickly, and efficiently [1]. These advances cause the software system to be bigger and more complex than before, so it is necessary to prevent software defects. Therefore, predicting the number of defects in a software module is needed and can help developers allocate limited resources [2]. Furthermore, the predictions software modules’ results are categorized as fault-prone and non-faultprone [2]–[7].


Introduction
The current era's characteristic is constant technological advances and information that software is everywhere like Web, mobile, desktop, embedded, or software developed to help achieve goals more easily, quickly, and efficiently [1]. These advances cause the software system to be bigger and more complex than before, so it is necessary to prevent software defects. Therefore, predicting the number of defects in a software module is needed and can help developers allocate limited resources [2]. Furthermore, the predictions software modules' results are categorized as fault-prone and non-faultprone [2]- [7].
Software defect prediction utilizes software metrics and fault data that originate from previous versions of the current software project or can be retrieved from other similar software projects. Data software contains learning models that significantly influence the efficacy of software Defect Prediction techniques [5] [8]. A study investigating 3747 defects from 70 software systems developed by 29 Chinese aviation organizations showed 87% of software defects [9].
Unsupervised machine learning is increasingly being applied to software defect predictions. It is a useful approach for software practitioners because it reduces the need for labeled training data. Various software defect prediction models have been proposed to improve software quality over the past few years, which is increasingly popular using machine learning. This approach can be divided into supervised The K-Means is a clustering algorithm that is often and easy to use. This algorithm is susceptible to randomly chosen centroid points so that it cannot produce optimal results. This research aimed to improve the K-Means algorithm's performance by applying a proposed algorithm called point center. The proposed algorithm overcame the random centroid value in K-Means and then applied it to predict software defects modules' errors. The point center algorithm was proposed to determine the initial centroid value for the K-Means algorithm optimization. Then, the selection of X and Y variables determined the cluster center members. The ten datasets were used to perform the testing, of which nine datasets were used for predicting software defects. The proposed center point algorithm showed the lowest errors. It also improved the K-Means algorithm's performance by an average of 12.82% cluster errors in the software compared to the centroid value obtained randomly on the simple K-Means algorithm. The findings are beneficial and contribute to developing a clustering model to handle data, such as to predict software defect modules more accurately.

329
International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 6, No. 3, November 2020, pp.  methods where training data require labels and unsupervised methods, where data does not need to be labeled [10].
Many defect prediction approaches have been proposed; the majority of studies are on defect prediction techniques [7][11]- [18]. Clustering analysis belongs to the unsupervised machine learning technique of patterns into groups. It is widely used in many fields, such as data mining, machine learning, pattern recognition, and image processing. The K-Means algorithm is often used among clustering algorithms because of its simplicity and efficiency [19] [20].
One of the clustering grouping techniques is partitional clustering. The most widely used partitional clustering algorithm is the K-Means cluster, where there is n number of instances partitioned into k clusters. An optimal centroid is selected for each cluster to located nearby group instances [21] [22]. K-Means begins by choosing the random data point k as the initial set of centroids, which is then increased by the next two steps. Next, each point is inserted into the nearest centroid cluster. Each cluster's center is recalculated as the average of all data points assigned to the cluster [23]- [25]. However, this method's main problem is not ensuring optimal results due to the selection of randomly selected centroid [19] [26].
In this study, an algorithm called point center for K-Means clustering was proposed to overcome early random centroid and focus on problems that occur when software data fails for the software's cluster module's error. The proposed point center algorithm finds the initial centroid of the K-Means algorithm, then applied to predict the software defect module's error. The overall error rate of this prediction approach was compared to the K-Means algorithm with the random centroid. The proposed approach was used to get the best cluster center value of the K-Means algorithm to prove its effectiveness.

Data and experimental design
This study uses NASA MDP datasets because it is very commonly used for predictive software defects and can be obtained in the PROMISE repository. From 2000 to 2013 for 64.79% of software defects research using the NASA MDP dataset [27]. Each NASA MDP dataset consists of several software modules and attributes characteristics. Modules that contain defects are categorized as prone faults, and non-defective ones are categorized as non-fault prone. However, they also consist of McCabe and Halstead complexity attributes in Table 1.
The experiments were using a computer to perform the process of calculation of the proposed method. The hardware and operating system specifications were a DELL laptop with Intel Core (TM) processor i5-3340M CPU @ 2.70GHz, 4.00 GB (RAM) memory, and the Windows 10 Pro operating system 64-bit. Simultaneously, the tools used in this study include Microsoft Excel, RapidMiner, and Rstudio.

Point center algorithm
The K-Means algorithm has a weakness in determining the value of random centroid so that the results are less optimal. This study proposed the algorithm to determine the K-Means centroid's value named point center. This algorithm is based on selecting variables X and Y to determine cluster members. For variable selection, the first stage calculates each attribute's average using Equation (1).
where, ̅ is the average of each attribute ( is attribute), is a data point ( is the data point of the 1 to ), and is the amount of data. Then calculate each standard deviation of each attribute through (2).  After calculating the average and standard deviation, then specify the center data point of the dataset variable as in (3).
where is the first midpoint, ̅ is the maximum value of the average of the standard deviation (SD) and ̅ is the average value of the minimum SD. Then, calculate it by the Euclidean distance (4).
International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 6, No. 3, November 2020, pp. 328-339 where is the Euclidean distance for the first midpoint data point. is the calculated data point ( is datapoint of 1 to ), ̅ is the maximum value of the average of matrix , and ̅ is the average value of the minimum standard deviation of matrix . To calculate the distance between each data point and the starting point. Then, Equation (5) is used to select variables X as a first variable and Y as a second variable of the cluster center member.
where is the matrix or midpoint of the first and second variables, ̅ is the minimum value of the average of candidate cluster and ̅ is the average of the maximum SD of the candidate cluster. Datapoint based on the selection of variables of the Equation (5) with the highest distance Euclidean data on the Equation (4) selected as the first candidate of the initial center point ( 1 ). Then, calculate the Euclidean distance (6).
The highest distance data point in Equation (6) is chosen as the second point center candidate ( 2 ). Where obtained based on the point center point, do until the Equation (6) is equal to distance previously to get the k value obtained from the final cluster ( −1) . Then the cluster members of each point are determined by candidate point center and variable and .

K-Means Clustering Algorithm
The K-Means algorithm is a simple method for partitioning a given dataset into a specified number of clusters k. This algorithm has been discovered by several researchers from various disciplines, especially Lloyd (1957Lloyd ( , 1982, Forgey (1965), Friedman and Rubin (1967), and McQueen (1967). K-Means at non-convex costs also explain that integration is only for local optimality, and the algorithm is usually quite sensitive to the initial centroid location [29].
K-Means is a fairly simple clustering algorithm that partitioned datasets into clusters of k. This technique's main principle is to compile a partition or centroid/average of a set of data. The K-Means algorithm starts with forming a cluster partition initially, then iteratively clustered the partition is repaired until there is no significant change in the cluster partition [30].
K-Means initializes the cluster by randomly generating k data points, while the proposed method of giving initial K-Means centroid values is not random. This is usually done by producing uniformly random values for each dimension. Each K-Means iteration consists of two steps: i) cluster assignments and ii) centroid updates. Determine the centroid k points, then group the data to form a k cluster, with the centroid points of each cluster being the pre-selected centroid points. Update the centroid point value with (7).
where is centroid point of the -cluster, the amount of data in the cluster, and data oncluster. Repeat the grouping and update the centroid value until the value from the centroid point no longer changes.

Proposed Method
This research is a proposed method to determine initial centroid value using an algorithm called point center as the determinant of initial centroid value on K-Means clustering (Fig. 1). The steps of the proposed method to determine the initial centroid value by using the point center algorithm are as follows: Step 1: At this step, the preliminary data processing was used to check and eliminate the missing data using RapidMiner 7.3 Library application. Each dataset with empty or missing supporting value need to 332 Vol. 6, No. 3, November 2020, pp. 328-339 fill out and ensure it was numerical using the Replace Missing Value operator. Since replacing was only on data and did not impact an attribute, this change could be applied in all data copies. In this case, the value assignment procedure was applied to fill the blank information based on the average data value. Then, the data was stored back in the form of excel for subsequent data processing.
Step 2: Calculate the point center algorithm to get value and point center value as the initial centroid.

Fig. 1.
Step of algorithm center point.
Step 3: Calculate the cluster value based on the initial centroid on the K-Means algorithm. Then process it to the point center algorithm and the obtained value of .
Step 4: Calculate the error rate and the Rand index obtained from the K-Means calculation's confusion matrix. The testing performs by comparing the obtained proposed algorithm clusters and the available clusters by K-Means. The grouping of data obtained using the clustering algorithm was a predicted label, while the dataset label value was actual. The final process was comparing and classifying them.
The K-Means algorithm is an unsupervised learning algorithm without labels; However, in the proposed algorithm, a label was needed as a comparison and to measure the performance of this testing algorithm. The two clusters' appeal's total value could be presented using the confusion matrix in Table  2.  Table 2 calculates error value and rand index (accuracy) as in (8) Furthermore, the results obtained are compared with simple K-Means algorithm calculations by the same evaluation technique.

Results and Discussion
This section discusses the results of the measurements of the results obtained by comparing the measurement results using simple K-Means and K-Means using the proposed method called Point Center K-Means (PCKM). The dataset tested was ten datasets consisting of an iris dataset having 3 clusters and 9 NASA MDP datasets (PC1, PC2, PC3, MW1, CM1, KC1, KC3, and MC2), each of which had 2 clusters.
The Iris dataset experiment used 3 classes (Setosa, Versicolor, Virginica), and 150 sample data. Then, Each sample data has 4 attributes: sepal length, sepal width, petal length, petal width. The proposed algorithm calculation starts with calculating the average value and the standard deviation value of each attribute presented in Table 3. After getting the average value and standard deviation, then set the data center point. The center data point is determined as = [5.8433, 3.0540] obtained from the sepal length and sepal width attributes then calculate the first center point candidate ( 1 ) obtained from the maximum Euclidean distance.
Next, determine the variables X and Y, namely = [1.1987 , 3.7587] for the center point obtained from the attributes of petal width ( ) and petal length ( ). To determine the second point center candidate ( 2 ) and the next point center candidate ( ) was using the euclidean formula. Then, the value of the distance was 1 = 2.1878 , 2 = 5.6921, 3 = 6.2626, because the four attributes' distance value was the same as before, the third distance, the number of clusters in this dataset is 3 clusters ( ). Then the point center points obtained from the cluster membership were 1 = [2; 6.4], 2 = [0.2; 1], and 3 = [2.3; 6.9]. The obtained point center was used to calculate the clustering data using the K-Means. This process was done using the R tool by entering the point center value first and then clustering them based on the value. The updated centroid value was 1 =   2 presents the division of clusters based on the proposed algorithm, from the grouping the number of 1 was 54 data, 2 was 50 data, and 3 as 46 data. The comparison between the label clusters obtained with the actual class labels could be presented in Table 4. Based on Table 2, equations (8) and (9) could be calculated, and getting the error rate in Table 4 was 5.3%, and the Rand Index was 94.7%. Compared with the simple K-Means, which determine the initial centroid randomly, the results are presented in Table 4 with an error rate of 10.7% and a Rand Index of 89.3%.  Furthermore, each NASA MDP dataset is calculated using a point center algorithm to get the initial centroid value, which had been then calculated using the K-Means clustering algorithm (PCKM). A confusion matrix is presented in Table 4. Then each dataset had been recalculated using K-Means. The initial centroid clustering by randomly generating data and the confusion matrix results are also presented in Table 5. The proposed method algorithm can determine the number of clusters ( ) and the initial centroid value of the K-Means algorithm from the experimental results. Table 6 shows the results of calculating the number of clusters (k) using ten test datasets.   Table 4 compares the number of clusters (k) of actual labels with the number of clusters (k) obtained through the proposed method. The number of clusters (k) obtained using the proposed method followed the actual number of cluster labels. Using the proposed method algorithm determined the number of clusters of the K-Means algorithm at first and the initial centroid value. Experiments using R tools were carried out two to three times on the K-Means algorithm that getting a random centroid value. The results showed that each experiment's centroid value was changed, but the error rate was the same.
The proposed method for clustering aimed to determine the initial centroid value. Then, the centroid value is implemented by the K-Means method. A comparison between the proposed method and the K-Means that used the initial randomly are shown in Table 7. The simple K-Means test and the proposed method show that the error results using the proposed method are lower than the K-Means method. The difference in error value with ten datasets between them is 13.1%. The comparison result between simple K-Means and the proposed method using ten datasets ( Table  7) obtained five datasets (Iris, PC2, PC4, MW1, KC3) showed that the proposed method got the lower errors. The proposed method produced an initial centroid value from the experimental results because it affects the cluster center's fixed value. The error rate is lower than the simple K-Means, and it gets a random centroid value affecting the cluster center result did not fix. The proposed algorithm had a better Rand Index value on the NASA MDP dataset, such as the PC2, PC4, MW1, and KC3. However, the other NASA datasets were having a level comparable to simple K-Means (Table 8). Rand index is between 0 and 1, a value close to 1 means the perfect rand index, and the value seen from the high rand index. After calculating the proposed method's performance on ten test datasets, 9 of them were NASA MDP datasets that used a sample of clustering software modules as defective and non-defective. The proposed algorithm captured cluster errors of software (Table 9). Table 9. Software defect modules number used in the proposed method testing.  Table 9 presents the number of software modules and the number of defective module labels in the NASA MDP datasets. To capture cluster errors in software with the K-Means point center clustering algorithm (PCKM) can be seen from the obtained error level. From the 9 NASA MDP datasets (PC1, PC2, PC3, PC4, MW1, CM1, KC1, KC3, and MC2), the total error total of 115.38 was calculated, and got approximately 12.82% of the PCKM captured cluster errors of software defect modules.

Conclusion
This paper proposed an algorithm for calculating the initial centroid value of the K-Means algorithm called the central point algorithm. Besides calculating the initial centroid value, the point center ISSN 2442-6571 International Journal of Advances in Intelligent Informatics 338 Vol. 6, No. 3, November 2020, pp. 328-339