An improved K-nearest neighbour with grasshopper optimization algorithm for imputation of missing data

Data is an essential asset for any discipline of work to efficiently analyze in making better decisions. Data is accessible at every edge of life, which provides different insights. The first step in data mining, concerning collecting data, is that a researcher must confront common problems that any data are prone to. Practically, data collected that inclined to noise, incomplete, inconsistence, and redundant are the major source of poor data quality. Besides, more than 40% of datasets embedded in the UCI Machine Learning Repository were missing, extensively used to make an empirical analysis [1]. Missing data can significantly influence the efficacy of the result, which could lead to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings [2]. Missing data is commonly described as a significant issue in most scientific research domains that may originate from mishandling samples, low signal-to-noise ratio, measurement error, nonresponse, or deleted aberrant value [3]. There are many possible reasons the dataset tolerates missing data, especially when the respondents do not respond due to stress, fatigue, or inadequacy of knowledge. Some of the questions are sensitive and lack option answers [4].

statistical analyses. At the same time, parameter estimation implies maximum likelihood techniques to estimate a parameter's value that is most likely to have resulted in the observed data. This method does not impute any data, rather uses each case available to compute maximum likelihood estimates [8]. Although the parameter estimation approach is generally superior to case deletion, these two methods still suffer from high degree complexity, high sensitivity to outliers, and massive lost information. The third category, imputation, replaces the missing values with plausible estimates nearly to the actual values to make the data complete [9]. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Imputation preserves all cases by replacing the missing value with an estimated value based on other available information. Imputation theory is constantly developing, which has caught the attention of statistical and machine learning techniques. A well-known attempt to tackle missing value using statistical techniques is mean imputation. Mean imputation (sometimes called by substitution) replaces missing values by calculating a mean for the variable based on all cases that have data for that variable [10] [11]. This technique can lead to bias and underestimates of standard errors. Despite this, machine learning techniques proposed many algorithms to investigate the efficacy of algorithms when dealing with missing data. Machine learning has gained increasing attention to universally solve missing data imputation issues.
The typical imputation strategy regarding K-nearest neighbors (KNN) has been extensively applied to solve the ubiquitous issues in incomplete data. The fundamental idea of KNN can be expressed as a straightforward, robustness, highly efficient, and powerful algorithm that is useful in matching a point with its closest neighbors for all data types, such as continuous, discrete, ordinal, and categorical. KNN imputation has always been known as the lazy and instance-based estimation method [12] [13]. The main benefits of KNN imputation are the ability to predict both qualitative and quantitative attributes, easily treat instances with multiple missing values, and consider the correlation structure of the data [6]. Moreover, the success of the KNN imputation algorithm relies on the excellent option of value k. The k in KNN represents the number of nearest neighbors. However, one of the well-known drawbacks of this approach is its inability to deal with high-dimensional and sparse data, which leads to the objective of this paper [14] [15]. To overcome the limitation, we proposed to develop an optimization of KNN imputation based on one of the optimization algorithms, the Grasshopper Optimization Algorithm (GOA). A grasshopper optimization algorithm is recent population-based metaheuristics which have shown improved results and efficiencies in tackling issues with missing data [16]. The performances of the proposed algorithm will be compared with other optimization algorithms (Particle Swarm Optimization, Genetic Algorithm, Dragonfly Optimization) in terms of imputation accuracy.
The accuracy obtained from the state-of-art KNN imputation algorithm is not necessarily sufficient until it's proven to handle more versatile KNN with better accuracy. Therefore, this paper proposes a KNN based approach, with an additional optimization algorithm developed to improve the overall performance.

K-nearest neighbors (KNN) Algorithm
K-nearest neighbors (KNN) are universally recognized as one of the most powerful learning algorithms and used for a wide range of real-world applications. The efficacy of the KNN algorithm and its performances mainly depends on the distances or similarity measures and appropriate value for the parameter k [17]- [19].
KNN is the most straightforward algorithm in imputing missing values [20]. This algorithm has been used to solve many predictive problems. In order to impute a value of a variable, KNN defines a set of nearest neighbors for a sample and substitutes the missing data by calculating the average of nonmissing values to its neighbors [21]- [23]. There are many merits and demerits of KNN for imputation. However, despite the good points, KNN still imposes undesirable circumstances. KNN suffers from high time complexity, choosing the right k, and different functions.
Many articles, [1][24]- [26], have presented a novel method based on KNN to impute missing data. Most of the experimental work found that KNN efficiently and consecutively shows an accurate imputation on datasets better than any state-of-art algorithms. Besides, an extensive combination of the KNN approach with other ensemble approaches produced the highest robustness and accuracy [27]. Batista and Monard [8] analyzed one preferred standpoint of KNN that is independent of missing data treatment, which makes the algorithm the most suitable imputation for any circumstances.

Grasshopper Optimization Algorithm (GOA)
Grasshopper Optimization Algorithm (GOA) is a recent swarm intelligence developed by Mirjalili et al. [28] and Luo et al. [29] that mimics the behavior of grasshopper swarms in nature. The grasshopper is an insect that can be considered a pest due to its nature damaging crop production and agriculture. These creatures are commonly found to be seen individually. However, they often join the swarm as one of the largest swarms of all creatures [30] [31]. The swarm of grasshoppers is a nightmare for the farmers as the size can be of continental-scale [32] [33]. The grasshopper's life cycle passes through three main stages: egg, nymph, and adult ( Fig. 1). Another unique quality of the grasshopper swarm is the swarming behavior found in both nymph and adulthood [34]. The nymph grasshopper does not have wings; thus, they slowly eat all vegetation on their path [35]. However, after a period of time, the grasshopper will become an adult with wings to form a swarm in the air and move fast to a large-scale region [30]. The inspiration of GOA comes from the attacking strategy of a grasshopper on corps in the form of swarms. Although they are herbivores, they feed on grasses, leaves, and stems of plants, but when a swarm of grasshoppers infests farms or garden areas, they can cause extensive plant damage and loss. They manage to survive according to the gravitational and wind force so that these factors become helpful for them to attack crops of their target [36] [37]. A grasshopper can easily be at a 'gregarious' state when an increase in the chemical serotonin in certain parts of the nervous system (which boosts mood in humans) initiates the swarming behavior. Besides, as claimed by Melina [38], a solitary grasshopper could be made gregarious within 2 hours simply by tickling their hind legs to simulate the jostling they experience in the wild. Grasshopper optimization algorithm could be visualized as seen in Fig. 2.
According to the US Department of Agriculture (USDA), a swarm of grasshoppers is punctual despite their structured formation. They strictly swarm to migrate in search of food between 10 am and 6 pm. There are clear skies, and the temperature has risen to at least 75 degrees Fahrenheit (24 degrees Celsius). Moreover, a grasshopper is reported as a very structured swarm as a way it joins the formation and flies in an organized way as a member of the swarm when approached by a dense group of flying grasshoppers, although a single grasshopper merely flying follows its random path [39].
For this study, GOA favors KNN imputation methods by surviving to avoid local optima and finding the global space in the given space. Nevertheless, GOA beneficially balances exploration and exploitation to drive grasshoppers towards the global optimum. A fundamental assumption of GOA that may improve the processes of KNN imputation can be found in the way GOA finds its optimum solution. KNN estimates a value from its nearest neighbors while GOA has a high avoidance to find a solution between a set of neighborhoods and provides a solution among all possible solutions. Besides, one of the limitations of KNN imputation is that the algorithm searches through all datasets for estimating most similar instances, which takes a great deal of time. GOA favors KNN imputation in the sense of time complexity, where one of the main characteristics of grasshopper in the adulthood phase is long-range and abrupt movement.

Fig. 2. Grasshopper Optimization Algorithm
Three forces influence the position of each swarm grasshopper. The three forces are social interaction between an individual grasshopper and another grasshopper, Si; the gravity force on grasshopper, Gi; and Ai's wind advection. The mathematical model of the three forces and simulated grasshopper behaviors are presented as follows: Note that to provide random behavior the equation can be written in Xi = r1Si + r2Gi + r3Ai where r1, r2, and r3 are random number in [0,1].
Where dij is the distance between the i-th and the j-th grasshopper, calculated as dij=|xj -xi|, s is a function to define the strength of social forces in equation 3, and (dij) ̂ =(xj -xi) / dij is a unit vector from the i-th grasshopper to the j-th grasshopper.
The s function, which defines the strength between two social forces, attraction and repulsion between grasshoppers are calculated as follows: Where f, l are the intensity of the attraction and the attractive length scale. Social behavior is affected by changing the parameters f, l.
The second affected force on the position of grasshopper is the gravity force which is calculated as follows: Where g is the gravitational constant and (eg) ̂ is a unity vector towards the center of the Earth. The A component in equation 1 is calculated as follows: Where u is constant drift and (ew) ̂ is a wind direction unity vector. The nymph grasshopper movements is highly correlated with wind direction because they have no wings.
The main process is to impute the dataset with KNN by calculating its nearest neighbors' distance between each missing data. Then, the imputed value will be optimized with GOA according to the information of the missing dataset.

Experiments Design
In this section, the nine datasets used for this research are described. Data were acquired from public access websites such as data.world, UC Irvine Machine Learning Repository, Kaggle.com and Public Library of Science (PLOS One). The description of selected datasets is shown in the table below, including the domain, sources, number of instances, number of attributes, data types, and percentage of missing values. The data used in this paper are from the medical, engineering, and transformation domain. These three domains are claimed to be classified among the most beneficiaries in missing data subject. The nature of imputation was evaluated by comparing the imputed values against original values. The experiments will be computed regarding the accuracy, time complexity, and sensitivity of each imputation method. The parameters to evaluate the performance and measure the error differences between values are by employing Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). These parameters are negatively oriented, which implies lower values are better. The three criteria significantly a meaningful representation that computes an error between two numeric vectors. An alternative for the corresponding significance tests is supported with Vargha -Delaney A test. The A test helps to assess the difference between two populations concerning a variable. Upon testing, each swarm optimization algorithm shall be compared to determine which results are greater or smaller from the KNN-imputed values [40].

Results and Discussion
The following table shows the analysis done to examine the performance of four machine learning algorithms performance: Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Bayesian Network. Imputation with machine learning algorithm performs better than any statistical tools considering that machine learning is more flexible with better predictive accuracy. Nonetheless, a standard range of machine learning imputation algorithms will still introduce vague analysis results [41].

Statistical Correlation
In performing the visualization of all datasets between actual and predicted, a scatterplot is chosen to help illustrate a relationship between two variables. In a scatterplot, the points can discern a clear trend in the data. All the scatterplot figures in this subsection will visualize the differences between the actual and the imputed values for all seven medical datasets explained in the previous section.
A good scatterplot is best defined as the closer the data points forming a straight line from the origin out to high y-values. Besides, the best fit for this description is a strong, linear, and positive association between the two variables. Fig. 3 to Fig. 11 illustrates of seven scatterplot correlations for all nine datasets, which are KD1 (Fig. 3), KD2 (Fig. 4), HCC survival (Fig. 5), AKI (Fig. 6), EHP phthalates (Fig. 7), ECG (Fig. 8), Blood test (Fig. 9), Automobile (Fig. 10), and Air Quality (Fig. 11).
The results in Fig. 3 demonstrate two things. First, there is a positive linear association between two variables for all subfigures. Second, for Fig. 3(b) and (d), the association looks weaker compared to Fig.  3(a) and (c). This result concludes that only the GOA algorithm presents a higher correlation for actual and imputed values after being optimized by a conventional KNN imputation algorithm. The result in Fig. 4 shows that the pattern is significantly identical between the actual value and imputed values for all metaheuristics algorithms. The result in Fig. 6 demonstrates a moderate, positive, and linear relationship between actual and imputed values for the GOA algorithm. Unlike GOA, the scatterplot for all other metaheuristics algorithms displays a diverse form of correlations which can be statistically considered as no relationship measured.   Fig. 7 shows one similar pattern for all eight metaheuristics algorithms where specifically, the data has a general look of a line going uphill. The finding best describes that it shows a positive linear association between two variables, actual and imputed values. Besides, to assess the relationship between the variables, Fig. 7(b), (c), and (d), shows a stronger relationship, compared to Fig. 7(a), which means higher correlation.  Fig. 8 are far remotely to a straight line. However, Fig. 8(b), and (c) display a weak positive correlation which indicates that they tend to go up in response to one another for both variables, but the relationship is not very strong. Among all the subfigures in Fig. 9, GOA has demonstrated the perfect positive, linear, and strong relationship for both variables. However, Fig. 9(b) and (c) illustrate a positive and moderate correlations.  Fig. 10 clearly illustrates that only Fig. 10(b) demonstrated a different trend, which is weakly correlated. As shown in Fig. 10(a), (c), and (d), the graph describes a strong correlated positive linear relationship between the two variables.  Fig. 11 shows a close trend for all eight algorithms. Fig. 11 (c) and (d) points out an identical strongly correlated, positive, and linear relationship. The evidence is by the much cleaner line formed by the data points.
International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 7, No. 3, November 2021, pp. 304-317 Abidin and Ismail (An improved K-nearest neighbour with grasshopper optimization algorithm for …) To conclude, relying the interpretation on scatterplot only is individually biased. Therefore, extensive experiments are carried out to support the discussion made in the following section.

Error Accuracy
In general, the results highlighted for error accuracy is that the most promising finding was an optimization of KNN with Grasshopper Optimization Algorithm (GOA). KNNGOA showed the lowest error accuracy for all nine datasets regarding the size of datasets and missing value rates, except for the ECG heartbeat dataset. Table 4 describes all four relative error parameters. KNNGOA consistently demonstrated impressive performance, providing a low error accuracy from the KNN imputation algorithm. According to Table  4, ECG heartbeat and Air Quality datasets display an unusual result for MAPE. For all algorithms, KNN and eight metaheuristics-based KNN algorithms, the results imply that the function will return -Inf, Inf, or NaN if actual is instability at or near zero.

Computation Time
Another investigation that governs the efficiency of an algorithm is by measuring the computation time. The tradeoff between error accuracy and time complexity is considered by comparing the results. Table 3 shows that 4 out of 9 datasets have the fastest time using GOA. Time computation tradeoff refers to slow execution time in exchange for the lowest error accuracy. Although only four datasets show GOA achieved as the fastest computation time, GOA still appeared and achieved as the high accuracy.

Conclusion
In this paper, we present a novel method that improves imputation performance based on K-nearest neighbors by using the Grasshopper Optimization Algorithm (GOA). The hybrid model KNNGOA is applied to optimize the imputation algorithm and missing value problems. It is essential because any analysis can draw an inaccurate inference due to the missing value. Experiments are conducted to evaluate the imputation accuracy of the proposed KNNGOA on the five real-world datasets from all public websites. According to three different evaluation criteria, error accuracy, statistical test, and time computing, the proposed KNNGOA constantly outperforms and performs better than other algorithms. Currently, the proposed solution is time-consuming because the training procedure for GOA is repeated many times to find the optimal solution and attribute weights for big datasets. Therefore, some modifications are needed as a tradeoff, thus reducing the computational time. In future work, we attempt to tailor the model for big datasets by concurrently applying a solution of speeding up the training time of KNN by using some methods to reduce the size of datasets.