A particle swarm optimization levy flight algorithm for imputation of missing creatinine dataset

a Department of Computer Science, Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, Kuala Lumpur, Malaysia b Department of Anaesthesiology, Kulliyyah of Medicine, International Islamic University Malaysia, Kuantan Pahang, Malaysia 1 amelia@iium.edu.my; 2 naa@iium.edu.my; 3 azrinar@iium.edu.my; 4 nadzurah.abidin@gmail.com; 5 samarsalim7076@gmail.com * corresponding author


Introduction
Acute Kidney Injury (AKI) is a sudden episode of abrupt loss of kidney failure within a few hours and or a few days. A well-described definition by Kidney Disease Improving Global Outcomes (KDIGO) on AKI is a syndrome of diverse etiology that is characterized by a rapid decline in the glomerular filtration rate (GFR) [1], [2]. AKI is a common disease in hospitalized patients and has a high mortality due to the severity of injury associated with poor outcomes [3]. Besides, AKI has been recognized as a global public health problem, with roughly over 50 percent of AKI mortality occurred in intensive care unit (ICU) settings [4]. Despite being adequately associated with higher mortality, a high incidence of AKI may also be attributed to sepsis, about 60 percent of patients reported in Malaysia. However, the burden of AKI in hospitalized patients is vastly underestimated, especially in developing countries. Even though data generation was detailed and reliable, the underestimation was obvious, let alone the diagnosis rate by disease code. entitled "Forecasting the Incidence and Prevalence of Patients with End-Stage Renal Disease in Malaysia up to the year 2040", Malaysia is ranked as the top seventh-highest dialysis treatment rate globally.
Acute kidney failure usually happens when kidneys lose the ability to eliminate excess salts, fluids, and waste materials from the blood. This eliminations process is the core of the kidney's primary function. AKI is a kidney disease that can lead to stroke, heart attacks, and other serious diseases. This kidney disorder is defined by an abrupt decrease in kidney function at Kidney Disease: Improving Global Outcomes (KDIGO) AKI Guideline [1]. This disorder is a sudden event of kidney failure that may happen within a few hours or even a few days. Patients with AKI need special attention and care, especially with their records such as creatinine and urine values. Accordingly, AKI stages are defined by the maximum serum creatinine or urine output [5]. Missing those (creatinine and urine values) is a common obstacle to access AKI [6]. This common obstacle imposes surrogate estimates, leads to poor estimation of kidney function, misclassifies AKI, and adversely affects the study of associated outcomes [7].
Machine learning algorithms and optimization algorithms are successful approaches employed in the recent decade to treat missing baseline creatinine [8]. These approaches allow estimating of missing data for statistical analysis. Therefore, this research proposed improved machine learning with particle swarm optimization techniques enhanced with a Levi flight.

Method
Creatinine and urine values are frequently missing in AKI studies. Therefore, this paper aimed to identify the best machine-learning algorithm to impute for missing creatinine and urine values.
For the methodology, the first step concerns exposing new issues and challenges, and it is instructive to have a variety of problems when considering supervised learning methods (Fig. 1). This phase also identifies different techniques for developing the rules and classification to concentrate on the information needed, such as creatinine and urine values. The estimation process of the dataset is applied to real data stored in the International Islamic University of Malaysia Hospital (HUIAM). The second step helps to identify the data that need to be analyzed. The Bayesian approach relies on data collection then calculates the probability that data is significantly related to the extracted information. The dataset should be extracted and identified during this phase and turn the information and structure into a result. The second and third steps cover the role of implementing processes and decision-making that generate ultimate results. The next phase covers the identification of relevant values and information, substituting missing values with valid estimations. Besides, this phase should define the appropriate approach of imputing missing values for the AKI dataset. The performance of each approach is compared, and results are presented.
The last phase involves resolving the information into a more understandable model, qualifiable values to choose the best methodology. The data extracted in earlier stages will be compiled into the final result.

Data Collection
We collected data on demographic characteristics, past medical history, laboratory results, the severity of illness, and care processes from IIUM Medical Centre (IIUMMC). The data is collected according to the code of ethics ref. number NMRR-13-1631-18970 from Kementerian Kesihatan Malaysia. We also retrieved SCr levels for each patient for up to one year before hospital admission from the dataset. Outcomes included AKI diagnosis and staging, mortality at hospital discharge, and renal recovery at least three months after hospital discharge.

Study Designs and Participants
We performed a retrospective study of critically ill adult patients admitted to our tertiary care academic center between January 1 and December 31, 2012. In this study, we assessed the performance of four surrogate methods: 1) first SCr level at hospital admission; 2) minimal SCr level within two weeks after intensive care unit (ICU) admission; 3) SCr computed from the MDRD formula for an eGFR of 75 ml/min per 1.73 m2 and 4) SCr computed from the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula for an eGFR of 75 ml/min per 1.73 m2. We performed a multilinear regression model to identify patients' characteristics that best predict preadmission SCr. We then performed imputation strategies using calculated SCr values from the multilinear regression models to assess AKI diagnosis.
We included randomly selected critically ill patients aged 18 or more and excluded readmissions, patients on chronic dialysis, those having a kidney transplant, or those who stayed in the ICU less than 24 h. We followed the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) guidelines for observational studies.

Evaluation Methods
The nature of imputation is evaluated by comparing the imputed values against original values. The evaluation of the optimized KNN algorithm with GOA and other optimization algorithms involves two performance metrics such as error accuracy, running time, and statistical significance test.
The most powerful parameters to evaluate the performance and measures the error differences between values are by employing Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). These parameters are negatively oriented, which implies the better of lower values. The three criteria significantly a meaningful representation that computes an error between two numeric vectors. MAE measures the average squared difference in a set of predictions, where absolute differences between prediction and actual observation are calculated. All the individual differences are weighted equally in the average as in (1).
MSE is an estimator that measures the average squared difference of its error between the predicted values and actual values as shown in (2).
RMSE is a quadratic scoring rule that measures the average magnitude of error, a square root of the average squared differences between actual and prediction observation. The following is an equation of RMSE: Apart from assessing the algorithm efficiency through accuracy and performance, an algorithm is also measured by calculating the running time. Running time is a measure of the amount of time for an algorithm to execute. A statistical significance test is interested in assessing the performance of optimizing the KNN imputation algorithm with GOA against other optimization algorithms. The purpose of statistical significance testing is to help gather evidence of the extent to which the results returned by an evaluation metric represent the general behavior of the proposed algorithm. Vargha-Delaney A Test is another non-parametric test used to evaluate the performance of the optimized KNN imputation algorithm with GOA [9]. The comparison of the algorithm's actual value and the predicted value is taken and compared whether there is a significant difference between the two results.

Machine Learning Imputation Algorithm
Imputation is a common way to deal with missing values where the missing value's substitutes are discovered through statistical or machine learning approaches. Even though the statistical approach has been adopted for decades, machine learning-based data imputation techniques are becoming popular in handling missing values, especially in large data sets [10].
Many machine learning-based imputation methods have been introduced to resolve the missing data problem [11]. These methods work by using machine learning techniques to find rules from the input data to estimate the possible value of the missing data. These methods have several advantages [12].
The machine learning approach has revolutionized the world with various algorithms to aid data analysis. Recent studies on imputation indicate that four popular machine learning classifiers are Knearest neighbors (KNN), Decision Tree, Naïve Bayes, and Support Vector Machine (SVM), as shown in Fig. 2. Hence, the focus of this thesis is the machine learning that has been proposed in data imputation.

Particle Swarm Optimization (PSO)
Particle swarm optimization is a simple method introduced by Kennedy and Eberhart in 1995 based on the swarm of bird communication inspiration. PSO consider one of the most efficient optimization algorithm used to solve optimization problems [13]- [15]. Because the simplicity and robust performance, PSO attract researcher and engineer [16]. PSO has been widely applied for solving realworld optimization problems, including feature selection [17], Control System in an Internet of Things (IoT) Environment [18], tracking 3D objects in RGB-D image [19], Path Planning For Mobile Robot [20], Face recognition [21], trained recurrent neural network [22], Network Security [23], Gene selection [24], digital image watermarking [25], design digital A proportional-integral-derivative controller (PID), and in various science and engineering problems [16].
PSO depends on the movement and intelligence of the swarm [26], [27]. The swarm consists of the number of particles tending to move toward a better solution [28]. The particles in the search space present the solutions. PSO relies on two formulas belonging to every single particle: position and. The new velocity particle is updated using equation (4) [29].
where i is the particle index; t is the number of iteration; ( ) is the current velocity of the particle; w is inertia Weight; ( ) is the current position of the particle; represents the best previous position of particle i; represents the best position among all particles; 1 , 2 random numbers with values between (0,1); 1 , and 2 are positive numbers called acceleration coefficients guide the particle toward the particle best and swarm best positions. PSO use equation (4) to update the position of the particles [29].
where Xi(t) is the previous position of the particle; Vi (t+1) is the particle's current velocity. The number of studies has been improved PSO. One of these studies is levy PSO [30]; the algorithm shows high performance compared to PSO. Algorithm 1 presents the Pseudocode of PSO (Fig. 3). Initialization of the parameter NP, w, C1, and C2, maximum iteration Initialize particle velocity and position Evaluate the fitness value If (the current fitness value < the particle best value) Assign the current value to Pbest End if Set particle with the best fitness value to Gbest Iteration = 1 While stopping criteria is not reached Do For i =1 to NP Update particle velocity according to equation 4 Update particle position according to equation 5 End for End while Output the best solution

Particle Swarm Optimization Levy Flight (PSO-LF)
The Particle Swarm Optimization (PSO) is classified as one of the meta-heuristics, which mimics the behavior of swarm intelligence of schools of fish and flock of birds. There are two important control parameters of PSO: inertia and acceleration [31]. The inertia coefficient, in particular, govern the convergence property. Various control methods for the inertia coefficient are proposed to improve the performance of the solution searchability.
The inertia coefficient determines the speed of convergence. As the inertia parameter increases, the convergence speed slows. Furthermore, if these parameters exceed certain thresholds, the system does not converge. The inertia coefficient controls the phase transition. As a result, the inertia coefficient is critical to the PSO's dynamics. A large inertia coefficient is highly associated with slow convergence, keeping the system searching for the optimum solution. Although this approach helps to improve the search performance, however, if the inertia coefficient is less than 1, the system cannot escape the local minimum. Nevertheless, if the inertia coefficient is larger than 1, this can lead to divergence. The divergence property results in the ability to escape from the local minimum. If the divergence property is tamed, the ability to find the solutions can improve. Therefore, we propose a novel PSO with Levy flight to the inertia coefficient to control the divergence property.
A Levy flight is a random walk in which the step size is according to a heavy-tailed distribution that is drawn from Levy distribution [31], [32]. Fig. 4 depicts two-dimensional Levy flight and random walk examples. As shown in Figure 1, Levy flight has a broad step size on occasion.

Fig. 4. Example of two-dimensional motion
The proposed algorithm Levy flight method solves premature convergence and enables PSO to produce more efficient results (Fig. 5). This approach ensures that PSO, which cannot perform global search well, can perform global search more efficiently and avoid being stuck in local minima [33], [34]. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29 Initialization the parameter NP, D, C1, and C2, max_iter, Vmin, Vmax Initialize particle velocity and position Trial (keeps the limit value for each particle) = 0; Evaluate the fitness value Set X to be Pbest Set the particle with best fitness to be Gbest While iter > max_iter do For i = 1: NP If trial(i) < limit (if current particle is not exceeded limit value) Update the velocity Vi of particle using equation 4 Update the position of particle using equation 5 Else (if current particle is exceeded limit value) Trial (i) =  The Levy flight method is studied to seek updated velocity to increase the PSO algorithm's performance. Similar to the state-of-art PSO, particles are first distributed randomly in the search space, fitness values of all particles are assessed, and particles Pbest and swarm Gbest are obtained. The velocity and position of each particle are then updated based on the random probability. The particles' velocity and position are updated with probability greater than or equal to 0.5, just like in the traditional PSO by equations (4) and (5). If the random value is less than 0.5, the particle velocity is updated, and the particle's velocity becomes its position. Using the levy flight technique to update the particle's velocity, the particle takes a lengthy hop towards its Pbest and Gbest, increasing the diversity of the swarm and allowing the algorithm to do global exploration over the search space.

Algorithm 2: The proposed PSO-LF algorithm
Occasionally, the PSO-LF inertia coefficient, ω, reaches a high value. As a result, the particle of PSO-LF can escape the local minimum and continue to seek the best solution in the global domain [33]. PSO-LF, in particular, combines the capacity to search locally and globally. However, there is a chance that the obtained moving distance will be excessively long. Based on Algorithm 2, NP is the number of particles, D is the dimension of benchmark function, C1 and C2 are the acceleration coefficients, max_iter refers to the maximum number of iterations, Vmin, and Vmax represent the maximum and minimum limit of the velocity increase to be made.
In the proposed PSO-LF method (Fig. 5), two changes are made compared to the traditional PSO method. First, the limit value is assigned for each particle, where the limit value is increased by 1 in case the particles are unable to enhance their self-solutions for each subsequent iteration. Second, particles that surpass the limit value are redistributed using the Levy flight method in the search space. The loss of diversity is prevented by employing the random phenomena of levy flight while updating the velocity. As the efficiency of the PSO algorithm is improved by introducing the benefits of random walk into the PSO, particle's positions in each iteration due to increased exploration and exploitation of the search space.
In the levy flight technique, ß parameters have a significant impact on distribution. The random distribution is altered by changing the value by using a different value for ß [30]. The distribution is frequently expressed as equation (6), where ß parameter is an index in the range (0,2] [35]. L (s) ~ |s| -1-ß (6) For random walk, the step length S is derived by Mantegna's algorithm as: Where ß is referred to as Levy index, and where u and v are drawn from normal distribution as follows: where Γ is standard Gamma function. Then, step size is calculated by: Here, the step size represents the step size in the search space, and the dimension of the desired problem determines the factor S. Otherwise, Levy flight may exhibit very aggressive behavior, resulting in the generation of new solutions outside the design space [36].
A nontrivial approach for producing step size S samples is explained in-depth, and it can be summarized as below [32]. S = random (size (D)) ⊕ Levy (ß) ~ 0.01 (u / |v| 1/ß ) (Xj t − Gbest t ) (11) The S value with D dimension derived from equation (11) is added to update the position Xi particles determines the position values of the new particle. Then, the fitness value for this new particle is assessed, if the particle achieves a better result than its Pbest, the Pbest value is updated, and the trial value of this particle is set to zero; otherwise, the trial value is increased by one. The algorithm is then repeated until the stopping criterion is met.

Results and Discussion
The evaluation of the proposed algorithm is investigated concerning the error accuracy and running time applied to the creatinine dataset. This performance metric compares three optimization algorithms (Genetic Algorithm, Particle Swarm Optimization, and Particle Swarm Optimization Levy Flight) with four well-established machine learning imputation algorithms (K-nearest neighbors, Decision Tree, Naïve Bayes, and Support Vector Machine).
Generally, the result also highlights that the most promising finding is an optimization of machine learning imputation algorithm with Particle Swarm Optimization Levy Flight (PSOLV). Table 1 describes four machine learning optimized with PSOLV consistently demonstrating impressive performance for all three relative error parameters, which provides a low error accuracy against the traditional GA and PSO algorithm. Among four optimized machine learning imputation algorithms, SVMPSOLF shows the lowest error accuracy for all three error accuracy parameters. Another investigation that governs the efficiency of an algorithm is by measuring the running time. Table 2 shows that all optimized machine learning with PSOLF executes as the fastest running time. Table 2 also illustrates that the fastest imputation algorithm for imputing missing baseline creatinine uses SVMPSOLF among four machine learning imputation algorithms. The result demonstrates that error accuracy and running time are insufficient to evaluate the difference in performances between all algorithms fully. Precisely, the result needs to be verified whether the differences in performances are statistically significant and not merely coincidental. To intensely analyze the performance of each optimization algorithm, a statistical significance test is compared between actual values and imputed values. In the majority of statistical analyses, an alpha of 0.05 is used as the cutoff for significance. Vargha and Delaney suggested a threshold for interpreting the effect size where 0.5 means no difference at all; up to 0.56 indicates a small difference; up to 0.64 indicates medium, and anything over 0.71 is large.
The result in Table 3 illustrates an optimized machine learning imputation with PSOLF constantly displayed the closest difference between actual and imputed values. This result displays a statistical significance which implies that the differences can be negligible. Among the statistical significance test, SVMPSOLF is the most accepted, with the closest significance to 0.5. In order to assess the consistency performance of each optimized machine learning imputation method, a hypothesis is formulated for the comparison of performance and efficiency. For a missing dataset that achieves the lowest error accuracy and fastest time, the hypothesis is that the differences between the actual value and imputed value must be zero.
The hypothesis is H10: There is no statistical difference between actual and imputed values for optimized machine learning with the PSOLF algorithm.  Table 4, PSOLF has a perfect p-value closest to 0.5, consistent with their error accuracy and running time. Therefore, H10 is accepted. All optimized machine learning algorithms with PSOLF have almost the same p-value, with the lowest error accuracy and fastest running time compared to traditional GA and PSO optimization. Hence, all PSOLF based on four machine learning imputation algorithms is accepted. A final analysis can be deduced from all results that PSOLF performs well for optimizing all machine learning imputation algorithms.

Conclusion
Missing baseline creatinine value can be a root cause of a poor estimation of kidney function and misclassify AKI biased estimations. In this regard, four popular machine learning imputation methods (K-nearest neighbors, Decision tree, Naïve Bayes, and Support Vector Machine) are employed to analyze the optimization of each machine learning with an optimization approach. This paper demonstrates the application of the PSO algorithm based on machine learning to treat missing baseline creatinine values. The results show that the PSOLF imputation algorithm is reliable as the performance of PSOLF constantly outperformed regarding error accuracy and running time, which verified with statistical significance test. The results also show that SVMPSOLF is superior to other proposed algorithms as the lowest error accuracy and fastest running time. The proposed algorithm, SVMPSOLF, is recommended to be executed in other medical applications to determine if the SVMPSOLF imputation algorithm can treat missing values. Further comparison of SVMPSOLF with other optimization algorithms is also recommended to judge the performance of SVMPSOLF.