Bootstrap-based model selection in subset polynomial regression

The subset polynomial regression model is a polynomial regression in which some regression coefficients have a zero value. The advantage of this model is the user can select a regression model from all possible subsets of the polynomial regression model. The model has been studied by several researchers. Jekabson and Lavendels [1] compared the formation of polynomial regression models using the subset selection approach and the adaptive basis function construction approach. In the subset selection approach, the least squares method is used to approximate the solution. Overall the adaptive basis function construction approach was found to be superior to the subset selection approach. O'Neill et al. [2] used the method of a subset polynomial neural network to predict breast cancer. This method gives better results than the mammography method. Xie et al. [3] used the polynomial regression in medical image segmentation. Suparman [4] proposed a subset polynomial regression model using error which has an exponential distribution. The Markov Chain Monte Carlo (MCMC) reversible jump method is used to estimate the parameters of the subset polynomial regression model. The subset polynomial regression model often assumes that the error has a normal distribution or exponential distribution. However, in everyday life it is often found that the error distribution is unknown.


Introduction
The subset polynomial regression model is a polynomial regression in which some regression coefficients have a zero value.The advantage of this model is the user can select a regression model from all possible subsets of the polynomial regression model.The model has been studied by several researchers.Jekabson and Lavendels [1] compared the formation of polynomial regression models using the subset selection approach and the adaptive basis function construction approach.In the subset selection approach, the least squares method is used to approximate the solution.Overall the adaptive basis function construction approach was found to be superior to the subset selection approach.O'Neill et al. [2] used the method of a subset polynomial neural network to predict breast cancer.This method gives better results than the mammography method.Xie et al. [3] used the polynomial regression in medical image segmentation.Suparman [4] proposed a subset polynomial regression model using error which has an exponential distribution.The Markov Chain Monte Carlo (MCMC) reversible jump method is used to estimate the parameters of the subset polynomial regression model.The subset polynomial regression model often assumes that the error has a normal distribution or exponential distribution.However, in everyday life it is often found that the error distribution is unknown.
The Bootstrap method developed by Efroan and Tibshirani [5] is widely used in statistics and can be very useful in the context of regression [6].A principle of the Bootstrap method is to try to get a good estimate based on minimal resources.In the case of statistical inference, minimal resources can be interpreted as small data, data that deviate from certain assumptions, and data that have no assumption about the distribution.Warton [6] used the bootstrap algorithm to estimate the parameters of a regression model.The Bootstrap method is applied in ecology.Garcia-Soidan et al. [7] used the Bootstrap method for spatial data.The estimator of the multivariate distribution function is used as the basis for the implementation of the Bootstrap method.Yazici et al. [8] used the Bootstrap method to obtain the empirical distribution of the parameters in the nonparametric regression of Conic Multivariate Adaptive Regression Splines (CMARS).The results showed that the bootstrap method provides an accurate parameter estimate.Beda et al. [9] used the Bootstrap method to calculate the confidence limits for spectral indices of heart-rate variability (HRV).Spectral indices are modeled using an autoregressive model.Hall and Maiti [10] used the Bootstrap method to construct a mean error estimator and to calculate the predicted region.The Bootstrap technique can be applied to non-normal models.Colugnati et al. [11] used the Bootstrap method to obtain interval estimation for percentiles on the diagnosis of obesity and overweight in children and adolescents.Kant et al. [12] used a bootstrap-based neural network model for flood estimates.The results show that the bootstrap-based neural network model is a stable model.Ren et al. [13] used the Bootstrap method to determine the confidence interval for multihop distances.The use of Bootstrap method can eliminate the risk of small sample size and unknown distribution.Kleiner et al. [14] used Bootstrap for massive data.Jacek et al. [15] used the Bootstrap approach to estimate the uncertainty of surface response models.Chen et al. [16] used a bootstrap analysis to measure individual and regional differences in relative concentrations of gammaaminobutyric acid in the human brain.Dongping [17] used the Bootstrap method to determine predictive point and prediction intervals to reduce the risk of misleading decisions in maintenance in prognostic devices.Liang et al. [18] used the Bootstrap Metropolis-Hasting algorithm for model selection and optimization.Mei et al. [19] used the Residual-Based Bootstrap Test to detect the constant coefficients in the Weighted Geographic Regression model.Mikshowsky et al. [20] used bootstrap aggregation sampling to improve the reliability of genomic predictions for Jersey sires.Olaniran et al. [21] used Bootstrap techniques to improve the selection and classification of Bayesian features.Zhen [22] used Bootstrap resampling to detect wideband signal numbers.Boubaka et al. [23] used the Bootstrap method to identify parameters for the dependent data.In this paper, the Bootstrap method will be used to determine the parameter estimator in the polynomial subset regression.This paper aims to estimate the parameters of the subset polynomial regression model using the Bootstrap method.

Method
The method used to estimate the parameters of the subset polynomial regression model is as follows:

The Least Squares Estimate
Suppose that (yt , xt) is the pairing of the dependent variable and the independent variable, as well as zt is error and t = 1, 2, ....n where n is the number of observation.Let kmax be a maximum order.The subset polynomial regression model which has an order k (k = 0, 1, ...., kmax) can be written as: Here {n1, n2, ..., nk} is the subset of {1, 2, ..., k} and  = ( 0 ,   1 , … ,    ) ′ is the coefficient vector.The t z (t = 1, 2, 3, ..., n ) is an error with mean 0 and variance  2 that is identical but its distribution is unknown.Based on the data (yt, xt) for t = 1, 2, ..., n, the parameters ,  2 and the polynomial regression subset models are estimated.
Equation ( 1) is a short form for a set of the following n simultaneous equations: In a matrix form, equation ( 2) can be written as: where To obtain the least squares estimate of β, first write the sample subset polynomial regression: for t = 1, 2, 3, ..., n, which can be written briefly in matrix notation as: where Here,  ̂ is a column vector of the least squares estimator of the subset polynomial regression coefficient and e is a column vector of the residual n.According to the least squares method, the least squares estimator is obtained by minimizing (6).
This is achieved by partially differentiating (6) to  0 ,   1 , … ,    and the result obtained is equal to zero.This process produces k + 1 simultaneous equations in k + 1 unknown variables.  In the matrix form, equation ( 7) can be presented as: If the inverse of (X'X) exists, say (X'X) -1 , then by multiplying in both sides of (8) by this inverse, the result is as follows: The least squares estimator for  = ( 0 ,   1 , … ,    ) ′ is and the least squares estimator for  2 is:

Statistical Criteria
The Ck statistical criteria [5] is used to select the best polynomial subset regression model.The best subset polynomial regression model selected is a subset polynomial regression model that has the smallest Ck value.The Ck value is calculated using the following equation:

Bootstrap Method
The Bootstrap method developed in [5] is a simulation method based on data that can be applied to statistical inference problems.A basic principle of bootstrapping is resampling i.e. resampling / artificial observation of z1, z2, ... , zn that already exists.comprising the original samples z1, z2, ..., zn, appear once, appear twice, appear more than twice, or do not appear in the original sampling process.The computational steps to determine the 100(1-α)% confidence interval for  ̂+1 are as follows: 1) Calculate  ̂ and  ̂2 from the original data.

Results and Discussion
As an illustration, we apply the Bootstrap algorithm to determine the prediction interval in simulated data (simulation study) and real data (case study).A simulation study was undertaken to confirm the performance of the bootstrap algorithm whether it works properly.Case studies are given to provide examples of the application of research in solving problems in everyday life.Here resampling is done as much as B = 2000 and α = 0.05.

Simulated Data
Fig. 1 shows a graph of 1000 synthesis data of the subset polynomial regression model with order 2. The value of x is determined, hence the value of y is made using equation (1).The values of regression coefficients and the variance of error are  0 = 1,  2 = 0.5, and  2 =9.

Fig. 1. Simulated data
The simulated data in Fig. 1 are matched against the subset polynomial regression model i.e. kmax = 2.The Bootstrap algorithm is used to estimate the best subset polynomial regression model, the subset polynomial regression coefficient and the variance  2 .Estimation of the subset polynomial regression model is done by looking at   statistical value for the three regression models of the subset polynomial.The   statistical value for the three regression models of the subset polynomial can be seen in Table 1.From Table 1 it can be seen that the smallest   statistical value is achieved by the second subset polynomial regression model.Thus, the second regression is the best subset polynomial regression model.Based on the regression of the best subset polynomial, then the parameters of the corresponding subset polynomial regression model are estimated using the least squares method.The results are  ̂0 = 0.9323,  ̂2 = 0.5070, and  ̂2 = 9.1756.If the parameter values and estimator values of both regression and variance coefficients are compared then it appears that the Bootstrap algorithm can work well in estimating parameters based on synthesis data.Prediction for the value of y1000 if x = 16.4176 is 9.2569 and the corresponding 95% confidence interval is (9.0984, 9.4117).

Real Data
Table 2 shows the business tendency index (y) and the consumer tendency index (x) in the second quarter of 2000 to the fourth quarter of 2009.The data in Table 2 are matched against the subset polynomial regression model.Here kmax = 3.The bootstrap algorithm was used to obtain the subset polynomial regression model, the regression model parameters, and the variance  2 .Estimation of subset polynomial regression model is done by looking at the statistical value of   for 7 models.From Table 3 it can be seen that the smallest   statistical value is achieved by the 4 th subset polynomial regression model.Thus, It was the best subset polynomial regression model.

F
ˆ is an empirical distribution taken with probability 1/n at each observed value z1, z2, ..., zn.Let B be a number of the resampling.The Bootstrap sample is defined as a random sample of size n composed of F ˆ, e.g. the b th Bootstrap sample (b = 1, 2, ... , B) is denoted by sample of size n taken with the return of population z1, z2, ..., zn.Members of the bootstrap sample

Table 2 .
The business tendency index (BTI) and the consumer tendency index (CTI)