Overdispersion study of poisson and zero-inflated poisson regression for some characteristics of the data on lamda , n , p

Poisson distribution is one of the discrete distribution that is often used for modeling on rare occasions. The data obtained in the form of counts with non-negative integers. One form of analysis used to model count data is Poisson regression. Poisson regression analysis showed a relationship between the explanatory variables with the response variable that spread Poisson. Poisson regression has equidispersion assumptions, a condition in which the mean and variance of the response variable equal value. Deviation assumptions that often events on Poisson regression is overdispersion/ underdispersion. Overdispersion is the variance greater than the mean, while the value underdispersion is the variance smaller than mean value on response variable. Application about underdispersion on Poisson regression is rare eventring, it is because there is no low variance value of the response variable on real data [1].


I. Introduction
Poisson distribution is one of the discrete distribution that is often used for modeling on rare occasions.The data obtained in the form of counts with non-negative integers.One form of analysis used to model count data is Poisson regression.Poisson regression analysis showed a relationship between the explanatory variables with the response variable that spread Poisson.Poisson regression has equidispersion assumptions, a condition in which the mean and variance of the response variable equal value.Deviation assumptions that often events on Poisson regression is overdispersion/ underdispersion.Overdispersion is the variance greater than the mean, while the value underdispersion is the variance smaller than mean value on response variable.Application about underdispersion on Poisson regression is rare eventring, it is because there is no low variance value of the response variable on real data [1].
Problem often encountered in Poisson regression was overdispersion.This condition is caused by the explanatory variable that can't be explained in the model, so it is possible the high variability of the response variable caused by other variables.Cause of overdispersion that often event in Poisson regression is zero probability that excess on the response variable.One result is the standard deviation of parameter estimate to be underestimate and the significance of the explanatory variables to be overstate, resulting invalid conclusions [2].
The handling model can be used to overcome overdipersion due to zero probability excess on the response variable Poisson regression such as models of hurdle Poisson regression, zero-inflated Poisson (ZIP) regression, and Semiparametric hurdle Poisson regression [3].Handling model that will be used in this paper is the model of ZIP regression, because it is more convenient than models of hurdle Poisson regression and Semiparametric hurdle Poisson.Superiority of ZIP regression is very easily applied to several fields [4] such as agriculture, animal husbandry, biostatistics, and industry.

ARTICLE INFO A B S T R A C T
In addition, estimate of the parameter on ZIP regression model can be interpretation easily, and can explain the reason of the mean smaller in the response variable.
Research that has been done before, starting the ZIP regression model developed as a solution overdispersion handling of Poisson regression models using simulation studies that Xie et.al. [5] using a type II error in the simulation with combining the parameter of Poisson distribution, zero probability for sample size on the response variable.Numna [6] developed a Wald test for comparison of Poisson and ZIP regression models.Wald test development performed simulations with determination of zero probability on the response variable based on the value parameter of the Poisson distribution.
Development of the research that has been done previously, the researcher wanted to study develops the overdispersion on some characteristics of the data.Overdispersion which will be examined in this study are simulated by combining the value of the parameter of the Poisson distribution, zero probability, and sample size on the response variable.Furthermore, comparing Poisson and ZIP regression models based on exploration of the response variable, and the evaluation of the prediction models.Any simulation on characteristics data are expected to determine the cause of overdispersion on response variable.

II. Poission Regression
Hardin and Hilbe [7] stated that the Poisson regression model provides a standard framework for analysis of data count.Poisson regression is a form of general linear model.Let yi, i=1,2,…,n represents count of those rare occasions in the period with value of the parameter Poisson distribution lambda (λi).yi is a Poisson random variable that spread by the mass function probability of the following with assuming of Poisson regression is If Poisson regression is used for the condition overdispersion, then the result is not exact because the value of mean and variance Poisson regression contain dispersion with τ is ratio of dispersion.When there was overdispersion on Poisson regression then value of τ is more than one and constant.Dispersion is a measure of variant of a group of data to the mean data.Small dispersion values showed a homogeneous variety in the data, while the big dispersion values indicate heterogeneity in data.Method to estimate parameters of Poisson regression coefficients are maximum likelihood method.Suppose X is explanatory variable that size matrix n x (p+1).Random variables yi and i is a row vector of X, will be linked with the log link function.
  The model in ( 5) is a Poisson regression model with parameter  is the coefficient estimate.

III. Zero Inflated Poisson Regression
Jansakul and Hinde [1] state that if the Yi are independent random variables that have a ZIP distribution, then the value of zero is assumed to arise from the same two steps.The first step events on probability that only produces zero observations denoted by pi.The second step events on the probability that result of data count spread Poisson with parameter λ is denoted by (1-pi) .In general, the zero value of the first step is called structural zeros, and the zero value of the second step is called sampling zeros.Variables variables Yi has overdispersion pi>0 if, then overdispersion will reduce to Poisson models when pi=0.Value pi>0 explains that there is an increase value of zero on the response variable.
Method to estimate parameters of ZIP regression coefficients is maximum likelihood method.Loglikelihood function for observations y1,…,yn on ZIP regression model is used to simplify the calculation to get parameter estimate coefficient.Maximizing of log-likelihood function will give the same result as maximizing the likelihood function.ZIP regression models of divided into two components, namely discrete data models for λ and zero-inflation models for p.If the explanatory variables used in ln and logit model on the same value, then the ZIP regression model is with X is the matrix of explanatory variables, while b and g is the vector of parameter estimate of ZIP regression coefficients, each sized (q+1)x1 and (r+1)x1 in equation (8).Explanatory variables used in the model ln be the same or different from the explanatory variables used in the logit model.Maximum likelihood estimation for b and g are obtained by using the expectation maximization (EM) algorithm which provides a simple way, so that it can be applied to standard software to match the general linear model.

IV. Design of Simulation
The data used in this study is the simulation data.Simulation data was generated based on the characteristics of the data.Characteristics of the data in the form of lambda (λ) starting from λ= 0.6, 0.8, 1, 6, 8, 10, and 20, the zero probability (p) are p=0.1,0.3, 0.5, and 0.7, and sample size (n) are n=100, 300, and 500.The data generated are useful to obtain parameter estimators of Poisson and ZIP regression.The coefficient of regression parameters were determined are 0=3, and 1=0.01.Variables were determined to make the Poisson and ZIP regression models are explanatory variable (X), the response variable (Y).
Variable X consists of variable X which is a normal random variable spreads (μ,1).Variable X is assumed as a fixed variable.Generating variables X and Y on simulation study carried out by stages, namely: 1. Generating variable Y based on the value of λ, n, p have been determined.

Generating variable X is with the first loop, are:
 Separate variables Y become Y zero and Y not zero.
 Transform variable X with formula xi= (ln (yi) -0)/1, which is yi from variable Y not zero.
 Initialize the result of transformation variable X as X not zero.

Generating variable X on variable Y not zero and Y zero with the second loop are
 If the variable Y is zero, then the variable xi is gotten by sampling with replacement on variable X not zero.
 If the variable Y is not zero, then the variables xi is generated from Normal distribution with mean of transformation result from 2(ii) and variance is 1 with sample size n=1.
Simulation data on the variables X and Y generated by the software program R ver.2.15.2 and will be repeated r=500 replications.There are 84 simulation conditions used in this study.The accuracy of parameter estimators in the Poisson and ZIP regression models can be seen from the relative absolute bias (RAB) and relative root mean square error (RRMSE) [8].Furthermore, the accuracy of the estimate y in Poisson and ZIP regression models can be seen from Pearson residual (PR) and sum of absolute Pearson residual (SAPR) [9].The equation of value RAB, RRMSE, PR, and SAPR are defined respectively in ( 9), ( 10), (11), and (12).
= √∑ ( then where r is the number of simulation replications,   ̂ is the parameter estimators of Poisson and ZIP regression i th ,  is the actual parameter.Then,   is the response variable on repeat to-i and observations to-j, and  ̂ is the estimate y of the Poisson and ZIP regression model on repeat to-i and observations to-j, and Var (Y) is the estimate variance of Poisson and ZIP regression.Then, the smaller values of RAB, RRMSE, and SAPR on regression model can be said to be getting better.

V. Results
Simulation study consisted of 84 cases of simulation which is a characteristic of data from the combination of λ, n, and p. Simulations were performed to evaluate the results of estimating the parameters of the Poisson and ZIP regression using percentage of ARB, RRMSE, and average of SAPR.The value obtained from the simulation was repeated 500 times.The evaluation results of the simulation data to be clarified with the results of exploration and testing of the variable Y.

A. Exploration and testing at the variable Y
Characteristics of data simulation againts λ, n, and p is tested indicates that the rise of the value of p effect on λ.Value of λ 0.6 with p that is tested 0.3, then the variables Y produces range p of 0.3 to 0.5.This is because the λ that has small value still has p from Poisson distribution relatively large.This statement can be explained by the cumulative Poisson distribution table.Value λ is tested to 0.6, 0.8, 1, 6, and 8 still have zero probability of a Poisson distribution, while for the other λ is tested, namely 10 and 20 already have not zero probability of a Poisson distribution.
Exploration variable Y to λ, n, and p is tested indication excess zero probability, so that the necessary tests on variables Y.The test is in the form of scores test and chi-square test were able to generalize the general conclusions on the results of exploration at variables Y.The condition of excess zero probability at the variable Y due to overdispersion.Flynn and Francis [10] state that when the score test results value of zero excess on a variable, then chances are the variables do not spread Poisson distribution, but has ZIP distribution.
The score test results with α of 0.05 on simulation data at variable Y against a combination of λ, n, p are shown in Table 1.Scores test indicate that the larger λ, n, and p is tested, then the larger percentage of excess zero at the variable Y. Furthermore, the results of chi-square test with α of 0.05 for the Poisson and ZIP distribution against the combination of λ, n, p are shown in Table 2.The chi-square test for Poisson distribution shows that the larger λ, n, and p is tested, then percentage Poisson distribution will be smaller at the variable Y.The results of the score test and the chi-square test for Poisson distribution is inversely proportional to the larger λ, n, and p is tested.Chi-square test for ZIP distribution indicates that ZIP regression able to overcome overdispersion due to excess p at the variable Y.This condition is indicated by the larger value of λ, then the percentage Poisson distribution reaches 0%, while the percentage of ZIP distribution in the range of 60% to 80%.

B. The testing overdipserion on Poisson and ZIP regression
Results of exploration and testing of the simulation variables Y is a step that must be checked before performing Poisson regression analysis.When the variable Y has the characteristics of the data that led to overdispersion due to excessi zero value, then ZIP regression became one of the completion of the Poisson regression.Overdipersion conditions on any combination of λ, n, and p are tested in Poisson and ZIP regression can be traced from the dispersion ratio (τ) and the Pearson chi-square test at 5% significance level.
Ratio of τ shows the value of a statistical result Pearson chi-square test of the degrees of freedom (n-k).The value of degrees of freedom Poisson and ZIP regression is different, because the Poisson regression using k=2, are the parameter estimate b0 and b1.ZIP regression using k=4 is based on discrete models for λ and zero-inflation modesl for p are g0 and g1, then l0 and l1.
The average ratio of τ obtained from 500 times of repetition of the combination of λ, n, and p is tested in Table 3.The overdispersion indicated by the ratio τ is greater than one.The ratio τ on Poisson regression will be compared with ZIP regression.The ratio τ most at risk are at p=0.7 in Poisson regression with the value λ is the larger.The ratio τ of the Poisson regression showed that the larger λ and p, then the ratio τ more than one in every n is tested.Poisson regression suffered overdispersion by the larger λ, p, and n is tested.The ratio τ of ZIP regression has a value of less than one in every λ, p, and n is tested, so ZIP regression able to overcome overdispersion caused excess zero probability at the variable Y.The percentage of overdispersion have the similar result with the ratio of τ.Pearson chi-square test that significant shows overdispersion.Value of overdispersion contained in Table 4 show that in every λ, p, and n is tested to get overdispersion value reaching 0% for ZIP regression.These results indicate that ZIP regression is better to handle overdispersion caused excess zero probability on the variable Y.The percentage of overdispersion on Poisson regression showed that the larger λ and p, then the greater overdispersion in each n is tested that indicated by the value reached 100%.These results are consistent with the score test and the chi-square test that indicates the larger λ, n, and p is tested, then the greater zero probability that appears and the less spread Poisson at the variable Y.

C. Evaluation of the estimation on Poisson and ZIP regression model
The goodness of fit on Poisson and ZIP regression model showed that the larger λ, n, and p is tested, then ZIP regression is better than Poisson regression.Furthermore, the comparison of Poisson and ZIP regression models is taken based on the evaluation parameter estimate and y.The average of ARB for each combination of λ, n, and p are tested on estimate  1 in Table 5.The average of ARB on ZIP regression in Table 5 produces a value is smaller as enlargement λ, n, and p is tested against estimate  1 .The value of ARB on Poisson regression has a minimum value that is contained in λ = 6.The average of ARB on estimate  1 indicates that Poisson regression is better than ZIP regression on the value of λ are 0.6, 0.8, and 1 in each of p and n is tested.Value of λ are 6, 8, 10, and 20 in each of the p and n is tested showed that ZIP regression is better than Poisson regression.The result of RRMSE is similar with the ARB on estimate  1 in Table 6.MSE contains two components, namely variance estimate (accuracy) and the bias (accuracy) [11].The estimation with good character of MSE is that can controls the variance and bias.The big value of RRMSE show the big variance estimate, so the risk to the estimation results, the accuracy estimation is lower.The average of SAPR on the estimate y is gotten by Poisson and ZIP regression models.The estimate y in ZIP regression model using two models are discrete model for λ and zero-inflation model for p.In Table 7 shows that the larger λ, n, and p is tested, then the larger the value of SAPR.ZIP Regression have the average of SAPR that is smaller than Poisson regression in every λ, n, and p is tested.

Table 1 .
Percentage of score test on combination of λ, n, p (%)

Table 3 .
Dispersion ratio on Poisson and ZIP regression * Overdispersion

Table 4 .
Percentage of overdispersion on Poisson and ZIP regression Lili Puspita Rahayu et.al. (Overdispersion study of poisson and zero-inflated poisson regression for ...)

Table 5 .
Average of ARB againts estimate  1 on Poisson and ZIP regression * Average of ARB is smaller

Table 6 .
Average of RRMSE againts estimate  1 on Poisson and ZIP regression * Average of RRMSE is smaller

Table 7 .
Average of SAPR againts estimate y on Poisson and ZIP regression * Average of SAPR is smaller