Queen Mary, University of London
MTH731U Computation Statistics
The Kolmogorov-Smirnov test for normality Idlir Shkurti – 120192308
1.0 Introduction Suppose we have n observations 𝑦1 , … , 𝑦𝑛 coming from independent and randomly distributed random variables 𝑌1 , … , 𝑌𝑛 with a common cumulative distribution function (cdf) F. If we wish to test the hypothesis 𝐻0 : 𝐹 = 𝐹0 against 𝐻1 : 𝐹 ≠ 𝐹0 , where 𝐹0 is the cdf of a known continuous distribution; the Kolmogorov-Smirnov statistic is an appropriate and valid way of doing so. The Kolmogorov-Smirnov statistic, 𝐷𝑛 is defined the following way 𝐷𝑛 = 𝑠𝑢𝑝𝑦∈{𝑦1 ,…,𝑦𝑛 } |𝐹̂𝑛 (𝑦) − 𝐹0 (𝑦)|
(1)
where 𝐹̂𝑛 (𝑦) is the ecdf obtained from the observations 𝑦1 , … , 𝑦𝑛 . In other words, the KS statistic is the maximum absolute distance between the graph of the ecdf and the cdf of the known distribution, which we are testing our data to come from. Another way of defining the test statistic, when we are testing for normality of the data is the following. Suppose we have a sample which consists of n observations ordered such that 𝑥1 ≤ 𝑥𝑛 ≤ ⋯ ≤ 𝑥𝑛 . The ecdf of the sample is the step function such that at each 𝑥𝑘 the step is between 𝑘−1 𝑛
𝑘
and 𝑛. If 𝐹0 is the cdf of a normal distribution with mean µ and standard deviation σ, then
the KS statistic is given by: 𝑘
𝑥𝑘 −µ
𝑛
σ
𝐷𝑛 = 𝑠𝑢𝑝1≤𝑘≤𝑛 { − 𝜙 (
),𝜙(
𝑥𝑘 −µ σ
)−
𝑘−1 𝑛
}
(2)
Where 𝜙 is the cumulative distributive function of the standard normal distribution. A more conventional way of testing a similar hypothesis is the chi-square test, however the Kolmogorov-Smirnov statistic is much more advantageous, since it can be used with small samples and it is overall more powerful. In his paper in 1967, Lilliefors provided a means of using an improved version of the Kolmogorov-Smirnov statistic to test whether a set of observations are from a completely specified continuous distribution 𝐹0 (𝑦) when certain parameters of the distribution must be estimated. If the conventional Kolmogorov-Smirnov test is used in this case the results will be conservative, hence compromising the power of the test. In this paper Lilliefors presents a method of testing whether a set of observations come from a normal population with unknown mean and variance. In order to do so we must first know exactly what the continuous distribution 𝐹0 (𝑦) is. Since the mean and the variance are unknown we use estimators 𝑦̅ and 𝑠 2 to estimate the mean and the variance respectively; where 𝑦̅ is the sample mean and 𝑠 2 is the sample variance of the given observations. Hence we assume that 𝐹0 (𝑦) is the cdf of a normal distribution with mean 𝑦̅ and variance 𝑠 2 . When these values are calculated the KolmogorovSmirnov statistic 𝐷𝑛 is calculated exactly the same way as above with 𝐹0 (𝑦) = 𝜙𝑦̅,𝑠2 .
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London
2.0 Adjusted Critical Values Once the test statistic is obtained, we cannot use the critical values from the KolmogorovSmirnov tables in order to draw a conclusion about the test because, as mentioned before, the results will be conservative. In his paper, Lilliefors calculates new critical values for this specific test using a Monte Carlo calculation. This was done by obtaining 1000 random normally distributed samples, 𝑦1 , … , 𝑦𝑛 for different values of n and thus estimating the distribution of 𝐷𝑛 . The first R code in appendix 1 is used to obtain a similar table to table 1 on Lilliefors’ paper. The output table is given below. Table 1: Monte Carlo Critical Values of Dn Critical values estimated as a result of a Monte Carlo calculation using 1000 samples for different values of sample size, N. Any value of Kolmogorov-Smirnov test statistic 𝐷𝑛 greater than the corresponding critical value at a certain level of significance leads to a rejection of the null hypothesis of normality for that specific significance level. Level of Significance for 𝐷𝑛 = 𝑠𝑢𝑝𝑦∈{𝑦1 ,…,𝑦𝑛 } |𝐹̂𝑛 (𝑦) − 𝐹0 (𝑦)|
Sample Size N 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30
0.20 0.300 0.290 0.272 0.251 0.234 0.227 0.215 0.209 0.204 0.196 0.185 0.178 0.176 0.170 0.166 0.161 0.158 0.143 0.131
0.15 0.318 0.303 0.282 0.263 0.246 0.237 0.225 0.217 0.212 0.204 0.196 0.185 0.183 0.179 0.173 0.170 0.165 0.150 0.136
0.10 0.342 0.323 0.297 0.277 0.259 0.254 0.243 0.228 0.224 0.216 0.205 0.197 0.195 0.190 0.184 0.178 0.174 0.160 0.143
0.05 0.378 0.350 0.321 0.307 0.280 0.274 0.265 0.248 0.245 0.231 0.222 0.215 0.210 0.206 0.200 0.195 0.189 0.171 0.155
0.01 0.410 0.403 0.373 0.351 0.332 0.313 0.307 0.290 0.279 0.263 0.256 0.257 0.251 0.234 0.236 0.220 0.220 0.195 0.185
Over 30
0.723
0.760
0.814
0.880
1.022
√N
√N
√N
√N
√N
The output in table 1 gives the estimated critical values for a 0.2, 0.15, 0.1, 0.05 and 0.01 significance level for sample sizes from n=4 to n=30. For any samples of size greater than 30, say N, we use the critical values of N=40 multiplied by the square root of 40 and then the product is divided by the square root of the sample size √𝑁. Comparing this table to the standard Kolmogorov-Smirnov tables, we can see that the critical values for the 0.01 significance level here are slightly smaller than the critical values for a 0.20 significance level. Hence if we use the standard Kolmogorov-Smirnov tables as critical values we
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London would obtain very conservative results. This means that the actual significance level would be much lower than the one claimed in the test. A big question which needs to be addressed at this stage is: Is the modified KS test still more powerful than the Chi-Squared test?
3.0 Power of the test One of the implements of this specialised Kolmogorov-Smirnov test for normality is that it can still be used for small sample sizes, unlike the Chi-Squared test. Kiefer and Wolfowitz (1955) showed that it is asymptotically more powerful than the Chi-Squared test. However we want to know whether this test is ideal for relatively small sample sizes. In the Lilliefors paper a small investigation was made to compare the powers of these two tests. There were 500 samples of size 20 drawn from distributions such as normal, chi-square 3 d.f, student’s t 3 d.f, exponential and uniform. The probabilities of rejecting the null hypothesis for normality using the Kolmogorov-Smirnov statistic and the chi-square statistic were found and compared. However in this example Lilliefors has also used Monte-Carlo critical values for the Chi-Squared test rather than the standard chi-squared points to avoid a high probability of type I error. From Table 2 in the Lilliefors paper we can see that the probabilities of type I error are satisfying and relatively similar for both tests. However the Kolmogorov-Smirnov test is much more powerful than the Chi-Squared test. This is because the probabilities of correctly rejecting the hypothesis of normality are significantly greater when using the D statistic compared to the Chi-Squared statistic for any underlying, non-normal distribution. The probabilities of rejecting the null hypotheses when the observations do not come from a normal distribution are greater at both a 5% and 10% significance level. This is another advantage of using the specialised test for normality. However this simply tells us that this specialised Kolmogorov-Smirnov test for normality is superior to the Chi-Squared test. The power of this test is still far from ideal, particularly when the observations come from a uniform distribution and we can see this by looking at the same table. The probability of correctly rejecting the null hypothesis of normality when the 20 observations come from a uniform distribution is 12% when using a 5% significance level and 22% when using a 10% significance level. Hence this test is not ideal for certain distributions. Table 3 from the Lilliefors table gives us a similar calculation to that from Table 2. This table gives the probabilities of rejecting the hypotheses of normality from the same underlying distributions as Table 2, however now using 500 samples of size 10 rather than 20. The test used in this table is the Kolmogorov-Smirnov test with the adjusted critical values from Table 1 at α=0.05 and α=0.10. We can see from this table that the probabilities of type I error are still satisfactory, however the power of the test decreases even more now that the sample size has dropped. At a 5% significance level, when the observations come from a uniform distribution, the test only correctly rejects the null hypothesis of normality 7% of the time and only 13% of the time when α=0.10. Hence another important factor affecting the power of the test is the size of the sample. The greater the size of the sample, the more powerful the test. The code in appendix 2 will generate
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London the proportion of samples which have correctly rejected the null hypothesis of normality at a 10% level of significance out of 500 samples, each of size N=100, coming from a uniform distribution. The output obtained is 0.774, which means that the null hypothesis of normality will be correctly rejected for 77.4% of the 500 samples. This output is much higher than both outputs from tables 2 and 3 from Lilliefors’ paper. The following table is obtained using R. This is equivalent to Table 3 from Lilliefors’ paper, the only difference being that the samples now are of size 100 rather than 10.
Table 2: Probability of rejecting hypothesis of normality when sample size is 100 Underlying distribution Normal Chi-Square, 3 d.f. Student’s t, 3 d.f. Exponential Uniform
Kolmogorov – Smirnov test – Using Critical Values from Table 1 α = 0.05 α = 0.10 0.050 0.098 0.990 0.998 0.730 0.842 1.000 1.000 0.578 0.774
Comparing this table to table 3 or even table 2 from Lilliefors’s paper we can clearly observe dramatic increases in probabilities of correctly rejecting the null hypothesis of normality for the bottom four distributions, particularly for exponential distribution. The power of this test has increased; hence the sample size plays a very important factor in estimating the power of the test. The probabilities of type I error are still as predicted.
4.0 Outliers One problem with the use of sample mean and sample variance as estimates of the mean and variance of the null distribution is that they are sensitive to outliers in the data sample. Since the test is directly affected by the choice of mean and variance then this could lead to possible errors. Type I error is a bigger problem when outliers are present, particularly when the sample size is small. This is because the smaller the sample size, the greater the effect of the outlier on the sample estimates. The third R code (appendix 3) uses a similar code to the one used to obtain the values for Table 2 (appendix 2). 500 samples of size 10 are drawn. 9 observations from each sample come from a standard normal distribution, whilst the remaining observation is an outlier. If we include the outlier as a correct observation whilst estimating the sample mean and the sample variance, then the null hypothesis of normality is rejected for a large proportion of the 500 samples (30.2% precisely in one case). This means that the probability of type I error is much higher than the level of significance. However if we first install the ‘outliers’ package into R and load it, then use the command ‘rm.outlier’ in order to locate the outlier and replace it by the mean of the remaining observations; the proportion of samples rejecting the hypothesis of normality go back to normal. In appendix 4, a code was used in order to observe the effects of a single outlier on the value of the test statistic Dn for samples of size 10 to 101. Just as expected, the effect of the outlier decreases as the sample size increases. The code in the appendix draws 500 samples for each sample size from 10 to 101, with one outlier in each sample and finds the average value of the
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London KS test statistic for each sample size. The graph below shows the plotted values of the average KS statistics against the corresponding sample sizes.
Graph 1: The average KS statistic for different sample sizes 4.2 Modified Lilliefors Test One problem with Lilliefors test is that the mean and the variance are obtained from fixed sample estimates. These sample estimates, particularly for relatively small sample sizes, are sensitive to sample outliers. In their 2008 paper, Drezner, Turel and Zerom introduced a modified version of Lilliefors test which they believed to be superior to the conventional procedure. In their paper they also use equation (2) in order to find the test statistic, however they do not believe that fixed sample estimators such as the sample mean and the sample variance are appropriate estimates of the mean and variance of the random sample. In contrast to the traditional KS test introduced by Lilliefors in which the data is compared against a normal distribution with fixed parameters, this model tries to find a normal distribution which is more appropriate for the data sample than the fixed parameter distribution. The traditional KS statistic is obtained by using the sample mean, 𝑦̅ and the variance, 𝑠 2 as the estimates of the mean µ and standard deviation σ in (2). However the modified test introduced in this paper uses an algorithm in order to obtain values µ̃ and σ ̃ which minimize the value of the test statistic 𝐷𝑛 . The critical values needed in order to make a decision about the outcome of the experiment were also calculated differently to Lilliefors in order to complement the purpose of this test. 4.2.1 Algorithm When using Lilliefors test, equation (2) depends on the choice of µ and σ2 , hence we denote the test statistic as 𝐷𝑛 (µ, σ2 ). When using the conventional Lilliefors test this test statistic is simply ̃, σ 𝐷𝑛 (𝑦̅, 𝑠 2 ). The test statistic for the modified test is 𝐷𝑛 (µ̃, σ ̃) where (µ ̃) is the vector solution for the problem which minimizes the KS statistic, i.e. minimizes the following problem: 𝑚𝑖𝑛µ,σ2 {𝐷𝑛 (µ, σ2 )}
(3)
We can write equation (2) as the following inequalities: Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London 𝑘
𝑥𝑘 −µ ) σ
𝐷𝑛 (µ, σ2 ) ≥ 𝑛 − 𝜙 (
(4)
𝑥𝑘 −µ 𝑘−1 )− 𝑛 σ
𝐷𝑛 (µ, σ2 ) ≥ 𝜙 (
(5)
The solution to problem (3) is the smallest possible value of 𝐷𝑛 (µ, σ2 ) which satisfies (4) for 𝑘 < 𝑛𝐿 and (5) for 𝑘 < 𝑛(1 − 𝐿) + 1. The values of µ and σ2 for which this is satisfied are the values of the vector (µ̃, σ ̃). The benefit of using this modified version of the Lilliefors test is the fact that it accounts for possible outliers in the data. When using the standard KS test, possible outliers in the data could significantly affect the values of the sample estimates. This will lead to errors in the test. However this method takes possible outliers into account. When compared to the standard KS test, the modified version was much more powerful for most distributions, particularly when data came from a uniform distribution.
5.0 Conclusion The Kolmogorov-Smirnov test provides a good method of testing whether a sample comes from a completely knows distribution with cdf 𝐹0 . However when the cdf 𝐹0 is not known we can test for normality using Lillefors’s test, which is simply a modified version of the KS test where the sample mean and the sample variance are used as the mean and the variance of the unknown distribution. Lilliefors introduced new critical values in his paper which he obtained from a Monte Carlo calculation from 1000 samples for different sample sizes. Once the Monte Carlo critical values were obtained, it was noticeable that for each sample size, the standard critical values from the Kolmogorov-Smirnov tables were much higher than the ones obtained in this paper. Hence using the standard critical values could result in a very conservative test and hence a loss of power. This version of the KS test was still more powerful than the Chi-Square test as shown in table 2 of reference 1, where the probabilities of correctly rejecting the null hypothesis were significantly higher for the KS test than the Chi-Square test. This test can also be used for small sample sizes just like the standard KS test, which is another advantage of using this test over the Chi-Square test. Outliers will strongly affect the sample estimates, particularly for small sample sizes which could lead to possible errors. This was shown in graph 2 where the average values of the KS statistic from 500 samples of different sample sizes were calculated and plotted. The values of the KS statistic clearly decrease as the sample size increases, which means that the outlier is much more likely to lead to a rejection of the null hypothesis when the sample size is small. This is also shown in appendix 3, where the inclusion of an outlier in the data sample increases the proportion of the type I errors. Drezner, Turel and Zerom, in their 2008 paper introduced a modified version of the Lilliefors test which aims to choose more appropriate estimates for the mean and variance such that the test statistic is minimized. This method was proved to be more powerful than the Lilliefors test for many underlying distributions, however not for the t distribution. When the data came from a t-distribution, this method was less powerful than Lilliefors test.
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London
6.0 Appendices 6.1 Appendix 1 > N<-30 > N1<-1000 > Dn=c() > CT.2=c() > CT.15=c() > CT.1=c() > CT.05=c() > CT.01=c() # Empty vectors defined. > for(i in 4:N){for(j in 1:N1){S<-rnorm(i) # We create two loops. One for the increase in sample size and one for the amount of samples needed for each sample size. Also we create a random normal sample S. + ord<-sort(S) #We need to order the samples in order to find the KS statistic later on. + SD<-sd(S) + m<-mean(S) # SD is the sample standard deviation whilst m is the sample mean. + Ecdf<-ecdf(S) + F0<-function(x) return(pnorm(x,mean=m,sd=SD)) # This function is the cdf of a normal distribution with mean = m and standard deviation = sd. + DnStarP<-max(Ecdf(ord)-F0(ord)) + DnStarM<-max(F0(ord)-c(0,Ecdf(ord)[1:(i-1)])) + DnStar<-max(DnStarP,DnStarM) # This calculates the value of the KS statistic for the sample S. + Dn[j]<-DnStar} # Dn is the vector with 1000 entries, each with different values of the KS test statistic obtained from each sample size + ECDF<-ecdf(Dn) # We calculate the ecdf of the entries in Dn.
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London + cc<-function(a)for(x in Dn){if(ECDF(x)==a)return(x)} # This function cc gives the critical values of the ecdf of Dn at a specific significance level a. + CT.2[i]<-cc(0.8) + CT.15[i]<-cc(0.85) + CT.1[i]<-cc(0.9) + CT.05[i]<-cc(0.95) + CT.01[i]<-cc(0.99) # The entries of the previous vectors are the critical values of the ecdf of Dn for sample sizes from 4 to 30 for levels of significance a=0.2, 0.15, 0.1, 0.05 and 0.01. + dim(CT.2[4:30])<-c(27,1) + dim(CT.15[4:30])<-c(27,1) + dim(CT.1[4:30])<-c(27,1) + dim(CT.05[4:30])<-c(27,1) + dim(CT.01[4:30])<-c(27,1) # This puts all the previously defined vectors in matrix form with dimenstions 27x1. It also gets rid of the unnecessary entries 1-3. + CT<-cbind(CT.2[4:30],CT.15[4:30],CT.1[4:30],CT.05[4:30],CT.01[4:30])} # All these vectors are now combined into one matrix using the ‘cbind’ command. > CT # Outputs the final matrix
6.2 Appendix 2 > C=c() > for(i in 1:500){S<-runif(100) # Creates a sample S with 100 observations coming from a uniform distribution and a loop to do that 500 times + ord<-sort(S) # Just like appendix one, needed to calculate the test statistic + SD<-sd(S) + m<-mean(S)
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London + Ecdf<-ecdf(S) + F0<-function(x) + return(pnorm(x,mean=m,sd=SD)) # Same sample mean, variance, and function as defined in appendix 1 + DnStarP<-max(Ecdf(ord)-F0(ord)) + DnStarM<-max(F0(ord)-c(0,Ecdf(ord)[1:(100-1)])) + DnStar<-max(DnStarP,DnStarM) # Procedure to calculate the KS test statistic. + FF<-function(x) + ifelse(x>(0.805)/sqrt(100),1,0) # A function which produces 1 if the input value is greater than the Monte Carlo critical value for a sample of 100 at a 10% significance level, and 0 otherwise. + C[i]<-FF(DnStar)} # For all 500 samples, we input 1 as an output in C if the test was rejected for that sample and 0 otherwise. The following command will give us the proportion of the samples for which the hypothesis was rejected. > sum(C)/500 [1] 0.774 6.3 Appendix 3 > C=c() > for(i in 1:500){S<-c(rnorm(9),4) # We create 500 samples of size 10. For each sample I have added an outlier. + NS<-S # Associate ‘NS<-rm.outlier(S,fill=TRUE)’ to replace outlier by remaining sample mean and run the same test to observe the change in the proportion of samples for which the hypothesis was rejected. This command replaces the outlier by the observed mean of the remaining sample. + Ecdf<-ecdf(NS) + ord<-sort(NS) + SD<-sd(NS)
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London + m<-mean(NS) + F0<-function(x) + return(pnorm(x,mean=m,sd=SD)) + DnStarP<-max(Ecdf(ord)-F0(ord)) + DnStarM<-max(F0(ord)-c(0,Ecdf(ord)[1:(length(NS)-1)])) + DnStar<-max(DnStarP,DnStarM) # We find the KS statistic the same way we have found it for the previous two codes. + FF<-function(x) + ifelse(x>0.2425861,1,0) # This is a similar function to that from appendix 2, however the function returns 1 if the input value is greater than the Monte Carlo critical value for a sample of size 10. + C[i]<-FF(DnStar)} > sum(C)/500 # We obtain the proportion of rejected null hypotheses the same way as in appendix 2. [1] 0.302 6.3 Appendix 4 > C=c() > AV=c() > for(j in 9:100){for(i in 1:500){S<-c(rnorm(j),4) # 2 loops are created. One for the increasing sample size and one for the number of samples for each sample size. A vector S is defined such that it has ‘j’ observations from a random normal distribution where ‘j’ varies from 9 to 100, and an outlier. + Ecdf<-ecdf(S) + ord<-sort(S) + SD<-sd(S) + m<-mean(S) + F0<-function(x) + return(pnorm(x,mean=m,sd=SD)) + DnStarP<-max(Ecdf(ord)-F0(ord)) + DnStarM<-max(F0(ord)-c(0,Ecdf(ord)[1:(length(S)-1)]))
Idlir Shkurti
Student ID: 120192308
Queen Mary, University of London + DnStar<-max(DnStarP,DnStarM) # Same procedure of finding the KS test statistic for each sample as in the previous codes. + C[i]<-DnStar} # The loop which was just closed takes care of the 500 samples taken for each sample size. The 500 entries of vector C are the observed KS statistics from the 500 samples for each sample size. + AV[j]<-mean(C)} # The 92 entries of vector AV are the average values of the KS test statistic for the sample sizes from 10 to 101. > plot(AV,xlab='sample size',ylab='KS statistic') # This command will plot the previously defined average values against the corresponding sample sizes.
7.0 References 1. Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62, 399-402 2. Z. Drezner, O. Turel, D. Zerom (2008) A modified Kolmogorov-Smirnov test for normality. Munich Personal RePEc Archive 3. Kac, M., Kiefer, J., and Wolfowitz,.(1955) On tests of normality and other tests of goodness of fit based on distance methods. The annals of mathematical statistics, 26, 189-211
Idlir Shkurti
Student ID: 120192308