Hand Handou outt No. No. 6 Phil.015 April 12, 2016 HYPOTHESIS TESTING IN BINOMIAL AND NORMAL DISTRIBUTION MODELS 1. Hypotheses
Recall that in one old segment of the TV show “The Odd Couple,” Felix claimed to have have ESP. ESP. Oscar Oscar wa wass skepti skeptical cal and suggest suggested ed to test Feli Felix’s x’s claim. claim. Oscar Oscar would draw a card at random from a deck of four large cards, each with a different ferent geometric figure on it (e.g., a circle, square, triangle triangle and a cross). Without showing showing the card, Felix was asked asked to identify identify it. Felix and Oscar repeated the basic card identification experiment 10 times. Remarkably, Felix made 6 correct identifications. Felix’s score is surprisingly good in view of the fact that the average number of correct identifications is only 2.5 for anyone who does not have ESP. Well, does Felix’s score prove that he actually has ESP? Of course not. This This may may have have been his lucky lucky day. day. But Felix Felix’s ’s high score score may may provid providee some some evidence evidence for Felix’s elix’s ESP capability capability.. But just how much evidence? evidence? It depends on how strict Oscar chooses to be in terms of the discrepancy between the gathered data (i.e., 6 correct identifications) and the predictions of his “no ESP” statistical model (i.e., on average one can make only 2.5 correct identifications).
building 2. Statistical model building
To build a statistical model for the Felix-Oscar ESP experiment, we need the following two conceptual ingredients: (i) A test statistic X n , i.e., a random variable that counts the number of Felix’s correct indentifications in n independent trials. trials. It is easy to see that the possible values of the test statistic X n (with n = 10) are: 0, 1, 2, , 10.
···
(ii) A par parametrize ametrized d family family of seriously seriously possible ossible prob probabili ability ty distributio distribution n functions of X n . Because (i) the probability that Felix correctly identifies a card drawn by Oscar in any trial is always the same, namely p = 41 , (ii) there are only two possible outcomes for each trial, called success (correct identification) and failure (wrong identification), (iii) there are n = 10 trials, where n is fixed, and since (iv) all n trials are statistically independent of each other, the binomial probability distribution is an appropriate model for the experiment. 1
Remember that there are 4 cards and without ESP Felix will guess correctly any card drawn by Oscar with probability p = 41 . If Felix has ESP, then the probability could be much higher, but we may not know exactly how much higher. Given the foregoing problem description, it is most adequate to consider as a model the following parametrized family of binomial probability distribution functions : BinX 10
| = k p
df
P X 10 = k p
|
= 10 · k
pk (1
· − p)
10−k
,
where the parameter that parametrizes the possible binomial statistical models is the probability p. Of course, we do not know the exact value of p. The business of hypothesis testing is to generate statistical inferences about the likely values of p.
|
Statisticians often write X n Bin k p to indicate that the random variable (i.e., the test statistic) X n has a pdf specified by Bin k p with parameter p whose value is unknown. This is their way to introduce a parametric family of statistical models that hopefully includes the correct model with a specific value for p. Because this value is not known, the next move is to hypothesize a specific value of p and then let the experimental outcome decide whether that hypothesis about p’s value is acceptable.
∼
|
Because Oscar does not believe that Felix has ESP, he starts pessimistically with the so-called null hypothesis H 0 : p =
1 , 4
stating plausibly that Felix has no ESP and Felix is only guessing. In other words, Oscar believes that the binomial statistical model
∼ Bink | 41
X 10
correctly characterizes Felix’s ESP capabilities. To emphasize the extant hypothesis H 0 , it is common to symbolize
∼ Bink | H )
X 10
0
the extant statistical model in place of the above notation. (Think of H 0 as a condition in the model.) It is easy to verify that the mean (expected value) of the test statistic X n under H 0 is µ = E X 10 = 10 14 = 2.5 (from E X n = np. ) Recall also that under H 0 , the pdf of X 10 is given by the table below:
2
·
Specification of BinX
10
| k p
Probability Assignment
x pX (y ) 10
0
1
2
3
4
5
6
7
8
0.056
0.188
0.282
0.250
0.146
0.058
0.016
0.003
0.000
The graphical representation of pdf pX is displayed next. Note that the diagram has its highest values at X 10 = 2 and X 10 = 3, justifying the mean value µX = 2.5. As alluded to above, with no ESP, the correct identification scores will be quite close to 2 or 3. 10
10
pX (x) = BinX (x H 0 ) = P X 10 = x p
|
10
|
0.4
0.2
−1
0
1
2
3
4
5
6
7
8
9
10
Remember that Felix has correctly identified 6 cards in a sample of n = 10 trials. I.e., the observed value xobs of X 10 is 6. It is easy to see that the probability of correctly identifying 6 or more cards is P X 10 6 H 0 = 0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019. The fact that this probability is rather small tells us that X 10 = 6 is quite far from what we expect (namely only E (X 10 = 2.5 under hypothesis H 0 or under Oscar’s extant statistical model X 10 Bin k H 0 ). Note again that the sample result X 10 = 6 seems rather inconsistent with the null hypothesis H 0 . To put it in another way, the null hypothesis H 0 does not explain in a satisfactory manner the observed value X 10 = 6. For example, if p were considerably larger, say p = 0.5, captured by a different null hypothesis, say H 0′ , then the mean of X 10 would be µ = 10. 12 = 5 and the graph of the revised probability distribution function would be as follows:
∼
≥ |
|
·
3
x
pX (x) = BinX (x H 0 ) = P X 10 = x p
|
10
|
0.4
0.2
−1
0
1
2
3
4
5
6
7
8
9
10
This model would indeed ”explain” Felix’s results much better, but Oscar does not accept it! Oscar is a skeptic! Be that as it may, we should definitely consider the so-called alternative hypothesis H a : p >
1 , 4
stating that in the case of Felix ESP performance the probability of correctly identifying a card drawn by Oscar is actually greater than the “guessing-type” probability. In other words, the correct model for the experiment is somewhere in the binomial family
∼ Bink | H )
X 10
a
of seriously possible statistical models. The next problem is how to decide which hypothesis should we accept – H 0 or H a – in face of fresh observation X 10 = 6? Before moving on, note also that if Oscar were a true believer in Felix’s ESP capabilities, he may as well consider yet another hypothesis, say H 0′′ : p = 0.75, that leads to the binomial probability distribution
4
x
pX (x) = BinX (x H 0 ) = P X 10 = x p
|
10
|
0.4
0.2
−1
0
1
2
3
4
5
6
7
8
9
x
10
giving the mean µ = 10. 0.75 = 7.5 that treats Felix’s ESP performance far too optimistically. This is something Oscar is not prepared to do. What hypotheses H ′ and H ′′ indicate is that there are many hypotheses that
·
0
0
perhaps explain Felix’s experimental result much better than H 0 . But Oscar is not convinced as yet that this might be the case.
3. Model validation in binomial right-tail tests
One way to indicate the discrepancy between what we expext based on H 0 and what we actually observe is to calculate the so-called P -value , i.e., the right-tail probability
≥ 6 | H = 0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019 or in general the P -value P X ≥ x | H , where x is the value of X P X 10
0
n
obs
0
obs
n
observed in an experiment consisting of n trials. The fact that in Felix’s case the P -value is small indicates that the observation X 10 = 6 is far from what we expect when p = 41 (i.e., when we hold the view that H 0 is the correct hypothesis). To repeat, the model based on H 0 cannot explain Felix’s data! However, the result X 10 = 6 is not at all surprising if Felix really has some degree of ESP. As a matter of fact, the evidence tends to favor the conclusion that Felix has an ESP capability. It was Ronald Fisher (1890-1962) who stated by convention or by way of benchmarks that 5
(i) If the P -value satisfies
P X n
≥ x | H < 0.01, 0
obs
then xobs shows strong evidence against H 0 (or simply the test result is highly statistically significant ), prompting a rejection of H 0 at the 1% significance level! Because the P -value 0.019 in Felix’s case is greater than 0.01, Oscar can still retain H 0 at the 1% significance level! (ii) However, if the P -value satisfies
≤ PX ≥ x | H < 0.05,
0.01
n
0
obs
then x obs shows moderate evidence against H 0 (or simply the test result is statistically significant ) – still prompting a rejection of H 0 but this time only at the 5% significance level! Since now the P -value 0.019 (obtained for Felix) is strictly less than 0.05, Oscar must give up his pessimistic hypothesis H 0 at the 5% significance level! (iii) Finally, the P -value satisfying
≤ PX ≥ x | H
0.10
n
0
obs
indicates no evidence against H 0 .
In addition to the P -value approach, described above, there is also a dual critical value approach, according to which a designated critical value xcr of the test statistic X n will determine when H 0 ought to be rejected. Specifically, in a right-tailed test (a prime example is Felix’s ESP experiment) H 0 is rejected precisely when xcr xobs . In a right-tail test, the set of values of X n that are equal to or larger than the critical value is called the rejection region . All other values of X n belong to the nonrejection region.
≤
Question: How do we find the critical value? Answer: The statistician specifies it prior to the experiment or calculates it from the equation
≥ x | H = 0.01 at the 1% significance level. Since P X ≥ x | H = 1 − P X ≤ x − 1 | H P X n
cr
n
0
cr
0
n
cr
0
and hence
≤ x − 1 | H = 0.99, we can look up the value of x − 1 in the table for the binomial comulative P X n
cr
0
cr
probability distribution for sample n and probability specified by H 0 . 6
In the case of Felix’s ESP experiment, the values n = 10 and p = 41 give xcr 1 = 7 and therefore the critical value at the 1% significance level is x cr = 8. What this means is that any score x obs of Felix belonging to the rejection region 8, 9, 10 on the right of the binomial pdf graph calls for a rejection of H 0 at the 1% significance level.
−
{
}
Now, at a 5% significance level we solve the equation
≥ x | H = 0.05 . Equivalently, we look up the value of x − 1 in the table for the binomial P X n
cr
0
for x cr cr comulative probability distribution for sample n and probability specified by H 0 , satisfying the formula
P X n
≤ x − 1 | H = 0.95. 0
cr
In the Felix’s ESP example, we find that the critical value at the 5% significance level is approximately x cr 1 = 4.5, i.e., we have x cr = 5.5. What this means is that any test score above 5.5 leads to the rejection of H 0 at the 5% significance level. Thus now the rejection region for Felix’s ESP experiment is given by the set 6, 7, 8, 9, 10 .
−
{
}
The binominal graph for a biased coin that leads to more likely heads (i.e., the alternative hypothesis has the form H a : p (head)> ) in n = 16 tosses has the form 1 2
pX (x) = BinX (x H 0 ) = P X 16 = x p 16
|
|
0.4
0.2
−1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
x
reject region
The rejection region is given by head numbers X = 12, 13, 14, 15, 16.
7
4. Model validation in left-tail tests
So far we have been analyzing the so-called one-sided or one-tailed tests in which the alternative hypothesis H a has the general “right-tail” or “greater than” form θ > θ0 , indicating that the unknown parameter θ is greater than the parameter θ0 in the null hypothesis H 0 . However, in one-tailed tests the inequality can also go in the reversed “left-tail” or “smaller than” direction θ < θ0 . For example, suppose we have coin that may be biased (loaded) in such a way that a head is less likely occur than a tail. In this case the null hypothesis has the equational form H 0 : p =
1 , 2
stating that the coin is fair, so that the probability of getting a head is P (H ) = 0.5. But because it appears that a head is less likely (.e., P(H ) < 0.5), the obvious alternative (or research) hypothesis has the form H a : p <
1 . 2
Suppose the coin has been tossed n = 16 times. As above, the statistical model is once again given by a binomial probability distribution. Before the performance of the coin-tossing experiment we may stipulate that the rejection region will be specified by the set 0, 1, 2, 3, 4, 5 . In other words, this time the critical value is xcr = 5, and the hypothesis H 0 will be rejected whenever only 5 or a smaller number of heads occurs in 16 independent tosses of the coin. The graph of the corresponding statistical model looks as follows:
{
pX (x) = BinX (x H 0 ) = P X 16 = x p 16
|
|
}
0.4
0.2
−1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
reject region
8
x
In the graph the cutoff point (critical value) xcr is indicated by a vertical line. In this setting hypothesis testing is really quite simple. In a given experiment of 16 coin tosses, simply count the total number of heads and verify whether it is above the cutoff point x cr . If so, hypothesis H 0 is retained, and otherwise H 0 is rejected.
Of course, we can calculate the P -value P X n xcr H 0 . If it turns out to be smaller than 0.01, then H 0 is rejected at the 1% significance level. And of course similarly for 0.05. Specifically, the binomial table gives P X 16 5 H 0 = 0.1051, which is a bit too weak for rejecting H 0 . However, because 3 H 0 = 0.010, only 3 heads in total in 16 coin tosses would be on P X 16 the border of a highly significant test at the 1% significance level.
≤ |
≤ |
≤ |
5. Model validation in two-tail tests
As you may have guessed, there is also a two-tail test , in which the null hypothesis H 0 has the general form θ = θ0 and the alternative hypothesis H a is expressed by the inequality θ = θ0. A typical example is a coin that may be biased but we do not know which way. Therefore the hypotheses are as follows:
H 0 : p =
1 , 2
stating that the coin is fair, so that the probability of getting a head is P (H ) = 0.5. But because it appears that a head and a tail may not be equally likely, the alternative hypothesis has the form
21 .
H a : p =
Once again, let us assume that the coin is tossed n = 16 times and let us stipulate that the rejection region is given by the set 0, 1, 2, 3 13, 14, 15, 16 . Thus the critical value on the left is xcr = 3 and on the right it is xcr = 13. In this case the P -value is given by the sum P X 16 3 H 0 + P X 16 13 H 0 of P -values on the left and right.
{
≤ |
9
}∪{ ≥ |
}
The graph of a two-tailed binomial test for a biased coin is as follows:
pX (x) = BinX (x H 0 ) = P X 16 = x p
|
16
|
0.4
0.2
−1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
reject region
x
reject region
6. Model validation in normal distribution experiments
If the sample in a binomial model is large, say n 30, then the limits theorems of statistics suggest to use a normal distribution that approximates binomial distributions with large n to high degrees of accuracy. Recall that the normal distribution has the graphic form
≥
y 2
pX (x) =
1 √ e σ 2π
−(x − µ) 2σ 2
σ = 0.5, µ = 0
The approximation of a binomial distribution by a normal distribution can be doe as follows:
10
Because it is often easier and far more universal to work with the normal distribution, for large binomial samples hypothesis testing is performed in a normal distribution setting. There is one important technical problem, however. Tables for the normal probability distribution are available only for special cases, where the mean is µ = 0 and the standard deviation is σ = 1. In order to be able to use this rather specialized table also for the other normal probability distributions (in which in general µ = 0 and σ = 1), we must transform the original test statistic X n into its so-called Z -statistic or Z -score (or standardized random variable), defined by
Z =df
¯n X
−µ
¯ X
σ √ n
,
where the distribution of Z is the standard normal distribution (with µ = 0 and σ is not known, it can be replaced σ = 1) and therefore can be found in the table. If √ n by the sample standard deviation S n .
7. Model validation in normal right-tail tests
The tables have determined that in normal right-tail tests the critical value Z = z cr for Z at a 1% significance level is z cr = 2.33. Likewise, at the 5% significance level the critical value of Z is z cr = 1.645. What this means is that hypothesis H 0 is rejected at a 0.05 significance level precisely when the Z score computed from the test statistic X n lies beyond the value z cr = 1.645 in the rejection region on the right of the graph, indicated in the diagram below. y 2
pX (x) =
1 √ e σ 2π
−(x − µ) 2σ 2
σ = 0.5, µ = 0
reject region
1.64
11
8. Model validation in normal left-tail tests
In the case of normal left-tail tests the situation is symmetrical with right-tail tests. This time, however, in order to reject H 0, the observed value Z = xobs of the Z score computed from the test statistic X n should be smaller and hence to the left of the critical value z cr = 1.645, as indicated by the rejection region in the graph below:
−
y 2
pX (x) =
1 √ e σ 2π
−(x − µ) 2σ 2
σ = 0.5, µ = 0
reject region
−1.64
9. Model validation in normal two-tail tests
As in the case of binomial tests, Two-sided normal tests divide the rejection region into two areas – on the left and on the right. However, in this case the critical value z cr at the 0.01 level of significance is 2.58 and 2.58. Likewise, at the 0.05 significance level the critical value z cr is 1.96 and 1.96, as shown in the graph below:
−
−
y 2
pX (x) =
1 √ e σ 2π
−(x − µ) 2σ 2
σ = 0.5, µ = 0
reject region
reject region
−1.96
1.96
12
Once again, hypothesis H 0 is rejected provided that the observed result Z = z obs in a pertinment experiment falls into the rejection region.
13