ICICI Centre for Mathematical Sciences
Mathematical Finance 2004-5
A Probability Primer Dr Amber Habib We will sum up the most essential aspects of Probability, and the transition to Statistical Inference. We avoid technical distractions. The gaps we leave can be filled in by consulting one of the references provided at the end. The figures have been produced using Mathematica.
Contents 1 Sample Space, Random Variables
2
2 Probability
3
3 Probability Distributions
4
4 Binomial Distribution
7
5 Normal Distribution
9
6 Expectation
11
7 Variance
13
8 Lognormal Distribution
17
9 Bivariate Distributions
18
10 Conditional Probability and Distributions
21
11 Independence
24
12 Chi Square Distribution
25
13 Random Samples
26
14 Sample Mean and Variance
28
15 Large Sample Approximations
29
1
1 SAMPLE SPACE, RANDOM VARIABLES
2
16 Point Estimation
32
17 Method of Moments
33
18 Maximum Likelihood Estimators
34
19 Sampling Distributions
36
20 Confidence Intervals
41
1
Sample Space, Random Variables
The sample space is the collection of objects whose statistical properties are to be studied. Each such object is called an outcome, and a collection of outcomes is called an event. • Mathematically, the sample space is a set, an outcome is a member of this set, and an event is a subset of it. For example, our sample space may be the collection of stocks listed on the National Stock Exchange. We may ask statistical questions about this space, such as: What is the average change in the value of a listed stock (an outcome) over the past year? Alternately, our sample space might be the history of a single stock: All its daily closing prices, say, over the last 5 years. Our question might be: What is the average weekly change of the value over the last year? Or, how large is the variation in the prices over its 5 year history? We could ask more subtle questions: How likely is it that the stock will have a sudden fall in value over the next month? • Typical notation is to use S for sample space, lower case letters such as x for individual outcomes, and capital letters such as E for events. In a particular context, we would be interested in some specific numerical property of the outcomes: such as the closing price on October 25, 2004, of all the stocks listed on the NSE. This property allots a number to each outcome, so it can be viewed as a function whose domain is the sample space S and range is in the real numbers R. • A function X : S → R is called a random variable.
3
2 PROBABILITY
Our interest is in taking a particular random variable X and studying how its values are distributed. What is the average? How much variation is there in its values? Are very large values unlikely enough to be ignored?
2
Probability
We present two viewpoints on the meaning of probability. Both are relevant to Finance and, interestingly, they lead to the same mathematical definition of probability! Viewpoint 1: The probability of an event should predict the relative frequency of its occurrence. That is, suppose we say the probability of a random stock having increased in value over the last month is 0.6. Then, if we look at 100 different stocks, about 60 of them should have increased in value. The prediction should be more accurate if we look at larger numbers of stocks. Viewpoint 2: The probability of an event reflects our (subjective) opinion about how likely it is to occur, in comparison to other events. Thus, if we allot probability 0.4 to event A and 0.2 to event B, we are expressing the opinion that A is twice as likely as B. Viewpoint 1 is appropriate when we are analyzing the historical data to predict the future. Viewpoint 2 is useful in analyzing how an individual may act when faced with certain information. Both viewpoints are captured by the following mathematical formulation: • Let A be the collection of all subsets of the sample space S: we call it the event algebra. • A probability function is a function P : A → [0, 1] such that: 1. P(S) = 1. 2. P
∞ [
i=1
Ei =
∞ X i=1
P(Ei ), if the Ei are pairwise disjoint events (i.e., i 6= j
implies Ei ∩ Ej = ∅). Let P : A → [0, 1] be a probability function. Then it automatically has the following properties:
3 PROBABILITY DISTRIBUTIONS
4
1. P(∅) = 0. 2. P
n [
i=1
Ei =
n X
P(Ei ), if the Ei are pairwise disjoint events.
i=1
3. P(Ac ) = 1 − P(A). [ X P(Ei ), for any collection of events Ei . 4. P Ei ≤ i
3
i
Probability Distributions
We return to our main question: How likely are different values (or ranges of values) of a random variable X : S → R? If we just plug in the definition of a random variable, we realize that our question can be phrased as follows: What is the probability of the event whose outcomes correspond to a given value (or range of values) of X? Thus, suppose I ask, what is the probability that X takes on a value greater than 100? This is to be interpreted as: what is the probability of the event whose outcomes t all satisfy X(t) > 100? That is, P(X > 100) = P({t : X(t) > 100}). It is convenient to consider two types of random variables, ones whose values vary discretely (discrete random variables) and those whose values vary continuously (continuous random variables).1 Examples: 1. Let us allot +1 to the stocks whose value rose on a given day, and −1 to those whose value fell. Then we have created a random variable whose possible values are ±1. This is a discrete random variable. 2. Let us allot the whole number +n to the stocks whose value rose by between n and n + 1 on a given day, and −n to those whose value fell by between n and n + 1. Then we have created a random variable whose possible values are all the integers. This is also a discrete random variable. 1
These are not the only types. But these types contain all the ones we need.
5
3 PROBABILITY DISTRIBUTIONS
3. If, in the previous example, we let X be the actual change in value, it is still discrete (since all changes are in multiples of Rs 0.01). However, now the values are so close that it is simpler to ignore the discreteness and model X as a continuous random variable.
Discrete Random Variables Let S be the sample space, A the event algebra, and P : A → [0, 1] a probablity function. Let X : S → R be a discrete random variable, with range x1 , x2 , . . . (the range can be finite or infinite). The probability of any particular value x is P(X = x) = P({t ∈ S : X(t) = x}). • We call these values the probability distribution or probability density of X and denote them by fX : R → [0, 1], fX (x) = P(X = x). One can find the probability of a range of values of X by just summing up the probabilities of all the individual values in that range. For instance, X P(a < X < b) = fX (xi ). xi ∈(a,b)
In particular, summing over the entire range gives X fX (xi ) = 1, i
since the total probability must be 1. Discrete Uniform Distribution: Consider an X whose range is {0, 1, 2, . . . , n} and each value is equally likely. Then 0 x∈ / {0, 1, 2, . . . , n} fX (x) = . 1/(n + 1) else
6
3 PROBABILITY DISTRIBUTIONS
Continuous Random Variables Suppose the values of a random variable X vary continuously over some range, such as [0, 1]. Then, it is not particularly useful to ask for the likelihood of X taking on any individual value such as 1/2 - since there are infinitely many choices, each individual choice is essentially impossible and has probability zero. From a real life viewpoint also, since exact measurements of a continuously varying quantity are impossible, it is only reasonable to ask for the probability of an observation lying in a range, such as (0.49, 0.51), rather than its having the exact value 0.5. The notion of a probability distribution of a continuous random variable is developed with this in mind. Recall that in the discrete context, probability of a range was obtained by summing over that range. So, in the continuous case, we seek to obtain probability of a range by integrating over it. • Given a continuous random variable X, we define its probability density to be a function fX : R → [0, ∞) such that for any a, b with a ≤ b, P(a ≤ X ≤ b) = In particular,
Z
Z
b
fX (x) dx. a
∞
fX (x) dx = 1. −∞
Important. The number fX (x) does not represent the probability that X = x. Individual values of fX have no significance, only the integrals of fX do! (Contrast this with the discrete case.) Continuous Uniform Distribution:2 Suppose we want a random variable X that represents a quantity which varies over [0, 1] without any bias: this is taken to mean that P(X ∈ [a, b]) should not depend on the location of [a, b] but only on its length. 0
1 - -
I1
2
I2
I3
-
1 P(X ∈ I1 ) = P(X ∈ I2 ) = P(X ∈ I3 ) 2
Usually, this is just called the unifom distribution.
7
4 BINOMIAL DISTRIBUTION This is achieved by taking the following probability density: 1 0≤x≤1 fX (x) = 0 else For then, with 0 ≤ a ≤ b ≤ 1, P(a ≤ X ≤ b) =
Z
b a
1 dx = b − a.
fX
1
0
1
Cumulative Probability Distribution • The cumulative probability distribution of a random variable X, denoted FX , is defined by: FX (x) = P(X ≤ x). It is easy to see that X fX (xi ) X is discrete with range x1 , x2 , . . . xi ≤x FX (x) = Z x fX (t) dt X is continuous −∞
Now we look at two distributions, one discrete and one continuous, which are especially important.
4
Binomial Distribution
Consider a random variable X which can take on only two values, say 0 and 1 (the choice of values is not important). Suppose the probability of the value
8
4 BINOMIAL DISTRIBUTION 0 is q and of 1 is p. Then we have: 1. 2.
0 ≤ p, q ≤ 1, p+q =1
Question: Suppose we observe X n times. What are the likely distributions of 0’s and 1’s? Specifically, we ask: What is the probability of observing 1 k times? We calculate as follows. Let us consider all possible combinations of n 0’s and 1’s: Number of combinations with k 1’s = Ways of picking k slots out of n = n . k Probability of each individual combination with k 1’s = pk (1 − p)n−k . n k Therefore, P(1 occurs k times) = p (1 − p)n−k . k • A random variable Y has a binomial distribution with parameters n and p if it has range 0, 1, . . . , n and its probability distribution is: n k fY (k) = p (1 − p)n−k , k = 0, 1, . . . , n. k We call Y a binomial random variable and write Y ∼ B(n, p). As illustrated above, binomial distributions arise naturally wherever we are faced with a sequence of choices. In Finance, the binomial distribution is part of the Binomial Tree Model for pricing options. Figure 1 illustrates the binomial distributions with p = 0.2 and p = 0.5, using different types of squares for each. Both have n = 10. Exercise. In Figure 1, identify which points correspond to p = 0.2 and which to p = 0.5.
9
5 NORMAL DISTRIBUTION 0.3 0.25 0.2 0.15 0.1 0.05
4
2
6
8
10
Figure 1: Two binomial distributions.
5
Normal Distribution
This kind of probability distribution is at once the most common in nature, among the easiest to work with mathematically, and theoretically at the heart of Probability and Statistical Inference. Among its remarkable properties is that any phenomenon occurring on a large scale tends to be governed by it. When in doubt about the nature of a distribution, assume it is (nearly) normal, and you will usually get good results! We first define the standard normal distribution. This is a probability density of the form 1 2 fX (x) = √ e−x /2 . 2π It has the following ‘bell-shaped’ graph: 0.4 0.3 0.2 0.1 -3
-2
-1
1
2
3
10
5 NORMAL DISTRIBUTION
1 Exercise. Can you explain the factor √ ? (Hint: Think about the 2π requirement that total probability should be 1.) Note that the graph is symmetric about the y-axis. The axis of symmetry can be moved to another position m, by replacing x by x − m. In the following diagram, the dashed line represents the standard normal distribution: 0.4 0.3 0.2 0.1 m
1 2 fX (x) = √ e−(x−m) /2 . 2π Also, starting from the standard normal distribution, we can create one with a similar shape but bunched more tightly around the y-axis. We achieve this by replacing x with x/s:
-3
-2
-1
1
fX (x) = √
2
3
1 1 2 e− 2 (x/s) . 2πs
By combining both kinds of changes, we reach the definition of a general normal distribution: • A random variable X has a normal distribution with parameters µ, σ,
11
6 EXPECTATION if its density function has the form: fX (x) = √
1 x−µ 2 1 e− 2 ( σ ) . 2πσ
We call X a normal random variable and write X ∼ N(µ, σ). The axis of symmetry of this distribution is determined by µ and its clustering about the axis of symmetry is controlled by σ. Exercise. Will increasing σ make the graph more tightly bunched around the axis of symmetry? What will it do to the peak height of the graph? In the empirical sciences, errors in observation tend to be normally distributed: they are clustered around zero, small errors are common, and very large errors are very rare. Regarding this, observe from the graph that by ±3 the density of the standard normal distribution has essentially become zero: in fact the probability of a standard normal variable taking on a value outside [−3, 3] is just 0.0027. In theoretical work, the normal distribution is the main tool in determining whether the gap between theoretical predictions and observed reality can be attributed solely to errors in observation.
6
Expectation
If we have some data consisting of numbers xi , each occurring fi times, then the average of this data is defined to be: P fi xi X fi Sum of all the data P xi x¯ = = Pi = Number of data points i fi j fj i Now, if we have a discrete random variable X, then fX (xi ) predicts the relative frequency with which xi will occur in a large number of observations of X, i.e., we view fX (xi ) as a prediction of Pfifj . And then, j
X
fX (xi )xi
i
becomes a predictor for the average x¯ of the observations of X.
12
6 EXPECTATION • The expectation of a discrete random variable is defined to be X E[X] = fX (xi )xi . i
On replacing the sum by an integral we arrive at the notion of expectation of a continuous random variable: • The expectation of a continuous random variable is defined to be Z ∞ E[X] = x fX (x) dx. −∞
Expectation is also called mean and denoted by µX or just µ. Exercise. Make the following calculations:
1. X has the discrete uniform distribution with range 0, . . . , n. Then E[X] = n2 . 2. X has the uniform distribution on [0, 1]. Then E[X] = 1/2. 3. X ∼ B(n, p) implies E[X] = np. 4. X ∼ N(µ, σ) implies E[X] = µ. Some elementary properties of expectation: 1. E[c] = c, for any constant c.3 2. E[cX] = c E[X], for any constant c. Suppose X : S → R is a random variable and g : R → R is any function. Then their composition g ◦ X : S → R, defined by (g ◦ X)(w) = g(X(w)), is a new random variable which we will call g(X). Example. Let g(x) = xr . Then g ◦ X is denoted X r .
2
Suppose X is discrete with range {xi }. Then, the range of g(X) is {g(xi)}. Therefore we can calculate the expectation of g(X) as follows:4 X X E[g(X)] = g(xi )P g(X) = g(xi) = g(xi )fX (xi ). i
3
i
A constant c can be viewed as a random variable whose range consists of the single value c. 4 Our calculation is valid if g is one-one. With slightly more effort we can make it valid for any g.
13
7 VARIANCE If X is continuous, one has the analogous formula: Z ∞ E[g(X)] = g(x)fX (x) dx. −∞
Example. Let g(x) = x2 . Then P 2 X is discrete. i xi fX (xi ) Z ∞ E[X 2 ] = x2 fX (x) dx X is continuous. −∞
2
With these facts in hand, the following result is easy to prove. • Let X be any random variable, and g, h two real functions. Then E[g(X) + h(X)] = E[g(X)] + E[h(X)].
7
Variance
Given some data {xi }ni=1 , its average x¯ is seen as a central value about which the data is clustered. The significance of the average is greater if the clustering is tight, less otherwise. To measure the tightness of the clustering, we use the variance of the data: 1X s2 = (xi − x¯)2 . n i
Variance is just the average of the squared distance from each data point to the average (of the data). Therefore, in the analogous situation where we have a random variable X, if we wish to know how close to its expectation its values are likely to be, we again define a quantity called the variance of X: var[X] = E[(X − E[X])2 ]. 2 Alternate notation for variance is σX or just σ 2 . The quantity σX or σ, the square root of the variance, is called the standard deviation of X. Its advantage is that it is in the same units as X.
14
7 VARIANCE
Exercise. Will a larger value of variance indicate tighter clustering around the mean? Sometimes, it is convenient to use the following alternative formula for variance: • var[X] = E[X 2 ] − E[X]2 . This is obtained as follows. var[X] = = = = =
E[(X − E[X])2 ] E[X 2 − 2E[X]X + E[X]2 ] E[X 2 ] − 2E[ (E[X]X) ] + E[(E[X]2 )] E[X 2 ] − 2E[X]2 + E[X]2 E[X 2 ] − E[X]2 .
Elementary properties of variance: 1. var[X + a] = var[X], if a is any constant. 2. var[aX] = a2 var[X], if a is any constant. Exercise. Will it be correct to say that σaX = a σX for any constant a? Exercise. Let X be a random variable with expectation µ and standard X −µ deviation σ. Then Z = has expectation 0 and standard deviation 1. σ Exercise. Suppose X has the discrete uniform distibution with range {0, . . . , n}. 1 We have seen that E[X] = n/2. Show that its variance is n(n + 2). 6 Exercise. Suppose X has the continuous uniform distibution with range [0, 1]. We have seen that E[X] = 1/2. Show that its variance is 1/12.
15
7 VARIANCE Example. Suppose X ∼ B(n, p). We know E[X] = np. Therefore, n X
n k E[X(X − 1)] = k(k − 1) p (1 − p)n−k k k=0 n X
n! pk (1 − p)n−k (k − 2)!(n − k)! k=2 n X n−2 k = n(n − 1) p (1 − p)n−k k − 2 k=2 n−2 X n−2 i 2 = n(n − 1)p p (1 − p)(n−2)−i i i=0 =
= n(n − 1)p2 And so, var[X] = = = =
E[X 2 ] − E[X]2 E[X(X − 1)] + E[X] − E[X]2 n(n − 1)p2 + np(1 − np) np(1 − p) 2
Example. Suppose X ∼ N(µ, σ). We know E[X] = µ. Therefore, Z ∞ 1 1 x−µ 2 √ var[X] = (x − µ)2 e− 2 ( σ ) dx σ 2π −∞ Z ∞ σ2 2 z 2 e−z /2 dz = √ 2π −∞ Now we integrate by parts: Z ∞ Z 2 −z 2 /2 z e dz = −∞
∞
z(ze−z
2 /2
−∞
∞ 2 = −ze−z /2 −∞ √ = 0 + 2π
Therefore var[X] = σ 2 .
) dz Z ∞ 2 + e−z /2 dz −∞
16
7 VARIANCE 0.1
0.08
0.06
0.04
0.02
10
20
30
40
Figure 2: Normal approximation to a binomial distribution. 2 X −µ Thus, if X ∼ N(µ, σ), then Z = has expectation 0 and standard σ deviation 1. In fact, Z is again a normal distribution and hence is the standard normal distribution (it has parameters 0 and 1). Through this link, all questions about normal distributions can be converted to questions about the standard normal distribution. For instance, let X, Z be as above. Then: X −µ a−µ a−µ P(X ≤ a) = P ≤ =P Z ≤ . σ σ σ Now we can also illustrate our earlier statement about how the normal distribution serves as a substitute for other distributions. Figure 2 compares a binomial distribution (n = 100 and p = 0.2) with the normal distribution with the same mean and variance (µ = np = 20 and σ 2 = np(1 − p) = 4). Generally, the normal distribution is a good approximation to the binomial distribution if n is large. One criterion that is often used is that, for a reasonable approximation, we should have both np and n(1 − p) greater than 5.
17
8 LOGNORMAL DISTRIBUTION
1.2 1 A B
0.8
C 0.6 0.4 0.2
1
2
4
3
Figure 3: Lognormal density functions: (A) µ = 0, σ = 1, (B) µ = 1, σ = 1, (C) µ = 0, σ = 0.4.
8
Lognormal Distribution
• If Z ∼ N(µ, σ) then X = eZ is called a lognormal random variable with parameters µ and σ.5 Exercise. Let Z ∼ N(µ, σ). Then 1
E[etZ ] = eµt+ 2 σ
2 t2
.
The t = 1, 2 cases of the Exercise immediately give the following: • Let X be a lognormal variable with parameters µ and σ. Then 1
2
E[X] = eµ+ 2 σ ,
2
2
var[X] = e2µ+2σ − e2µ+σ .
The lognormal distribution is used in Finance to model the variation of stock prices with time. Without explaining how, we show in Figure 4 an example of the kind of behaviour predicted by this model. 5
The name comes from “The log of X is normal.”
9 BIVARIATE DISTRIBUTIONS
18
Figure 4: A simulation of the lognormal model for variation of stock prices with time.
9
Bivariate Distributions
So far we have dealt with individual random variables, i.e. with models for particular features of a population. The next step is to study the relationships that exist between different features of a population. For instance, an investor might like to know the nature and strength of the connection between her portfolio and a stock market index such as the NIFTY. If the NIFTY goes up, is her portfolio likely to do the same? How much of a rise is it reasonable to expect? This leads us to the study of pairs of random variables and the probabilities associated with their joint values. Are high values of one associated with high or low values of the other? Is there a significant connection at all? • Let S be a sample space and X, Y : S → R two random variables. Then we say that X, Y are jointly distributed. • Let X, Y be jointly distributed discrete random variables. The joint dis-
19
9 BIVARIATE DISTRIBUTIONS tribution fX,Y of X and Y is defined by fX,Y (x, y) = P(X = x, Y = y).
Since fX,Y is a function of two variables, we call it a bivariate distribution. It can be used to find any probability associated with X and Y . Let X have range {xi } and Y have range {yj }. Then: X X 1. P(a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (xi , yj ). xi ∈[a,b] yj ∈[c,d]
2. fX (x) =
X
fX,Y (x, yj ) and fY (y) =
j
3.
X
X
fX,Y (xi , y).
i
fX,Y (xi , yj ) = 1.
i,j
We will be interested in various combinations of X and Y . Therefore, consider a function g : R2 → R. We use it to define a new random variable g(X, Y ) : S → R by g(X, Y )(w) = g(X(w), Y (w)). The expectation of this new random variable can be obtained, as usual, by multiplying its values with their probabilities: X E[g(X, Y )] = g(xi , yj ) fX,Y (xi , yj ). i,j
We create analogous definitions when X, Y are jointly distributed continuous random variables. • Let X, Y be jointly distributed continuous random variables. Their joint probability density fX,Y is a function of two variables whose integrals give the probability of X, Y lying in any range: Z bZ d P(a ≤ X ≤ b, c ≤ Y ≤ d) = fX,Y (x, y) dx dy. a
Then the following are easy to prove: Z ∞ Z 1. fX,Y (x, y) dy = fX (x) and −∞
c
∞
fX,Y (x, y) dx = fY (y). −∞
20
9 BIVARIATE DISTRIBUTIONS
2.
Z
∞ −∞
Z
∞
fX,Y (x, y) dx dy = 1.
−∞
• Let X, Y be jointly distributed continuous random variables and g : R2 → R. Then the expectation of g(X, Y ) is given by Z ∞Z ∞ E[g(X, Y )] = g(x, y)fX,Y (x, y) dx dy. −∞
−∞
Some important special cases: 1. Suppose X, Y are jointly distributed discrete random variables. Then X E[X + Y ] = (xi + yj ) fX,Y (xi , yj ) i,j
=
X i
=
X
xi (
X
fX,Y (xi , yj )) +
j
yj (
j
xi fX (xi ) +
i
X
X
X
fX,Y (xi , yj ))
i
yj fY (yj ) = E[X] + E[Y ]
j
There is a similar proof when X, Y are continuous. Thus expectation always distributes over sums. 2. E[(X − µX )(Y − µY )] is called the covariance of X and Y and is denoted by cov[X, Y ] or σXY . If large values of X tend to go with large values of Y then covariance is positive. If they go with small values of Y , covariance is negative. A zero covariance indicates that X and Y are unrelated. (See Figure 5.) 3. We have the following identity:6 var[X + Y ] = var[X] + var[Y ] + 2cov[X, Y ]. Exercise. Show that cov[X, Y ] = E[XY ] − E[X]E[Y ]. • The correlation coefficient of X, Y is defined to be ρ = ρX,Y = 6
cov[X, Y ] . σX σY
Compare this with the identity connecting the dot product and length of vectors: ||u + v||2 = ||u||2 + ||v||2 + 2u · v. This motivates us to think of covariance as a kind of dot product between different random variables, with variance as squared length. Geometric analogies then lead to useful statistical insights and even proofs, such as the statement below that |ρ| ≤ 1.
21
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS
-3
-2
3
3
3
2
2
2
1
1
1
-1
1
2
3
-3
-2
-1
1
2
3
-3
-1
-2
1
-1
-1
-1
-2
-2
-2
-3
-3
-3
2
3
Figure 5: Observed values of two jointly distributed standard normal variables with covariances 0.95, −0.6 and 0 respectively.
The advantage of the correlation coefficient is that it is not affected by the units used in the measurement. If we replace X by aX ′ = X + b and Y by Y ′ = cY + d, we will have ρX ′ ,Y ′ = ρX,Y . An interesting fact is that |cov[X, Y ]| ≤ σX σY .7 This immediately implies: • −1 ≤ ρX,Y ≤ 1. Exercise. Suppose a die is tossed once. Let X take on the value 1 if the result is ≤ 4 and the value 0 otherwise. Similarly, let Y take on the value 1 if the result is even and the value 0 otherwise. 1. Show that the values of fX,Y
X\Y 0 are given by: 1
0 1 1/6 1/6 1/3 1/3
2. Show that µX = 2/3 and µY = 1/2. 3. Show that cov[X, Y ] = 0.
10
Conditional Probability and Distributions
Consider a sample space S with event algebra A and a probability fuction p : A → [0, 1]. Let A, B ⊂ A be events. If we know B has occurred, 7
This is the analogue of the geometric fact that |u · v| ≤ ||u|| ||v||.
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS
22
what is the probability that A has also occurred? We reason that since we know B has occurred, in effect B has become our sample space. Therefore all probabilities of events inside B should be scaled by 1/P(B), to keep the total probability at 1. As for the occurrence of A, the points outside B are irrelevant, so our answer should be P(A ∩ B) times the correcting factor 1/P(B). • The conditional probability of A, given B, is defined to be P(A|B) =
P(A ∩ B) . P(B)
We apply this idea to random variables: • Let X, Y be jointly distributed random variables (both discrete or both continuous). Then the conditional probability distribution of Y , given X = x, is defined to be fY |X=x (y) =
fX,Y (x, y) . fX (x)
In the discrete case, we have fY |X=x (x) = P(Y = y|X = x). Note that the conditional probability distribution is a valid probability distribution in its own right. For example, in the discrete case, we have 1. 0 ≤ fY |X=x (y) ≤ 1, 2.
X i
fY |X=x (yi ) =
X fX,Y (x, yi ) i
fX (x)
=
fX (x) = 1. fX (x)
Since fY |X=x is a probability distribution, we can use it to define expectations. • The conditional expectation of Y , given X = x, is defined to be X yi fY |X=x (yi ) X, Y are discrete i E[Y |X = x] = Z ∞ yfY |X=x (y) dy X, Y are continuous −∞
10 CONDITIONAL PROBABILITY AND DISTRIBUTIONS
23
3
2
1
-3
-2
-1
1
2
3
-1
-2 -3
Figure 6: Regression curve for two normal variables with ρ = 0.7. • Note that E[Y |X = x] is a function of x. It is also denoted by µY |X=x or µY |x , and is called the curve of regression of Y on X. The function E[Y |X = x] creates a new random variable E[Y |X]. Below, we calculate the expectation of this new random variable in the continuous case: Z ∞ E[E[Y |X]] = E[Y |X = x]fX (x) dx −∞ Z ∞Z ∞ = yfY |X=x (y)fX (x) dy dx −∞ −∞ Z ∞Z ∞ = yfX,Y (x, y) dy dx −∞
−∞
= E[Y ]
Similar calculations can be carried out in the discrete case, so that we have the general result: E[Y ] = E[E[Y |X]].
This result is useful when we deal with experiments carried out in stages, and we have information on how the results of one stage depend on those of the previous ones.
24
11 INDEPENDENCE
11
Independence
Let X, Y be jointly distributed random variables. We consider Y to be independent of X, if knowledge of the value taken by X tells us nothing about the value taken by Y . Mathematically, this means: fY |X=x (y) = fY (y). This is easily rearranged to: fX,Y (x, y) = fX (x)fY (y). Note that the last expression is symmetric in X, Y . • Jointly distributed X, Y are independent if we have the identity: fX,Y (x, y) = fX (x)fY (y). Exercise. If X, Y are independent random variables and g : R2 → R any function of the form g(x, y) = m(x)n(y), then E[g(X, Y )] = E[m(X)]E[n(Y )]. Exercise. If X, Y are independent, then cov[X, Y ] = 0. A common error is to think zero covariance implies independence – in fact it only indicates the possibility of independence. Exercise. If X, Y are independent, then var[X + Y ] = var[X] + var[Y ]. We return again to the normal distribution. The following facts about it are the key to its wide usability: 1. If X ∼ N(µ, σ) and a 6= 0, then aX ∼ N(aµ, |a|σ). 2. P If Xi ∼ N(µi , σi ) with i = 1, . . . , n are pairwise independent, then i Xi ∼ N(µ, σ), where X X µ= µi , σ2 = σi2 . i
i
Of course, the mean and variance would add up like this for any collection of independent random variables. The important feature here is the preservation of normality.
25
12 CHI SQUARE DISTRIBUTION
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
n=1
0.1
n=2
0.08
n=3
0.06
n=10 n=20
0.04 0.02
1
2
3
4
5
10
20
30
40
Figure 7: Chi square distributions with various degrees of freedom n.
12
Chi Square Distribution
We have remarked that measurements of continuously varying quantities tend to follow normal distributions. Therefore, a consistent method of measurement will then amount to looking at a sequence of normal variables. The errors in the measurements can be treated as a sequence of independent normal variables with mean zero. By scaling units, we can take them to be standard normal. Suppose we consider independent standard normal variables X1 , . . . , Xn . If we view them as representing a sequence of errors, it is natural to ask if, in P total, the errors are large or small. The sum i Xi won’t do as a measure of this, because individual Xi could take on large values yet cancel out to give a small value of the sum. This problem can be avoided by summing either |Xi | or Xi2 . The latter choice is more tractable. • Let X1 , . . . , Xn be independent standard normal variables. The variable X=
n X
Xi2
i=1
is called a chi square random variable with n degrees of freedom. We 2 write X ∼ χ (n), and say it has the chi square distribution. Exercise. Let X be a standard normal variable. Show that: 1. E[X 2 ] = 1. 2. E[X 4 ] = 3.
26
13 RANDOM SAMPLES 0.06 0.05 Chi square 0.04
Normal
0.03 0.02 0.01
10
20
30
40
50
60
Figure 8: Density functions of chi square and normal distributions 3. var[X 2 ] = 2. 2
Exercise. Let X ∼ χ (n). Show that E[X] = n and var[X] = 2n. Figure 7 shows chi square distributions with different degrees of freedom. Note that as the degrees of freedom increase, the distributions √ look more and 2 more normal. Figure 8 compares the χ (30) and N(30, 60) distributions: χ2 (n) with they are very close to each other, and so a chi square distribution √ n ≥ 30 is usually replaced by a normal approximation N(n, 2n). 2 2 • If X ∼ χ (m) and Y ∼ χ (n) are independent, then X + Y is a sum of m + n squared independent standard normal variables. Therefore X + Y ∼ χ2 (m + n). 2
• The converse is also true. If X, Y are independent, X ∼ χ (m) and X +Y ∼ χ2 (m + n), then Y ∼ χ2 (n).
13
Random Samples
The subject of statistical inference is concerned with the task of using observed data to draw conclusions about the distribution of properties in a
27
13 RANDOM SAMPLES
population. The main obstruction is that it may not be possible (or even desirable) to observe all the members of the population. We are forced to draw conclusions by observing only a fraction of the members, and these conclusions are necessarily probabilistic rather than certain. In such situations, probability enters at two levels. Let us illustrate this by a familiar example - the use of opinion polls. An opinion poll consists of a rather small number of interviews, and from the opinions observed in these, the pollster extrapolates the likely distribution of opinions in the whole population. Thus, one may come across a poll which, after mailing forms to 10,000 people and getting responses from about half of them, concludes that 42% (of a population of 200 million) support candidate A, while 40% support B, and the remainder are undecided. These conclusions are not certain and come with error limits, say of 3%. Typically, this is done by reviewing all the possible distributions of opinions, and finding out which one is most likely to lead to the observations. This is one level at which probability is used. The other level is in evaluating the confidence with which we can declare our conclusions - the error limit of 3% is not certain either, but has a probability attached, say 90%. If we wish to have an error bar we are more certain about (say 95%), we would have to raise the bar (say, to 5%). Sampling is the act of choosing certain members of a population, and then measuring their properties. Since the results depend on the choices and are not known beforehand, each measurement is naturally represented by a random variable. We will also make the usually reasonable assumptions that the measurements do not disturb the population, and that each choice is independent of the previous ones. • A random sample is a finite sequence of jointly distributed random variables X1 , . . . , Xn such that 1. Each Xi has the same probability density: fXi = fXj for all i, j. Y 2. The Xi are independent: fX1 ,...,Xn (x1 , . . . , xn ) = fXi (xi ). i
The common density function fX for all the Xi is called the density function of the population. We also say that we are sampling from a population of type X.
28
14 SAMPLE MEAN AND VARIANCE
14
Sample Mean and Variance
Broadly, the task of Statistics is to estimate the type of a population, as well as the associated parameters. Thus we might first try to establish that a population can be reasonably described by a binomial variable, and then estimate n and p. The parameters of a population can be estimated in various ways, but are most commonly approached through the mean and variance. For instance, if we have estimates µ ˆ and σ ˆ 2 for the mean and variance of a binomial population, we could then use µ = np and σ 2 = np(1 − p) to estimate n and p. • Let X1 , . . . , Xn be a random sample. Its sample mean is a random ¯ defined by variable X n 1X ¯ X= Xi . n i=1 Observed values of the sample mean are used as estimates of the population mean µ. Therefore, we need to be reassured that, on average at least, we will see the right value: n
n
X 1X ¯ = 1 E[Xi ] = µ = µ. E[X] n i=1 n i=1
¯ to be small so that its values are Moreover, we would like the variance of X more tightly clustered around µ. We have ¯ = var[X]
n 1 X σ2 var[X ] = , i n2 i=1 n
where σ 2 is the population variance. Thus the variance of the sample mean goes to zero as the sample size increases: the sample mean becomes a more reliable estimator of the population mean. ¯ = µ, E[X]
¯ = var[X]
σ2 n
• Let X1 , . . . , Xn be a random sample. Its sample variance is a random variable S 2 defined by n
1 X ¯ 2. S = (Xi − X) n − 1 i=1 2
29
15 LARGE SAMPLE APPROXIMATIONS
X i
E
¯ 2= (Xi − X)
hX i
X
¯ 2 − 2XiX) ¯ = (Xi2 + (X)
i
=
¯ 2 (Xi − X)
i
X i
=
X i
X i
¯ 2. Xi2 − n(X)
¯ 2] E[Xi2 ] − nE[(X) ¯ + E[X] ¯ 2) (σ 2 + µ2 ) − n(var[X]
= n(σ 2 + µ2 ) − σ 2 − nµ2 = (n − 1)σ 2 E[S 2 ] = σ 2 A rather longer calculation, which we do not include, shows that: 2σ 4 var[S ] = n−1 2
Again, we see that sample variance has the right average (σ 2 ) and clusters more tightly around it if we use larger samples.
15
Large Sample Approximations
Chebyshev’s Inequality. Let X be any random variable, with mean µ and standard deviation σ. Then for any r ≥ 0, 1 P |X − µ| ≥ rσ ≤ 2 . r
15 LARGE SAMPLE APPROXIMATIONS
30
Proof. We give the proof for a discrete random variable X with range {xi }. X σ2 = (xi − µ)2 fX (xi ) i
≥
X
i:|xi −µ|≥rσ
≥ r2σ2
(xi − µ)2 fX (xi )
X
fX (xi )
i:|xi −µ|≥rσ
= r 2 σ 2 P |X − µ| ≥ rσ
Rearranging gives the desired inequality.
2
Weak Law of Large Numbers. Let X1 , . . . , Xn be a random sample from a population with mean µ and standard deviation σ. Then for any c ≥ 0: 2 ¯ − µ| ≥ c ≤ σ . P |X nc2 In particular, the probability diminishes to zero as n → ∞. ¯ noting that it has mean µ and Proof. Apply Chebyshev’s inequality to X, √ standard deviation σ/ n: rσ 1 ¯ − µ| ≥ √ P |X ≤ 2. n r √ 2 Substituting c = rσ/ n gives the Weak Law. The Weak Law allows us to make estimates of the mean of arbitrary accuracy and certainty, by just increasing the sample size. A remarkable feature of the sample mean is that as the sample size increases, its distribution looks more and more like a normal distribution, regardless of the population distribution! We give an illustration of this in Figure 9. Central Limit Theorem. Let X1 , . . . , Xn be a random sample from a ¯ is more population with mean µ and variance σ 2 . Then the sample mean X normally distributed as n increases: ¯ −µ X √ ∼ N(0, 1). lim n→∞ σ/ n When we work with large samples,8 the Central Limit Theorem allows us to 8
A commonly used definition of “large” is n ≥ 30.
31
15 LARGE SAMPLE APPROXIMATIONS
1.6
1.8
2
2.2
2.4
Figure 9: The histogram represents the means of 5000 samples of size 100, from a binomial population with parameters n =√10 and p = 0.2. It is matched against a normal distribution with µ = 2 and σ = 0.016.
use the normal approximation to make sharper error estimates (relative to the Weak Law). We will not prove this result, but we shall use it frequently. Example. Suppose we are trying to estimate the mean of a population whose variance is known to be σ 2 = 1.6 (as in Figure 9). Using samples of size 100, we want error bars that are 90% certain. Then the Weak Law suggests we find a c such that 1.62 = 0.1, 100c2
or c = 0.506.
Thus we are 90% sure that any particular observed mean is within 0.506 of the actual population mean. If we consider the particular data that led to the histogram of Figure 9, we find that in fact 90% of the observed sample means are within 0.21 of the actual mean, which is 2. Additionally, 99.9% of the observed means are within 0.5 of 2. So the Weak Law estimates are correct, but inefficient. On the other hand, taking 100 as a large sample size, we see from the Central ¯ − µ) should be approximately a standard Limit Theorem that Z = 7.905(X normal variable. Therefore we are 90% sure that its observed values are within 1.65 of 0. So we are 90% sure that any observed sample mean is
32
16 POINT ESTIMATION
1.65 within = 0.209 of the actual population mean. Similarly, we are 99.9% 7.905 sure that any observed mean is within 0.5 of the actual. Thus, the Central Limit Theorem provides estimates which are close to reality(in this case, identical). 2
16
Point Estimation
We now start our enquiry into the general process of estimating statistical parameters – so far we have become familiar with estimating mean and variance via sample mean and sample variance. • A statistic of a random sample X1 , . . . , Xn is any function Θ = f (X1 , . . . , Xn ). ¯ and S 2 are examples of statisThus a statistic is itself a random variable. X tics. The probability distribution of a statistic is called its sampling distribution. • A statistic Θ is called an estimator of a population parameter θ if values of the statistic are used as estimates of θ. ¯ is an estimator of the population mean, and S 2 is an estimator of the X population variance. • If a statistic Θ is an estimator of a population parameter θ, then any observed value of Θ is denoted θˆ and called a point estimate of θ. ¯ should be denoted by µ According to this convention, observed values of X ˆ, 2 2 2 and of S by σˆ . We will also denote them by x¯ and s , respectively. • A statistic Θ is an unbiased estimator of a population parameter θ if E[Θ] = θ. ¯ is an unbiased estimator of the population mean, and S 2 is an unbiased X estimator of the population variance.
17 METHOD OF MOMENTS
33
√ Exercise. Suppose we use S = S 2 as an estimator of the standard deviation of the population. Is it an unbiased estimator? In general, if Θ is an estimator of θ, the gap E[Θ] − θ is called its bias. • Θ is an asymptotically unbiased estimator of θ if the bias goes to 0 as the sample size goes to infinity. Exercise. Show the following statistics are asymptotically unbiased estimators: 1X ¯ 2 , of σ 2 . 1. (Xi − X) n i 2. S =
√
S 2 , of σ.
• Θ is a consistent estimator of θ if for any c ≥ 0, P |Θ − θ| ≥ c → 0, as n → ∞,
where n is the sample size.
¯ is a consistent estimator The Weak Law of Large Numbers states that X of µ. The proof of the Weak Law can be trivially generalized to yield the following result: Theorem. Let Θ be an unbiased estimator for θ, such that var[Θ] → 0 as the sample size n → ∞. Then Θ is a consistent estimator of θ. 2 It follows that S 2 is a consistent estimator of σ 2 .
17
Method of Moments
Having created the abstract notion of an estimator of a population parameter, we now face the problem of creating such estimators. The Method of Moments is based on the idea that parameters of a population can be recovered from its moments, and for the moments there are certain obvious estimators.
18 MAXIMUM LIKELIHOOD ESTIMATORS
34
• Given a random variable X, its r th moment is defined to be µr = E[X r ]. 2 Note that µX = µ1 and σX = µ2 − µ21 .
• Consider a population with probability distribution fX . Then the statistic n
Mr =
1X r X n i=1 i
is the Method of Moments estimator for the population moment µr . ¯ Further, the identity σ 2 = µ2 − µ21 leads to the use of We have M1 = X. X M2 − M12 as an estimator of σ 2 . Exercise. M2 − M12 =
1X ¯ 2. (Xi − X) n i
The Method of Moments is easy to implement. Its drawback is that it comes with no guarantees about the performance of the estimators that it creates.
18
Maximum Likelihood Estimators
Maximum Likelihood Estimation (MLE) will be our main tool for generating estimators. The reason lies in the following guarantee: Fact Maximum Likelihood Estimators are asymptotically minimum variance and unbiased. Thus, for large samples, they are essentially the best estimators. Before describing the method, we ask the reader to consider the following graph and question:
35
18 MAXIMUM LIKELIHOOD ESTIMATORS 0.4 0.3 0.2 0.1 -2
-1
1
2
3
4
Question. The graphs above represent two conjectured distributions for a population. If we pick a random member of the population and it turns out to be 2, which of these conjectures appears more plausible? What if the random member turns out to be −1? 1? • Consider a population whose probability density f ( · ; t) depends on a parameter t. We draw a random sample X1 , . . . , Xn from this population. The corresponding likelihood function is defined to be Y L(x1 , . . . , xn ; t) = f (xi ; t), i
where xi is a value of Xi . Thus L is the joint probability density of the random sample. • Suppose we observe the values x1 , . . . , xn of a random sample X1 , . . . , Xn . The corresponding maximum likelihood estimate of t is the value tˆ that maximizes L(x1 , . . . , xn ; t). Example. Suppose we have a binomial population with n = 1 and we wish to estimate p using a sample of size m. The corresponding likelihood function is: Y1 L(x1 , . . . , xm ; p) = pxi (1 − p)1−xi xi i Y Y = ( p)( (1 − p)) i: xi =1 P i xi
i: xi =0 P m− i xi
= p (1 − p) m¯ x = p (1 − p)m(1−¯x) .
To maximize L, we use Calculus: dL =0 dp
=⇒
p = x¯.
19 SAMPLING DISTRIBUTIONS
36
¯ is the ML Estimator of p. So the sample mean X
2
Example. Suppose we have a normal population whose σ is known, and we want the maximum likelihood estimator of µ. The likelihood function is 1 xi −µ 2 1 √ e− 2 ( σ ) σ 2π i 1 1 P xi −µ 2 √ = e− 2 i ( σ ) (σ 2π)n
L(x1 , . . . , xn ; σ) =
Y
Matters simplify if we take the logarithms: √
1 X xi − µ 2 ln L(x1 , . . . , xn ; σ) = −n ln(σ 2π) − 2 i σ We differentiate with respect to µ: 0=
X xi − µ d(ln L) =− dµ σ i
=⇒
µ = x¯
So the sample mean is the ML Estimator of µ.
2
Exercise. Suppose we have a normal population whose µ is known. Show n−1 2 S . that the ML Estimator of σ 2 is n The MLE technique can also be applied when many parameters t1 , . . . , tk have to be estimated. Then the likelihood function has the form L(x1 , . . . , xn ; t1 , . . . , tk ) and we again seek the values tˆ1 , . . . , tˆk which maximize it. Exercise. Suppose we have a normal population and neither parameter is known. Find their ML Estimators.
19
Sampling Distributions
In earlier sections, we have explored the characteristics of the sample mean and variance. We could give their expectation and variance, and for large
37
19 SAMPLING DISTRIBUTIONS
samples the Central Limit Theorem gives a close approximation to the actual ¯ Other than this, we do not have an explicit description of distribution of X. ¯ and S 2 . However, if the population is known the sampling distributions of X to be normal, we can do better. Theorem. Let X1 , . . . , Xn be a random sample from a normal population. Then ¯ and S 2 are independent. 1. X ¯ ∼ N(µ, σ/√n). 2. X 3. (n − 1)
S2 2 ∼ χ (n − 1). σ2
Proof. We skip the proof of the first item – it is hard and no elementary Statistics book contains it! The next item is trivial, since linear combinations of independent normal variables are again normal. For the last claim, we start with the identity X X ¯ 2= ¯ − µ)2 . (n − 1)S 2 = Xi2 − n(X) (Xi − µ)2 − n(X i
Hence,
i
X Xi − µ 2 i
σ
S2 = (n − 1) 2 + σ
¯ 2 X −µ √ . σ/ n
2 Let us recall the following fact: If Z, Y are independent, Z ∼ χ (k) and 2 2 Y + Z ∼ χ (k + m), then Y ∼ χ (m). We can apply this to the last equation S2 2 with k = 1 and k + m = n, to get (n − 1) 2 ∼ χ (n − 1). 2 σ
Exercise. Let S 2 be the sample variance of a sample of size n from an N(µ, σ) population. Show that 2σ 4 var[S ] = . n−1 2
38
19 SAMPLING DISTRIBUTIONS
1.4
1.2
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.8
0.6
Figure 10: This diagram illustrates the independence of X¯ and S 2 for a normal population. It shows (¯ x, s2 ) pairs for a thousand samples (each of size 100) from a standard normal population.
0.5
0.4
A B
0.3 C 0.2
0.1
-3
-2
-1
1
2
3
Figure 11: The graphs compare t distributions with the standard normal distribution. We have A: t(1), B: t(10), C: N (0, 1).
39
19 SAMPLING DISTRIBUTIONS
t Distribution 2 • Consider independent random variables Y, Z, where Y ∼ χ (n) and Z ∼ N(0, 1). Then the random variable
T =
Z √ Y/ n
is said to have a t distribution with n degrees of freedom. We write T ∼ t(n). This definition is motivated by the following example: Example. Suppose we have a random sample X1 , . . . , Xn from a normal population, and we seek to estimate µ. Then we have: Z=
¯ −µ X √ ∼ N(0, 1). σ/ n
We can’t use Z directly because σ is also unknown. One option is to replace σ by its estimator S, but then the distribution is no longer normal. However, we recall that S2 2 Y = (n − 1) 2 ∼ χ (n − 1). σ ¯ S 2 are independent, so are Y and Z. Therefore Further, since X, Z √ ∼ t(n − 1) Y/ n − 1 To round off the example, note that ¯ −µ Z X √ √ . = S/ n Y/ n − 1 2 Figure 11 shows that the t distribution is only needed for small sample sizes. Beyond n = 30, the t distributions are essentially indistinguishable from the standard normal one, and so we can directly use the observed value s (of S) for σ and work with the standard normal distribution.
40
19 SAMPLING DISTRIBUTIONS
3
1.4 1.2 1 0.8 0.6 0.4 0.2
2.5
A B C
A
2 B 1.5 1 0.5
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
Figure 12: The first graph shows various F distributions (A: F(3, 10), B: F(10, 10), C: F(50, 50)). The second graph compares the F(200, 200) distribution (A) and the normal distribution with the same mean and standard deviation (B).
F Distribution 2
• Suppose X1 , X2 are independent random variables, and X1 ∼ χ (n1 ), X2 ∼ χ2 (n2 ). Then the random variable F =
X1 /n1 X2 /n2
is said to have an F distribution with n1 , n2 degrees of freedom. We write F ∼ F(n1 , n2 ). Example. Suppose we wish to compare the variances of two normal populations. We could do this by estimating their ratio. Therefore, for i = 1, 2, let Si2 be the sample variance of a random sample of size ni from the ith population. Then S2 2 (ni − 1) i2 ∼ χ (ni − 1). σi Therefore, on taking the ratio and simplifying, S12 /σ12 ∼ F(n1 − 1, n2 − 1) S22 /σ22 Given observations of Si , we can use this relation to estimate σ1 /σ2 .
2
From Figure 12 we see that F distributions also become more normal with 2 larger sample sizes. However, they do so much more slowly than χ or t distributions.
20 CONFIDENCE INTERVALS
41
You may have noted that we have not provided the density functions for the χ2 , t and F distributions. Their density functions are not very complicated, but they are rarely used directly. What we really need are their direct integrals (to get probabilities) and these do not have closed form formulae! Instead, we have to look up the values from tables (provided at the end of every Statistics book) or use computational software such as Mathematica. Remark From the definition of the F distribution, it is clear that F ∼ F(m1 , m2 ) implies 1/F ∼ F(m2 , m1 ). So tables for the F distribution are printed only for the case m1 ≤ m2 .
20
Confidence Intervals
An interval estimate for a population parameter t is an interval [tˆ1 , tˆ2 ] where t is predicted to lie. The numbers tˆ1 , tˆ2 are obtained from a random sample: so they are values of statistics T1 and T2 . Let P(Tˆ1 < t < Tˆ2 ) = 1 − α. Then 1 − α is called the degree of confidence and tˆ1 , tˆ2 are the lower and upper confidence limits. The interval (tˆ1 , tˆ2 ) is called a (1 − α)100% confidence interval. To obtain a confidence interval for a parameter t we need to create a statistic T in which t is the only unknown. If T is of one of the standard types (normal, chi square, etc.) we can use tables or statistical software to find t1 , t2 such that P(t1 < T < t2 ) = 1 − α.
By rearranging t1 < T < t2 in the form T1 < t < T2 we obtain a (1 − α)100% confidence interval for t.
Confidence Interval For Mean Example. Suppose we need to estimate the mean µ of a normal population whose standard deviation σ is known. If we have a random sample of size n, then √ ¯ ∼ N(µ, σ/ n). X
20 CONFIDENCE INTERVALS
42
Figure 13: The point z such that P(Z > z) = α/2, also satisfies P(|Z| < z) = 1−α. If we desire a degree of confidence 1 − α, we need a c such that ¯ − µ| < c = 1 − α. P |X
Then (¯ x −c, x¯ +c) will be a confidence interval for µ with degree of confidence 1 − α. ¯ −µ X √ . This is a standard normal variable. Also, σ/ n √ c n ¯ P |X − µ| < c = P |Z| < . σ √ c n The problem can be finished off by locating z = such that (See Figure σ 13): α P(Z > z) = . 2 Define Z =
For example, suppose we have n = 150, σ = 6.2 and α = 0.01. Then we have z = 2.575, since Z ∞ 1 2 √ e−x /2 dx = 0.005. 2π 2.575 √ c n From z = we obtain σ c=
2.575 × 6.2 √ = 1.30. 150
20 CONFIDENCE INTERVALS
43
Figure 14: A one-sided confidence interval. Hence (¯ x − 1.30, x¯ + 1.30) is a 99% confidence interval for µ.
2
Let us note some features of this example: 1. We found a confidence interval which was symmetric about the point estimate x¯. We can also find asymmetric ones. In particular, we have σ the one-sided confidence interval (−∞, x¯+z √ ), where z is chosen n to satisfy P(Z > z) = α. (See Figure 14.) 2. We assumed a normal population. This can be bypassed with the help ¯ of the Central Limit Theorem and a large sample size n, for then X will be normally distributed. 3. We assumed σ was somehow known. If not, we can work with the ¯ −µ X √ . Further, if n ≥ 30, we can just use the t distribution T = S/ n observed value s of S for σ. The next example illustrates the last item. Example. Suppose we have a small random sample of size n from a normal population, N(µ, σ). Then T =
¯ −µ X √ ∼ t(n − 1). S/ n
44
20 CONFIDENCE INTERVALS
To obtain a confidence interval for µ, with degree of confidence 1 − α, we seek a value t such that P( |T | ≤ t ) = 1 − α. For then, the interval s s x¯ − µ µ : −t ≤ √ ≤ t = x¯ − √ t, x¯ + √ t s/ n n n is the required confidence interval for µ. For instance, suppose the random sample consists of the following numbers: 2.3, 1.9, 2.1, 2.8, 2.3, 3.6, 1.4, 1.8, 2.1, 3.2, 2.0 and 1.9. Then we have x¯ = 2.28
and
s2 = 0.407.
Also, T ∼ t(11). If we want a confidence level of 95%, then we can take t = 2.201 since P(T ≥ 2.201) = 0.025 = α/2 (and T is symmetric about y-axis). So the 95% confidence interval is (1.88, 2.68). 2
Confidence Interval For Variance Consider the task of estimating the variance σ 2 of a normal population. If we have a random sample of size n, then (n − 1)
S2 2 ∼ χ (n − 1). 2 σ
We take x1 , x2 such that S2 α S2 α P (n − 1) 2 ≥ x1 = 1 − and P (n − 1) 2 ≥ x2 = , σ 2 σ 2 S2 so that P x1 < (n − 1) 2 < x2 = 1 − α. Therefore, σ n−1 2 n−1 2 2 P S <σ < S = 1 − α. x2 x1 n−1 2 n−1 2 Thus, s , s is a (1 − α)100% confidence interval for σ 2 . x2 x1
20 CONFIDENCE INTERVALS
45
Figure 15: Choosing a confidence interval for variance, using a chi square distribution.
Confidence Interval For Difference of Means We now consider the problem of comparing two different populations: We do this by estimating either the difference or the ratio of their population parameters such as mean and variance. Example. Consider independent normal populations with parameters (µi , σi ), i = 1, 2, respectively. We will estimate µ1 − µ2 , assuming the variances are known. Now, σi ¯ . Xi ∼ N µ i , √ ni Assuming the random samples to be independent, we have s 2 2 ¯1 − X ¯ 2 ∼ N(µ1 − µ2 , σ), where σ = σ1 + σ2 . X n1 n2 Therefore,
¯1 − X ¯ 2 ) − (µ1 − µ2 ) (X ∼ N(0, 1). σ We are now in a familiar situation. We choose z so that Z=
P(Z > z) = α/2. Then, P(|Z| < z) = 1 − α. Hence,
¯1 − X ¯ 2 ) − (µ1 − µ2 )| < σz = 1 − α. P |(X
46
20 CONFIDENCE INTERVALS
And so, (¯ x1 − x¯2 − σz, x¯1 − x¯2 + σz) is a (1 − α)100% confidence interval for the difference of means. 2 If the variances are not known, we can take large samples (ni ≥ 30) and use s2i for σi2 . If even that is not possible, we need to at least assume that σ1 = σ2 = σ. Then Z=
¯1 − X ¯ 2 ) − (µ1 − µ2 ) (X q ∼ N(0, 1). 1 1 σ n1 + n2
For σ, we have the pooled estimator Sp2
(n1 − 1)S12 + (n2 − 1)S22 = , n1 + n2 − 1
which is just the overall sample variance of the pooled samples. We have (ni − 1) Hence Y =
2 X i=1
Therefore
T =p
Si2 2 ∼ χ (ni − 1) 2 σ
(ni − 1)
i = 1, 2.
Si2 2 ∼ χ (n1 + n2 − 2). 2 σ
Z ∼ t(n1 + n2 − 2). Y /(n1 + n2 − 2)
Now T can also be expressed as: T =
¯1 − X ¯ 2 ) − (µ1 − µ2 ) (X q Sp n11 + n12
So we see that the t distribution can be used to find confidence intervals for µ1 − µ2 . For example, suppose we find t such that P(T ≥ t) = α/2. Then P(|T | ≤ t) = 1 − α.
This can be rearranged into r 1 1 ¯ ¯ P |(X1 − X2 ) − (µ1 − µ2 )| ≤ t Sp + = 1 − α. n1 n2
47
20 CONFIDENCE INTERVALS Therefore, r r 1 1 1 1 x¯1 − x¯2 − t sp + , x¯1 − x¯2 + t sp + n1 n2 n1 n2 is a (1 − α)100% confidence interval for µ1 − µ2 .
Confidence Interval For Ratio Of Variances For comparing the variances of two normal populations, it is convenient to look at their ratio through the F distribution. Example. Suppose we have independent random samples from N(µi , σi ) populations, with i = 1, 2. Let ni be the sample size and Si2 the sample variance for the sample from the ith population. Then Si2 2 (ni − 1) 2 ∼ χ (ni − 1), σi and hence
S12 /σ12 ∼ F(n1 − 1, n2 − 1). S22 /σ22 Therefore, we look for f1 , f2 such that F =
P(F > f1 ) = 1 − α/2
and
P(F > f2 ) = α/2,
so that P(f1 < F < f2 ) = 1 − α.
This yields the (1 − α)100% confidence interval 2 s1 1 s21 1 , s22 f2 s22 f1 for σ12 /σ22 .
2
References 1. A.M. Mood, F.A. Graybill and D.C. Boes, Introduction to the Theory of Statistics, Third Edition, Tata McGraw-Hill, New Delhi. 2. J.E. Freund, Mathematical Statistics, Fifth Edition, Prentice-Hall India.