Probability and Statistics (ENM 503) Michael A. Carchidi October 5, 2015 Chapter 9 - Limit Theorems The following notes are based on the textbook entitled: A First Course in by Sheldon Ross (9th edition) and these notes can be viewed at Probability by https://canvas.upenn.edu/ after you log in using your PennKey user name and Password. 1. Introdu Introduction ction In this chapter, we shall develop some of the more important and beautiful aspects of probability, which lie in its inequality and limit theorems . Of these, the most important and profound are the laws of large numbers and the central limit theorems since these seem to explain the statistics associated with many of our very very activities activities.. The laws of large large numbers numbers are concerned with stating stating conditions conditions in which the average of a sequence of random variables approaches the “expected” average. average. The central limit theorems, theorems, which which are much stronger, stronger, are concerned with stating conditions in which the sum of a large number of random variables approaches the “normal” distribution thereby explaining the empirical fact that so many natural populations exhibit the well-known bell-shaped (normal) curves. The limit theorems also supply a nice tie between the axiomatic approach to probability and the relative frequency (or measure of belief) approach to probability. Let us begin with a few inequality results, which are quite simple, but they serve as a necessary stepping stones to the more stronger limit results.
2. Cheby Chebyshev shev’s ’s Inequal Inequality ity and the Weak Weak Large of Large Large Numbers Numbers The first step in arriving at the laws of large numbers and the central limit theorems involves a very simple result (with a very simple proof) known as Markov’s inequality. Markov’s Markov’s Ine Inequality quality
Markov’s inequality states that if X is X is a random variable that takes on only non-negative values, so that P ( P (X < 0) = 0, 0, (1a) and if a a is any positive real number ( i.e., a > 0) 0) , then
≥ a) ≤ a1 E (X ). P ( P (X ≥
(1b)
if I is is the random variable defined as Proof : To prove this we note that if I
⎧⎨ 1, when X ≥ ≥ a I = ⎩ 0, when X < a
,
(2)
then (since a (since a > 0), 0 ), it should be clear that
⎧⎨ 1, when 1 ≤ X/a I = ⎩ 0, when 1 > X/a
,
≤ X/a for and this says that I that I ≤ X/a for all values of the random variable X and X and all values ≤ X/a, of a a > 0. 0 . Taking expected values of I I ≤ X/a, we have 1 E (I ) ≤ E (X/a) X/a) = E (X ). a But by the definition of I in I in Equation (2), we see that
≥ a) + (0)P ≥ a) E (I ) = (1)P (1)P ((X ≥ (0)P ((X < a) = P ( P (X ≥ and so we find that
≥ a) ≤ a1 E (X ) P ( P (X ≥ 2
2. Cheby Chebyshev shev’s ’s Inequal Inequality ity and the Weak Weak Large of Large Large Numbers Numbers The first step in arriving at the laws of large numbers and the central limit theorems involves a very simple result (with a very simple proof) known as Markov’s inequality. Markov’s Markov’s Ine Inequality quality
Markov’s inequality states that if X is X is a random variable that takes on only non-negative values, so that P ( P (X < 0) = 0, 0, (1a) and if a a is any positive real number ( i.e., a > 0) 0) , then
≥ a) ≤ a1 E (X ). P ( P (X ≥
(1b)
if I is is the random variable defined as Proof : To prove this we note that if I
⎧⎨ 1, when X ≥ ≥ a I = ⎩ 0, when X < a
,
(2)
then (since a (since a > 0), 0 ), it should be clear that
⎧⎨ 1, when 1 ≤ X/a I = ⎩ 0, when 1 > X/a
,
≤ X/a for and this says that I that I ≤ X/a for all values of the random variable X and X and all values ≤ X/a, of a a > 0. 0 . Taking expected values of I I ≤ X/a, we have 1 E (I ) ≤ E (X/a) X/a) = E (X ). a But by the definition of I in I in Equation (2), we see that
≥ a) + (0)P ≥ a) E (I ) = (1)P (1)P ((X ≥ (0)P ((X < a) = P ( P (X ≥ and so we find that
≥ a) ≤ a1 E (X ) P ( P (X ≥ 2
and the proof of the theorem is complete. Note that this can also be seen using only the Calculus (in the case when X is a continuous random variable with pdf f pdf f )) by writing ∞
Z
≥ a) = P ( P (X ≥
a
1 f (t)dt = dt = a
since
∞
Z
af (t)dt ≤
a
1 a
∞
Z
tf ( tf (t)dt
a
∞
Z
(•)dt
a
means that 0 < a ≤ t and the definition of the pdf says that f ( f (t) ≥ 0. But But we we also note that since 0 since 0 < a, we have ∞
Z
a
∞
tf ( tf (t)dt
a
Z ≤
tf ( tf (t)dt +
P ( P (X ≥ ≥ a) ≤
1 a
and then since 0
∞
E (X ) =
−∞
tf ( tf (t)dt = dt =
Z
−∞
∞
tf ( tf (t)dt = dt =
0
a
and so
Z
Z Z
tf ( tf (t)dt+ dt+
∞
tf ( tf (t)dt
0
0
tf ( tf (t)dt = dt =
0
tf ( tf (t)dt
0
∞
Z
Z
Z
−∞
∞
Z
t(0)dt (0)dt+ +
0
∞
tf ( tf (t)dt = dt =
Z 0
we see that
≥ a) ≤ a1 E (X ) P ( P (X ≥ follows. Note that we may also write Equation (1b) as 1 1 − P ( P (X < a) a ) ≤ E (X ) a
or
1 P ( P (X < a) a ) ≥ 1 − E (X ). a
3
(1c)
tf ( tf (t)dt
Chebyshev’s Inequality
An immediate consequence of Markov’s inequality is Chebyshev’s inequality, which states that if X is any random variable with finite mean µ and standard deviation σ , and if k is any positive real number, then P (|X − µ| ≥ k)
σ
³ ´ ≤
2
k
.
(3a)
This follows by noting that (X − µ)2 is a non-negative random variable and setting a = k 2 > 0 in Markov’s inequality, we have P ((X − µ)2
≥ k2) ≤ k12 E ((X − µ)2) = k12 σ2
or P (|X − µ| ≥ k) ≤
σ2
k2 which is Equation (3a). Note that we may also write Equation (3a) as σ
1 − P (|X − µ| < k) or P (|X − µ| < k) or
2
³ ´ ≤ ³ ´ ≥ − ³ ´ ≥ −
P (µ − k < X < µ + k)
k
σ
1
2
k
1
σ
k
2
.
(3b)
Equations (1b,c) and (3a,b) enables one to derive bounds on probabilities when only the mean or both the mean and variance are known for a random variable X . Let us now consider two examples. Example #1: Markov’s and Chebyshev’s Inequality
Suppose that the number of items produced in a factory is a random variable with mean 50, then let’s see what can we say about the probability that a given week’s production will exceed 74 items, assuming that X is a discrete random
4
variable. For one, we may use only the mean value of 50 and Markov’s inequality (with a = 75) to say that P (X > 74) = P (X ≥ 75) ≤
1 50 2 E (X ) = = . 75 75 3
But suppose that we also know that the variance in a week’s production is 25, then we can say something about the probability that a given week’s production will be between 40 and 60 items. Using Chebyshev’s inequality, in the form of Equation (3b), with k = 10, we have P (50 − 10 < X < 50 + 10)
≥1
or
σ
2
³ ´ − k
=1
5 10
2
µ ¶ −
3 4 so that the probability that a given week’s production is between 40 and 60 is at least as large as 3/4, regardless on the actual distribution of X . ¥ P (40 < X < 60) ≥
Since Chebyshev’s inequality is valid for all random variables X , we should not expect the bound on the probability to be very close to the actual probability in most cases, as the following example shows. Example #2: Chebyshev’s Inequality is Numerically Weak
Suppose that the number of items produced in a factory example above is normal with mean 50 and variance 25, i.e., X ∼ N (50, 25). Then the probability that a given week’s production will be between 40 and 60 items can be computed exactly as 40 − 50 X − 50 60 − 50 P (40 < X < 60) = P < < 5 5 5
µ
¶
or P (40 < X < 60) = P (−2 < Z < 2) = Φ(2) − Φ(−2) = Φ(2) − (1 − Φ(2)) or
2 P (40 < X < 60) = 2Φ(2) − 1 = √ 2π 5
2
Z
−∞
1 2
e− 2 t dt − 1
or P (40 < X < 60) =
2
r Z 2
π
1 2
−∞
e− 2 t dt − 1 ' 0.9545
which is much larger than the 3/4 = 0.75 as provided by Chebyshev’s inequality in Example #1 above. ¥ Example #3: Zero Variance
⇒ Not Random at all
Suppose that X is a random variable having mean µ and zero variance, then Chebyshev’s inequality would say that P (|X − µ| < k) ≥ 1
σ
³ ´ −
2
or
k
P (|X − µ| < k) ≥ 1
since σ 2 = 0. But since P (|X − µ| < k) ≤ 1 for all probabilities, we see then that 1 ≤ P (|X − µ| < k) ≤ 1
which says that
P (|X − µ| < k) = 1
→ 0+ then says that P (|X − µ| ≤ 0) = 1
for all k > 0. Taking the limit as k
which says that P (X = µ) = 1. In other words, the only random variables having zero variance are those that are equal to their means with probability 1, and hence they are not random at all. ¥ You will notice that in the previous example, we used the “limit-inequality” that if f (x) < g(x) for all x > 0, then lim f (x) ≤ lim+ g(x).
x→0+
x→0
For example, it is true that f (x) = 1 − x < g(x) = 1 for all x > 0, but lim (1 − x) < lim+ (1),
x→0+
x→0
resulting in 1 < 1 is not true, but rather lim (1 − x) ≤ lim+ (1)
x→0+
x→0
6
resulting in 1
≤ 1, which is a true statement. This is why P (|X − µ| < k) = 1 and lim P (|X − µ| < k) = lim (1) k 0 k 0 → +
→ +
becomes P
µ
lim |X − µ| < lim+ k
k→0+
k→0
where the “<” in P changes to a “≤” in
µ
¶
=1
or
P (|X − µ| ≤ 0) = 1
lim |X − µ| < lim+ k
k→0+
k→0
¶
P (|X − µ| ≤ 0).
The Weak Law of Large Numbers
Let X 1 , X 2 , X 3 , ..., be a sequence of independent and identically distributed random variables, each having the same finite mean E (X i ) = µ, then, for any ε > 0, X 1 + X 2 + X 3 + · · · + X n − µ ≥ ε = 0. (4a) lim P n→∞ n
¯¯ ¾
½¯¯ ¯
To prove this result, we de fine the sequence of random variables Y n =
X 1 + X 2 + X 3 + · · · + X n n
for n = 1, 2, 3,..., and note that X 1 + X 2 + X 3 + · · · + X n E (Y n ) = E n
µ
¶
=
1 E (X 1 + X 2 + X 3 + · · · + X n ) n
or E (Y n ) =
1 1 (E (X 1 ) + E (X 2 ) + E (X 3 ) + · · · + E (X n)) = (µ + µ + µ + · · · + µ) n n
or E (Y n ) = µ for n = 1, 2, 3,.... We also have V (Y n ) = V
µ
X 1 + X 2 + X 3 + · · · + X n n 7
¶
=
1 V (X 1 + X 2 + X 3 + · · · + X n ) n2
or, since the X i ’s are independent, V (Y n ) =
1 1 2 (V (X )+V (X ) +V (X ) +· · · +V (X )) = (σ + σ2 + σ2 +· · · + σ 2) 1 2 3 n 2 2 n n
or V (Y n ) = σ2 /n for n = 1, 2, 3,.... Then, using Chebyshev’s inequality with X = Y n and k = ε , we have P (|Y n − E (Y n )| ≥ ε) ≤ or
V (Y n)
or
ε2
µ¯¯
P (|Y n − µ| ≥ ε) ≤
¯¯ ¶ ≤ ¯ ¶ µ ¶ − ¯≥ ≤ ¯ ¶ ¶ − ¯ ≥
X 1 + X 2 + X 3 + · · · + X n −µ ≥ε P n for n = 1, 2, 3,.... Taking the limit as n → ∞, we have lim P
n→∞
µ¯¯ ¯
X 1 + X 2 + X 3 + · · · + X n n
which says that
lim
n→∞
µ µ¯¯ ¯ P
σ2 . nε2
σ2 nε2
µ
X 1 + X 2 + X 3 + · · · + X n n
n→∞
σ2 nε2
ε
=0
lim
ε
µ
(4b)
=0
for any ε > 0, and the result is now proven. It should be noted that we may not take the limit of this result as ε → 0+, since then the limit lim
n→∞
σ2 nε2
µ ¶
=0
taken above would no longer be true. 3. The Central Limit Theorem Suppose that a large number n, of independent random variables X 1 , X 2 , X 3 , ..., X n having means E (X i ) = µi and variances V (X i ) = σi2 , satisfy the assumptions that: (a) the X i ’s are uniformly bounded which means that there is an M such that P (|X i | < M ) = 1 for all i, and (b)
n
lim
n→∞
X
σi2
i=1
8
→ ∞,
then the central limit theorem states that the sum of these X i ’s n
X =
X
X i
i=1
has a distribution that is approximately normal with means and variances given by µ = µ 1 + µ2 + µ3 + · · · + µn
and
σ2 = σ 12 + σ22 + σ32 + · · · + σn2 .
Therefore, this theorem provides a simple method for computing approximate probabilities for “large” sums of independent random variables via the standard normal cdf z 1 2 Φ(z ) = √ e−t /2 dt. 2π −∞ In addition, this theorem helps explain the remarkable fact that the empirical frequency of so many natural populations exhibit bell-shaped (normal) curves, as shown in the following figure.
Z
A Bell-Shaped Curve One simpler form of the central limit theorem states that if X 1 , X 2 , X 3, ..., X n is a sequence of independent and identically distributed random variables, each having mean µ and variance σ2 , and if Z n =
X 1 + X 2 + X 3 + · · · + X n − nµ √ σ n
9
(5a)
for n = 1, 2, 3,..., then, for any
−∞ < z < +∞,
1 lim P (Z n ≤ z ) = Φ(z ) = √ n→∞ 2π
z
Z
2
e−t
/2
dt.
(5b)
−∞
This says that the Z n’s tend to the standard normal distribution as n gets large. To prove this result, let us assume that the moment generating function of each of the X i ’s (denoted by M Xi (t) = M (t)) exists and is finite. Now if E (X i ) = µ and V (X i ) = σ 2 , we first note that X 1 + X 2 + X 3 + · · · + X n − nµ √ σ n (X 1 − µ) + (X 2 − µ) + (X 3 − µ) + · · · + (X n − µ) √ = σ n X 1∗ + X 2∗ + X 3∗ + · · · + X n∗ X i − µ √ n = where X i∗ = σ
Z n =
and E (X i∗ ) = 0 and V (X i∗ ) = 1. Thus we now have Z n =
X 1∗ + X 2∗ + X 3∗ + · · · + X n∗ √ n
with E (X i∗ ) = 0 and V (X i∗ ) = 1. Then
µ ¶ √ µ µµ ¶ ¶¶ √ µ µ ¶¶ √ µ µ √ ¶ µ √ ¶ µ √ ¶ µ √ ¶¶
X 1∗ + X 2∗ + X 3∗ + · · · + X n∗ n X 1∗ + X 2∗ + X 3∗ + · · · + X n∗ = E exp t n tX 1∗ + tX 2∗ + tX 3∗ + · · · + tX n∗ = E exp n tX 1∗ tX 2∗ tX 3∗ tX n∗ = E exp exp exp · · · exp n n n n
M Zn (t) = M
√
Since the X i ’s and hence the X i∗ ’s are independent, the exp(tX i∗ / n)’s are also independent and so tX 1∗ M Zn (t) = E exp n
tX 2∗ E exp n
tX n∗ · · · E exp n
µ µ √ ¶¶ µ µ √ ¶¶ µ µ √ ¶¶ 10
but since the X i∗ ’s are all identical, we have
µ √ ¶ µ √ ¶ µ √ ¶ µ µ √ ¶¶ t M n
M Zn (t) = M and so
M Zn (t) = M
t · · · M n
t n
n
t n
M (t) = E (etX j ).
where
∗
Taking the natural logarithm of both sides of this equation, we have ln (M Zn (t)) = nL
t n
µ √ ¶
where
L(t) = ln (M (t)) .
(6)
Now L(0) = ln (M (0)) = ln(M (0)) = ln(1) = 0 and, using what we know about moment generating functions from the previous chapter, we have M 0 (t) L (t) = M (t) 0
and
M 0 (0) E (X i∗ ) L (0) = = =0 M (0) 1 0
so that
M 00(t)M (t) − M 0 (t)M 0 (t) L (t) = (M (t))2 00
so that M 00(0)M (0) − M 0 (0)M 0 (0) E ((X i∗ )2 )(1) − E (X i∗ )E (X i∗ ) L (0) = = (M (0))2 (1)2 00
or L00 (0) = E ((X i∗ )2 ) − (E (X i∗ ))2 = V (X i∗ ) = 1.
Now taking the limit of Equation (6) as n lim (ln(M Zn (t))) = lim
n→∞
n→∞
t n
→ ∞, we find that
µ µ √ ¶¶ nL
= lim
n→∞
µ
L(tn−1/2 ) n−1
¶
=
L(0) 0 = 0 0
and so using L’Hopital’s rule, we have lim (ln(M Zn (t))) = lim
n→∞
n→∞
µ
L(tn−1/2 ) n−1
¶
11
= lim
n→∞
(1/2)tn−3/2 L0 (tn−1/2 ) −n−2
µ−
¶
or lim (ln(M Zn (t))) =
n→∞
µ¶ µ 1 t 2
lim
n→∞
L0 (tn−1/2 ) n−1/2
L0 (0) 0
¶ µ ¶µ ¶ =
1 t 2
=
0 0
and using L’Hopital’s rule again, we find that lim (ln(M Zn (t))) =
n→∞
(1/2)tn−3/2 L00 (tn−1/2 ) −(1/2)n−3/2
µ ¶ µ− 1 t 2
lim
n→∞
which says that lim (ln(M Zn (t))) =
n→∞
This says that lim (M Zn (t)) = e t
2
/2
n→∞
¶ µ ¶ ¡ =
1 2 t 2
lim L00(tn−1/2 )
n→∞
µ ¶
1 2 1 t L00 (0) = t2 . 2 2
= M Z (t) where
Z ∼ N (0, 1)
and using lim (M Zn (t)) = M lim
n→∞
n→∞
Z n (t)
= M Z (t)
and the uniqueness of the moment generating function, we conclude that lim (Z n ) = Z ∼ N (0, 1)
n→∞
and the proof is complete. Example #4: Measuring Large Distances - Normal Approximation
An astronomer is interested in measuring the distance (in light years) from the earth to a distant star. Although the astronomer has a measuring technique, he knows that because of changing atmospheric conditions and normal error, each time a measurement is made, it will not yield the same distance and hence not the exact distance every time. Instead, merely an estimate to the exact distance is measured each time. As a result, the astronomer plans to make a series of measurements and then use the average of these as an estimate for the distance from the earth to the star. If all of these measurements are independent and identically distributed (which are both reasonable assumptions), and if each measurement has a common mean d (the actual distance in light years) and a common standard deviation of 2 (light years), as governed by the measuring conditions, 12
¢
how many measurements must the astronomer make if he wants to be at least 95% certain that his estimate is accurate to within ±0.5 light years? To solve this we assume that the astronomer makes n measurements denoted by X 1 , X 2 , X 3 , ..., X n such that E (X i ) = d (in light years) and σ(X i ) = 2 light years. Then from the central limit theorem, it follows that Z n =
X 1 + X 2 + X 3 + · · · + X n − nd √ σ n
for large n. Setting An =
∼ N (0, 1)
X 1 + X 2 + X 3 + · · · + X n n
we see that An = d +
√ σn Z n.
Then
µ−
P (d − 0.5 ≤ An ≤ d + 0.5) = P d
0.5 ≤ d +
or P (d − 0.5 ≤ An ≤ d + 0.5) = P or P (d − 0.5 ≤ An ≤ d + 0.5) = Φ or
√ n Z n ≤ d + 0.5
µ−√ ≤ ≤ √ ¶ µ √ ¶ − µ− √ ¶ µ √ ¶ − n
Z n
2σ
n
Φ
2σ
n
P (d − 0.5 ≤ An ≤ d + 0.5) = 2Φ Setting this so that
σ
P (d − 0.5 ≤ An ≤ d + 0.5) = 2Φ
2σ
µ√ ¶ − n
2σ
n
2σ
n
2σ
1.
1 ≥ 0.95,
we find that Φ
µ √ ¶ ≥ n
2σ
0.975
√ n
or
2σ 13
−1
≥Φ
(0.975) = 1.96.
¶
This says that n ≥ (3.92σ)2 = (3.92 × 2)2 = 61.47 so that the astronomer must take at least 62 observations.
¥
Example #5: Measuring Large Distances - Chebyshev’s Inequality
The previous example assumes that n = 62 is “large enough” so that the central limit theorem is valid. If the astronomer is worried about the validly of this, he may solve the problem using Chebyshev’s inequality, which does not require a large-n assumption. Toward this end, we have E (X i ) = d
and
V (X i ) = σ 2
for each i = 1, 2, 3,...,n, and with 1 An = n
n
X
X i
we have
E (An ) = d
and
1 2 σ . n
V (An) =
i=1
Using Chebyshev’s inequality with X = A n and k = 0.5, we then have P (|An − d| ≥ k) ≤
V (An ) σ2 = 2 k2 nk
and so 1 − P (|An − d| < k) ≤
σ2
nk2
or
P (|An − d| < k) ≥ 1 −
or P (d − k < An < d + k) ≥ 1 −
σ2
nk2
σ2
. nk2 Setting σ = 2 and setting this probability larger than (or equal to) 0.95, we find that 1−
σ2
nk 2
≥ 0.95
yielding
n≥
σ2
22 = = 320, (0.05)k2 (0.05)(0.5)2
which is more that five times the value of 62 obtained in the previous example. Whether to take 62 measurements or 320 measurements must now be weighted by the amount of time and costs it takes for each measurement. ¥ 14
Example #6: A Poisson Distribution
Suppose that the number of students who enroll in a psychology course is a Poisson random variable with mean E (X ) = 100. The professor in charge of the course has decided that if the number enrolling is 120 or more, he will teach the course in two separate sections, whereas if fewer that 120 students enroll, he will teach all the students together in a single section. Determine the probability that the professor will have to teach two sections. The exact solution is ∞
119
X
X
e−100(100)k e−100(100)k P (X ≥ 120) = =1− k! k! k=120 k=0 which gives P (X ≥ 120) = 1 − 0.9718 = 0.0282 or P (X ≥ 120) ' 2.82%. Using the fact that a Poisson random variable with mean 100 is the sum of 100 independent Poisson random variable each with mean E (X i ) = 1 and V (X i ) = 1, we can make use of the central limit theorem and the fact that E (X ) = 100E (X i ) = 100
and
V (X ) = 100V (X i ) = 100
to approximate P (X ≥ 120) as P (X ≥ 119.5), which is the continuity correction , and get X − 100 P (X ≥ 119.5) = P 10
µ
≥
119.5 − 100 10
¶
= P (Z ≥ 1.95) = 1 − Φ(1.95)
which says that P (X ≥ 119.5) = 1 − 0.9744 = 0.0256, which is about
µ
¶
0.0256 − 0.0282 100% = −9.22% 0.0282
or 9.22% lower than the exact result.
¥
15
Example #7: Rolling the Dice
If 10 fair six-sided dice are rolled, find the approximate probability that the sum obtained is between 30 and 40, inclusive. To solve this we let X i be the random variable on the roll of the ith dice so that E (X i ) = and
1 1 1 1 1 1 7 (1) + (2) + (3) + (4) + (5) + (6) = 6 6 6 6 6 6 2
1 1 1 1 1 1 91 E (X i2 ) = (1)2 + (2)2 + (3)2 + (4)2 + (5)2 + (6)2 = 6 6 6 6 6 6 6
and
91 V (X i ) = 6
7 2
2
µ ¶ −
=
35 . 12
Then since X = X 1 +X 2 +· · ·+X 10 , we have (after using the continuity correction) P (29.5 ≤ X ≤ 40.5) = P which gives
Ã
29.5 − 10(7/2) √ 10 35/12
p
≤ X √ − 10(7/2) ≤ 40.5 √ − 10(7/2)
p
10 35/12
p
10 35/12
!
,
P (29.5 ≤ X ≤ 40.5) = P ( −1.0184 ≤ Z ≤ 1.0184) = 2Φ(1.0184) − 1 or P (29.5 ≤ X ≤ 40.5) = 2(0.8458) − 1 = 0.6915. Example #8: Designing a Parking Lot
Prior to the construction of a new mega shopping mall, the builders must plan the size of the mall’s parking lot in terms of the number of shoppers that can be simultaneously in the mall. If the mall is to thrive, customers are expected to arrive at a constant rate of 625 per hour, according to a Poisson process and they are expected to shop for an average of 4 hours. In the real mall there will be an upper limit on the number of shoppers but for planning purposes, the builders can pretend that the number of shoppers is in finite. Determine the minimum number of parking places the builders should plan to make if they want to insure that they have an adequate capacity at least 97.5% of the time. You may assume that there is only one shopper per car.
16
The solution to this problem involves modeling the system as a self-service queueing system with λ = 625 customers/hours and E (S ) = 1/µ = 4 hours, so that the average number of customers in the mall at any given time is L = λ /µ = 2500 and this is distributed in a Poisson way so that the builders should plan on the number of shoppers to be the smallest value of c such that the probability there are no more than c shoppers is c
P (N ≤ c) =
c
Ln −L e ≥ p = 0.975. n! n=0
X X P n =
n=0
Solving this for c is quite difficult since L = 2500 is large making e −L very small and L n very large. However, we know from our discussion of Example #6 above that a normal approximation to the Poisson distribution using E (X ) = L and V (X ) = L = 2500 can be used and using this we have
à − ! µ ¶ − − − p ≤ p ≤ √ ≥ − ∼ µ √ − ¶ ≥ √ ≥
P or, since (N
Φ
N
E (X ) V (X )
L)/L c
L L
c
E (X ) V (X )
= P
N
L
L
c
L L
p
N (0, 1), p
which leads to
c
L+
LΦ−1 ( p).
With the numbers given in the problem, namely L = 2500 and p = 0.975 we see that √ c ≥ 2500 + 2500Φ−1 (0.975) = 2500 + 50(1.96) = 2598 which means they builders should plan on having a parking complex that can accommodate at least c min = 2598 cars. ¥ Example #9: The Sum of Uniform Random Variables
Suppose that X i (for i = 1, 2,...,n) are all independent and identical random variables, each uniformly distributed over the interval from a to b (with b > a). Since b x a+b E (X i ) = dx = 2 a b−a and b x2 a2 + ab + b2 2 E (X i ) = dx = 3 a b−a
Z
Z
17
and V (X i )
= E (X i2 )
a2 + ab + b2 − (E (X i)) = 3 2
we see that
a+b 2
2
µ ¶ −
(b − a)2 = 12
n
Y n =
X k=1
with
X i ∼ N (µ, σ2 )
n
n
(a + b)n µ = E (X i ) = 2 k=1
X
(b − a)2 n V (X i ) = σ = 12 k=1
and
and so P (Y n > c) = P or
µ
X
2
Y n − µ
>
σ
P (Y n > c) = 1 − Φ
c
c−µ σ
µ
µ−¶ σ
¶
.
Using a = 0, b = 1, n = 10 and c = 6, we find that µ = 5, σ2 = 5/6 P (Y 10 > 6) = 1 − Φ or P (Y 10 > 6) = 1
à − ! Ãr ! p − √ Z − √
which reduces to P (Y 10 > 6) ' 0.1367.
6
5 5/6
1 2π
=1
6/5
6 5
Φ
e−t
2
/2
dt
−∞
¥
Example #10: The Sum of Uniform Random Variables
An instructor has N exams that will be graded in sequence. Suppose that the times required to grade the N exams are independent and identical with mean µ and standard deviation σ . Estimate the probability that the instructor will grade at least n of the exams in the first T minutes of work. Letting E (X i ) = µ and V (X i ) = σ . Then n
X n =
X i=1
18
X i
is the time to grade the first n exams and we want to estimate P (X n ≤ T ). Now n
E (X n) =
X
n
E (X i ) = nµ
and
i=1
Then
X −√ ¶ µ −√ ¶
V (X i ) = nσ2 .
V (X n ) =
i=1
X n − nµ T nµ T nµ √ ≤ 'Φ . σ n σ n σ n Using n = 25, µ = 20 minutes, σ = 4 minutes and T = 450 minutes, we find that P (X n ≤ T ) = P
P (X 25 ≤ 450) '
Φ
µ
µ
450 − 25(20) 4(5)
¶ µ− ¶ =Φ
5 2
1 = √ 2π
−5/2
Z
2
e−t
/2
dt
−∞
or P (X 25 ≤ 450) ' 0.0062.¥ 4. The Strong Law of Large Numbers The strong law of large numbers is probably the best-know result in probability theory. It simply states that the average of a sequence of independent random variables X 1 . X 2 , X 3 , ..., X n having a common distribution will, with probability 1, converge to the mean (µ) of that common distribution. In other words, n
à X
1 P lim n→∞ n
X i = µ
i=1
!
= 1.
(7)
Computing Probabilities Using The Strong Law of Large Numbers
As an important application of the strong law of large numbers, suppose that a sequence of independent trials of some experiment is performed and suppose that E is some fixed event of the experiment that occurs with probability P (E ) on any particular trial. Letting
⎧⎨ 1, X = ⎩ 0,
if E does occur on the ith trial ,
i
if E does not occur on the ith trial
for i = 1, 2, 3,..., then by the strong law of large numbers 1 lim n→∞ n
n
X
¯ ) X i = (1)P (X i = 1) + (0)P (X i = 0) = (1)P (E ) + (0)P (E
i=1
19
(8)
or 1 P (E ) = lim n→∞ n
n
X
1 which we may also write as P (E ) ' n
X i
i=1
n
X
X i
(9)
i=1
for large n. This result is very important in Simulation Theory , which is covered in detail in the ESE 603 course. To prove the strong law of large numbers, let us assume (although the theorem can be proven without this assumption) that the random variable X i has its first four moments as finite so that so that E (X i ) = µ < ∞, E (X i2 ) = σ2 + µ2 < ∞, E (X i3 ) = L < ∞ and E (X i4 ) = K < ∞. To begin, assume that E (X i ) = µ so that Y i = X i − µ has E (Y i ) = 0. Setting n
X ÃX ! X X X X XXXX S n =
Y i
i=1
we see that
4
n
S n4 =
Y i
n
=
i=1
or
n
Y p
n
n
n
Y q
q =1
p=1
n
n
Y r
r=1
Y s
s=1
n
S n4 =
Y p Y q Y r Y s .
p=1 q =1 r=1 s=1
The right-hand side of this expression will contain terms of the form Y i4 ,
Y i3 Y j
,
Y i2 Y j2 ,
Y i2 Y j Y k
and Y i Y j Y k Y l
where the i, j, k and l are all diff erent. Since E (Y i ) = 0 and all the Y i ’s are independent we see that E (Y i3Y j ) = E (Y i3 )E (Y j ) = 0 along with E (Y i2 Y j Y k ) = E (Y i Y j Y k Y l ) = 0, so that n
E (S n4 ) =
X i=1
n
E (Y i4 ) +
n
XX
E (Y i2 )E (Y j2 ).
i=1 j=1 j6 =i
But E (Y i4 ) = = = =
E ((X i − µ)4 ) = E (X i4 − 4X i3 µ + 6X i2 µ2 − 4X i µ3 + µ4) E (X i4 ) − 4µE (X i3 ) + 6µ2E (X i2 ) − 4µ3E (X i) + E (µ4 ) K − 4µE (X i3 ) + 6µ2E (X i2 ) − 4µ3 µ + µ4 K − 4µE (X i3 ) + 6µ2E (X i2 ) − 3µ4 20
and E (Y j2 ) = E (Y i2 ) = E ((X i − µ)2 ) = V (X i) = σ 2 .
Then
n
E (S n4 ) =
n
X X i=1
(K − 4µE (X i3 ) + 6µ2 E (X i2 ) − 3µ4 ) +
n
=
i=1
n
XX XX
V (X i )V (X j )
i=1 j=1 j6 =i n n
(K − 4µE (X i3 ) + 6µ2 (σ2 + µ2 ) − 3µ4 ) +
σ2 σ2
i=1 j=1 j6 =i
n 2 2
4
= n(K + 6µ σ + 3µ ) − 4µ
X i=1
E (X i3 ) + n(n − 1)σ2 σ2 ,
and setting E (X i3 ) = L, we have E (S n4 ) = n(K + 6µ2 σ2 + 3µ4 ) − 4µnL + n(n − 1)σ2 σ2 . so that E Then ∞
S 4 E n4 n n=1
S n4 n4
µ ¶
Xµ ¶
=
K + 6µ2σ 2 + 3µ4 − 4µL − σ 2 σ2 σ2 σ2 + 2 . n3 n ∞
∞
X
X
1 1 2 2 σ σ = (K + 6µ σ + 3µ − 4µL − σ σ ) + n3 n2 n=1 n=1 2 2
4
2 2
showing that
X µ ¶ ÃX ! ∞
S n4 E 4 n n=1
since
∞
S n4 = E n4 n=1
∞
1 = ζ (3) ' 1 202 3 n n=1
X
<∞
∞
1 π2 = ζ (2) = ' 1.645 2 n 6 n=1
X
and
are both finite. But the fact that
ÃX ! ÃX µ ¶ ! ∞
S n4 E n4 n=1
∞
= E
n=1
21
S n n
4
<∞
says that ∞
n=1
with probability one and hence lim
n→∞
<∞
4
µ ¶ S n n
4
S n n
Xµ ¶ =0
or
lim
n→∞
µ ¶ S n n
=0
with probability 1. Thus we find that lim
n→∞
or
S n n
µ ¶
= lim
n→∞
n
ÃX! 1 n
Y i
= lim
i=1
lim
n→∞
n→∞
n
ÃX ÃX ! 1 n
1 n
i=1
!
(X i − µ)
= lim
n→∞
n
ÃX ! − 1 n
X i
µ = 0
i=1
n
X i
= µ
i=1
with probability 1, and now the proof is complete. Many students are initially confused about the di ff erence between the weak and strong laws of large numbers. The weak law of large numbers states that for any specified large value of n, says n ∗ , X 1 + X 2 + X 3 + · · · + X n n is likely to be near µ. However, it does not say that X 1 + X 2 + X 3 + · · · + X n n∗
∗
is bound to stay near µ for all values of n larger than n ∗. Thus it leaves open the possibility that large values of
¯¯
X 1 + X 2 + X 3 + · · · + X n n
¯ − ¯ µ
can occur infinitely often (though at infrequent intervals). The strong form shows that this cannot occur. In particular, it implies, with probability 1, for any ε > 0,
¯¯
X 1 + X 2 + X 3 + · · · + X n n 22
¯ − ¯
µ > ε
only for a finite number of n values. 5. Other Inequalities We are sometimes confronted with situations in which we are interested in obtaining an upper bound for a probability of the form P (X − µ ≥ a), where a > 0 when only the mean µ = E (X ) and variance σ2 = V (X ) of the random variable X are known. We may use Chebyshev’s inequality and the fact that P (E ) ≤ P (E ∪ F )
(10)
for any two events E and F , and |A| ≥ a is equivalent to A ≥ a or A ≤ −a for a > 0, to write P (X − µ ≥ a) ≤ P ((X − µ ≥ a) ∪ (X − µ ≤ −a)) = P (|X − µ| ≥ a) ≤ so that P (X − µ ≥ a > 0)
σ2
≤ a2 .
σ2
a2 (11a)
But the following result (known as the one-sided Chebyshev’s inequality) says that we can do better is than it states P (X − µ ≥ a > 0)
≤
σ2 a2 + σ 2
≤
σ2 . a2 + σ 2
and P (X − µ ≤ −a < 0)
(11b)
(11c)
To prove this we set Y = X − µ so that E (Y ) = E (X ) − µ = 0 and V (Y ) = V (X ) = σ 2. Suppose that b > 0, the Y ≥ a is equivalent to Y + b ≥ a + b. Hence P (Y ≥ a) = P (Y + b ≥ a + b) ≤ P ((Y + b ≥ a + b) ∪ (Y + b ≤ −(a + b))) 23
or P (Y ≥ a) ≤ P (|Y + b| ≥ a + b) = P ((Y + b)2
≥ (a + b)2)
since |Y + b| ≥ a + b is equivalent to (Y + b)2
≥ (a + b)2
for a + b > 0. Applying Markov’s inequality, we may write that P (Y ≥ a) ≤ P ((Y + b) or P (Y ≥ a) ≤
2
2
≥ (a + b) ) ≤
E ((Y + b)2 ) E (Y 2 + 2bY + b2 ) = (a + b)2 (a + b)2
E (Y 2 ) + 2bE (Y ) + b2 σ2 + 2b(0) + b2 = (a + b)2 (a + b)2
or P (Y ≥ a) ≤
σ2 + b2
(a + b)2
= g(b)
for all b > 0. We may now choose the value of b which minimizes g(b) by setting d g0 (b) = db
µ
σ2 + b2
(a + b)2
¶
2(ab − σ2 ) = =0 (a + b)3
resulting in b = σ 2 /a and hence σ2 + (σ2 /a)2 σ2 gmin = g(σ /a) = = 2 (a + σ 2 /a)2 a + σ 2 2
which then says that P (Y ≥ a) ≤
σ2 gmin = 2 a + σ 2
and the proof of Equation (11b) is complete. To prove Equation (11c) we use the fact that Y ≤ −a is equivalent to Y + b ≥ b − a and proceed along similar lines.
Example #11: The One-Sided Chebyshev’s Inequality
Suppose that the number of items produced in a factory during a week is a random variable with mean µ = 100 and variance σ2 = 400. Let us compute an 24
upper bound on the probability that this week’s production will be at least 120. To do this we use P (X ≥ 120) = P (X − 100 ≥ 20) = P (Y ≥ 20) ≤
σ2 400 = a2 + σ 2 (20)2 + 400
so that
1 = 0.5. 2 Hence the probability that this week’s production will be at least 120 is at least one-half. Note that Equation (11a) would be the weaker (and quite useless) result P (X ≥ 120) ≤
P (X ≥ 120) ≤
σ2
a2
=
400 = 1. (20)2
If we attempted to obtain a bound using Markov’s inequality (which does not utilize the variance σ2 ), we would get E (X ) 100 5 = = , a 120 6 which is a weaker result than P (X ≥ 120) ≤ 1/2. ¥ P (X ≥ 120) ≤
It should be noted that Equations (11b,c) also may be written as P (X ≥ µ + a) ≤
σ2 a2 + σ 2
(12a)
P (X ≤ µ − a) ≤
σ2 a2 + σ 2
(12b)
and for any a > 0.
Example #12: The One-Sided Chebyshev’s Inequality
A set of 200 people consisting of 100 men and 100 women is randomly divided into 100 pairs of 2 each. Let us compute an upper bound to the probability that at most 30 of these pairs will consist of a man and a woman. To solve this, let us number the men from 1 to 100 and for i = 1, 2, 3,..., 100, let
⎧⎨ 1, X = ⎩ 0,
if man j is paired with a woman
i
if man j is not paired with a woman 25
so that E (X i ) = (1)P (X i = 1) + (0)P (X i = 0) = P (X i = 1). Then the number of man-women pairs is given by 100
X
X =
X j .
j=1
Because man i is equally likely to be paired with any of the other 199 people, of which 100 are woman, we have E (X i ) = P (X i = 1) =
100 . 199
Similarly, for i 6 = j, E (X iX j ) = P (X i = 1|X j = 1)P (X j = 1) = Note that
100 − 1 100 99 100 × = × . 199 − 2 199 197 199
100 − 1 99 = 199 − 2 197 follows because given that man j is paired with a woman, man i is equally likely to be paired with any of the remaining 197 people of which 99 are woman. Using these results, we now have P (X i = 1|X j = 1) =
100
E (X ) = E
100
ÃX ! X X j
j=1
or
=
E (X j ) =
j=1
100
100
X
X
P (X j = 1) =
j=1
E (X ) =
j=1
100 100 = 100 × 199 199
10000 ' 50.251, 199
and 100
100
ÃX ! X X − X −
V (X ) = V
X j
=
j=1
100
V (X j ) + 2
j=1
=
100
=
100
2
(E (X i )) ) + 2
j=1
(P (X j = 1)
Cov(X i, X j )
i
100
(E (X j2 )
X X
(E (X i X j ) − E (X i)E (X j ))
i
(P (X i = 1)) ) + 2
j=1
100
X
(E (X i X j ) − E (X i )E (X j ))
i
26
resulting in 100
V (X ) =
100 199
j=1 100
=
100
X Ã − µ ¶ ! X µ X Xµ ¶ µ −¶ µ
j=1
100 199
2
99 100 100 100 − × × 197 199 199 199
+2
i
¶
100
9900 100 + 2 39601 7801397 i
9900 (100)(100 = 100 × + 2 39601 2
1)
×
100 7801397
¶
or
196020000 ' 25.126. 7801397 The one-sided Chebyshev’s inequality can no be applied to get V (X ) =
P (X ≤ 30) ≤ P (X ≤ 50.25 − 20.25) ≤ or P (X ≤ 30) ≤ 0.0577.
25.126 (20.25)2 + 25.126
¥
Another Form For The One-Sided Chebyshev’s Inequality
It should be noted that by setting b = µ + a (or b = µ − a) in Equations (12a,b), we have P (X ≥ b) ≤ when µ < b, and P (X ≤ b) ≤
σ2
(b − µ)2 + σ 2 σ2
(µ − b)2 + σ 2
(13a)
(13b)
when µ > b, which is what was used in the previous example with: b = 20, µ = 50.25 and σ2 = 25.126. By also writing P (X ≤ b) = 1 − P (X > b) ≤
σ2
(µ − b)2 + σ 2
for µ > b, we see that P (X > b) ≥ 1 − 27
σ2
(µ − b)2 + σ 2
(14a)
when µ > b. We also have P (X < b) ≥ 1 −
σ2
(b − µ)2 + σ 2
(14b)
when µ < b. Cherno ff ’s Bounds
Suppose that the moment generating function M X (t) = E (etX ) of a random variable X is known, then for t > 0, we have tX
ta
≥e )≤
E (etX ) = e −taM (t). ta e
tX
ta
E (etX ) = e −taM (t). ta e
P (X ≥ a) = P (e For t < 0, we have P (X ≤ a) = P (e
≥e )≤
Thus we find that P (X ≥ a) ≤ e−taM (t)
(15a)
P (X ≤ a) ≤ e−taM (t)
(15b)
for all t > 0 and for all t < 0, and these are known as the Chernoff bounds. By choosing the value of t which minimizes e −taM (t), we find that P (X ≥ a) ≤ min(e−ta M (t))
(16a)
P (X ≤ a) ≤ min(e−ta M (t)).
(16b)
t>0
and t<0
28
Example #13: The Cherno ff Bounds for N (0, 1)
If Z ∼ N (0, 1), then M (t) = e t
2
/2
and since 2
e−ta M (t) = e −ta et
/2
2
= e t
/2−at
we see that
d −ta d 2 2 (e M (t)) = (et /2−at) = (t − a)et /2−at = 0 dt dt occurs when t = a, which says that for a > 0, 2
/2−aa
= e −a
2
/2
.
2
/2−aa
= e −a
2
/2
.
min(e−taM (t)) = e a t>0
In a similar way we have for a < 0, min(e−taM (t)) = e a t<0
Thus we find, using Equations (16a,b) that 2
P (Z ≥ a > 0) ≤ e−a
/2
and
P (Z ≤ a < 0)
−a
≤e
2
/2
which can serve as quick bounds on the standard normal distribution.
(17) ¥
Example #14: Cherno ff Bounds for The Poisson Random Variable
If X is a Poisson random variable with parameter λ , then t
M (t) = eλ(e −1) and then
d −ta λ(et −1) t (e e ) = (λet − a)eλ(e −1)−ta = 0 dt occurs when t = ln(a/λ), provided that a > 0, which says that for a > 0, −ta
min(e t>0
M (t)) = e
−a ln(a/λ)
λ(a/λ−1)
e
=
Thus we find that using Equation (16a) that P (X ≥ a > 0)
−λ
≤e
29
a
³´
−a
a−λ
e
λ
eλ a
=
λe
a
µ ¶ a
e−λ
a
µ ¶
(18)
which can serve as a quick bound on the Poisson distribution.
¥
Example #15: A Problem with Gambling
Consider a gambler who is likely to win a units with probability p or loose b units (or win −b units) with probability q = 1 − p on every play, independently of his past results. That is, if X i is the gambler’s winnings on the ith play, then the X i ’s are independent and P (X i = a) = p
and
P (X i =
Letting
−b) = q = 1 − p.
n
X
S n =
X i
i=1
denote the gambler’s winnings after n plays, let us use Chernoff ’s results to determine a bound on P (S n ≥ w). Toward this end we first note that M Xi (t) = E (etX i ) = peta + qe −tb . Then M Sn (t) = E (etS n ) = E (et(X 1 +X 2 +···+X n) ) = E (etX 1 etX 2 · · · etX n ) or M S n (t) = E (etX 1 )E (etX 2 ) · · · E (etX n ) = ( peta + qe −tb )n. Using Chernoff ’s bound, we now have P (S n ≥ w) ≤ e−wt ( peta + qe−tb)n for all t > 0. Since d −wt ta (e ( pe + qe−tb)n) = dt
−we
−wt
( peta + qe−tb )n
+ne−wt ( paeta − qbe−tb )( peta + qe −tb )n−1 or d −wt ta (e ( pe + qe −tb )n ) dt = e−wt ( peta + qe−tb)n−1 (n( paeta − qbe−tb ) − w( peta + qe −tb )) = 0 30
gives n( paeta − qbe−tb ) − w( peta + qe−tb ) = 0 or (na − w) pet(a+b) = (nb + w)q which says that 1 tmin = ln a+b
µ
(nb + w)q 1 = ln(R) with (na − w) p a+b
¶
R =
(nb + w)q (na − w) p
provided that t min > 0 or R > 1. Then P (S n ≥ w) ≤ e−wtmin ( petmin a + qe−tminb )n which becomes P (S n ≥ w) ≤ R−w/(a+b) pRa/(a+b) + (1 − p)R−b/(a+b)
¡
provided that R =
(nb + w)(1 − p) > 1 (na − w) p
n
¢
(19a)
(19b)
For example, suppose that: n = 10, w = 6, a = b = 1 and p = 1/2. Then R =
(nb + w)(1 − p) (10(1) + 6)(1 − 1/2) = = 4 > 1 (na − w) p ((10)(1) − 6)(1/2)
and P (S 10 ≥ 6) ≤ 4−6/2 (1/2)(4)1/2 + (1 − 1/2)(4)−1/2 which reduces to
¡
P (S 10 ≥ 6) ≤
1 64
5 4
10
µ¶
10
¢
showing that P (S 10 ≥ 6) ≤ 0.14552. It should be noted that the exact probability in this case is P (S 10 ≥ 6) = P (gambler wins at least 8 of the first 10 games) which says that P (S 10 ≥ 6) =
10 8
10 9 210
10 10
¡¢ ¡¢ ¡¢ +
31
+
=
7 128
or P (S 10 ≥ 6) = 0.0547.
¥
Jensen’s Expected-Value Inequality
This inequality is on expected values rather than probabilities. It states that if f (x) is a twice-diff erentiable convex function (which means that f 00 (x) ≥ 0 for all x), then E (f (X )) ≥ f (E (X )) (20a) provided both E (f (X )) and E (X ) exists and are finite. Also, if f (x) is a twicediff erentiable concave function (which means that f 00 (x) ≤ 0 for all x), then E (f (X )) ≤ f (E (X ))
(20b)
provided both E (f (X )) and E (X ) exists and are finite. To prove this, we expand f (x) as a Taylor series about E (X ) = µ and write 1 f (x) = f (µ) + f 0 (µ)(x − µ) + f 00(ξ )(x − µ)2 2 where ξ is some value between x and µ. Since f 00(ξ ) ≥ 0, we have f (x) ≥ f (µ) + f 0 (µ)(x − µ) and then f (X ) ≥ f (µ) + f 0 (µ)(X − µ) showing that E (f (X )) ≥ E (f (µ) + f 0 (µ)(X − µ)) = E (f (µ)) + f 0 (µ)E (X − µ) = f (µ) and so we find that E (f (X )) ≥ f (E (X )). Of course, if f (x) is a twice-diff erentiable concave function then a similar proof leads to E (f (X )) ≤ f (E (X )).
32
Example #16: A Problem in Investing
An investor is faced with the following choices: either she can invest all of her money in a risky proposition that would lead to a random return X that has mean m, or she can put the money into a risk-free venture that will lead to a return of m with probability 1. Suppose that her decision will be made on the basis of maximizing the expected value of u(R), where R is her return and u is her utility function. By Jensen’s inequality, it follows that if u is a concave function, then E (u(X )) ≤ f (u(X )) = f (m) so the risk-free alternative is preferable, whereas if u is convex, then E (u(X )) ≥ f (u(X )) = f (m) so the risky investment alternative would be preferred.
¥
6. Bernoulli and Poisson Random Variables In this final section we consider providing bounds on error probabilities when approximating a sum of independent Bernoulli random variables by a Poisson random variable. In other words, we establish bounds on how closely a sum of independent Bernoulli random variables is approximated by a Poisson random variable with the same mean. Toward this end, suppose that we want to approximate the sum of independent Bernoulli random variables with respective parameters p1 , p2 , ..., pn. Starting with a sequence Y 1 , Y 2, ..., Y n of independent Poisson random variables, with Y i having mean pi , we will construct a sequence X 1 , X 2 , ..., X n of independent Bernoulli random variables with parameters p 1 , p2 , ..., p n , respectively, so that
⎧⎨ 0, X = ⎩ 1, i
with probability 1 − pi with probability p i
33
.
(21a)
Plots of e − p and 1 − p (for 0
≤ p ≤ 1) as seen below
1 0.8 0.6 0.4 0.2
0
0
0.2
0.4
p
0.6
0.8
1
Plots of e − p (bold) and 1 − p (thin) for 0 ≤ p ≤ 1 shows that e − p ≥ 1 − p or (1 − p)e p < 1, and so we may de fine a sequence U 1, U 2 , ..., U n of independent random variables that are also independent of the Y i ’s and are such that
⎧⎨ 0, U = ⎩ 1,
with some probability (1 − pi )e pi
i
. pi
with some probability 1 − (1 − pi )e
We also define random variables Z i so that
⎧⎨ 0, Z = ⎩ 1,
if Y i = U i = 0 .
i
Note that
(21b)
otherwise
P (Z i = 0) = P (Y i = U i = 0) = P (Y i = 0)P (U i = 0) = e − pi (1 − pi )e pi = 1 − pi and P (Z i = 1) = 1 − P (Z i = 0) = pi
and this shows that Z i = X i for i = 1, 2, 3,...,n.
Now if X i = 0 (or Z i = 0 since Z i = X i ) then so must Y i = 0 by the definition of Z i. Therefore P (X i 6 = Y i ) = P (X i = 1, Y i 6 = 1) = P (X i = 1, Y i = 0) + P (Y i > 1) 34
which becomes P (X i 6 = Y i ) = P (U i = 1, Y i = 0) + P (Y i > 1) since Y i = 0 and U i = 0 gives Z i = 0 = X i , which is not the case. Thus we now find that P (X i 6 = Y i ) = P (U i = 1)P (Y i = 0) + 1 − P (Y i ≤ 1) or or
P (X i 6 = Y i ) = (1 − (1 − pi )e pi )(e− pi ) + 1 − P (Y i = 0) − P (Y i = 1) P (X i 6 = Y i ) = e − pi
which reduces to
− (1 − pi) + 1 − e p − ( pie p ) − i
− i
P (X i 6 = Y i) = pi − pi e− pi = pi (1 − e− pi ) ≤ pi ( pi ) = p 2i and so we find that P (X i 6 = Y i) ≤ p2i
for i = 1, 2, 3,...,n. Setting X =
n
n
X
X
X i
and
Y =
i=1
(21c)
Y i
(21d)
i=1
and using the fact that X 6 = Y implies that X i 6 = Y i for some value of i, we see that P (X 6 = Y ) ≤ P (X i 6 = Y i for some i) ≤ P (X i 6 = Y i for all i) or
n
P (X 6 = Y )
n
X ≤
P (X i 6 = Y i)
i=1
X ≤
p2i .
(21e)
i=1
Next, let us define an indicator function for any event B as I B and note that
⎧⎨ 1, = ⎩ 0,
if B does occur (21f) if B does not occur
¯ = P (B) E (I B ) = (1)P (B) + (0)P (B) 35
(21g)
and consider for any set of real numbers A, the quantity I {X ∈A} − I {Y ∈A} . By definition of indicator functions (which equal either 0 or 1) we see that I {X ∈A} − I {Y ∈A} is either equal to 0 − 1 = −1, 1 − 1 = 0 − 0 = 0 or 1 − 0 = +1, and I {X ∈A} − I {Y ∈A} = 1 occurs only when I {X ∈A} = 1 and I {Y ∈A} = 0, which says that X ∈ A and Y ∈ /A so that X 6 = Y . Thus, we may conclude that I {X ∈A} − I {Y ∈A} ≤ I {X 6=Y } .
(21h)
Taking expectation values of Equation (21h) and using Equation (21g) we that E (I {X ∈A} − I {Y ∈A}) ≤ E (I {X 6=Y } )
find
or
E (I {X ∈A} ) − E (I {Y ∈A}) ≤ E (I {X 6=Y } )
or
P (X ∈ A) − P (Y ∈ A) ≤ P (X 6 = Y ).
Since probabilities are always non-negative, this also says that |P (X ∈ A) − P (Y ∈ A)| ≤ P (X 6 = Y ).
(21i)
Finally, since Y is the sum of independent Poisson random variables (each with parameters λi = pi ), we know from the reproductive property of the Poisson random variable that Y is also Poisson with parameter n
λ =
n
X X λi =
i=1
pi ,
(21j)
i=1
we have shown through Equation (21i) that
¯¯ ÃX ¯ n
P
i=1
! X ¯¯ X − ¯≤
(X i ∈ A)
i∈A
n
e−λ λi i!
p2i .
(21k)
i=1
For the special case when all the p i are the same (and equal to p), we know that X is binomial with parameters n and p, λ = np, and Equation (21k) becomes
¯¯X µ ¶ ¯ i∈A
n i p (1 − p)n−i i
X − i∈A
36
−np
e
i
(np) i!
¯¯ ¯≤
np2
(21k)