Hand Handou outt No. No. 4 PHIL.015 March 28, 2016 RANDOM VARIABLES AND THEIR PROBABILITY DISTRIBUTIONS Probabilistic to Statistical Reasoning Reasoning 1. Passage from Probabilistic
In this handout I introduce the basic concepts of random variable variable and associated probability ated probability distribution functions. functions. Roughly speaking, a random variable is a numerical-valued quantity whose value depends on the outcome of a probabilistic bilistic experiment. Its associated probility distribution distribution specifies the probability probability that the variable will assume a given numerical value. In applications of probability theory to probabilistic to probabilistic (random) experiments the first step consists of specifying the experiment’s correct probability correct probability model ⟨Ω, A, P⟩ with a suitable sample space Ω, event algebra A and a probability measure P thereon. The next step involves involves the calculation calculation of probability probability values values of events events that are of intere interest. st. In complex complex (composite (composite)) experime experiment ntss several several probabilit probability y models models may may become become necessa necessary ry.. When an experime experiment nt is performed performed,, statist statistiically oriented experimenters are often less interested in the specific outcome (or event) of it than in some aspect of the outcome that may be shared with other other particu particular lar outcomes. outcomes. For example example,, a statist statistici ician an may may be curious curious about the number the number of of babies born in a certain hospital each day (which is of course not a fixed quantity quantity,, because b ecause it depends on many random factors that vary from one day to another) and not at all in the particular babies themselves. In statistical usage, such a quantity is called a random a random variable , doubtless because its value tends to vary from one outcomne to the next (hence the term “variable”) and the outcome outcome itself itself depends on chance chance (thus, (thus, the term “random”) “random”).. The matheThe mathematical concept that concept that captures these quantities is that of a real-valued function defined on the sample space Ω. Specifically, given a probability model ⟨Ω, A, P⟩, a random variable is any function of the form X : Ω −→ R that assigns to each outcome ω in Ω a unique numerical value X (ω ) in the set R of real numbers and satisfies the measurability the measurability ondition
{ ω | X (ω ) ≤ a } ∈ A for all real numbers a in R. This This last last conditi condition on is thrown thrown in for theor theoreti etical cal reasons, because – strictly speaking, not all real-valued functions are considered 1
to be random variables. In general, there are extremely complex functions (e.g., not summable or integrable, etc.) that simply resist mathematical tractability. Fortunately, we will not have to worry about any of these, so that for us a random variable on a probability model ⟨Ω, A, P⟩ is a real-valued function defined on Ω.1 Generally, as common in statistics, we shall denote random variables by capital letters X , Y , and Z from the end of the latin alphabet. In many finitary applications, involving small sample spaces, a particular random variable X may be completely described by making a table of its values like the one shown below. (This is of course not practical for Ω with a large number of sample points. In such cases, a random variable is specified analytically, by providing a rule or a graph for X (ω).) Specification of X : From Ω to R Sample points
Values of X
Ω
−→ R
Assignment
ω1
ω2
···
ωn
X (ω1 )
X (ω2 )
···
X (ωn )
Example 1:
Consider an experiment in which three fair coins are tossed once. We already know that the probability model ⟨Ω8 , A, P⟩ for this experiment is specified by the 8-element sample space
Ω8
= Ω2 ×Ω2 ×Ω2 = T T T , T T H , T H T , H T T , H H T , H T H , T H H , H H H ,
event algebra A – consisting of all subsets of Ω8 , and the Laplacean probability measure P, defined by P(A) =
for any event A in
#{A} #{Ω8 }
A.
As statisticians, we may only be interested in the random variable X : Ω8 −→ R, defined by the total number of heads in each trial. That is to say, the random variable X : Ω8 −→ R is defined argumentwise by 1
In what follows, we use the standard mathematical notation f : A −→ B to indicate that f is a function defined on set A (known as its domain), taking values in set B (known as its codomain or range). Please do not confuse this notation with logical conditionals P → Q.
2
X (T T T ) = 0 X (T T H ) = X (T HT ) = X (HT T ) = 1 X (T HH ) = X (HT H ) = X (HHT ) = 2 X (HHH ) = 3 Example 2:
Suppose a probabilistic experiment consists of rolling two fair dice once. Its accociated probability model ⟨Ω36 , A, P⟩ is given by a 36-element sample space Ω36 =
11 12 · · · 16 21 22 · · · 26 · · · · · · 61 62 · · · 66 ,
,
,
,
,
,
,
,
,
,
,
,
,
,
the usual event algebra A of all subsets of Ω36 , and the Laplacean probability measure, defined by the fraction P(A) =
#{A} #{Ω36 }
for any event A in A. Note that each event A gives rise to a unique 2valued random variable I A : Ω36 −→ R, given by I A (ω) = 1 i f ω is in A, and I A (ω ) = 0 otherwise. Here I A is called the indicator random variable of A. Of course, I ∅ = 0 at each sample point in Ω36 . Likewise, it can be seen that I Ω = 1 for all sample points. Indicator random variables are encoding information about outcomes into numbers. We have seen that sentences in deductive logic and events in event algebras are qualitative entities. So far only probabilities were quantitative (numerical). The introduction of random variables turns probabilistic reasoning into a variant of quantitative reasoning. 36
Returning to the topic of random variables, suppose we are interested only in the sum of outcomes in rolling two dice. For this we need to define a random variable Y : Ω36 −→ R by setting Y (ij ) =df i + j for all outcomes or (sample) pairs ij in Ω36 = Ω6 × Ω6 . Clearly,
3
Y (11) = 2 Y (12) = Y (21) = 3 Y (13) = Y (22) = Y (31) = 4 Y (23) = Y (32) = Y (41) = Y (14) = 5 Y (33) = Y (42) = Y (24) = Y (15) = Y (51) = 6 Y (43) = Y (34) = Y (25) = Y (52) = Y (61) = Y (16) = 7 Y (44) = Y (53) = Y (35) = Y (26) = Y (62) = 8 Y (45) = Y (54) = Y (63) = Y (36) = 9 Y (64) = Y (46) = Y (55) = 10 Y (56) = Y (65) = 11 Y (66) = 12 Example 3:
Suppose a probability experiment consists, once again, of rolling two fair dice once. We know that its associated probability model ⟨Ω36 , A, P⟩ is given by a 36-element sample space Ω36
= 11, 12, · · · , 16, 21, 22, · · · , 26, · · · , · · · , 61, 62, · · · , 66
with the event algebra A of all subsets of Ω 36 and the Laplacean probability measure P(A) =
for any event A in
#{A} #{Ω36 }
A.
However, this time let the random variable Z : Ω36 −→ R be specified by the maximum of the two numbers on the upturned faces. That is to say, let Y (ij ) =df max i, j for all outcomes or (sample) pairs ij in Ω 36 = Ω6 × Ω6 . Clearly, we have
4
Z (11) = 1 Z (12) = Z (21) = Z (22) = 2 Z (13) = Z (31) = Z (23) = Z (32) = Z (33) = 3 Z (14) = Z (24) = Z (34) = Z (44) = Z (41) = Z (42) = Z (43) = 4 Z (15) = Z (25) = Z (35) = Z (45) = Z (55) = Z (54) = Z (53) = Z (52)
= Z (51) = 5 Z (16) = Z (26) = Z (36) = Z (46) = Z (56) = Z (66) = Z (65) = Z (64) = Z (63) = Z (62) = Z (61) = 6
We see that random variable Z takes only 6 values – from 1 to 6. It is not surprising that the measurable space ⟨Ω36 , A⟩ carries many, many random variables but only a few of them are of practical interest. For example, we can have a random variable W : Ω36 −→ R that assigns to each outcome the same constant value, say 1, i.e., we have W (ij ) = 1 for all outcomes (ij ). Needless to add, in applications this variable is not important at all. 2. Probability Distribution Functions, Expectations and Moments
Up to this point, random variables have been treated simply as functions on the sample space of a probability model. It is now time to take into account also the probability model’s probability measure. Given a random variable X : Ω −→ R on a general probability model ⟨Ω, A, P⟩, the real-valued function pX : X (Ω) −→ [0, 1], defined by
pX (xi ) =df P(X = x i ) = P { ω | X (ω ) = x i ) }
on the set
X (Ω) = { X (ω1 ), X (ω2 ), · · · , X (ωn )}
of values xi of variable X , is called the probability distribution function or simply the pdf of X . Here P (X = x i ) denotes the probability that random variable X takes value xi in a given random experiment. We now return to Example 1 , wherein we have specified X as the number of heads in tossing three fair coins. From the associated probability model we find that
5
pX (0) = P (X = 0) = P (T T T ) =
1 = 0.125 8
3 = 0.375 8 3 pX (2) = P (X = 2) = P (T H H , H H T , H T H ) = = 0.375 8 1 pX (3) = P (X = 3) = P (HHH ) = = 0.125 8 pX (1) = P (X = 1) = P (T T H , H T T , T H T ) =
Of course, the sum of probabilities of a probability distribution function pX (x) must be equal to 1. A probability distribution function (pdf ) can also be shown graphically by labeling the x axis with the values of the random variable (belonging to the set of all values X (Ω8 )) and letting the values on the y axis represent the probabilities for equational statements of the form X = 0, X = 1, X = 2, · · · about the random variable X . The graph for the pdf of X is shown below:
pX (x) = P X = x
0.4 0.3 0.2 0.1 −1
0
1
2
3
4
x
As we will note later, this is a special kind of discrete probability distribution function, called a binomial distribution . In probabilistic studies of a random variable X , statisticians are also interested in calculating the probability F X (xi ) =df P(X ≤ xi ) = P (X = x 1) + P(X = x 2 ) + · · · + P(X = x i )
for any 0 ≤ i ≤ n. Given a random variable X , the above-defined function F X : R −→ [0, 1] is called the cumulative distribution function of X . The following theorems are easily verified: 6
THEOREM 1
(i) P(X > x) = 1 − P(X ≤ x). (ii) P(x < X ≤ x′ ) = P (X ≤ x′ ) − P(X ≤ x) = F X (x′ ) − F X (x). (iii) P(X = x ) = P (X ≤ x) − P(X < x). Suppose a coin is tossed many times and the number of heads is systematically recorded. It is possible to predict ahead of time the average number of heads. The mathematical tool for calculating the “mean” or “average” of a random variable X is the so-called expectation functional E. More generally, in studying the probability distribution of a random variable, it is often useful to be able to summarize a given aspect of the distribution by means of a single number that then serves to ‘measure’ that aspect. Statisticians call such a number a paremeter of the distribution. In what follows we introduce two such parameters of a probability distribution, namely the crucial expectation and variance . Now we make the pertinent formal definition. Given a discrete random variable X : Ω −→ R of a probability model ⟨Ω, A, P⟩, the expected value of X (or simply the expectation of X ) with values X (Ω) = x1 , x2 , · · · , xn is defined by the weighted average (or center of gravity)
µX = E (X ) =df x1 · pX (x1 ) + x2 · pX (x2 ) + · · · + xn · pX (xn )
We see that the expectation of a random variable is a type of “average” value of that random variable. It is a value about which the possible values of the random variable are scattered. Note that the expectation of X is determined by pX . To illustrate the idea behind expectation, let us go back to Example 1 . The expected value of random variable X (introduced in Example 1) is given by E(X ) = 0 ·
1 3 3 1 + 1 · + 2 · + 3 · = 1.5. 8 8 8 8
. Although we think of 1.5 as an “average” value of X , it is clearly not a possible value of X at all in X (Ω8 ). It is, however, a value in a central location relative to the possible values 0, 1, 2 and 3 of X . The number E(X ) is often called a measure of location. It is clear that it falls between the smallest value (namely 0) assumed by X and the largest value (namely 3) assumed by X . Thus, a knowledge of E (X ) gives a rough idea of the possible size of the possible values of X . Finally, we might consider a beautiful analogy between the concept of expectation and the concept of center of gravity in mechanics. If we imagine 7
masses of size p X (xi ) being placed at points x i on the line (with i = 1, 2, · · · , n), then E(X ) is exactly what physicists call the center of gravity of this mass distribution. Suppose again that we are given a probability model ⟨Ω, A, P⟩ of a random experiment and a random variable X : Ω −→ R thereon. Let g : R −→ R be any (measurable) real-valued function. Then the composite g (X ) : Ω −→ R is again a random variable, defined by g (X ) (ω) =df g X (ω ) for all ω in Ω , with expectation
E g (X ) =df g (x1 ) · pX (x1 ) + g (x2 ) · pX (x2 ) + · · · + g (xn ) · pX (xn ).
For example, if g(x) = 2x + 5, then the composite random variable is g (X ) = 2X + 5. Given two (or more) random variables X 1 : Ω −→ R and X 2 : Ω −→ R, their sum X 1 + X 2 : Ω −→ R is again a random variable, defined coordinatewise by X 1 + X 2 (ω ) = X 1 (ω ) + X 2 (ω) for all sample points ω in Ω. Of course, X + c (ω ) = X (ω )+ c, where c is a real constant. All other operations (subtraction, multiplication, division, etc.) on random variables are defined similarly, argumentwise. The power of reasoning in terms of random variables relies precisely on the algebraic richness of operations on random variables. For example, the so-called sample average , to be discussed later, is a random variable defined by X n = X +X n+···+X .
1
2
n
Expectation by itself does not give much information about the random variable of interest. Additional measures are needed. One such quantity is the variance of a random variable. It measures the fluctuation or dispersion of X from its expectation (or the “spread” of pX ). Let µ =df E(X ) be the expectation of random variable X with the same conditions regarding X (Ω) as above. For technical reasons, the variance , denoted 2 σ = Var(X ) , of X is defined by the quadratic distance X Var(X ) = E (X −µ)2 = (x1 −µ)2 pX (x1 )+(x2 −µ)2 pX (x2 )+· · ·+(xn −µ)2 pX (xn ).
Since Var(X ) ≥ 0 and the units (inches, degrees of Fahrenheit, etc.) of random variables are important, statisticians are more interested in the standard deviation σX of X , defined by the square root
( − ) = − ( ) , the variance of E X
σX
2 Since σX = Var(X ) = E X 2 is given by Var(X ) = 3 − 94 =
2
3 4
µ
2
.
E X X from Example 1 = 0.75, and hence the standard deviation is
8
σX
= = 0 866. We see that the dispersion is quite large; about one head in 3 4
.
each trial.
The probability distribution function defined in Example 1 is a special case of the so-called binomial distribution function, defined by BS
n
(k ) =df = P ⊗n S n = k
= · n k
pk · (1 − p)n−k ,
where the sample space is a n-fold product Ω2 = { 0, 1} × {0, 1} × · · · × {0, 1} of a 2-element or dubleton set {0, 1}, representing an outcome of a coin toss perfomed n times or any other binary outcome of a binary experiment executed independently n times, so that the entries in Ω2 are n -fold sequences of 0s and 1s (heads and tails, etc.). In addition, function S n : Ω2 −→ R is a so-called success random variable on Ω2 that counts the number of 1s, heads (as in Example 1), or other “success’ outcomes in each n-fold sequence of independent trials. Finally, 0 ≤ p ≤ 1 is a parameter capturing the probability of getting 1, a head, or anything else of interest, which is set to be equal 12 for a fair coin. n
n
n
n
To summarize, S n : Ω2 −→ R is a random variable whose probability distribution function is BS with parameter p. This means that S n can take values 0, 1, 2, · · · , n, and the probability that S n equals k is given by nk · pk · (1 − p)n−k , where 0 ≤ k ≤ n. It is not hard to show that the expectation of S n is E(S n ) = np and that its variance Var(X ) is Var(X ) = np (1 − p). n
n
Returning to Example 2, from the associated probability model we find that the probability distribution function pY of Y is given by the list of equations
9
pY (2) = P (Y = 2) = P (11) =
1 36
pY (3) = P (Y = 3) = P (12, 21) =
2 36
pY (4) = P (Y = 4) = P (13, 22, 31) =
3 36
pY (5) = P (Y = 5) = P (23, 32, 41, 14) =
4 36
pY (6) = P (Y = 6) = P (33, 42, 24, 15, 51) =
5 36
pY (7) = P (Y = 7) = P (43, 34, 25, 52, 16, 61) = pY (8) = P (Y = 8) = P (44, 53, 35, 26, 62) = pY (9) = P (Y = 9) = P (54, 45, 63, 36) = pY (10) = P (Y = 10) = P (64, 46, 55) = pY (11) = P (Y = 11) = P (56, 65) = pY (12) = P (Y = 12) = P (66) =
1 36
2 36
4 36
5 36
6 36
3 36
It is easy to check that the sum of the values P X = xi of probabilities is 1. Note that Y is not defined at sample points that fall outside the sample space Ω36 . The graph of the probability distribution function p Y is shown in the figure below:
10
pY (x) = P Y = x
0.4 0.3 0.2 0.1 −1
0
1
2
3
4
5
6
7
8
9
10
11
12
x
It is not hard (but tedious) to show that E(Y ) = 7. A simpler method is to calculate the expected value of the outcome of rolling a fair die and then apply the additivity property of the expectation functional. Specifically, recall that the experiment specified by rolling a fair die has the obvious Laplacean probability model ⟨ Ω6, A, P⟩ with a 6-element sample space Ω6 . Now, let function X 1 : Ω6 −→ R be the random variable defined by the number of dots on the die’s upturned face in a given trial. Obviously, X 1 takes six possible values from 1 to 6. Using the definition of expectation of a random variable, it is easy to check that E (X 1 ) = 72 . In addition, by a related definition we have E(X 12) = 91 for the square random variable X 12 , important for calcu6 lating the variance of X 1 . Because the expectation functional is linear, i.e., it satisfies the linearity conditions for any pair of random variables X 1 , X 2 : Ω −→ R, as recalled in the theorem below, THEOREM 2
(i) E(X 1 + X 2 ) = E (X 1 ) + E(X 2 ); (ii) E(aX 1 ) = a E(X 1 ); (iii) E(−X 1 ) = −E(X 1); and since the random variable Y in Example 2 is given by the sum Y = X 1 + X 2 of two random variables (characterizing the upturned faces of two dice), we have E(Y ) = E (X 1 + X 2 ) = E (X 1 ) + E(X 2 ) = 72 + 72 = 7.
2
Now, because Var(X 1 ) = E X 12 − E(X 1) , the variance of the random variable 11
91 6
X 1 (capturing the numbers on a die’s upturned faces) is Var(X 1 ) = 35 = 2.91666. 12
−
49 4
=
Turning to variance Var(Y ) = Var(X 1 + X 2 ) of the random variable Y , defined by the sum of numbers on upturned faces of two fair dice that are rolled (discussed in Example 2), in view of independence of X 1 and X 2 (rolling two dice is a statistically independent experiment), the additivity Var(X 1 + X 2 ) = Var(X 1 ) + Var(X 2 ) holds, so that Var(Y ) = Var(X 1 ) + Var(X 2 ) =
35 12
+
35 12
=
35 6
= 5.8333.
We now return to Example 3 to illuminate further the concepts of probability distribution functions (pdf ) and cumulative distribution functions (cdf ). Recall that the experiment consists of rolling two balanced dice with the probability model ⟨Ω36 , A, P⟩ – given by a 36-element sample space Ω36
= 11, 12, · · · , 16, 21, 22, · · · , 26, · · · , · · · , 61, 62, · · · , 66
with the event algebra measure
A of
all subsets of Ω36 and the Laplacean probability
P(A) =
#{A} #{Ω36 }
for any event A in A. The random variable Z : Ω36 −→ R of interest is given by the maximum of the two numbers on the upturned faces. I.e., Y (ij ) =df max i, j for all outcomes or (sample) pairs ij in Ω36 = Ω6 × Ω6 . From the associated probability model we find that the probability distribution function pZ of Z is given by the list of equations
pZ (1) = P (Z = 1) = P (11) =
1 36
pZ (2) = P (Z = 2) = P (12, 22, 21) =
3 36
pZ (3) = P (Z = 3) = P (13, 23, 33, 32, 31) =
5 36
pZ (4) = P (Z = 4) = P (14, 22, 34, 44, 43, 42, 41) =
7 36
pZ (5) = P (Z = 5) = P (15, 25, 35, 45, 55, 54, 53, 52, 51) =
9 36
pZ (6) = P (Z = 6) = P (16, 26, 36, 46, 56, 66, 65, 64, 63, 62, 61) =
12
11 36
The graphical representation of pZ is shown below.
pZ (x) = P Z = x
0.4 0.3 0.2 0.1 −1
0
1
2
3
4
5
6
The cumulative distribution function F Z , denined by F Z (xi ) =df P(Z ≤ xi ) = P (Z = x 1 ) + P(Z = x 2) + · · · + P(Z = x i )
for any 0 ≤ i ≤ 6, is specified by
0 ( )= 1
F Z x
if
x < 1
1 36
if
1 ≤ x < 2
4 36
if
2 ≤ x < 3
9 36
if
3 ≤ x < 4
16 36
if
4 ≤ x < 5
25 36
if
5 ≤ x; < 6
if
x ≥ 6
3. Counting Algorithms and Combinatorics
Combinatorial analysis deals with counting. To count the number of outcomes of a finitary probability experiment, it is often useful to look for special patterns and algorithms, supporting techniques of counting. The simplest rule is called the Fundamental counting rule or Multiplication Principle: For a sequence of n events in which the first event can occur in k 1 ways and the second event can occur in k2 ways, and the third event can occur in k3 ways, and so on, the total number of ways the sequence of n events can occur is the product k1 · k2 · k3 · · · · · kn .
The number of permutations (i.e., possible ordered arrangements) of n di ff erent objects is given by the descending product 13
n! = df n · (n − 1) · (n − 2) · · · · · 2 · 1,
read “n factorial”. For example, there are exactly 6 = 3! ordered arrangements of three events A, B , C , namely ABC, ACB, BAC,BCA, CBA, CAB . The point is that one can put any of n objects in the first place, but then only n − 1 are left for the second place, and so forth, until only one is left for the last place. By default, we set 0! = 0, and of course 1! = 1. A slightly more interesting care arises when some of the objects are identical or alike, as for example, the letters “I”, “S”, and “P” in MISSISSIPPI. In this case, the number of permutations of n objects, where k1 are identical (or alike) of type or kind one, k2 are identical (or alike) of type or kind two, and so on, and km are identical of type m with k1 + k 2 + · · · + k m = n is given by the fraction
n k1 , k2 , · · · , km−1
=df
n! . k1 ! · k2 ! · · · · · km !
Question: How many diff erent permutations can be made from the letters of the word “MISSISSIPPI”? Answer: Since n = 11 and there are 4 types in total (type M, S, I and type P), we can define k1 = 4 for type S , k2 = 4 for type I , k3 = 2 for type P , and k4 = 1 for type M . Therefore, in this example the ! number of permutations is 4!·411 = 34, 650. !·2!·1!
The number of ordered arrangements of n objects using k (k ≤ n) objects at a time is given by the formula n! , (n − k )!
and is called the permutation of n distinct objects taking k objects at a time . For example, two letters from the set of three { A , B , C } can be arranged exactly in 3! = 6 ways, namely AB,BA,AC, CA, BC, CB . 1! Finally, the most important algorithm is the combination rule : The number of ways of selecting k objects (k ≤ n) from a list of n distinct objects without regard to order is
n k
=df
n! . k ! · (n − k )!
For example, two letters from the set of three {A , B , C } can be selected without regard to order exactly in 2!3·!1! = 3 ways, namely AB, BC,AC . 14
Another example: In a classroom there are 8 women and 5 men. A committee of 3 women and 2 men is to be formed for a project. How many diff erent possibilities are there? Answer: There are 83 · 52 = 560 diff erent ways to make the selection.
15