Hand Handout out No. No. 5 Phil.015 April 4, 2016 JOINT AND CONDITIONAL PROBABILITY DISTRIBUTIONS, INDEPENDENT VARIABLES, AND HYPOTHESIS TESTING 1. Joint Probability Probability Distribution Functions Functions
Studying the relationships between two (or more) random variables requires Handout ut No. 4 we treated probability their joint probability distribution. 1 In Hando distributions of single discrete random variables. variables. We now consider two (or more) random variables that are defined simultaneously on the same sample space. In this section we focus on examples involving two random variables. variables. First we make the following definition of the concept of a joint or more specifically bivariate probability distribution function ( jpdf ): ): Given a probability model Ω, A, P of a target random experiment, let X −→ X : Ω −→ R and Y : Ω −→ −→ R be two discrete random variables with respective finite sets of possible values X (Ω) (Ω) = x1, x2 , · · · , xn and Y (Ω) = y1 , y2 , · · · , ym . The probability distribution function
pX,Y (x, y ) =df P X = x & Y = y
is called the joint probability distribution function of X and Y . Thus, to specify a joint probability distribution of X and Y , we must specify the pairs of values (xi , y j ) together with the probabilities pX,Y (xi , y j ) =df P X = xi & Y = y j for all i = 1, 2, · · · , n and j = 1, 2, · · · , m.2 Joint probability distributions of two random variables with finitely many values are best exhibited by a m × n
double-entry table, similar to the one below: 1
For example, changing one of the two related random variables may cause a change in the other. Also, a relationship between random variables can be used for predicting one variable from the knowledge of others, even when the relationship is not causal. 2 Remember that we may define a bivariate probability distribution function simply by giving the function pX,Y (x, y ) without any reference to the underlying sample space Ω, event algebra A and a probabilit probability y mreasure mreasure thereon. thereon. Statisticia Statisticians ns seldom refer to the underlying underlying sample sample space (domain) (domain) of a random variable. All one has to know are the values of random variables.
1
X Y → ↓
y1
y2
···
ym
pX ↓
x1
pX,Y (x1 , y1)
pX,Y (x1 , y2 )
···
pX,Y (x1 , ym )
pX (x1 )
x2
pX,Y (x2 , y1)
pX,Y (x2 , y2 )
···
pX,Y (x2 , ym )
pX (x2 )
.. .
.. .
···
.. .
.. .
pX,Y (xn , y1 )
pX,Y (xn , y1 )
···
pX,Y (xn , ym )
pX (xn )
pY (y1 )
pY (y2)
···
pY (ym )
1
.. . xn pY →
Note that pX,Y (x, y) ≥ 0 and the double-sum of all values of pX,Y (x, y) is equal to 1:
pX,Y (x1 , y1) + pX,Y (x2, y1 ) + · · · + pX,Y (xn , y1 ) + · · · + pX,Y (xn , ym ) = 1
Suppose, as above, that random variables X and Y come with the joint probability distribution function pX,Y (x, y ). Then the functions befined by the sums pX (x) =df pX,Y (x, y1 ) + pX,Y (x, y2 ) + · · · + pX,Y (x, ym)
and pY (y ) =df pX,Y (x1 , y ) + pX,Y (x2 , y ) + · · · + pX,Y (xn )
are called the marginal probability distribution function of X and Y , respectively. These are entered in the last column and last row of the extended double-entry table above. They are simply the probability distributions of random variables X and Y , and are used in defining conditional probability distribution functions. Please remember that from the joint probability distribution pX,Y (x, y) of X and Y we can calculate a large variety of probabilities, including not only P (X ) ≤ x i and P(Y ) ≤ y j (obtained by summing up the probabilities of X = x with x ≤ x j and by adding the probabilities of Y = y for all y ≤ y j ), but also the probabilities of the form P (X < Y ), P(X + Y ≤ 2), P(X · Y ≥ 2), P(X ≤ 1 & Y ≥ 1), and so forth.3 3
As expected, the sum X + Y of random variables X and Y is the unique random variable Z , defined argumentwise as follows: Z (ω ) =df X (ω ) + Y (ω ) for all ω ∈ Ω. Similarly for product and the other algebraic operations on random variables.
2
Given two random variables X and Y with data as above, we say that they are statistically independent , in symbols X ⊥ ⊥ Y , provided that their joint probability distribution is the product of their marginal probability distributions. That is to say, the equation pX,Y (x, y) = p X (x) · pY (y )
holds for all values x ∈ X (Ω) and y ∈ Y (Ω). Obviously, independence is a symmetric relation, so that we have X ⊥ ⊥ Y if and only if Y ⊥ ⊥ X . Furthermore, independence crucially depends on the associated probability distribution functions. Informally, independence means that a knowledge that X has assumed a given value, say x i, does not effect at all the probability that Y will assume any given value, say y j . The notion of independence can be carried over to more than two random variables. We mention in passing that in the case of three random variables X , Y and Z , their joint (trivariate) probability distribution is defined by
pX,Y,Z (x,y,z ) =df P X = x & Y = y & Z = z
and the mutual independence is defined by the equation pX,Y,Z (x,y,z ) = p X (x) · pY (y ) · pZ (z ).
Independence implies many pleasing properties for thexpectation and variance. In particular, recall that while E(X + Y ) = E(X ) + E (Y ) always holds, the following holds If X ⊥ ⊥ Y , then E(X · Y ) = E (X ) · E(Y ). for the multiplication of independent raandom variables. (The implication cannot be reversed.) Correspondingly, while the additivity law does not hold for variance in general, we have the following special situation: If X ⊥ ⊥ Y , then Var(X + Y ) = Var(X ) + Var(Y ). Example 1: Consider the following probability experiment: Roll a balanced die once and let X : Ω6 −→ R be the random variable whose values are determined by the number on the die’s upturned face. Hence the possible values of X are specified by the set X (Ω6 ) = 1, 2, 3, 4, 5, 6 . Next, after the die’s outcome has been observed, toss a fair coin exactly as many times as the value of X . Let
3
Y : (Ω2 + Ω 2 + · · · Ω2 ) −→ R be the random variable that counts the number 2
6
of tails in all possible coin tossing experiments. Clearly, the possible values of Y form the set { 0, 1, 2, 3, 4, 5, 6}.4 Problem: Determine the joint probability distribution pX,Y (x, y) of the random
variables specified above! Solution: All we have to do is calculate the values of pX,Y (x, y) for x = 1, 2, 3, · · · , 6 and for y = 0, 1, 2, · · · , 6:
pX,Y (1, 0) = P (X = 1 & Y = 0) = P (X = 1) · P Y = 0 | X = 1 = pX,Y (1, 1) = P (X = 1 & Y = 1) = P (X = 1) · P Y = 1 | X = 1 = pX,Y (2, 0) = P (X = 2 & Y = 0) = P (X = 2) · P Y = 0 | X = 2 = pX,Y (2, 1) = P (X = 2 & Y = 1) = P (X = 2) · P Y = 1 | X = 2 = pX,Y (2, 2) = P (X = 2 & Y = 2) = P (X = 2) · P Y = 2 | X = 2 =
11 1 = 62 12 11 1 = 62 12 11 1 = 64 24 1 12 1 24
pX,Y (3, 0) = P (X = 3 & Y = 0) = P (X = 3) · P Y = 0 | X = 3
1 3 1 0 1 3 1 = = 6 0 2 2 48 pX,Y (3, 1) = P (X = 3 & Y = 1) = P (X = 3) · P Y = 1 | X = 3
1 3 = 6 1
1 2
2
1 2
1
=
3 48
3 48 1 pX,Y (3, 3) = P (X = 3 & Y = 3) = P (X = 3) · P Y = 3 | X = 3 = 48 · · · · · · = · · · · · · · · · · · · · · · · · · · = · · · · · · · · · · · · · · · · · ·
pX,Y (3, 2) = P (X = 3 & Y = 2) = P (X = 3) · P Y = 2 | X = 3 =
the remaining values of pX,Y (x, y) (and the total of 6 × 7 = 42) are calculated similarly. Recall that the marginal probability distribution pX (x) is calculated 4
Astute readers will object that X and Y are not defined on the same sample space! This can easily be fixed by assuming that a slightly modified variant of X , denoted by X , is actually defined on the product sample space Ω 6 × Ω , where Ω =df Ω2 + Ω2 + · · · Ω2 , i.e., X : Ω6 × Ω −→ R, constant in the second coordinate ( X does not depend on the coin experiments), i.e., we have X (ω, ω ) =df X (ω ). Likewise, Y : Ω6 × Ω −→ R is a slightly modified variant of Y , presumed to be constant in the first coordinate (does not depend directly on the die’s outcome), i.e., we have Y (ω, ω ) = df Y (ω ).
•
•
•
4
2
6
•
by summing up the rows of the table for pX,Y (x, y), so that p X (1) = 61 , p X (2) = 1 , · · · , and pX (6) = 61 . Obviously, we get the same values, since rolling the die 6 does not depend on flipping the coin. On the other hand, the marginal probability distribution p Y (y ) is calculated by 63 summing up the colums of the table for p X,Y (x, y), so that p Y (0) = 384 , p Y (1) = 120 99 64 29 8 1 , pY (2) = 384 , pY (3) = 384 , pY (4) = 384 , pY (5) = 384 , and pY (6) = 384 . Here 384 the values are significantly different fro case to case, since Y depends on X . 2. Conditional Probability Distribution Functions
Given a probability model Ω, A, P of a target random experiment, let X : Ω −→ R and Y : Ω −→ R be two discrete random variables with respective finite sets of possible values X (Ω) = x1 , x2 , · · · , xn and Y (Ω) = y1 , y2 , · · · , ym . The probability distribution function of the form pY |X (y |x), defined below, is called the conditional probability distribution of random variable Y given variable X :
P X = x & Y = y pX,Y (x, y) = pY |X (y |x) =df , pX (x) P X = x
provided the marginal distribution satisfies pX (x) > 0 for all x. Note that the foregoing distribution is in general different from the conditional probability distribution function pX |Y (x|y ) of random variable X given variable Y , defined by
P Y = y & X = x pX,Y (x, y) = pX |Y (x|y ) =df pY (y ) P Y = y
provided pY (y ) > 0 for all y. Of course, for a fixed value y, pX |Y (x|y ) itself is a probability distribution (i.e., is nonnegative and sums up to 1). It is easy to see that if X ⊥ ⊥ Y , then pX |Y (x|y) = p X (x). Example 2: Suppose we are given two ranbdom variables X and Y with respective possible values X (Ω) = {1, 2, 3} and Y (Ω) = { 1, 2, 3}, and joint probability 1 distribution pX,Y (x, y) =df 36 x · y.
Problem: Determine whether X and Y are independent. Answer: Because the marginal for X is pX (x) = x6 (with x = 1, 2, 3) and the 1 marginal for Y is pY (y ) = y6 (with y = 1, 2, 3), and pX,Y (x, y) =df 36 x·y = p X (x)· pY (y ), we have X ⊥ ⊥ Y . Therefore pX |Y (x|y) = p X (x) and pY |X (y|x) = p Y (y).
Mathematical expectation automatically generalizes to the conditional case. Specifically, let X and Y be discrete random variables, and let the set of possi-
5
ble values of X be X (Ω) = { x1 , x2, · · · , xn}. Then the conditional expectation of random variable X , given that Y = y , is defined by the weighted sum
E(X | Y = y ) =df x1 · pX |Y (x1 |y ) + x2 · pX |Y (x2 |y ) + · · · + xn · pX |Y (xn |y ),
of conditional probability distribution functions, where pY (y ) > 0.
As expected, conditional variance Var X | Y = y is defined by the sum 2
2
2
x1 − µy pX |Y (x1 |y ) + x2 − µy pX |Y (x2 |y ) + · · · + xn − µy pX |Y (xn |y ),
where µy =df E(X | Y = y ).
The purpose of conditional probability distribution functions is best seen in the applications of Bayes’ theorem. In particular, Bayesians write the binomial probability distribution in the conditional form BinS
n
k | p = df
n k p (1 − p)n−k , k
where S n denotes the so-called success random variable in n trials and 0 ≤ p ≤ 1 is the probability of achieving success in any single trial. Recall that BinS k | p = P S n = k , from which all other probabilities of interest regarding S n can be calculated. The conditional symbolization of the binomial distribution above suggests that parameter p is best viewed as a random variable that interacts with the values of S n . In particular, an experimenter may receive valuable information from S n about the unknown value of p. This works in general, since whenever 5 Y ∼ Binn k | p , we have a connection, namely E Y = np and Var Y = np (1 − p).6 n
3. Classical Hypothesis Testing
A typical problem in scientific disciplines that rely on statistical models is to learn something about a particular population . (For example, a politician wants to know the opinion of a certain group of people.) It may be impractical or even impossible to examine the whole population, so one must rely on a sample from the population. From the standpoint of probability theory, a sample is modeled 5
Statisticians often write X ∼ p X (x) to indicate that random variable X comes with distribution pX (x). 6 The values of the cumulative binomial probability distribution P S n ≤ k | p can be obtained from a binomial table handed out in class. Also, you can open Excel in your laptop and calculate the value of P S n ≤ k | p by typing BINOMDIST(k,n,p,1) with appropriate values for k , n and p.
6
by a sequence of mytually independent random variables, each of which has the same probability distribution function. Consider a probability experiment about which we formally reason in terms of a probability model Ω, A, P. Next, suppose that n replications of the experiment are performed independently and under identical conditions (e.g., imagine some coin floipping experiments). Suppose that X is a random variable associated with the repeated experiment with a probability distribution function pX (x). The induced pair X, pX (x) is called a statistical model of the repeated target experiment. Using a somewhat sloppy notation, statisticians often write X p(x) to indicate that the representing random variable X of the target experiment has distribution p(x). For example, if the experiment consists of rolling a fair die once and the interest is in the number on the die’s upturned face, then X will have the possible values 1 , 2, 3, 4, 5, 6, and the associated probability distribution function will be the so-called uniform distribution pX (x) = 61 , giving the same probability values to all x. ∼
Now, if the target experiment is repeated n times (so that we have n trials, where n is any natural number) – resulting in n-fold repeated observations of the values of random variable X , the correct probability model for the repeated experiment consists of n random variables X 1 , X 2 , · · · , X n such that (i) the probability distribution of each X i (for i = 1, 2, · · · , n) is exactly the same, namely pX (x); (ii) the joint probability distribution function of X 1 , X 2 , · · · , X n is the product of the marginal distributions: pX ,X ,··· ,X (x1 , x2 , · · · , xn ) = p X (x1 ) · pX (x2 ) · · · pX (xn ), 1
2
n
and the joint probability distribution is symmetric in all of its arguments x1 , x2 , · · · , xn , so that (iii) all X 1 , X 2 , · · · , X n are mutually independent. The sequence X 1 , X 2 , · · · , X n is commonly referred to as an exchangeable random sample or simply a sample . In probability theory, alternatively, the sequence X 1 , X 2, · · · , X n is called an iid sequence, meaning an identically and independently distributed sequence of random variables. In any case, the sequence X 1 , X 2 , · · · , X n together with pX (x) is a probabilistic representation of n independent outcomes of a target experiment, repeated n times. Samples have a mean. Specifically, the sample mean of an iid sequence X 1 , X 2 , · · · , X n is defined by the random variable X =df
1 n
X 1 + X 2 + · · · X n .
7
It is simply the arithmetic average of n observations. Remember that lower case symbols x1 , x2 , · · · xn denote the distinct deterministic values in a particular data set, whereas the capital symbols X 1 , X 2 , · · · , X n are the random variables representing individual observations in each trial that take the respective values x1 , x2 , · · · xn in some repeated experiment with a certain probability. In addition, samples have a variance. We tentatively define the sample variance of an iid sequence X 1 , X 2 , · · · , X n by the random variable
V =df
1 n
2
X 1 − X
2
2
+ X 2 − X + · · · + X n − X .
However, the usual statistical convention is to replace n in the denominator with n − 1. Therefore the sample variance is standardly defined by the sum 2
S =df
1 n−1
2
X 1 − X
2
2
+ X 2 − X + · · · X n − X
.
So now we have the population mean µX = E(X ) of variable X , parametrizing the associated probability distribution function pX x|µ, σ , and the sample mean X of sample X 1 , X 2 , · · · , X n of size n, taken from the target population. The exact relationship between the population mean µ and the sample mean X is described by the weak and strong laws of large numbers. Specifically, if the purpose of sampling is to obtain information about the population average µ, the weak law of large numbers tells us that the sample average X = n1 X 1 + X 2 + · · · + X n is likely to be near to µ when n is large. So, for large sample, we expect to find the sample average (which we can compute) close to the population average µ which is unknown.
In parallel, we have the population variance Var(X ) of X , parametrizing the associated probability distribution function pX x|µ, σ , and the sample variance S 2 X of sample X 1 , X 2 , · · · , X n of size n, taken from the target population. Of course, the sample standard deviation of the sample X 1 , X 2 , · · · , X n is the random variable (statisatic)
S =df
1 n−1
2
X 1 − X
2
+ X 2 − X + · · · X n − X
2
.
We also have the notion of sample correlation , sample moments , and a host of other concepts, paralleing the population terminology. Generally, probability appears only to relate the calculations of sample mean, sample variance, etc., to population mean, population variance, and so on. Often it is easier to work with a stadardized variant of a random variable X 8
that transforms its expectation to 0 and its standard deviation to 1. This is acheived by using the so-called standard score or Z-score , defined by the linear transformation Z =df
X − µX . σX
It is easy to check that µZ = E Z = 0 and σZ = Var Z = 1. We shall use the Z-score in calculating the so-called P-values. Sample data provide evidence concerning hypotheses about the population from which they are drawn. Here is a typical example: Extrasensory Perception: In attempting to determine whether or not a subject
may be said to have extrasensory perception, something akin to the following procedure is commonly used. The subject is placed in one room, while, in another room, a card is randomly selected from a deck of cards. The subject is then asked to guess a particular feature (e.g., color, number, suit, etc.) of the card drawn. In this experiment, a person with no extrasensory powers might be expected to guess correctly an average number of cards. On the other hand, a person who claims that (s)he has extrasensory powers should presumably be able to guess correctly an impressively large number of cards. For specificity, Felix from the TV show called The Odd Couple claims to have ESP. Oscar tested Felix’s claim by drawing a card at random from a set of four large cards, each with a different number on it, and without showing it, he asked Felix to identify the card. They repeated this basic experiment many times. At each such trial, an individual without ESP has one chance in four ( 14 ) of correctly identifying the card. In 10 trials, Felix made six correct identifications. Although he did not claim to be perfect, six is rather more than 2 .5 = 41 [1 + 2 + 3 + 4], the average number of correctly guessed cards, if Felix does not have ESP. Question: Does this prove anything about Felix having ESP? Here is where hypothesis testing comes in. Let p denote the probability that Felix correctly identifies a card. The so-called null hypothesis H 0 is that Felix has no ESP , and is only guessing or, in terms of the parameter p, we specify the hypothesis formally by setting H 0 : p =
1 . 4
Now we need a test statistic , say random variable Y 10, that represents exactly 10 trials, each consisting of drawing a card by Oscar and then asking Felix to identify it. Because the trials are independent and the probability p is the same at each trial, the statistical model is given by the binomial distribution 9
BinY 10
k | p
In words, Y 10 is a Bernoulli process with a binomial distribution. Under hypothesis H 0 (i.e., p = 41 ), the probability distribution function of Y 10 is given by
Specification of BinY k | p 10
Probability Assignment
0
y pY (y ) 10
1
2
3
4
5
6
7
8
0.056 0.188 0.282 0.250 0.146 0.058 0.016 0.003 0.000
We left out the values pY (9) = pY (10) = 0.000, because they are practically zero. Recall that the population mean is µ = np = 10 · 41 = 2.5. And Felix’s score (he guessed six times correctly) is rather far from 2 .5, out in a tail of the null hypothesis distribution BinY k | 14 . In this sense, 6 correct is rather surprising when H 0 is true. 10
From the table above we have to calculate P Y 10 ≥ 6 |H 0 :
P 6 or more correct|H 0 = 0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019.
That this probability is small tells us that Y 10 = 6 is quite far from what we expect when p = 41 . Simply, the sample results are rather inconsistent with the null hypothesis. On the other hand, 6 correct is not very surprising if Felix really has some degree of ESP. So let’s consider the alternative hypothesis H a : p >
1 , 4
, the alternative to H 0 , capturing the informal conjecture that Felix has ESP . Now, the so-called P-value (or observed level of significance) is the probability in the right tail at and beyond the observed number of sucesses ( Y 10 = 6), that is
P Y 10 ≥ 6 = 0.019.
Many statisticians take the following interpretations as benchmarks: 10
(i) Highly statistically significant: P-value < 0.01 is strong evidence against H 0 ; (ii) Statistically significant: 0.01 < P-value < 0.05 is moderate evidence against H 0 , and (iii) P-value > 0 .10 is little or no evidence against H 0 . In view of the foregoing classification, at 5% level, the test is statisatically significant, and therefore H 0 should be rejected and H a should be accepted instead.
11