Hand Handou out t No. No. 3
Phil.015 February 29, 2016 ELEMENTS OF CLASSICAL PROBABILITY THEORY 1.
Passage from Deductive Deductive to Inductive Reasoning Reasoning
In this handout I introduce the basic measure-theoretic machinery for probabilit bility y calcul calculus. us. The philos philosoph ophy y behind behind various arious definiti definitions ons is discuss discussed ed as we move move along. Recall that a logically logically valid valid deductive inference always always provides provides certainty certainty,, but generally generally only at the cost of a loss of information. For example, we can deductively infer from the premise “All swans are white” the conclusion sion that a particu particular lar swan swan – named named (say) Alma, Alma, is also also white. white. In our formal formal language, the inference has the obvious predicate logic form S a → W a. ∀x[Sx → W x] ⊢ Sa Note that the information (in a classical sense of the term, to be discussed later) always always flows downhill downhill from the premises to the conclusion. conclusion. That is to say, say, for any information measure Info (that assigns non-negative real numbers, e.g., in bits , to sentences) the inequality Info(∀x[Sx → W x]) ≥ Info(Sa → W a) holds. For another example, to arrive deductively at a much desired conclusion that the universe is finite, it is necessary to come up with powerful premises that carry sufficient information about spacetime, distribution of matter in the universe, and so forth, together with sophisticated astronomical data. Inductive Inductive logicians are interested interested in a reverse form form of inference, having the form Sa → Wa,Sb → W b, · · · |∽ ∀x[Sx → W x] that goes from particular (observational) premises with less information to conclusio clusions ns that generally generally carry vastly vastly more informa informatio tion. n. Because Because observing observing a couple of swans and finding them to be white does not necessarily mean that all swans all swans are white, we use the wiggly (or wavy) entailment sign ||∽ , indicating that the inferen inference ce is “risky” “risky” and no longer longer guaranteed guaranteed.. As well known, known, Da David vid Hume and Karl Popper have argued that the foregoing inductive inference (and others like like it) cannot be justified, justified, period. Simply Simply, we can not be b e certain that the conclus conclusion ion is correct. correct. A hasty hasty conclusi conclusion on is that there there is no inductiv inductivee logic logic in the foregoing sense. sense. However, However, if we agree to measure to measure the strength (possibility, plausibility, probability, degree of certainty or risk) of the inference by a number 1
0 ≤ p ≤ 1 that indexes or parametrizes the inductive entailment relation as in |∽ p , then the revised inductive inference Sa → Wa,Sb → W b, · · · |∽ p ∀ x[Sx → W x] begins to make some sense. Simply, there is a trade-off between certainty and information flow from premises to conclusions. Given a string of premises, the more information is in the conclusion, the less certain is the inference. What is troubling in this approach is the identification of (indexing) parameter values. For example, because induction is non-ampliative, there is no reason to think that under additional evidence, as in the inductive inference Sa → Wa,Sb → Wb,Sc → Wc,Sd → W d, · · · |∽ p ∀x[Sx → W x], ′
the strength of the inference is bound to go up and up, i.e., first we will have p < p , then p ≪ p , and then p ≪ p , and so forth. Clearly, at any point of adding new observational data, the ornithologists may discover a black swan, forcing p to take a dive to the lowest value 0. ′
′
′
′
Another problem regarding the individuation of parameter value p has something to do with the empirical content of sentences, used in an inductive inference. For example, if the claim is “All reptiles have three-chambered hearts”, in symbols ∀x[Rx → Hx], then in view of evolutionary species theory, a tiny sample (say a small n) of “normal” reptiles will suffice to accept the inductive inference Ra1 → H a1 , Ra2 → H a2 , · · · , Ran → H an ||∽ p ∀ x[Rx → H x] with a near-maximal parameter value p. This happens to those empirical laws that are discovered via exploratory data. In brief, parameter value p has more to do with an empirical law than with some principles of inductive logic. Since much of our consideration regarding inductive inference depends on the structural properties of parameter p, we should go deeper to uncover its nature. Recall that deductive inference in classical sentential logic is based on truthvalue assignments or truth functions. We say that a function T : L −→ {0, 1} that assigns to each sentence P a unique value T(P ) (say T(P ) = 1 if P is true , and T(P ) = 0 if P is false ) is a truth function just in case it satisfies the following conditions: (i) (ii)
T(∼ P )
= 1 − T(P ).
T(P & Q)
= T(P ) · T(Q) = min 2
T(P ), T(Q)
.
(iii) (iv)
T(P ∨ Q)
= max T(P ), T(Q) = T(P ) + T(Q) − T(P ) · T(Q). T(P → Q) = max 1 − T(P ), T(Q) .
Thus, a truth function preserves the meanings of logical connectives, and it may be specified by the states of affairs “out there” (e.g., we set T(P ) = 1 because the sentence P happens to be true in reality). Now, recall that a deductive inference Ψ1, Ψ2 , · · · ⊢ Φ is logically valid if and only if for all truth functions T, whenever T(Ψ1 ) = 1, T(Ψ2 ) = 1, · · · holds simultaneously for all premises, we have T(Φ) = 1. Or simply, we have T(Ψ1
& Ψ2 & · · · ) ≤
T(Φ)
and equivalently
T
(Ψ1 & Ψ2 & · · · ) → Φ = 1
for all truth functions T. This is the crux of the truth table method. Because, as we will note later, truth functions T are limit cases of probability measures,1 we may wish to consider a conditional truth function TP (Q)
=df T(P → Q)
for any sentence P . It is easy to see that TP satisfies all conditions itemized above. There is an obvious iteration TP Q = T(P & Q) of conditioning and the multiplication law T(P & Q)
= T(P ) · TP (Q),
important also in classical probability theory. I want to make two points from this digression into deductive inference and truth functions. First, deductive logic abstracts from all truth functions T by quantifying them away from a valid deductive inference. In so doing, logical claims and inferences turn out to be claims about “all possible worlds” and not about any specific world in particular.2 In contrast, in applied probability the1
We should remember that truth functions T are special probability measures that take only two values, namely 0 and 1 (earlier we used the symbols T and F ). In view of this simplicity, any pair of sentences turns out to be probabilistically independent for any T. 2 This sheds some light on the information loss problem.
3
ory, certain probability functions P (to be defined later) are given or presumed to be identifiable, and not quantified away. Of course, the axioms and basic theorems of probability calculus hold for all probability functions P. Second, deductive logic prompts to treat inductive inference in a logically familiar manner, having the form Ψ1 , Ψ2 , · · · |∽ p Φ, that completely suppresses the crucial role of a function that actually assigns the numerical value p to the relationship between premises Ψ1 , Ψ2 , · · · and conclusion Φ. The inference above says that the premises Ψ1 , Ψ2 , · · · entail with probability p the conclusion Φ. In complete analogy with the role of truth functions T in
T
(Ψ1 & Ψ2 & · · · ) → Φ = 1
we now rewrite the foregoing inductive inference into an equational form
P
Φ | Ψ 1 & Ψ2 & · · · = p.
In particular, the example about white swans is now symbolized in the form of an equation, belonging to probability logic
P
∀x[Sx → W x] | (Sa → W a) & (Sb → W b) = p
with some P, where (i) the measure P of (un)certainty is now explicitly mentioned, (ii) the conclusion is always put first, and (iii) the converse of the inductive entailment relation ||∽ is symbolized quite simply by a vertical bar | , intended to separate the conclusion from the premise. The foregoing formula expresses the fact that the degree of (un)certainty (probability) that the hypothesis ∀x[Sx → W x] is true, given evidence or premise (Sa → W a) & (Sb → W b), is p. Unfortunately, the problem has a further twist for applications. Exactly how should the measure P be identified? Bayesians hold that P measures the probability theorists’ degrees of belief in all sentences of a given formal language, and it should be identified via subsequent revisions in face of new information. The so-called objectivists or frequentists argue that since P measures an objective property of a system “out there”, it can be identified by calculating the relative frequencies of occurrences of events or it may be obtained “a priori” from geometric or other symmetry considerations. Both interpretations of P are held 4
quite widely among most working statisticians. There is still a worthwhile debate to be had on what kind of assumptions should be made about the underlying formal language of sentences. Since in practice, probability applies to finitary or countable populations, or possibly to a continuum situation, quantifiers are not needed in their fullest generality. On the other hand, because countably long disjunctions P 1 ∨ P 2 ∨ P 3 ∨ · · · (and also conjunctions) of sentences are important in characterizing various probability experiments, classical sentential logic tuns out to be too weak for the purposes of statistical inference. Statistical problems differ from most problems encountered in probability logic in that almost invariably they rely on the notions of random variable and parameters indexing probability distributions. A major attempt to overcome these problems is to develop a probability calculus quite independently of logical systems, using only set theory. All formal probability concepts are built from set-theoretic notions. The next section develops this approach in some detail. Axioms and Basic Theorems of Classical Probability Theory
2.
Random, chance and other processes involving uncertainty (e.g., flipping a coin, rolling a die many times, measuring a quantity, testing a product, winning the next presidential election, observing a stochastic dynamical system, and so on) are collectively called somewhat misleadingly probability experiments , experiments of chance or random experiments .3 It is assumed that each outcome of a probability experiment is random or uncertain in some way. In order to be able to compute the probability of occurrence of events associated with the experiment, scientists need to find the correct sample space and an event algebra for their probability experiment under consideration – often referred to as the target experiment . One of the most basic issues in probabilistic (and statistical) modeling is the right (or at least a reasonably good) mathematical representation of events associated with the target experiment. A sample space – usually symbolized by Ω, Ω , · · · , is a nonempty set of sample points , mathematically encoding all possible outcomes of the probability experiment under consideration. Simply, a sample space is a universe of sample points, encoding the possible outcomes of the target experiment we are interested in studying. Perhaps the most trivial and most familiar probability experiment is flipping a coin once. The corresponding sample space Ω =df {H, T } consists of exactly two distinct points labeled H and T .4 . In this notation, the first encodes the ′
3
Often probability experiments are entirely conceptual. For example, in discrete probability calculus it is common to assume that a hypothetical urn contains x marbles, each colored differently from any of the others. A marble is drawn from the urn and its color is noted. It does not matter whether or not this is a real-life situation. The interest is in determining the probability that a marble of a designated color is drawn. 4 Because we are conceptually encoding the empirical outcomes,the points in Ω could be 0 and 1.
5
outcome “heads” and the second encodes the empirical outcome “tails”. It is strictly a question of good mathematical modeling whether or not Ω should include a sample point E for additional outcomes, say “edge”, and so forth.5 Unfortunately, sample spaces are not directly available in the logical approach to probabilistic inference. Yet, they have proved to be extremely useful in upholding the linear laws of random variables and the convex manipulations of probability density functions. In many applications, sample spaces are frequently introduced in a mixing or conditional manner. Suppose the target probability experiment consists of two steps. First a coin is flipped. Then, if the outcome is tails, a die is rolled. But if the outcome is heads, then the coin is flipped again. It is immediately obvious that the correct sample space is specified by Ω0 =df {T 1, T 2, T 3, T 4, T 5, T 6, H T , H H } . Here T 1 encodes the first outcome of heads and the subsequent outcome of the upturned face numbered 1, after the die has been rolled, and so on. Observe that the sample space Ω0 is actually defined by the “mixture” or “conditional sample space” Ω ⊸ (Ω , Ω) =df {T } × Ω + { H } × Ω, where Ω = {H, T } encodes the outcomes of the first trial and Ω = {1, 2, 3, 4, 5, 6} is used in the second trial if T occurs, else Ω is used again in the second trial (namely, if H occurs in the first trial). Here the sample spaces Ω and Ω are invoked conditionally on Ω. ′
′
′
′
Set theory has two standard concepts we shall need: Cartesian product and disjoint sum. Given two sets Ω and Ω , we use the expression Ω × Ω to denote their Cartesian product set , i.e., the set of all pairs (ω, ω ), where ω is a member of Ω and ω is a member of Ω . For example, the samplespace for flipping a coin twice is defined by the Cartesian product Ω × Ω = HH,HT,TH,TT that should be written more pedantically as (H, H ), (H, T ), (T, H ), (T, T ) . Clearly, in general Ω × Ω = Ω × Ω. ′
′
′
′
′
′
′
Given two sets Ω and Ω , we use the expression Ω + Ω to denote the disjoint sum of sets Ω and Ω . It is the set containing both the elements of Ω and Ω . Possible double-counting is allowed by labeling the elements of Ω separately by 0 and those of Ω by 1. ′
′
′
′
′
Compositions of experiments are typically represented by Cartesian products of sample spaces (e.g., flipping a coin n times requires the n-fold Cartesian product Ω × Ω × · · · × Ω = Ω n , and by iterated mixing (conditioning) constructions. ×
A slightly more complicated sample spaces are needed in cascade probability 5
By agreeing on the univese of sample points that encodes all ‘seriously’ possible outcomes, we are of course resorting to idealization. If “edge” were to occur reasonably often in repeated tosses, then we would make an unrealistic assumption in insisting to have Ω = {H, T }. On the other hand, if “edge” never seems to occur, then it is inconvenient to specify the working sample space by Ω = { H , T , E } .
6
experiments without replacement. Suppose a box contains a red ball (encoded by point R), a blue ball (encoded by point B), and a yellow ball (encoded by point Y ), comprising the sample space Ω =df {R,B,Y }. Two balls are selected at random in succession, where the first ball is not replaced before the second ball is drawn. It is easy to see that the correct sample space is given by Ω0 =df {RB,RY, BR, BY, Y R, Y B }. However, a slightly deeper conceptual analysis reveals that Ω0 = Ω ⊸ (Ω/R , Ω/B , Ω/Y ) = {R} × Ω/R + {B } × Ω/B + {Y } × Ω/Y , where Ω/R = Ω − {R} = {RB,RY }, Ω/B = Ω − {B } = {BR,BY }, and Ω/Y = Ω − {Y } = {Y R , Y B } are the reduced sample spaces, conditioned by previous outcomes. This kind of decomposition or “tree” analysis of complex sample spaces will prove to be very useful in setting up conditional probabilities. These are all finite sample spaces. But often countably infinite sample spaces are needed. For example, consider a stream of potential customers entering a gift shop. Starting at an arbitrary point in time (e.g., 9:00 a.m.), we count how many different people come in before one makes a purchase. This number is a nonnegative integer. Because we can’t be certain that a purchase will be made withinany specified finite number of potential customers, it is correct to define Ω = 0, 1, 2, · · · . Spinning a pointer that has a continuous scale showing the angle (measured from a reference position with infinite accuracy) at which it stops requires a real half-open interval (or circle) sample space [0, 2π). 8pt] According to the objectivist approach, a probability measure P measures the probability that a certain event associated with the target experiment will occur.6 Events occurring in a probability experiment are mathematically encoded by the so-called measurable subsets of points of the target experiment’s sample space. For example, if a die is rolled once, its six possible outcomes are mathematically encoded by sample points 1, 2, 3, 4, 5 and 6, comprising the die’s sample space Ω = {1, 2, 3, 4, 5, 6}. Now, the event described by “getting a prime number after rolling the die once” is encoded by the subset { 2, 3, 5} of Ω , because there are exactly three ways to get a prime number, namely having 2, 3, or 5 dots on the die’s upturned face, after it lands. Likewise, the event described by “getting an odd number” is encoded by the subset { 1, 3, 5} of Ω , since there are exactly three ways to get an odd number in the experiment defined by rolling a die once, namely, getting 1, 3, or 5 in a roll. So an event represented by the subset E of Ω occurs precisely when the target experiment results in an outcome that is encoded by a sample point in the subset E . And the event represented by the subset E does not occur provided that the target ′′
′
′
′
′
6
In probability logic, a probability measure measures the probability that a sentence of a designated formal language is true.
7
experiment results in an outcome that is encoded by a sample point not in the subset E of Ω . Often by ‘abuse of language’ we shall call {1, 3, 5} an event instead of a pedantical ‘subset of’ Ω that mathematically encodes the event described by “getting an odd number”. Likewise, by abuse of terminology often Ω will be referred to as the set of outcomes instead of the sample space whose elements mathematically encode the empirical outcomes of the target experiment. These distinctions are meant to ensure that a mathematical model is not confused with reality. ′
′
′
Because combinations of and relations between events (associated with a probability experiment) play a crucial role, we need the notion of an event algebra . A mathematical reason for event algebras is prompted by measurability considerations. It turns out that not any old subset of a general sample space is measurable by a probability measure ! Fortunately, in finitary settings – our ma jor focus, all subsets are measurable. Given a sample space Ω of the target experiment, a system A of subsets of Ω is said to be an event algebra of Ω provided that it satisfies the following closure conditions: (i) Ω ∈ A, i.e., Ω is itself represents an ‘event’(the so-called sure event) and hence is a member of A;7 (ii) Closure under Complementation: If E ∈ A, then E ∈ A, where E denotes the complement of E , i.e., the set of elements of Ω that do not belong to E . For example, since Ω = ∅, the so-called impossible event, encoded by ∅ , is also in A. In this regard, it is well to emphasize the difference between a sample point ω of Ω and the singleton event {ω}, consiting of ω as its only member. Now, depending on how A is specified, it may but need not contain { ω }. (iii) Closure Under Union: If E ∈ A and E ∈ A, then E ∪ E ∈ A, where E ∪ E denotes the union of subsets E and E of Ω. Recall that for two subsets E and E of Ω, their union , written E ∪ E , is the subset of Ω consisting of sample points that are in E or in E , or in both.8 ′
′
′
′
′
′
′
(iv) Closure Under Countable Union: If E 1 ∈ A, E 2 ∈ A, · · · , then E 1 ∪ E 2 ∪ · · · ∈ A. In brief, for each sequence of subsets belonging to A, their countable union is also in A. This is a technical condition we shall not use. Formally, the event algera is a so-called Boolean (sigma) algebra of measurable subsets of a sample space that upholds all laws of sentential 7
In general, to indicate the membership of an object x in a set S , we write x ∈ S and read “x is in S ”. 8 For two subsets E and E of Ω, their intersection , written E ∩ E , is the subset of Ω consisting of sample points that are in both of the subsets E and E . ′
′
′
8
logic, formulated set-theoretically. For example, since in propositional logic we have the theorem ⊢ (P & Q) ↔ [P & (P → Q], in A we have its settheoretic counterpart in the form of equality A ∩ B = A ∩ (A → B), where the set-theoretic conditional (using the same syumbol) is defined by A → B =df A ∪ B. Simply, A is the set-theoretic embodiment of the formal language of enriched sentential logic (enriched with countable conjunctions and dijunctions). For example, suppose Ω = { 1, 2, 3, 4, 5, 6} is the sample space associated with the die-rolling experiment, discussed earlier. Then it is easy to check that the system of four subsets A = df ∅, {2, 3, 5}, {1, 4, 6}, Ω is an event algebra of Ω , convenient for calculating the probability of getting a prime number. In general, a sample space has many possible event algebras, but typically only one of them is chosen for any target experiment, contingently upon the events of interest. The simplest choice is the system of all subsets of Ω, which is trivially an event algebra. This is what we shall typically choose in the case of all finite sample spaces. In advanced probability textbooks the sample space together with its event algebra, written as a pair Ω, A, is called a measurable space . ′
′
′
The main reason for the introduction of event algebras is that they are needed for specifying the probability measures, associated with target experiments. We have already noted that in some (possibly pathological) sample spaces not all subsets can be “measured” by a probability measure. Event algebras ensure that only ‘measurable subsets’ will be used in encoding events associated with a target experiment. So formally, probability measures are always defined on a suitable event algebra. This takes us to the next definition. Let Ω be a sample space of a probability experiment of interest and let A be the designated event algebra of Ω, representing all events associated with a target experiment. A real-valued function P : A −→ R that assigns to each event E in A a unique real number P(E ), called the probability of E (or more precisely the probability that event E occurs ), is called a probability measure on the measurable space Ω, A provided that the following axioms are satisfied: (i) Normality: P(Ω) = 1. Since Ω represents the so-called sure event that always occurs, it is fully meaningful to assign value 1 to it. (ii) Non-negativity: P(E ) ≥ 0 for all events E in A. This axiom is motivated by the fact that it is empirically meaningless to consider negative probability values. The axioms ensure that the probability values always fall into the real unit interval [0, 1]. (iii) Finite Additivity: P(E ∪· E ) = P(E ) + P(E ), where E ∪· E =df E ∪ E with E ∩ E = ∅ denotes the disjoint union of events E and E . Recall that two ′
′
′
′
′
9
′
events E and E are said to be mutually exclusive or simply disjoint just in case E ∩ E = ∅ . ′
′
(iv) Countable Additivity:
· E 2 ∪· · · · ) P(E 1 ∪
= P(E 1 ) + P(E 2 ) + · · ·
for any sequence E 1 , E 2 , · · · of pairwise or mutually disjoint events in A, i.e., when E i ∩ E j = ∅ for all i = j, i, j = 1, 2, · · · . The last axiom is a technical condition that supports many non-trivial theorems about probability measures. A triple Ω, A, P of mathematical objects, consiting of a sample space Ω, an event algebra A on it, and a probability measure on the event algebra is called a probability model . Probability models are used to represent the chance mechanisms of probability experiments. Observe that this approach to inductive reasoning differs from deductive inference in that here we work with only one probability measure P at a time, whereas deduction generally relies on all truth functions T, behaving as counterparts of P. As one immediate application of the definition above, we present the following simple example. Suppose the target probability experiment is specified by flipping an unbiased coin once. The we can calculate the probability of obtaining heads and tails a priori with the help of the probability model Ω, A, P, where Ω = {H, T } is thecorrect sample space with meanings introduced earlier, A = ∅, {H }, {T }, Ω is the correct event algebra (set of all subsets of Ω), and P is defined for all events in A as follows: P(∅) = 0, P({H }) = frac12, P({T }) = f rac12, and P(Ω) = 1. Clearly, the definition of P is highly redundant, since the probability of Ω is axiomatically 1 and because P (E ) = 1− P(E ), the probability of {T } is immediately obtained from that of {H } (its complement), and likewise for the empty set ∅ – encoding the so-called impossible event that never occurs. So the only probability value that is sufficient to specify the rest of probability values is P({H }). Now why is it set equal to 21 ? Because the coin is unbiased, meaning that the likelihood of obtaining heads is presumed to be the same as that of obtaining tails. In general, the probability measure P in a so-called Laplacean (a priori, or equally likely) probability model Ω, A, P is defined by P(E )
=
#{E } , #{Ω}
where #{E } denotes the number of sample points in subset E . Whenever we can think of the outcomes as finitary and equally likely (e.g., for symmetry and other geometric or physical reasons), the Laplacean model applies. Remember, 10
however, that the Laplacean model is confined to finite sample spaces only. Problem: Suppose two dice are rolled. Find the probability that the sum of
outcomes shown by the first and second die is exactly 6. Solution: The sample space for the experiment is given by the Cartesian product Ω ×Ω = (1, 1), (1, 2), · · · , (2, 1), (2, 2), · · · , · · · , (6, 6) , where Ω = {1, 2, 3, 4, 5, 6}. Now, because there are 5 pairs of outcomes with the same sum 6, namely (5, 1), (4, 2), (3, 3), (2, 4), (1, 5), the probability of the event described by “The 5 sum of outcomes shown by the first and second die is 6” is equal to 36 . ′
′
′
Note that in general, a measurable space Ω, A admits infinitelymany distinct probability measures, forming a kind of meta-sample space Prob Ω, A with its own meta-event algebra, important in higher-order probability theory. Observe also that unlike in deductive inference, where usually all possible truth functions T are considered and then quantified away, in probability theory the focus tends to be on just one particular probability measure, namely the one that is believed to be (in some sense) the ‘correct’ probability measure for the target experiment. Hence, probabilistic reasoning tend to be local and model-driven. We will see more of this style in statistical reasoning. Now we turn to some of the basic theorems of probability theory. Henceforth, we shall assume that we have an arbitrary but fixed probability model Ω, A, P with events A,B,C, · · · , belonging to its event algebra A. We ignore a host of trivial facts, including ⊢ P(∅) = 0, ⊢ P(A) ≤ 1, P(E ) = 1 − P(E ) , and ⊢ P(A) ≤ P(B) if A ⊆ B. Since the probability model above is arbitrary, the theorems below in effect hold for all probability measures, and they do not depend on Bayesian, objectivist or other interpretations at all. For this reason, we should preface each theorem below with the quantifier “for all” P in Prob Ω, A . P1
⊢ P(B − A) = P(B) − P(A), if A ⊆ B.9 Proof. In view of A ⊆ B we can write B = (B −A)∪· A, where B −A denotes the set-theoretic difference , meaning the set of those elements that are in B but not in A. Quite simply, we have B − A = B ∩ A. Now, using the additivity law for probability measures we get P(B) = P((B − A)∪· A) = P(B − A) + P(A). And this immediately gives P(B) − P(A) = P(B − A), as wanted.
P1’
⊢ P(A − B) = P(A) − P(A ∩ B).
9
Suppose that A and B are two sets. Then the set A is said to be a subset of B provided that every element of A is also an element of B . This important relationship between two sets is symbolized succinctly as A ⊆ B .
11
P1”
In general, we always have ⊢ P(A − B) ≥ P(A) + P(B) for any A and B.
P(A) − P(B)
and P(A ∪ B) ≤
Proof. Follows directly from Theorem P1 above. P2
⊢ P(A ∪ B) = P(A) + P (B) − P(A ∩ B).10 The theorem generalizes to the case of P(A ∪ B ∪ C ) and larger finitary unions. Proof. First note that A ∪ B = A ∪· [B − (A ∩ B)]. Therefore the additivity law gives P(A ∪ B) = P(A) + P (B − (A ∩ B)). Now Theorem P1 allows to write P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
P3
⊢ P(A) = P(AB) + P(AB).11 Proof. First note that Ω = B ∪· B for any set B. Now, since A = A ∩ Ω = A ∩ B ∪· B = AB ∪· AB, by the additivity law we immediately have P(A) = P(AB) + P (AB). This theorem is generalized below in the form of the so-called law of total probability .
P4
⊢ P(A) = P(AB1 ) + P(AB2 ) + · · · + P(ABn ), where the string of events B1 , B2 , · · · , Bn forms a measurable partition of Ω. Recall that a set B1 , B2 , · · · , Bn forms a measurable partition of Ω just in case all events belong to A, are pairwise (mutually) disjoint, and they fill out the entire sample space, meaning B1 ∪ B2 ∪ · · · ∪ Bn = Ω. Because the events AB1 , AB2 , ·ABn are also pairwise disjoint, the theorem immediately follows.
P5
⊢ P(A ∩ B) = 0, if P(A) = 0, and ⊢ P(A ∪ B) = 1, if P(B) = 1.
P6
⊢ P(AB) ≥
P(A) + P(B)
− 1.
Proof. First, note that the general addition law has an alternative form P(A ∩ B) = P(A) + P (B) − P(A ∪ B). Now since P(A ∪ B) ≤ 1, if we subtract the larger value (namely 1) on the right (instead of P(A ∪ B)), we immediately obtain the desired inequality. P7
⊢ max 0, P(A) + P(B) − 1 ≤ 10
P(A
∩ B) ≤ min
P(A), P(B)
This theorem is often referred as the general addition law , because there is no assumption in it about the disjointness of events. 11 Given two sets A and B , often we shall write AB as short for A ∩ B .
12
P8
⊢ P(A ∪ B) ≥ max
P(A), P(B)
.
We conclude this section with probability logic , pausing to define probability functions as analogs of truth functions. Inductive logic studies risky arguments using probability ideas. Probabilities are expressed in numbers from the unit interval. Probability logic is deductive logic enriched with a designated probability measure. In this framework logicians can analyze questions of the form “What is the probability that proposition Φ is true?”. Thus, focus is on the probability of propositions or sentences being true (or false), and not on the probability that a given event occurs (or does not occur). Suppose we are given a language L of sentential, predicate or other logic, the notion of probability in probability logic is treated as a generalization of a truth functions. Specifically, a function P : L −→ [0, 1] that assigns to each sentence Φ of a formal language L a unique real number 0 ≤ P Φ ≤ 1 (interpreted as the probability that Φ is true) is called a probability function or probability measure (or belief state ) on L provided that the following conditions hold for all sentences Φ and Ψ in L: (i) (ii)
P(Φ)
= 1, if ⊢ Φ.
P(Φ) ≤ P(Ψ),
if Φ ⊢ Ψ.
(iii)
P(∼ Φ)
= 1 − P(Φ).
(iv)
P(Φ
∨ Ψ) = P(Φ) + P(Ψ) − P(Φ & Ψ).
(v)
P(Φ
& Ψ) = P(Φ) · P Ψ | Φ , where
(vi)
P(Φ → Ψ)
= P(∼ Φ ∨ Ψ).
(vii)
P(∀xΦ(x))
= Inf a
,
1 ···
,a P(Φ(a1 ) n
P
Ψ | Φ = df P(Ψ & Φ) : P(Φ).
& · · · Φ(an )).
Where a typical probability function P differs from truth functions T is in specifying logical truth and logical validity. If we wanted to define logical truth by setting ⊢ Φ precisely when P(Φ) = 1 for all probability functions P on L, we would not get anything new, beyond the usual logical theorems we obtain from truth functions. However, if we set P Φ
↔
P(Φ)
=1
for just one particular probability function, say P, then we obtain a deductively closed system of sentences enjoying “full belief” or certainty in belief state P that of course always includes all logically true sentences. The most important 13
conclusion to be drawn from this conceptual analysis is that probaility functions in logic are not quantified away but used individually. Thus, the inductive inference Ψ1 , Ψ2 , · · · |∽ p Φ, can only mean the equality
P
Φ | Ψ 1 & Ψ2 & · · · = p
for some specific probability function P, and not all probabiolity functions. 3.
Probabilistic Independence and Conditional Probability
Given a probability model Ω, A, P, two events A and B in the algebra of events A are said to be probabilistically (or statistically) independent , in symbols A ⊥ ⊥ B, just in case P (A ∩ B) = P(A) · P(B). This relation enjoys several simple properties that drop out from its definition: (i) Triviality: Ω ⊥ ⊥ B and ∅ ⊥ ⊥ B. (ii) Symmetry: If A ⊥ ⊥ B, then B ⊥ ⊥ A. (iii) Complementation: If A ⊥ ⊥ B, then A ⊥ ⊥ B. (iv) Double-complementation: If A ⊥ ⊥ B, then A ⊥ ⊥ B. (v) If C ⊥ ⊥ A and C ⊥ ⊥ B, then C ⊥ ⊥ A ∪· B, where A and B are disjoint. (vi) If A and B are disjoint events with
= P(A)
0 = P(B), then not A ⊥ ⊥ B.
As an illustrative example, consider the nickel-and-dime experiment. We are tossing both a nickel and a dime into the air once, the object being to observe the upturned faces on both coins. A suitable description of the four possible outcomes is afforded by four orderedpairs, specifying the sample space Ω × Ω =df (H, H ), (H, T ), (T, H ), (T, T ) . The event algebra A consists of all subsets of the sample space. Since we are assuming the Laplacean (equally likely) model, the probability of getting the falls event described by “The nickel 1 heads” and modeled by the subset A = (H, H ), (H, T ) is P(A) = 2 . Similarly, the probability of gettingthe event descibed by “The1dime falls tails” and represented by the subset B = (H, T ), (T, T ) is P(B) = 2 . From the physical nature of the experiment (the coins are physically separate objects) we know that events A and B should be independent. Since A ∩ B = (H, T ) , we see that P(A ∩ B) = 41 = 21 · 21 = P(A) · P(B), hence A ⊥ ⊥ B, as suspected. Clearly, a given pair of events A and B may be probabilistically independent with respect a probability measure P, but not necessarily with respect 14
to some other probability measure P . In particular, note that if the probability measure P on the measurable space Ω × Ω, A above is now intended to represent biased coins, given by (say) P ({(H, H )}) = 0.5, P ({(H, T )}) = 0.4, P ({(T, H )}) = 0.05, and P ({(H, H )}) = 0.05, then we have P (A) = 0.9, P (B) = 0.45, P (A ∩ B) = 0.400, but P (A) · P (B) = 0.405, so that with respect to P the events A and B are not probabilistically independent. ′
′
′
′
′
′
′
′
′
′
′
′
In general, with respect to a so-called Dirac {0, 1}-valued probability measure / B for all events Pω – defined by Pω (A) = 1 if ω ∈ A and Pω (B) = 0 if ω ∈ A and B, all events are probabilistically independent. In particular, since each truth function T is a probability measure over the formal language of sentential logic, all pairs of sentences are automatically T-independent. Suppose an experiment consists of rolling a die twice. It is easy to see that the event described by “Getting an odd number in the first roll.” is probabilistically independent of “Getting an even number in the second roll”. If there are three events A,B,C , they are called pairwise independent just in case A ⊥ ⊥ B, A ⊥ ⊥ C , and B ⊥ ⊥ C . However, multiple independence or mutual independence , in symbols ⊥ ⊥ (A,B,C ), means considerably more, namely, pairwise independence and equality P(A ∩ B ∩ C ) = P(A) · P(B) · P(C ). The last condition is not reducible to pairwise independence. The concept of multiple independence readily extends to any string of events. Suppose we are given an arbitrary but fixed probability model Ω, A, P and a fixed event C with P(C ) > 0. The conditional 12 probability of event A, given conditioning event C , denoted by P A | C , is defined by the numerical fraction
= P A | C = df
PC (A)
P(A ∩
C ) . P(C )
As an example, let Ω = {1, 2, 3, 4, 5, 6} be the sample space for the roll of a die. Let B = {2, 4, 6} and A = {2}. Assume the Laplacean (equally likely) 1 1 model, so P(B) = 2 and P(A) = 6 . Then it is easy to calculate that that 1 P A | B = 3 and P B | A = 1. ′
In the Laplacean (equally likely) model we have
P
A | C =
#{A ∩ C } . #{C }
A conditional probability measure is just like any other probability measure in that it satisfies the following properties of probability measures, referred to 12
Another reading is: The probability that event A will occur, given that event B occurs.
15
earlier:
CP1
⊢ P Ω | C = 1.
CP2
⊢ P A | C = 1, if C ⊆ A.
CP3
⊢ P(A ∩ C ) = P(C ) · P A | C .
CP4
⊢ P A | C ≥ 0.
CP5 CP6 CP7 CP8 CP9
⊢ P A | C = 1 − P A | C . ⊢ P B − A | C = P B | C − P A | C , if A ⊆ B. ⊢ P A ∪ B | C = P A | C + P B | C − P AB | C . ⊢ P A | C = P AB | C + P AB | C . ⊢ P A ∩ B | C = P A | C · P B | A ∩ C .
⊢ P(A ∩ B ∩ C ) = P(A) · P B | A · P C | A ∩ B , where the conditioning events have a positive probability.
CP10
CP11
⊢ P(A) = P(B) · P A | B + P(B) · P A | B .
The proofs are not hard and are left as exercises.
To get a better feeling for the meaning of probabilistic conditionalization P A | C , we recall some of the most amazing similarities between logical and probabilistic conditionals. First, recall the tautology ⊢ (P & Q) ↔ [P & (P → Q)] and its truth function evaluation ⊢ T(P & Q) = T(P ) · T(P → Q) for any T. Now, if we remind ourselves the definition of a conditional truth function TP (Q) =df T(P → Q) from Section one, we immediately note the striking similarity between Theorem CP3 above and the truth functional identity ⊢ T(P & Q) = T(P ) · TP (Q). The similarity goes even further. Since the tautology
⊢ (P & Q & R) ↔ P & (P → Q) & [(P & Q) → R]
⊢ T(P & Q & R) = T(P ) · T(P → Q) · T[(P & Q) → R, ] upon reformulation into conditional truth functions we get
⊢ T(P & Q & R) = T(P ) · TP (Q) · T(P & Q) (R) which is an exact counterpart of the theorem
⊢ P(ABC ) = P(A) · P B | A · P C | AB , 16
where P(A) = 0 and P(AB) = 0. The only difference is that in the setting of deductive logic conditioning (using the symbol →) is quite simple, whereas in probability theory (using the symbol | ) can be rather complex. Earlier we mentioned that iterated truth function conditioning of the form TP Q = 13 This is also true in the T(P & Q) does not offer anything new. case of iterated conditionalization of probability measures, since we have P (A | B) | C = P A | BC . In probability theory a major complication arises from the fact that a probability measure P can take its values in the entire unit interval [0, 1]. 3.
Bayes’ Theorem and Its Applications to Complex Experiments.
Given an arbitrary but fixed probability model Ω, A, P, for any pair of events A and B , with P(A) · P(B) = 0, we can write down two conditional probability formulas
P
A | B =
∩ B) and P(B)
P(A
P
B | A =
P(A ∩
B) , P(A)
giving
P(A) · P
B | A =
P(B) · P
A | B ,
which is easily seen to be the probabilistic counterpart of the tautology ⊢ P & (P → Q) ↔ Q & (Q → P ) and that of truth functional identity T(P ) · TP (Q) = T(Q) · TQ (P ). It is easy to see that the foregoing equality can be rewritten into one of the simplest forms of the so-called Bayes’ formula
P
A | B =
P(A)
·
P
B | A P(B)
or
P
A | B =
P(A)
·
B | A
P
P(A) · P
B | A + P(A) · P B | A
.
These formulas are used whenever the quantities P (A), P B | A and P B | A are given or can be calculated. The typical situation arises when event B occurs logically or temporally after A, so that the probabilities P(A) and P B | A can be readily computed. In applications, Bayes’ formula is used when we know the effect of a cause and we wish to make some inference about the cause. 13
This fact follows from the tautology ⊢ [ P → ( Q → R ) ↔ [( P & Q ) → R ].
17
The general form of Bayes’ Theorem (Rule) applies to a system of mutually (pairwise) exclusive and jointly exhaustive hypotheses about some potential causes, i.e., a string of potential cause events H 1 , H 2 , · · · H n such that H i ∩H j = ∅ for i = j and H 1 ∪H 2 ∪···∪H n = Ω. Further, it is assumed that these hypotheses are not impossible, i.e., P(H i ) = 0 for all i = 1, 2, · · · , n. Now, suppose event E with P(E ) = 0 is an effect (serving as evidence of a cause). Then we have
P
E | H i . H i | E = P(H i ) · n (H ) E H · | P P j j j=1 P
Calculation of Conditional Probabilities:
Three balls are drawn from a box containing 6 red balls, 4 white balls, and 5 blue balls. Find the probability that they are drawn in the order red, white and blue, if each ball is replaced.
Example 1:
Solution: In most examples below we shall tacitly assume that the sample
space is obvious. The event algebra is always the family of all subsets of the sample space. And the probability is of the equal likelihood (Laplacean) type. This avoids several pedantic technicalities that do not add much to problem solving skills. Let 1. R1 = event described by “Red ball is drawn on first draw.” 2. W 2 = event described by “White ball is drawn on second draw.” 3. B3 = event described by “Blue ball is drawn on third draw.” In view of Theorem CP10 and probabilistic independence of color drawings we have P(R1 ∩W 2 ∩B3 )
= P(R1)P W 2 | R 1
P
B3 | R 1 ∩W 2 =
P(R1 )P(W 2 )P(B3 ),
and hence P(R1
∩ W 2 ∩ B3 ) =
4 5 6 8 = . 6+4+5 6+4+5 6+4+5 225
Three balls are drawn from a box containing 6 red balls, 4 white balls, and 5 blue balls. Find the probability that they are drawn in the order red, white and blue, if none of the balls is replaced.
Example 2:
Solution: We shall use the same notation as in Example 1 . Now, in view of P(R1
∩ W 2 ∩ B3 ) = P(R1 )P W 2 | R 1
we have 18
P
B3 | R 1 ∩ W 2
P(R1
∩ W 2 ∩ B3 ) =
4 5 4 6 = . 6+4+5 5+4+5 6+3+5 91
Suppose Box No. I contains 3 red and 2 blue marbles, while Box No. II contains 2 red and 8 blue marbles. A fair coin is tossed. If the coin turns up heads, a marble is chosen from Box I; if it turns up tails, a marble is chosen from Box II. Find the probability that a red marble is chosen. (Note that here the sample space for the coin can be identified with Ω = { Box I, Box II}, and the sample spaces for the Boxes are Ω5 = {r1 , r2 , r3 , b1 , b2 } and Ω10 = {r1 , r2 , b1 , b2 , · · · , b8 }, respectively. The composite sample space has the “conditional” form Ω ⊸ (Ω5 , Ω10 ), prompting the use of conditional probability.)
Example 3:
′
′
′
′
′
Solution: Let R denote the event described by “A red marble is chosen.”
and let I denote the event that Box I is chosen. Likewise, let II denote the event that Box II is chosen. Now the probability of choosing a red marble is calculated using Theorem CP11 as follows:
P(R)
= P(I )P R | I + P(II )P R | I I =
1 3 1 2 2
3+2
+
2
2+8
=
2 . 5
Suppose Box No. I contains 3 red and 2 blue marbles, while Box No. II contains 2 red and 8 blue marbles. A fair coin is tossed, but this time it is not revealed whether it has turned up heads or tails, so that the box from which a marble was chosen is not revealed. But we do know that a red marble was chosen. What is the probability that Box I was chosen, i.e., that the coin turned up heads, given that a red marble was chosen?
Example 4:
Solution: This problem requires Bayes’ Theorem. Let R denote the event
described by “A red marble is chosen.” and let I denote the event that Box I is chosen. Likewise, let II denote the event that Box II is chosen. Now the probability of choosing a red marble is calculated using Bayes’ Theorem
P
R | I = I | R = P(I )P R | I + P(II )P R | I I
P(I )P
19
1 3 1 3 2 3+2 1 2
3+2
+
2
2 2+8
3 4
= .