L2 Probabilistic Reasoning

Artificial Intelligence EECS 491

Probabilistic Reasoning

The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters from data 4. evaluate model accuracy

Axioms of probability

•

Axioms (Kolmogorov): 0 ≤ P(A) ≤ 1 P(true) = 1 P(false) = 0 P(A or B) = P(A) + P(B) − P(A and B)

•

Corollaries:

-

A single random variable must sum to 1: n ! P (D = di ) = 1 i=1

-

The joint probability of a set of variables must also sum to 1. If A and B are mutually exclusive: P(A or B) = P(A) + P(B)

Rules of probability

•

conditional probability

P r(A and B) , P r(A|B) = P r(B)

•

P r(B) > 0

corollary (Bayes’ rule)

P r(B|A)P r(A) ⇒ P r(B|A)

= P r(A and B) = P r(A|B)P r(B) P r(A|B)P r(B) = P r(A)

Discrete probability distributions

• • • • •

discrete probability distribution joint probability distribution marginal probability distribution Bayes’ rule independence

Basic concepts Making rational decisions when faced with uncertainty:

•

Probability the precise representation of knowledge and uncertainty

•

Probability theory how to optimally update your knowledge based on new information

•

Decision theory: probability theory + utility theory how to use this information to achieve maximum expected utility

Basic probability theory random variables probability distributions (discrete) and probability densities (continuous) rules of probability expectation and the computation of 1st and 2nd moments joint and multivariate probability distributions and densities covariance and principal components

• • • • • •

The Joint Distribution Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows). 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to 1.

Example: Boolean variables A, B, C

A

B

C

Prob

0

0

0

0.30

0

0

1

0.05

0

1

0

0.10

0

1

1

0.05

1

0

0

0.05

1

0

1

0.10

1

1

0

0.25

1

1

1

0.10

A

0.05 0.25

0.10

0.05

0.10 0.05

C

0.10

B 0.30 All the nice looking slides like this one from are from Andrew Moore, CMU Copyright © 2001, Andrew W. Moore

Probabilistic Analytics: Slide 41


A

B

C

Prob

0

0

0

0.30

0

0

1

0.05

0

1

0

0.10

0

1

1

0.05

1

0

0

0.05

1

0

1

0.10

1

1

0

0.25

1

1

1

0.10

A

0.05 0.25

0.30 Copyright © 2001, Andrew W. Moore


B

0.10

0.05

0.10 0.05

C

0.10 Probabilistic Analytics: Slide 41


A

B

C

Prob

0

0

0

0.30

0

0

1

0.05

0

1

0

0.10

0

1

1

0.05

1

0

0

0.05

1

0

1

0.10

1

1

0

0.25

1

1

1

0.10

A

0.05 0.25

0.30 Copyright © 2001, Andrew W. Moore


B

0.10

0.05

0.10 0.05

C

0.10 Probabilistic Analytics: Slide 41

Using the Joint

One you have the JD you can ask for the probability of any logical expression involving your attribute

Copyright © 2001, Andrew W. Moore

P( E ) "

! P(row )

rows matching E


Using the Using the Joint Joint OneP(Poor you have the JD=you can Male) 0.4654 ask for the probability of any logical expression involving your attribute


P( E ) "

P(! E )P"(row ) ! P (row )

rows matching E

rows matching E


Using the Using the Joint Joint One you have the JD=you can P(Poor) 0.7604 ask for the probability of any logical expression involving your attribute


P( E ) "

!PP((row E ) ")

rows matching E

! P(ro

rows matching E


Inference Inference Inference with the with the Using the with Joint Joint Joint PP(! row )()row ) ! P ( row ! P ( E E ) # One you have the JD youPcan E1E#2P)E"( E " matching (row (P E1( # rows ! matching E EP and E and) E 2 ))rows P ( E | E ) " P(P E(1E| 1E|2E) 2"of) "anyP( E ) " " ask for the probability P ( row ) ! P2()E2 ) P ()row ) P! (row logical expression involving P ( E ! your attribute 1

2

1

2

2

rows matching E1 and E2 1

12

2

rows matching E

rows matching E2 rows matching E2 rows matching E2

P(Male | Poor) = 0.4654 / 0.7604 = 0.612 Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore


Probabilistic Analytics: Slide 45 Probabilistic Analytics: Slide 42 Probabilistic Analytics: Slide 45


Continuous probability distributions

• • • • •

probability density function (pdf) joint probability density marginal probability calculating probabilities using the pdf Bayes’ rule

A PDF of American Ages in 2000

more of Andrew’s nice slides Copyright © 2001, Andrew W. Moore

Probability Densities: Slide 5

A PDF of American Ages in 2000 Let X be a continuous random variable. If p(x) is a Probability Density Function for X then…

P!a & X % b " $

b

# p( x)dx

x$a

P!30 & Age % 50 " $

50

# p(age)dage

age $ 30

= 0.36 Copyright © 2001, Andrew W. Moore


b “very close to” 5.92.

What does p(x) mean? Copyright © 2001, Andrew W. Moore

• • • • •


It does not mean a probability! First of all, it’s not a value between 0 and 1. It’s just a value, and an arbitrary one at that. The likelihood of p(a)to canyour only bestomach compared relatively to other values p(b) Talking indicatesthe the gut-feel relative probability integrated density over a small delta: •It What’s meaningofofthep(x)?

If

p(a) "! p (b)

then

P ( a % h $ X $ a # h) lim "! h &0 P (b % h $ X $ b # h ) Copyright © 2001, Andrew W. Moore


Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X "

$

! x p( x) dx

x $ #"



Expectations E[X] = the expected value of random variable X = the average value we’d see if we took a very large number of random samples of X "

$

! x p( x) dx

x $ #"

E[age]=35.897

= the first moment of the shape formed by the axes and the blue curve = the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error



Expectation of a function !=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution. = the average value we’d see if we took a very large number of random samples of f(X)

E[age 2 ] % 1786.64 ( E[age]) % 1288.62 2

!%

#

" f ( x) p( x) dx

x % $#

Note that in general:

E[ f ( x)] & f ( E[ X ])



Variance '2 = Var[X] = the expected squared difference between x and E[X]

Var[age] % 498.02


' % 2

#

" (x $ ! )

2

p ( x) dx

x % $#

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally


Standard Deviation !2 = Var[X] = the expected squared difference between x and E[X]

Var[age] % 498.02

! % 22.32

! % 2

#

" (x $ & )

2

p ( x) dx

x % $#

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally ! = Standard Deviation = “typical” deviation of X from its mean

! % Var[ X ] Copyright © 2001, Andrew W. Moore


Simple example: medical test results

• • •

Test report for rare disease is positive, 90% accurate

•

This is the simplest example of reasoning by combining sources of information.

What’s the probability that you have the disease? What if the test is repeated?

How do we model the problem?

•

•

Which is the correct description of “Test is 90% accurate” ?

P (T = true) P (T = true|D = true)

= 0.9 = 0.9

P (D = true|T = true)

= 0.9

What do we want to know?

P (T = true) P (T = true|D = true) P (D = true|T = true)

•

More compact notation:

P (T = true|D = true) P (T = false|D = false)

→ P (T |D) ¯ → P (T¯|D)

Evaluating the posterior probability through Bayesian inference

• •

We want P(D|T) = “The probability of the having the disease given a positive test” Use Bayes rule to relate it to what we know: P(T|D)

likelihood

prior

P (T |D)P (D) posterior P (D|T ) = P (T ) normalizing constant

• • • •

What’s the prior P(D)? Disease is rare, so let’s assume

P (D) = 0.001 What about P(T)? What’s the interpretation of that?

Evaluating the normalizing constant likelihood

prior

P (T |D)P (D) posterior P (D|T ) = P (T ) normalizing constant

• •

P(T) is the marginal probability of P(T,D) = P(T|D) P(D) So, compute with summation

P (T ) =

!

P (T |D)P (D)

all values of D

•

For true or false propositions:

¯ (D) ¯ P (T ) = P (T |D)P (D) + P (T |D)P What are these?

Refining our model of the test

•

We also have to consider the negative case to incorporate all information:

P (T |D) = ¯ P (T |D) =

•

What should it be?

•

¯ ? What about P (D)

0.9 ?

Plugging in the numbers

•

Our complete expression is

P (D|T ) =

•

P (T |D)P (D) ¯ (D) ¯ P (T |D)P (D) + P (T |D)P

Plugging in the numbers we get:

0.9 × 0.001 P (D|T ) = = 0.0089 0.9 × 0.001 + 0.1 × 0.999

•

Does this make intuitive sense?

Same problem different situation

• • •

Suppose we have a test to determine if you won the lottery. It’s 90% accurate. What is P($ = true | T = true) then?

Playing around with the numbers

P (D|T ) =

•

P (T |D)P (D) ¯ (D) ¯ P (T |D)P (D) + P (T |D)P

What if the test were 100% reliable?

1.0 × 0.001 P (D|T ) = = 1.0 1.0 × 0.001 + 0.0 × 0.999

•

What if the test was the same, but disease wasn’t so rare?

0.9 ⇥ 0.1 P (D|T ) = = 0.5 0.9 ⇥ 0.1 + 0.1 ⇥ 0.9

Repeating the test

• • •

We can relax, P(D|T) = 0.0089, right? Just to be sure the doctor recommends repeating the test. How do we represent this?

P (D|T1 , T2 )

•

Again, we apply Bayes’ rule

P (T1 , T2 |D)P (D) P (D|T1 , T2 ) = P (T1 , T2 )

•

How do we model P(T1,T2|D)?

Modeling repeated tests P (T1 , T2 |D)P (D) P (D|T1 , T2 ) = P (T1 , T2 )

•

Easiest is to assume the tests are independent.

P (T1 , T2 |D) = P (T 1|D)P (T2 |D)

•

This also implies:

P (T1 , T2 ) = P (T 1)P (T2 )

•

Plugging these in, we have

P (T1 |D)P (T2 |D)P (D) P (D|T1 , T2 ) = P (T1 )P (T2 )

Evaluating the normalizing constant again

•

Expanding as before we have

P (D|T1 , T2 ) = !

•

P (T1 |D)P (T2 |D)P (D) D={t,f } P (T1 |D)P (T2 |D)P (D)

Plugging in the numbers gives us

0.9 × 0.9 × 0.001 P (D|T ) = = 0.075 0.9 × 0.9 × 0.001 + 0.1 × 0.1 × 0.999

•

•

Another way to think about this:

-

What’s the chance of 1 false positive from the test? What’s the chance of 2 false positives?

The chance of 2 false positives is still 10x more likely than the a prior probability of having the disease.

Simpler: Combining information the Bayesian way

•

Let’s look at the equation again:


•

If we rearrange slightly:


•

It’s the posterior for the first test, which we just computed

P (T1 |D)P (D) P (D|T1 ) = P (T1 )

We’ve seen this before!

The old posterior is the new prior

• •

We can just plugin the value of the old posterior It plays exactly the same role as our old prior


P (T2 |D) × 0.0089 P (D|T1 , T 2) = P (T2 )

•

Plugging in the numbers gives the same answer:

P (D|T ) =

P (T |D)P ! (D) ¯ ! (D) ¯ P (T |D)P ! (D) + P (T |D)P

0.9 × 0.0089 P (D|T ) = = 0.075 0.9 × 0.0089 + 0.1 × 0.9911

This is how Bayesian reasoning combines old information with new information to update our belief states.

reasoning in networks containing many variables, for which the graphical notations of chapter(2) will play a central role.

Example 1.2 (Hamburgers). Consider the following fictitious scientific information: Doctors find that people with Kreuzfeld-Jacob disease (KJ) almost invariably ate hamburgers, thus p(Hamburger Eater|KJ ) = 0.9. The probability of an individual having KJ is currently rather low, about one in 100,000. 1. Assuming eating lots of hamburgers is rather widespread, say p(Hamburger Eater) = 0.5, what is the probability that a hamburger eater will have Kreuzfeld-Jacob disease? This may be computed as p(KJ |Hamburger Eater) =

p(Hamburger Eater, KJ ) p(Hamburger Eater|KJ )p(KJ ) = p(Hamburger Eater) p(Hamburger Eater) (1.2.1)

=

9 10

⇥

1 100000 1 2

= 1.8 ⇥ 10

5

(1.2.2)

2. If the fraction of people eating hamburgers was rather small, p(Hamburger Eater) = 0.001, what is the probability that a regular hamburger eater will have Kreuzfeld-Jacob disease? Repeating the above calculation, this is given by 9 10

1 100000 1 1000

⇥

⇡ 1/100

(1.2.3)

This is much higher than in scenario (1) since here we can be more sure that eating hamburgers is related to the illness.

Example 1.3 (Inspector Clouseau). Inspector Clouseau arrives at the scene of a crime. The victim lies dead

related to the illness.

Example 1.3 (Inspector Clouseau). Inspector Clouseau arrives at the scene of a crime. The victim lies dead in the room alongside the possible murder weapon, a knife. The Butler (B) and Maid (M ) are the inspector’s main suspects and the inspector has a prior belief of 0.6 that the Butler is the murderer, and a prior belief of 0.2 that the Maid is the murderer. These beliefs are independent in the sense that p(B, M ) = p(B)p(M ). (It is possible that both the Butler and the Maid murdered the victim or neither). The inspector’s prior criminal knowledge can be formulated mathematically as follows: dom(B) = dom(M ) = {murderer, not murderer} , dom(K) = {knife used, knife not used}

(1.2.4)

p(B = murderer) = 0.6,

(1.2.5)

p(knife p(knife p(knife p(knife

used|B used|B used|B used|B

p(M = murderer) = 0.2

= not murderer, = not murderer, = murderer, = murderer,

M M M M

= not murderer) = murderer) = not murderer) = murderer)

= 0.3 = 0.2 = 0.6 = 0.1

(1.2.6)

In addition p(K, B, M ) = p(K|B, M )p(B)p(M ). Assuming that the knife is the murder weapon, what is the probability that the Butler is the murderer? (Remember that it might be that neither is the murderer). Using b for the two states of B and m for the two states of M , P X X p(B, m, K) P p(K|B, m)p(B, m) p(B) m p(K|B, m)p(m) P p(B|K) = p(B, m|K) = = Pm =P (1.2.7) p(K) m,b p(K|b, m)p(b, m) b p(b) m p(K|b, m)p(m) m m DRAFT February 27, 2012

13

Probabilistic Reasoning

Example 1.5 (Aristotle : Resolution). We can represent the statement ‘All apples are fruit’ by p(F = tr|A = tr) = 1. Similarly, ‘All fruits grow on trees’ may be represented by p(T = tr|F = tr) = 1. Additionally we assume that whether or not something grows on a tree depends only on whether or not it is a fruit, p(T |A, F ) = P (T |F ). From this we can compute X X p(T = tr|A = tr) = p(T = tr|F, A = tr)p(F |A = tr) = p(T = tr|F )p(F |A = tr) F

F

= p(T = tr|F = fa) p(F = fa|A = tr) + p(T = tr|F = tr) p(F = tr|A = tr) = 1 | {z } | {z }| {z } =0

=1

(1.2.16)

=1

In other words we have deduced that ‘All apples grow on trees’ is a true statement, based on the information presented. (This kind of reasoning is called resolution and is a form of transitivity : from the statements A ) F and F ) T we can infer A ) T ).

Example 1.6 (Aristotle : Inverse Modus Ponens). According to Logic, from the statement : ‘If A is true then B is true’, one may deduce that ‘if B is false then A is false’. To see how this fits in with a probabilistic reasoning system we can first express the statement : ‘If A is true then B is true’ as p(B = tr|A = tr) = 1. Then we may infer

Next time

•

Bayesian belief networks

L2 Probabilistic Reasoning

Recommend Documents