Fundamentals of Bayesian Epistemology Michael G. Titelbaum
ii
Contents QuickReference
ix
I IntroducingtheSubject
1
1 Beliefs and Degrees of Belief 3 1.1 Binary beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Classificatory, comparative, quantitative . . . . . . . . 4 1.1.2 Shortcomings of binary belief . . . . . . . . . . . . . . 5 1.2 From binary to graded . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Comparative confidence . . . . . . . . . . . . . . . . . 10 1.2.2 Bayesian Epistemology . . . . . . . . . . . . . . . . . . 12 1.2.3 Relating beliefs and credences . . . . . . . . . . . . . . 14 1.3 The rest of this book . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Exercises . . . . . . . . . . . ............ ...... 18 1.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 19 II The Bayesian Formalism
23
2 Probability Distributions 27 2.1 Propositions and propositional logic . . . . . . . . . . . . . . 28 2.1.1 Relations among propositions . . . . . . . . . . . . . . 30 2.1.2 State-descriptions . . . . . . . . . . . . . . . . . . . . 31 2.1.3 Predicate logic . . . . . . . . . . . . . . . . . . . . . . 32 2.2 Probability distributions . . . . . . . . . . . . . . . . . . . . . 34 2.2.1 Consequences of the probability axioms . . . . . . . . 35 2.2.2 A Bayesian approach to the Lottery scenario . . . . . 37 2.2.3 Doxastic possibilities . . . . . . . . . . . . . . . . . . . 39 2.2.4 Probabilities are weird! The Conjunction Fallacy . . . 40 iii
iv
CONTENTS
2.3 Alternative representations of probability . . . . . . . . . . . 40 2.3.1 Probabilities in Venn diagrams . . . . . . . . . . . . . 40 2.3.2 Probability tables . . . . . . . . . . . . . . . . . . . . 43 2.3.3 Using probability tables . . . . . . . . . . . . . . . . . 45 2.3.4 Odds . . . . . . . . . . . ............ . . . . . 46 2.4 What the probability calculus adds . . . . . . . . . . . . . . . 48 2.5 Exercises . . . . . . . . . . ............ ....... 49 2.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3 ConditionalCredences 57 3.1 Conditional credences and the Ratio Formula . . . . . . . . . 57 3.1.1 The Ratio Formula . . . . . . . . . . . . . . . . . . . . 59 3.1.2 Consequences of the Ratio Formula . . . . . . . . . . . 61 3.1.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . 63 3.2 Relevance and independence . . . . . . . . . . . . . . . . . . . 66 3.2.1 Conditional independence and screening off . . . . . . 69 3.2.2 The Gambler’s Fallacy . . . . . . . . . . . . . . . . . . 70 3.2.3 Probabilities are weird! Simpson’s Paradox . . . . . . 71 3.2.4 Correlation andand causation . . . .. . . .. . . .. . . ... . . .. . . .. . . . 73 3.3 Conditional credences conditionals 77 3.4 Exercises . . . . . . . . . . ............ ....... 82 3.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Updating by Conditionalization 91 4.1 Conditionalization . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.1 Consequences of Conditionalization . . . . . . . . . . . 95 4.1.2 Probabilities are weird! The Base Rate Fallacy . . . . 97 4.2 Evidence and Certainty . . . . . . . . . . . . . . . . . . . . . 100 4.2.1 Probabilities are weird! Total Evidence and the Monty Hall Problem . . . . . . . . . . . . . . . . . . . . . . . 103 4.3 Hypothetical Priors and Evidential Standards . . . . . . . . . 106 4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 14 4.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5 Further Rational Constraints 121 5.1 Subjective and Objective Bayesianism . . . . . . . . . . . . . 122 5.1.1 Frequencies and Propensities . . . . . . . . . . . . . . 122 5.1.2 Two Distinctions . . . . . . . . . . . . . . . . . . . . . 126 5.2 Deference Principles . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.1 The Principal Principle . . . . . . . . . . . . . . . . . 130
v
CONTENTS
5.2.2 Expert principles and Reflection . . . . . . . . . . . . 139 5.3 The Principle of Indifference . . . . . . . . . . . . . . . . . . . 143 5.4 Credences for Infinite Possibilities . . . . . . . . . . . . . . . . 147 5.5 Jeffrey Conditionalization . . . . . . . . . . . . . . . . . . . . 154 5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 59 5.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2
III Applications
171
6 Confirmation 175 6.1 Formal features of the confirmation relation . . . . . . . . . . 176 6.1.1 Confirmation is weird! The Paradox of the Ravens . . 176 6.1.2 Further adequacy conditions . . . . . . . . . . . . . . 180 6.2 Carnap’s Theory of Confirmation . . . . . . . . . . . . . . . . 188 6.2.1 Confirmation as relevance . . . . . . . . . . . . . . . . 188 6.2.2 Finding the right function . . . . . . . . . . . . . . . . 190 6.3 Grue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.4 Subjective Bayesian confirmation . . . . . . . . . . . . . . . . 201 6.4.1 Confirmation measures . . . . . . . . . . . . . . . . . . 205 6.4.2 Subjective Bayesian solutions to the Paradox of the Ravens . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 0 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 14 6.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 21 8 7 DecisionTheory 225 7.1 Calculating expectations . . . . . . . . . . . . . . . . . . . . . 226 7.1.1 The move to utility . . . . . . . . . . . . . . . . . . . . 229 7.2 Expected Utility Theory . . . . . . . . . . . . . . . . . . . . . 230 7.2.1 Preference orderings, and money pumps . . . . . . . . 230 7.2.2 Savage’s expected utility . . . . . . . . . . . . . . . . . 233 7.2.3 Jeffrey’s theory . . . . . . . . . . . . . . . . . . . . . . 235 7.2.4 Risk aversion, and Allais’ paradox . . . . . . . . . . . 238 7.3 Causal Decision Theory . . . . . . . . . . . . . . . . . . . . . 24 1 7.3.1 Newcomb’s Problem . . . . . . . . . . . . . . . . . . . 24 1 7.3.2 A causal approach . . . . . . . . . . . . . . . . . . . . 245 7.3.3 Responses and extensions . . . . . . . . . . . . . . . . 248 7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 50 7.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3
vi
CONTENTS
IV Arguments for Bayesianism
257
8 Representation Theorems 263 8.1 Ramsey’s four-step process . . . . . . . . . . . . . . . . . . . 265 8.2 representation . . . . . . . . . . . . . . . .. .. .. .. .. .. .269 8.3 Savage’s Representation theoremstheorem and probabilism 273 8.3.1 Objections to the argument . . . . . . . . . . . . . . . 275 8.3.2 Reformulating the argument . . . . . . . . . . . . . . . 278 8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 81 8.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 283 9 DutchBookArguments 9.1 Dutch Books . . . . . . . . . . . . . . . . . . 9.1.1 Dutch Books for probabilism . . 9.1.2 Other Dutch Books . . . . . . . . 9.2 The Dutch Book Argument . . . . . . . . 9.2.1 Dutch Books depragmatized . . . 9.3 Objections to Dutch Book Arguments .
287 . . . . . . . . . 2 88 . . . . . . . . . . . . 290 . . . . . . . . . . . 29 2 . . . . . . . . . . . 295 . . . . . . . . . . . . 297 . . . . . . . . . . . . 301
9.3.1 Package Principle . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 302 9.3.2 The Dutch Strategy objections 305 9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 10 9.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 311 10AccuracyArguments 315 10.1 Accuracy as calibration . . . . . . . . . . . . . . . . . . . . . 317 10.2 The gradational accuracy argument for probabilism . . . . . . 321 10.2.1 The Brier score . . . . . . . . . . . . . . . . . . . . . . 321 10.2.2 Joyce’s accuracy argument for probabilism . . . . . . 324 10.3 Objections to the accuracy argument for probabilism . . . . . 329 10.3.1 The absolute-value score . . . . . . . . . . . . . . . . . 330 10.3.2 Proper scoring rules . . . . . . . . . . . . . . . . . . . 332 10.3.3 Do we really need Finite Additivity? . . . . . . . . . . 339 10.4 Anercises accuracy . . . 343 10.5 Ex . . argument . . . . . . .for . .Conditionalization . . . . . . . . . . . . . . . .. .. .. .. .345 10.6 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 349
V Challenges and Objections
11ProblemofthePriors
355
357
vii
CONTENTS
11.1 The Problem of the Priors . . . . . . . . . . . . . . . . . . . . 357 11.1.1 Understanding the problem . . . . . . . . . . . . . . . 359 11.1.2 Washing out of priors . . . . . . . . . . . . . . . . . . 363 11.2 Frequentism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 11.2.1 Significance testing . . . . . . . . . . . . . . . . . . . . 370 11.2.2 Troubles with significance testing . . . . . . . . . . . . 372 11.3 Likelihoodism . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 11.3.1 Troubles with likelihoodism . . . . . . . . . . . . . . . 378 11.3.2 Why we need priors . . . . . . . . . . . . . . . . . . . 378 11.4 Ex ercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 11.5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 378 12CredenceRanges 359 12.1 Ex ercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 12.2 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 359 13 Logical Omniscience and Old Evidence 361 13.1 Ex ercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 13.2 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 361 14 Memory Loss and Self-Locating Belief 363 14.1 Ex ercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 14.2 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Glossary IndexofNames Bibliography
365 387 391
viii
CONTENTS
ROSENCRANTZ: Eighty-five in a row—beaten the record! GUILDENSTERN: Don’t be absurd. ROS: Easily! GUIL: Is that it, then? Is that all? ROS: What? GUIL: A new record? Is that as far as you are prepared to go? ROS: Well. . . . GUIL: No questions? Not even a pause? ROS: You spun them yourself. GUIL: Not a flicker of doubt? ROS: Well, I won—didn’t I? GUIL: And if you’d lost? If they’d come down against you, eighty-five times, one after another, just like that? ROS: Eighty-five in a row? Tails? GUIL: Yes! What would you think? ROS: Well. . . . Well, I’d ha ve a good look at your coins fo r a start! —Tom Stoppard, Rosencrantz and Guildenstern are Dead
ix
CONTENTS
Quick Reference The Core Bayesian Rules
p q ě 0. Normality: For any tautology in , crp q “ 1. Finite Additivity: For any mutually exclusive P and Q in L, crpP _ Qq “ crpP q ` crpQq. Ratio Formula: For any P and Q in L, if cr pQq ą 0 then crpP | Qq “ p p q q . Conditionalization: For any time t and later time t , if E in L represents everything the agent learns between t and t and cr pE q ą 0, then for any H in L, cr pH q “ cr pH | E q. Non-Negativity: For any P in
L,
T
cr
cr P L
T
P &Q Q
cr
i
j
i
j
j
i
i
Consequences of These Rules
p„ q “ ´ p q p qď
Negation: For any P in L, cr P 1 cr P . Maximality: For any P in L, cr P 1. Contradiction: For any contradiction F in L, cr
p q “ 0. Entailment: For any P and Q in L, if P ( Q then cr pP q ď crpQq. Equivalence: For any P and Q in L, if P )( Q then cr pP q “ crpQq. F
General Additivity: For any P and Q in L, cr P Q cr P cr Q cr P & Q . Finite Additivity (Extended): For any finite set of mutually exclusive propositions P1 , P2 ,...,P n , cr P1 P2 . . . Pn cr P1 cr P2 . . . cr Pn Decomposition: For any P and Q in L , cr P cr P & Q cr P & Q . Partition: For any finite partition of proposition s in L, the sum of their unconditional cr-values is 1. Law of Total Probability: For any proposition P and finite partition Q1 , Q2 ,...,Q n in L, cr P cr P Q1 cr Q1 cr P Q2 cr Q2
p _ q“ p q` p q´ p q t u p _ _ _ q“ p q` p q` ` p q p q “ p q` p
p q “ p | q ¨ p q ` p | q ¨ p q` . . . ` crpP | Q q ¨ crpQ q Bayes’ Theorem: For any H and E in L, crpE | H q ¨ crpH q crpH | E q “ crpE q n
„ q
n
Multiplication: P and Q with nonextreme cr-values are independent relative to cr if and only if cr P & Q cr P cr Q .
p
q“ p q¨ p q
x
CONTENTS
Part I
Introducing the Subject
1
Chapter 1
Beliefs and Degrees of Belief Most of epistemology concerns propositional attitudes. A propositional attitude is an attitude an agent adopts towards a proposition, or towards a set of propositions. While much philosophical ink has been spilled over the nature of propositions, we will assume only that a proposition is an abstract entity expressible by a declarative sentence and capable of being true or false. ( True and false are truth-values, so we say that a proposition is capable of “having a truth-value”.) For example, the sentence “Nuclear fusion is a viable energ y source” expresses a proposition. If I believe that fusion is viable, this belief is a propositional attitude—it is an attitude I take towards the proposition that fusion is viable. Humans adopt a variety of attitudes towar ds propositions . I might hope that fusion is a viable energy source, desire that fusion be viable, wonder whether fusion is viable, fear that fusion is viable, or intend to make it the case that fusion is a viable energy source. While some propositional attitudes involve plans to change the world, others attempt to represent what the world is already like. Epistemology focuses on the latter kind of propositional attitude—representational attitudes. Examples of such attitudes include belief and knowledge. (Knowl1 edge will not be a major focus of this book. ) Belief is in som e sense a purely representational attitude: when we attribute a belief to an agent, we are simply trying to describe how she takes the world to be. A belief attribution does not indicate any emotional affect towards the proposition, level of justification in that proposition, etc. Yet belief is not the only purely representational attitude; an agent might be certain that a proposition is true, or disbelieve a particular proposition. Philosophers often discuss the class of doxastic attitudes (“belief-like” attitudes) into which belief, disbelief, and 3
4
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
certainty fall. Bayesian Epistemology focuses on a type of doxastic attitude known variously as degree of belief, degree of confidence, or credence. Over the last few decades discussion of credences has become much more common in epistemology, as well as in other areas of philosophy (not to mention psychology, economics, and nearby disciplines). This chapter tries to explain why credences are important to epistemology. I’ll begin by contrasting degree of belief talk with other doxastic attitude attributions—especially attributions of “binary” belief that have historically been significant in epistemology. I’ll then consider what working with degrees of belief adds to our account of an agent’s doxastic life. Finally I’ll introduce a basic characterization of Bayesian Epistemology, and outline how we will explore that view in the chapters to come.
1.1
Binary beliefs
1.1.1
Classificatory, comparative, quantitative
In his (1950), Rudolf Carnap helpfully distinguishes classificatory, comparative, and quantitative concepts: Classificatory concepts are those which serve for the classification of things or case s into two or a few [kin ds]. . . . Quantitative concepts . . . are those which serve for characterizing things or eve nts or certain of their features by the ascription of numerical values.... Comparative concepts . . . stand between the two other kinds. . . . [They] serve for the formulation of the result of a comparison in the form of a more-less-statement without the use of numerical values. (p. 9)
In Carnap’s famous example, describing the air in a room as warm or cold employs classificatory concepts. Characterizing one room as warmer than another uses a comparati ve concept. The temperature scale describes the heat of a room with a quantitative concept. Both our everyday talk about doxastic attitudes and our philosophical theorizing about them use classificatory, comparative, and quantitative concepts. Classificatory concepts include belief, disbelief, suspension of judgment, and certainty. The doxastic attitudes pic ked out by these concepts are monadic; each is adopted towards a single proposition. Moreover, given any particular proposition, agent, and classificatory doxastic attitude, the agent either has that attitude towards the proposition or she doesn ’t. So
1.1. BINARY BELIEFS
5
classificatory doxastic attitud es are sometimes called “binary”. (I’ll alternate between “classificatory” and “binary” terminology in what follows.) A comparative attitude, on the other hand, is adopted towards an ordered pair of propos itions. For examp le, I am more confident that fission is a viable energy sour ce than I am that fusion is. A quantitative attitude assigns a numerical value to a single proposition; my physicist friend is 90% confident that fusion is viable. Until the last few decades, much of epistemology revolved around classificatory conce pts. (Think of debates about the justi fication of belief, or about necessary and sufficient conditions for knowledge.) This wasn’t an exclusive focus, but more a matter of emphasis. So-called “traditional” or “mainstream” epistemologists certainly employed comparative and quantitative terms.2 Moreover, their classificatory attitude ascriptions were subtly shaded by various modifiers: a belief, for example, might be reluctant, intransigent, or deeply-held. Nevertheless, Bayesian epistemologists place much more emphasis on quantitative attitudes such as credences. This chapter examines reasons for such a shift: Why should epistem ologists be so interested in credences? To aid our understanding, I’ll introduce arist. character whoBinarist has probably existed inagents’ real life:doxastic the Simple BinaA Simple insistsnever on describing proposition al attitudes exclusively in terms of belief, disbelief, and suspension of judgment. The Simple Binarist eschews all other doxastic attitude attributions, and even refuses to add shading modifiers like the ones above. I introduce the Simple Binarist not as a plausible rival to the Bayesian, but instead as an illustrative contrast. By highlighting doxastic phenomena for which the Simple Binarist has trouble accounting, I will illustrate the importance of quantitative attitude attributions. Nowadays most everyone uses a mix of classificatory, comparative, and quantitative doxastic concepts to describe agents’ doxastic lives. I hope to demonstrate the significance of quantitative concepts within that mix by imagining what would happen if our epistemology lacked them entirely. And I will suggest that epistemologists’ growing understanding of the advantages of degree-valued doxastic concepts helps explain the preponderance of quantitative attitude ascriptions in epistemology today. 1.1.2
Shortcomings of binary belief
My physicist friend believes that nuclear fusion is a viable energy source. She also believes that her car will stop when she presses the brake pedal. She is willing to bet her life on the latter belief, and in fact does so multiple
6
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
times daily duri ng her comm ute. She is not willing to bet her life on the former belief. This difference in the decisions she’s willing to make seems like it should be traceable to a difference between her doxastic attitudes towards the proposition that fusion is viable and the proposition that pressin g her brake pedal will stop her car. Yet the Simple Binarist— who is willing to attribute only beliefs, disbeliefs, and suspensions—can make out no difference between my friend’s doxastic attitudes towards those propositions. Once the Simple Binarist says my friend believes both propositions, he has said all he has to say. Now suppose my physicist friend reads about some new research into nuclear energy. The research reveals new difficulties with tokamak design, which will make fusion power more chal lenging. After learning of this research, she still believes fusion is a viable energy source. Nevertheless, it seems this evidence should cause some change in her attitudes towards the proposition that fusion is viable. Yet the Simple Binarist lacks the tools to ascribe any such change; my friend believed the proposition before, and she still believes it now. What do these two examp les show? The Simpl e Binarist doesn’t say anything false —it’s true that my friend believes the propositions in question discussion at the relevant times. But the Simple Binarist’s descriptive resources don’t seem fine-grained enough to capture some further things we want to say about my friend’s doxastic attitudes. Now maybe there’s some complicated way the Simple Binarist could account for these examples within his classifi catory scheme. Or maybe a complex binaris t with more classificatory attitudes in his repertoire than the Simple Binarist could do the trick. But it’s most natural to respond to these examples with confidence comparisons: my friend is more confident that her brakes will work than she is that fusion is viable; reading the new research makes her less confident in the viability of fusion than she was before. Comparative doxastic attitudes fine-grain our representations in a manner that feels appropriate to these examples. We’ve now seen two difficulties the Simple Binarist has in describing an agent’s doxastic attitud es. But in addition to descriptive adequacy, we often want to work with concepts that figure in plausible norms.3 Historically, epistemologists were often driven to work with comparative and quantitative doxastic attitudes because of their difficulties with framing defensible rational norms for binary belief. The normative constraints most commonly considered for binary belief are:
1.1. BINARY BELIEFS
7
Belief Consistency: Rationality requires the set of propositions an agent believes to be logically consistent. Belief Closure: If some subset of the propositions an agent believes entails a further proposition, rationality requires the agent to believe that further proposition as well. Belief Consistency and Belief Closure are proposed as necessary conditions for an agent’s belief set to be rational. They are also typically proposed as requirements of theoretical rather than practical rationality. Practical rationality concerns connections between attitudes and actions. Our earlier contrast between my friend’s fusion beliefs and her braking beliefs was a practical one; it concerned how those doxastic attitudes influenced her betting behavior. Our other problematic example for the Simple Binarist was a purely theoretical one, having to do with my friend’s fusion beliefs as evidence-responsive representations of the world (and without considering those beliefs’ consequences for her acts). What kinds of constraints does practical rationality place on attitudes? In Chapter 7 we’ll see that if an agent’s preferences fail to satisfy certain axioms, Practical this can lead to a disastrous of actions as a “money pump”. rationality thereforecourse requires agents’known preferences to satisfy those axioms. Similarly, we’ll see in Chapter 9 that if an agent’s credences fail to satisfy the probability axioms, her betting behavior is susceptible to a troublesome “Dutch Book”. This fact has been used to argue that practical rationality requires credences to satisfy the probability axioms. One might think that practical rationality provides all the rational constraints there are.4 The standard response to this proposal invokes Pascal’s Wager. Pascal (1670/1910, Section III) argues that it is rational to believe the propo sition that the Christian god exists. If that proposition is true, having believed it will yield vast benefi ts in the after life. If the proposition is false, whether one believed it or not won’t have nearly as dramatic consequences. Assuming Pascal has the consequences right, this seems to provide some sort of reason for maintainin g religious beliefs. Nevertheless, if an agent’s evidence points much more strongly to atheism than to the existence of a deity, it feels like there’s a sense of rationality in which religious belief would be a mistake. This is theoretical rationality, a standard that assesses representational attitudes in their capacity as representations—how well they do at depicting the world, being responsive to evidence, etc.— without considering how they influence action. Belief Consistency and Closure are usually offered as requirements of theoretical rationality. The idea is that a set of beliefs has failed as a responsible representation of the world
8
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
if it contradicts itself or fails to admit its own logical consequences. 5 The versions of Belief Consistency and Closure I’ve stated above are pretty implau sible as genuine rational requirements. Belief Closure, for instance, requires an agent to believe any arbitrarily complex proposition entailed by what she already believes, even if she’s never come close to entertaining that proposition. And since any set of beliefs has infinitely many logical consequences, Closure also requires rational agents to have infinitely many beliefs. Belief Consistency, meanwhile, forbids an agent from maintaining a logically inconsistent set of beliefs even if the inconsistency is so recondite that she is incapable of seeing it. One might find these require ments far too demanding to be rational constraints. It could be argued, though, that these flaws in Belief Consistency and Closure have to do with the particular way in which I’ve stated the norms. Perhaps we could make a few tweaks to these principles that would leave their spirit intac t while inoculatin g them against these particu lar flaws. In Chapter ?? we will consider such tweaks to a parallel set of Bayesian constraints with similar problem s. In the meantime, though, there are counterexamples to Belief Consistency and Closure that require much more than a few tweaks(1961) to resolve. Kyburg first described the Lottery Paradox: A fair lottery has sold one million tickets. Because of the poor odds, an agent who has purchased a ticket believes her ticket will not win. She also believes, of each other ticket purchased in the lottery, that it will not win. Nevertheless, she believ es that at least one purchased ticket will win. The belief s attributed to the agent in the story seem ration al. Yet these beliefs are logically inconsistent—you cannot consistently believe that at least one ticket will win while believ ing of each ticket that it will lose. So if the agent’s beliefs in the story are rationally permissible, we have a counterexample to Belief Consistency. Moreover, if we focus just on the agent’s beliefs about the individual tickets, that set of beliefs entails that none of the tickets will win. Yet it seems irrational for the agen t to believe that no ticket will win. So the Lottery also provides a counterexample to Belief Closure. Some defenders of Belief Consistency and Closure have responded that, strictly speaking, it is irrational for the agen t in the Lotter y to believe her ticket will lose. (If you beli eve your tic ket will lose, why buy it to begin with? 6 ) If true, this resolves the problem. But it’s difficult to resolve Makinson’s (1965) Preface Paradox in a similar fashion:
1.1. BINARY BELIEFS
9
You write a long nonfiction book with many claims in its main text, each of which you believe. In the acknowledgments at the beginning of the book you write, “I’m sure there are mistakes in the main text, for which I take full responsibility.”
Many authors write such statements in the prefaces to their books, and it’s hard to deny that it’s ratio nal for them to do so. It’s also very plausible that nonfiction authors believe the conte nts of what they write. Yet if the concession that there are mistakes is an assertion that there is at least one falsehood in the main text, then the belief asserted in the preface is logically inconsistent with belief in all of the claims in the text. 7 The Lottery and Preface pose a different kind of problem from our earlier examples. The exam ples with my frien d the physicist didn’ t show that descriptions in classificatory belief terms were false; they simply suggested that classificatory descriptions don’t capture all the important aspects of doxastic life. The Lottery and Preface, however, are meant to demonstrate that Belief Consistency and Belief Closure—the most natural normative principles for binary belief—are actually false. An extensive literature has grown up around the Lottery and Preface, attempting to resolve them in a number of ways. One might deny that the sets of beliefs desc ribed in the paradoxes are in fact rational. One might find a clever way to establish that those sets of beliefs don’t violate Belief Consistency or Belief Closure . One might drop Belief Consis tency and/or Belief Closure for alternativ e normative constraints on binary belief. All of these responses have been tried, and I couldn’t hope to adjudicate their successes and failures here. For our purposes, the crucial p oint is that while it remains controversial how to square norms for binary belief with the Lottery and Preface, norms for rational crede nce have no trouble with those examples at all. In Chapter 2 we’ll see that Bayesian norms tell a natural, intuitive story about the rational credences to adopt in the Lottery and Preface situations. The ease with which Bayesianism handles cases that are paradoxical for binary belief norms has been seen as a strong advantage for credence-centered epistemology.
10
1.2 1.2.1
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
From binary to graded Comparative confidence
Therestricting previous section articulated b oth descriptive and normative for one’s attention exclusively to classificatory doxastic difficulties attitude ascriptions (belief, disbelief, suspension of judgment , etc.). We imagined a Simple Binarist who works only with these kinds of attitudes, and posed both descriptive and normative problems for him. The first descriptive problem was that an agent may believe two propositions while nevertheless treating these propositions quite differently when it comes to action. The secon d descriptive problem was that new evidence may change an agent’s doxastic attitudes towards a proposition despite her believing the proposition both before and after incorporatin g the evidence. We could address both of these shortcomings in a natural fashion by moving beyond strictly classificatory concepts to comparisons between an agent’s levels of confidence in two different propositions, or her levels of confidence in a single proposition at two different times. So let’s augment our resources a bit beyond what the Simple Binarist has available. We’ll still allow ourselves to say that an agent believes, disbelieves, or suspends judgment in a proposition. But we’ll also allow ourselves to describe an agent as at least as confident of one proposition as another, more confident in one proposition than another, or equally confident in the two. Some of these comparisons follow directly from classifica tory claims. For instance, when I say that my friend believes nuclear fusion is a viable energy source, we typically infer that she is more confident in the proposition that fusion is viable than she is in the proposition that fusion is nonviable. But there are also comparisons which, while consistent with classificatory information, are not entailed by such informatio n. My friend believes both that fusion is viable and that her brakes are functional. We go beyond this description when we add that she is more confident in the latter proposition than the former. Introducing confidence comparisons between the propositions in a set creates a formal structure called an ordering on that set. For example, Figure 1.1 depicts my confidence ordering over a particular set of propositions. Here D represents the proposition that the Democrats will win the next presidential election, and W represents the proposition that anthropogenic global warm ing has o ccurred. The arrows indicate more confident than relations: for instance, I am more confident that warming either has or hasn’t occurred than I am that it has, but I am also more confident that
11
1.2. FROM BINARY TO GRADED
Figure 1.1: A confidence ordering
D or not D
W or not W
not D
W
D
not W
D and not D
W and not W
warming has occurred than I am that it has not. It’s important that not every confidence ordering is a total ordering — there may be some pairs of propositions for which the ordering says nothing about the agent’s relat ive confidences. Don’t be fooled by the fact that “not D ” and “ Wreflect ” are at the same height in Figure 1.1. In that diag remains ram only the arrows features of the ordering; the ordering depicted
silent about whether I am more confident in “not D” or “ W ”. This reflects an important truth about my doxastic attitudes: while I’m more confident in warming than nonwarming and in a Democratic loss than a win, I may genuinely be incapable of making a confidence comparison across those two unrelated issue s. In other word s, I may view warming propositions and election propositions as incommensurable. We now have the basic elements of a descriptive scheme for attributing comparative doxastic attitudes. How might we add a normative element to this scheme? A typical norm for confidence comparisons is: Comparative Entailment: For any pair of propositions such that the first entails the second, rationality requires an agent to be at least as confident of the second as the first. Comparative Entailment is intuitively plausible. For example, it would be irrational to be more confident in the proposition that Arsenal is the best soccer team in the Premier League than the proposition that Arsenal is a soccer team. Being the best soccer team in the Premier League entails that Arsenal is a soccer team! 8 Although it’s a simple norm, Comparative Entailment has a number of substantive consequences. For instance, assuming we are working with a
12
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
classical entailment relation on which any proposition entails a tautology and every tautology entails every other, Comparative Entailment requires a rational agent to be equally confident of every tautology and at least as confident of any tautology as she is of anything else. Comparative Entailment also requires a rational agent to be equally confident of every9 contradiction. While Comparative Entailment (or something close to it ) has generally been endorsed by authors working on comparative confidence relations, there is great disagreement over which additional comparative norms are correct. We will present some alternatives in Chapter ??, when we delve into the technical details of comparative confidence orderings. 1.2.2
Bayesian Epistemology
There is no single Bayesian Epistemology; instead, there are many Bayesian epistemologies.10 Every view I would call a Bayesian epistemology endorses the following two principles: 1. Agents have doxastic attitudes that can usefully be represented by assigning real numbers to claims. 2. Rational requirements on those doxastic attitudes can be represented by mathematical constraints on the real-number assignments closely related to the probability calculus. The first of these principles is descriptive, while the second is normative— reflecting the fact that Bayesian epistemologies have both descriptive and normative commitments. Most of the rest of this chapter conce rns the descriptive element; extensive coverage of Bayesian Epistemology’s normative content begins in Chapter 2. 11 I’ve articulated these two principles vaguely to make them consistent with the wide variety of views (many of which we’ll see later in this book) that call themselves Bayesian epistemologies. For instance, the first principle mentions “claims” because some Bayesian views assign real numbers to sentences or other entities in place of propositions. Still, the most common Bayesian descriptive approach—and the one we will stick with for most of this book—assigns numerical degrees of confidence to propositions. 12 In the previous section, we took the Simple Binarist’s repertoire of belief, disbelief, and suspension descriptions and added confidence comparisons. What more can we gain by moving to a full numerical representation of confidence? Comparative confidence relations create ordering s—they put things in order. But they can not tell us how relatively big the gaps are
1.2. FROM BINARY TO GRADED
13
between items in the ordering. Lacking quantitative credal concepts we can say that an agent is more confident in one proposition than she is in another, but we cannot say how much more confident she is. These matters of degree can be very importa nt. Suppose you’ ve been offered a job teaching at a university, but there’s another university at which you’d much rather teach . The first univers ity has given you two weeks to respond to their offer, and you know you won’t have a hiring decision from the pref erred school by then. Trying to decide wheth er to turn down the offer in hand, you contact a friend at the pref erred university. She says you’re one of only two candidates for their job, and she’s more confident that you’ll get the offer than the other candida te. At this point you wan t to ask how much more confident she is in your prospects than the other candidate’s. A 51-49 split might not be enough for you to hang in! Like our earlier brake pedal story, this is an example about the practical consequences of doxastic attitudes. It suggests that distinction s between doxastic attitudes affecting action cannot all be captured by a confidence ordering—important decisions may depend on the sizes of the gaps. Put another way, this example suggests that one needs more than just confidence orderings (whichconfidence will be themeasures subject oftoChapter 7). a In Chapterto6,do wedecision will usetheory quantitative investigate topic of great significance for theoretical rationality: degrees of confirmation. Numerical credence values are very important in determining whether a body of experimental evidence supports one scientific hypothesis more than it supports another. These are some of the advantages of numerically measuring degrees of belief. But credal descriptions have disadvantages as well. For instance, numerical representations may provide more specific information than is actually present in the situation being represented. The Beatles were better than the Monkees, but there was no numerical amount by which they were better. Similarly, I might be more confident that the Democr ats will lose the next election than I am that they will win without there being a fact of the matter about exactly how much more confident I am. Representing my attitudes by assigning precise credence numbers to the proposition that the Democrats will lose and the proposition that they will win attributes to me a confidence gap of a particular size—which may be an over -attribution in the actual case. Numerical degree of belief representations also impose complete commensurability. It is possible to build a Bayesian representation that assigns credal values to some propositions but not others—representing the fact that an agent takes attitudes towards the former but not the latter. 13 But once
14
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
our representation assigns a numerical credence to some particular proposition, that proposition immediately becomes comparable to every other proposition to which a credence is assigned. Suppose I am 60% confid ent that the Democrats will lose, 40% confident that they will win, and 80% confident that anthropogenic global warming has occurred. One can immediately rank all three of these propositions with respect to my confidence. Assigning numerical credences over a set of propositions creates a total ordering on the set, making it impossible to retain any incommensurabilities among the propositions involved. This is worrying if you think confide nce incommensurability is a common and rational feature in real agents’ doxastic lives. Epistemologists sometimes complain that working with numerical credences is unrealistic, because agents “don’t have numbers in their heads”. This is a bit like refusing to measure gas samples with numerical temperature values because molecules don’t fly around with numbers pinned to their backs. 14 The relevant question is whether agents’ doxastic attitudes have a level of structure that can be well-represented by numbers, by a comparative ordering, by classificat ory concepts, or by something else. This is the in which it’s to worry whether an agents’ confidence gapscontext have important sizeappropriate characteristics, or whether agent’s assigning doxastic attitudes to any two propositions should automatically make them confidence-commensurable. We will return to these issues a number of times in this book. 1.2.3
Relating beliefs and credences
I’ve said a lot about representing agents as having various doxastic attitudes. But presumably these attitudes aren’t just things we can represent agents as having; presumably agents actually have at least some of the attitudes in question. The metaphysics of doxastic attitudes raises a huge number of questions. For instance: What is it—if anything—for an agent to genuinely possess a mental attitude beyond being usefully representable as having such? Or: If an agent can have both binary beliefs and degrees of belief in the same set of propositions, how are those different sorts of doxastic attitudes related? The latter question has generated a great deal of discussion, which I cannot hope to summarize here. Yet I do want to mention some of the general issues and best-known proposals. Before doing so, let me pause to discuss terminology. There are two different ways to employ the terms “belief” and “doxastic attitude”. In this book I will use “belief” as a synonym for “binary belief”, one of the clas-
1.2. FROM BINARY TO GRADED
15
sificatory representational attitud es. “Doxastic attitude” will then be an umbrella term for propositional attitudes that are belief-like in particular ways, including not only binary belief but also disbelief, certainty, doubt, suspension of belief, comparative confidence, numerical credence, and others. Yet there is another approach on which “belief ” is the umbrella term, and “doxastic attitude” means something like “v ariety of belief”. On this approach, binary beliefs are sometimes called “full beliefs”, and credences may be called “part ial belief s” or “graded beliefs”. On this approach one also hears the aphorism “Belief comes in degrees.” These last few locutions wouldn’t make sense if “belief” meant exclusively binary belief. (A credence is not a partial or incomplete binary belief.) But they make more sense when “belief” is an umbrella term. Going forward, I will refer to the quantitative representational attitudes that are our main topic as either “credences” or “degr ees of belief”. I will also use “belief” and “doxastic attitude” according to the first of the two approaches just described. Now suppose some philosopher asserts a particular connection between (binary) beliefs and credences. That connection migh t do any of the following: define of one kind in terms of thea other; (2) reduce attitudes(1) of one kindattitudes to attitudes of the other; (3) assert descriptively true conditional (or biconditional) linking one kind of attitude to the other; (4) offer a normative constraint to the effect that any rational agent with an attitude of one kind will have a particular attitude of the other. For example, the Lockean thesis connects believing a proposition with possessing a degree of confidence in that proposition surpassing some numerical threshold. Taking inspiration from John Locke (1689/19 75, Bk. IV, Ch. 15-16), Richard Foley entertains the idea that: To say that you believe a proposition is just to say that you are sufficiently confident of its truth for your attitude to be one of belief. Then it is ration al for you to believe a proposition just in case it is rational for you to have sufficiently high degree of confidence in it. (1993, p. 140) Foley presents the first sentence—identifying belief with sufficiently high degree of belief—as the Lockean thesis. The latter sentence is presented as following from the former. But notice that the latter senten ce’s normative claim could be secured by a weaker, purely normative Lockean thesis, asserting only that a rational agent believes a proposition just in case she is sufficiently confident of it.
16
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
On any reading of the Lockean thesis, there are going to be questions about exactly how high this thres hold must be. One might suggest that the confidence threshold for belief is certainty (i.e. 100% confidence). But many of us believe propositions of which we are not certain, and this seems perfectly rational. Working down the confidence spectrum, it seems that in order to believe a proposition one should be more confident of it than not. But that leaves a lot of space to pin down the threshold between 50% and 100% confidence. Here it may help to suggest that the relevant threshold for belief is vague, or varies with context. The Lockean thesis also causes problems when we try to layer traditional norms of rational belief and cred ence on top of it. If we adopt Bay esian probabilistic norms for credence, the Lockean thesis generates rational belief sets for the Lottery and Preface that violate Belief Consistency and Closure. We will see why when we give a probabilistic solution to the Lottery in Section 2.2.2. The Lockean thesis works by identifying belief with a particular kind of credence. But we might try connecting these attitudes in the opposi te direction. For instance, we might say that I have a 60% credence that the Democrats will lose the case I strategy believe the proposition that their probability of next losingelection is 60%.just Theingeneral here is to align my credence in one proposition with belief in a second proposition about the probability of the first. This connective strategy—whether meant definitionally, reductively, normatively, etc.—is now generally viewed as unlikely to succeed. For one thing, it requires thinking that whenever a (rational) agent has a degree of confidence, she also has a belief about probabil ities. David Christensen (2004, Ch. 2) wonders about the conten t of these probability beliefs. In Chapter 5 we will explore various “interpretations of probability” that attempt to explain the meaning of “probability” claims. The details need not concern us here; what matters is that for each possible interpretation, it’s implausible to think that whenever a (rational) agent has a degree of confidence she (also?) has a belief with that kind of probabilistic content. If “probability” talk is, for instance, always talk about frequency within a reference class, must I have beliefs about frequencies and reference classes in order to be pessimistic about the Democrats? The idea that the numerical value of a credence occurs inside a proposition towards which the agent adopts some attitude also generates deeper problems. We will disc uss some of them when we cover conditional credences in Chapter 3. Generally, contemporary Bayesians think of the numerical value of a credence not as part of the content towards which the
1.3. THE REST OF THIS BOOK
17
agent adopts the attitude, but instead as an attribute of the attitude itself. I adopt a credence of 60% towards the proposition that the Democrats will lose; no proposition containing the value 60% is involved. 15 This is a small sample of the positions and principles that have been proposed relating beliefs to degrees of belief. One might embrace some connecting principle I haven’t mentioned here. Or one might deny the existence of attitudes in one category altogeth er. (Perhaps there are no beliefs. Perhaps there are no degre es of belief.) Yet I’d like to note that it is possible that both types of attitudes exist without there being any fully general, systematic connections between them. Here’s an anology: 16 Consider three different maps of the same square mile of earth ly terrain. One is a topographic map; another is a satellite image; another shows streets marked with names . Each map repre sents different features of the underlying terrain. The features represented on each map are equally real. There are some connections between the information on one map and the information on another; a street that appears on the satellite photo will presumably appear on the streetmap as well. But there are no fully general, systematic connections that would allow you to derive everything one map provides from any the of the others.ofFor instance, on the topo orabout the streetmap location a tree pickednothing up by the satellite. Similarly, describing agents as possessing beliefs or as possessing degrees of belief might be equally valid representations of a complex underlying reality, useful for different purposes. The features of an agent’s cognitive state picked out by each representation might also be equally real. Yet there might nevertheless be no general, systematic connections between one representation and the other (even for a fully rational agent). Going forward, we will assume that it is at least sometimes philosophically useful to represent agents as having numerical degrees of belief. We will not assume any systematic connection between credences and beliefs, and indeed we will only rarely mention the latter.
1.3
The rest of this book
Hopefully I have now given you some sense of what credences are, and of why one might incorporate them into one’s epistemolog y. Our first task in Chapter 2 will be to develop a Bayesian formalism in which credences can be descriptively represented. After that, much of our focus will b e on the norms Bayesians require of rational degrees of belief.
18
CHAPTER 1. BELIEFS AND DEGREES OF BELIEF
There is a great deal of disagreement among Bayesians about exactly what these norms should be. Nevertheless, we can identify five core normative Bayesian rules: Kolmogorov’s three probability axioms for unconditional credence, the Ratio Formula for conditional credence, and Conditionalization for updating credences over time. These are not core rules in the sense that all Bayesian epistemologists agree with them. Some Bayesians accept all five rules and want to add more; some don’t even accept these five. They are core in the sense that one needs to understand them in order to understand any further Bayesian position entertained. This chapter completes Part I of this book. Part II is primarily concerned with the five core Bayes ian rules. Chapter 2 covers Kolmogorov’s axioms; Chapter 3 covers the Ratio Formula; and Chapter 4 covers Conditionalization. Chapter 5 then discusses a variety of norms Bayesians have proposed either to supplement or to replace the core five. The presence of all these alternatives raises the question of why we should accept any of these rules as genuinely normative to begin with. To my mind, one can see the advantages of Bayesianism best by seeing its consequences for applications. For instance, I’ve already mention ed that Bayesian credal norms accommodat a natural story about attitudes in the Lotteryt Paradox. Part III ofethis b ook discusses the doxastic two histor ically most importan applications of Bayesian Epistemology: confirmation theory (Chapter 6) and decision theory (Chapter 7). Along with their benefits in application, Bayesian normative rules have been directly defend ed with a variety of philosophical arguments. I discuss the three most popular arguments in Part IV, and explain why I find each ultimately unconvincing. Chapter 8 discusses Representation Theorem arguments; Chapter 9 Dutch Books; and Chapter 10 arguments based on the goal of accurate credences. Finally, a number of important challenges have been raised to Bayesian Epistemology—both to its descriptive framework and to its normative rules. Many of these (though admittedly not all) are covered in Part V.
1.4
Exercises
Problem 1.1. What do you think the agent in the Lottery Paradox should believe? In particular, should she b elieve of each ticket in the lottery that that ticket will lose? Does it make a difference how many tic kets there are in the lottery? Explain and defend your answers. Problem 1.2. Explain why (given a classical logical entailment relation)
19
1.5. FURTHER READING
Comparative Entailment requires a rational agent to be equally confident of every contradiction. Problem 1.3. Assign numerical confidence values (between 0% and 100%, inclusive) to each of be thearranged propositions mentioned These 1.1 confidence values should so that if there’sinanFigure arrow1.1. in Figure from one proposition to another, then the first proposition has a lower confidence value than the second. Problem 1.4. The arrows in Figure 1.1 represent “more confident in” relations between pairs of propositions. Comparative Entailment, on the other hand, concerns the “at least as confident in” relati on. So suppos e we reinterpreted Figure 1.1 so that the arrows represented “at least as confident in” relations. (For example, Figure 1.1 would now tell you that I’m at least as confident of a Democratic loss as a win.) (a) Explain why—even with this reinterpretation—the arrows in Figure 1.1 do not provide an ordering that satisfies Comparative Entailment. (b) to Describe of arrows you could add to the (reinterpreted) diagram create aanbunch ordering satisfying Comparative Entailment. Problem 1.5. Is it ever helpful to describe an agent’s attitudes in terms of binary beliefs? Or could we get by just as well using only more fine-grained (comparative and quantitative) concepts? Explain and defend your answer.
1.5
Further reading
Classic Texts
Henry E. Kyburg Jr (1970). Conjunctivitis. In: Induction, Acceptance, and Rational Belief . Ed. by M . Swain. Boston: Reidel, pp. 55–82 David sis C. 25,Makinson pp. 205–7(1965). The Paradox of the Preface.
Analy-
Classic discussions of the Lottery and Preface Paradoxes (respectively), by the authors who introduced these paradoxes to the philosophic al literature.
Extended Discussion
20
NOTES
Richard Foley (2009). Beliefs, Degrees of Belief, and the Lockean Thesis. In: Degrees of Belief . Ed. by Franz Huber and Christoph Schmidt-Petri. Vol. 342. Synthese Library. Springer, pp. 37–48 Ruth Weintraub (2001). The Lottery: A Paradox Regained and Resolved. Synthese 129, pp. 439–449 David Christensen (2004). Putting Logic in its Place . Oxford: Oxford University Press Foley, Weintraub, and Christensen each discuss the relation of binary beliefs to graded, and the troubles for binary rationality norms generated by the Lottery and Preface Paradoxes. They end up leaning in different directions: Christensen stresses the centrality of credence to norms of theoretical rationality, while Foley and Weintraub emphasize the role of binary belief in a robust epistemology.
Notes 1 While Bayesian Epistemology has historically focused on doxastic representational attitudes, some authors have recently applied Bayesian ideas to the study of knowledge. See, for instance, (Moss ms). 2 John Bengson, who has greatly helped me with this chapter, brought up the interesting historical example of how we might characterize David Hume’s (1739–40/1978) theory of belief vivacity in classificatory/comparative/quantitative terms. 3 On some epistemologies the descriptive and normative projects cannot be prized apart, because various normative conditions are either definitional or constitutive of what it is to possess particu lar doxastic attitudes. See, for instance, (Davidson 1984) and (Kim 1988). 4 See, for example, (Kornb lith 1993). Kornblith has a response to the Pascalian argument I’m about to offer, but chasing down his line would take us too far afield. 5 Has Pascal demonstra ted that practical rationality requires religious belief? I defined practical rationality as concernin g an attitude’s connection to action. One odd aspect of Pasca l’s Wager is that it seems to treat believi ng as a kind of action in itself . Many philosophers have wondered whether we have the kind of direct control over our beliefs to deliberately follow Pascal’s advice. For our purposes, the crucial point is that the pressure to honor atheistic evidence
doesn’t seem immediately connected to action. This establishes a standard of theoretical rationality distinct from concerns of practical rationality. 6 This is why I never play the lottery. 7 If you find the Preface Paradox somehow unrealistic or too distant from your life, consider that (1) you have a large number of beliefs (each of which, presumably, you believe); and (2) you may also believe (quite reasonably) that at least one of your beliefs is false. This combination is logically inconsiste nt. 8 In an article dated January 2, 2014 on grantland.com, a number of authors made bold predictions for the forthcomin g year. Amos Barshad wrote,
NOTES
21
“And so, here goes, my two-part prediction: 1. The Wu-Tang album will actually come out. 2. It’ll be incredible. I’m actually, illogically more sure of no. 2.” 9
Comparative shares some of Entailment the intuitive flaws we pointed out earlier for Belief Closure: Entailment (1) as stated, Comparative requires an agent to compare infinitely many ordered pairs of propositions (including propositions the agent has never entertained); (2) Comparative Entailment places demands on agents who have not yet recognized that some particular propos ition entails another. So it is tempting to tweak Comparative Entailment in ways similar to the tweaks we will later propose for Belief Consistency, Belief Closure, and their Bayesian cousins. 10 I.J. Good famously argued in a letter to the editor of The American Statistician that there are at least 46,656 varieties of Bayesians. (Good 1971) 11 These days philos ophers sometimes talk about “Formal Epistemology”. A formal epistemology is any epistemological theory that uses formal tools. Bayesian Epistemo logy is just one example of a formal epistemology; other examples include AGM theory (Alchourr´ on, G¨ardenfors, and Makinson 1985) and ranking theory (Spohn 2012). 12 There exists a quantitative strand of epistemology (e.g. (Pollock 2001)) focusing on the numerical degree of justification granted a proposition by a particular body of evidence. I will return to this degree-of-justification approach in Section 6.4.1. For now, it suffices to note that even if an agent’s evidence confers some numerically-measurable degree of justification on a particular proposition for her, that degree of justification is conceptually distinct from her degree of belief in the proposition, and the norms we will study apply to the latter. 13 I’ll describe some details of this construction in Chapter XXX. 14 Joel Velasco reminded me that doctors often ask us to rate our pain on a scale of 1 to 10. May we respond only if we have numbers in our heads? In our nerves? 15 If we shouldn’t think of the number in a numerical credence as part of the content of the proposition towards which the attitude is adopted, how exactly should we think of it? I tend to think of the numerical value as a sort of property or adjustable parameter of a particular doxastic attitude-type, credence. An agent adopts a credence towards a specific proposition, and it’s a fact about that credence that it has degree 60% (or whatever). For a contrasting view, and arguments in favor of putting the numerical value in the content of a proposition believed, see (Holton 2014). Another option—which would take us too far afield to address in this book—is to read a credence as a belief in a more complex kind of content, one component of which is propositional and a distinct component of which is numeric. (Moss ms) adopts this approach. 16 Thanks to Elizabeth Bell for discussion.
22
NOTES
Part II
The Bayesian Formalism
23
25 There are five core normative rules of Bayesian Epistemology: Kolmogorov’s three probability axioms, the Ratio Formula, and updating by Conditionalization. That is not to say that these are the only normative rules Bayesians accept, or that all Bayesians acce pt all five of these . But one cannot understand any additional rules or replacement rules without understanding these five first. Chapter 2 begins with some review of propositions and propositional logic. It then discusses unconditional credence, an agent’s general degree of confidence that a particular proposition is true. The Kolmogorov axioms are introduced as rational constraints on unconditional credences, then their consequences are explored. Finally, I discuss how the resulting normativ e system goes beyond what one gets from simple non-numerical norms for comparative confidence. Chapter 3 then introduces conditional credence—an agent’s confidence that one proposition is true on the supposition that another proposition is. The Ratio Formula is a normative rule relating an agent’s conditional credences to her unconditional credences. Chapter 3 applies the Ratio Formula to develop Bayesian notions of relevance and probabilistic independence. It then discusses relationships between conditional credences, causes, and conditional propositions. The probability axioms and the Ratio Formula relate credences held by an agent at a given time to other credences held by that agent at the same time. Updating by Conditionalization relates an agent’s credences at different times. After introducing Condition alization, Chapter 4 discusses the roles that evidence and certainty play in that rule. It then explains how Conditionalization does the useful epistemological work of distinguishing an agent’s evidence from the evidential standards she brings to bear on that evidence. Chapter 5 begins by discussing notions of “Subjective” and “Objective” Bayesianism, and various interpretations of “probability” talk. It then covers a number of popular Bayesian norms that go beyond the core five, including: the Principal Principle, the Reflection Principle, various other deference principles, the Principle of Indifference, Countable Additivity, and Jeffrey Conditionalization.
26
Chapter 2
Probability Distributions The main purpose of this chapter is to introduce Kolmogorov’s probability axioms. These are the first three core normativ e rules of Bayesian Epistemology. They represent constraints that an agent’s unconditional credence distribution at a given time must satisfy in order to be rational. The chapter begins with a quick overview of propositional and predicate logic. The goal is to remind readers of logical notation and terminology we will need later; if this material is new to you, you can learn it from any introductory logic text. Next I introduce the notion of a numerical distribution over a propositional language, the tool Bayesians use to represent an agent’s degrees of belief. Then I present the probability axioms, which are mathematical constraints on such distributions. Once the probability axioms are on the table, I point out some of their more intuitive consequences. The probability calculus is then used to analyze the Lottery Paradox scenario from Chapter 1, and Tversky and Kahneman’s Conjunction Fallacy example. Kolmogorov’s axioms are the canonical way of defining what it is to be a probability distribution, and they are useful for doing probability proofs. Yet there are other, equivalent mathematical structures that Bayesians often use to illustrate points and solve proble ms. After presenting the axioms, this chapter describes how to work with probability distributions in three alternate forms: Venn diagrams, probability tables, and odds. I end the chapter by explaining what I think are the most distinctive elements of probabilism, and how probability distributions go beyond what one obtains from a comparative confidence ordering. 27
28
CHAPTER 2. PROBABILITY DISTRIBUTIONS
Figure 2.1: The space of possible worlds
P
Q
2.1
Propositions and propositional logic
While other approaches are sometimes used, we will assume that degrees of belief are assigned to propositions. 1 In any particular application we will be interested in the degrees of belief an agent assigns to the propositions in L L
some language . will containwith a finite number of( atomic , which we will usually represent capital letters P , Q, Rpropositions , etc.). The rest of the propositions in L are constructed in standard fashion from atomic propositions using five propositional connectives: , &, , , and . A negation P is true just in case P is false. A conjunction P & Q is true just in case its conjuncts P and Q are both tr ue. “ ” represents inclusive “or”; a disjunction P Q is false just in case its disjuncts P and Q are both false. “ ” represents the material conditional; P Q is false just in case its antecedent P is true and its consequent Q is false. A material biconditional P Q is true just in case P and Q are both true or P and Q are both false. Philosophers sometimes think about propositional connectives using sets of possible worlds . Possible worlds are somewhat like the alternat e universes to which characters travel in science-fiction stories—events occur in a possible world, but they may be different events than occur in the actual world (the possible world in which we live). Possible worlds are maximally specified, such that for any event and any possible world that event either does or does not occur in that world. And the possible worlds are plentiful enough such that for any combination of events that could happen, there is a possible world in which that combination of events does happen. We can associate with each proposition the set of possible worlds in which that proposition is true. Imagine that in the Venn diagram of
”
„
_
Ą
”
„ _
_Ą
Ą
29
2.1. PROPOSITIONS AND PROPOSITIONAL LOGIC
Figure 2.2: The set of worlds associated with P
_Q
P
Q
Figure 2.1 (named after a logical technique developed by John Venn), the possible worlds are represented as points inside the rectangl e. Proposition P might be true in some of those worlds, false in others. We can draw a circle around all the worlds in which P is true, label it P , and then associate proposition P with the set of all possible worlds in that circle (and similarly for proposition Q). The propositional connectives can also be thought of in terms of possible worlds. P is associated with the set of all worlds lying outs ide the P -circle. P & Q is associated with the set of worlds in the overlap of the P -circle and the Q-circle. P Q is associated with the set of worlds lying in either the P -circle or the Q-circle. (The set of worlds associated with P Q has been shaded in Figure 2.2 for illustration.) P Q is associated with the set containing all the worlds except those that lie both inside the P -circle and outside the Q-circle. P Q is associated with the set of worlds that are either in both the P -circle and the Q-circle or in neither one. 2
„
_
Ą
_
”
Warning: I keep saying that a proposition can be “associated” with the set of possib le worlds in which that propos ition is true. It’s tempting to think that the proposition just is that set of possible worlds, but we will avo id that tempt ation. Here’s why: The way we’ve set things up, any two logically equivalent propositions (such as P and P P ) are associated with the same set of possible worlds. So if propositions just were their associated sets of possible worlds, P and P P would be the same proposition. Since we’re taking credences to be assigned to propositions, that would mean that of necessity every agent assigns P and P P the same credence.
„ Ą „ Ą
„ Ą
30
CHAPTER 2. PROBABILITY DISTRIBUTIONS
Eventually we’re going to suggest that if an agent assigns
P and
„P Ą P different credences she’s making a rational mistake. But we
want our formalism to suggest it’s a rational requirement that agents assign the same credence to logical equivalents, not a necessary truth. It’s useful to think about propositions in terms of their associated sets of possible worlds, so we will con tinue to do so. But to keep logically equivalent propositions separate entities we will not say that a proposition just is a set of possible worlds.
Before we discuss logical relations among propositions, a word about notation. I said we will use capital letters as atomic proposit ions. We will also use capital letters as metavariables ranging over proposit ions. I might say, “If P entails Q, then. . . ”. Clearly the at omic proposition P doesn’t entail the atomic proposition Q . So what I’m trying to say in such a sentence is “Suppose we have one proposition (which we’ll call ‘P ’ for the time being) that entails another proposition (which we’ll call ‘ Q’). Then. . . ”. At first it may be confusing sorting atomic proposition letters from metavariables, but context will any hopefully make my phrases as: “For propositions and Qclear. ...”.) 3(Look especially for such P usage 2.1.1
Relations among propositions
Propositions P and Q are equivalent just in case they are associated with the same set of possible worlds—in each possible world, P is true just in case Q is. In that cas e I will write “ P Q”. P entails Q (“ P Q”) just in case there is no possible world in which P is true but Q is not. On a Venn diagram, P entails Q when the P -circle is entirely contained within the Qcircle. (Keep in mind that one way for the P -circle to be entirely contained in the Q-circle is for them to be the same circl e! When P is equivalent to Q, P entails Q and Q entails P .) P refutes Q just in case P Q. When P refutes Q, every world that makes P true makes Q false.4 For example, suppose I roll a six-sided die. The proposition that the die came up six entails the pro position that it came up even. The proposition that the die came up six refutes the proposition that it came up odd. The proposition that the die came up even is equivalent to the proposition that it did not come up odd—and each of those propositions entails the other. P is a tautology just in case it is tru e in every poss ible wor ld. In that case we write “ P ”. I will some times use the sym bol “ T” to stand for a tautology. A contradiction is false in every possible world. I will
)(
(
(„
(
2.1. PROPOSITIONS AND PROPOSITIONAL LOGIC
31
sometimes use “ F” to stand for a contradiction. A contingent proposition is neither a contradiction nor a tautology. Finally, we have properties of proposition sets of arbitrary size. The propositions in a set are consistent if there is at least one possible world in which all those propositions are true. The propositions in a set are inconsistent if no world makes them all true. The propositions in a set are mutually exclusive if no possible world makes more than one of them true. Put another way, each proposition in the set refutes each of the others. (For any propositions P and Q in the set, P Q.) The propositions in a set are jointly exhaustive if each possible world makes at least one of the propositions in the set true. In other words, the disjunction of all the propositions in the set is a tautology. We will often work with proposition sets whose members are both mutually exclusive and jointly exhaustiv e. A mutually exclusive, jointly exhaustive set of propositions is called a partition. Intuitively, a partition is a way of dividing up the available possibilities. For example, in our die-rolling example the proposition that the die came up odd and the proposition that the die came up even together form a partition. When you have a partition,
(„
each world makes the exactly one of the propositions in the partition true. possible On a Venn diagram, regions representing the propositions in a partition combine to fill the entire rectangle without overlapping at any point.
2.1.2
State-descriptions
Suppose we are working with a language that has just two atomic propositions, P and Q. Looking back at Figure 2.1, we can see that these propositions divide the space of possible worlds into four mutually exclusive, jointly exhaustive regions. Figure 2.3 labels those regions s1 , s2 , s3 , and s4 . Each of the regions corresponds to one of the lines in the following truth-table: P
Q
s1
T
T
s2 s3 s4
T F
F T
F
F
state-description P &Q
„P P&&„QQ „P & „Q
Each line on the truth-table can also be described by a kind of proposition called a state-description. A state-description in language L is a conjunction in which (1) each conjunct is either an atomic proposition of L or its negation; and (2) each atomic proposition of L appears exactly
32
CHAPTER 2. PROBABILITY DISTRIBUTIONS
Figure 2.3: Four mutually exclusive, jointly exhaustive regions
P s2
s1
s3
Q
s4
„
once. For example, P & Q and P & Q are each state-descriptions. A state-description succinctly describes the possible worlds associated with a line on the truth-tabl e. For example, the possible worlds in region s3 are just those in which P is false and Q is true; in other words, they are just those in which the state-description P & Q is true. Given any language,
„
5
its state-descriptions will descriptions form a partition. Notice that the state available for use are dependent on the language we are wor king with. If instead of language L we are working with a language L1 containing three atomic propositions ( P , Q, and R), we will have eight state-descriptions available instead of L’s four. (You’ll work with these eight state-descriptions in Exercise 2.1. For now we’ll go back to working with language L and its paltry four.) Every non-contradictory proposition in a language has an equivalent that is a disjunction of state-des criptions. We call this disjunction the proposition’s disjunctive normal form . For example, the proposition P Q is true in regions s1 , s2 , and s3 . Thus
_
P
_ Q )( pP & Qq _ pP & „Qq_p„ P & Qq
(2.1)
The proposition on the righthand side is the disjunctive normal form equiv-
_
alent of P Q. To find the disjunctive normal form of a non-contradictory proposition, figure out which lines of the truth-table it’s true on, then make a disjunction of the state-descriptions associated with each such line. 6 2.1.3
Predicate logic
Sometimes we will want to work with languages that represent objects and properties. To do so, we will first ide ntify a universe of discourse , the
2.1. PROPOSITIONS AND PROPOSITIONAL LOGIC
33
total set of objects under discussion. Each object in the universe of discourse will be represented by a constant, which will usually be a lower-case letter (a,b,c,... ). Properties of those object s and relations among them will be represented by predicates, which will be capital letters. Relations among propositions in such a language are exactly as described in the previous sections, except that we have two new kinds of propositions. First, our atomic propositions are now generated by applying a predicate to a constant, as in “ F a”. Second, we can generate quantified sentences, as in “ x Fx F x ”. Since we will rarely be using pred icate logic, I won’t work through the details here; a thorough treatment can be found in any introductory logic text. I do want to emphasize, though, that as long as we restrict our attention to finite universes of discourse and finite property sets, all the logical relations we need can be handled by the propositional machinery discussed above. If, say, our only two constants are a and b and our only predicate is F , then the only atomic propositions in L will be F a and F b, for which we can build a standard truth-table:
p@ qp Ą „ q
Fa Fb T
T
T
F
F
T
F
F
state-description Fa & Fb Fa & Fb Fa & Fb Fa & Fb
„ „
„ „
For any proposition in this langua ge containing a quantifier, we can find an equivalent composed entirely of atomic propositions and propositional connectives. A universally-quantified sentence will be equivalent to a conjunction of its substitution instances, while an existentially-quantified sentence will be equivalent to a disjunction of its substitution instances. For example, when our only two constants are a and b we have:
pDxqF x )( F a _ F b p@xqpF x Ą „F xq)(p F a Ą „F aq & pF b Ą „F bq
(2.2) (2.3)
As long as we stick to finite universes of discourse, every proposition will have an equivalent that uses only proposition al connectives. So even when we work in predicate logic, every non-contradictory proposition will have an equivalent in disjunctive normal form.
34
CHAPTER 2. PROBABILITY DISTRIBUTIONS
2.2
Probability distributions
A distribution over language L assigns a real number to each proposition in the language. 7 Bayesians represent an agent’s degrees of belief as a distribution over For a language; use “cr” isto 70% symbolize an agent’s distribution. examp le,I will if an agent confi dent that itcredence will rain tomorrow, we will write cr R 0.7 (2.4)
p q“
where R is the proposition that it will rain tomorrow. Another way to put this is that the agent’s unconditional credence in rain tomorrow is 0 .7. (Unconditional credences contrast with conditional credences, which we will discuss in Chapter 3.) Bayesians hold that a rational credence distribution satisfies certain rules. Among these are our first three core rule s, Kolmogorov’s axioms: Non-Negativity: For any proposition P in Normality: For any tautology T in L, cr T
L,
p q“
p q ě 0.
cr P 1.
Finite Additivity: For any mutually exclusive propositions P and Q in cr P cr Q L, cr P Q
p _ q“ p q` p q
Kolmogorov’s axioms are often referred to as “the probability axioms”. Mathematicians call any distribution that satisfies these axioms a probability distribution. Kolmogorov (1933/1950) was the first to articulate these axioms as the foundation of mathematical probability theory.8
Warning: Kolmogorov’s work inaugurated a mathematical field of probability theory distinct from the philosophical study of what probability is. While this was an important advance, it gave the word “probability” a special meaning in mathematical circles that can generate confusion elsewhere. For a 21st-century mathematician, Kolmogorov’s axioms define what it is for a distribution to be a “probability distribution”. This is distinct from the way people use “probability” in everyday life. For one thing, the word “probability” in English may not mean the same thing in every use. And even if it does, it would be a substantive philosophical thesis that probabilities (in the everyday sense) can be represented by a numerical distribution satisfying Kolmogorov’s axioms. Going in the other directio n, there are numerical distributions
35
2.2. PROBABILITY DISTRIBUTIONS
satisfying the axioms that don’t count as “probabilistic” in any ordinary sense. For example, we could invent a distribution “tv” that assigns 1 to every true proposition and 0 to every false proposition. To a mathematician, the fact that tv satisfies Kolmogorov’s axioms makes it a probability distribution. But a proposition’s tv-value might not match its probability in the everyday sense. Improbable propositions can turn out to be true (I just rolled snake-eyes!), and propositions with high probabilities can turn out to be false (the Titanic should’ve made it to port). Probabilism is the philosophical view that rationality requires an agent’s credences to form a probability distribution (that is, to satisfy Kolmogorov’s axioms). Probabilism is attractive in part because it has intu itively appealing consequences. For example, from the probability axioms we can prove: Negation: For any proposition P in
L,
p„P q “ 1 ´ crpP q.
cr
According to Negation, rationality requires an agent with cr
p„ q “
R
0.7 to
p q“
have cr R 0.3. Among other things , Negation embodies the sens ible thought that if you’re highly confident that a proposition is true, you should be unconfident that its negation is. Usually I’ll leave it as an exercise to prove that a particular consequence follows from the probability axioms, but here I will prove Negation as an example for the reader. Negation Proof:
p1q p2q p3q p4q p5q p6q 2.2.1
„P are mutually exclusive p _ „P q “ crpP q ` crp„P q is a tautology P _ „P crpP _ „P q “ 1 1 “ crpP q ` crp„P q crp„P q “ 1 ´ crpP q P and
cr P
logic (1), Finite Additivity logic (3),Normality (2), (4) (5), algebra
Consequences of the pro bability axioms
Below are a number of further consequences of the probability axioms. Again, these consequences are listed in part to demonstrate the intuitive things that follow from the probability axioms. But I’m also listing them because they’ll be useful in future proofs.
36
CHAPTER 2. PROBABILITY DISTRIBUTIONS
p qď p q“
Maximality: For any proposition P in L, cr P 1. Contradiction: For any contradiction F in L, cr F 0. Entailment: For any propositions P and Q in L, if P Q then cr P cr Q .
p qď p q p q“ p q p _ q“ p q` p t p _ _ _ q“ p q“ p q` p
(
)(
Q then Equivalence: For any propositions P and Q in L, if P cr P cr Q . General Additivity: For any propositions P and Q in L, cr P Q cr P cr Q cr P & Q . Finite Additivity (Extended): For any finite set of mutually exclusive propositions P1 , P2 ,...,P n , cr P1 P2 . . . Pn cr P1 cr P2 . . . cr Pn . Decomposition: For any propositions P and Q in L, cr P cr P & Q cr P & Q . Partition: For any finite partition of propositions in L, the sum of their unconditional cr-values is 1.
q´ p q u p q` p q` ` p q „ q
Together, Non-Negativity and Maximality establish the bounds of our credence scale. Rational credences will always fall between 0 and 1 (inclusive). Given these bounds, Baye sians represent absolute certainty that a proposition is true as a credence of 1 and absolute certainty that a proposition is false as credence 0. The upper bound is arbitrary—we could have set it at whatever positive real number we wanted. But using 0 and 1 lines up nicely with everyday talk of being 0% confident or 100% confident in particular propositions, and also with various considerations of frequency and chance discussed later in this book. Entailment is plausible for all the same reasons Comparative Entailment was plausible in Chapter 1; we’ve simply moved from an expression in terms of confidence orderings to one using numerical credences. Understanding equivalence as mutual entailment, Entailment entails Equivalence. General Additivity is a generalization of Finite Additivity that allows us to calculate an agent’s credence in any disjunction, whether the disjuncts are mutually exclusive or not. (When the disjun cts are mutually exclusive, their conjunction is a contradiction, the cr P & Q term equals 0, and General Additivity takes us back to Finite Additivity.) Finite Additivity (Extended) can be derived by repeatedly applying Finite Additivity. Begin with any finite set of mutually exclusive propositions P1 , P2 ,...,P n . By Finite Additivity,
p
t
u
q
p _ P q “ crpP q ` crpP q
cr P1
2
1
2
(2.5)
37
2.2. PROBABILITY DISTRIBUTIONS
Logically, since P1 and P2 are each mutually exclusive with P3 , P1 also mutually exclusive with P3 . So Finite Additivity yields
pr _ P s _ P q “ crpP _ P q ` crpP q
cr P1
2
3
1
2
2
is
(2.6)
3
Combining Equations (2.5) and (2.6) then gives us cr P1 P2 P3 cr P1 cr P2
_P
p _ _ q “ p q ` p q ` crpP q (2.7) Next we would invoke the fact that P _ P _ P is mutually exclusive with 1
2
3
3
P4 to derive
p _ P _ P _ P q “ crpP q ` crpP q ` crpP q ` crpP q
cr P1
2
3
4
1
2
3
4
(2.8)
Clearly this process iterates as many times as we need to reach
p _ P _ . . . _ P q “ crpP q ` crpP q ` . . . ` crpP q
cr P1
n
2
1
2
n
(2.9)
The idea here is that once you have Finite Additivity for proposition sets of size 2, you have it for proposition sets of any larger finite size as well. When the propositions in a finite set are mutually exclusive, the probability of their disjunction equals the sum of the probabilities of the disjuncts. Combining Finite Additivity and Equivalence yields Decomposition. For any P and Q, P is equivalent to the disjunction of the mutually exclusive propositions P & Q and P & Q, so cr P must equal the sum of the cr-values of those two. Partition then takes a finite set of mutually exclusive propositions whose disju nction is a tautology. By Finite Additivit y (Extended) the cr-values of the propositions in the partition must sum to the cr-value of the tautology, which by Normality must be 1.
„
2.2.2
p q
A Bayesian approach to the Lot tery scenario
In future sections I’ll explain some alternative ways of thinking about probabilities. But first let’s use it to do something: a Bayesian analysis of the situation in the Lottery Paradox. Recall the scenario from Chapter 1: A fair lottery has one million tickets.9 An agent is skeptical of each ticket that it will win, but takes it that some ticket will win. In Chapter 1 we saw that it’s difficult to articulate norms on binary belief that depict this agent as believing rationally. But once we move to degrees of belief, the analysis is straightforward. We’ll use a language in which the constants a,b ,c,.. . stand for the various tickets in the lottery, and the predicate W says that a particular ticket wins. A reasonable credence distribution over the resulting language sets
p q “ crpW bq “ crpW cq “ . . . “ 1{1,000,000
cr W a
(2.10)
38
CHAPTER 2. PROBABILITY DISTRIBUTIONS
Negation then gives us
p„W aq “ crp„W bq “ crp„W cq “ 1 ´ 1{1,000,000 “ 0.999999
cr
(2.11)
reflecting the agent’s high confidence for each ticket that that ticket won’t win. What about the disj unction saying that some tick et will win? Since the Wa ,W b,W c,. .. propositions are mutually exclusive, Finite Additivity (Extended) yields
p
cr W a
_ W b _ W c _ W d _ . . .q “ crpW aq ` crpW bq`crpW cq ` crpW dq ` . . .
(2.12)
On the righthand side of Equation (2.12) we have one million terms, each of which has a value of 1 1,000,000. Thus the credence on the lefthand side equals 1. The Lottery Paradox is a problem for certain norms on binary belief. We haven’t done anything to resolve that paradox here. Instead, we’ve
{
shown that the lotterymeans. situation giving rise to the paradox can be easily in modeled by Bayesian We’ve build a model of the lottery situation which the agent is both highly confident that some ticket will win and highly confident of each ticket that it will not. (Constructing a similar model for the Preface is left as an exercise for the reader.) There is no tensio n with the rules of rational confidence represented in Kolmogorov’s axioms. The Bayesian model not only accommodates but predicts that if an agent has a small confidence in each proposition of the form W x, is certain that no two of those propositions can be true at once, and yet has a high enough number of W x propositions available, that agent will be certain (or close to certain) that at least one of the W x is true. This analysis also reveals why it’s difficult to simultaneously maintain both the Lockean thesis and the Belief Consistency norm from Chapter 1. The Lockean thesis implies that a rational agent believes a proposition just in case her credence in that proposition is above some numerical threshold. For any such threshold we pick (less than 1), it’s possible to generate a lottery-type scenario in which the agent’s credence that at least one ticket will win clears the threshold, but her credence for any given ticket that that ticket will lose also clears the threshold. Given the Lockean thesis, a rational agent will therefore believe that at least one ticket will win but also believe of each ticket that it will lose. This violates Belief Consis tency, which says that every rational belief set is logically consistent.
2.2. PROBABILITY DISTRIBUTIONS
2.2.3
39
Doxastic possibilities
In the previous section we considered propositions of the form W x, each of which says of some particular ticket that it will win the lottery. To perform various calculations involving these W propositions, we assumed that they form a partition—that is, that they are mutually exclusive and jointly exhaustive. But this isn’t exactly right: there are possible worlds in which ticket a and ticket b both win the lottery, worlds in which no ticket wins the lottery, worlds in which the lottery never takes place, worlds in which humans never evo lve, etc. The credence distribution we crafted for our agent assigns these sorts of worlds degree of belief 0. But could it ever be rational for an agent to assign these possibilities no credence whatsoever? We will refer to the set of possible worlds an agent entertains as her doxastically possible worlds.10 Perhaps a fully rational agent never rules out any logically possible world; if so, then a rational agent’s set of doxastic possibilities is always the full set of logical possib ilities. We will discuss this claim when we turn to the Regularity Principle in Chapters 4 and 5. For the time being I want to note that even if a rational agent should never actually rule out a logically possible world, it might be convenient in particular contexts for her to temporarily ignore certain worlds as live possibilities. Pollsters calculating confidence intervals for their latest sampling data don’t factor in the possibility that the United States will be overthrown before the next presidential election. How is the probability calculus affected when an agent restricts her doxastically possible worlds to a proper subset of the logically possible worlds? Section 2.1 defined various relations among propositions in terms of possible worlds. In that context, the appropr iate set of possible worlds to consider was the full set of logically possibl e worlds. But we can reinterpret those definitions as quantified over an agent’s doxastically possible worlds. In our analysis of the Lottery scenario above, we effectively ignored possible worlds in which no tickets win the lottery or in which more than one ticket wins. For our purposes it was simpler to suppose that the agent rules them out of consideration. So our Bayesian model treated each W x proposition as mutually exclusive with all the others, allowing us to apply Finite Additivity to generate equat ions like (2.12). If we were working with the full space of logically possible worlds we would have worlds in which more than one W x proposition was true, so those propositions wouldn’t count as mutally exclusive. But relative to the set of possible worlds we’ve supposed the agent entertains, they are.
40
CHAPTER 2. PROBABILITY DISTRIBUTIONS
2.2.4
Probabilities are weir d! The Conjunction Fallacy
As you work with credences it’s important to remember that probabilistic relations can function very differently from the relations among categorical concepts that inform many of our intuition s. In the Lottery situatio n it’s perfectly rational for an agent to be highly confident of a disjunction while having low confidence in each of its disjuncts. That may seem strange. Tversky and Kahneman (1983) offer another probabilistic example that runs counter to most people’s intuitions. In a famous study, they presented subjects with the following prompt: Linda is 31 years old, single, outspoken and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. The subjects were then asked to rank the probabilities of the following propositions (among others): •
Linda is active in the feminist movement.
•
Linda is a bank teller.
•
Linda is a bank teller and is active in the feminist movement.
The “great majority” of Tversky and Kahneman’s subjects ranked the conjunction as more probable than the bank teller propos ition. But this violates the proba bility axiom s! A conjuction will always entail each of its conjuncts. By our Entailment rule—which follows from the probability axioms—the conjunct must be at least as probable as the conjunction. Being more confident in a conjunction than its conjunct is known as the Conjunction Fallacy.
2.3
Alternative representations of probability
2.3.1 Probabilities in Venn diagrams Earlier we used Venn diagrams to visualize propositions and the relations among them. We can also use Venn diagrams to picture probability distributions. All we have to do is attach significance to something that was unimportant before: the size of regions in the diagram. We stipulate that the area of the entire rectangle is 1. The area of a region inside the rectangle equals the agent’s unconditional credence in any proposition associated with
2.3. ALTERNATIVE REPRESENTATIONS OF PROBABILITY
41
Figure 2.4: Areas equal to unconditional credences
P s2
s1
s3
Q
s4
that region. (Note that this visualization technique works only for credence functions that satisfy the probability axioms. 11 ) For example, consider Figure 2.4. There we’ve depicted a probabilistic credence distribution in which the agent is more confident of proposition P than she is of proposition Q , as indicated by the P -circle’s being larger than
p
q
p
„ q
the Q-circle. What about cr Q & P versus cr Q & P ? On the diagram the region labeled s3 has slightly more area than the region labeled s1 , so the agent is slightly more confident of Q & P than Q & P . (When you construct your own Venn diagrams you need not include state-description labels like “ s3 ”; I’ve added them for later reference.)
„
Warning: It is tempting to think that the size of a region in a Venn diagram represents the number of possible worlds in that region—the number of worlds that make the associated proposition true. But this would be a mistake. Just because an agent is more confident of one proposition than another does not necessarily mean she associates more possible worlds with the former than the latter. For example, if I tell you I have a weighted die that is more likely to come up 6necessarily than any other in 6 does many not mean number, that you your thinkincreased there areconfidence disproportionately worlds in which the die land s 6. The area of a region in a Venn diagram is a useful visual representation of an agent’s confidence in its associ ated proposition. We should not read too much out of it about the distribution of possible worlds. 12
42
CHAPTER 2. PROBABILITY DISTRIBUTIONS
Figure 2.5: P
(Q Q
P
Venn diagrams make it easy to see why certain probabilistic relations hold. For example, take the Gener al Additivity rule from Section 2.2.1. In Figure 2.4, the P Q region contains every point that is in the P -circle, in the Q-circle, or in both. We could calcu late the area of that region by adding up the area of the P -circle and the area of the Q -circle, but in doing
_
so we’d be counting the P & Q region (labeled s1 ) twice. We adjust for the double-counting as follows:
p _ Qq “ crpP q ` crpQq ´ crpP & Qq
cr P
(2.13)
That’s General Additivity. Figure 2.5 depicts a situation in which proposition P entails proposition Q. As discu ssed earlier, this req uires the P -circle to be wholly contained within the Q -circle. But since areas now represent unconditional credences, the diagram makes it obvious that the cr-value of proposition Q must be at least as great as the cr-value of proposition P . That’s exactly what our Entailment rule requires. (It also shows why the Conjunction Fallacy is a mistake—imagine Q is the proposition that Linda is a bank teller and P is the proposition that Linda is a feminist bank teller.) Venn diagrams can be a useful way of visualizing probabilistic relationships. Bayesians often clarify a complex situation by sketching a quick Venn diagram of the agent’s credence distribution. There are limits to this technique; when our languages grow beyond 3 or so atomic propositions it becomes difficult to get all the overlapping regions one needs and to make areas proportional to credences. But there are also cases in which it’s much easier to understand why a particular theorem holds by looking at a diagram than by working with the axioms.
2.3. ALTERNATIVE REPRESENTATIONS OF PROBABILITY
2.3.2
43
Probability tables
Besides being represented visually in a Venn diagram, a probability distribution can be represented precisely and efficiently in a probability table. To build a probability table, we begin with a set of propositions forming a partition of the agent’s doxastic possibilities. For example, suppose an agent is going to roll a loaded six-sided die that comes up six on half of its rolls (with the remaining rolls distributed equally among the other numbers). A natural partition of the agent’s doxastic space uses the propositions that the die comes up one, the die come s up two, the die comes up three, etc. The resulting probability table looks like this: proposition Die comes up one. Die comes up two. Die comes up three. Die comes up four. Die comes up five. Die comes up six.
cr 1 10 1 10 1 10 1 10 1 10 1 2
{ { { { {
{
The probability table first lists the propositions in the partition. Then for each proposition it lists the agent’s unconditional credence in that proposition. The credence values must follow two important rules: 1. Each value must be non-negativ e. 2. The values in the column mu st sum to 1. The first rule follows from Non-Negativity, while the second follows from our Partition theorem. Once we know the credences of partition members, we can calculate the agent’s unconditional credence in any other proposition expressible in terms of that partition. First, any contradiction receives credence 0. Then for any other proposition, we figure out which rows of the table it’s true on, and calculate its credence the values on the those For example, might be interested inby thesumming agent’s credence that dierows. roll comes up even.we The proposition that the roll comes up even is true on the second, fourth, and sixth rows of the table. So the agent’s crede nce in that proposition is 1 10 1 10 1 2 7 10. We can calculate the agent’s credence this way because
{ `{ ` { “ {
E
)( 2 _ 4 _ 6
(2.14)
44
CHAPTER 2. PROBABILITY DISTRIBUTIONS
where E is the proposition that the die came up even, “2” represents its coming up two, etc. By Equivalence,
p q “ crr2 _ 4 _ 6s
cr E
(2.15)
Since the propositions on the right are members of a partition, they are mutually exclusive, so Finite Additivity (Extended) yields
p q “ crp2q ` crp4q ` crp6q
cr E
(2.16)
So the agent’s unconditional credence in E can be found by summing the values on the second, fourth, and sixth rows of the table. Given a propositional language L, it’s often useful to build a probability table using the partition containin g L’s state-descriptions. For example, for a language with two atomic propositions P and Q, I might give you the following probability table: P
Q
s1
T
T
s2 s3 s4
T F
F T
F
F
cr 0.1 0.3 0.2 0.4
The state-descriptions in this table are fully specified by the Ts and Fs appearing under P and Q in each row, but I’ve also provided labels ( s1 , s2 ,... ) for each state-description to show how they correspond to regions in Figure 2.4. Suppose a probabilistic agent has the unconditional credences specified in this table . What credence does she ass ign to P Q? From the Venn diagram we can see that P Q is true on state-descriptions s1 , s2 , and s3 . So we find cr P Q by adding up the cr-values on the first three rows of our table. In this case cr P Q 0.6. A probability table over state-descriptions is a particularly efficient way of specifying an agent’s unconditional credence distribution over an entire propositional language.13 A language L closed under the standard connectives contains infinitely many propositions, so a distribution over that language contains infinitely man y values. If the agent’s credences satisfy the probability axioms, the Equivalence rule tells us that equivalent propositions must all receive the same credence. So we can specify the entire distribution just by specifying its values over a maximal set of non-equivalent propositions in the language.
_ p _ q p _ q“
_
45
2.3. ALTERNATIVE REPRESENTATIONS OF PROBABILITY
But that can still be a lot! If L has n atomic propositions, it will contain n 22 non-equivalent propositions (see Exercise 2.3). For 2 atomics that’s only 16 credence values to specify, but by the time we reach 4 atomics it’s up to 65,536 distinct values. n
On the other hand, a language with n atomics will contain only 2 statedescriptions. And once we provide unconditional credences for these propositions in our probability table, all the remaining values in the distribution follow. Every contradictory proposition receives credence 0, while each noncontradictory proposition is equivalent to a disjunction of state-descriptions (its disju nctive normal form). By Finite Additivity (Exte nded), the credence in a disjunction of state-descriptions is just the sum of the credences assigned to those state-descriptions. So the probability table contains all the information we need to specify the full distribution. 14 2.3.3
Using probability tables
Probability tables describe an entire credence distribution in an efficient manner; instead of specifying a credence value for each non-equivalent proposition in the language, we need only specify values for its state-descriptions. Credences in state-descriptions can then be used to calculate credences in other propositions. But probability tables can also be used to prove theorems and solve problems. To do so, we replace the numerical credence values in the table with variables: s1 s2 s3 s4
P
Q
cr
T
T
T
F
F
T
F
F
a b c d
This probability table for an L with two atomic propositions makes no assumptions about the agent’s specific credence values. It is therefore fully general, and can be used to prove general theorems about probability distributions. For example, on this table
p q “a`b But a is just cr pP & Q q, and b is cr pP & „Qq. cr P
(2.17)
This gives us a very qu ick proof of the Decomposition rule from Section 2.2.1. It’s often much easier to prove a general probability result using a probability table built on statedescriptions than it is to prove the same result from Kolmogorov’s axioms.
46
CHAPTER 2. PROBABILITY DISTRIBUTIONS
As for problem-solving, suppose I tell you that my credence distribution satisfies the probability axioms and also has the following features: I am certain of P Q, and I am equally confident in Q and Q. I then ask you to tell me my credence in P Q.
_
„
Ą
You might be able to solve this problem by drawing a careful Venn diagram—perhaps you can even solve it in your head! If not, the probability table provides a purely algebraic solution method. We start by express ing the constraints on my distribution as equations using the variables from the table. From our second rule for filling out probability tables we have: a
`b`c`d“ 1
(2.18)
(Sometimes it also helps to invoke the first rule, writing inequalities specifying that a , b , c , and d are each greater than or equal to 0. In this particular problem those inequalitie s aren’t needed.) Next we represent the fact that I am equally confident in Q and Q:
„ crpQq “ crp„Qq a`c“b`d
(2.19) (2.20)
_
Finally, we represent the fact that I am certain of P Q. The only line of the table on which P Q is false is line s4 ; if I’m certain of P Q, I must assign this state-description a credence of 0. So
_
d
“0
_
(2.21)
Now what value are we look ing for? I’ve asked you for my cre dence in P Q; that proposition is true on lines s1 , s3 , and s4 ; so you need to find a c d. Applying a bit of algebra to Equations (2.18), (2.20), and (2.21), you should be able to determine that a c d 1 2.
Ą ` `
2.3.4
` ` “ {
Odds
Agents sometimes report their levels of confidence using odds rather than probabilities. If an agent’s unconditional credence in P is cr P , her odds for P are cr P : cr P , and her odds against P are cr P : cr P . For example, there are 37 pockets on a European roulette wheel. (American wheels have more.) 18 of those po ckets are black. Suppose an agent’s credences obey the probability axioms, and she assigns equal credence to the roulette ball’s landing in any of the 37 pockets. Then her credence that
p q p„ q
p q p„ q p q
2.3. ALTERNATIVE REPRESENTATIONS OF PROBABILITY
47
{
the ball will land in a black pocket is 18 37, and her credence that it won’t is 19 37. Her odds for black are therefore
{
{
{
18 37 : 19 37, or 18:19
(2.22)
(Since the agent assigns equal credence to each of the pockets, these odds are easily found by comparing the number of pockets that make the proposition true to the number of pockets that make it false.) Yet in gambling contexts we usually report odds against a proposition. So in a casino someone might say that the odds against the ball’s landing in the single green pocket are “36 to 1”. The odds agains t an event are tightly connected to the stakes at which it would be fair to gamble on that event, which we will discuss in Chapter 7. Warning: Instead of using a colon or the word “to”, people sometimes quote odds as fractions . So someone might say that the odds for the roulette ball’s landing in a black pocket are “18 19”. It’s important not to mistake this fraction for a probability value. If
{
your odds for black are 18 : 19, you take the ball’s landing on black to a bit less lik ely to happen than not. But if your unconditional credence in black were 18 19, you would always bet on black!
{
It can be useful to think in terms of odds not only for calculating betting stakes, but also because odds highlight differences that may be obscured by probability values. Suppose you hold a single ticket in a lottery that you take to be fair. Initially you think that the lottery contain s only 2 tickets, of which yours is one. But then someone tells you there are 100 tick ets in the lotter y. This is a significant blow to your chances, witnessed by the fact that your assessment of the odds against winning has gone from 1 : 1 to 99 : 1. The significance of this chan ge can also be seen in your unconditional credence that you will lose, which has jumped from 50% to 99%. But now it turns out that your informant was misled, and there are actually 10,000 tickets in the lotter y! This is another significant blow to your chances, intuitively at least as bad as the first jump in size. And indeed, your odds against winning go from 99 : 1 to 9 , 999 : 1. Yet your credence that you’ll lose moves only from 99% to 99 .99%. Probabilities work on an additive scale; from that perspective a move from 0 .5 to 0 .99 looks important while a move from 0 .99 to 0 .9999 looks like a rounding error. But odds use ratios, which highlight multiplicative effects more obviously.
48
2.4
CHAPTER 2. PROBABILITY DISTRIBUTIONS
What the probability calculus adds
In Chapter 1 we moved from thinking of agents’ doxastic attitudes in terms of binary (categorical) beliefs and confidence comparisons to working with numerical degrees belie f. Atofaanfirst pass attitudes , this is aatpurely maneuver, yielding of descriptions agent’s a higherdescriptive fineness of grain. As we saw in Chapter 1, this added level of descriptive detail confers both advantages and disadvantages. On the one hand, credences allow us to say how much more confident an agent is of one proposition than another. On the other hand, assigning numerical credences over a set of propositions introduces a total ordering, making all the propositions commensurable with respect to the agent’s confidences. This may be an unrealistic result. Chapter 1 also offered a norm for comparative confidence orderings: Comparative Entailment: For any pair of propositions such that the first entails the secon d, rationality requires an agent to be at least as confident of the second as the first. We have now introduced Kolmogorov’s probability axioms as a set of norms on credences. Besides the descriptive changes that happen we move from comparative confidences to numerical credeces, how dowhen the probability axioms go beyond Comparative Entailment? What more do we demand of an agent when we require that her credences be probabilistic? Comparative Entailment can be derived from the probability axioms— we’ve already seen that by the Entailment rule, if P Q then rationality requires cr P cr Q . But how muc h of the probability calculus can be recreated simply by assuming that Comparative Entailment holds? We saw in Chapter 1 that if Comparative Entailment holds, a rational agent will assign equal, maximal confidence to all tautologies and equal, minimal confidence to all contradictions. This doesn’t give specific numerical confidence values to contradictions and tautologies, because Comparative Entailment doesn’t work with numbers. But the probability axioms’ 0-to-1 scale for credence values is fairly stipulative and arbitrary anyway. The real essence of Normality, Contradiction, Non-Negativity, and Maximality can be obtained from Comparative Entailment. That leaves one axiom unaccounted for. To me the key insight of probabilism— and the element most responsible for Bayesianism’s distinctive contributions to epistemology—is Finite Additivity. Finite Additivity places demands on rational credence that don’t follow from any comparative norms we’ve seen. To see how, consider the following two credence distributions over a language with one atomic proposition:
p qď p q
(
49
2.5. EXERCISES
Mr. Prob: cr Mr. Weak: cr
p q“0 p q“0 F F
p q “ 1{6 p q “ 1{36
cr P cr P
p„P q “ 5{6 p„P q “ 25{36
cr cr
p q“1 p q“1
cr cr
T T
From a confidence ordering point of view, Mr. Prob and Mr. Weak are iden-
„
tical; they each rank P above P and both those propositions between a tautology and a contradiction. Both agents satisfy Comparative Entailment. Both agents also satisfy the Non-Negativity and Normality probability axioms. But only Mr. Prob sati sfies Finite Additivity. His credence in the tautologous disjunction P P is the sum of his credences in its mutually exclusive disjuncts. Mr. Weak’s credences, on the other hand, are superadditive: he assigns more credence to the disjunction than the sum of his credences in its mutually exclusive disjuncts. (1 1 36 25 36) Probabilism goes beyond Comparative Entailment by exalting Mr. Prob over Mr. Weak. In endorsing Finite Additiv ity, the probabilist holds that Mr. Weak’s credences have an irrational feature not present in Mr. Prob’s. When we apply Bayesianism in later chapters, we’ll see that Finite Additivity— a kind of linearity constraint—gives rise to some of the theory’s most interesting and useful results.
_„
ą { ` {
Of course, the fan of comparative confidence orderings need not restrict herself to the Comparative Entailment norm. Chapter ?? will explore further comparative constraints that have been proposed. We will ask whether those non-numerical norms can replicate all the desirable results secured by Finite Additivity for the Bayesian credal regime. This will be an especially pressing question because the impressive numerical credence results come with a price. When we examine explicit philosophical argum ents for the probability axioms in Part IV of this book, we’ll find that while Normality and Non-Negativity can be straightforwardly argued for, Finite Additivity is the most difficult part of Bayesian Epistemology to successfully defend.
2.5
Exercises
Problem 2.1. (a) List all eight state-descriptions available in a language with the three atomic sentences P , Q, and R.
p _ Qq Ą R.
(b) Give the disjunctive normal form of P
Problem 2.2. Here’s a fact: For any non-contradictory propositions X and Y , X Y if and only if every disjunct in the disjunctive normal form equivalent of X is also a disjunct of the disjunctive normal form equivalent of Y .
(
50
CHAPTER 2. PROBABILITY DISTRIBUTIONS
(a) Use this fact to show that
pP _ Qq & R ( pP _ Qq Ą R.
(b) Explain why the fact is true. (Be sure to explain both the “if” directio n and the “only if” direction!) Problem 2.3. Explain why a language L with n atomic propositions can n express exactly 2 2 non-equivalent propositions. (Hint: Think about the number of state-descriptions available, and the number of distinct disjunctive normal forms.) Problem 2.4. Suppose your universe of discourse contains only two objects, named by the constants “ a” and “ b”. (a) Find a quantifier-free equivalent of the proposition
p@xqrF x Ą pDyqGys.
(b) Find the disjunctive normal form of your quantifier-free proposition from part (a).
p q “ 0.5,
Problem 2.5. Can a probabilistic credence distribution assign cr P cr Q 0.5, and cr P & Q 0.8? Explain why or why not. ∗
p q“
p„
„ q“
Problem 2.6. Starting with only the probability axioms and Negation, write out proofs for all of the probability rules listed in Section 2.2.1. Your proofs must be straight from the axioms—no using Venn diagrams or probability tables! Once you prove a rule you may use it in further proofs. (Hint: You may want to prove them in an order different from the one in which they’re listed. And I did Finite Additivity (Extended) for you.) Problem 2.7. In The Empire Strikes Back , C-3PO tells Han Solo that the odds against successfully navigating an asteroid field are 3,720 to 1. (a) What is C-3PO’s unconditional credence that they will successfully navigate the asteroid field? (Express your answer as a fraction.) (b) Suppose C-3PO is certain that they will surviv e if they either successfully navigate the asteroid field, or unsuccessfully navigate it but hide in a cave.and He his is also certain that those are thenavigating only two ways they can survive, odds against unsuccessfully and hiding in a cave are 59 to 2. Assuming C-3PO’s cre dences obey the probability axioms, what are his odds against their surviving? (c) How does Han respond to 3PO’s odds declara tion? (Hint: Apparently Han prefers to be quoted linear probabilities.) ∗
I owe this problem to Julia Staffel.
51
2.5. EXERCISES
Problem 2.8. Consider the probabilistic credence distribution specified by this probability table: P
Q
R
T
T
T
T
T
F
T
F
T
T
F
F
F
T
T
F
T
F
F
F
T
F
F
F
cr 0.1 0.2 0 0.3 0.1 0.2 0 0.1
Calculate each of the following values on this distribution:
p ” Qq (b) cr pR Ą Qq (c) cr pP & Rq ´ crp„P & Rq (d) cr pP & Q & Rq{crpRq (a) cr P
Problem 2.9. Can an agent have a probabilistic cr-distribution meeting all of the following constraints? 1. The agent is cert ain of A
Ą pB ” C q.
2. The agent is equally confid ent of B and
„B.
3. The agent is twice as confiden t of C as C & A.
p
4. cr B & C &
„Aq “ 1{5.
If not, prove that it’s imposs ible. If so, provide a probability table and demonstrate that the resulting distribution satisfies each of the four constraints. Problem 2.10. Tversky and Kahneman’s finding that ordinary subjects commit the Conjunction Fallacy has held up to a great deal of experimental replication. Kolmogorov’s axioms make it clear that the proposi tions involved cannot range from most probable to least probable in the way subjects consistently rank them. Do you have any suggestions for why subjects might consistently make this mistak e? Is there any way to read what the subjects are doing as rationally acceptable?
52
CHAPTER 2. PROBABILITY DISTRIBUTIONS
Problem 2.11. Recall Mr. Prob and Mr. Weak from Section 2.4. Mr. Weak assigns lower credences to each contingent proposition than does Mr. Prob. While Mr. Weak’s distribution satisfies Non-Negativity and Normality, it violates Finite Additivity by being superadditive: it contains a disjunction whose credence is greater than the sum of the credences of its mutually exclusive disjuncts. Construct a credence distribution for “Mr. Bold” over language L with single atomic proposition P . Mr. Bold sho uld rank every propos ition in the same orde r as Mr. Prob and Mr. We ak. Mr. Bold shou ld also satis fy Non-Negativity and Normality. But Mr. Bold’s distribution should be subadditive: it should contain a disjunction whose credence is less than the sum of the credences of its mutually exclusive disjuncts.
2.6
Further reading
Introductions and Overviews
Merrie Bergmann, James Moor, and Jack Nelson (2013). Logic Book. 6th edition. New York: McGraw Hill
The
One of many available texts that thoroughly covers the logical material assumed in this book. Ian Hacking (2001). An Introduction to Probability and Inductive Logic. Cambridge: Cambridge University Press Brian Skyrms (2000). Choice & Chance: An Introduction to Inductive Logic. 4th. Stamford, CT: Wadsworth Each of these books contains a Chapter 6 offering an entry-level, intuitive discussion of the probability rules—though neither explicitly uses Kolmogorov’s axioms. Hacking has especially nice applications of probabilistic reasoning, along with many counter-intuitive examples like the Conjunction Fallacy from our Section 2.2.4. Classic Texts
A. N. Kolmogorov (1933/1950). Foundations of the Theory of Probability. Translation edited by Nathan Morrison. New York: Chelsea Publishing Company
NOTES
53
Text in which Kolmogorov laid out his famous axiomatization of probability theory. Extended Discussion
J. Robert G. Will iams (ta) . Probability and Non -Classical Logic. In: Oxford Handbook of Probability and Philosophy . Ed. by Alan H´ajek and Christopher R. Hitchcock. Oxford University Press Covers probability distributions in non-classical logics, such as logics with non-classical entailment rules and logics with more than one truth-value. Also briefly discusses probability distributions in logics with extra connectives and operators, such as modal logics. Branden Fitelson (2008). A Decision Procedure for Probabilit y Calculus with Applications. The Review of Symbolic Logic 1, pp. 111–125 Fills in the technical details of solving probability problems algebraically using probability tables (which Fitelson calls “stochastic truth-tab les”), including the relevant meta-theory. Also describes a Mathematica package that will solve probability problems and evaluate probabilistic conjectures for you, downloadable for free at http://fitelson.org/PrSAT/.
Notes 1 Among various alternatives, some authors assign degrees of belief to sentences, statements, or sets of events. Also, some views of propos itions make them ident ical to one of these alternati ves. I will not assume much about what propositi ons are, except that: they are capable of having truth-values (that is, capable of being true or false); they are expressible by declarative sentences; and they have enough internal structure to contain logical operators. This last assumption could be lifted with a bit of work. 2 Bayesians sometimes define degrees of belief over a sigma algebra . A sigma alge-
bra is a set of sets that is closed under (countable) union, (countable) intersection, and complementation. Given a language L, the sets of possible worlds associated with the propositions in that language form a sigma algebra. The algebra is closed under union , intersection, and complementation because the propositions in L are closed under disjunction, conjunction, and negation (respectively). 3 I’m also going to be fairly cavalier about the use-mention distinction, corner-quotes, and the like. 4 Throughout this book we will be assuming a classical logic, in which each proposition has exactly one of two available truth-values (true/false)and entailment obeys the inference
54
NOTES
rules taught in standard introductory logic classes. For information about probabili ty in non-classical logics, see the Further Readings at the end of this chapter. 5 The cognoscenti will note that in order for the state-descriptions of L to form a partition, the atomic propositions of L must be (logically) independent. We will assume throughout this book that every propositional language employed contains logically independent atomic propositions, unless explicitly noted otherwise. 6 Strictly, in order to get the result that the state-descriptions in a language form a partition and the result that each non-contradictory proposition has a unique disjunctive normal form, we need to further regiment our definitions. To our definition of a statedescription we add that the atomic proposi tions must appear in alphabetical order. We then introduce a canonical ordering of the state-descriptions in a language (say, the order in which they appear in a standardly-ordered truth-table) and require disjunctive normal form propositions to contain their disjuncts in canonical order with no repetition. 7 In the statistics community, probability distributions are often assigned over the possible values of sets of random variables. Propositions are then thought of as dichotomous random variables capable of taking only the values 1 and 0 (for “true” and “false”, respectively). Only rarely in this book will we look past distributions over propositio ns to more general random variables. 8 The axioms I’ve presented are not precisely identical to Kolmogorov’s, but the differences are insignificant for our purposes. Some authors also include Countable Additivity— which we’ll discuss in Chapter 5—among “Kolmogorov’s axioms”, but I’ll use the phrase to pick out only Non-Negativity, Normality, and Finite Additivity. Galavotti (2005, pp. 54–5) notes that authors such as (Mazurkiewicz 1932) and (Popper 1938) also provided axioms for probability around the time Kolmogorov was working. She recommends (Roeper and Leblanc 1999) for an extensive survey of the axiomatizations available. 9 This analysis could easily be generalized to any large number of tickets. 10 Philosophers sometimes describe the worlds an agent entertains as her “epistemically possible worlds”. Yet that term also carries a connotatio n of being determined by what the agent knows. So I’ll discuss doxas tically possible worlds , which are determined by what an agent takes to be possible rather than what she knows. 11 A probability distribution over sets of possible worlds is an example of what mathematicians call a “measure”. The function that takes any region of a two-dimensional space and outputs its area is also a measure. That’s what makes probabilities representable by areas in a rectangle. 12 To avoid the confusion discussed here, some authors use “muddy” Venn diagrams in which all atomic propositions have regions of the same size and probability weights are indicated by piling up more or less “mud” on top of particular regions. Muddy Venn diagrams are difficult to depict on two-dimensional paper, so I’ve stuck with representing increased confidence as increased region size. 13
Truth-tables famously come to us from (Wittgenstein 1921/1961), in which Wittgenstein also proposed a theory of probability assigning equal value to each state-description. But to my knowledge the first person to characterize probability distributions in general by the values they assign to state-descriptions was Carnap, as in his (1945, Sect. 3). 14 We have argued from the assumption that an agent’s credences satisfy the probability axioms to the conclusion that her unconditional credence in any non-contradictory proposition is the sum of her credences in the disjuncts of its disjunctive normal form. One can also argue in the other direction. Suppose I stipulate an agent’s credence distribution over language L as follows: (1) I stipulate uncondit ional credences for L’s state-descriptions
NOTES
55
that are non-negative and sum to 1; (2) I stipulate that for every other non-cont radictory proposition in L, the agent’s credence in that proposition is the sum of her credences in the disjuncts of that proposition’s disjunctive normal form; and (3) I stipulate that the agent’s credence in each contra diction is 0. We can then prove that the credence distribution I’ve just stipulated satisfies Kolmogor ov’s three probabili ty axioms. I’ll leave the (somewhat challenging) proof as an exercise for the reader.
56
NOTES
Chapter 3
Conditional Credences Chapter 2’s discussion was confined to unconditional credence, an agent’s outright degree of confidence that a particular proposition is true. This chapter takes up conditional credence, an agent’s credence that one proposition is true on the supposition that another one is. The main focus of this chapter is our fourth core normative Bayesian rule: the Ratio Formula. This rational constrain t on conditional credenc es has a number of important consequences, including Bayes’ Theorem (which gives Bayesianism its name). Conditional credences are also central to the way Bayesians understand evidential relevance. I will define relevance as positive correlation, then explain how this notion has been used to investigate causal relations through the concept of screening off. Having achieved a deeper understanding of the mathematics of conditional credences, I return at the end of the chapter to what exactly a conditional credence is. In particular, I discuss an argument by David Lewis that a conditional credence can’t be understood as an unconditional credence in a conditional.
3.1
Conditional credences and the Ratio Formula
Andy and Bob know that two events will occur simultaneously in separate rooms: a fair coin will be flipped, and a clairvoyant will predict how it will land. Let H represent the proposition that the coin comes up heads, and C represent the proposition that the clairvoyant predicts heads. Suppose Andy and Bob each assign an unconditional credence of 1 2 to H and an unconditional credence of 1 2 to C .
{
{
57
58
CHAPTER 3. CONDITIONAL CREDENCES
Although Andy and Bob assign the same unconditional credences as each other to H and C , they still might take these propositions to be related in different ways. We could tease out those differences by saying to each agent, “I have no idea how the coin is going to come up or what the clairvoyant is going to say. But suppose for a moment the clairvoyant predicts heads. On this supposition, how confident are you that the coin will come up heads?” If Andy says 1 2 and Bob says 99 100, that’s a good indication that Bob has more faith in the mystical than Andy. The quoted question in the previous paragraph elicits Andy and Bob’s conditional credences, as opposed to the unconditional credences discussed in Chapter 2. An unconditional credence is a degree of belief assigned to a single proposition, indicating how confident the agent is that that proposition is tru e. A conditional credence is a degree of belief assigned to an ordered pair of propositions, indicating how confident the agent is that the first proposition is true on the supposition that the second is. We symbolize conditional credences as follows:
{
{
p | q “ 1{2
cr H C
(3.1)
{
This equation says that a particular agent (in this case, Andy) has a 1 2 credence that the coin comes up heads conditional on the supposition that the clairvoyant predicts heads. The vertical bar indicates a conditional credence; to the right of the bar is the proposition supposed; to the left of the bar is the proposition evaluated in light of that supposition. The proposition to the right of the bar is sometimes called the condition; I am not aware of any generally-accepted name for the proposition on the left. To be clear: A real agent never assigns any credences ex nihilo , without assuming at least some background information. An agent’s uncondi tional credences in various propositions (such as H ) are informed by her background information at that time. To assign a conditional credence, the agent combines her stock of background information with a further supposition that the conditio n is true. She then evaluates the other propos ition in light of this combination. A conditional is assigned to an ordered of propositions. makes a differencecredence which proposition is supposed andpair which is evaluated.It Consider a case in which I’m going to roll a fair die and you have various credences involving the proposition E that it comes up even and the proposition 6 that it comes up six. Compare:
p | q “ 1{3 crpE | 6q “ 1 cr 6 E
(3.2) (3.3)
3.1. CONDITIONAL CREDENCES AND THE RATIO FORMULA
59
p | q
Figure 3.1: cr P Q
P
Q
3.1.1
The Ratio Formula
Section 2.2 described Kolmogorov’s probability axioms, which Bayesians take to represent rational constraints on an agent’s unconditional credences. Bayesians then add a constraint relating conditional to unconditional credences:
p q ą 0 then crpP & Qq crpP | Qq “ crpQq
Ratio Formula: For any P and Q in
L,
if cr Q
p | q
Stated this way, the Ratio Formula remains silent on the value of cr P Q when cr Q 0. There are var ious posit ions on how one should assign conditional credences when the condition has credence 0; we’ll address some of them in our discussion of the infinite in Chapter 5. Why should an agent’s conditional credences equal the ratio of those unconditionals? Consider Figure 3.1. The rectangle represents all the possible worlds the agent entertains. The agent’s uncondit ional credence in P is the fraction of that rectangle taken up by the P -circle. (The area of the rectangle is stipulated to be 1, so that fraction is the area of the P -circle divided by 1, which is just the area of the P -circle.) When we ask the agent to evaluate a credence conditional on the supposition that Q , she temporarily narrows her focus to just those possibilities that make Q true. In other words, she excludes from her attention the worlds I’ve shaded in the diagram, and considers only what’s in the Q-circle. The agent’s credence in P conditional on Q is the fraction of the Q-circle occupied by P -worlds. So it’s the area of the P Q overlap divided by the area of the entire Q-circle, which is cr P & Q cr Q .
p q“
p
q{ p q
60
CHAPTER 3. CONDITIONAL CREDENCES
In the scenario in which I roll a fair die, your initial doxastic possibilities include all six outco mes of the die roll. I then ask for your credence that the die comes up 6 conditional on its coming up even—that is, cr 6 E . To assign this value, you exclude from consideration all the odd outcomes.
p| q
That doesn’t mean you’ve actually learned that the die outcome is even; I’ve just asked you to suppose momentarily that it comes up even and assign a confidence to other propositions in light of that supposition . You distribute your credence equally over the outcomes that remain under consideration (2, 4, and 6), so your credence in 6 conditional on even is 1 3. We get the same result from the Ratio Formula:
{
p | q “ crpcr6p&E qE q “ 11{{62 “ 13
cr 6 E
(3.4)
The Ratio Formula allows us to calculate your conditional credences (confidences under a supposition) in terms of your unconditional credences (confidences relative to no suppositions beyond your background information). Hopefully it’s obvious why E gets an unconditional credence of 1 2 in this case; as for 6 & E , that’s equivalent to just 6, so it gets an unconditional
{
{
credence of 1 6.1 Warning: Mathematicians often take the Ratio Formula to be a definition of conditional probability. From their poin t of view, a conditional probability has the value it does in virtue of two unconditional probabilities’ standing in a certain ratio. But I do not want to reduce the possession of a conditional credence to the possession of two unconditional credences standing in a particular relation. I take a conditional credence to be a genuine mental state (an attitude towards an ordered pair of propositions) capable of being elicited in various ways, such as by asking an agent her confidence in a proposition given a supposition. So I will interpret the Ratio For mula as a rational constraint on how an agent’s conditional credences should relate heraunconditional As a normative (rathertothan definition ), it credences. can be violated—by assigningconstraint a conditional credence that doesn’t equal the specified ratio. The point of the previous warning is that the Ratio Formula is a rational constraint, and not all agents meet all the rational constraints on their credences. Yet for an agent who does satisfy the Ratio Formula, there can
3.1. CONDITIONAL CREDENCES AND THE RATIO FORMULA
61
be no difference in her conditional credences without a difference in her unconditional credences as well. (We say that a rational agent’s conditional credences supervene on her unconditional credences.) Fully specifying an agent’s unconditional credence distribution suffices to specify her conditional 2
credences as well. For instance, we might specify Andy’s and Bob’s credence distributions using the following probability table: C
H
T
T
T
F
F
T
F
F
crA 1 4 1 4 1 4 1 4
{ { { {
crB 99 200 1 200 1 200 99 200
{ { { {
Here cr A represents Andy’s credences and cr B represents Bob’s. Andy’s unconditional credence in C is identical to Bob’s—the values on the first two rows sum to 1 2 for each of them. Similarly, Andy and Bob hav e the same unconditional credence in H (the sum of the first and third rows). Yet Andy and Bob disagree in their confidence that the coin will come up heads (H ) given that the clairvoyant predicts heads ( C ). Using the Ratio Formula, we calculate this conditional credence by dividing the value on the first row of the table by the sum of the values on the first two rows. This yields:
{
99 99{200 p | q “ 11{{42 “ 12 ‰ 100 “ 100 {200 “ cr pH | C q
crA H C
B
(3.5)
Bob has high confidence in the clair voyant’s abilities. So on the suppos ition that the clairvoyant predicts heads, Bob is almost certain that the flip comes up heads. Andy, on the other hand, is skeptica l, so supposing that the clairvoyant predicts heads leaves his opinions about the flip outcome unchanged. 3.1.2
Consequences of the Ratio Formula
Combining the Ratio Formula useful probability rules. First wewith havethe theprobability axioms yields further Law of Total Probability: For any proposition P and finite partition Q1 , Q2 ,...,Q n in L,
p q “ crpP | Q q ¨ crpQ q ` crpP | Q q ¨ crpQ q` . . . ` crpP | Q q ¨ crpQ q
cr P
1
1
2
2
n
n
62
CHAPTER 3. CONDITIONAL CREDENCES
Suppose you’re trying to predict whether I will bike to work tomorrow, but you’re unsure if the weather will rain, hail, or be clear. The Law of Total Probability allows you to systematically work through the possibilities in that partition. You multiply your confidence that it will rain by your confidence that I’ll bike should it rain. Then you multiply your confidence that it’ll hail by your confidenc e in my biking given hail. Finally you multiply your unconditional credence that it’ll be clear by your conditional credence that I’ll bike give n that it’s clear . Adding these thre e products together yields your unconditional credence that I’ll bike. (In the formula the proposition that I’ll bike plays the role of P and the three weather possibilities are Q1 , Q2 , and Q3 .) Next, the Ratio Formula connects conditional credences to Kolmogorov’s axioms in a special way. Conditional credence is a two-place function, taking in an ordered pair of propositions and yielding a real number. Now suppose we designate some particular proposition R as our condition, and look at all of an agent’s credences conditional on the supposition of that proposition. We now have a one-place function (because the second place has been filled by R) which we can think of as a distribution over the propositions L
in Remarkably, the agent’s unconditio nal this credences satisfydistribution the probability. axioms, then ifthe Ratio Formula requires conditional cr R to satisfy those axioms as well. More formally, for any proposition R in L such that cr R 0, the following will all be true:
p¨| q
p qą
p | q ě 0. in L, crp | Rq “ 1.
•
For any proposition P in
•
For any tautology
•
For any mutually exclusive propositions P and Q in cr P Q R cr P R cr Q R .
T
L,
cr P R T
L,
p _ | q“ p | q` p | q
(You’ll prove these three facts in Exercise 3.3.) Knowing that a conditional credence distribution is a probability distribution can be a handy shortcut. (It also has a significance for updatin g credences that we’ll discuss in Chapter 4.) Because it’s a probability distribution, a conditional credence distribution must satisfy all the consequences of the probability axioms we saw in Section 2.2.1. For example, if I tell you that cr P R 0.7, you can immediately tell that cr P R 0.3, by the following conditional implemen tation of the Negation rule:
p | q“
p„ | q “
p„P | Rq “ 1 ´ crpP | Rq Similarly, by Entailment if P ( Q then cr pP | Rq ď crpQ | Rq. cr
(3.6)
3.1. CONDITIONAL CREDENCES AND THE RATIO FORMULA
63
One special conditional distribution is worth investigating at this point: What happens when the condition R is a tautology? Imagine I ask you to report your unconditional crede nces in a bunch of propositions. Then I ask you to assign credences to those same propositions conditional on the further suppos ition of. . . nothing. I give you nothing more to suppose. Clearly you’ll just report back to me the same credences. Bayesians represent vacuous information as a tautology, so this means that a rational agent’s credences conditional on a tautology equal her unconditional credences. In other words, for any P in L
p | q “ crpP q
cr P
This fact will be important to our theory of updating later on. 3.1.3
(3.7)
T
3
Bayes’ Theorem
The most famous consequence of the Ratio Formula and Kolmogorov’s axioms is Bayes’ Theorem: For any H and E in
L,
p | q “ p | pq ¨q p q
cr H E
cr E H cr H cr E
The first thing to say about Bayes’ Theorem is that it is a theorem —it can be proven straightforwardly from the axioms and Ratio For mula. This is worth remembering, because there is a great deal of controversy about how Bayesians apply the theorem. (The significance they attach to this theorem is why Bayesians came to be called “Bayesians”.) What philosophical significance could attach to an equation that is, in the end, just a truth of mathematics? The theorem was first articulated by the Reverend Thomas Bayes in the 1700s. 4 Prior to Bayes, much of probability theory was concerned with problems of direct inference . Direct inference starts with the supposition of some probabilistic hypothesis, then asks how likely that hypothesis makes a particular experimental result. You probably learned to solve many direct inference problems in school, such as “Suppose I flip a fair coin 20 times; how likely am I to get exactly 19 heads?” Here the probabilistic hypothesis H says that the coin is fair, while the experimental result E is that 20 flips yield exactly 19 heads. Your credence that the experimental result will occur on the supposition that the hypothesis is true—cr E H —is called the likelihood.5
p | q
64
CHAPTER 3. CONDITIONAL CREDENCES
Yet Bayes was also interested in inverse inference. Instead of making suppositions about hypotheses and determining probabilities of courses of evidence, his theorem allows us to calculate probabilities of hypotheses from suppositions about evidence. Instead of calculating the likelihood cr E H ,
p | q
p | q
Bayes’ Theorem shows us how to calculate cr H E . A problem of inverse inference might ask, “Suppose a coin comes up heads on exactly 19 of 20 flips; how probable is it that the coin is fair?” Assessing the significance of Bayes’ Theorem, Hans Reichenbach wrote, The method of indirect evidence, as this form of inquiry is called, consists of inferences that on closer analysis can be shown to follow the structure of the rule of Bayes. The physician’s inferences, leading from the observed symptoms to the diagnosis of a specified disease, are of this type; so are the inferences of the historian determining the historical events that must be assumed for the explanation of recorded observations; and, likewise, the inferences of the detective concluding criminal actions from inconspicuous observable data. . . . Similarly, the general induct ive inference from observational data to the validity of a given scientific theory must be regarded as an inference in terms of Bayes’ rule. (Reichenbach 1935/1949, pp. 94–5) 6 Here’s an example of inverse infer ence: You’re a biologist studying a particular species of fish, and you want to know whether the genetic allele coding for blue fins is dominant or recessive. Based on some other work you’ve done on fish, you’re leaning towards recessive—initially you assign a 0.4 credence that the blue-fin allele is dominan t. Given some background assumptions we won’t worry about here, 7 a direct inference from the theory of genetics tells you that if the allele is dominant, roughly 3 out of 4 species members will have blue fins; if the allele is recessive blue fins will appear on roughly 25% of the fish. But you’re going to perform an inverse inference, making experimental observations to decide between genetic hypotheses. You will capture fish from the species at random and examine their fins. How significant will your first observation be to your credences in dominant versus recessive? When you contemplate various ways that observation might turn out, how should supposing one outcome or the other affect your credences about the allele? Before we do the calculation, try estimating how confide nt you should be that the allele is dominant on the supposition that the first fish you observe has blue fins. In this example our hypothesis H will be that the blue-fin allele is dominant. The evidence E to be supposed is that a randomly-drawn fish has
3.1. CONDITIONAL CREDENCES AND THE RATIO FORMULA
65
p | q
blue fins. We want to calculate the posterior value cr H E . This value is called the “posterior” because it’s your credence in the hypothesis H after the evidence E has been supposed. In order to calculate this posteri or, Bayes’ Theorem requires the values of cr E H , cr H , and cr E .
p | q p q
p | q
p q
cr E H is the likelihood of drawing a blue-finned fish on the hypothesis that the allele is dominant. On the supposition that the allele is dominant, 75% of the fish have blue fins, so your cr E H value should be 0 .75. The other two values are known as priors; they are your unconditional credences in the hypothesis and the evidence before anything is supposed. We already said that your prior in the blue-fin dominant hypothesis H is 0.4. So cr H is 0.4. But what abou t the prior in the evide nce? How confident are you before observing any fish that the first one you draw will have blue fins? Here we can apply the Law of Total Probability to the partition consisting of H and H . This yields:
p | q
p q
„
p q “ crpE | H q ¨ crpH q ` crpE |„H q ¨ crp„H q
cr E
(3.8)
The values on the righthand side are all either likelihoods, or priors related to the hypothesis. These values we can easily calculate. So
p q “ 0.75 ¨ 0.4 ` 0.25 ¨ 0.6 “ 0.45
cr E
(3.9)
Plugging all these values into Bayes’ Theorem gives us
p | q “ crpE |crHpqE¨qcrpH q “ 0.750.45¨ 0.4 “ 2{3
cr H E
(3.10)
Observing a single fish has the potential to change your credences substantially. On the supposition that the fish you dra w has blue fins, your credence that the blue-fin allele is dominant goes from its prior value of 2 5 to a posterior of 2 3. Again, all of this is strictly mathematics from a set of axioms that are rarely disputed. So why has Bayes’ Theorem been the focus of controversy? One issue is the role Bayesians see the theorem playing in updating our attitudes over time; we’ll return to that application of the theorem in Chapter 4. But the main idea that Bayesians take from Bayes—and that has proven controversial—is that probabilistic inverse inference is the key to induction. Bayesians think the primary way we ought to draw conclusions from data— how we ought to reason about scientific hypotheses, say, on the basis of experimental evidence—is by calculating posterior credences using Bayes’ Theorem. This view stands in direct confli ct with other stati stical methods, such as frequentism and likelihoodism. Advocates of those methods
{
{
66
CHAPTER 3. CONDITIONAL CREDENCES
also have deep concerns about where Bayesians get the priors that Bayes’ Theorem requires. Once we’ve conside rably deepened our understanding of Bayesian Epistemology, we will discuss these issues in Chapter 11. Before moving on, I’d like to highlight two useful alternative forms of Bayes’ Theorem. We’ve just seen that calculating the prior of the evidence— cr E —can be easier if we break it up using the Law of Total Probability. Incorporating that manuever into Bayes’ Theorem yields
p q
Hq p | q “ crpE | H q ¨ crcrppHEq|`Hcrq ¨pcrE p|„ H q ¨ crp„H q
cr H E
(3.11)
„
When a particular hypothesis H is under consideration, its negation H is known as the catchall hypothesis. So this form of Bayes’ Theor em calculates the posterior in the hypothesis from the priors and likelihoods of the hypothesis and its catchall. In other situations we have multiple hypotheses under consideration instead of just one. Given a finite partition of n hypotheses H1 , H2 ,...,H n , the Law of Total Probability transforms the denominator of Bayes’ Theorem to yield
p | q “ crpE | H q ¨ crpH q ` crpEcr|pEH |qH¨ crq ¨pHcrpqH`q . . . ` crpE | H q ¨ crpH q i
cr Hi E
1
1
i
2
2
n
n
(3.12) This version allows you to calculate the posterior of one particular hypothesis Hi in the partition from the priors and likelihoods of all the hypotheses.
3.2
Relevance and independence
Andy doesn’t believe in hocus pocus; from his point of view, information about what a clairvoyant predicts is irrelevant to determining how a coin flip will come out. So supposing that a clairvoyant predicts heads makes no difference to Andy’s confidence in a heads outcome. If C says the clairvoyant predicts heads, H says the coin lands heads, and cr A is Andy’s credence distribution, we have
p | q “ 1{2 “ cr pH q
crA H C
A
(3.13)
Generalizing this idea yields a key definition: Proposition P is probabilistically independent of proposition Q relative to distribution cr just in case cr P Q cr P (3.14)
p | q“ p q
3.2. RELEVANCE AND INDEPENDENCE
67
In this case Bayesians also say that Q is irrelevant to P . When Q is irrelevant to P , supposing Q leaves an agent’s credence in P unchanged.8 Notice that probabilistic independence is always relative to a distribution cr. The very same propositions P and Q might be independent relative to one distribution but dependent relative to another. (Relative to Andy’s credences the clairvoyant’s prediction is irrelevant to the flip outcome, but relative to the credences of his friend Bob—who believes in psychic powers— it is not.) In what follows I may omit reference to a particular distribution when context makes it clear, but you should keep the relativity of independence to probability distribution in the back of your mind. While Equation (3.14) will be our official definition of probabilistic independence, there are many equivalent tests for independen ce. Given the probability axioms and Ratio Formula, the following equations are all true just when Equation (3.14) is:
p q “ crpP |„Qq crpP | Qq “ crpP |„Qq crpQ | P q “ crpQq “ crpQ |„P q crpP & Qq “ crpP q ¨ crpQq cr P
(3.15) (3.16) (3.17) (3.18)
The equivalence of Equations (3.14) and (3.15) tells us that if supposing Q makes no difference to an agent’s confidence in P , then supposing Q makes no difference as well. The equivalence of (3.14) and (3.17 ) shows us that independence is symmetric: if supposing Q makes no difference to an agent’s credence in P , supposing P won’t change the agent’s attitude towards Q either. Finally, Equation (3.18) embodies a useful probability rule:
„
Multiplication: P and Q are probabilistically independent relative to cr if and only if cr P & Q cr P cr Q .
p
q“ p q¨ p q
(Some authors define probabilistic independence using this biconditional, but we will define independence using Equation (3.14) and then treat Multiplication as a consequence.) We can calculate the probability of a conjunction by multiplying the probabilities of its conjuncts only when those conjuncts are independent. This trick will not work for any arbitrary propos itions. The gener al formula for probability in a conjunction can be derived quickly from the Ratio Formula: cr P & Q cr P Q cr Q (3.19)
p
q“ p | q¨ p q
68
CHAPTER 3. CONDITIONAL CREDENCES
When P and Q are probabilistically independent, the first term on the righthand side equals cr P . It’s important not to get Multiplication and Finite Additivity confused. Finite Additivity says that the credence of a disjunction is the sum of the
p q
credences of its mutually exclusive disjuncts. Multiplication says that the credence of a conjunction is the product of the credences of its independent conjuncts. If I flip two fair coins in succession, heads on the first and heads on the second are independent, while heads on the first and tails on the first are mutually exclusive. Probabilistic independence fails to hold when one proposition is relevant to the other. Replace the “ ” signs in Equations (3.14) through (3.18) with “ ” signs and you have tests for Q ’s being positively relevant to P . Once more the tests are equivalent—if any of the resulting inequalities is true, all of them are. Q is positively relevant to P when assuming Q makes you more confident in P . For example, since Bob believ es in mysticism he takes the clairvoyant’s predictions to be highly relevant to the outcome of the coin flip—supposing that the clairvoyant has predicted heads takes him from equanimity to near-certainty in a heads outcome. Bob assigns
ą
“
p | q “ 99{100 ą 1{2 “ cr pH q
crB H C
B
(3.20)
Like independence, p ositive relevance is symmetric. Given his high confidence in the clairvoyant’s accuracy, supposing that the coin came up heads will make Bob highly confident that the clairvoyant predicted it would. Similarly, replacing the “ ” signs with “ ” signs above yield s tests for negative relevance . For Bob, the clair voyant’s predicting heads is negatively relevant to the coin’ s coming up tails. Like positive correlation, negative correlation is symmetric (supposing a tails outcome makes Bob less confid ent in a heads prediction). Note also that there are many synonyms in the statistics community for “relevance”. Instead of finding “positively/negatively relevant” terminology, you’ll sometimes find “positively/negatively dependent”, “positively/negatively correlated”, or even “correlated/anti-correlated” used.
“
ă
The strongest forms of positive and negative relevance are entailment and refu tation. Suppose a hypothesis H has nonextreme prior credence. If a particular piece of evidence E entails the hypothesis, the probability axioms and Ratio Formula tell us that
p | q“1
cr H E
(3.21)
Supposing E takes H from a middling credence to the highest credence
3.2. RELEVANCE AND INDEPENDENCE
69
allowed. Similarly, if E refutes H (what philosophers of science call falsification), then cr H E 0 (3.22)
p | q“
Relevance will be most important to us because of its connection to confirmation, the Bayesian notion of evidential support. A piece of evidence confirms a hypothesis only if it’s relevant to that hypothesis. Put another way, learning a piece of evidence changes a rational agent’s credence in a hypothesis only if that evidence is relevant to the hypothesis. (Much more on all this later.) 3.2.1
Conditional independence and scre ening off
The definition of probabilistic independence compares an agent’s conditional credence in a proposition to her unconditional credence in that proposition. But we can also compare conditional credences. When Bob, who believes in the occult, hears a clairvoyant’s prediction about the outcome of a fair coin flip, he takes it to be highly corr elated with the true flip outcome. But what if we ask Bob to suppose that this particular clairvoyant is an impostor? Once he supposes the clairvoyant is an impostor, Bob may take the clairvoyant’s predictions to be completely irrelevant to the flip outcome. Let C be the proposition that the clairvoyant predicts heads, H be the proposition that the coin comes up heads, and I be the proposition that the clairvoyant is an impostor. It’s possible for Bob’s credenc es to satisfy both of the following equations at once:
p | q ą crpH q p | q “ crpH | I q
cr H C cr H C & I
(3.23) (3.24)
Inequality (3.23) tells us that unconditionally, Bob takes C to be relevant to H . But cond itional on the supposition of I , C becomes independent of H (Equation (3.24)); supposing C & I gives Bob the same confidence in H as supposing I alone. Generalizing this idea yields the following definition of conditional independence: P is probabilistically independent of Q conditional on R just in case cr P Q & R cr P R (3.25)
p |
q“ p | q
If this equality fails to hold, we say that P is relevant to (or dependent on) Q conditional on R.
70
CHAPTER 3. CONDITIONAL CREDENCES
One more piece of terminology: We will say that R screens off P from Q when P is unconditionally dependent on Q but independent of Q conditional on both R and R. That is, R screens off P from Q just in case all three of the following are satisfied:
„
crpP | Qq ‰ crpP q p | q “ crpP | Rq p | „ q “ crpP |„Rq
(3.26)
cr P Q & R cr P Q & R
(3.27) (3.28)
When these equations are met, P and Q are correlated but supposing either R or R makes that correlation disappear. 9 Conditional independence and screening off are both best understood through real-world examples. We’ll see a number of those in the next few sections.
„
3.2.2
The Gambler’s Fallacy
People often act as if future chancy events will “compensate” for unexpected past results. When a good hitter strikes out many times in a row, someone will say he’s “due” for a hit. If a fair coin comes up heads 19 times in a row, many people become more confident that the next outcome will be tails. This mistake is known as the Gambler’s Fallacy .10 A person who makes the mistake is thinking along something like the follo wing lines: In twenty flips of a fair coin, it’s more probable to get 19 heads and 1 tail than it is to get 20 heads. So having seen 19 heads , it’s much more likely that the next flip will come up tails. This person is providing the right answer to the wron g question. The answer to the question “When a fair coin is flipped 20 times, is 19 heads and 1 tail more likely than 20 heads?” is “yes”—in fact, it’s 20 times as likely! But that’ s the wrong question to ask in this case. Instead of wondering what sorts of outcomes are probable when one flips a fair coin 20 times in general, it’s more appropriate to ask of this specific case: given that the coin has already come up heads 19 times, how confident are we that the next flip will be tails? This is a question about our conditional credence
p
|
cr next flip heads previous 19 flips heads
q
(3.29)
How should we calculate this conditional creden ce? Ironically, it might be more reasonable to make a mistake in the opposite direction from the Gambler’s Fallacy. If I see a coin come up heads 19 times, shoul dn’t that make me suspect that it’s biase d towards head s? If anything, shouldn’t
71
3.2. RELEVANCE AND INDEPENDENCE
supposing 19 consecutive heads make me more confident that the next flip will come up heads than tails? This line of reasoning would be appropriate to the present case if we hadn’t stip ulated in setting thin gs up that the coin is fair. For a rational agent, the outcome of the 20th flip is independent of the outcomes of the first 19 flips conditional on the fact that the coin is fair. That is,
p
|
q“ q
cr next flip heads previous 19 flips heads & fair coin cr next flip heads fair coin
p
|
(3.30)
We can justify this equation as follows: Typically, information that a coin came up heads 19 times in a row would alter your opinion about whether it’s a fair coin. Changing your opinion about whether it’s a fair coin would then affect your prediction for the 20th flip. So typically, information about the first 19 flips alters your credences about the 20th flip by way of your opinion about whether the coin is fair. But if you’ve already established that the coin is fair, information about the first 19 flips has no further significance for your prediction about the 20th. So conditional on the coin’s being fair, the The first lefthand 19 flips’ outcomes are irrelevant to the outcome of thequestion 20th flip. side of Equation (3.30) captures the correct to ask about the Gambler’s Fallacy case. The righthand side is easy to calculate; it’s 1 2. So after seeing a coin known to be fair come up heads 19 times, we should be 1 2 confident that the next flip will be heads. 11
{
3.2.3
{
Probabilities are weird! Simpson’s Paradox
Perhaps you’re too much of a probabilistic sophisticate to ever commit the Gambler’s Fallacy. Perhaps you successfully navigated Tversky and Kahneman’s Conjunction Fallacy (Section 2.2.4) as well. But even probabilit y experts sometimes have trouble with the counterintuitive relations that arise between conditional and unconditional dependence. Here’s an example of how odd things can get: In a famous case, the University of California, Berkeley’s graduate departments were investigated for gender bias in admissions. The concern arose because in 1973 about 44% of overall male applicants were admitted to graduate school at Berkeley, while only 35% of female applicants were. Yet when the graduate departments (where admissions decisions are made) were studied one at a time, it turned out that individual departments either were admitting men and women at roughly equal rates, or in some cases were admitting a higher percentage of female applicants.
72
CHAPTER 3. CONDITIONAL CREDENCES
This is an example of Simpson’s Paradox, in which probabilistic dependencies point in one direction conditi onal on each member of a partition yet point the opposite way uncon ditionally. A Simpson’s Paradox case involves a collection with a number of subgroups. Each of the subgroups displays the same probabilistic correlation between two traits. Yet when we examine the collection as a whole, that correlation is reversed! 12 To see how this can happen, consider another example: In 1995, David Justice had a higher batting aver age than Derek Jeter. In 1996, Justice also had a higher average than Jeter. Yet over that entire two-year span, Jeter’s average was better than Justice’s. 13 Here are the data for the two hitters: 1995 Jeter 12 48 Justice 104 411
{
{
.250 .253
1996 183 582 45 140
{
{
.314 .321
Combined 195 630 149 551
{ {
.310 .270
The first number in each box is the number of hits; the second is the number of at-bats; the third is the batting average (hits divided by at-bats). Looking at the table, you can see how Justice managed to beat Jeter for average in each individual year yet lose to him overall. In 1995 Justice beat Jeter but both batters hit in the mid- .200s; in 1996 Justice beat Jeter while both hitters had a much better year. Jeter’s trick was to have fewe r at-bats than Justice during the off year and many more at-bats when both hitters were going well. Totaling the two years, many more of Jeter’s atbats produced hits at the over-.300 rate, while the preponderance of Justice’s at-bats came while he was toiling in the .200s.14 Scrutiny revealed a similar effect in Berkeley’s 1973 admissions data. Bickel, Hammel, and O’Connell (1975) concluded, “The proportion of women applicants tends to be high in departments that are hard to get into and low in those that are easy to get into.” Although individual departments were just as willing to admit women as men, female applications were less successful overall because more were directed at departments with low admission rates. Simpson’s Paradox cases can be thought of entirely in terms of numerical proportions, as we’ve just done with the baseball and admissions examples. But these examples can also be analyzed using conditional probabilities. Suppose, for instance, that you are going select a Jeter or Justice at-bat at random from the 1,181 at-bats in the combined 1995 and 1996 pool, making your selection so that each of the 1,181 at-bats is equally likely to be selected. How confident should you be that the selected at-bat is a hit? How should
3.2. RELEVANCE AND INDEPENDENCE
73
that confidence change if you suppose a Jeter at-bat is selected, or an at-bat from 1995? Below is a probability table for your credences, assembled from the reallife statistics above. Here E says that it’s a Jeter at-bat; 5 says it’s from 1995; and H says it’s a hit. (Given the pool from which we’re sampling, means a Justice at-bat and 5 means it’s from 1996.)
„
E
5
H
T
T
T
T
T
F
T
F
T
T
F
F
F
T
T
F
T
F
F
F
T
F
F
F
„E
cr 12 1181 36 1181 183 1181 399 1181 104 1181 307 1181 45 1181 95 1181
{ { { { { { { {
A bit of calculation with this probability table reveals the following:
p | q ą crpH |„E q p | q ă crpH |„E & 5 q crpH | E & „5q ă crpH |„E & „5q cr H E cr H E & 5
(3.31) (3.32) (3.33)
If you’re selecting an at-bat from the total sample, Jeter is more likely to get a hit than Justice. Put another way, Jeter’s batting is unconditionally positively relevant to an at-bat’s being a hit. But Jeter’s batting is negatively relevant to a hit conditional on each of the two years in the sample. If you’re selecting from only the at-bats associated with a particular year, you’re more likely to get a hit if you go with Justice. 3.2.4
Correlation and causation
You may have heard the expressio n “correlation is not causation.” People typically use this expression to point out that just because two events have both occurred—and maybe occurred in close spatio-temporal proximity— that doesn’t mean they had anything to do with each other. But “correlation” is a technical term in probability discussions. The propositions describing two events may both be true, or you might have high credence in both of them, yet they still might not be probabilistically correlated. For the propositions to be correlated, supposing one to be true must increase the
74
CHAPTER 3. CONDITIONAL CREDENCES
probability of the other. I’m confid ent that I’m unde r 6 feet tall and that my eyes are blue, but that doesn’t mean I take those facts to be correlated. Once we’ve understood probabilistic correlation correctly, does its presence always indicate a causal connection? Perhaps not. If I suppose that the fiftieth Fibonacci number is even, that makes me highly confident that it’s the sum of two primes. But being even and being the sum of two primes are not causally related; Goldbach’s Conjecture that every even number greater than 2 is the sum of two primes is an arithmetic fact (if it’s a fact at all). 15 On the other hand, most correlations we encounter in everyday life are due to empirical condi tions. When two propositions are corre lated due to empirical facts, must the event described by one cause the event described by the other? Hans Reichenbach offered a classic counterexample. He wrote, Suppose two geysers which are not far apart spout irregularly, but throw up their columns of water always at the same time. The existence of a subterranean connection of the two geysers with a common reservoir of hot water is then practically certain. (1956, p. 158) If you’ve noticed that two nearby geysers always spout simultaneously, seeing one spout will increase your confidence that the other is spouting as well. So your crede nces about the geyse rs are corr elated. But you don’t think one geyser’s spouting causes the other to spout. Instead, you hypothesize an unobserved reservoir of hot water that is the common cause of both spouts. Reichenbach proposed a famous principle about empirically correlated events: Principle of the Common Cause: When event outcomes are probabilistically correlated, either one causes the other or they have a common cause. 16 In addition to this principle, he offered a key mathematical insight about causation: a common cause screens its effects off from each other. Let’s work through an example of this insight concerning causation and screening off. Suppose the proposition that a particular individual is a drinker is positively relevant to the proposition that she’s a smoker. This may be because drinking causes smoking—the kinds of places and social situations in which one drinks may encourage smoking. But there’s another possible explanation: being a smoker and being a drinker may both be promoted by an addictive personality, which we can imagine results from a
3.2. RELEVANCE AND INDEPENDENCE
75
Figure 3.2: A causal fork addictive personality (P )
smoker (S )
drinker( D)
genetic endowment unaffected by one’s behavior. In that case, an addictive personality would be a common cause of both being a drinker and being a smoker. (See Figure 3.2; the arrows indicate causal influence.) Imagine the latter explanation is true, and moreover is the only true explanation of the correlation between drinking and smoking. That is, being a smoker and being a drinker are positively correlated only due to their both being caused by an addictive personality. Given this assumption, let’s take a particular subject whose personality you’re unsure about, and consider what happens to your credences when you make various suppositions about her. If you begin by supposing that the subject drinks, this will make you more confident that she smokes—but only because it makes you more confident that the subject has an addictive personality . On the oth er hand, you might start by supposing that the subject has an addictive personality. That will certainly make you more confident that she’s a smoker. But once you’ve made that adjustment, going on to suppose that she’s a drinker won’t affect your confidence in smoking. Information about drinkin g affects your smoking opinions only by way of helping you figure out whether she has an addictive personality, and the answer to the personality question was filled in by your initial supposit ion. Once an addictive personality is supposed, drinking has no further relevance to smoking. (Compare: Once a coin is supposed to be fair, the outcomes of its first 19 flips have no relevance to the outcome of the 20th.) Drinking becomes probabilis tically indepen dent of smoking conditional on suppositions about whether the subject has an
76
CHAPTER 3. CONDITIONAL CREDENCES
Figure 3.3: A causal chain parents’ genes ( G)
addictive personality (P )
smoker (S )
addictive personality. That is,
p | q ą crpSq p | q “ crpS | P q crpS | D & „P q “ crpS |„P q cr S D cr S D & P
(3.34) (3.35) (3.36)
P is a common Causal (asDin, so Figure 3.2) give to screening off. cause offorks S and P screens off rise S from D. But that’ s not the only way scre ening off can occur . Consider Figure 3.3. Here we’ve focused on a differen t portion of the causal struc ture. Imagine that the subject’s parents’ genes causally determine whether she has an addictive personality, which in turn causally promotes smoking. Now her parents’ genetics are probabilistically relevant to the subject’s smoking, but that corr elation is screened off by facts about her perso nality. Again, if you’re uncertain whether the subject’s personality is addictive, facts about her parents’ genes will affect your opinion of whether she’s a smoker . But once you’ve made a firm supposition about the subject’s personality, suppositions about her parents’ genetics have no further influence on your smoking opinions. In equation form:
cr S G
cr S
p | p | q “ą crpS q| P q crpS | G & „P q “ crpS |„P q cr S G & P
(3.37) (3.38) (3.39)
P screens off S from G.17
Relevance, conditional relevance, and causation can interact in very complex ways.18 My goal here has been to introduce the main ideas and terminology employed in their analy sis. The state of the art in this field has come
3.3. CONDITIONAL CREDENCES AND CONDITIONALS
77
a long way from Reichenbach; computational tools now available can look at statistical correlations among a large number of variables and hypothesize a causal structu re lying beneath them. The resulting causal diagra ms are known as Bayes Nets , and have practical applications from satellites to health care to car insurance to college admissions. These causal methods all start from Reichenbach’s insight that common causes screen off their effects. And what of his more metaphysically radical Principle of the Common Cause? It remains highly controversial.
3.3
Conditional credences and conditionals
I now want to circle back and get clearer on the nature of conditional credence. First, it’s import ant to note that the condit ional credences we’ve been discussing are indicative, not subjunctive. This distinction is familiar from the theory of conditional propositions. Compare: If Shakespeare didn’t write Hamlet, then someone else did. If Shakespeare hadn’t written Hamlet, then someone else would have. The former conditio nal is indicative, while the latter is subjunctive. Typically one evaluates the truth of a conditional by considering possible worlds in which the antecedent is satisfied, then seeing if the consequent is true in those worlds as well. When you evaluate an indicativ e conditional, you’r e restricted to considering worlds among your doxastic p ossibilities. 19 Evaluating a subjunctive conditional, on the other hand, permits you to engage in counterfactual reasoning involving worlds you’ve actually ruled out. So for the subjunctive conditional above, you can consider worlds that make the antecedent true because Hamlet never exists. But for the indic ative conditional, you have to take into account that Hamlet does exist, and entertain only worlds in which that’s true. So you consider bizarre “autho rconspiracy” worlds which, while far-fetched, satisfy the antecedent and are among your curre nt doxastic p ossibilities. In the end, I’m guessing you take the indicative conditional to be true but the subjunctive to be false. Now suppose I ask for your credence in the proposition that someone wrote Hamlet, conditional on the supposition that Shakespe are didn’t. This value will be high, again because you take Hamlet to exist. In assigning this conditional credence, you aren’t bringing into consideration possible worlds you’d otherwise ruled out (such as Hamlet -free worlds). Instead, you’re focusing in on the narrow set of author-conspiracy worlds you currently entertain. As we saw in Figure 3.1, assigning a conditional credence strictly
78
CHAPTER 3. CONDITIONAL CREDENCES
narrows the worlds under consideration; it’s doesn’t expand your attention to worlds previously ruled out. Thus the conditional credences discussed in this book—and typically discussed in the Bayesian literature—are indicative rather than subjunctive.20 Are there more features of conditional propositions that can help us understand conditional credences? Might we understand conditional credences in terms of conditionals? Initiating his study of conditional degrees of belief, F.P. Ramsey warned against assimilatin g them to conditional propositions: We are also able to define a very useful new idea—“the degree of belief in p given q”. This does not mean the degr ee of belief in “If p then q”, or that in “p entails q”, or that which the subject would have in p if he knew q, or that which he ought to have. (1931, p. 82) Yet many authors failed to heed Ramsey’s warning. It’s very tempting to equate conditional credences with some simple combination of conditional propositions and unconditional credences. For example, when I ask, “How confident are you in P given Q?”, it’s easy to hear that as “Given Q, how confident are you in P ?” or just “If Q is true, how confident are you in P ?” This simple slide might suggest that
p | q “ r is equivalent to Q Ñ crpP q “ r (3.40) Here I’m using the symbol “ Ñ” to represent some kind of conditional . For the reaso ns discussed above, it should be an indicative conditional. But it need not be the material conditional symbolized by “ Ą”; many authors cr P Q
think the material conditional’s truth-function fails to accurately represent the meaning of natural-language indicative conditionals. Endorsing the equivalence in (3.40) would require serious changes to the traditional logic of conditionals. We can demonstrate this in two ways. First, we usually take indicative conditionals to satisfy the Disjunctive Syllogism rule. (The material conditional certainly does!) This rule tells us that X
Ñ Z and Y Ñ Z jointly entail pX _ Y q Ñ Z
(3.41)
for any propositions X , Y , and Z . Thus for any propositions A, B , and C and constant k we have A
Ñ rcrpC q “ ks and B Ñ rcrpC q “ ks entail pA _ Bq Ñ rcrpC q “ ks
(3.42)
3.3. CONDITIONAL CREDENCES AND CONDITIONALS
79
Combining (3.40) and (3.42) yields
p | q “ k and cr pC | Bq “ k entail cr pC | A _ Bq “ k
cr C A
(3.43)
which is false. Not only and can Ratio one design a credence distribution the probability axioms Formula such that cr C A satisfying k and cr C B k but cr C A B k ; one can even describe real-life examples in which it’s rational for an agent to assign such a distribution. (See Exercise 3.12.) The failu re of (3.43) is another case in which credences confound expectations developed by our experiences with classificatory states. Second, we usually take indicative conditionals to satisfy Modus Tollens. But consid er the follo wing facts about me: Unconditionally, I am highly confident that I will be alive tomorrow. But conditional on the proposition that the sun just exploded, my confidence that I will be alive tomorrow is very low. Given these fac ts, Modus To llens, and (3.40 ), I could run the following argument:
p | q“
p
p | q“
p | _ q‰
|
q
cr alive tomorrow sun exploded is low.
q p
q
If the sun exploded, cr alive tomorrow is low. cr alive tomorrow is high.
p
The sun did not explode.
rgivens (3.44) (3.40) s (3.45) rr(3.44), givens (3.46) r(3.45), (3.46), Modus Tollens s (3.47)
This argument would allow a neat bit of astronomy by introspection. Yet I take it that’s not rational. So I conclude that as long as indicative conditionals satisfy classical logical rules such as Disjunctive Syllogism and Modus Tollens, any analysis of conditional credences in terms of conditionals that uses (3.40) must be false. 21 Perhaps we’ve mangled the transition from conditional credences to conditional propositions. Perhaps we should hear “How confident are you in P given Q?” as “How confident are you in ‘ P , given Q’?”, which is in turn “How confident are you in ‘If Q, then P ’ ?” Maybe a conditional credence is a credence in a conditional. perhaps more she weaunconditionally kly: an agent assign sa particular conditional credenceOr value whenever assigns that value to a conditional. In symbols, the proposal is
p | q “ r is equivalent to cr pQ Ñ P q “ r
cr P Q
(3.48)
for any propositions P and Q , any credence distribution cr, and some indicative conditional . If true, this equivalence would offer another possibilit y
Ñ
80
CHAPTER 3. CONDITIONAL CREDENCES
for analyzing conditional credences in terms of unconditional credences and conditionals. We can quickly show that (3.48) fails if “ ” is read as the material conditional . Under the material reading, the proposal entails that
Ñ
Ą
p | q “ crpQ Ą P q
cr P Q
(3.49)
Using the probability calculus and Ratio Formula, we can show that Equation (3.49) holds only when cr Q 1 or cr Q 1. (See Exercise P 3.13.) This is a triviality result : It shows that Equat ion (3.49) can hold only in trivial cases, namely over the narrow range of conditionals for which the agent is either certain of the antecedent or certain of the conditional itself. Equation (3.49) does not express a truth that holds for all conditional credences in all propositions; nor does (3.48) when “ ” is read materially. Perhaps the equivalence in (3.48) can be saved from this objection by construing its “ ” as something other than a material conditional. But Lewis (1976) provided a clever objection that works whichever conditional we choose. Begin by selecting arbitrary propositions P and Q. We then
p q“
p Ą q“ Ñ
Ñ
Ñ
derive the following from the proposal on the table:
p Ñ P q “ crpP | Qq p Ñ P | P q “ crpP | Q & P q crpQ Ñ P | P q “ 1 crpQ Ñ P |„P q “ crpP | Q & „P q crpQ Ñ P |„P q “ 0 crpQ Ñ P q “ crpQ Ñ P | P q ¨ crpP q ` crpQ Ñ P |„P q ¨ crp„P q crpQ Ñ P q “ 1 ¨ crpP q ` 0 ¨ crp„P q crpQ Ñ P q “ crpP q crpP | Qq “ crpP q cr Q cr Q
rfrom (3.48) s rsee below s rQ & P entails P s rsee below s rQ & „P refutes P s rLaw of Tot. Prob.s
[(3.52), (3.54), (3.55)]
(3.50) (3.51) (3.52) (3.53) (3.54) (3.55) (3.56) (3.57)
[(3.50)]
(3.58)
Some of these lines require explanation. The idea of lines (3.51) and (3.53) is this: We’ve already seen that a credence distribution conditional on a particular proposition satisfies the probability axioms. This suggests that we should think of a distribution conditional on a proposition as being just like any other credence distribution. (We’ll see more reason to think this in Chapter 4, note 3.) So a distribution cond itional on a proposition should satisfy the proposal of (3.48) as well. If you conditionally suppose X , then
3.3. CONDITIONAL CREDENCES AND CONDITIONALS
81
Ñ
under that supposition you should assign Y Z the same credence you would assign Z were you to further suppose Y . In other words, cr Y
Z X
cr Z Y & X
p Ñ | q“ p |
(3.59)
q
In line (3.51) the roles of X , Y , and Z are played by P , Q, and P ; in line (3.53) it’s P , Q, and P . This result of Lewis’s is another triviality resu lt. Assuming the probability axioms and Ratio Formula, the proposal in (3.48) can hold only for propositions P and Q such that cr P Q cr P . In other words, it can hold only for propo sitions the agent takes to be independent. Or (takin g things from the other end), the proposed equivalence can hold for all the conditionals an agent entertains only if the agent treats every pair of propositions in L as independent!22 So a rational agent’s conditional credence will not in general equal her unconditional credence in a conditional. This is not to say that conditional credences have nothing to do with conditionals. A popular idea now usually called “Adams’ Thesis” (Adams 1965) holds that an indicative conditional Q P is acceptable to a degree equal to cr P Q .23 But we cannot maintain that an agent’s conditional credence is equal to her credence that some conditional is true. This brings us back to a proposal I discussed in Chapter 1. One might try to relate degrees of belief to binary beliefs by suggesting that whenever an agent has an r -valued credence, she has a binary belief in a traditional proposition with r as part of its content. Working out this proposal for conditional credences reveals how hopeless it is. Suppose an agent assigns cr P Q r. Would we suggest that the agent believes that if Q, then the probability of P is r ? This proposal mangles the logic of conditional credences. Perhaps the agent believes that the probability of “if P , then Q ” is r? Lewis’s argument dooms this idea. I said in Chapter 1 that the numerical value of an unconditional degree of belief is an attribute of the attitude taken towards the proposition, not a constituent of that proposition itself. As for condition al credences, cr P Q r does not say that an agent takes some attitude towards a conditional proposition with a probability value in its consequent. Nor does it say that the agent takes some attitude towards a single, conditional proposition composed of P and Q. cr P Q r says that the agent takes an r -valued attitude towards an ordered pair of propositions—neither of which need involve the number r .
„
p | q“ p q
Ñ
p | q
p | q“
p | q“
p | q“
82
CHAPTER 3. CONDITIONAL CREDENCES
3.4
Exercises
Unless otherwise noted, you should assume when completing these exercises that the cr-distributions under discussion satisfy the probability axioms and Ratio Formula. You may alsohas assume thatunconditional whenever a conditional cr expression occurs, the condition a nonzero credence so that the conditional credence is well-defined. Problem 3.1. Suppose there are 30 people in a room. For each pers on, you’re equally confident that their birthday falls on any of the 365 days in a year. (You’re certain none of them was born in a leapyear.) Your credences about each person’s birthday are independent of your credences about all the other people’s birthdays. How confident are you that at least two people in the room share a birthday? (Hint: Imagine the peop le enter the room one at a time. After n people have entered, what is your credence that no two people in the room so far share a birthday?) Problem 3.2. One might think that real humans only ever assign credences that are rational numbers—and perhaps only rational numbers involving relatively small whole-number numerators and denominators. But we can write down simple conditions that require an irrational-valued credence function. For example, take these three conditions:
p | q “ crpX _ Y q 2. cr pY q “ crp„Y q 3. cr pX & Y q “ crp„X & Y q 1. cr Y X
Show that there is exactly one credence distribution over language L with atomic propositions X and Y that satisfies all three of these conditi ons, and that that distribution contains irrational-valued credences. ∗ Problem 3.3. Prove that credences conditional on a particular proposition form a probability distribution. That is, prove that for any proposition R in L such that cr R 0, the following three conditions hold:
p qą
p | q ě 0. in L, crp | Rq “ 1.
(a) For any propositio n P in (b) For any tautology ∗
T
L,
cr P R T
I owe this problem to Branden Fitelson.
83
3.4. EXERCISES
(c) For any mutually exclusive propositions P and Q in cr P Q R cr P R cr Q R .
L,
p _ | q“ p | q` p | q
Problem 3.4. Pink gumballs always make my sister sick. (They remind her of Pepto Bismol.) Blue gumballs make her sick half of the time (they just look unnatural), while white gumballs make her sick only one-tenth of the time. Yesterday, my sister bought a single gumball from a machine that’s one-third pink gumballs, one-third blue, and one-third white. Applying the version of Bayes’ Theorem in Equation (3.12), how confident should I be that my sister’s gumball was pink conditional on the supposition that it made her sick? Problem 3.5. (a) Prove Bayes’ Theorem from the probabilit y axioms and Ratio Formula. (Hint: Start by using the Ratio Formula to write down expressions involving cr H & E and cr E & H .)
p
q
p
q
(b) Exactly which unconditio nal credences must we assume to be positive in order for your proof to go through? (c) Where exactly does your proof rely on the probabili ty axioms (and not just the Ratio Formula)? Problem 3.6. Once more, consider the probabilistic credence distribution specified by this probability table (from Exercise 2.8): P
Q
R
T
T
T
T
T
F
T
F
T
T
F
F
F
T
T
F
T
F
F
F
T
F
F
F
cr 0.1 0.2 0 0.3 0.1 0.2 0 0.1
Answer the following questions about this distribution:
p | q
(a) What is cr P Q ? (b) Relative to this distributio n, is Q positively relevant to P , negatively relevant to P , or probabilistically independent of P ?
p | q
(c) What is cr P R ?
84
CHAPTER 3. CONDITIONAL CREDENCES
p |
q
(d) What is cr P Q & R ? (e) Conditional on R, is Q positively relevant to P , negatively relevant to P , or probabilistically independent of P ? (f) Does R screen off P from Q? Explain why or why not. Problem 3.7. Prove that all the alternative statements of probabilistic independence in Equations (3.15) through (3.18) follow from our srcinal independence definition. That is, prove that each Equation (3.15) through (3.18) follows from Equation (3.14), the probability axioms, and the Ratio Formula. (Hint: Once you prove that a particular equat ion follows from Equation (3.14), you may use it in subsequent proofs.) Problem 3.8. Show that probabilistic independence is not transitive. That is, provide a single probability distribution on which all of the following are true: X is independent of Y , and Y is independent of Z , but X is not independent of Z . Show that your distribution satisfies all three conditions. (For an added challenge, have your distribution assign every state-description a nonzero unconditional credence.) Problem 3.9. In the text we discussed what makes a pair of propositions probabilistically independent. If we have a larger collection of propositions, what does it take to make them all independent of each other? You might think all that’s necessary is pairwise independence —for each pair within the set of propositions to be independent. But pairwise independence doesn’t guarantee that each proposition will be independent of combinations of the others. To demonstrate this fact, describe a real-world example (spelling out the propositions represented by X , Y , and Z ) in which it would be rational for an agent to assign credences meeting all four of the following conditions:
p | q “ crpX q 2. cr pX | Z q “ crpX q 3. cr pY | Z q “ crpY q 4. cr pX | Y & Z q ‰ crpX q 1. cr X Y
Show that your example satisfies all four conditions.
85
3.4. EXERCISES
Problem 3.10. Using the program PrSAT referenced in the Further Readings for Chapter 2, find a probability distribution satisfying all the conditions in Exercise 3.9, plus the following additional condition: Every statedescription expressible in terms of X , Y , and Z must have a nonzero unconditional cr-value. Problem 3.11. After laying down probabilistic conditions for a causal fork, Reichenbach demonstrated that a causal fork induces correlation. Consider the following four conditions:
p | q ą crpA |„C q 2. cr pB | C q ą crpB |„C q 3. cr pA & B | C q “ crpA | C q ¨ crpB | C q 4. cr pA & B |„C q “ crpA |„C q ¨ crpB |„C q 1. cr A C
(a) Assuming C is the common cause of A and B , explain what each of the four conditions means in terms of relevance, independence, conditional relevance, or conditional independence.
p
(b) Prove that if all four condi tions hold, then cr A & B (This is a tough one!)
q ą crpAq ¨ crpBq.
Problem 3.12. In Section 3.3 I pointed out that the following statement (labeled Equation (3.43) there) is false:
p | q “ k and cr pC | Bq “ k entail crpC | A _ Bq “ k
cr C A
(a) Describe a real-world example (involving dice, or cards, or something more interesting) in which it’s rational for an agent to assign cr C A k and cr C B k but cr C A B k . Show that your ex ample meets this description.
p | q“
p | q“
p | _ q‰
p | q“
(b) Prove that if A and B are mutually exclusive, then whenever cr C A k and cr C B k it’s also the case that cr C A B k.
p | q“
p | _ q“
P
Problem cr Q P 3.13. cr PFact: Q . For any propositions
p Ą qě p | q
Q
and
, if cr
Q
p q ą 0 then
(a) Starting from a language L with atomic propositions P and Q, build a probability table on its state-descriptions and use that table to prove the fact above.
p q “1
(b) Show that Equation (3.49) in Section 3.3 entai ls that either cr Q or cr Q P 1.
p Ą q“
86
3.5
CHAPTER 3. CONDITIONAL CREDENCES
Further reading
Introductions and Overviews
Alan H´ajek (2011a). Conditional Probability. In: Philosophy of Statistics. Ed. by Prasanta S. Bandyop adhyay and Malcolm R. Forster. Vol. 7. Handbook of the Philosop hy of Science. Amsterdam: Elsevier, pp. 99–136 Describes the Ratio Formula and its motivations. Then works through a number of philosophical applications of conditional probability, and a number of objections to the Ratio Formula. Also discusses condi tionalprobability-first formalizations (as described in note 3 of this chapter). Todd A. Stephenson (2000). An Introduction to Bayesian Network Theory and Usage . Tech. rep. 03. IDIAP Section 1 provides a nice, concise overview of what a Bayes Net is and how it interacts with conditional probabilities. (Note that the author uses A, B to express the conjunction of A and B .) Things get fairly technical after that as he covers algorithms for creating and using Bay es Nets. Sections 6 and 7, though, contain real-life examples of Bayes Nets for speech recognition, Microsoft Windows troubleshooting, and medical diagnosis. Christopher R. Hitchcock (2012). Probabilistic Causation. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Winter 2012 While this entry is primarily about analyses of the concept of causation using probability theory, along the way Hitchcock includes impressive coverage of the Principle of the Common Cause, Simpson’s Paradox, causal modeling with Bayes Nets, and related material. Classic Texts
Hans Reichenbach (1956). The Prin ciple of Common Cause . In: The Direction of Time . University of California Pre ss, pp. 157–160 Text in which Reichenbach introduces his account of common causes in terms of screening off. (Note that Reiche nbach uses a period to express conjunction, and a comma rather than a vertical bar for conditional probabilities— what we would write as cr A B he writes as P B, A .)
p | q
p
q
87
NOTES
David Lewis (1976). Probabilities of Conditionals and Conditional Probabilities. The Philosophical Review 85, pp. 297– 315 Article in which Lewis presents his triviality argument concerning probabilities of conditionals. Extended Discussion
Bas C. van Fraassen (1982). Rational Belief and the Common Cause Principle. In: What? Where? When? Why? Ed. by Robert McLaughlin. Dordrecht: Reidel, pp. 193–209 Frank Arntzenius (1993). The Common Cause Principle. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 2, pp. 227–237 Discuss the meaning and significance of Reichenbach’s Principle of the Common Cause, then present possible counterexamples (including counterexamples from quantum mechanics). Alan H´ajek and Ned Hall (1994). The Hypothesis of the Conditional Construal of Conditional Probability. In: Probability and Conditionals: Belief Revision and Rational Decision . Ed. by Ellery Eells and Brian Skyrms. Cambridge Studi es in Probability, Induction, and Decision Theory. Cambridge University Press, pp. 75–112 H´ ajek and Hall extensively assess views about conditional credences and credences in conditionals in light of Lewis’s and other triviality results.
Notes 1
Here’s a good way to double-check that 6 &
E is equivalent to 6: Remember that
equivalence is mutual enta ilment. Clearly 6 & E entails 6. Going in the other direction, 6 entails 6, but 6 also entails E . So 6 entails 6& E . When evaluating conditional credences using the Ratio Formula we’ll often find ourselves simplifying a conjunction down to just one or two of its conjuncts. To make this work, the conjunct that remains has to entail each of the conjuncts that was removed. 2 When I refer to an agent’s “credence distribution” going forward, I will often be referring to both her unconditional and conditional credenc es. Strictly speaking this extends our definition of a “distribution”, but since conditional credences supervene on the unconditional for rational agents, not much damage will be done.
88
NOTES
3 Some authors also take advantage of Equation (3.7) to formalize probability theory in exactly the opposite order from what I’ve presen ted here. They begin by introducing conditional credences and subject them to a number of constraints somewhat like Kolmogorov’s axioms. The desired rul es for un conditional credences are then obtained by introducing the single constraint that cr pP q “ cr pP | Tq. For more on this approac h, its
advocates, and its motivations, see Section 5.4. 4 Bayes never published the theorem; Richard Price found it in Bayes’ notes and published it after Bayes’ death in 1761. Pierre-Simon Laplace independently rediscovered the theorem later on and was responsible for much of its early popularization. 5 In everyday English “lik ely” is a synonym for “probable”. Yet R.A. Fisher introduced the technical term “likelihood” to represent a particular kind of probability—the probability of some evidence given a hypothesis. This somewhat peculiar terminology has stuck. 6 Quoted in (Galavotti 2005, p. 51). 7 For instance, we have to assume that the base rates of the alleles are equal in the population, none of the relevant phenotypes is fitter than any of the others , and the blue-finned fish don’t assortativ ely mate with other blue-fin ned fish. (Thanks to Hayley Clatterbuck for discussion.) 8 Throughout this section and Section 3.2.1, I will assume that any proposition appearing in the condition of a conditional cr-ex pression has a nonzero cr-value. Defining probabilistic independence for propositions with probabili ty 0 can get complicated. (See e.g. (Fitelson and H´ajek ta).) 9 One will sometimes see “screening off” defined without Equation (3.28) or its analogue. (That is, some authors define screening off in terms of R ’s making the correlation between P and Q disappear, without worrying whether „ R has the same effect.) Equation (3.28) makes an important difference to our definition: in the Bob example I does not screen off H from C according to our definition because when „ I is supposed, C becomes very relevant to H . I have included Equation (3.28) in our definition because it connects our approach to the more general notion of screening off used in the statistics community. In statistics one often works with continuous random variables, and the idea is that random variable X screens off Y from Z if Y and Z are independent conditional on each possible value of X . Understanding proposition R as a dichotomous random variable (Chapter 2, note 7) takes this general definition of screening off and yields the definition I’ve given in the text. Many authors also leave Equation (3.26) (or its analogue) implicit in their definitions of “screening off ”. Conditional indepen dence is only interesting when the propositio ns in question are unconditionally correlated, so I have made this condition explicit in my definition of “screening off”. (I suppose if one wanted, one could alter my definition so that unconditionally-independent P and Q would count as trivially screened off by anything.) 10 Not to be confused with the Rambler’ s Fallacy: I’ve said so many false things in a row, the next one must be true! 11 20 flips of a fair coin provide a good example of what statisticians call IID trials . “IID” stands for “independen t, identically distributed.” Each of the coin flips is probabilistically independent of all the others; information about the outcomes of other coin flips doesn’t cha nge the probabil ity that a particular flip will come up heads. The flips are identically distributed because each has the same probability of producing a heads outcome. Anyone who goes in for the Gambler’s Fallacy and thinks that future flips will make up for past outcomes is committed to the existence of some mechanism by which future
NOTES
89
flips can respond to what happened in the past. Understanding that no such mechanism exists leads one to treat repeated flips of the same coin as IID. 12 This paradoxical phenomenon is named after E.H. Simpson because of a number of strik ing examp les he gave in his (1951 ). Yet the pheno menon had been know n to statisticians as early as (Pearson, Lee, and Bramley-Moore 1899) and (Yule 1903). 13
I learned about the Jeter/Justice example from the Wikipedia page on Simpson’s Paradox. (The batting data for the two hitters is widely available.) The UC Berkeley example was brought to the attention of philosophers by (Cartwright 1979). 14 An analogy: Suppose we each have some gold bar s and some silver bars . Each gold bar you’re holding is heavier (and therefore more valuable) than each of my gold bars. Each silver bar you’re holding is heavier (and more valuable) than each of my silver bars. Then how could I possibly be richer than you? If I have many more gold bars than you. 15 You may be concerned that arithmetic facts are true in every possible world, and so cannot rationally receive nonextre me credences, and so cannot be probabilistically correlated. We’ll come back to that concern in Chapter XXX. 16 I’m playing a bit fast and loose with the objects of discussion here. Throughout this chapter we’re considering correlations in an agent’s credence distribution. Reichenbach was concerned not with probabilistic correlations in an agent’s credences but instead with correlations in objective frequency or chance distributions (more about which in Chapter 5). But presumably if the Principle of the Common Cause holds for objective probability distributions, that provides an agent who views particular propositio ns as empirically correlated some reason to suppose that the events described in those propositions either stand as cause to effect or share a common cause. 17 You might worry that Figure 3.3 presents a counterexample to Reichenbach’s Principle of the Common Cause, because G and S are unconditionally correlated yet G doesn’t cause S and they have no common cause. It’s important to the principle that the causal relations need not be direct; for Reichenbach’s purposes G counts as a cause of S even though it’s not the immediate cause of S . 18 Just to indicate a few more complexities that may arise: While our discussion in the text concerns “direct” common causes, one can have an “indirect” common cause that doesn’t screen off its effects from each other. For example, if we imagine merging Figures 3.2 and 3.3 to show how the subject’s parents’ genes are a common cause of both smoking and drinking by way of her addictive personality, it is possible to arrange the numbers so that her parents’ genetics don’t screen off smoker from drinker. Even more complicatio ns arise if some causal arrows do end-arounds past others—what if in addition to the causal structure just described, the parents’ genetics tend to make them smokers, which in turn directly influences the subject’s smoking b ehavior? 19 Here I assume that a rational agent will entertain an indicative conditional only if she takes its antecedent to be possible. For arguments in favor of this position, and citations to the relevant literature, see (Moss ms, Sect. 4.3) and (Titel baum 2013, Sect. 5.3.2). The analogous assumption for conditional credences is that an agent assigns a conditional credence only when the condition is consistent with her background information. 20 One could study a kind of attitude different from the conditional credences considered in this book—something like a subjunctive degree of belief. Joyce (1999) does exactly that, but is careful to distinguish his analysandum from standard conditional degrees of belief. Note also that arguments for applying the Ratio Formula to standard conditional degrees of belief do not cover its application to Jamesian subjunctive credences. 21 A variety of recent positions in linguistics and the philosophy of language suggest that indicative conditionals with modal expressions in their consequents do not obey
90
NOTES
classical inference rules. Yalcin (2012), among others, classes probabili ty locutions with these modals and so argues that, inter alia, indicative conditionals with probabilistic consequents do not keep Modus Toll ens truth-preserving. (His argument could easily be extended to Disjunctive Syllogism as well.) Yet the alternative positive theory of indicative conditionals Yalcin offers does not analyze conditional credences in terms of conditionals either, so even if he’s correct we would still need an independent understanding of what conditional credence s are. (Thanks to Fabrizio Cariani for discussion of these points.) 22 Fitelson (2015) proves a triviality result like Lewis’s using probability tables (instead of proceeding axiomatically). Moreover, he traces that triviality specifically to the combination of (3.48) with the assumption that the conditiona l Ñ satisfies the “import-export” condition. 23 Interestingly, this idea is often traced back to a suggestion in Ramsey, known as “Ramsey’s test”. (Ramsey 1929/1990, p. 155n)
Chapter 4
Updating by Conditionalization Up to this point we have discussed synchronic credence constraints—rationallyrequired relations among the degrees of belief an agent assigns at a given time. This chapter introduces the fifth (and final) core normative Bayesian rule, Conditionalization. Conditionalization is a diachronic rule, requiring an agent’s degrees of belief to line up in particular ways across times. I begin by laying out the rule and some of its immediate consequences. We will then practice applying Conditionalization using Bayes’ Theorem. Some of Conditionalization’s consequences will prompt us to ask what notions of learning and evidence pair most natu rally with the rule. I will also explain why it’s important to attend to an agent’s total evidence in evaluating her responses to learning. Finally, we will see how Conditionalization helps Bayesians distinguish two influences on an agent’s opinions: the content of her evidence, and her tendencies to respon d to evide nce in particular ways. This will lead to Chapter 5’s discussion of how many distinct responses to the same evidence could be rationally permissible . Differing answers to that question prov ide a crucial distinction between Subjective and Objective Bayesianism.
4.1
Conditionalization
Suppose I tell you I just rolled a fair 6-sided die, and give you no further information about how the roll came out. Presumably you assign equal unconditional credence to each of the 6 possible outcomes, so your credence that the die came up 6 will be 1 6. I then ask you to suppose that the roll
{
91
92
CHAPTER 4. UPDATING BY CONDITIONALIZATION
came up even (while being very clear that this is just a supposition—I’m still not revealing anything about the outcome). Applying the Ratio Formula to your unconditional distribution, we find that rationality requires your credence in 6 conditional on the supposition of even to be 1 3. Finally, I
{
break down and tell you that the roll actually did come up even. Now how confident should you be that it came up 6? I hope the obvious answer is 1 3. When you lea rn that the die actu ally came up even, the effect on your confidence in a 6 is identical to the effect of merely supposing evenness. This relationship between learning and supposing is captured in Bayesians’ credence-updating rule:
{
Conditionalization: For any time ti and later time tj , if proposition E in L represents everything the agent learns between t i and t j and cri E 0, then for any H in L,
p qą
p q “ cr pH | E q
crj H
i
where cr i and cr j are the agent’s credence distributions at the two times. Conditionalization captures the idea that an agent’s credence in H at tj — learning E t H after supposing E —should equalthe hertwo earlier in case had she been . If we label time is credence in the die-roll t1 merely and t2 , Conditionalization tells us that
p q “ cr p6 | E q
cr2 6
(4.1)
1
{
which equals 1 3 (given your unconditional distribution at t1 ). Warning: Some theorists take Condition alization to define conditional credence. For them, to assign the conditional credence cri H E r just is to be disposed to assign cr j H r should you learn E . As I said in Chapter 3, I take conditional credence to be a genuine mental state, manifested by the agent in various ways at ti (what she’ll say in conversation, what sorts of bets she’ll accept, etc.) beyond just her dispositions to update. For us, Conditionalization represents a normative constraint relating the agent’s unconditional credences at a later time to her conditional credences earlier on.
p | q“
p q“
Combining Conditionalization with the Ratio Formula gives us
p q “ cr pH | E q “ cr crpHpE&qE q
crj H
i
i
i
(4.2)
93
4.1. CONDITIONALIZATION
Figure 4.1: Updating H on E
H
E
p qą
(when cr i E 0). A Venn diagram shows wh y dividing these partic ular ti credences should yield the agent’s credence in H at tj . In Chapter 3 we used a diagram like Figure 4.1 to understand conditional crede nces. There the white circle represented a set of possibilities to which the agent had temporarily narrowed her focus in order to entertain a supposition. Now let’s imagine the rectangle represents all the possible worlds the agent entertains at ti (her doxastically possible worlds). The size of the H circle represents the agent’s unconditional t i credence in H . Between t i and tj the agent learns that E is true. Among the worlds she had entertai ned before, the agent now excludes all the non- E worlds. Her set of doxa stic possibilities narrows down to the E -circle; in effect, the E -circle becomes the agent’s new rectangle . How unconditionally confident is the agent in H now? That depends what frac tion of her new doxa stic space is occupied by H -worlds. And this is what Equation (4.2) calculat es: it tells you what fraction of the E -circle is occupied by H & E worlds. As stated, the Conditionalization rule is useful for calculating a single unconditional credence value after an agent has gained evidence. But what if you want to generate the agent’s entire tj credence distribution at once? We saw in Chapter 2 that a rational agent’s entire ti credence distribution can be specified by a probability table that gives the agent’s unconditional credence in each state-description of L. To satisfy the probabilit y axioms, the credence values in a probability table must be non-negative and sum to 1. The agent’s unconditional credence in any (non-contradictory) proposition can then be determined by summing her credences in the state-descriptions on which that proposition is true. When an agent updates her credence distribution by applying Condition-
94
CHAPTER 4. UPDATING BY CONDITIONALIZATION
alization to some learned proposition E , we say that she “conditionalizes on E ”. To calculate the probability table values resulting from such an update, we apply a two-step process: 1. Give credence 0 to all state-descriptions inconsistent with the evidence learned. 2. Multiply each remaining nonzero creden ce by the same constant so that they all sum to 1. As an example, let’s consider what happens to your confidence that the fair die roll came up prime 1 when you learn that it came up even.
Here we’ve used a language
P
E
T
T
T
F
F
T
F
F
L
cr1 1 6 1 3 1 3 1 6
{ { { {
cr2 1 3 0 2 3 0
{ {
with atomic propositions P and E repre-
senting “prime” and “even”. The cr 1 column represents your unconditional credences at time t1 , while the cr 2 column represents your t2 credences. Between t1 and t2 you learn that the die came up even. That’s inconsistent with the second and fourth state-descriptions, so in the first step of our update process their cr 2 -values go to 0. The cr 1 -values of the first and third state-descriptions (1 6 and 1 3 respectively) add up to only 1 2. So we multiply both of these values by 2 to obtain unconditional t2 -credences summing to 1. 2 In this manner, we generate your unconditio nal state-description credences at t 2 from your state-description credences at t 1 . We can then calculate cr 2 -values for other propositions. For instance, adding up the cr 2 -values on the lines that make P true, we find that
{
{
{
p q “ 1{3
cr2 P
(4.3)
Given your initial distribution, your credence that the die came up prime after learning that it came up even is required to be 1 3. Hopefully that squares with your intuitions about rational requirements in this case! One final note: Our two-step process for updating probability tables yields a handy fact. Notice that in the second step of the proce ss, every state-description that hasn’t been set to zero is multiplied by the same constant. When two values are multip led by the same constant, the ratio
{
4.1. CONDITIONALIZATION
95
between them remains intac t. This means that if two state-descriptions have nonzero credence values after an update by Conditionalization, those values will stand in the same ratio as they did before the update. This fact will prove useful for problem-s olving later on. (Notice that it applies only to state-descriptions ; propositions that are not state-descriptions may not maintain their credence ratios after a conditionalization.) 4.1.1
Consequences of Conditionalization
If we adopt Conditionalization as our updating norm, what follows? When an agent updates by conditionalizing on E , her new credence distribution is just her earlier distribution conditional on E . In Section 3.1 .2 we sa w that if an agent’s credence distribution obeys the probability axioms and Ratio Formula, then the distribution she assigns conditional on any particular proposition (in which she has nonzero credence) will be probabilistic as well. This yields the important result that if an agent starts off obeying the probability axioms and Ratio Formula and then updates by Conditionalization, her resulting credence distribution will satisfy the probability axioms 3
as well. The process may then iterate. Having conditionalized her probab ilistic cr 1 distribution on some evidence E to obtain probabilistic credence distribution cr 2 , the agent may then gain further evidence E 1 , which she conditionalizes upon to obtain cr 3 (and so on). Moreover, Conditionalization has the elegant mathematical property of being cumulative: Instead of obtaining cr 3 from cr 1 in two steps—first conditionalizing cr 1 on E to obtain cr 2 , then conditionalizing cr 2 on E 1 to obtain cr 3 —we can generate the same cr 3 distribution by conditionalizing cr 1 on E & E 1 , a conjunction representing all the propositions learned between t1 and t3 . (You’ll prove this in Exercise 4.3.) Because Conditionalization is cumulative it is also commutative: Conditionalizing first on E and then E 1 has the same effect as conditionalizing in the opposite order. Besides being mathematically elegant, cumulativity and commutativity are intuitively plausible featur es of a learning process. Suppose a detective investigating a crime learns that the victim was a Caucasian male, and updates her credenc es accordingly. Intuitively, it shouldn’t matter if we describe this episode as the detective’s learning first one piece of evidence and then another (first that the victim was a Caucasian, and then that he was male) or as the detective’s learning a single conjunction containing both. Because Conditionalization is cumulative, it will prescribe the same ultimate credences for the detective on either construal. Similarly, it shouldn’t matter
96
CHAPTER 4. UPDATING BY CONDITIONALIZATION
whether we take her to have learned that the victim was a Caucasian male or a male Caucasian. Because Conditionalization is commutati ve, the order in which pieces of evidence are presented makes no difference to an agent’s ultimate credences.4 When an agent conditionalizes on evidence E , what happens to her unconditional credence in that very evidence? Substituting E for H in Equation (4.2) (and recalling that E & E is equivalent to E ), we can see that if an agent learns E between ti and tj then
p q“1
crj E
(4.4)
Conditionalization creates certainties; conditionalizing on a piece of evidence makes an agent certain of that evidence. Moreover, any proposition entailed by that evidence must receive at least as high a credence as the evidence (by our Entailment rule). So an agent who conditionalizes also becomes certain of any proposition entailed by the evidence she learns. And Conditionalization doesn’t just create certainties; it also maintains them. If an agent is certa in of a proposition at ti and updates by Conditionalization, she will remain certain of that proposition at tj . That is, if cri H 1 then Conditionalization yields cr j H 1 as well. On a probability table, this means that once a state-description receives credence 0 at a particular time (the agent has ruled out that possible state of the world), it will receive credence 0 at all subsequent times as well. In Exercise 4.2 you’ll prove that Conditionalization retains certainties from the probabili ty axioms and Ratio For mula. But it’s easy to see why this occurs on a Venn diagram. You’re certain of H at ti when H is true in every world you consider a live doxastic possibility. Conditionalizing on E strictly narrows the set of possible world s you entertain. So if H was true in every world you entertained before conditionalizing, it’ll be true in every world you entertain afterwards as well. Combining these consequences of Conditionalization yields a somewhat counterintuitive result, to which we’ll return in later discussions. Conditionalizing on E between two times makes that proposition (and any proposition it entails) certain . Future updates by Conditi onalization will then retain that certainty. So if an agent updates by conditional izing throughout her life, any piece of evidence she learns at any point will remain certain for her ever after. What if an agent doesn’t learn anything between two times? Bayesians represent an empty evidence set as a tautology. So when an agent gains no
p q“
p q“
97
4.1. CONDITIONALIZATION
information between ti and tj , Conditionalization yields
p q “ cr pH | q “ cr pH q
crj H H
i
T
(4.5)
i
L
for any in we . showed (The latter of thisconditional equat ion comes from Equation (3.7), in which that half credences on a tautology equal unconditional credences.) If an agent learns nothi ng between two times and updates by Conditionalization, her degrees of confidence will remain unchanged. 4.1.2
Probabilities are weird! The Base Rate Fallacy
Bayes’ Theorem expresses a purely synchronic relation; as we saw in Section 3.1.3, for any time ti it calculates cr i H E in terms of other credences assigned at that time. Nevertheless, our diachronic Conditionalization rule gives Bayes’ Theorem added significance. Conditionalization says that your unconditional tj credence in hypothesis H after learning E should equal cri H E . Bayes’ Theorem is a tool for calculating this crucial value (your “posterior” at ti ) from other credences you assign at ti . As new evidence comes in over time and we repeatedly update by conditionalizing, Bayes’ Theorem can be a handy tool for generating new credences from old. For example, we could’ve used Bayes’ Theorem to answer our earlier question of what happens to your credence in 6 when you learn that a fair die roll has come up even. The hypothesis is 6, and the evidence is E (for even). By Conditionalization and then Bayes’ Theorem,
p | q
p | q
p q “ cr p6 | E q “ cr pEcr| 6pqE¨ qcr p6q
cr2 6
1
1
1
(4.6)
1
pq { {
{ p |q
p q
cr1 6 , your prior credence in a 6, is 1 6, and cr 1 E , your prior credence in E , is 1 2. The likelihood of E , cr1 E 6 , is easy—it’s 1. So the numerator is 1 6, the denominator is 1 2, and the posterior cr 2 6 1 3 as we saw before.5
{
pq“ {
Let’s apply Bayes’ Theorem to a more interesting case: 1 in 1,000 people have a particula r disease. You have a test for the presence of the disease that is 90% accurate, in the following sense: If you apply the test to a subject who has the disease it will yield a positive result 90% of the time, and if you apply the test to a subject who lacks the disease it will yield a negative result 90% of the time.
98
CHAPTER 4. UPDATING BY CONDITIONALIZATION
You randomly select a person and apply the test. The test yields a positive result. How confident should you be that this subject actually has the disease? Most people—including trained medical professionals!—answer this question with a value around 80% or 90%. But if you set you r credences by the statistics given in the problem, the rationally-required degree of confidence that the subject has the disease is less than 1%. We’ll use Bayes’ Theorem to work that out. Let D be the proposition that the subject has the disease and P the proposition that when applied to the subjec t, the test yield s a positive result. Here D is our hypothesis, and P is the evidence acquired between t1 and t2 . At t1 (before applying the test) we take the subject to be representative of the population, giving us priors for the hypothesis and the catchall:
p q “ 0.001
cr1 D
cr
1
p„Dq “ 0.999
The accuracy profile of the test gives us likelihoods for the hypothesis and catchall:
p | q“
p |„ q “
D cr1 P D 0.9 cr1 P 0.1 In words, you’re 90% confident that the test will yield a positive result given that the subject has the disease, and 10% confident that we’ll get a “false positive” on the supposition that the subject lacks the disease. Now we’ll apply a version of Bayes’ Theorem from Section 3.1.3, in which the Law of Total Probability has been used to expand the denominator: Dq p q “ cr pP | Dq ¨ crcr ppDPq|`Dcrq ¨ crpP p|„ Dq ¨ cr p„Dq 0.9 ¨ 0.001 “ 0.9 ¨ 0.001 ` 0.1 ¨ 0.999 « 0.009 “ 0.9% 1
cr2 D
1
1
1
1
1
(4.7)
So there’s the calcu lation. After learning of the positiv e test result, your credence that the subject has the disease should be a little bit less than 1%. But even having seen this calculat ion, most people find it hard to believe. Why shouldn’t we be more confident that the subject has the disease? Wasn’t the test 90% accurate? Tversky and Kahneman (1974) suggested that in cases like this one, people’s intuitive responses ignore the “base rate” of a phenomenon. The base rate in our example is the prior credence that the subject has the disease. In this case , that base rate is extremely low. But peopl e tend to
99
4.1. CONDITIONALIZATION
forget about that fact and be overwhelmed by accuracy statistics (such as likelihoods) concerning the test. This is known as the Base Rate Fallacy. Why is the base rate so importa nt? To illustrate, let’s suppos e you applied this test to 10,000 people. Using the base rate statistics , we would expect about 10 of those peopl e to have the disease. Since the test gives a positive result for 90% of people who have the disease, we would expect these 10 diseased people to yield about 9 positive results—so-called “true positives”. Then there would be about 9,990 people lacking the dise ase. Since cr i P D —the false positive rate—is 10%, we’d expect to get about 999 false positive results. Out of 1,008 positive results the test would yield, only 9 of those subjects (or about 0 .9%) would actually have the disease. This particular disease is so rare—its base rate is so tiny—that even with an accurate test we should expect the false positives to swamp the true positives. So when a single indiv idual takes the test and gets a positive result, we should be much more confident that this is a false positive than a true one. Another way to see what’s going on is to consider the Bayes factor
p |„ q
of the Formula, evidence you receive in this case. Using Conditionalization and the Ratio we can derive
p q cr pH | E q cr pH q cr pE | H q p„ q “ cr p„H | E q “ cr p„H q ¨ cr pE |„H q
crj H crj H
i
i
i
i
i
(4.8)
i
That last fraction on the right—the ratio of the likelihood of the hypothesis to the likelihood of the catchall—is the Bayes factor. Personally, I found this equation fairly impenetrable until I remembered that cr H cr H is an agent’s odds for the proposition H (Section 2.3.4). That means we can rewrite Equation (4.8) as
p q{ p„ q
odds for H after update
“ odds for H before update ¨ Bayes factor
(4.9)
If you update by Conditionalization, learning E multiplies your odds for H by the Bayes factor. The Bayes factor thu s provides a handy way of E
measuring how much learning changes your opinions about the hypothesis. In our disease example, the Bayes factor is
p | q 0.9 p |„ q “ 0.1 “ 9
cr1 P D cr1 P D
(4.10)
At t1 , your odds for D are 1 : 999. Applying the tes t has a substantial influence on these odds; as the Bayes factor reveals, a positive test result
100
CHAPTER 4. UPDATING BY CONDITIONALIZATION
multiplies the odds by 9. This reflects the high accuracy of the test. Yet since the odds were so small initially, multiplying them by 9 only brings them up to 9 : 999. So even after seei ng the test outco me, you shou ld be much more confident that the subject doesn’t have the disease than you are 6
that she does.
4.2
Evidence and Certainty
Combining Conditionalization with the probability axioms and Ratio Formula creates a Bayesian approach to evidence that many have found troubling. Conditionalization works with a proposition E representing everything the agent learns between two times. (If many propositions are learned, E is their conjunction.) We also speak of E as the evidence the agent gains between those two times. Yet Conditionalization gives E properties that epistemologists don’t typically attribute to evidence. We’ve already seen that a piece of evidence E (along with anything it entails) becomes certain once conditional ized upon. When an agent learns E , the set of doxastically possib le worlds she entertains shrin ks to a set of worlds all of which make E true; on the Venn diagram, what once was merely an E -circle within her rectangle of worlds now becomes the entire rectangle. And as we saw in Section 4.1.1, this change is permanent: as long as the agent keeps updating by Conditionalization, any evidence she once learned remains certain and p ossible worlds inconsistent with it continue to be ruled out. What realistic conception of evidence—and of learning—meets these requirements? When I learn that my sister is coming ove r for Thanksgiving dinner, I become highly confident in that proposition. But do I become 100% certain? Do I rule out all possible worlds in which she doesn’t show, refusing to consider them ever after? As Richard C. Jeffrey put it, Certainty is quite dema nding. It rules out not only the farfetched uncertainties associated with philosophical skepticism, but also the familiar uncertainties that affect real empirical inquiry in science and everyday life. (2004, p. 53) This concern about certainties motivates the Regularity Principle: In a rational credence distribution, no logically contingent proposition receives unconditional credence 0. The Regularity Principle captures the common-sense idea that one’s evidence is never so strong as to entirely rule out any logical possibility. (Recall
4.2. EVIDENCE AND CERTAINTY
101
that a logically contingent proposition is neither a logical contradiction nor a logical tautology.7 ) As damning eviden ce against a contingent proposition mounts up, we may keep decreasing and decreasing our credence in it, but our unconditional credence distribution should always remain regular—it should assign each logically contingent proposition at least a tiny bit of confidence. The Regularity Principle adds to the synchronic Bayesian rules we have seen so far—it is not entailed by the probability axioms, the Ratio Formula, or any combination of them. As our Contradiction result showed in Section 2.2.1, those rules do entail that all logical contradictions receive credence 0. But Regularity is the converse of Contradiction; instead of saying that all contradictions receive credence 0, it entails that only contradictions do. Similarly, Regularity (along with the probability axioms) entails the converse of Normality: instead of saying that all tautologies receive credence 1, it entails that only tautologies do. (The negation of a contingent proposition is contingent; if we were to assign a contingent proposition credence 1 its negation would receive credence 0, in violation of Regularity.) This captures the common-sense idea that one should never be absolutely certain of a 8
proposition that’s not logically Conditionalization conflictstrue. with Regularity; the moment an agent conditionalizes on contingent evidence, she assigns credence 1 to a non-tautology. As we saw earlier, Conditionalization on contingent evidence rules out doxastic possibilitie s the agent had previously entertained; on the Venn diagram, it narrows the set of worlds under considera tion. Regularity, on the other hand, fixes an agent’s doxastic possibility set as the full set of logical possibilitie s. While evidence might shift the agent’s credences around among various possible worlds, an agent who satisfies Regularity will never eliminate a possible world outright. We might defend Conditionalization by claiming that whenever agents receive contingent evidence, it is of a highly specific kind, and Regularity is false for this kind of evidence. Perhaps I don’t actually learn that my sister is coming over for Thanksgiving—I learn that she told me she’s coming; or that it seemed me that she said that; orwhat that phenomenal I had a phenomenal experience as of. . . to . Surely I can be certain experiences I’ve had, or at least what experiences I’m having right now. When in the midst of having a particular phenomenal experience, can’t I entirely rule out the logical possibilit y that I am having a different experience instead? C.I. Lewis defended this approach as follows: If anything is to be probable, then something must be certain.
102
CHAPTER 4. UPDATING BY CONDITIONALIZATION
The data which themselves support a genuine probability, must themselves be certainties. We do have such absolu te certainties, in the sense data initiating belief and in those passages of experience which later may confirm it. (1946, p. 186) Yet foundationalist epistemologies based on sense data and indubitable phenomenology have become unpopular in recent years. So it’s worth considering other ways to make sense of Conditionalization’s conception of evidence. Levi (1980) took credence-1 propositions to represent “standards of serious possibility”: When witnessing the toss of a coin, [an agen t] will norma lly envisage as possibly true the hypothesis that the coin will land heads up and that it will lan d tails up. He may also en visage other possibilities—e.g., its landing on its edge. However, if he takes for granted even the crudest folklore of modern physics, he will rule out as impossible the coin’s moving upward to outer space in the direction of Alpha Cen tauri. He will also rule out the hypothesis that the Earth will explode. (p. 3) However, Levi formalized his standards of serious possibility so that they could change—growing either stronger or weaker—for a given agent over time. So his approach did not fully embrace Conditionalization. Alternatively, we could represent agents as ruling out contingent possibilities only relative to a particular inquiry. Consider a scientist who has just received a batch of experimental data and wants to weigh its import for a set of hypotheses. There are always outlandish p ossibilities to consider: the data might have been faked; the laws of physics might have changed a moment ago; she might be a brain in a vat. But to focu s on the pro blem at hand, she might conditionalize on the data and see where that takes her credences in the hypotheses. Updating by Conditiona lization might fail as a big-picture, permanent strategy, but nevertheless could be useful in carefully-delimited contexts. (I mentioned this possibility in Section 2.2.3.) Perhaps these interpretations of evidence conditionalized-upon remain unsatisfying. We will return to this problem in Chapter 5, and consider a generalized updating rule (Jeffrey Conditional ization) that allows agents to redistribute their credences over contingent possibilities without eliminating any of them entir ely. For the rest of this chapter we will simp ly assum e that Conditionalization on some kind of contingent evidence is a rational updating rule, so as to draw out further features of such updates.
4.2. EVIDENCE AND CERTAINTY
4.2.1
103
Probabilities are weird! Total Evidence and the Monty Hall Problem
Classical entailment is monotonic in the following sense: If a piece of evidence E you have received entails that H , any augmentation of that evidence (any conjunction that includes E as a conjunct) will continue to entail H as well. Probabilistic relations, however, can be nonmonotonic: H might be highly probable given E , but improbable given E & E 1 . For this reason, it’s important for an agent assigning credences on the basis of her evidence to consider all of that evidence, and not simply draw conclusions from a subpart of it. Carnap (1947) articulate d the Principle of Total Evidence that a rational agent’s credence distribution takes into account all of the evidence available to her. We sometimes violate the Principle of Total Evidence by failing to note the manner in which an agent gained particular infor mation. If the agent is aware of the mechanism by which a piece of information was received, it can be important to recognize facts about that mechanism as a component of her total evide nce (alon g with the information itself). In Eddington’s (1939) classic example, you draw a sample of fish from a lake, and all the fish are longer than six inches. Normally, updating on this information would increase your confidence that every fish in the lake is at least that long. But if you know the net used to draw the sample has big holes through which shorter fish fall, a confidence increase is unwarranted. Here it’s important to conditionalize not only on the lengths of the fish but also on how they were caught. The method by which your sample was collected has introduced an observation selection effect into the data. 9 The process by which information is obtained is also crucial to a famously counterintuitive probability puzzle, the Monty Hall Problem (Selvin 1975): In one of the games played on Let’s Make a Deal , a prize is randomly hidde n behind one of three doors . The contestant selects one door, then the host (Monty Hall) opens one of the doors the contestant didn’t pick. Monty knows where the prize is, and makes sure to always open a door that doesn’t have the prize behind it. (If both the unsel ected doors are empt y, he randomly chooses which one to open.) After he opens an empty door, Monty asks the contestant if she wants what’s behind the door she initially selected, or what’s behind the other remaining closed door. Assuming she understa nds the details of Monty’s procedure, how confident should the contestant be that the door she initially selected contains the prize?
104
CHAPTER 4. UPDATING BY CONDITIONALIZATION
{
Most people’s initial reaction is to answer 1 2: the contestant srcinally spread her credence equally among the three doors; one of them has been revealed to be empty; so she shoul d be equally confi dent that the prize is behind eac h of the remaining two. This analysis is illustrated by the following probability table: Prize behind door A Prize behind door B Prize behind door C
cr1 1 3 1 3 1 3
{ { {
cr2 1 2 0 1 2
{ {
Here we’ve used the obvious partition of three locations where the prize might be. Without loss of generality, I’ve imagined that the contestant initially selects door A and Monty then opens door B. At time t 1 —after the contestant has selected door A but before Monty has opened anything—she is equally confident that the prize is hidden behind each of the three doors. When Monty opens door B at t2 , the contestant should conditionalize on the prize’s not being behind that door. It looks lik e this yields the cr 2 distribution listed above, which matches most people’s intuitions. Yet the contestant’s total evidence at t2 includes not only the fact that the prize isn’t behind door B, but also the fact that Monty opened that one. These two propositions aren’t equivalent among the agent’s doxastically possible worlds; there are possible worlds consistent with what the contestant knows about Monty’s procedure in which door B is empty yet Monty opens door C. That door B was not only empty but was revealed to be empty is not expressible in the partition used abov e. So we need a richer partition, containing information both about the location of the prize and about what Monty does: Prize behind door A & Monty reveals B Prize behind door A & Monty reveals C Prize behind door B & Monty reveals C Prize behind door C & Monty reveals B
cr1 1 6 1 6
cr2
{ 1{3 { 0 1 {3 0 1 {3 2{3
Given what the agent knows of Monty’s procedure, these four propositions partition her doxastic possibilities at t1 . At that time sh e doesn’t know where the prize is, but she has initially selected door A (and Monty hasn’t opened anything yet). If the prize is indeed behind door A, Monty randomly chooses whether to open B or C. So the contestant divides her 1 3 credence
{
105
4.2. EVIDENCE AND CERTAINTY
that the priz e is behind door A equally between thos e two options. If the prize is behind door B, Monty is forbidden to open that door as well as the door the contestant selected, so Monty must open C. Similarly, if the prize is behind door C, Monty must open B. At t2 Monty has opened door B, so the contestant conditionalizes by setting her credence in the second and third partition elements to 0, then multiplying the remaining values by a constant so that they sum to 1. This maintains the ratio between her credences on the first and fourth lines; initially she was twice as confident of the fourth as the first, so she remains twice as confident after the update. She is now 2 3 confident that the prize isn’t behind the door she initially selected, and 1 3 confident that her initial selection was correct. If she wants the prize, the contestant should switch doors. This is the correct analysis. If you find that surprising, the following explanation may help: When the contestant srcinally selected her door, she was 1 3 confident that the prize was behind it and 2 3 confident that the prize was somewhere else. If her initial pick was correct, she claims the prize just in case she sticks with that pick . But if her initial selection was
{ {
{
{
wrong, she wins switching to thea other remaining door, because it must contain thebyprize. So there’s 1 3 chance thatclosed sticking is the winning strategy, and a 2 3 chance that swit ching will earn her the priz e. Clearly switching is a better idea. When I first heard the Monty Hall Problem, even that explanation didn’t convince me. I only became convinced after I simulated the scenar io over and over and found that sticking made me miss the prize roughly 2 out of 3 times. If you’re not conv inced, try writing a quick computer program or finding a friend with a free afternoon to act as Monty Hall for you a few hundred times. You’ll eventually find that the probability table taking total evidence into account provides the correct analysis. 10 One final note about total evidence: I may have convinced you that taking your total evidence into account is a good idea. But you might be concerned that this is impossible. After all, at every conscious moment an agent receives torrents of information from her environment. How can she take it all into account when setting her credence in a particular proposition—say, the proposition that the cheese sandwich on the counter in front of her has not yet gone bad? The nonmonotonicity of probabilistic relations means that a rational agent cannot afford to ignore any of her evidence. But many of the propositions an agent learns in a given moment will be irrelevant to the matter under consideration relative to her current credence distribution. That is,
{
{
106
CHAPTER 4. UPDATING BY CONDITIONALIZATION
for many pieces of evidence her credence in the proposition at issue would be the same whether she conditionalized on that particular piece of evidence or not. As the agent ponders her ch eese sandwich, information about the color of the bird that just flew by or the current position of her right hand makes no difference to her credence that the sandw ich is edible. So while a rational agent doesn’t ignore any of her total evidence, the irrelevance of much of that evidence permits her to focus in on the few pieces of evidence that are relevant to the proposition under conside ration. For this reason, Bayesians often bypass discussion of an agent’s total evidence in favor of discussing her total relevant evidence.11
4.3
Hypothetical Priors and Evidential Standards
Consider a rational agent with probabilistic credences who updates by Conditionalization each time she gains new evidence, for her entire life. At a given moment ti she has a credence distribution cr i . She then ga ins new evidence E and updates by Conditional ization. Her unconditional cr i values provide the priors for that update, and her cr i values conditional on E provide the posteriors. By Conditionalization, these posteriors become her unconditional credences at the next time, t j . Then she receives a new piece of evidence E 1 . Her unconditional cr j values supply the priors for a new update, and her cr j values conditional on E 1 become the posteriors. And so it goes. We have already seen that if this agent updates by Conditionalization every time she learns something new, she will gain contingent certainties over time and never lose any of them. So her entire doxastic life will be a process of accumulating empirical evidence from her environment, building a snowball of information that never loses any of its parts. What happens if we view that process back wards, working from the agent’s present doxastic state back through the states she assigned in the past? Her current unconditional credences resulted from an earlier update by Conditionalization. Relative to that update, her current credences were the posteriors and some other distribution provided the priors. But those priors, in turn, came from a conditionalization. So they were once the posteriors of an even earlier set of priors. As we go backwards in time, we find a sequence of credence distributions, each of which was conditionalized to form the next. And since each conditionalization strictly added evidence, the earlier distributions contain successively less and less contingent information as we travel back. Bayesian Epistemologists often imagine continuing this process until
4.3. HYPOTHETICAL PRIORS AND EVIDENTIAL STANDARDS 107
Figure 4.2: An initial prior? Ej
Ek
Ei
cr0
...
cri
E
crj
E1
crk
there’s no farth er back you can go. They imagine that if you went back far enough, you would find a point at which the agent possessed literally no contingent information. This was the starting point from which she gained her first piece of evidence, and made her first update by Conditionalization. This distribution is sometimes referred to as the agent’s initial prior distribution (or her “ur-prior”). Let’s think about the properties an initial prior distribution would have. First, since the credence distributions that develop from an initial prior by Conditionalization are probability distributions, it’s generally assumed that the initial prior satisfies the Kolmogorov axioms (and Ratio Formula) as well. Second, it’s thoug ht that since at the imagin ed initial momen t (call it t0 ) the agent possessed no contingent information, she should not have been cert ain of any contingent propositions. In other word s, the initia l prior distribution cr 0 should be regular (assign nonextreme values to all contingent propositions). Finally, think about how cr0 relates to a credence distribution our agent assigns at some arbitrary moment ti later on. We could recover cri from cr0 by conditionalizing cr0 on the first piece of evidence the agent ever learned, then conditionalizing the result of that update on the second piece of evidence she learned, and so on until we reach cr i . But since Conditionalization is cumulative, we could also skip the intermediate steps and get from cr 0 to cr i in one move. Suppose the proposition Ei represents all the evidence the agent possesses at t i (perhaps E i is the conjunction of all the individual pieces of evidence the agent has learned since t0 ). Then as long as the agent has updated by conditionalizing at every step between t 0 and t i , cumulativity guarantees that cri cr0 Ei . A rational agent’s credence distribution at any given time is her initial prior distribution conditional on all the evidence she possesses at that time. The idea is illustrated in Figure 4.2. Each distribution in the serie s is generated from the previous one by conditionalizing on the evidence learned (solid arrows). But we can also deriv e each distribution direc tly (dashed arrows) by conditionalizing cr0 on the agent’s total evidence at the relevant
p¨q“
p¨| q
108
CHAPTER 4. UPDATING BY CONDITIONALIZATION
time. The initial-priors picture is an attractive one, and bears a certain mathematical elegance. The trouble is that it can at best be a myth. Was there ever a time in a real agent’s life when she possessed no contingent information? Since cr 0 satisfies the probability axioms, it must be perfectly attuned to logical relations (such as mutual exclusivity and entailment) and assign a value of 1 to all tautologies. So an initial prio r is omnis cient logically while totally ignorant empirically. David Lewis used to refer to such highly intelligent, blank creatures in conversation as “superbabies”; sadly, I doubt the world has ever seen their like. 12 Moreover, I’m not even sure it makes sense for an agent with no contingent information to assign precise numerical credences to the kinds of elaborate, highly-detailed empirical claims that are real agents’ stock in trade. Yet the formal mechanism employed by the initial priors myth—a regular probability distribution conditionalized on total evidence to generate credence distributions at arbitrary times—can be repurposed to play an important epistemological role. To get a sense of that role, let’s conside r an example. Question: When playing a hand of five-card stud, how confident should the news that your last card will be the two of clubs make you that you’ll win the hand? Answer: Depends where you are in the game, and what’s happen ed up to that point. Five-card stud is a poker game in which you receive a total of five cards, one at a time. Four of a kind (fou r out of five car ds showing the same number) is an excellent, almost unbe atable hand in this game. So let’s suppose that your first four cards in this particular hand of five-card stud were the jack of spades, the two of diamonds, the two of hearts, and then the two of spades. With that background information, discovering that your last card will be the two of clubs should make you almost certain that you’ll win. (Depending in part on what you know of the other players’ hands.) An agent’s evidential standards govern how she reacts to particular pieces of news. These evide ntial stand ards are deter mined in part by an agent’s total evidence, and as such evolve over time. At the beginning of a hand of five-card stud, before any cards are dealt, learning that your last card will be the two of clubs (perhaps by peeking into the deck) would not make you very confident of winning the hand. Similarly, after seeing your first card (the jack of spades), a final two of clubs doesn’t seem like very
4.3. HYPOTHETICAL PRIORS AND EVIDENTIAL STANDARDS 109
good news. But for each succes sive two you receive after that p oint, your evidential standards change such that learning the final card will be a two would make you more and more confident of a win. How might we represent evolving evidential standards in a formal fashion? In general, what we need is a function indexed to each point in time that takes as input various pieces of evidence the agent might receive after that time and outputs attitudes to assign towards various propositions in light of each possible piece of evidence. In the Bayesian context, that function is supplied by an agent’s conditional credences. For any time ti and potential evidential proposition E in L, the conditional distribution cr i E specifies credences the agent will assign if she updates by conditionalization on learning E . As the agent gains evidence over time, her conditional credences change, representing changes in her ongoing evidential standards. In real life, when we observe two people reacting differen tly to acquiring the same piece of information (thereby applying different evidential standards), we usually attribute this difference in evidential standards to differences in their previous experien ce. When one student in a class insists on answering every question, pontificates at length, and refuses to consider
p¨| q
others’ ideas,knowledgable some of his fellow might that this studen person ts is the most in the students room. But theconclude teacher (or other with more experience) might draw the opposite conclusion, informed by a broader pool of evidence about how particular personality types behave in the classroom. Yet how should we understand cases in which agents draw different conclusions despite sharing the same total evidence? Hiring committees for m different beliefs about candidates’ suitability from the same application files; jurors disagree about a defendant’s guilt after witnessing the same trial; scientists embrace different hypotheses consistent with the same experimental data. These seem to be cases in which agents share a common body of total evidence, or at least total evidence relevant to the question at han d. So it can’t be some further, unshared piece of extra evidence that’s leading the agents to draw different conclusions from each other. One could stubbornly maintain that in every real-life case in which agents interpret a piece of evidence differently, that difference is entirely attributable to the vagaries of their backg round information. But I think this would be a mistake. In addition to variations in their total evidence, agents have varying ways of interpreting their total evidence. Some people are naturally more skeptical than others, and so require more evidence to become confident of a particular proposition (that humans actually landed on the moon, that a lone gunman shot JFK, that the material world exists).
110
CHAPTER 4. UPDATING BY CONDITIONALIZATION
Some people are interested in avoiding high confidence in falsehoods, while others are more interested in achieving high confidence in truths. (The former will tend to prefer non-committal credences, while the latter will be more willing to adopt credence values near 0 and 1.) Some scientists are more inclined to believe elegant theories, while others incline towards the theory that hews clos est to the data. (When the Copernican theory was first proposed, heliocentrism fit the available astronomical data worse than Ptolemaic approaches. (Kuhn 1957)) In the five-card stud example we discussed what I will call ongoing evidential standards. Ongoing evide ntial standards reflec t how an agent is disposed at a given time to assign attitudes in light of particular pieces of evidence she might receive. At any given time, an agent’s crede nces can be determined from the last piece of evidence she acquired and the ongoing standards she possessed before she acquired it (with the latter having been influenced by pieces of evidence she acquired even earlier than that ). Yet there’s another way to think about the influences on an agent’s attitudes at a given time: we can separate out the influence of her total evidence from the influence of whatever other, non-evidential factors dictate how she assesses that An evidence. refer to the latter as the agent’ s ultimate standards. agent’sIultimate evidential standards represent her evidential evidenceindependent tendencies to respond to whatever package of total evidence might come her way.13 How might we formally represent an agent’s ultimate evidential standards? Again, what’s nee ded is some kind of function, this time from bodies of total evidence to sets of attitu des adopt ed in their light. This function could be specified in any way one likes (as a table listing inputs and outputs, as a very complicated kind of graph, etc.), but Bayesians have a particularly useful representation already to hand. To wit: Hypothetical Priors Theorem: Given any finite series of crede nce distributions cr 1 , cr2 ,..., crn each of which satisfies the probability axioms and Ratio Formula, let E i be a conjunction of the agent’s total evidence at ti . If the cr update by Conditionalization, then there exists at least one regular probability distribution Pr H such that for all 1 i n,
ď ď
cri
p¨q“
PrH
p¨| E q i
Take any agent who has obeyed the Bayesian rational norms (probability axioms, Ratio Formula, updating by Conditionalization) throughout her life. The Hypothetical Priors Theorem guarantees that, given the (finite) series
4.3. HYPOTHETICAL PRIORS AND EVIDENTIAL STANDARDS 111
of credence distributions she has adopted at various times, we will be able to generate at least one regular, probabilistic numerical distribution that stands in a particular mathematical relationship to every member of that series. I will call this distribution a hypothetical prior distribution, and abbreviate it Pr H . For any time ti at which the agent’s total evidence is Ei , the agent’s unconditional credence distribution cr i will equal Pr H Ei .
p¨| q
A hypothetical prior distribution is a particularly elegant way of representing a rational Bayesian agent’s ultimat e evidential standards. It is defined so that at any moment in the agent’s life, we can recover her credence distribution by conditionalizing her hypothethical prior on her total contingent evidence at that moment. Yet being regular, the hypothetical prior does not assign any contin gent certainties itself. This means that it contains no contingent evidence, and is not influenced by empirical experiences.14 So when we are confronted with an agent’s credences at any given time, we can cleanly separate out two distinct influences on those credences: her total evidence and her hypothetical priors. Hypothetical priors are convenient because they take on a mathematical form withlanguage which weLare already familiar: a regular distribution over the . Yet this does not mean thatprobability a hypothetical prior distribution is a credence distribution. An agent’s hypoth etical priors are not degrees of belief we imagine she espoused at some particular point in her life, or would espouse under some strang e conditions. This is what distinguishes them from the mythical initial priors. 15 Instead, hypothetical priors encapsulate an agent’s abstract evidential assessment tendencies, and stay constant throughout her life as long as she obeys the Conditionalization update rule. Instead of appearing somewhere within the series of credence distributions the agent assigns over time, the hypothetical prior “hovers above” that series, combining with the agent’s total evidence to create elements of the series at each given time. As de Finetti puts it, “If we reason according to Bayes’ theorem we do not change opinio n. We keep the same opinion and we update it to the new situation.” ((de Finetti 1995, p. 100), translated by and quoted in (Galavotti 2005, p. 215)) This arrangement is depicted in Figur e 4.3. Again, the solid arro ws represent conditionalizations from one moment to the next, while the dashed arrows represent the possibility of generating an agent’s distribution at any given time by conditionalizing Pr H on her total evidence at that time. Let’s work through an example of extracting hypothetical priors from a series of updates. Suppose Ava has drawn two coins from a bin that contains only fair coins and coins biased towards head s. Prior to time t1
112
CHAPTER 4. UPDATING BY CONDITIONALIZATION
Figure 4.3: A hypothetical prior PrH E Ei
j
Ek
crj
cri
crk
E1
E
El
crl
...
she has inspected both coins and determined them to be fair. Between t1 and t 2 she flips the first coin, which comes up heads. Between t 2 and t 3 , the second coin comes up tails. Our language L will contain three atomic propositions: N , that neither coin Ava picked is biased; Ha, that the first flip comes up heads; and Hb, that the second flip is heads. Presumably the follo wing probability table describes Ava’s credences over time:
N Ha Hb T
T
T
T
T
F
T
F
T
T
F
F
F
T
T
F
T
F
F
F
T
F
F
F
1 1cr4 1 4 1 4 1 4 0 0 0 0
{ { { {
2 1cr2 1 2 0 0 0 0 0 0
{ {
cr03 1 0 0 0 0 0 0
In this example, Ava’s total evidence at t1 (or at least her total evidence representable in language L) is N . We’ll call this propos ition E1 . Ava’s total evidence at t2 , E2 , is N & H a. Then E3 is N & H a & Hb. Notice that since N is part of Ava’s evidence at all times reflected in this table, she assigns credence 0 throughout the table to any state-description on which N is false. Since Ava’s credence distributions cr 1 through cr3 are probabilistic, and update by Conditionalization, the Hypothetical Priors Theorem guarantees the existence of at least one hypothetical prior Pr H standing in a particular relation to Ava’s crede nces. I’ve added a column to the proba bility table below representing one such hypothetical prior:
„
4.3. HYPOTHETICAL PRIORS AND EVIDENTIAL STANDARDS 113
N Ha Hb T
T
T
T
T
F
T
F
T
T F
F T
F T
F
T
F
F
F
T
F
F
F
cr1 cr2 cr3 1 4 1 2 0 1 4 1 2 1 1 4 0 01
{ { { { { 1{4 0 0 1 0 0 0 21 0 0 0
0 0 11 0 0 11 00 5
PrH 1 16 1 16 16
{ { { {{1664 {64 {64 {64
As the Hypothetical Priors Theorem requires, Pr H is regular—it assigns positive credence to every contingent proposition in L. The values in the PrH column also sum to 1, so Pr H satisfies the probability axioms. Finally, PrH stands in the desired relationship to each of cr 1 , cr2 , and cr 3 : each of those distributions can be obtained from Pr H by conditionalizing on Ava’s total evidence at the relevant time. To take one example, consider cr 2 . E2 is N & Ha. To conditionalize Pr H on N & Ha, we write a zero on each line whose state-description is inconsistent with N & H a. That puts ze roes on the third lines of thebytruth-table. Wethis thencase, multiply the Pr H values onthrough the firsteighth and second lines a constant (in 8) so that the results sum to 1. This yields the cr 2 distribution above. With a bit of work you can verify that cr 1 results from conditionalizing Pr H on E1 , and cr3 is the result of conditionalizing Pr H on E3 . The hypothetical prior I wrote down isn’t unique. I could have writte n down (infinitely) many other regular, probabilistic distributions that stand in the required relation to cr 1 through cr3 . This reveals that the informat ion in the srcinal table underdescribes Ava’s ultimate evidential standards, even over our fairly limited language L. For instance, the srcinal table doesn’t tell us what credences Ava would’ve assigned had she learned before t 1 that at least one of the coins was biased. The Pr H I’ve provided makes very specific assumptions about Ava’s tendencies for that case. (For a fun exercise, figure out what that Pr H assumes about the biase d coins in the bin.) But I could’ve made different assumptions, and generated a different hypothetical prior consistent with cr 1 through cr 3 . Interestingly, those assumption s don’t matter much for practical purposes. While different hypothetical priors would encode different tendencies for dealing with counterfactual cases, every hypothetical prior consistent with cr1 through cr3 yields identical credence distributions for every time after cr 3 , no matter what evidence Ava gains at those later times. (That is, assuming Ava continues to conditionalize on the evidence received.) So the differences among these hypoth etical
114
CHAPTER 4. UPDATING BY CONDITIONALIZATION
priors won’t make any difference to Ava’s actual credences going forward. On the other hand, when different agents have differing hypothetical priors, those differences can be important. A hypothetical prior represents the extra-evidential factors that combine with an agent’s evidence to generate her creden ces at a given time. Plugging the agent’s total evidence at a given time into her hypothetical prior yields her credence distribution at that time. When two agents have different hypothetical priors, plugging in the same bo dy of total evidence may yield differe nt results. So two agents may assign different credences to the same proposition in light of the same total evidence. The obvious next question is whether they can both be rational in doing so. Evidence and evidential standards come up in a variety of contexts in epistemology. As we’ve just seen, Bayesi an epistemology provides an elegant formal appara tus for isolating each of these elements. But once we’ve separated them, the question to ask about ultimate evidential standards is whether anything goes. Is any hypothetical prior ratio nal, so long as it’s probabilistic? Some probabilistic hypothetical priors will be anti-inductive, or will recommend highly skeptical attitudes in the face of everyday bundles of total evidence. weeven rulefarther out such hypo thetical priorsenough as rationally impermissible? CanCan we go than that, laying down constraints on ultimate evidential standards so that any time two agents interpret the same evidence differently, at least one of them must be interpreting it irrationally? This will be our first topic in Chapter 5, as we distinguish Objective from Subjective Bayesianism.
4.4
Exercises
Unless otherwise noted, you should assume when completing these exercises that the cr-distributions under discussion satisfy the probability axioms and Ratio Formula. You may also assume that whenever a conditional cr expression occurs or a proposition is conditionalized upon, the needed proposition has nonzero unconditional credence so that conditional credences are welldefined. Problem 4.1. Galileo intends to determine whether gravitational acceleration is independent of mass by dropping two cannonballs of differing mass off the Leaning Tower of Pisa. Conditional on the quantities’ being independent, he is 95% confident that the cannonballs will land within 0 .1 seconds of each other. (The experiment isn’t perfect—one ball might hit a bird.) Conditional on the quantities’ being dependent, he is 80% confident that
115
4.4. EXERCISES
the balls won’t land within 0 .1 seconds of each other. (There’s some chance that although mass affects acceleration, it doesn’t have much of an effect.) ∗ (a) Before p erforming the experimen t, Galileo is 30% confident that mass andcannonballs gravitationalwill acceleration are How other? confident is he that the land within 0 independent. .1 seconds of each (b) After Galileo conditionalizes on the evidence that the cannonballs landed within 0 .1 seconds of each other, how confident is he in each hypothesis? Problem 4.2. Prove that Conditionalization retains certain ties. In other words, prove that if cr i H 1 and cr j is generated from cr i by Conditionalization, then cr j H 1 as well.
p q“ p q“
Problem 4.3. Prove that Conditionalization is cumulative. That is, prove that for any cr i , crj , and cr k , conditions 1 and 2 below entail condition 3. 1. For any proposition X in
L,
p q “ cr pX | E q.
crj X
i
1
p q “ cr pY | E q. 3. For any proposition Z in L, cr pZ q “ cr pZ | E & E 1 q. 2. For any proposition Y in
L,
crk Y
j
k
i
Problem 4.4. (a) Provide an example in which an agent conditionalizes on new evidence, yet her credence in a proposition compatible with the evidence decreases. That is, provide an example in which H and E are consistent, yet cr2 H cr1 H when E is learned between t 1 and t2 .
p qă p q
(b) Prove that when an agent conditionalizes on new evidence, her credence in a proposition that entails the evidence cannot decrease. That is, when H cr1 H when E is E , it must be the case that cr 2 H learned between t1 and t2 .
(
p qě p q
(c) Prove that as long as cr 1 H and cr 1 E are both nonextreme, condi-
p q
p q
tionalizing on E increases the agent’s credence in H when H
( E.
†
Problem 4.5. Reread the details of the Base Rate Fallacy example in Section 4.1.2. After you apply the diagnosti c test once and it yields a positive result, your odds for D should be 9 : 999. ∗ †
This is a version of a problem from Julia Staffel. This problem was inspired by a problem of Sarah Moss’.
116
CHAPTER 4. UPDATING BY CONDITIONALIZATION
(a) Suppose you apply the test a second time to the same subject, and it yield s a positive result once more . What should your odds for the subject’s having the disease be now? (Assume that D screens off the results of the first test from the results of the second.) (b) How many consecutive tests (each independent of all the prior test results conditional on both D and D ) would have to yield positive results before your confidence that the subject has the disease exceeded 50%?
„
(c) Does this shed any light on why patients diagnosed with rare diseases are often advised to seek a second opinion? Explain. Problem 4.6. Your friend Jones is a gambler. He even gambles about whether to gamble! Every time he goes to the track, he flips a fair coin to determine whether to bet that day. If it comes up heads he bets on his favorite horse, Speedy. If it comes up tails he doesn’t bet at all. On your way to the track today, you were 1 6 confident that out of the six horses running, Speedy would win. You were 1 2 confident that Jones’s coin wou ld come up heads . And you cons idered the outcome of the hors e race independent of the outcome of the coin flip. But then you saw Jone s leaving the track with a smile on his face . The smile tells you that eithe r Jones bet on Speedy and won, or Jones didn’t bet and Speedy didn’t win. ‡
{
{
(a) Using a language with the atomic pro positions H (for heads on the coin) and S (for a Speedy win), express the information you learn when you see Jones smiling. (b) After updating on this information by conditionalizing, how confident are you that Speedy won? How confident are you that the coin came up heads? (c) Explain why one of the unconditional credences you calculated in part (b) diffe rs from its prior va lue and the other one doesn’ t. Be sure to include an explanation of why that unconditional credence was the one that changed out of the two. (“Because that’s what the math says” is not an adequate explanation—we want to know why the mathematical outcome makes sense .) Problem 4.7. At t 1 , t 2 , and t 3 , Jane assigns credences over the language L constructed from atomic propositions P and Q. Jane’s distributions satisfy constraints 1 through 6: ‡
This story is adapted from (Hart and Titelbaum ta).
117
4.4. EXERCISES
1. At t 1 , Jane is certain of Q nothing else.
Ą P , anything that proposition entails, and
2. Between t1 and t2 Jane learns P and nothing else. She updates by conditionalizing between those two times. 3. cr 1 Q P 2 3.
p | q“ { 4. cr pQ |„P q “ 1{2. 5. cr pP Ą Qq “ cr pP Ą Qq. 6. At t , Jane is certain of „pP & Qq, anything that proposition entails, 3
3
2
3
and nothing else.
(a) Completely specify Jane’s creden ce distributions at t2 and t3 . (b) Create a hypothetical prior for Jan e. In other words, specify a regular probabilistic distribution Pr H over L such that cr 1 can be generated from PrH by conditionalizing on Jane’s set of certainties at t1 ; cr2 is Pr H conditionalized on Jane’s certainties at t2 ; and cr 3 is PrH conditionalized on Jane’s t3 certainties. (c) Does Jane update by Conditional ization between t2 and t3 ? Explain how you know. (d) The Hypothetical Priors Theo rem says that if an agent always updates by conditionalizing, then her credences can be represented by a hypothetical prior distribution. Is the converse of this theorem true?
t
u
Problem 4.8. Suppose you have a finite partition B1 , B2 ,...B n . Suppose also that between t1 and t2 you conditionalize on evidence equivalent to a disjunction of some of the B s. Prove that for any A in L and any Bi such that cr 2 Bi 0, cr2 A Bi cr1 A Bi
p qą
p | q“ p | q
Problem 4.9. Do you think only one set of ultimate evidential standards is rationally permissible? Put another way: If two agents’ series of credence distributions cannot be represented by the same hypothetical prior distribution, must at least one of them have assigned irrational credences at some point?
118
4.5
CHAPTER 4. UPDATING BY CONDITIONALIZATION
Further reading
Introductions and Overviews
Ian Hacking (2001). An Introduction to Probability and Inductive Logic. Cambridge: Cambridge University Press Chapter 15 works through many excellent examples of applying Bayes’ Theorem to manage complex updates. Classic Texts
Rudolf Carnap (1947). On the Application of Inducti ve Logic. Philosophy and Phenomenological Research 8, pp. 133–148 Section 3 contains Carnap’s original discussion of the Principle of Total Evidence. Extended Discussion
Paul Teller (1973). Conditionalization and Observation. Synthese 26, pp. 218–258 Offers a number of arguments for the Conditionalization updating norm. (We’ll discuss the Dutch Book argument for Conditionalization that Teller provides in Chapter 9.) Isaac Levi (1980). The Enterprise of Knowledge . Boston: The MIT Press Though Levi’s notation and terminology are somewhat different from mine, Chapter 4 thoroughly works through the mathematics of hypothetical priors. Levi also discusses various historicall y-important Bayesians’ positions on how many distinct hypothetical priors are rationally permissible. Christopher J.G. Meacham (ms). Ur-Priors, Conditionalization, and Ur-Prior Conditionalization. Unpublished manuscript. Meacham considers a number of possible interpretations of hypothetical priors, and how they might be linked to an agent’s credences at specific times by Conditionalization.
NOTES
119
Notes 1
Remember that 1 is not a prime number, while 2 is! A bit of reflection on Equation (4.2) will reveal that the constant we multiply by in the second step of our probability table updating process—the normalization factor—will 2
always be the reciprocal of the agent’s earlier uncondi tional credence in the evidence. In other words, the second step divides all nonzero state-description credences by cr i pE q. 3 We can also now see an alternate explanation for steps (3.51) and (3.53) of Lewis’s triviality proof from Section 3.3. The proposal assesse d there is that for some conditiona l Ñ, the agent’s conditional credence crpZ | Y q for any Y and Z in L equals her unconditional credence in Y Ñ Z . Whatever motivates that proposal, we should want the proposal to remain true even after the agent learns some information X . If the relev ant values are going to match after conditionalization on X , it must be true before conditionalization that cr pY Ñ Z | X q “ crpZ | Y & X q, which is just Equation (3.59). 4 Thanks to Joel Velasco for discussion, and for the example. 5 For reasons we are now in a position to understand, the term “posterior” is sometimes used ambiguously in the Bayesian literature. I have defined “posterio r” as an agent’s conditional credence in the hypothesis given the evidence—crpH | E q. If the agent updates by conditionalizing on E , this will equal her unconditional credence in the hypothesis after the update. The term s “prior” and “post erior” come fro m the fact that on an orthodox Bayesian position, those quantities pick out the agent’s unconditional credences in the hypothesis before and after the update. But unorthodox Bayesians who prefer an alternative updating rule to Conditionalization nevertheless sometimes refer to an agent’s post-update credence in a hypothesis as her “posterior”. As I’ve defined the term, this is strictly speaking incorrect. 6 An acquaintance involved with neuroscientific research recently told me that when a prisoner in the American penal system comes up for parole, a particular kind of brain scan can predict with greater than 90% accuracy whether that prisoner will, if released, be sent back to jail withi n a specifie d period of time. He sugge sted that we use this brain scan in place of traditional parole board hearings , whose predictive accuracy is much lower. I asked why we don’t just apply the brain scan to everyone in America, rather than waiting to see if a person commits a crime worth send ing them to jail. He replied that the base rates make this impractica l: While the recidivism rate among prisoners is fairly high, the percentage of ordinary Americans committing crimes is low, so the scan would generate far too many false positives if used on the general population. 7 In Section 2.2.3 I mentioned that Bayesians often work with an agent’s set of doxastically possible worlds instead of the full set of logically possible worlds, understanding “mutually exclusive” and “tautology” in the Kolmogorov axioms in terms of this restricted doxastic set. The Regularity Principle concerns the full set of logically possible worlds—it forbids assigning credence 0 to any proposition that is true in at least one of them. So for the rest of this section, references to “contingent propositions”, “tautologies”, etc. should be read against that full logical set of possibilities. 8 Throughout this section I identify credence 1 with absolute certainty in a proposition and credence 0 with ruling that proposi tion out. This becomes more compli cated when we consider events with infinitely many possible outcomes; the relevant complications will be addressed in Chapter 5. 9 Observation selection effects pop up all over the place in real life—perhaps you think the refrigerator light is always on because it’s on whenever you open the door to look. Here’s my favori te real-world example: During World War II, the American military
120
NOTES
showed mathematician Abraham Wald data indicating that planes returning from engagements had more bullet holes in the fusel age than in the engine. The milit ary was considering shifting armor from the engine to the fuselage. Wald recommended exactly the opposite, on the grounds that it was the planes returning from engagements that had holes in the fuselage but not the engines. ((Wainer 2011), recoun ted in (Ellenberg 2014, pp.1012-3)) A similar example appears in (Bradley 2010). Colin Howson argued that a so-called “Thomason case” provides a counterexample to Conditionalization. Bradley replies that if we analyze the agent’s total evidence in the case—including evidence about how he came to have his evidence—the supposed counterexample disappears. 11 You may have noticed that in the Monty Hall Problem, accounting for the agent’s total relevant evidence required us to move from a coarser-grained partition of her doxastic possibilities (Prize behind door A/B/C) to a finer-grained partition (Prize behind A & Monty reveals B, Prize behind A & Monty reveals C, etc.). Whether a conditionalization yields the right results often depends on the richness of the language in which the agent represents her doxastic possibilities; a language without enough detail may miss aspects of her total relevant evidence. For more on selecting an appropriately detailed language for updating, and some formal results on how one can know when one’s language is rich enough, see (Titelbaum 2013, Ch. 8). 12 I learned of Lewis’s “superbaby” talk from Alan H´ ajek. Susan Vineberg suggested to me that Lewis’s inspiration for the term may have been I.J. Good’s (1968) discussion of “an infinitely intelligent newborn baby having built-in neural circuits enabling him to deal with formal logic, English syntax, and subjective probability”—to which we shall return in Chapter 6. 13 White (2005) and Schoenfield (2014) use the term “epistemic standards” for what I’m calling “ultima te evidential standards”. I prefer the “evidential” terminology because it emphasizes that these standards take bodies of evidence as their inputs, and also removes any association with the notion of knowledg e. Feldman (2007) talks about epistemic “starting points”, while Levi (1980) speaks of “confirmational commitments”. Levi’s discussion is particularly important, because it lays out the mathematical formalism for evidential standards I’m about to present. 14 While the Hypothetical Priors Theorem stipulates that hypothetical priors are regular, this doesn’t involve any commitmen t to the Regularit y Principle as a rational constraint. Hypothetical priors are defined to be regular so they will be independent of contingent evidence, and can represent extra-evidential influences on the attitudes an agent assigns. The Regularity Principle is a constraint on rational credences, while a hypothetical prior is not a credence distribution the agent ever assigns. Moreover, the Hypothetical Priors Theorem applies only to agents who update by Conditionalization, while Conditionalization conflicts with the Regularity Principle. 15 In the Bayesian literatu re, the terms “initial prior”, “ur-prior”, and “hypotheti cal prior” are often used interchangeably. To me, the forme r two carry the connotation that the prior was ass igned by the agent at some early time . So I’ve sele cted the term “hypothetical prior” to emphasize the use of a mathematical representation that does not correspond to any credences the agent actually ever assigned. Unfortunately, the term “hypothetical prior” has also been used for a very specific notion within the literature on the Problem of Old Evidence (as in (Bart ha and Hitchcock 1999, p. S349) ). Here I will simply note the distinction between that usage and the one I intend; I’ll discuss the relationship between the two notions of hypothetical prior in Chapter 13.
Chapter 5
Further Rational Constraints The previous three chapters have discussed five core normative Bayesian rules: Kolmogorov’s three probability axioms, the Ratio Formula, and Conditionalization. Bayesians offer these rules as necessa ry conditions for an agent’s credences to be rational. We have not discussed whether these five rules are jointly sufficient for rational credence. Agents can satisfy the core rules and still have wildly divergent attitudes. Suppose 1,000 balls have been drawn from an urn and every one of them has been black. In light of this evid ence, I might be highly confident that the next ball drawn will be black. But I might also have a friend Mina, whose credences satisfy all the rational constraints we have considered so far, yet who nevertheless responds to the same evidence by being 99% confident that the next ball will be white. Similarly, if you tell me you rolled a fair die but don’t say how the roll came out, I might assign credence 1 6 that it came up 3. Mina, however, could be 5 6 confident of that proposition, without violating the core Bayesian rules in any way. If we think Mina’s credences in these examples are irrational, we need to identify additional rational requirements beyond the Bayesian core that rule them out. We have already seen one potential requirement that goes beyond the core: the Regularity Principle (Section 4.2) prohibits assigning credence 0 to logically contingent propositions. What other requirem ents on rational credence might there be? When all the requirements are put together, are they strong enough to dictate a single rationally-permissible credence distribution for each body of total evidence? The answer to this last question is sometimes taken to separate Subjective from Objective Bayesians. Unfortunately, “Objective/Subjective Bayesian” terminology is used ambiguously, so this chapter begins by clarify-
{
{
121
122
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
ing two different ways in which those terms are used. In the course of doing so we’ll discuss various interpretations of probability, including frequency and propensity views. Then we will consider a number of additional rational credence constraints proposed in the Bayesian literature. We’ll begin with synchronic constraints: the Principal Principle (relating credences to chances); the Reflection Principle (concerning one’s current credences about one’s future credences); principles for deferring to experts; indifference principles (for distributing credences in the absence of evidence); and principles for distributing credences over infinitely many possibilitie s. Finally, we will turn to Jeffrey Conditionalization, a diachronic updating principle proposed as a generalization of standard Conditionalization.
5.1
Subjective and Objective Bayesianism
When a weather forecaster comes on television and says, “The probability of snow tomor row is 30%,” what does that mean? What exactly has this forecaster communicated to her audience? Such questions have been asked throughout the history of mathematical probability theory; in the twentieth century, rival answers became known as intepretations of probability . There is an excellent literature devoted to this topic and its history (see the Further Readings of this chapter for recommendations), so I don’t intend to let it take ov er this book. But for our purpos es it’s import ant to touch on some of the main interpretations, and at least mention some of their advantages and disadvantages. 5.1.1
Frequencies and Propensities
The earliest European practitioners of mathematical probability theory applied what we now call the classical interpretation of probability. This interpretation, championed most famously by Pierre-Simon Laplace, calculates the probability of a proposition by counting up the number of possible event outcomes consistent with that proposition, then dividing by the total number of outcomes possible. For example, if I roll a six-sided die, there are 6 possible outcomes, and 3 of them are consistent with the proposition that the die came up even. So the classical probab ility of even is 1 2. (This is almost certainly the kind of probability you first encountered in school.) Laplace advocated this procedure for any situation in which “nothing leads us to believe that one of [the outcomes] will happen rather than the others.” (Laplace 1814/1995, p. 3) Applying what Jacob Bernoulli (1713)
{
5.1. SUBJECTIVE AND OBJECTIVE BAYESIANISM
123
had earlier called the “principle of insufficient reason”, Laplace declared that in such cases we should view the outcomes as “equally possible”, and calculate the probabilities as described above. The notion of “equally possible” at the crux of this approach clearly needs more philosophical elucidation. But even setting that aside, the classical interpretation leaves us adrift the moment someone learns to shave a die. With the shape of the die changed, our interpretation of probability needs to allow the possibility that some faces are more probable than others. For instance, it might now be 20% probable that you will roll a six. While Laplace recognized and discussed such cases, it’s unclear how his view can interpret the probabilities involved. There are no longer possible outcomes of the roll that can be tallied up and put into a ratio equaling 20%. So suppose a shady confederate offers to sell you a shaved die with a 20% proba bility of landing 6. How might she explain—or back up—t hat probability claim? Well, if an event has a 20% probability of producin g a certain outcome, we expect that were the event repeated it would produce that type of outcom e about 20% of the time. The frequency theory of probability uses this fact to analyze “probability” talk. On this interpretation, the die has a 20% probability of landing 6 on awhen givenyour roll,confederate she meansclaims that repeated rolls of the die will produce a6 about 20% of the time. According to the frequency theory, the probability is x that event A will have outcome B just in case proportion x of events like A have outcomes like B .1 The frequency theory srcinated in work by Robert Leslie Ellis (1849) and John Venn (1866), then was famously developed by the logical positivist Richard von Mises (1928/1957). The frequency theory has a number of problems; I will mention only a few.2 Suppose my sixteen-year-old daughter asks for the keys to my car; I wonder what the probability is that she will get into an accident should I give her the keys. According to the frequency theory, the probability that the event of my giving her the keys will have the outcome of an accident is determined by how frequently this type of event leads to accidents. But what type of event is it? Is the frequ ency in question how often peop le who go driving get into accidents? How often sixteen-year-olds get into accidents? How often sixteen-year-olds with the courage to ask their fathers for the keys get into accidents? How often my daughter gets into accidents? Presumably these frequencies will differ—which one is the probability of an accident should I give my daughter the keys right now? Any event can be subsumed under many types, and the frequency theory leaves it unclear which event-types determine probability values. Event types are sometimes known as reference classes, so this is the reference
124
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
class problem. In response, one might suggest that outcomes have frequencies— and therefore probabilities—only relative to the specification of a particular reference class (either implic itly or explicitly). But it seems we can meaningfully inquire about the probabilities of particular event outcomes (or of propositions simpliciter) without specifyin g a reference class. I need to decide whether to give the keys to my daughter; I want to know how probable it is that she will crash. To which reference class should I relativize? Frequency information about specific event-types seems more relevant to determining probabilities than information about general types. (The probability that my daughter will get into an accident on this occasion seems much closer to her frequency of accidents than to the accident frequency of drivers in general.) Perhaps probabilities are frequencies in the maximally specific reference class? But the maximally specific reference class containing a particular event contains only that individual event. The frequency with which my daughter gets into an accident when I give her my keys on this occasion is either 0 or 1—but we often think probabilities for such events have nonextreme values. This brings us to another problem for frequency theories. Suppose I have aheads penny, I flip the probability the flip will come is 1and 2. think Let’sthat just ifgra nt it, arguendo that thethat correct reference classout for this event is penny flips. According to the frequency theory, the probability that this flip will come up heads is the fraction of all penny flips that ever occur which come up heads. Yet while I’d be willing to bet that fraction is close to 1 2, I’d be willing to bet even more that the fraction is not exactly 1 2. (For one thing, the num ber of penny flips that will ever occur in the history of the universe might be an odd number!) For any finite run of trials of a particular event-type, it seems perfectly coherent to imagine—indeed, to expect—that a particular outcome will occur with a frequency not precisely equal to that outcome’s probab ility. Yet if the frequency theory is correct, this is conceptually impossible when the run in question encompasses every event trial that will ever occur. One might respond that the probability of heads on the flip of a penny is not the frequency with which penny flips actually come up heads over the finite history of our universe; instead, it’s the frequency in the limit —were pennies to continue being flipped forever. This gives us hypothetical frequency theory . Yet this move under mines one of the srcinal appeals of the frequency approach: its empiricism. The proportion of event repetitions that produce a particular outcome in the actual world is the sort of thing that could be observed (at least in principle)—providing a sound empirical base for otherwise-mysterious “probability” talk. Empirically grounding
{
{
{
5.1. SUBJECTIVE AND OBJECTIVE BAYESIANISM
125
hypothetical frequencies is a much more difficult task. Moreover, there seem to be events that couldn’t possibly be repeated many many times, and even events that couldn’t be repeated once . Before the Large Hadron Collider was switched on, physicists were asked for the probability that doing so would destroy the Earth. Were that to have happened, switching on the Large Hadron Collider would not have been a repeatable event. Scientists also sometimes discuss the probability that our universe began with a Big Bang; arguably, that’s not an event that will happen over and over or even could happen over and ove r. So it’s difficu lt to understand talk about how frequently the universe would begin with a Bang were the number of times the universe started increased toward the limit. This problem of assigning meaningful nonextreme probabilities to individual, perhaps non-repeatable events is called the problem of the single case. The frequentist still has moves available. Faced with a single event that’s non-repeatable in the actual world, she might ask what proportion of times that event produces a particular outcome across other possible worlds.3 But now the prospects for analyzing “probability” talk in empirically-observable terms growninterpretation fairly dim. of probability admits that probabilities are Anhave alternate related to frequencies, but draws our attention to the features that cause particular outco mes to appear with the frequencies that they do. What is it about a penny that makes it come up heads about half the time? Presumably something about its physical attributes, the symmetries with which it interacts with surrounding air as it flips, etc. These traits lend the penny a certain tendency to come up heads, and an equal tendency to come up tails. This quantifiable disposition—or propensity—would generate certain frequencies were a long run of trials to b e staged. But the propensity is also at work in each individual flip, whether that flip is ever repeated or could ever b e repeated. A non-repeatable experimental setup may possess a nonextreme propensity to generate a particular outcome. While an early propensity theory appeared in the work of Charles Sanders Peirce (1910/1932), propensity’s most famous champion was Karl Popper (1957). Popper was especially motiv ated by developments in quantum mechanics. In quantum theory the Born rule calculates probabilities of experimental outcomes from a particular quantity (the amplitude of the wavefunction) with independent significance in the theory’s dynamics. Moreover, this quantity can be determined for a particular experimental setup even if that setup is never to be repeated (or couldn’t be repeated) again. This gives propensities a respectable place within an empirically-established scientific
126
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
theory. Propensities may also figure in such theories as thermodynam ics and population genetics. Yet even if there are propensitie s in the world, it seems difficult to interpret all probabilities as propensities. Suppose we’re discussing the likelihood that a particular outcome occurs given that a quantum experiment is set up in a particular fashion. This is a conditional probability, and it has a natural interpretation in terms of physical propensities . But where there is a likelihood, probability mathematics suggests there will also be a posterior—if there’s a probability of outcome given setup, there should also be a probability of setup given outcome. Yet the latter hardly makes sense as a physical propensity—does an experimental outcome have a quantifiable tendency to set up the experiment that produces it in a particular way? 4 Some philosophers—especially those of a Humean bent—are also suspicious of the metaphys ics of propensities . From their point of view, causes are objectionable enough; even worse to admit propensities that seem to be a kind of graded causation . Nowadays most philosophers of science agree that we need some notion of physical probability that applies to the single case. Call this notion objective chance . But whether objective chances are bestorunderstood propensity theory, a “bestmatter. systems” analysis (Lewis 1994), some other via approach is a hotly contested Finally, whatever objective chances turn out to be, they are governed by the physical laws of our world. That means there can be no objective chance that the phy sical laws are one way or anoth er. (What set of laws beyo nd the physical might determine suc h chances?) Yet it seems physicists can meaningfully discuss the probability that the physical laws of the universe will turn out to be such-and-such. While the notion of objective chance makes sense of some of our “probability” talk, it nevertheless seems to leave a remainder. 5.1.2
Two Distinctions
What are physicists talking about when they discuss the probability that the physical laws of the universe are one way rather than another? Perhaps they are expressing their degrees of confidence in alternative physical hypotheses. Perhaps there are no probabilities out in the world, independent of us, about which our opinions change as we gain evidence. Instead, it may be that facts in the world are simply true or false, probability-free, and “probability” talk records our changing confidences in those facts in the face of changing evidence. Bayesian theories are often characterized as “Subjective” or “Objective”,
5.1. SUBJECTIVE AND OBJECTIVE BAYESIANISM
127
but this terminology can be used to draw two different distinctions. One of them concerns the interpretation of “probability” talk. On this distinction— which I’ll call the semantic distinction—Subjective Bayesians adopt the position I proposed in the previ ous parag raph. For them, “probability” talk expresses or reports the degrees of confidence of the individuals doing the talki ng, or perhap s of communities to which they belong . Objective Bayesians, on the other hand, interpret “probability” assertions as having truth-conditions independent of the attitudes of particular agents or groups of agents. 5 In the twentieth century, talk of “Objective” and “Subjective” Bayesianism was usually used to draw this semantic distinction. 6 More recently the “Subjective Bayesian/Objective Bayesian” terminology has been used to draw a different distinction, which I will call the normative distinction . However we interpret the meaning of “probabilit y” talk, we can grant that agents assign different degrees of confidence to different propositions (or, more weakly, that it is at least useful to model agents as if they do). Once we grant that credences exist and are subject to rational constraints, we may inquire about the stringency of those constraints. On one end of the spectrum, Objective Bayesians (in the norma tive sense) endorse what Richard Feldman (2007) and Roger White (2005) have called the Uniqueness Thesis: Given any proposition and bo dy of total evidence, there is exactly one attitude it is rationally permissible for agents with that body of total evidence to adopt towards that proposition. Assuming the attitudes in question are degrees of belief, the Uniqueness Thesis says that given any evidential situation, there’s exactly one credence that any agent is rationally permitted to adopt towards a given proposition in that situation. The Uniqueness Thesis entails evidentialism, according to which the attitudes rationally permissible for an agent supervene on her evidence. Suppose we have two agents with identical total evidence who adopt different credences towards some propositions . Because Objective Bayesians (in the normative sense) endorse the Uniqueness Thesis, they will maintain that at least one of these agents is responding to her evidence irrationally. In most real-life situations, different agents have different bodies of total evidence—and even different bodies of relevant evidence—so many discrepancies in their attitudes can be chalked up to evidential differ ences. But we have stipulated in this case that the agents have identical evidence, so whatever causes the differences in their attitudes, it can’t be the contents of
128
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
their evidence. In Section 4.3 we identified the extra-evidential factors that determine an agent’s attitudes in light of her total evidence as her “ultimate evidential standards”. These evidential standards might reflect pragmatic influences, a predilection for hypotheses with certain features, a tendency towards mistrust or skepticism, etc. In a credal context, the Hypothetical Priors Theorem tells us that whenever an agent’s credence distributions over time satisfy the probability axioms, Ratio Formula, and Conditionalization, her evidential standards can be represented by a hypothetical prior distribution. This regular, probabilistic distribution stays constan t as the agent gains evidence ov er time. Yet we can always recover the agent’s credence distribution at a given time by conditionalizing her hypothetical prior on her total evidence at that time. The core Bayesian rules (probability axioms, Ratio Formula, Conditionalization) leave a wide variety of hypothetical priors available. Assuming they satisfy the core rules, our two agents who assign different credences in response to the same total evidence must have different hypothetical priors. According to the Objective Bayesian (in the normative sense), any time such a situation arises at least one of the agents must be violating rational requirements. Thus the hypothetical Objective Bayesian thinksset there is exactly one set of rationally permissible priors—one of correct evidential standards embodying all rational agents’ common responses to evidence. If we think of hypothetical priors as the input which, given a particular evidential situation, produces an agent’s credence distribution as the output, then the Objective Bayesians secures unique outputs in every situation by demanding a universal unique input. How might the unique rational hypothetical prior be generated, and how might we justif y the claim that it is uniquely correct? Our ongoing evidential standards for responding to new pieces of empirical evidence are often informed by other pieces of evidence we hav e received in the past. I take a fire alarm to support a particular belief about what’s going on in my building because I have received past evidence about the import of such alarms. But when we abstract far enough this process must end somewhere; our ultimate evidential standards, represented by our hypothetical priors, encode responses to our total evidence, and so cannot be influenced by elements of that evidence. If we are to select and justify a unique set of such ultimate evidential standards, we must do so a priori . Extending a tradition that dated back to Bolzano (1837/1973) and perhaps even Leibniz, 7 Keynes (1921) and Carnap (1950) argued that just as there are objective facts about which propositions are logically entailed by a given body of evidence, there are objective logical facts about the degree
5.1. SUBJECTIVE AND OBJECTIVE BAYESIANISM
129
to which a b ody of evidence probabilifies a particula r proposition. Carnap went on to offer a mathematical algorithm for calculating the uniquely logical hypothetical priors from which these facts could be determined; we will discuss that algorithm in Chapter 6. (The logical interpretation of probability holds that an agent’s “probability” talk concerns logical probabilities relative to her current total evidence. 8 ) Many recent theoris ts, while backing away from Keynes’s and Carnap’s position that these values are logical, nevertheless embrace the idea of evidential probabilities reflecting the degree to which a proposition is probabilified by a given body of evidence. If you think that rationality requires an agent to assign credences equal to the unique, true evidential probabilities on her current total evidence, you have an Objective Bayesian view in the normative sense. 9 At the other end of the spectrum from Objective Bayesians (in the normative sense) are theorists who hold that the probability axioms and Ratio Formula are the only rational constraints on hypothetical priors.10 The literature often defines “Subjective Bayesians” as people who hold this view. But that terminology leaves no way to describe theorists in the middle of the spectrum—the vast majority of Bayesian epistemologists who believe in rational constraints on hypothetical that go beyondstandard. the core rules are insufficient to narrow us down topriors a single permissible I will but use the term “Subjective Bayesian” (in the normative sense) to refer to anyone who thinks more than one hypothetical prior is rationally permissible. I will call people who think the Ratio Formula and probability axioms are the only rational constraints on hypothetical priors “extreme Subjective Bayesians”. Subjective Bayesians allow for what White calls permissive cases: examples in which two agents reach different conclusions on the basis of the same total evidence without either party’s making a rational mistake. This is because each agent interprets the evidence according to different (yet equally rational) evidential standards, which allow them to draw different conclusions. I have distinguished the semantic and normative Objective/Subjective Bayesian distinctio ns because they can cross-cut one another. Historically, Ramsey (1931) and de Finetti (1931/1989) reacted to Keynes’s Objective Bayesianism with groundbreaking theories that were Subjective in both the semantic and normative sens es. But one could be a Subjective Bayesian in the semantic sense—taking agents’ “probability” talk to express their own current credences—while maintaining that strictly speaking only one credence distribution is rationally permitted in each situation (thereby adhering to Ob jective Bayesianism in the normative sense). Going in the other direction, one could admit the existence of degrees of belief while holding that
130
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
they’re not what “probability” talk concerns. This would give an Objective Bayesian semantic view that combined with either Subjective or Objective Bayesianism in the normative sense. Finally, probability semantics need not be monolithic; many Bayesians now hold that some “probability” assertions express credences, others report objective chances, and still11 others indicate what would be reasonable to believe given one’s evidence. Regardless of her position on the semantics, any Bayesian who isn’t an extreme Subjective Bayesian in the normative sense will concede that there are rational constraints on agents’ hypothetical priors beyond the probability axioms and Ratio Formula. The rest of this chapter investigates what those additional constraints might be. I should note at the outset, though, that the more powerful and widely-applicable these constraints get, the more they seem to be beset by problems. Many Subjective Bayesians (in the normative sense) would be happy to adopt an Objective position, if only they could see past the numerous shortcomings of the principles Objective Bayesians use to generate unique rational priors. Richard Jeffrey characterized his Subjective Bayesian p osition as follows: As a practical matter, I think one can give necessary conditions for reasonableness of a set of partial beliefs that go beyond mere [probabilistic] coherence—in special cases. The result is a patchwork quilt, where the patches have frayed edges, and there are large gaps where we lack patc hes altog ether. It is not the sort of seamless garment philosophers like to wear; but (we ragged pragmatists say), the philosophe rs are naked! Indeed we have no proof that no more elegant garb than our rags is available, or ever will be, but we haven’t seen any, yet, as far as we know. We will be the first to snatch it off the racks, when the shipments come in. But perhaps they never will. Anyway, for the tim e being, we are dressed in rags, tied neatly at the waist with a beautiful cord—probabilistic coherence. (It is the only cord that visibly distinguishes us from the benigh ted masse s.) (1970, p. 169)
5.2 5.2.1
Deference Principles The Principal Principle
Bayesian Epistemology concerns agents’ degrees of belief. Yet most contemporary Bayesian epistemologists also believe that the world contains objec-
5.2. DEFERENCE PRINCIPLES
131
tive chances of some sort—physical probabilities that particular events will produce partic ular outcomes. This raises the question of how subjective credences and objective cha nces should relate. One obvious response is a principle of direct inference: roughly, rational agents set their credences in line with what they know of the chances. If you’re certain a die is fair (has an equal objective chance of landing on each of its faces), you should assign equal credence to each possible roll outcome. While direct inference principles have a long history, the most famous such principle relating credence and chance is David Lewis’s (1980) Principal Principle. The Principal Principle ’s most straightforward consequence is that if you are certain an event has objective chance x of producing a particular outcome, and you have no other information about that event, then your credence that the outcome will occur should be x. For many Bayesian purposes this is all one needs to know about the Principal Principle. But in fact the Principle is a more finely-honed instrument, b ecause Lewis wanted it to deal with complications like the following: (1) What if you’re uncertain about the objecti ve chance of the outco me? (2) What if the outcome’s chan ce changes over time? (3) What if you have additiona l information about explains the evenhow t besides what you know ofdeals the chances ? The rest of this section the Principal Principle with those eventualities. If you’re not inte rested in those details, feel free to skip to Section 5.2.2. So: Suppose it is now 1pm on a Monday. I tell you that over the weekend I found a coin from a foreign country that is somewhat irregular in shape. Despite being foreign, one side of the coin is clearly the “Heads” side and the othe r is “Tails”. I also tell you that I flipped the foreign coin toda y at noon. Let H be the proposition that the noon coin flip landed heads. Consider each of the propositions below one at a time, and decide what your credence in H would be if that proposition was all you knew about the coin besides the information in the previous paragraph: E1 : After discovering the coin I spent a good part of my weekend
flipping it, and out of my 100 weekend flips 64 came up heads.
E2 : The coin was produ ced in a factory that advertises its coins as
fair, but also has a side business generating black-market coins biased towards tails. E3 : The coin is fair (has a 1 2 chance of landing heads). E4 : Your friend Amir was with me at noon when I flipped the coin, and he told you it came up heads.
{
132
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
Hopefully it’s fairly clear how to respond to each of these pieces of evidence, taken singly. For instance, in light of the frequency information in E1 , it seems rational to have a credence in H somewhere around 0 .64. We might debate whether precisely 0 .64 is required, 12 but certainly a credence in H of 0.01 (assuming E1 is your only evidence about the coin) seems unreasonable. This point generalizes to a rational principle that whenever one’s evidence includes the frequency with which events of type A have produced outcomes of type B , one should set one’s credence that the next A-event will produce a B -outcome equal to (or at least in the vicinity of) that frequency.13 While some version of this principle ought to be right, working out the specifics creates problems like those faced by the frequency interpretation of probability. For instance, we have a reference class problem: Suppose my evidence includes accident frequency data for drivers in general, for sixteen-year-old drivers in general, and for my sixteen-year-old daughter in particular. Which value should I use to set my credence that my daughter will get in a car accident tonight? The more spec ific data seem s more relevant, but the more general data contains a larger sample size. 14 Therewe are statistical tools available with some of which will discuss in Chapte r 11. for Butdealing for now let’sthese focusproblems, on a different question about frequency data: Why do we use known flip outcomes to predict the outcome of unobserved flips? Perhaps because known outcomes indicate something about the physical properties of the coin itself; they help us figur e out its objec tive chance of coming up heads. Known flip data influence our unknown flip predictions because they make us think our coin has a particular chance profile. In this case, frequency data influenc es predictions by way of our opinions about objective chances. This relationship between frequency and chance is revealed when we combine pieces of evidence listed abov e. We’ve already said that if your only evidence about the coin is E1 —it came up heads on 64 of 100 known tosses—then your credence that the noon toss (of uncertain outcome) came up heads should be around 0 .64. On the other hand, if you r only evidence is E3 , that the coin is fair, then I hope it’s plausible that your credence in H should be 0 .5. But what if you’re already certain of E3 , and then learn E1 ? In that case your credence in heads should still be 0 .5. Keep in mind we’re imagining you’re certain that the coin is fair before you learn the frequency data; we’re not concerning ourselves with the possibility that, say, learning about the frequencies makes you suspicious of the source from which you learne d that the coin is fair. If it’s a fixed, unquestionable truth for you that the coin is fair, then learning that it came up 64
5.2. DEFERENCE PRINCIPLES
133
heads on 100 flips will not change yo ur credence in heads. If all you had was the frequency information, that would support a different hypothesis about the chances. But it’s not as if 64 hea ds on 100 flips is inconsistent with the coin’s being fair—a fair coin usually won’t come up heads on exactly half the flips in a given sample. So once you’re already certain of heads, the frequency information becomes redundant, irrelevant to your opinions about unknown flips. Frequencies help you learn about chance s, so if you are already certain of the chances there’s nothing more for frequency information to do. David Lewis called information that can change your credences about an event only by way of changing your opinions about its chances admissible information. His main insight about admissible information was that when the chance values for an event have already been established, admissible information becomes irrelevant to a rational agent’s opinions about the outcome. Here’s another example: Suppose your only evidence about the noon flip outcome is E2 , that the coin was produced in a factory that advertises its coins as fair but has a side business in tails-biased coins. Given only this .5. (Exactly information your credence in Hextensive should be somewhere below how far below depends on how you estimate the side0 business to be.) On the other hand , suppose you learn E 2 after already learning E3 , that this particular coin is fair. E2 then becomes unimportant information, at least with respect to predicting flips of this coin. E2 is relevant in isolation because it informs you about the chances associated with the coin. But once you’re certain that the coin is fair, information E2 only teaches you that you happened to get lucky not to have a black-market coin; it doesn’t do anything to push your credence in H away from 0 .5. E2 is admissible information. Contrast that with E4 , your friend Amir’s report that he observed the flip landing head s. Assuming you trust Amir , E4 should make you highly confident in H . And this shou ld be true even if you already possess information E3 that the coin is fair . Notice that E3 and E4 are consistent; the coin’s being fair is consistent with its having landed heads on this particular flip, and with Amir’ s reporting that outcome. But E4 trumps the chance information; it moves your credence in heads away from where it would be (0.5) if you knew only E3 . Information about this particular flip’s outcome does not change your credences about the flip by way of influencing your opinions about the chances. You still think the coin is fair, and was fair at the time it was flipped. You just know now that the fair coin happene d to come up heads on this occasion. Information about this flip’ s outcome is
134
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
inadmissible with respect to H . Lewis expressed his insight about the irrelevance of admissible information in his famous chance-credence principle, the Principal Principle: Let PrH be any reasonable initial credence function. Let t i be any time. Let x be any real number in the unit interval. Let Ch i A x be the proposition that the chance, at time t i , of A’s holding equals x. Let E be any proposition compatible with Chi A x that is admissible at time ti . Then
p q“ p q“
p | p q “ x & Eq “ x
PrH A Chi A
(I have copied this principle verbatim from (Lewis 1980, p. 266), though I have altered Lewis ’ notation to match our own.) There’s a lot to unpack in the Principal Prin ciple, so we’ll tak e it one step at a time. First, Lewis’ “reasonable initial credence function” sounds a lot like an initial prior distribution. Yet we saw in Section 4.3 that the notio n of an initial prio r is problematic, and there are passages in Lewis that make it sound more like he’s talking about a hypothetical prior. 15 So I will interpret the “reasonable initialitcredence as your hypothetical prior distribution, and designate with ourfunction” notation “Pr H ”. The Principal Principle is proposed as a rational constraint on hypothetical priors, one that goes beyond the probability axioms and Ratio Formula. Why frame the Principal Principle around hypothetical priors, instead of focusing on the credences of rational agents at particular times? One advantage of the hypothetical-priors approach is that it makes the total evidence at work explicit, and therefore easy to reference in the principle. Recall from Section 4.3 that a hypothetical prior is a probabilistic, regular distribution containing no contingent evidence. A rational agent is associated with a particular hypothetical prior, in the sense that if you conditionalize that hypothetical prior on the agent’s total evidence at any given time, you get the agent’s credence distribution at that time. In the Principal Principle, we imagine that a real-life agent is considering some proposition A about the outcom e of a chance event. She has some information about the chance of A, Chi A x, and then some further evidence E . So her total evidence is Ch i A x & E , and by the definition of a hypothetical prior her credence in A equals Pr H A Chi A x&E . Lewis claims that as long as E is both admissible for A, and is compatible (which we can take to mean “logically consistent”) with Ch i A x, E should make no difference to the agent’s credence in A. In other words, as long as E is admissible and compatible, the agent should be just as confident
p q“ p q“
p | p q“ p q“
q
135
5.2. DEFERENCE PRINCIPLES
Figure 5.1: Chances screen off frequencies coin’s objective chances
known flip frequencies
unknown flip outcome ( H )
p q“
in A as she would be if all she knew were Ch i A x. That is, her credence in A should be x. Return to our example about the noon coin flip, and the relationship between chance and frequency information. Suppose that at 1pm your total evidence about the flip outcome consists of E1 and E3 . E3 , the chance information, says that Ch H 0.5. E1 , the frequency information, comprises the rest of your total evidence, which will play the role of E in the Principal Principle. Because this additional evidence is both consistent with Ch H 0.5 and admissible for H , the Principal Principle says your 1pm credence in H should be 0 .5. Which is exactly the result we came to before. We can gain further insight into this result by connecting it to our earlier (Section 3.2.4) discussion of causation and screening off. Figure 5.1 illustrates the causal relationships in the coin example between chances, frequencies, and unknown results. The coin’s physical structure, and associated objective chances, causally influenced the frequency with which it came up heads in the previous trials. The coin’s physical makeup also affects the outcome of the unknown flip. Thus previous frequ ency information is 16 relevant to the unknown flip, but only by way of the chances. We saw in Section 3.2.4 that when this kind of causal fork structure obtains, the common cause screens its effects off from each other. 17 Conditional on the chances, frequency information becomes irrelevant to flip predictions. That
p q“
p q“
is,
p | p q “ 0.5 & E q “ Pr pH | ChpH q “ 0.5q
PrH H Ch H
H
(5.1)
and intuitively the expression on the right should equal 0 .5. A similar analysis applies if your total evidence about the coin flip contains only Ch H 0 .5 and E2 , the evidence about the coin factory . This time our structur e is a causal chain, as depicted in Figure 5.2. The situation in the coin factory causally affects the chance profile of the coin, which in
p q“
136
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
Figure 5.2: Chance in a causal chain coin factory details
p q
chance of heads (Ch H )
unknown flip outcome ( H )
turn causally affect s the unknown flip outcome. Thus the coin factory infor mation affects opinions about H by way of the chances, and if the chances are already determin ed then factory information becomes irrelev ant. Letting the factory information play the role of E in the Principal Principle, the chances screen off E from H and we have the relation in Equation (5.1). Finally, information E4 , your friend Amir’s report, is not admissible information about H . E4 affects your opinions about H , but not by way of affecting your opinions about the chances. The Principal Principle applies only when E , the information possessed in addition to the chances, is admissible. Since E4 is inadmissible, the Principal Principle supplies no guidance about setting your credences in light of it. There are still a few details in the prin ciple to unpack. For instan ce, the chance expression Ch i A is indexed to a time ti . That’s because the chance that a particular proposition will obtain can change as time goes on. For instance, suppose that at 11am our foreign coin was fair, but at 11:30 I stuck a particularly large, non-aerodynamic wad of chewing gum to one of its sides . In that case, the proposition H that the coin comes up heads at noon would have a chance of 0 .5 at 11am but might have a different chance after 11:30. The physical details of an experimental setup deter mine its chances, so as physical conditions change chances may change as well. 18 Finally, the Principal Principle’s formulation in terms of conditional credences allows us to apply it even when an agent doesn’t have full information about the chances. Suppose you r total evidence about the outco me A of some chance event is E . E influences your credences in A by way of informing you about A’s chances (so E is admissible), but E does not tell you what the chances are exactly. Instead, E tells you that the chance of A (at some specific time, which I’ll suppress for the duration of this example)
p q
137
5.2. DEFERENCE PRINCIPLES
is either 0 .7 or 0 .4. E also supplies you with a favorite among these two chance hypotheses: it sets your credence that 0 .7 is the true chance at 2 3, and your credence that 0 .4 is the true chance at 1 3. How can we analyze this situation using the Principal Princ iple? Since
{
{
your total evidence is E , the definition of a hypothetical prior distribution tells us that your current credences cr should be related to your hypothetical prior Pr H as follows: cr A PrH A E (5.2)
p q“
p | q
This value is not dictated directly by the Principal Principle. However, the Principal Principle does set
p | p q “ 0.7 & E q “ 0.7
PrH A Ch A
(5.3)
because we stipulated that E is admissible. Similarly, the Principal Principle sets PrH A Ch A 0.4 & E 0.4 (5.4)
p | p q“
q“
Since E narrows the possibilities down to two mutually exclusive chance A
p q“
.
A
p q“
.
hypotheses, (Chapply the 0 7Law and of ChTot al Probability 0 4) form a (in partition relative those to E .hypotheses Thus we can its conditional credence form19 ) to obtain
p | q “Pr pA | ChpAq “ 0.7 & E q ¨ Pr pChpAq “ 0.7 | E q` Pr pA | ChpAq “ 0.4 & E q ¨ Pr pChpAq “ 0.4 | E q
PrH A E
H
H
H
H
(5.5)
By Equations (5.3) and (5.4), this is
p | q “ 0.7 ¨ Pr pChpAq “ 0.7 | E q` 0.4 ¨ Pr pChpAq “ 0.4 | E q (5.6) As Equation (5.2) suggested, Pr p¨| E q is just cr p¨q. So this la st equation PrH A E
H
H
H
becomes
p q “ 0.7 ¨ crpChpAq “ 0.7q ` 0.4 ¨ crpChpAq “ 0.4q
cr A
(5.7)
Finally, we fill in the values stipulated in the problem to conclude
p q “ 0.7 ¨ 2{3 ` 0.4 ¨ 1{3 “ 0.6
cr A
(5.8)
That’s a lot of calculation, but the overall lesson comes to this: When your total evidence is admissible and restricts you to a finite set of chance values for A, the Principal Principle sets your credence in A equal to a weighted
138
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
average of those chance values (where each chance value is weighted by your credence that it’s the true chance). This is an extremely useful conclusion, provided we can discern what kinds of evidence are admissi ble. Lewis writes that, “Admiss ible propositions are the sort of information whose impact on credence about outcomes comes entirely by way of credence about the chances of those outcomes.” (1980, p. 272) He then sketches out some categories of information that we should expect to be admissible, and inadmissible. For example, evidence about events causally upstream from the chances will be admissible; such events will form the first link in a causal chain like Figure 5.2. This includes information about the physical laws that give rise to chances, information which affects our credences about experimental outcomes by affecting our views about their chances. On the other hand, evide nce about effect s of the chance outcome is inadmissible, as we saw in the example of Amir’s report. Generally, then, it’s a good rule of thumb that facts concernin g events temporally before the chance outcome are admissible, and inadmissible information is always about events after the outcome. (Though Lewis does remark at one point (1980, p. 274) that if backward causation is possible, seers the future or yet time-travelers aboutofchance events to come.) might give us inadmissible information We’ll close our discussion of the Principal Principle with a couple of caveats. First, I have been talking about coin flips, die rolls, etc. as if their outcomes have non-extreme objective chances. If you think that these outcomes are fully determined by the physical state of the world prior to such events, you might think these examples aren’t really chancy at all—or if there are chances associated with their outcomes, the world’s determinism makes those chances either 1 or 0. There are authors who think non-extreme chance assignments are compatible with an event’s being fully deterministic. This will be especially plausible if you think a single phenomenon may admit of causal explanation s at multiple levels of descripti on. (Though the behavior of a gas sample is fully determined by the positions and velocities of its constituent particles, we might still apply a thermodynamical theory that treats the sample’s behav ior as chancy.) In any case, if the compatibility of determinism and non-extreme chance concerns you, you can replace my coin-flipping and die-rolling examples with genuinely indeterministic quantum events. Second, you might think frequency data can affect rational credences without operatin g through opinions about chance s. Suppose a new patient walks into a doctor’s office, and the doctor assigns a credence that the patient has a particular disease equal to that disease’s frequency in the general
5.2. DEFERENCE PRINCIPLES
139
population. In order for this to make sense, must the doctor assume that physical chances govern who gets the disease, or that the patient was somehow brought to her throu gh a physically chancy process? (That is, must the frequency affect the doctor’s credences by informing her opinions about chances?) This will depend on how broadly we are willing to interpret macroscopic events as having ob jective chances. But unless chanc es are literally everywhere, inferences governed by the Principal Principle form a proper subset of the legitimate instan ces of inductive reasoning. To move from frequencies in an observed population to predictions about the unobserved when chances are not present, we may need something like the frequencycredence principle (perhaps made more plausible by incorporating statistical tools) with which this section began. Or we may need a theory of inductive confirmation in general—something we will try to construct in Chapter 6. For the time being, the message of the Principal Principle is clear: Where there are ob jective chances in the world, we should align our crede nces with them to the extent we can deter mine what they are. While there are exceptions to this rule, they can be worked out by thinking about the causal relations between our information and the chances of which we’re aware. 5.2.2
Expert principles and Reflection
The Principal Principle is sometimes described as a deference principle : to the extent you can determine what the objective chances are, the principle directs you to defer to them by making your credences match. In a certain sense, you treat the chances as authorities on what your credences should be. Might other sorts of authorities demand such rational deference? Testimonial evidence play s a large role in how we learn about the world. Suppose an expert on some subject reveals her credences to you. Instead of coming on television and talking about the “probability” of snow, the weather forecaster simply tells you she’s 30% confident that it will snow tomorrow. It seems intuitive that—absent other evidence about tomorrow’s weather—you should set your credence in snow to 0 .30 as well. We can generalize this intuition with a principle for deference to experts modeled on the Principal Principle:
p | p q “ xq “ x
PrH A crE A
(5.9)
Here Pr H is a rational agent’s hypothetical prior distribution, representing her ultimate evidential standards for assigning attitudes on the basis of total evidence. A is a proposition within some particular subject matter, and cr E A x is the proposition that an expert on that subject matter
p q“
140
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
assigns credence x to A . As we’ve discussed before (Section 4.3), an agent’s credences at a given time equal her hypothetical prior conditionalized on her total evidence at that time. So Equation (5.9) has conseq uences similar to the Principal Principle’s: When a rational agent is certain that an expert assigns credence x to A , and that fact constitutes her total evidence relevant to A , satisfying Equation (5.9) will leave her with an unconditional credence of cr A x. On the other hand, an agent who is uncertain of the expert’s opinion can use Equation (5.9) to calculate a weighted average of all the values she thinks the expert might assign. 20 Equation (5.9) helps us figure out how to defer to someone we’ve identified as an expert. But it doesn’t say an ything about how to make that identification! Ned Hall helpfully distinguishes two kinds of experts we might look for:
p q“
Let us call the first kind of expert a database-expert: she earns her epistemic status simply b ecause she possesses more information. Let us call the second kind an analyst-expert : she earns her epistemic status because she is particularly good at evaluating the relevance of one proposition to another. (Hall 2004, p. 100) A database expert possesses strictly more evidence than me (or at least, more evidence relevant to the matter at hand). While she may not reveal the contents of that evidence, I can still take advantage of it by assigning the credences she assigns on its basis. On the other hand, I defer to an analyst expert not because of her superior evidence but because she is particularly skilled at forming opinions from the evidence we share. Clearly these categories can overlap; relative to me, a weather forecaster is probably both an analyst expert and a database expert with respect to the weather. One particular database expert has garnered a great deal of attention in the deference literature: an agent’s future self. Because Conditionalization retains certainties (Section 4.1.1), at any given time a conditionalizing agent will possess all the evidence possessed by each of her past selves—and typically quite a bit more . So an agent who is certain she will updat e by 21
conditionalizing should treat her future self as a database expert. On the supposition that her future self will assign credence x to a proposition A , she should now assign credence x to A as well. This is van Fraassen’s (1984) Reflection Principle: For any proposition A in L, real number x, and times ti and tj with j i, rationality requires
ą cr pA | cr pAq “ xq “ x i
j
141
5.2. DEFERENCE PRINCIPLES
Although the Reflection Principle mentions both the agent’s ti and tj credences, strictly speaking it is a synchronic principle, relating various credences the agent assigns at ti . If we apply the Rati o Formula and then cross-mutiply, Reflection gives us:
r
p q “ xs “ x ¨ cr rcr pAq “ xs
cri A & crj A
i
(5.10)
j
The two credences related by this equation are both assigned at ti ; they just happen to be credences in some propositions about tj . Despite this synchronic nature, Reflection bears an intimate connection to Conditionalization. If an agent is certain she will update by conditionalizing between ti and tj —and meets a few other side conditions—Reflection follows. For instance, the Reflection Principle can be proven from the following set of conditions: 1. The agent is certain at t i that cr j will result from conditionalizing cr on the total evidence she learns between ti and tj (call it E ).
i
2. The agent is cert ain at ti that E (whatever it may be) is true. 3. cr i crj A x 0
p p q“ qą
4. At ti the agent can identify a set of propositions S in
L
such that:
(a) The elements of S form a partition relative to the agent’s certainties at ti . (b) At ti the agent is certain that E is one of the propositions in S . (c) For each element in S , the agent is certain at ti what cr i -value she assigns to A conditional on that element. References to a proof can be found in the Further Readings. Here I’ll simply provide an example that illustrates the connection between Conditionalization and Reflection. Suppose that I’ve rolled a die you’re certain is fair, but as of t1 have told you nothing ab out the outcome. However, at t1 you’re certain that between t 1 and t 2 I’ll reveal to you whether the die came up odd or even. The Reflection Principle suggests you should assign
p | p q “ 1{3q “ 1{3
cr1 6 cr2 6
(5.11)
Assuming the enumerated conditions hold in this example, we can reason to Equation (5.11 ) as follows: In this case the partition S contains the
142
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
proposition that the die came up odd and the proposition that it came up even. You are certain at t1 that one of these propositions will provide the E you learn before t2 . You’re also certain that your cr 2 6 value will result from conditionalizing your t1 credences on E . So you’re certain at t1 that
pq
p q “ cr p6 | E q
cr2 6
(5.12)
1
Equation (5.11) involves your t 1 credence in 6 conditional on the supposition that cr 2 6 1 3. To determine this value, let’s see what conditional reasoning you could do at t1 , not yet certain what credences you will actually assign at t 2 , but temporarily supposing that cr 2 6 1 3. We just said that at t 1 you’re certain of Equation (5.12), so given the supposition you can conclude that cr 1 6 E 1 3. Then you can examin e your current t1 credences conditional on both odd and even, and find that cr 1 6 E will equal 1 3 only if E is the proposition that the die came up even. (Conditional on the die’s coming up odd, your credence in a 6 would be 0.) Thus you can conclude that E is the proposition that the die came up even. You’re also certain at t 1 that E (whatever its content) is true, so concluding that E says the die came up even allows you to conclude that the die did indeed come up even. And on the condition that the die came up even, your t1 credence in a 6 is 1 3. All of the reasoning in the previous paragraph was conditional, starting with the supposition that cr 2 6 1 3. We found that con ditional on this supposition, your rational credence in 6 would be 1 3. And that’s exactly what the Reflection Principle gave us in Equation (5.11). 22 Information about your future credences tells you something about what evidence you’ll receive between now and then. And information about what evidence you’ll receive in the future should be incorporated into your credences in the present. But how often do we really get information about our future opinions? Approached the way I’ve just done, the Reflection Principle seems to have little real-world applicability. But van Fraassen srcinally proposed Reflection in a very different spirit. He saw the principle as stemming from basic commitments we undertake when we form opinions. van Fraassen drew an analogy to making promi ses. Suppose I make a promise at a particular time, but at the same time admit to being unsure whether I will actually carry it out. van Fraassen writes that “To do so would mean that I am now less than fully committed (a) to giving due regard to the felicity conditions for this act, or (b) to standing by the commitments I shall overtly enter.” (1984, p. 255) To fully stand behind a promise requires full
p q“ {
p q“ {
p | q“ {
p| q
{
{
pq“ {
{
5.3. THE PRINCIPLE OF INDIFFERENCE
143
confidence that you will carry it out. And what goes for curren t promises goes for future promises as well: if you know you’ll make a promise later on, failing to be fully confident now that you’ll enact the future promise is a betrayal of solidarity with your future promising self. Now apply this lesson to the act of making judgments: assigning a different credence now to a proposition than the credence you know you’ll assign in the future is a failure to stand by the commitments implicit in that future opinion. As van Fraassen puts it in a later publication, “Integrity requires me to express my commitment to proceed in what I now classify as a rational manner, to stand behind the ways in which I shall revise my values and opinions.” (1995, pp. 25–26) This is his motivation for endor sing the Reflection Principle.23 For van Fraassen, Reflection brings out a substantive commitment inherent in judgment, which underlies various other rational requirements. For instance, since van Fraassen’s argument for Reflection does not rely on Conditionalization, van Fraassen at one point (1999) uses Reflection to argue for Conditionalization!24 Of course, one might not agree with van Fraassen that assigning a credence necessarily involves such strong commitments. And even if Reflection can be supported as van Fraassen suggests, moving from that principle to Conditionalization is going to require substantive further premises. As we’ve seen, Reflection itself is a synchronic principle, relating an agent’s attitudes at one time to other attitudes she assign s at the same tim e. At best, Reflection will support the conclusion that an agent with certain attitudes at a given time is required to predict that she will update by Conditionalization. To actually establish Condition alization as a diachronic norm, we would need a further principle to the effect that rational agents update in the manner they antecedently predict.25
5.3
The Principle of Indifference
The previous section discussed various deference principles (the Principal Principle, expert principles, the Reflection Principle) that place additional rational constraints on credence beyond the probability axioms, Ratio Formula, and Conditionalization. Yet each of those deference principles work s with a particular kind of evidence—evidence about the chances, about an expert’s credences, or about future attitudes . When an agent lacks thes e sorts of evidence about a proposition she’s considering, the deference principles will do little to constrain her credences. If an Objective Bayesian (in the normative sense) wants to narrow what’s rationally permissible to a single
144
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
hypothetical prior, he is going to need a stronger principle than these three. The Principle of Indifference is often marketed to do the trick. This is John Maynard Keynes’s name for what used to be known as the “principle of insufficient reason”: The Principle of Indifference asserts that if there is no known reason for predicating of our subject one rather than another of several alternatives, then relatively to such knowledge the assertions of each of these alternatives have an equal probability. (1921, p. 42, emphasis in srcinal) Applied to degrees of belief, the Principle of Indifference holds that if an agent has no evidence favoring any particular proposition in a partition over any other, she should spread her credence equally over the members of the partition. If I tell you I have painted my house one of the seven colors of the rainbow but tell you nothing more about my selection, the Principle of Indifference requires 1 7 credence that my house is now violet. The Principle of Indifference looks like it could settle all open questions about rational credence. An agent could assign specific credences as dictated
{
by portions of her evidence (say, evidence that engages of the principles), then use the Principle of Indifference toone settle alldeference remaining questions about her distr ibution. For example, suppose I tell you that I flipped a fair coin to decide on a house color—heads meant gray, while tails meant a color of the rainbow. You could follow the Principal Principle and assign credence 1 2 to my house’s being gray, then follow the Principle of Indifference to distribute the remaining 1 2 credence equally among each of the rainbow colors (so each would receive credence 1 14). This plan seems to dictate a unique rational credence for every proposition in every evidential situation, thereby specifying a unique hypothetical prior distribution. Unfortunately, the Principle of Indifference has a serious flaw, which was pointed out by Keynes (among others). 26 Suppose I tell you only that I painted my house some color—I don’t tell you what palette I chose from— and you wonder whether it was violet. You might partition the possibilities into the proposition that I painted the house violet and the proposition that I didn’t. In that case, lacking furthe r information the Principle of Indifference will require you to assign credence 1 2 that the hou se is viole t. But if you use the seven colors of the rainbow as your partition, you will assign 1 7 credence that my house is now viol et. And if you use the col ors in a box of crayons. . . . The trouble is that faced with the same evidential situation and same proposition to be evaluated, the Principle of Indifference will recommend different credences depending on which partition you consider.
{
{
{
{
{
145
5.3. THE PRINCIPLE OF INDIFFERENCE
Might one partition be superior to all the others, perhaps on grounds of the natur alness with which it divides the space of possibilities? (The selection of colors in a crayon box is pretty arbitra ry!) Well, consider this example: I just drove 80 miles to visit you. I tell you it took betwe en 2 and 4 hours to make the trip, and ask how confident you are that it took less than 3. 3 hours seems to neatly divide the possibilities in half, so by the Principle of Indifference you assign credence 1 2. Then I tell you I maintained a constant speed throughout the drive, and that speed was between 20 and 40 miles per hour. You consider the proposition that I drove faster than 30mph, and since that neatly divides the possible speeds the Indifference Principle again recommends a credence of 1 2. But these two credence assignments conflict. I drove over 30mph just in case it took me less than two hours and forty minutes to make the trip. So are you 1 2 confident that it took me less than 3 hours, or that it took me less than 2 hours 40 minutes? If you assign any positive credence that my travel time fell between those durations, the two answers are inconsis tent. But thinking about my trip in velocity terms is just as natural as thinking about how long it took. 27 This example is different from the painting example, in that time and
{
{
{
speed to consider continuous ranges of possibilities. Infinite possibilityrequire spacesus introduce a number of complexities we will discuss in the next section, but hopef ully the intu itive problem here is clear . Joseph Bertrand (1888/1972) produced a number of infinite-possibility paradoxes for principles like Indifferen ce. His most famous puzzle (now usually called Bertrand’s Paradox) asks how probable it is that a chord of a circle will be longer than the side of an inscribed equilat eral triangle. Indifference reasoning yields conflicting answers depending on how one specifies the chord in question—by specifying its endpoints, by specifying its orientation and length, by specifying its midpoint, etc. Since Keynes’s discussion, a number of authors have modified his Indifference Prin ciple. Chapter 6 will look in detail at Carnap’s proposal. Another well-known suggestion is E.T. Jaynes’ (1957a,b) Maximum Entropy Principle. Given a partition of the space of poss ibilities, and a set of constraints on allowable credence distributions over that partition, the Maximum Entropy Principle selects the allowable distribution with the highest entropy. If the parti tion is finite, consisting of the propos itions Q1 , Q2 ,...,Q n , the entropy of a distribution is calculated as n
´
ÿ crp
Qi
i“1
q ¨ logcr pQ q i
(5.13)
The technical details of Jaynes’ proposal are beyond the level of this book.
146
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
Figure 5.3: Possible urn distributions
e c n e d e r c
0
black balls
100
The intuitive idea, though, is that by maximizing entropy in a distribution we minimize information. To illustrate, suppose you know an urn contains 100 balls, each of which is either black or white. Initially, you assign an equal credence to each available hypothesis about how many black balls are in the urn. This “flat” distribution over the urn hypotheses is reflected by the dashed line in Figure 5.3. Then I tell you tha t the balls were created by a process that tend s to produce roughly as many white balls as black. This moves you to the more “peaked” distribution of Figure 5.3’s solid curve. The peaked distribu tion reflects the fact that at the later time you have more information about the contents of the urn. There are various mathematical ways to measure the informational content of a distribution, and it turns out that a distribution’s information content goes up as its entropy goes down . So in Figure 5.3, the flat (dashed) distribution has a higher entropy than the peaked (solid) distribution. Maximizing entropy is thus a strategy for selecting the lowest-information distribution consistent with what we already know. Jaynes’ principle says that within the bounds imposed by your evidence, you should select the “flattest” credence distrib ution available. In a sense, this is a directive not to make any presumptions beyond what you know. As van Fraassen puts it, “one should not jump to unwarranted conclusions, or add capricious assumptions, when accommodating one’s belief state to the deliverances of experience.” (1981, p. 376) If all your evidence about my urn is that it
5.4. CREDENCES FOR INFINITE POSSIBILITIES
147
contains 100 black or white balls, it would be strange for you to peak your credences around any particu lar number of black balls. What in your evidence would justify such a maneuver? The flat distribution seems the most rational option available.28 The Maximum Entropy approach has a number of advantages. First, it can easily be extended from finite partitions to infinite partitions by replacing the summation in Equation (5.13) with an integral (and making a few further adjustments). Second, for cases in which an agent’s evidence simply delineates a space of doxastic possibilities (without some of those possibilities over others), the Principle of Maximum Entropy yields the same results as the Principle of Indifference. But Maximum Entropy also handles cases involving more complicated sorts of information. Besides restricting the set of possibilities, an agent’s evidence might require her credence in one possibility to be twice that in another, or might require a particular conditional credence value for some ordered pair of propositions. No matter the constraints, Maximum Entropy chooses the “flattest” (most entropic) distribution consiste nt with those constraints. Third, probability distributions selected by the Maximum Entropy Principle have been highly useful in various scientific applications, ranging from statistical mechanics to CT scans to natural language processing. Yet the Maximum Entropy Principle also has flaws. It suffers from a version of the Indifference Principle’s partitioning problem. Maximum Entropy requires us to first select a partition, then accept the most entropic distribution over that partition . But the probability value assigned to a particular proposition by this process often depends on what other propositions appear in the partition. Also, in some evidential situations satisfying the Maximum Entropy Principle both before and after an update requires agents to violate Conditionalization. You can learn more about these problems by studying this chapter’s Further Reading.
5.4
Credences for Infinite Possibilities
Suppose I tell you a positive integer was just selected by some process, and tell you nothing more about that process. You need to distribute your credence across all the possible integers that might have been selected. Let’s further suppose that you want to do so in such a way that each positive integer receives the same credence . In the last section we asked whethe r, given your scant evidence in this case about the selection process, such an assignment is obligatory—whether you’re rationally required to assign each
148
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
positive integer an equal credence. In this section I want to set aside the question of whether an equal distribution is required, and ask whether it’s even possible. We’re going to have a small, technical problem here with the propositional language ove r which your credence distribution is assigned. In Chapter 2 we set up propositional languages with a finite number of atomic propositions, while a distribution over every p ositive integer requires infinitely many atomic propositi ons. Yet there are standard logical methods for dealing with languages containing infinitely many atomic propositions, and even for representing them using a finite number of symbols. For example, we could use “1” to represent the atomic proposition that the number 1 was selected, “2” to represent 2’s being selected, “12” to represent 12’s being selected, etc. This will allow us to represent infinitely many atomic propositions with only the standard 10 Arabic digits. So the language isn’t the real problem; the real problem is what credence value you could possibly assign to each and every one of those positive integers. To start seeing the problem, imagine you pick some positiv e real number r and assign it as your unconditional credence in each positive integer’s beingrpicked. anysuch positive you pick,the there exists an that integer such that 1 n. For Select an nreal , andr consider proposition then positive integer selected was less than or equal to n. By Finite Additivity (Extended),
ą {
p _ 2 _ . . . _ nq “ crp1q ` crp2q ` . . . ` crpnq
cr 1
(5.14)
Each of the credences on the righthand side equals r, so your credence in the disjunction is r n. But we sel ected n such that r 1 n, so r n 1. And now you’ve violated the probability axioms. This argument rules out assigning the same positive real credence to each and every positive integer. What other option s are there? Historically the most popular proposal has been to assign each positive integer a credence of 0. Yet this proposal creates its own problems. The first problem with assigning each integer zero credence is that we must reconceive what an unconditional credence of 0 means. So far in this book we have equated assigning credence 0 to a proposition with ruling that proposition out as a live possibility. In this case, though, we’v e proposed assigning credence 0 to each positive integer while still treating each as a live possibility. So while we will still assign credence 0 to propositions that have been ruled out, there will now be other types of propositions that receive credence 0 as well. Similarly, we may assign credenc e 1 to propositions of which we are not certain.
¨
ą {
¨ ą
149
5.4. CREDENCES FOR INFINITE POSSIBILITIES
Among other things, this reconception of credence 0 will undermine arguments for the Regularity Principle. As stated (Section 4.2), Regularit y forbids assigning crede nce 0 to any logically conti ngent proposition. The argument there was that one should never entirely rule out a proposition that’s logically possible, so one should never assign such a proposition 0 credence. Now we’ve opened up the possibility of assigning credence 0 to a proposition without having ruled it out. So while we can endorse the idea that no contingent proposition should be ruled out, Regularity no longer follows. Moreover, the current proposal provides infinitely-many explicit counterexamples to Regularity: we have proposed assigning credence 0 to the contingent proposition that the positive integer selected was 1, to the proposition that the integer was 2, that it was 3, etc. Once we’ve decided to think about credence 0 in this new way, we encounter a second problem: the Ratio Formula. In Section 3.1.1 I framed the Ratio Formula as follows: Ratio Formula: For any P and Q in
L,
p q ą 0 then
if cr Q
cr P & Q
p | q “ pcrpQq q
cr P Q
p | q
This constraint relates an agent’s conditional credence cr P Q to her unconditional credences only when cr Q 0. As stated, it remains silent on how an agent’s conditional and unconditional credences relate when cr Q 0. Yet we surely want to have some rational constraints on that relation for cases in which an agent assigns credence 0 to a contingent proposition that she hasn’t ruled out. 29 For example, in the positive integer case consider your conditional credence cr 2 2 . Surely this cond itional credence should equal 1. Yet because the curren t proposal sets cr 2 0, the Ratio Formula cannot tell us anything about cr 2 2 . And since we’ve derived all of our rational constraints on conditional credence from the Ratio Formula, the Bayesian system we’ve set up isn’t going to deliver a requirement that cr 2 2 1.30 There are various ways to respond to this problem. One interesting suggestion is to reverse the order in which we proceeded with conditional and unconditional credences: We began by laying down fairly substan tive constraints (Kolmogorov’s probability axioms) on un conditional credences, then tied conditional credences to those via the Ratio Formula. On the reverse approach, substantive constraints are first placed on conditional credences,
p qą
p q“
p|q
p|q
p | q“
p q“
150
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
then some furthe r rule relates uncond itional to condition al. The simplest such rule is that for any proposition P , cr P cr P T . Some advocates of this technique describe it as making conditional credence “basic”, but we should be careful not to read too much into debates
p q“ p | q
about what’s basic. The way I’ve approached conditional and unconditional credences in this book, neither is more fundamental than the other in any sense significant to metaphysics or the philosophy of mind. Each is an independently existing type of doxastic attitude, and any rules we offer relating them are strictly normative constraints. The only sense in which our unconditionals-first approach has made unconditional credences prior to conditionals is in its order of normative explanation. The Ratio Formula helped us transform constraints on unconditional credences into constraints on conditional credences (as in Section 3.1.2). On the conditionals-first approach, the rule that cr P cr P T transforms constraints on conditionals into constraints on unconditionals. Examples of the conditionals-first technique include (Hosiasson-Lindenbaum 1940), (Popper 1955), (Renyi 1970), and (Roeper and Leblanc 1999). 31 Like many of these approaches, Popper’s axiom system entails that cr Q Q 1
p q“ p | q
p | q“
for any Q that This the agent of its unconditional credence value. ensuresdeems that crpossible, 2 2 regardless 1. The final problem I want to address with assigning each positive integer 0 unconditional credence of being selected has to do with your unconditional credence that any integer was selected at all. The propos ition that some integer was selected is equivalent to the disjunction of the proposition that 1 was selected, the proposition that 2 was selected, the proposition that 3 was selected, etc. Finite Additivity directly gov erns unconditional credences in disjunctions of two (mutually exclusive) disjuncts; iterating that rule gives us Finite Additivity (Extended), which applies to disjunctions of finitely many disjuncts. But this case concerns an infinite disjunction, and none of the constraints we’ve seen so far relates the unconditional credence of an infinite disjunction to the credences of its disjuncts. It might seem natural to supplement our credal constraints with the following:
p | q“
Countable Additivity: For any countable partition Q1 , Q2 , Q3 ,... in
L,
p _ Q _ Q _ . . .q “ crpQ q ` crpQ q ` crpQ q ` . . .
cr Q1
2
3
1
2
3
Notice that Countable Additivity does not apply to every partition of infinite size; it applies only to partitions of countably many members. The set of positive integers is counta ble, while the set of real numbers is not. (If you
151
5.4. CREDENCES FOR INFINITE POSSIBILITIES
are unfamiliar with infinite sets of differing sizes, I would suggest studying the brief explanation referenced in this chapter’s Further Reading.) Countable Additivity naturally extends the idea behind Finite Additivity to sets of (countably) infinite size. Many authors have found it attractive. Yet in our example it rules out assigning credence 0 to each proposition stating that a particular positive intege r was selected. Taken together, the proposition that 1 was selected, the proposition that 2 was selected, the proposition that 3 was selected, etc. form a countable partition (playing the role of Q1 , Q2 , Q3 , etc. in Countable Additivity). Countable Additivity therefore requires your credence in the disjunction of these propositions to equal the sum of your credences in the individual disjuncts. Yet the latter credences are each 0, while your credence in their disjunction (namely, the proposition that some positive integer was selected) should be 1. So perhaps Countable Additivity wasn’t such a good idea after all. The trouble is, without Countable Additivity we lose a very desirable property: Conglomerability: For each proposition P and partition Q1 , Q2 , Q3 ,... in L, cr P is no greater than the largest cr P Qi and no less
p q
p | q
p | q
than the least cr P Qi .
p | q p q
In other words, if Conglomerability holds then finding the largest cr P Qi and the smallest cr P Qi creates a set of bounds into which cr P must fall. In defining Conglomerability I didn’t say how large the Q-partitions in question are allowed to be. We might think of breaking up the general Conglomerability principle into a number of sub-cases: Finite Conglomerability applies to finite partitions, Countable Conglomerability applies to countable partitions, Continuous Conglomerability applies to partitions of continuummany elements, etc. Finite Conglome rability is guaranteed by the standard probability axioms. You’ll prove this in Exercise 5.6, but the basic idea is that by the Law of Total Probability cr P must be a weighted average of the various cr P Qi , so it can’t be greater than the largest of them or less than the smallest. With the standard axioms in place, Countable Conglomerability then stands or falls with our decision about Countable Additivity; without Countable Additivity, Countable Conglomerability is false.32 We’ve already seen that the strategy of assigning 0 credence to each positive integer’s being selected violates Countable Additivity; let’s see how it violates (Countable) Conglomerability as well.33 Begin with the following definition: For any positive integer n that’s not a multiple of 10, define the n-set as the set of all positive integers that start with n, followed by some
p | q
p | q
p q
152
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
t
u
number (perhap s 0) of zeroes. So the 1-set is 1, 10, 100, 1000,... ; the 11set is 11, 110, 1100, 11000,... ; the 36-set is 36, 360, 3600, 36000,... ; etc. Now take the proposition that the integer selected was a member of the 1set, and the proposition that the integer selected was a member of the 2-set,
t
u
t
u
and the proposition that the integer selected was a member of the 3-set, etc. (Though don’t include any ns that are mult iples of 10.) The set of these propositions forms a partition. (If you think about it carefully, you’ll see that any positive integer that might have been selected belongs to exactly one of these sets.) The distribution strategy we’re considering is going to want to assign
p
cr the selected integer is not a multiple of 10
|
the selected integer is a member of the 1-set
q“0
(5.15)
Why is that? Well, the only number in the 1-set that is not a multiple of 10 is the number 1. The 1-set contains infinitely many positive inte gers; on the supposition that one of those integers was selected you want to assign equal credence to each one’s being selected; so you assign 0 credence to each one’s being selected (including the number 1) conditional on that supposition. This gives us Equation (5.15). The argument then generalizes; for any n -set you’ll have
p
cr the selected integer is not a multiple of 10
|
q“0
the selected integer is a member of that n-set
(5.16)
Yet unconditionally it seems rational to have
p
cr the selected integer is not a multiple of 10
q “ 9{10
(5.17)
Conditional on any particular member of our n -set partition, your credence that the selected integer isn’t a multiple of 10 is 0. Yet unconditionally, you’re highly confident that the integer selected is not a mutiple of ten. This is a flagrant violation of (Countable) Conglomerability—your credences in a particular proposition conditional on each member of a (countable) partition are all the same, yet your unconditional credence in that partition has a very different value! Why is violating Conglomerability a problem? Well, imagine I’m about to give you some evidence on which you’re going to conditionalize. In particular, I’m about to tell you to which of the n -sets the selected integer belongs.
5.4. CREDENCES FOR INFINITE POSSIBILITIES
153
Whichever piece of evidence you’re about to get, your credence that the integer isn’t a multiple of ten cond itional on that evidence is 0. So you can be certain right now that immediately after receiving the evidence, your credence that the integer isn’t a multiple of ten will be 0. Yet despite being certain that your better-informed future self will assign a particular proposition a credence of 0, you continue to assign that proposition a credence of 9 10 right now. This is a flagrant violation of the Refle ction Principle, as well as generally good principles for attitude management. Our opinions are usually compromises among the pieces of evidence we think we might receive; we expect that some of them would change our views in one direction while others would press in the other. If we know that no matter what evidence comes in we’re going to be pulled away from our current opinion in the same direction, it seems irrationally stubborn to maintain our current opinion and not move in that direction right now. Conglomerability embodies these principles of good evidential hygiene; without Conglomerability our interactions with evidence begin to look absurd.
{
Where does this leave us? We wanted to find a way to assign an equal credence to each positive integer’s being selected. We quickly concluded that that equal credence could not be a positive real number. So we considered assigning credence 0 to each integer’s being selected. Doing so violates Countable Additivity (a natural extension of our finite principles for calculating credences in disjunctions) and Conglomerability, which looks desirable for a number of reasons. Are there any other options? I will briefl y mention two furth er possibilities. The first possibility is to assign each positive integer an infinitesimal credence of having been selected. To work with infinitesimals, we extend the standard real-number system to include numbers that are greater than 0 but smaller than all the positi ve reals. If we assign each integer an infinitesimal crede nce of having been picked, we avoid the problems with assigning a positive real and also the problems of assigning 0. (For instan ce, if you pile enoug h infinitesimals together they can sum to 1.) Yet infinitesimal numbers have a great deal of mathematical structure, and it’s not clear that the extra structure plausibly represents any feature of agents’ attitudes. 34 Moreover, the baroque mathematics of infinitesimals introduces troubles of its own (see Further Reading). So perhaps only one viable option remain s: Perhaps if you learn a positive integer was just selected, it’s impossible to assign equal credence to each of the possibilities consistent with what you know. 35
154
5.5
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
Jeffrey Conditionalization
Section 4.1.1 showed that conditionalizing on new evidence creates and retains certainties; evidence gained between two times becomes certain at the later time and remains evernoafter. Contraposing, if antimes, agentitupdates Conditionalization and so gains certainties between two must beby because she gained no evidence betw een those times. In that section we also saw that if an agent gains no evidence between two times, Conditionalization keeps her crede nces fixed. Putting all this together, we see that under Conditionalization an agent’s credences change just in case she gains new certainties. As we noted in Section 4.2, mid-twentieth-century epistemologists like C.I. Lewis defended this approach by citing sense data as the foundational evidential certainties. Many contemporary epistemologists are uncomfortable with this kind of foundationalism (and with appeals to sense data in general). Richard C. Jeffrey, however, had a slightly different concern, which he expressed with the following example and analysis: The agent inspects a piece of cloth by candlelight, and gets the impression that it is green, although he concedes that it might be blue or even (but ver y improbably) violet . If G, B , and V are the propositions that the cloth is green, blue, and viole t, respectively, then the outcome of the observation might be that, whereas srcinally his degrees of belief in G, B , and V were . 30, .30, and . 40, his degrees of belief in those same propositions after the observation are . 70, .25, and . 05. If there were a proposition E in his preference ranking which described the precise quality of his visual experience in looking at the cloth, one would say that what the agent learned from the observation was that E is true.... But there need be no such proposition E in his preference ranking; nor need any such proposition be expressible in the English language. Thus, the description “The cloth looked green or possibly blue or conceivably violet,” would be too vague to convey the precise quality of the experience. Certainly, it would be too vague to support such precise conditional probability ascriptions as those noted above. It seems that the best we can do is to describe, not the quality of the visual experience itself, but rather its effects on the observer, by saying, “After the observation, the agent’s degrees of belief in G, B , and V were .70, .25, and .05.”
155
5.5. JEFFREY CONDITIONALIZATION
(1965, p. 154) Jeffrey worried that even if we grant the existence of a sense datum for each potential learning experience, the quality of that sense datum might not be representable in a proposition to which the agent could assign certainty, or at least might not be representable in a precise-enough proposition to differentiate that sense datum from other nearby data with different effects on the agent’s credences. At the time Jeffrey was writing, the standard Bayesian updating norm (updating by Conditionalization) relied on the availability of such propositions. So Jeffrey proposed a new updating rule, capable of handling examples like the cloth one above. While he called it probability kinematics, it is now universally known as
ă
Jeffrey Conditionalization: Given any ti and tj with i j , any A in L, and a finite partition B 1 , B2 ,...,B n in L whose elements each have nonzero cri ,
p q “ cr pA | B q¨ cr pB q` cr pA | B q¨ cr pB q` . . . ` cr pA | B q¨ cr pB q
crj A
i
1
j
i
1
2
j
i
2
n
j
n
Let’s apply Jeffrey Condit ionalization to the cloth example. Suppose I’m fishing around in a stack of my family’s clean laundry hoping to pull out any shirt that belongs to me, but the lighting is dim because I don’t want to turn on the overheads and awaken my wife. The color of a shirt in the stack would be a strong clue as to whether it was mine, as reflected by my conditional credences:
p | q “ 0.80 p | q “ 0.50 cr pmine | V q “ 0 cr1 mine G
cr1 mine B
(5.18)
1
(For simplicity’s sake we imagine green, blue, and violet are the only color shirts I might fish out of the sta ck.) At t1 I pull out a shirt. Between t1 and t2 I take a glimpse of the shir t. According to Jeffrey’s story, my
{ {
unconditional credence distributions across the G B V partition are:
p q “ 0.30 p q “ 0.70
cr1 G
cr
1
cr2 G
cr
2
pBq “ 0.30 pBq “ 0.25
cr
1
cr
2
pV q “ 0.40 pV q “ 0.05
(5.19)
Applying Jeffrey Conditionalization, I find my credence in the target proposition at the later time by combining my post-update unconditional credences across the partition with my pre-update credences in the target
156
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
proposition conditional on elements of the partition. This yields:
p p
q“ | q ¨ cr pGq ` cr pmine | Bq ¨ cr pBq ` cr pmine | V q ¨ cr pV q “ 0.80 ¨ 0.70 ` 0.50 ¨ 0.25 ` 0 ¨ 0.05 “ 0.685 cr2 mine cr1 mine G
2
1
2
1
2
(5.20) At t 2 I’m fairly confident that the shirt I’ve selected is mine. How confident was I at t1 , before I caught my low-light glimpse? A quick calculation with the Law of Total Probability reveals that cr 1 mine 0.39. But it’s more interesting to see what happens when we apply the Law of Total Probability to my credences at t2 :
p
p p
q“
q“ | q ¨ cr pGq ` cr pmine | Bq ¨ cr pBq ` cr pmine | V q ¨ cr pV q
cr2 mine cr2 mine G
2
2
2
2
2
(5.21)
Take a moment to compare Equation (5.21) with the first two lines of Equat2 credence tion (5.20 ). Equation expre sses athe feature distribution must have if (5.21) it is to satisfy proba that bilitymy axioms and Ratio Formula. Equation (5.20) tells me how to set my t2 credences by Jeffrey Conditionalization. The only way to make these two equations match—the only way to square the Jeffrey update with the probability calculus—is if cr1 mine G cr2 mine G , cr1 mine B cr2 mine B , etc. Why should these conditional credenc es stay constant over time? Well, at any given time my credence that the shirt I’ve selected is mine is a function of two kinds of credences: first, my unconditional credence that the shirt is a particular color; and second, my conditional credence that the shirt is mine given that it’s a particular color. When I catch a glimpse of the shirt between t 1 and t 2 , only the first kind of credence changes . I change my opinion about what color the shirt is, but I don’t change my confidence that it’s my shirt given that (say) it’s green. Throughout the example I have a fixed opinion about what percentage of the green shirts in the house are mine; I simply gain information about whether this shirt is green. So while my unconditional color credences change, my credences conditional on the colors remain. This discussion reveals a general feature of Jeffrey Conditionalization. You’ll prove in Exercise 5.8 that an agent’s credences between two times update by Jeffrey Conditionalzation just in case the following condition obtains:
p
| q“ p
| q
p
| q“ p
| q
157
5.5. JEFFREY CONDITIONALIZATION
Figure 5.4: Jeffrey Conditionalization across a partition partition element cr 1 G &mine 0 .24
„ „ „
G & mine B&mine B & mine V &mine V & mine
Rigidity: For any A in
L
cr2 0.56
0 .06 .15 0 .15 0 0 0 .40
0.14 0.125 0.125
0
0.05
and any Bm in B1 , B2 ,...,B
n,
p | q “ cr pA | B q
crj A Bm
i
m
So Jeffrey Conditionalization using a particular partition B1 , B2 ,...,B n is appropriate only when the agent’s credences conditional on the Bm remain 36 constant across in two thought thiscloth was example, reasonable that “srcinate” thetimes. Bm Jeffrey partition. In the allfor myupdates credal changes between t1 and t2 are driven by the changes in my color credences caused by my expe rience. So if I tell you my cred ences at t1 , and then tell you my unconditional credences in the color propositions at t2 , this should suffice for you to work out the rest of my opinions at t2 . Jeffrey Conditionalization makes that possible. Rigidity can help us perform Jeffrey Conditionalization updates on a probability table. Given the partition B1 , B2 ,...,B n in which an update srcinates, we divide the lines of the table into “blocks”: the B1 block contains all the lines consistent with B1 ; the B2 block contains all the lines consistent with B2 ; etc. The agent’s exper ience between time s ti and tj directly sets her unconditional cr j -values for the B m ; in other words, it tells us what each block must sum to at t j . Once we know a block’s cr j total, we set individual line credences within the block by keeping them in the same proportions as at t i . (This follows from Rigidity’s requirement that each line have the same cr 2 -value conditional on a given B m as it did at ti .) That is, we multiply all the cr i -values in a block by the same constant so that their crj -values achieve the appropriate sum. Figure 5.4 shows this process for the colored shirt example. I’ve built the table around a simplified partition of doxastic possibilities in the problem, but I could’ve made a probability table with the full list of state-descriptions
158
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
and everything would proceed the same way. I calculated the cr 1 -values in the table from Equations (5.18) and (5.19). How do we then der ive the credences at t2 ? The credal change between t1 and t 2 srcinates in the G B V partition.
{ {
So the “blocks” on this table will be adjacent pairs of lines: the first pair of lines (on which B is true), the second pair of lines ( G lines), and the third pair of V lines. Let’s work with the B -block first. In Jeffrey’s story, glimpsing the shirt sends me to cr 2 B 0.25. So on the tabl e, the thir d and fourth lines must have cr 2 -values summing to 0 .25. At t1 these lines were in a 1 : 1 ratio, so they must maintain that ratio at t2 . This leads to cr2 -values of 0 .125 on both lines. Applying a similar process to the G- and V -blocks yields the remaining cr 2 -values. Once you understand this block-updating process, you can see that traditional updating by Conditionalization is a special case of updating by Jeffrey Conditionalization. When you update by Conditional ization on some evidential proposition E , your probability table divides into two blocks: lines consistent with E versus E lines. After the updat e, the E lines go to zero, while the E lines are multiplied by a constant so that they sum to 1.
p q“
„
„
This tells us how Jeffrey mathematically. Conditionalization to traditional (or “strict”) Conditionalization Butrelates how should we understand their relation philoso phically? Suppose we class learning experien ces into two kinds: those that send some proposition to certainty and those that don’t. Jeffrey Cond itionalization seems to be a universal updating rule, applying to both kinds of experience. When experience does send a proposition to certainty, Jeffrey Conditionalization provides the same advice as strict Conditionalization. But Jeffrey Conditionalization also provides guidance for learning experiences of the second kind. Now the defender of Regularity (the principle forbidding extreme unconditional credence in logically contingent propositions) will maintain that only the second kind of learning experience ever occurs (at least to rational agents), and therefore that strict Condition alization should never be applied in practice. All experience ever does is shuffle an agent’s uncond itional credences over some partition, without sending any partition members to extremity. Jeffrey Conditionalization tells us how such changes over a partition affect the rest of the agent’s credence distribution. But one can identify an important role for Jeffrey Conditionalization even without endorsing Regularity. To establish the need for his new kinematics, Jeffrey only had to argue that some experiences of the second kind exist—sometimes we learn without gaining certa inties. In that case we need a more general updating rule than strict Conditionalization, and Jeffrey
159
5.6. EXERCISES
Conditionalization provides one. Yet despite being such a flexible tool, Jeffrey Conditionalization has its drawbacks. For instance, while applications of strict Conditionalization are always commutative, Jeffrey updates that do not send any proposition to certainty may not be. The simplest example of this phenomenon (which Jeffrey readily acknowledged) occurs when one learning experience sends some Bm in the partition to unconditional credence p while the next experience sends that same partition member to credence q p . Applying Jeffrey Conditionalization to the experiences in that order will leave the agent with a final unconditional credence in Bm of q, while applying Jeffrey’s rule to the same experiences in the opposite order will result in a final B m credence of p. This commutativity failure is problematic if you think that the effects of evidence on an agent should not depend on the order in which pieces of evidence arrive.37 Finally, Jeffrey Conditionalization may not provide a recipe for every type of learning experience. Traditional Conditional ization covers experiences that set unconditional credences to certainty. Jeffrey Conditionalization generalizes to experiences that set unconditional credences to nonex-
p‰ q
treme values. But what if an experience affect s anthe agent by directly altering her conditional credences? How can we calculate effects of such an experience on her other degrees of belief? van Fraassen (1981) describes a “Judy Benjamin Problem” in which direct alteration of conditional credences plausibly occurs, and which cannot be addressed by Jeffrey Conditionalization.38
5.6
Exercises
Unless otherwise noted, you should assume when completing these exercises that the cr-distributions under discussion satisfy the probability axioms and Ratio Formula. You may also assume that whenever a conditional cr expression occurs, the needed proposition has nonzero unconditional credence so that conditional credences are well-defined. Problem 5.1. At noon I rolled a 6-sided die. It came fr om eith er the Fair Factory (which produces exclusively fair dice), the Snake-Eyes Factory (which produces dice with a 1 2 chance of coming up 1 and equal chance of each other outcome), or the Boxcar Factory (which produces dice with a 1 4 chance of coming up 6 and equal chance of each other outcome).
{
{
(a) Suppose you use the Principle of Indiffer ence to assign equal crede nce to each of the three factories from which the die might have come. Applying
160
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
the Principal Principle, what is your credence that my die roll came up 3? (b) Maria tells you that the die I rolled didn’t come from the Boxcar Factory. If you on roll thiscame new evidence are youupdate that the up 3? by Conditionalization, how confident (c) Is Maria’s evidence admissible with respect to the outcome of the die roll? Explain. (d) After you’ve incorporated Maria’s information into your credence distribution, Ron tells you the roll didn’ t come up 6. How confident are you in a 3 after conditionalizing on Ron’s information? (e) Is Ron’s evidence admissible with respect to the outcome of the die roll? Explain. Problem 5.2. The expert deference principle in Equation (5.9) resembles the Principal Principle in many ways. Yet the expert deference principle makes no allowance for anything like inadmissible information. What kind of information should play the role for expert deference that inadmissible information plays for defere nce to chances? How should Equation (5.9) be modified to take such information into account? Problem 5.3. Suppose it is currently t1 , and t2 and t3 are times in the future (with t3 after t2 ). At t1 , you satisfy the probability axioms, Ratio Formula, and Reflection Principle. You are also certain at t1 that you will satisfy these constraints at t2 . However, for some proposition X your t1 credences are equally divided between the following two (mutually exclusive and exhaustive) hypotheses about what your t2 self will think of your t3 credences:
pcr rcr pX q “ 1{10s “ 1{3q & pcr rcr pX q “ 2{5s “ 2{3q pcr rcr pX q “ 3{8s “ 3{4q & pcr rcr pX q “ 7{8s “ 1{4q Given all this information, what is cr pX q? (Be sure to explain your reaY: Z:
2
3
2
3
2
2
3
3
1
soning clearly.)
Problem 5.4. Can you think of any kind of real-world situation in which it would be rationally permissible to violate the Refle ction Principle? Explain the situation you’re thinking of, and why it would make a Reflection violation okay.
161
5.6. EXERCISES
Problem 5.5. Jingyi assigns the t1 credences indicated by the probability table below. Then between t1 and t2 , she learns P Q.
Ą
P
Q
T
T
T
F
F
T
F
F
cr1 0.4 0.2 0.2 0.2
(a) Determine Jingyi’s credence distribution at t2 . Then use Equation (5.13) to calculate the entropy of both cr1 and cr2 over the partition containing the four P Q state-descriptions.
{
(b) Use the concept of information content to explain why the entrop y of Jingyi’s distribution changed in the direction it did between t 1 and t 2 . (c) Create a probabilist ic credence distr ibution that assigns the same unconditional value to P as cr 1 , but has a higher entropy over the P Q state-description partition.
{
„
(d) Use the partition containing just P and P to calculate the entropy for cr 1 and for your distribution from part (c). What does this tell you about the partition-dependence of entropy comparisons? Problem 5.6. Using Non-Negativity, Normality, Finite Additivity, the Ratio Formula, and any results we’ve proven from those four, prove Finite Conglomerability. (Hint: The Law of Total Probability may be useful here.) Problem 5.7. Suppose that at t 1 you assign a “flat” credence distribution over language L whose only two atomic propositions are B and C —that is, you assign equal credence to each of the four state-descriptions of L. Between t1 and t2 you perform a Jeffrey Conditionalization that srcinates in the B B partition and sets cr 2 B 2 3. Between t2 and t3 you perform a Jeffrey Conditionalization that srcinates in the C C partition and sets cr3 C 3 4.
{„
p q“ {
{„
p q“ {
(a) Calculate your cr 2 and cr 3 distributions. (b) Does your credence in B change between t2 and t3 ? (c) Does your credence in C change between t1 and t2 ? (d) Explain why the answ ers to parts (b) and (c) are diffe rent, using the notion of probabilistic independence.
162
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
Problem 5.8. Prove that Jeffrey Conditionalization is equivalent to Rigidity. That is: Given any times ti and tj , proposition A in L, and finite partition B1 , B2 ,...,B n in L whose elements each have nonzero cr i , the following two conditions are equivalent:
p q “ cr pA | B q ¨ cr pB q ` cr pA | B q ¨ cr pB q ` . . . ` cr pA | B q ¨ p q 2. For all B in the partition, cr pA | B q “ cr pA | B q. 1. cr j A crj Bn
i
1
j
m
i
1
j
2
m
j
i
i
2
n
m
(Hint: Complete two proofs—first condition 2 from condition 1, then vice versa.) Problem 5.9. Suppose we apply Jeffrey Conditionalization over a finite partition B1 , B2 ,...,B n in L to generate cr 2 from cr 1 . Show that we could have obtained the same cr 2 from cr 1 in the following way: start with cr 1 ; Jeffrey Conditionalize it in a particular way over a partition containing only two propositions; Jeffrey Conditionaliz e the result of that operation in a particular way over a partition containing only two propositions (possibly different from the ones used the first time); repeat this process a finite number of times until cr 2 is eventually obtained.∗
5.7
Further reading
Subjective and Objective Bayesianism
Maria Carla Galavotti (2005). Philosophical Introduction to Probability. CSLI Lecture Notes 167. Stanford, CA: CSLI Publications Excellent historical introduction to the many ways “probability” has been understood by the philosophical and statistical community. Alan H´ajek (2011b). Interpretations of Proba bility. In: The Stanford Encyclopedia of Philosophy . Ed. by E dward N. Zalta. Winter 2011. URL: http://plato.stanford.edu/archives/win2011/entries/probabil interpret/ Survey of the various interpretations of probability, with extensive references. ∗
I owe this problem to Sarah Moss.
163
5.7. FURTHER READING
Bruno de Finetti (1931/1989) . Probabilism: A Critical Essay on the Theory of Probability and the Value of Science. Erkenntnis 31. Translation of B. de Finetti, Probabilismo, Logos 14: 163–219., pp. 169–223 Classic paper critiquing objective interpretations of probability and advocating a Subjective Bayesian (in the semantic sense) approach. Donald Gillies (2000). Varieties of Propensity. British Journal for the Philosophy of Science 51, pp. 807–835 Reviews different versions of the propensity theory and their motivations. Focuses at the end on how propensity theories might respond to Humphreys’ Paradox. Deference Principles
David Lewis (1980). A Subjectivist’s Guide to Objective Chance. In: Studies in Inductive Logic and Probability. Ed. by Richard C. Jeffrey. Vol. 2. Berkeley: University of California Pres s, pp. 263–294 Lewis’s classic article laying out the Principal Principle and its consequences for theories of credence and chance. Adam Elga (2007 ). Reflection and Disagr eement. pp. 478–502
Noˆ us 41,
Offers principles for deferring to many different kinds of agents, including experts, gurus (individuals with good judgment who lack some of your evidence), past and future selves, and peers (whose judgment is roughly as good as your own). Bas C. van Fraassen (1984). Belief and the Will. The Journal of Philosophy 81, pp. 235–256 Article in which van Fraassen proposes and defends the Reflection Principle. Jonathan Weisberg (2007). Conditionalization, Reflection , and Self-Knowledge. Philosophical Studies 135, pp. 179–197 Discusses conditions under which Reflection can be derived from Conditionalization, and vice versa.
164
CHAPTER 5. FURTHER RATIONAL CONSTRAINTS
Richard Pettigrew and Michael G. Titelbaum (2014). Deference Done Right. Philosophers’ Imprint 14.35 Attempts to get the formulation of deference principles precisely right, including expert deference principles, the Reflection Principle, and principles for higher-order credences. Particularly concerned with making those principles consistent with Conditionalization and with the possibility of ignorance about what’s rationally required . The Principle of Indifference
John Maynard Keynes (1921). Treatise on Probability. London: MacMillan and Co., Limited Chapter IV contains Keynes’s famous discussion of the Principle of Indifference. E. T. Jaynes (1957a). Information Theory and Statistical Mechanics I. Physical Review 106, pp. 620–30 E. T. Jaynes (1957b). Information Theory and Statistical Mechanics II. Physical Review 108, pp. 171–90 E.T. Jaynes introduces the Maximum Entropy approach. Colin Howson and Peter Urbach (2006). Scientific Reasoning: The Bayesian Approach. 3rd. Chicago: Open Court Section 9.a covers the Indifference Principle, Harold Jeffreys’s attempts to make it partition-invariant, and then Jaynes’s Maximum Entropy theory. Very clear on the flaws of all of these approaches. Teddy Seidenfeld (1986). Entropy and Uncertainty. Philosophy of Science 53, pp. 467–491
A general discussion of the flaws with Jaynes’s Maximum Entropy approach; especially good on its incompatibility with Bayesian conditionalization. Also contains useful references to Jaynes’s many defenses of Maximum Entropy over the years and to the critical discussion that has ensued. Credences for Infinite Possibilities
5.7. FURTHER READING
165
David Papineau (2012). Philosophical Devices: Proofs, Probabilities, Possibilities, and Sets . Oxford: Oxford University Press Chapter 2 offers a highly accessible introduction to the cardinalities of various infinite sets. (Note that Papineau uses “denumerable” where we use the term “countable”.) Alan H´ajek (2003). What Conditional Probability Could Not Be. Synthese 137, pp. 273–323 Assesses the viability of the Ratio Formula as a definition of conditional probability in light of various infinite phenomena and plausible violations of Regularity. Colin Howson (2014). Finite Additivity, Another Lottery Paradox and Conditionalisation. Synthese 191, pp. 989–1012 Neatly surveys arguments for and against Countable Additivity, then argues for dropping Conditionalization as a universal update rule over accepting infinite additivity principles. Timothy Williamson (2007) . How Probable Is an Infinite Sequence of Heads? Analysis 67, pp. 173–80 Brief introduction to the use of infinitesimals in probability distributions, followed by an argument against using infinitesimals to deal with infinite cases. Kenny Easwaran (2014b). Regularity and Hyperreal Credences. Philosophical Review 123, pp. 1–41 Excellent, comprehensive discussion of the motivations for Regularity, the mathematics of infinitesimals, arguments against using infinitesimals to secure Regularity (including Williamson’s argument), and an alternative approach. Jeffrey Conditionalization
Richard C. Jeffrey (1965). The Logic of Decision . 1st. McGrawHill series in probability and statistics. New York: McGrawHill Chapter 11 contains Jeffrey’s classic presentation of his “probability kinematics”, now universally known as “Jeffrey Conditionalization”.
166
NOTES
Notes 1 The frequency theory is sometimes referred to as “frequentism” and its adherents as “frequentists”. However “frequentism” more often refers to a school of statistical practice at odds with Bayesianism (which we’ll discuss in Chapter 11). The ambiguity probably
comes from the fact that most people in that statistical school also adopt the frequency theory as their interpretation of probability. But the positions are logically distinct and should b e denoted by different terms. So I will use “frequenc y theory” here, and reserve “frequentism” for my later discussion of the statistical approach. 2 For many, many more see (H´ ajek 1996) and its sequel (H´ajek 2009b). 3 The frequency theory will also need to work with counterfactuals if nonextreme probabillities can be meaningfully ascribed to a priori truths, or to metaphysical necessities. (Might a chemist at some point have said, “It’s highly probable that water is H 2 O”?) Assigning nonextreme frequencies to such propositions’ truth involves possible worlds far away from the actual. 4 This difficulty for the propensity theory is often known as Humphreys’ Paradox, since it was proposed in (Humphreys 1985). One might respond to Humphreys’ Paradox by suggesting that propensities don’t follow the standard mathematical rules of probability. And honestly, it’s not obvious why they should. The frequency theory clearl y yields probabilistic values: in any sequence of event repetitions a given outcome has a non-negative frequency, the tautologous outcome has a frequency of 1, and mutually exclusive outcomes have frequencies summing to the frequency of their disjunctio n. In fact, Kolmogoro v’s axioms can be read as a generalization of the mathematics of event frequencies to cases involving irrational and infinite quantities. But establishing that propensity values (or objective chances) satisfy the probability axioms takes argumentation from one’s metaphysics of propensity. Nevertheless, most authors who work with propensities assum e that they satisfy the axioms; if they didn’t, the propensity interpretation’s probabilities wouldn’t count as probabilities in the mathematician’s sense (Section 2.2). 5 One could focus here on a metaphysical distinction rather than a semantic one— instead of asking what “probability” talk means, I could ask what probabilities are. But some of the probability interpretations we will discuss don’t have clear metaphysical commitments. The logical interpr etation, for instance, takes probabili ty to be a logical relation, but need not go on to specify an ontol ogy for suc h relations. So I will stick with a semantic distinction, which in any case matches how these questions were discussed in much of twentieth-century analytic philosophy. 6 In the twentieth century Subjective Bayesianism was also typically read as a form of expressivism; an agent’s “probability” talk expressed her credal attitudes towards propositions without having truth-conditions. Nowadays alternative semantics are available that could interpret “probability” talk in a more cognitivist mode while still reading such talk as reflecting subjective degrees of belief. (Weatherson and Egan 2011) 7 See (Hacking 1971) for discussion of Leibniz’s position. 8 Carnap himself did not believe all “probability” talk picked out the logical values just described. Instead, he thought “probability” was ambiguous between two meanings, one of which was logical probability and the other of which had more of a frequency interpretation. 9 There is disagreement about whether the logical and evidential interpretations of probability should be considered Objective Bayesian in the semantic sense. Popper (1957) says that objective interpretations make probability values objectively testable. Logical and ev-
NOTES
167
idential probabilities don’t satisfy that criterion, and Popper seems to class them as subjective interpretations. Yet other authors (such as (Galavotti 2005)) distinguish between logical and subjective interpret ations. I have defined the semantic Subjective/Objective Bayesian distinction so that logical and evidential interpretations count as Objective; while they may be normative for the attitudes of agents, logical and evidential probabilities do not10vary with the attitudes particular agents or groups of agents possess. As I explained in Chapter 4, note 14, defining hypothetical priors as regular does not commit us to the Regularity Principle as a rational constraint. 11 Those who believe that “probability” is used in many ways—or that there are many different kinds of entities that count as probabilities—sometimes use the terms “subjective probability” and “objective probabili ty”. On this usage, subjective probabilities are agents’ credences, while objective probabilities include all the kinds of probabilities we’ve mentioned that are independent of particular agents’ attitudes. 12 To assign H a credence exactly equal to the observed frequency of heads would be to follow what Reichenbach (1938) called the straight rule. Interestingly, it’s impossible to construct a hypothetical prior satisfying the probability axioms that allows an agent to obey the straight rule in its full generality. However, Laplace (1814/1995) proved that if an agent’s prior satisfies the Principle of Indifference (adopting a “flat” distribution somewhat like the dashed line in Figure 5.3), her posteriors will obey the rule of succession : after seeing h of n tosses come up heads, her credence in H will be ph ` 1q{pn ` 2q. As the number of tosses increase s, this credence approach es the observed frequency of heads. Given these difficulties aligning credences and observed frequencies, anyone who thinks credences should match chances needs to describe a hypothetical prior making such a match possible. In a moment we’ll see Lewis doing this with the Principal Principle. 13 Since the ratio of B -outcomes to A-events must always fall between 0 and 1, this principle sheds some light on why credence values are usually scaled from 0 to 1. (Compare note 4 above.) 14 There’s also the problem that we sometimes have data from overlapping reference The Book of classes applying to the same case, neither of which is a subclass of the other. Odds (Shapiro, Campbell, and Wright 2014, p. 137) reports that 1 in 41 .7 adults in the U.S. aged 20 or older experiences heart failu re in a given year. For non-Hispanic white men 20 or older, the num ber is 1 in 37. But only 1 in 500 men aged 20–3 9 experi ences heart failure in a given year. In setting my creden ce that I will have a heart attack this year, should I use the data for non-Hispanic white men over 20 or the data for men aged 20–39? 15 Here I’m thin king especially of the follo wing: “What makes it be so that a certain reasonable initial credence function and a certain reasonable system of basic intrinsic values are both yours is that you are disposed to act in more or less the ways that are rationalized by the pair of them together, taking into account the modification of credence by conditionalizing on total evidence.” (Lewis 1980, p. 288) 16
Depending on one’s theory of the metaphysics of chance, it may be a category mistake to say something was caused by a chance value (or by the fact that a particular chance value obtained). In that case, we can focus on the underlying physical makeup associated with the chance value as the relevant cause. I should admit, though, that the explanation I’m giving of screening-off in the Principal Principle fits most naturally with a propensitystyle accou nt of chance. I’m unsure wheth er it could be made to work on Lewis’s own “best system” theo ry of chance (Lewis 1994) . As far as I know, Lewis himse lf never explains why the screening-off captured by the Principal Principle should obtain, except to say that it matches our best intuitions about how rational agents assign credences to
168
NOTES
chance events. 17 The notion of screening off in play here is the one I described in Chapter 3, Note 9 for continuous random variables. The objective chance of H is a continuous variable, so facts about Ch pH q screen off known flip frequencies from H in the sense that conditional on setting Ch pH q to any particular value, known frequency information becomes irrelevant H. to 18 Notice that the time ti to which the chance in the Principal Principle is indexed need not be the time at which an agent assigns her credence concerning the experimental outcome A. In our coin examp le, the agent form s her cred ence at 1pm about the coin flip outcome at noon using information about the chances at noon . This is signi ficant because on some metaphysical theories of chance, once the coin flip lands heads (or tails) the chance of H goes to 1 (or 0) forevermore. Yet even if the chance of H has become extreme by 1pm, the Principal Principle may still direct an agent to assign a nonextreme 1pm credence to H if all she knows are the chances from an earlier time. I should also note that because chances are time-indexed, the notion of admissibility must be time-indexed as well. The information about the wad of chewing gum is admissible relative to 11:30am chances—learning about the chewing gum affects your credence about the flip outcome by way of your opinions about the 11:30am chances. But the information that chewing gum was stuck to the coin after 11 is in admissible relative to the 11:00am chances. (Chewing gum information affects your crede nce in H , but not by influencing your opinions about the chances associated with the coin at 11:00am.) So strictly speaking we should ask whether a piece of information is admissible for a particular proposition relative to the chances at a given time. I have suppressed this comp lication in the main text. 19 For a partition containing only two elements (call them C 1 and C 2 ), the unconditional credence form of the Law of Total Probability tells us that
crpAq “ crpA | C1 q ¨ crpC1 q ` crpA | C2 q ¨ crpC2 q The conditional credence form (generated by the procedure described in Section 3.1.2) tells that for any E with cr pE q ą 0, crpA | E q “ crpA | C1 & E q ¨ crpC1 | E q ` crpA | C2 & E q ¨ crpC2 | E q 20
Equation (5.9) directs the assignment of your unconditional credences only when information about the opinion of a particular expert is your total relevant evidence concerning proposition A. If you have addit ional information about A (perhaps the opinion of a second expert?), the releva nt condition in the conditional credence on the lefthand side of Equation (5.9) is no longer just cr E pAq “ x. (See Exercise (5.2) for more on this point.) 21 Supposing that your future credences result from your present credences by conditionalization guarantees that your future self will possess at least as much evidence as your present self. But it also has the advantage of guaranteeing that future and present self both work from the same hypothetical prior distributi on (because of the Hypothetical Priors Theorem, Section 4.3). It’s worth think ing about whether an agent should defer to the opinions of a database expert who—despite having strictly more information than the agent—analyzes that evidence using different ultimate evidential standards. 22 The justification I’ve just provided for Equation (5.11) explicitly uses every one of the enumerated conditions except Condition 3. Condition 3 is necessary so that the conditional credence in Equation (5.11) is well-defined according to the Ratio Formula. 23 One complication here is that van Fraassen sometimes describes Reflection as relating attitudes, but at other times portrays it as being about various acts of commitment, and
NOTES
169
therefore more directly concerned with assertions and avowals than with particular mental states. 24 Earlier we saw that under the Reflection Principle, opinions about your future credences may influence other credences you assign now. van Fraassen’s argument for Conditionalization runs in the opposite direction, from credences you assign now to what you’ll do25in the future. The Reflection Principle applies to times t i and t j with j strictly greater than i . What would happen if we applied it when j “ i? In that case we’d have a principle for how an agent’s current credences should line up with her credences about her current credences. This principle would engage the results of an agent’s introspecting to determine what her current credences are. An agent’s credenc es about her own current credences are called higher-order credences, and they have been the subject of much Bayesian scrutiny (e.g. (Skyrms 1980b)). The core issue is how much access a rational agen t is required to have to the contents of her own mind. 26 Joyce (2005) reports that this sort of problem was first identified by John Venn in the 1800s. 27 This example is adapted from one in (Salmon 1966, pp. 66-7). A related example is van Fraassen’s (1989) Cube Factory, which describes a factory making cubes of various sizes and asks how confident I should be that a given manufactured cube has a size falling within a particular range. The Principle of Indifference yields conflict ing answers dependin g on whether cube size is described in terms of side length, face area, or volume. 28 In Chapter ?? we will discuss a different credal response to this kind of ignorance. 29 What about cases in which an agent has ruled out the proposition Q ? Should rational agents assign credences conditiona l on conditions that they’ve ruled out? For discussion and references on this question, see (Titelbaum 2013, Ch. 5). 30 I was careful to define the Ratio Formula so that it simply goes silent when crpQq “ 0, and is therefore in need of supplementation if we want to constrain values like cr p2 | 2q. Other authors define the Ratio Formula so that it contains the same equation as ours but leaves off the restriction to cr pQq ą 0 cases. This forces an impossi ble calculation when cr pQq “ 0. Alternatively, one can leave the Ratio Formula unrestricted but make its equation cr pP | Qq ¨ crpQq “ cr pP & Qq. This has the adv antage of being true even when crpQq “ 0 (because cr pP & Qq will presumably equal 0 as well), but does no better than our Ratio Formula on constraining the value of cr p2 | 2q. (Any value we fill in for tha t conditional credence will make the relevant multiplication-equation true.) 31 For a historical overview of this technique and detailed comparison of the disparate approaches, see (Makinson 2011). 32 (Seidenfeld, Schervish, and Kadane ms) shows that this pattern generalizes: At each infinite cardinality, we cannot secure the relevant Conglomerability principle with Additivity principles of lower cardinalities; Conglomerability at a particular level requires Additivity at that same level. 33
I got the example that follows from Brian Weatherson. 34 Contrast our move from comparative to quantitative representations of doxastic attitudes in Chapter 1. There the additional structu re of a numerical representation allowed us to model features like confidence-gap sizes, which plausibly make a difference to agents’ real-world decisions. 35 Let me quickly tie up one loose end: This section discussed cases in which it might be rational for an agent to assign unconditional credence 0 to a proposition without ruling it out. All the cases in which this might be rational involve credence assig nments over infinite partitions. For the rest of this book we will be working with finite partitions, and
170
NOTES
will revert to the assumption we were making prior to this section that credence 0 always represents ruling something out. 36 Actually, Jeffrey’s srcinal proposal was a bit more complicated than that. In (Jeffrey 1965) he began with a set of propositions B1 , B2 ,...,B n in which the credence change srcinated, but did not require the Bm to form a partition. Instead, he const ructed a Bm . set of “atoms”, which we can think of as state-descriptions constructed from the (Each atom was a consistent conjunction in which each B m appeared exactly once, either affirmed or negated.) The Rigidity condition (which Jeffrey sometimes called “invariance”) and Jeffrey Conditionalization were then applied to these atoms rather than directly to the B m in which the credence change srcinated. Notice that in this construction the atoms form a partition. Further, Jeffrey recognized that if the B m themselves formed a partition, the atoms wound up in a one-to-one correspondence with the Bm to which they were logically equivalent. I think it’s for this reason that Jeffre y later (2004, Ch. 3) dropped the business with “atoms ” and applied his probability kinematics directly to any finite partition. 37 Though see (Lange 2000) for an argument that this order-dependence is not a problem because the character of the experiences changes when they’re temporally rearranged. 38 Interestingly, the main thrust of van Fraassen’s article is that while Maximum Entropy is capable of providing a solution to the Judy Benjamin Problem, that solution is intuitively unappealing.
Part III
Applications
171
173 We have now seen the five core normative rules of Bayesian Epistemology (Chapters 2 through 4), plus a number of additional norms that have been proposed to supplement them (Cha pter 5). In Part IV of this book we will consider explicit premise-conclusion style, philosophical arguments for various of these norms . But as I see it, what actually convinced most practioners to adopt Bayesian Epistemology—to accept that agents can be usefully represented as assigning numerical degrees of belief, and that rationality requires those degrees of belief to satisfy certain mathematical constraints—were the applications in which this approach found success. Our discussion has already covered some minor successes of Bayesian Epistemology. For example, while a purely binary doxastic view has trouble furnishing agents with a rational, plausible set of attitudes to adopt in the Lottery Paradox (Section 1.1.2), Bayesian Epistemology has no trouble sketching a set of credences that are intuitively appropriate and entirely consistent with Bayesian norms (Section 2.2.2). Now we are after bigger target s. At one time or another, Bayesianism has been applied to offer positive theories of such central philosophical concepts as explanation , coherence, causation, and information. Yet the two applications central totheory the historical development of Bayesian mology weremost confirmation and decision theory. As these twoEpistesubjects grew and cemented their significance in philosophy (as well as economics and other nearby disciplines) over the course of the twentieth century, Bayesian Epistemology came to be viewed more and more as an indispensible philosophical tool. Each chapter in this part of the book takes up one of those two applications. Confirmation is tied to a number of central notio ns in theoretical rationality, such as induction, justification, evidential support, and epistemic reasons. Bayesian Epistemology provides the most detailed, substantive, and plausible account of confirmation philosophers have available, not only accounting for the broad contours of the concept but also yielding particular results concerning specific evidential situations. Decision theory, meanwhile, concerns rational action under uncertainty, and so is a central plank of practical ration ality and the theory of rational choice . Degrees of belief have been indispensible to decision theory since its inception. Volumes have been written on each of these subjects, so my goal in these two chapters is merely to introduce you to their historical development, identify some successes that have been achieved, and point to some controversies that carry on today. More information can be found through the Further Reading sections in each chapter. As for the applications of Bayesian Epistemology not covered here, you might start with the book
174 cited b elow.
Further Reading Luc Bovens and Stephan Hartmann (2003). mology. Oxford: Oxford University Press
Bayesian Episte-
Discusses applications of Bayesian Epistemology to information, coherence, reliability, confirmation, and testimony.
Chapter 6
Confirmation When evidence supports a hypothesis, philosophers of science say that the evidence “confirms” that hypothesis. Bayesians place this confirmation relation at the center of their theory of induction. But confir mation is also closely tied to such epistemological notions as justification and reasons. Bayesian Epistemology offers a systematic theory of confirmation (and its opposite, disconfirmation) that not only deepens our understanding of this relation but also provides specific answers about which hypotheses are supported (and to what degree) in particular evidential situations. Since its early days, the analysis of confirmation has been driven by a perceived analogy to deductive entailment. In Chapter 4 we discussed evidential standards that relate a body of evidence (represented as a proposition) to the doxastic attitudes it supports. But confirmation—though intimately linked with evidential standards in ways we’ll presently see—is a different kind of relation: instead of relating a proposition and an attitude, it relates two propositions (evidence and hypothesis). Confirmation shares this feature with deductive entailment. In fact, Rudolf Carnap thought of confirmation as a generalization of standard logical relations, with deductive entailment and refutation as two extremes of a continuous confirmational scale. In the late nineteenth and early twentieth centuries, logicians produced ever-more-powerful syntactical theories capable of answering ques-by tions about which propositions deductively entailed which . specific Impressed this progress, theorists such as Carl Hempel and Carnap envisioned a syntactical theory that would do the same for confirmation. As Hempel put it, The theoretical problem remains the same: to characterize, in precise and general terms, the conditions under which a body of 175
176
CHAPTER 6. CONFIRMATION
evidence can be said to confirm, or to disconfirm, a hypothesis of empirical character. (1945a, p. 7) Hempel identified various formal properties that the confirmation relation might or might not possess. Carnap then argued that we get a confirmation relation with exactly the right formal properties by identifying confirmation with positive probabilistic relevance. This chapter begins with Hempel’s formal conditions on the confirmation relation. Identifying the right formal conditi ons for confirmation will not only help us assess various theories of confirmation; it will also help us understand exactly what relation philosophers of science have in mind when they talk about “confirmation”. 1 We then move on to Carnap’s Objective Bayesian theory of confirmation, which roots confirmation in probability theory. While Carnap’s theory has a number of attractive features, we will also identify two drawbacks: its failure to capture particular patterns of inductive inference that Carnap found appealing; and the language-dependence suggested by Goodman’s “grue” problem. We’ll respond to these problems with a confirmation theory grounded in Subjective Bayesianism (in the normative sense). Confirmation is fairly undemanding, in one sense: we say that evidence confirms a hypothesis when it provides any amount of support for that hypothesis, no matter how small . But we might want to make more finegrained distinctions among support cases than that. Probabilistic theorie s of confirmation offer a number of different ways to measure the strength of confirmation in a particular case. We will survey these different measures of confirmational strength, assessing the pros and cons of each. Finally, we’ll apply probabilistic confirmation theory to provide a Bayesian solution to Hempel’s Paradox of the Ravens.
6.1
Formal features of the confi rmation relation
6.1.1 Confirmation is weird! The Paradox of the Ra vens One way to begin thinking about confirmation is to consider the simplest possible cases in which a piece of evidence confirms a general hypothesis. For example, the proposition that a particular frog is green seems to confirm the hypothesis that all frogs are green. On the other hand, the proposition that a particular frog is not green disconfirms the hypothesis that all frogs are green. (In fact, it refutes that hypothesis!) If we think this patt ern
6.1. FORMAL FEATURES OF THE CONFIRMATION RELATION 177
always holds, we will maintain that confirmation satisfies the following constraint: Nicod’s Criterion: For any predicates F and G and constant a of
L,
x Fx Gx F a Ga is confirmed by & and disconfirmed by F a & Ga.
p@ qp „ Ą
q
Hempel (1945a,b) named this condition after Jean Nicod (1930), who built his theory of induction around the criterion. We sometimes summarize the Nicod Criterion by saying that a universal generalization is confirmed by its positive instances and disconfirmed by its negative instances . Notice that one can endorse the Nicod Criterion as a sufficient condition for confirmation without taking it to be necessary; we need not think all cases of confirmation follow this pattern. Yet Hempel worries about the Nicod Criterion even as a sufficient condition for confirmation, because of how it interacts with another principle he endorses: Equivalence Condition (for hypotheses):
Suppose H and H 1 in
L
are
1
logically equivalent1 (H H also confirms H .
)( H ). Then any
E in
L
that confirms
Hempel endorses the Equivalence Condition because he doesn’t want confirmation to depend on the particular way a hypothesis is formulated; logically equivalent hypotheses say the same thing, so they should enter equally into confirmation relations. Hempel is also concerned with how workin g scientists use confirmed hypotheses; for instance, practitioners will often deduce predictions and explanation s from confirmed hypotheses. Equivalent hypotheses have identical deductive consequences, and scientists don’t hesitate to substitute logical equivalents for each other. But combining Nicod’s Criterion with the Equivalence Condition yields counterintuitive consequences, which Hempel calls the “paradoxes of confirmation”. The most famous of these is the Paradox of the Ravens . Consider the hypothesis that all ravens are black, representable as x Rx Bx . By Nicod’s Criterion this hypothe sis is confirmed by the evidence that a particular raven is black, Ra & Ba . But now consider the evidence that a particular non-raven is non-black, Ba & Ra. This is a positive instance of the hypothesis x Ba Ra , so by Nicod’s Criterion it confirms that hypothesis. By contraposition, that hypothesis is equivalent to the hypothesis that all ravens are black. So by the Equivalence Condition, Ba & Ra confirms x Rx Bx as well. The hypothesis that all ravens are black is
p@ qp Ą
q
„ p@ qp„ Ą „ q
p@ qp Ą q
„
„
„
178
CHAPTER 6. CONFIRMATION
confirmed by the observation of a red herring, or a white shoe. This result seems counterintuitive, to say the least. Nevertheless, Hempel writes that “the impression of a paradoxical situation. . . is a psychological illusion” (1945a , p. 18); on his view, we reject the confirmational result because we misunders tand what it says. Hempel highlights the fact that in everyday life people make confirmation judgments relative to an extensive corpus of background knowledge. For example, a candidate’s performance in an interview may confirm that she’d be good for the job, but only relative to a great deal of background information about how the questions asked relate to the job requirements, how interviewing reveals qualities of character, etc. In assessing confir mation, then, we should always be explicit about the background we’re assuming. This is especially important because background knowledge can dramatically alter confirmation rela tions. For example, in Section 4.3 we discussed a poker game in which you receive the cards that will make up your hand one at a time. At the beginning of the game, your background knowledge contains facts about how a deck is constructed and about which poker hands are winners. At that point the proposition that your last card will be the two of clubs does rmand the you’re proposdealt ition some that you will win your the hand. But as the gamenot goesconfi along other twos, total background knowledge changes such that the proposition that you’ll receive the two of clubs now strongly confirms that you’ll win. While Nicod’s Criterio n states a truth about confirmati on for some combinations of evidence, hypothesis, and background corpus, there are other corpora against which applying the Criteri on is a bad idea. For instance, suppose I know I’m in the Hall of Atypically-Colored Birds. A bird is placed in the Hall only if the majority of his species-ma tes are one color but he happens to be another color. Against a background corpus which includ es the fact that I’m in the Hall of Atypically-Colored Birds, observing a black raven disconfirms the hypothesis that all ravens are black. 2 Hempel thinks the only background against which the Nicod Criterion states a general confirmational truth about all hypotheses and bodies of evidence is the tautological background. The tautological background corpus contains no contingent propositions; it is logically equivalent to a tautology T. When we intuitively reject the Nicod Criterion’s consequence that a red herring confirms the ravens hypothesis, we are sneaking non-tautological information into the background. Hempel thinks we’re imagining a situation in which we already know in advance (as part of the background) that we will be observing a herring and checking its color. Relative to that background— which includes the information Ra—we know that whatever we’re about
„
6.1. FORMAL FEATURES OF THE CONFIRMATION RELATION 179
to observe will have no evidential import for the hypothesis that ravens are black. So when we then get the evide nce that Ba , that evidence is confirmationally inert with respect to the hypothesis x Rx Bx . But the srcinal question was whether Ba & Ra (taken all together,
„
„
„ p@ qp Ą q
p@ qp Ą q
at once) confirmed x Rx Bx . On Hempel’s view, this is a fair test of the Nicod Criterion only against an empty background corpus (since that’s the background against which he thinks the Criterion applies). And against that corpus, Hempel thinks the confir mational result is correct. Here’s a way of understanding why: Imagine you’ve decided to test the hypothesis that all ravens are blac k. You will do this by selecting object s from the universe one at a time and checking them for ravenhood and blackness. It’s the beginning of the experiment, you haven’t checked any objects yet, and you have no background information about the tendency of objects to be ravens and/or black . Moreover, you’ve found a way to select objects from the universe at random, so you have no background information about what kind of object you’ll be getting. Nevertheless, you start thinking about what sorts of objects might be selected, and whether they would be good or bad news for the hypothesis. Particularly important would be any ravens that weren’t black, sinceitany such instance would hypothesis immediatelyisrefute the hypothesis. (Here helps to negative realize that the ravens logically equivalent to x Rx & Bx .) So when the first object arrives and you see it’s a red herring— Ba & Ra—this is good news for the hypothesis (at least, some good news). After all, the first object could’ve been a non-black raven, in which case the hypothesis would’ve been sunk. This kind of reasoning defuses the seeming paradoxicality of a red herring’s confirming that all ravens are black, and the objection to the Nicod Criterion that results. As long as we’re careful not to smuggle in illicit background information, observing a red herring confirms the ravens hypothesis to at least a small degree. Nevertheless, I.J. Good worries about the Nicod Criterion, even against a tautological background:
„pD qp
„
„ q „
[T]he closest I can get to giving [confirmation relative to a tautological background] a practical significance is to imagine an infinitely intelligent newborn baby having built-in neural circuits enabling him to deal with formal logic, English syntax, and subjective probability. He might now argue, after defining a crow in detail, that it is initially extremely unlikely that there are any crows, and therefore that it is extremely likely that all crows are black. “On the othe r hand,” he goes on to argue, “if there are
180
CHAPTER 6. CONFIRMATION
crows, then there is a reasonable chance that they are of a variety of colors. Therefore, if I were to discover that even a black crow exists I would consider [the hypothesis that all crows are black] to be less probable than it was initially.” (1968, p. 157) 3
p@ qp Ą q
Here Good takes advantage of the fact that x Rx Bx is true if there are no ravens (or crows, in his example). 4 Before taking any samples from the universe, the intelligent newborn might consider four possibilities: there are no ravens; there are ravens but they come in many colors; there are ravens and they’re all black; there are ravens and they all share some other color. The first and third of these possibilities would make x Rx Bx true. When the baby sees a black rav en, the first possibility is eliminated ; this might be such a serious blow to the ravens hypothesis that the simultaneous elimination of the fourth possibility would not be able to compensate. In other words, the observation of a black raven might fail to confirm that all ravens are black, therefore violating the Nicod Criterion even against a tautological background.
p@ qp Ą q
6.1.2
Further adequacy conditions
We have already seen two general conditions (Nicod’s Criterion and the Equivalence Condition) that one might take the confirmation relation to satisfy. We will now consider a number of other such conditions, most of them discussed (and given the names we will use) by Hempel. Sorting out which of these are genuine properties of confirmation has a number of purposes. First, Hempel though t the corr ect list provided a set of adequacy conditions for any positive theory of confirmation. Second, sorting through these conditions will help us understand the abstract features of evidential support. These are feature s about which epistemologists, philosop hers of science, and others (including working scientists and ordinary folk!) often make strong assumptions—many of them incorrect. Finally, we are going to use the word “confirmation” in subsequent sections as a somewhat technical term, distinct from some of the ways “confirm” is used in everyday speech. Working through the properties of the confirmatio n relation will help illustrate how we’re using the term. The controversy between Hempel and Good leaves it unclear whether the Nicod Criterion should be endorsed as a constraint on confirmation, even when it’s restricted to tautologica l background. On the other hand, the Equivalence Condition can be embraced in a strong form: Equivalence Condition (full version): Suppose H
)( H 1, E )( E 1,
6.1. FORMAL FEATURES OF THE CONFIRMATION RELATION 181
and K K 1 . Then E confirms (/disconfirms) H relative to background K just in case E 1 confirms (/disconfirms) H 1 relative to background K 1 .
)(
Here we can think of K as a conjunction of all the propositions in an agent’s background corpus, just as E is often a conjunction of multiple pieces of evidence. This full version of the Equivalence Condition captures the idea that two logically equivalent propositions should enter into confirmation relations in all the same ways as each other. Our next candidate constraint is the Entailment Condition: For any consistent E , H , and K in L , if E & K H but K H , then E confirms H relative to K .
*
(
This condition enshrines the idea that entailing a hypothesis is one way to support, or provide evidence for, that hypoth esis. If E entails H in light of background corpus K (in other words, if E and K together entail H ), then E confirms H relative to K . The only ex ception to this rule is when K already entails H , in which case the fact that E and K together entail H does not indicate any particular relation between E and H .5 Notice that a tautological H will be entailed by every K , so this restriction on the Entailment Condition prevents the condition from saying anything about the confirmation of tautologies. Hempel thinks of his adequacy conditions as applying only to empirical hypotheses and bodies of evidence, so he generally restricts them to logically contingent E s and H s. Hempel considers a number of adequacy conditions motivated by the following intuition: Confirmation Transitivity: For any A , B , C , and K in L, if A confirms B relative to K and B confirms C relative to K , then A confirms C relative to K . It’s tempting to believe confirmation is transitive, as well as other nearby notions such as justification or eviden tial support. This temptation is buttressed by the fact that logical entailment is transitive. Confirmation, however, is not in general transitive. Here’s an example of Confirmation Transitivity failure: Suppose our background is the fact that a card has just been selected at random from a standard 52-car d deck. Consider these three propositions: A: The card is a spade . B : The card is the Jack of spades.
182
CHAPTER 6. CONFIRMATION
C : The card is a Jack.
Relative to our background, A would confirm B , at least to some extent. And relative to our background, B clearly would confirm C . But relative to the background that a card support the conclusion that was C . picked from a fair deck, A does nothing to The failure of Confirmation Transitivity has a number of important consequences. First, it explains wh y in the study of confirma tion we take evidence to be propositional rather than objectual. In everyday language we often use “evide nce” to refer to objects rather than propositions; police don’t store proposi tions in their Evidenc e Room. But as possible entran ts into confirmation relations, objects have an ambiguity akin to the Reference Class Problem (Section 5.1.1). Should I consider this bird evidence that all ravens are blac k? If we describe the bird as a black raven, the answer might be yes. But if we desc ribe it as a black ra ven found in the Hall of Atypically-Colored Birds, the answer seems to be no. Yet a black raven in the Hall of Atypically-Colored birds is still a black raven. If confirmation were transitive, knowing that a particular description of an object confirmed a hypothesis would guarantee that more precise descriptions confirmed the hypothesis as well. Logically stronger descriptions (it’s a black raven in the Hall of Atypically-Colored Birds) entail logically weaker descriptions (it’s a black raven) of the same object; by the Entailment Condition, the logically stronger description confirms the logically weaker; so if confirmation were transitive anything confirmed by the weaker description would be confirmed by the stronger as well. But confirmation isn’t transitive, so putting more or less information in our description of the very same object can alter what’s confirme d. (Black raven? Might confirm ravens hypothesis. Black raven in Hall of AtypicallyColored Birds? Disconfirms. Black raven mistakenly placed in the Hall of Atypically-Colored Birds when it shouldn’t have been? Perhaps confirms again.) We solve this problem by letting propositions rather than objects enter into the confir mation relati on. If we state our evidence as a proposition—such as the proposition that I observed a black raven in the Hall of Atypically-Colored Birds —there’s no question how the objects involved are being described. Confirmation’s lack of transitivity also impacts epistemology more broadly. For instance, it may cause trouble for Richard Feldman’s (2007) principle that “evidence of evidence is evidence”. Suppose I read in a magazine that anthropologists have reported evidence that Neanderthals cohabitated with homo sapiens . I don’t actually have the anthro pologists’ evidence for that
6.1. FORMAL FEATURES OF THE CONFIRMATION RELATION 183
hypothesis—the body of information that they think suppor ts it. But the magazine article constitutes evidence that they have such evidence; one might think that the magazine article therefore also constitutes evidence that Neanderthals and homo sapiens cohabitated. (After all, reading the article seems to provide me with some justification for that hypothesis.) Yet we cannot adopt this “evidence of evidence is evidence” principle with full generality. Suppose I’ve randoml y picked a card from a standard deck and examined it carefully. If I tell you my card is a spade, you hav e evidence that I know my card is the Jack of spad es. If I know my card is the Jack of spades, I have (ver y strong) eviden ce that my card is a Jack. Yet your evidence that my card is a spade is not evidence that my card is a Jack. 6 Finally, the failure of Confirmation Transitivity shows what’s wrong with two confirmation constraints Hempel embraces: Consequence Condition: If E in L confirms every member of a set of propositions relative to K and that set jointly entails H 1 relative to K , then E confirms H 1 relative to K . Special Consequence Condition:
For any E , H , H 1 , and K in
L,
if E
1
H confirms to K and H & K relative to Krelative .
1
( H , then E confirms H
The Consequence Condition states if a set of propositions together entails a hypothesis (relative to some background), then any evidence that confirms every member of the set also confirms that hypothesis (relative to that background). The Special Consequence Condition says that if a single proposition entails some hypothesis, then anything that confirms the proposition also confirms the hypothesis (again, all relative to some background corpus). The Special Consequence Condition is so-named because it can be derived from the Consequence Condition (by considering singleton sets). Yet each of these conditions is a bad idea, as can be demonstrated by our earlier Jack of spades example. Proposition A that the card is a spade confirms proposition B that it’s the Jack of spades; B entails proposition C that the card is a spade; yet A does not confirm C . We can even create examples in which H entails H 1 relative to K , but evidence E which confirms H dis confirms H 1 relative to K . Relative to the background corpus most of us have concerning the kinds of animals people keep as pets, the evidence E that Bob’s pet is hairless confirms (at least slightly) the hypothesis H that Bob’s pet is a Peruvian Hairless Dog. Yet relative to K that same evidence E disconfirms the hypothesis H 1 that Bob’s pet is a dog. 7
184
CHAPTER 6. CONFIRMATION
Why might the Special Consequ ence Condition seem plausible? It certainly looks tempt ing if one reads “confi rmation” in a particular way. In everyday language it’s a fairly strong claim that a hypothesis has been “confirmed”; this suggests our evidence is sufficient for us to accept the hypothesis. (Consider the sentences “That confirmed my suspicion” and “Your reservation has been confirmed.”) Reading “confirmed” that way might motivate us to endorse Glymour’s view that “when we accept a hypothesis we commit ourselves to accepting all of its logical consequences.” (1980, p. 31) This would tell us that evidence confirming a hypothesis also confirms its logical consequences, as the Special Consequence Condition requires. But hopefully the discussion to this point has indicated that we are not using “confirms” in this fashion. On our use, evidence confir ms a hypothesis if it provides any amount of support for that hypothesis; the support need not be decisive. We will often possess evid ence that confirms a hypothesis without requiring or even permittin g us to accept that hypothes is. If your only evidence about a card is that it’s a spade, this evidence confirms in our sense that the card is the Jack of spades. But this evid ence doesn’t authorize you to accept or believe that the card is the Jack of spades. motivation for the Special Consequen thisAnother was Hempel’s motivation—comes from the way ce we Condition—perhaps often treat hypotheses in science. Suppose we make a set of atmospheric observations confirming a particular global warmin g hypothesis. Suppose further that in combination with our background knowledge, the hypothesis entails that average global temperatures will increase by five degrees in the next fifty years. It’s very tempting then to say that the atmospheric observations support the conclusion that temperatures will rise five degrees in fifty years. Yet that’s to unthinkingly apply the Special Consequence Condition. I hope you’re getting the impression that denying Confirmation Transitivity can have serious consequences for the ways we think about everyday and scientific reasoning. Yet it’s important to realize that denying the Special Consequence Condition as a general principle does not mean that the transitivity it posits never holds. It simply means that we need to be careful about assuming confirmation will transmit across an entailment, and perhaps also that we need a precise, positive theory of confirmation to help us understand when it will and when it won’t. Rejecting the Special Consequence Condition does open up some intriguing possibilities in epistemology. Consider these three propositions: E : I am having a perceptual experience as of a hand before me. H : I have a hand.
6.1. FORMAL FEATURES OF THE CONFIRMATION RELATION 185
H 1 : There is a material world.
This kind of evidence figures prominently in G.E. Moore’s (1939) proof of the exis tence of an exter nal world. Yet for some time it was argued that E could not possibly be evidence for H . The reasoning was, first, that E could not discriminate between H 1 and various skeptical hypotheses (such as Descartes’ evil demon), and therefore could not provide evidence for H 1 . Next, H entails H 1 , so if E were evidence for H it would be evidence for H 1 as
well. But this step assumes the Special Consequence Condition. Epistemologists have recently explored positions that allow E to support H without supporting H 1 , by denying Special Consequence.8 Hempel’s unfortunate endorsement of the Consequence Condition also pushes him towards the: Consistency Condition: For any E and K in L , the set of all hypotheses confirmed by E relative to K is logically consistent with E & K . In order for the set of all hypotheses confirmed by E
E to be consistent with
K
& things, , it firstthe hasConsistency to be a logically consistent in its own So among other Condition bansset a single pieceright. of evidence from confirming two hypotheses that are mutually exclusive with each other. It seems easy to generate confirmational examples that violate this requirement: evidence that a randomly drawn card is red confirms both the hypothesis that it’s a heart and the hypothesis that it’s a diamond, but these two confirme d hypotheses are mutually exclusive. Hempel also notes that scientists often find themselves in the position of entertaining a variety of theories that are mutually exclusive with each other; experimental data eliminates some of those theories while confirming all of the ones that remain. Yet Hempel is trapped into the Consistency Condition by his allegiance to the Consequence Condition. Taken together, the propositions in an inconsistent set entail a contradiction; so any piece of evidence that confirmed all the members of an inconsistent set would also (by the Consequence Condition) confirm a contradiction. Hempel refuses to grant that anything could confirm a contradiction! So he tries to make the Consistency Condition work.9 Hempel rightly rejects the Converse Consequence Condition: For any E , H , H 1 , and K (with H 1 consistent with K ), if E confirms H relative to K and H 1 & K H , then E confirms H 1 relative to K .
(
186
CHAPTER 6. CONFIRMATION
The Converse Consequence Condition says that relative to a given background, evidence that confirms a hypothesis also confirms anything that entails that hypothesis. Here’s a counterexample. Suppose our backgr ound knowledge is that a fair six-sided die has been rolled, and our propositions are: E : The roll outcome is prime. H : The roll outcome is odd. H 1 : The roll outcome is 1.
In this case E confirms H relative to our background, H 1 entails H , yet E refutes H 1 . (Recall that 1 is not a prime number!) Still, there’s a goo d idea in the vicinity of Conve rse Consequence. Suppose our background consists of the fact that we are going to run a certain experiment. A particular scientific theory, in combination with that background, entails that the experiment will produce a particular result. If this result does in fact occur when the experiment is run, we take that to support the theory. This is an example of the Converse Entailment Condition: For any consistent E , H , and K in E but K E , then E confirms H relative to K . L, if H & K
(
*
Converse Entailment says that if, relative to a given background, a hypothesis entails some evidence, then that evidence confirms that hypothesis relative to that background. (Again, this condition omits cases in which the background K entails the experimental result E all on its own, because such cases need not reveal any connection between H and E .) Converse Entailment doesn’t give rise to examples like the die roll case above (because in that case E is not entailed by either H or H 1 in combination with K ). But because deductiv e entailment is transitive, Converse Entailment does generate the problem of irrelevant conjunction . Consider the following propositions: E : My pet is a flightless bird. H : My pet is an ostrich. H 1 : My pet is an ostrich and beryllium is a good conductor.
Here H entails E , so by the Converse Entailment Condition E confirms H , which seems reasonable. 10 Yet despite the fact that H 1 also entails E
6.1. FORMAL FEATURES OF THE CONFIRMATION RELATION 187
(because H 1 entails H ), it seems worrisome that E would confirm H 1 . What does my choice in pets indicate about the conductivity of beryllium? Nothing—and that’s completely consistent with the Converse Entailment Condition. Just because E confirms a conjunction one of whose conjuncts concerns b eryllium doesn’t mean E confirms that beryllium-conjunct all on its own. To assume that it does would be to assume the Special Consequence Condition, which we’ve rejected. So facts about my pet don’t confirm any conclusions that are about beryll ium but not about birds. On the other hand, it’s reasonable that E would confirm H 1 at least to some extent, by virtue of eliminating such rival hypotheses as “beryllium is a good conductor and my pet is an iguana.” Rejecting the Special Consequence Condition therefore allows us to accept Converse Entailment. But again, this should make us very caref ul about how we reason in our everyday lives. A scientific theory, for instance, will often have wide-ranging consequences, and might be thought of as a massive conjunction. When the theory makes a prediction that is borne out by experiment, that experimen tal result confirms the theory . But it need not confirm the rest of the theory’s conjuncts, taken in isolation . In other words, experimental evidence that confirms a theory may not confirm that theory’s further predictions.11 Finally, we should say something about disconfirmation. Hempel takes the following position: Disconfirmation Duality: For any E , H , and K in L, E confirms H relative to K just in case E disconfirms H relative to K .
„
Disconfirmation Duality pairs confirmation of a hypothesis with disconfirmation of its negat ion. It allows us to immediately convert many of our constraints on confirmation into constraints on disconfirmation. For example, the Entailment Condition now tells us that if E & K deductively refutes H (yet K doesn’t refute H all by itself), then E disconfirms H relative to K . (See Exercise 6.2.) We should be careful, though, not to think of confirmation and disconfirmation as exhaustive categories: for many propositions E , H , and K , E will neither confirm nor disconfirm H relative to K . Figure 6.1 summarizes the formal conditions on confirmation we have accepted and rejected. The task now is to find a positive theory of which evidence confirms which hypotheses relative to which backgrounds that satisfies the right conditions and avoids the wrong ones.
188
CHAPTER 6. CONFIRMATION
Figure 6.1: Accepted and rejected conditions on confirmation Name Equivalence Condition
Brief, Somewhat Imprecise Description equivalent hypotheses, evidence, back-
grounds behave same confirmationally Entailment Condition evidence confirms what it entails Converse Entailment a hypothesis is confirmed by what it Condition entails Disconfirmation a hypothesis is confirmed just when Duality its negation is disconfirmed Confirmation anything confirmed by a confirmed Transitivity hypothesis is also confirmed Consequence anything entailed by a set of confirmed Condition hypotheses is also confirmed Special Consequence anything entailed by a confirmed Condition hypothesis is also confirmed Consistency Condition all confirmed hypotheses are consistent Converse Consequence anything that entails a confirmed Condition hypothesis is also confirmed Nicod’s Criterion F a & Ga confirms x F x Gx
Verdict accepted accepted accepted accepted rejected rejected rejected rejected rejected ???
p@ qp Ą q 6.2 6.2.1
Carnap’s Theory of Confirmation Confirmation as relevance
Carnap saw that we could get a confirmation theory with exactly the properties we want by basing it on probability. Begin by taking any probabilistic distribution Pr over L. (I’ve named it “Pr” because we aren’ t committed at this stage to its being any kind of probability in particular—much less a credence distribution. All we know is that it’s a distribution over the propositions in L satisfying the Kolmogorov axioms.) Define Pr’s background corpus K as the conjunction of all propositions X in L such that Pr X 1.12 Given an E and H in L , we apply the Ratio Formula to calculate Pr H E . Two distinct theories of confirmation now suggest themselves: (1) E confirms H relative to K just in case Pr H E is high; (2) E confirms H relative to K just in case Pr H E Pr H . In the preface to the second ed ition of his Logical Foundations of Probability, Carnap calls the first of these options a “firmness” concept of confirmation and the second an “increase in firmness” concept.13 (1962, p. xvff.) The firmness conc ept of confirmation has a number of problems. First,
p | q p | qą p q
p q“ p | q
189
6.2. CARNAP’S THEORY OF CONFIRMATION
there are questions about where exactly the threshold for a “high” value of Pr H E falls, what determines that threshold, how we discover it, etc. Second, there will be cases in which E is irrelevant to H , yet Pr H E is high because Pr H was already high. For example, take the background
p | q
p | q
p q
K that a fair lottery with a million tickets has been held, the hypothesis H that ticket 942 did not win, and the evidence E that elephants have trunks. In this ex ample Pr H E may very well be high, but that need
p | q
not be due to any confirmation of lottery results by the endowments of elephants. Finally, the firmness concept doesn’t match the confirmation conditions we approved in the previous section. Wherever the threshold for “high” is set, whenever E confirms H relative to K it will also confirm any H 1 entailed by H . As a probability distribution, Pr must satis fy the Entailment rule and its extension to conditional probabilities (see Section 3.1.2), so if H H 1 then Pr H 1 E Pr H E . If Pr H E surpasses the threshold, Pr H 1 E will as well. But that means the firmness concept of confirmation satisfies the Special Consequence Condition, to which we’ve already seen counterexamples.
( p | q
p | qě p | q
p | q
Warning: Conflating firmness and increase in firmness, or just blithely assuming the firmness concept is correct, is one of the most frequent mistakes made in the confirmation literature and more generally in discussions of evidential support. 14 For example, it is often claimed that an agent’s evidence supports or justifies a conclusion just in case the conclusion is probable on that evidence. But for conclusions with a high prior, the conclusion may be probable on the evidence not because of anything the evidence is doing, but instead because the conclusion was proba ble all along. Then it’s not the evidence that’s justifying anything!
Increase in firmness has none of these disadvantages; it is the concept of confirmation we’ll work with going forward. Given a probability distribution Pr with background K (as defined above), E confirms H relative to K just in case Pr H E Pr H . In other wor ds, given Pr evidence E confirms H relative to K just in case E is positively relevant to H . We identify disconfirmation with negative relevance : Given Pr, E disconfirms H relative to K just in case Pr H E Pr H . If Pr H E Pr H , then E is irrelevant to H and neither confirms nor disconfirms it relative to K . This account of confirmation meets exactly those conditions we endorsed
p | qą p q
p | qă p q
p | q“ p q
190
CHAPTER 6. CONFIRMATION
in the previous section: Disconfirmation Duality and the Equivalence, Entailment, and Converse Entailment Conditions. Disconfirmation Duality follows immediately from our definitions of positive and negative relevance. The Equivalence Condition follows from the Equivalence rule for probability distributions; logically equivalent propositions will always receive identical Pr-values. We get the Entailment Condition because if E , H , and K are consistent, E & K H , but K H , then Pr H E 1 while Pr H 1. (You’ll prove this in Exercise 6.3.) The key result for Converse Entailment was established in Exercise 4.4. Identifying confirmation with positive relevance yields an account of confirmation with the general contours we want, without our having to commit on the specific numerical values of Pr.
(
6.2.2
*
p | q“
p qă
Finding the right function
Yet Carnap wants more than the general contours of confirmation—he wants a substantive theory that says in every case which bodies of evidence support which hypotheses relative to which backgrounds. A theory like that seems obtainable to Carnap because he sees confirmation as a logical relation. As E confirmsand H of with other relations, of whether relative to Kattitudes is independent of logical the truth-values those propositions particular individuals adopt toward them. Like Hempel, Carnap thinks confirmatio n relations emerge from the logical form of propositions, and therefore can be captured by a syntactical theory working with strings of symbols representing logica l forms. (Nicod’s Criterion is a good example of a confirmation principle that works with logical form.) Enormous progress in formal deductive logic in the decades just before Logical Foundations makes Carnap confident that a formalism for inductive logic is within reach. To construct the formalism Carnap wants, we begin with a formal language L.15 We then take each consistent corpus in that language (represented as a non-contradictory, conjunctive proposition K ) and associate it with a particular Pr distribution over L . That done, we can tes t whether evidence E confirms hypothesis H relative to a particular K by seeing whether E is positively relevant to H on the Pr associated with that K . The crucial step for Carnap is to associate each K with the unique, correct distribution Pr. Of course Pr will assign an uncondit ional value of 1 to each conjunct of K , but that leaves a lot of latitude with respect to the members of L that aren’t conjuncts of K . A full Pr distri bution must be specified for each K so that for any E , H , and K we might select in E confirms, L, there will be a definite answer to the question of whether disconfirms, or is irrelevant to H on K . (Just as there’s always a definite
191
6.2. CARNAP’S THEORY OF CONFIRMATION
answer as to whether a given P deductively entails a given Q, refutes it, or neither.) And it’s important to get the right Pr for each K ; the wrong Pr distribution could make evidential support counterinductive, or could have everyday evidence confirming skeptical hypotheses. Even in a language L with finitely many atomic propositions, there will typically be many, many possible non-contradictory background corpora K . Specifying a Pr-distribution for each such K could be a great deal of trouble. Carnap simplifies the process by constructing every Pr from a single, regular probability distribution over L that he calls m. As a regu lar probability distribution, m contains no contingent evidence. (In other words, m has a tautological background corpus.) The Pr distribution relative to any non-contradictory K is then specified as m K . (This guarantees that Pr K 1.) Evidence E confirms hypothesis H relative to K just in case Pr H E Pr H , which is equivalent to m H E & K m H K . So instead of working with particular Pr-distributions we can now focus our attention on m.16 m also fulfills a number of other roles for Carn ap. Carnap thin ks of an agent’s background corpus at a given time as her total evidence at that
p q“ p | qą p q
p¨q p¨| q p |
qą p | q
m
p | q
K provides time. an agent’s of total evidence is K , CarnapMoreover, thinks H the logicalIfprobability H on her total evidence. Carnap thinks that logical probabilities dictate rational crede nces. A rational agent with total evidence K will assign credence cr H m H K for any H in L. Since m is a particular, unique distribution, this means there is a unique credence any agent is required to assign a particular proposition H given body of total evidence K . So Carnap endorses the Uniqueness Thesis (Section 5.1.2), with m playing the role of the uniquely rational hypothetical prior distribution. On Carnap’s view, logic provides the unique correct evidential standard that all rational agents should apply, represented numerically by the distribution m. Carnap is thus an Objective Bayesian in both senses of the term: in the normative sense, because he thinks there’s a unique rational hypothetical prior; and in the seman tic sense , because he defines “prob ability” as an objective concept independent of agents’ particular attitudes. 17 m allows us to separate out two questions that are sometimes run together in the confir mation liter ature. Up until now we have been askin g whether evidence E confirms hypothesis H relative to background corpus K . For Carnap this question can be read: For a rational agent with total evidence K , would some further piece of evidence E be positively relevant to H ? Carnap answers this question by chec king whether m H E & K m H K . But we might also ask about confirmational relations involving K itself. Bayesians sometimes ask how an agent’s total evidence bears on
p q“ p | q
p | q
p |
qą
192
CHAPTER 6. CONFIRMATION
a hypothesis—does the sum total of information in the agent’s possession tell in favor of or against H ? From a Carnapian perspective this question is usually read as comparing the probability of H on K to H ’s probability relative to a tautological back ground. So we say that the agent’s total
p | qą p q
m H evidence K confirms H just in case m H K . Carnap doesn’t just talk about the hypothetical distribution m; he provides a recipe for calculating its nume rical values. To see how, let’s begin with a very simple language, containing only one predicate F and two constants a and b. This language has only two atomic pro positions ( F a and F b), so we can specify distribution m over the language using a probability table with four state-desc riptions. Carnap runs throu gh a few candidates for distribution m; he calls the first one m: :
Fa Fb T
T
T
F
F
T
F
F
:
m
{4 {4 {4 {4
1 1 1 1
In trying out various candidates for m, Carnap is attempting to determine the logical probabilities of particular propositions relative to a tautological background. m: captures the natural thought that a tautological background should treat each of the available possibilities symmetrically. m: applies a principle of indifference and assigns each state-description the same value. 18 Yet m: has a serious drawback: m
: m ,
:
pF b | F aq “ :pF bq “ 1{2 m
: m ,
(6.1)
On F a is irrelevant to F b; so according to F a does not confirm F b relative to the empty background. Carnap thinks the fact that one object has property F should confirm that the next object will have F , even against a tautological background. Yet m: does not yield this resu lt. Even worse, the failure continues as m: is extended to larger languages. m: makes each proposition F a, F b, F c, etc. independent not only of each of the others but also of logical combinations of the others; even the observation that 99 objects all have property F will not confirm that the 100th object is an F . (See Exercise 6.4.) This is an especially bad result becaus e m: is supposed to play the role of unique hypothetical prior for rational agents. According to m: , if a rational agent’s total evidence consists of the fact that 99 objects all have property F , this total evidence does not confirm in the slightest that the next object will have F . m: does not allow “learning from experience”; as Carnap puts it,
6.2. CARNAP’S THEORY OF CONFIRMATION
193
The choice of [ m: ] as the degree of confirmation would be tantamount to the principle never to let our past experiences influence our expec tations for the futu re. This would obviously be in striking contradiction to the basic principle of all inductive reasoning.(1950, p. 565) Carnap wants a theory of confirmation that squares with commonsense notions of rational inductive reasoning; m: clearly fails in that role. To address this problem, Carnap proposes distribution m˚ . According to m˚ , logical probability is indifferent not among the state-descriptions in a language but instead among its structure-descriptions. To understand structure-descriptions, start by thinking about property profiles. A property profile specifies exactly which of the language’s predicates an object does or does not satisfy. In a language with the single predicate F , the two available property profiles would be “this object has property F ” and “this object lacks property F .” In a language with two pre dicates F and G , the property profiles would be “this object has both F and G ,” “this object lacks property L
F G”; etc. but hashow property Given language , a structure-description describes many objects in the universe of discourse possess each of the
available property profiles, but doesn’t say which particular objects possess which profiles. For example, the language containing one property F and two constants a and b has the two property profiles just mentioned. Since there are two objects, this language allows three structure-descriptions: “both objects have F ”, “one object has F and one object lacks F ”, and “both objects lack F ”. Written in disjunc tive normal form, the three structuredescriptions are: i. F a & F b
p „ q_p„ F a & F bq „ F a & „F b
ii. F a & F b iii.
(6.2)
Note that one of these structure-descriptions is a disjunction of multiple state-descriptions.19 ˚ works by assigning equal value to each structure-description in a m language. If a structure-description contains multipl e state-description disjuncts, m˚ then divides the value of that structure-description equally among its state-descriptions. For our simple language, the result is:
194
CHAPTER 6. CONFIRMATION
Fa Fb T
T
T
F
F
T
F
F
˚
m
{ { { 1{3 1 3 1 6 1 6
Each structure-description receives m˚ -value 1 3; the structure-description ˚ containing the middle two lines of the table divides its m -value between them. ˚ allows learning from experien ce. From the table above, we can calm culate ˚ m Fb Fa 2 3 1 2 m˚ F b (6.3) ˚ m ,
{
p | q“ { ą { “ p q
On the fact that a possesses property F confirms that b will have F relative to the tautological background. Nevertheless, m˚ falls short in a different way. Suppose our language contains two predicates F and G and two constants a and b . Carnap thinks that on the correct, logical m distribution we should have
pF b | F a&Ga&Gbq ą pF b | F aq ą pF b | F a&Ga&„Gbq ą pF bq (6.4)
m
m
m
m
While evidence that a has F should increase a rational agent’s confidence that b has F , that rational confidence should increase even higher if we throw in the evidence that a and b share property G. If a and b both have G, in some sense they’re the same kind of object, so one should expect them to be alike with respect to F as well. When I tell you that one objec t in my possession has a beak, this might make you more confident that the other object in my posses sion is beaked as well. But if you alrea dy know that both objects are animals of the same species, beak information about one is much more relevant to your beak beliefs about the other. On the other hand, information that a and b are un alike with respect to G should make F -facts about a less relevant to F -beliefs about b. Telling you that my two objects are not animals of the same species reduces the relevance of beak information about one object to beak conc lusions about the other. In gener al, Carna p thinks a successful m-distribution should track these analogical effects , expressed in Equation (6.4). To see if Equation (6.4) holds for m˚ , one would need to identify the structure-descriptions in this language. The available property profiles are: object has both F and G, object has F but not G, object has G but not F , object has neith er. Some examples of structure-descriptions are: both objects have F and G, one object has both F and G while the other has
6.3. GRUE
195
neither, one object has F but not G while the other object has G but not F , etc. I’ll leave the details to the reader (see Exercise 6.5), but suffice it to say that m˚ is unable to capture the analogical effects of Equation (6.4). Carnap responded to this problem (and others) by introducing a contin20
uum of m-distributions with properties set by two adjustable parameters. The parameter λ was an “index of caution”, controlling how reluctant m made an agent to learn from experience. m: was the m-distribution with λ value (because it made the agent infinitely cautious and forbade learning from experience), while m˚ had λ-value 2. Adjusting the other parameter, γ , made analogical effects possible. Carnap suggested the values of these parameters be set by pragmatic considerations, which threatened the Objective Bayesian aspects of his project. Even then, Mary Hesse (1963, p. 121) and Peter Achinstein (1963) uncovered more subtle learning effects that even Carnap’s parameterized m-distributions were unable to capture. In the end, Carnap never constructed an m-distribution (or set of m-distributions) with which he was entirely satisfied.
8
6.3
Grue
Nelson Goodman (1946, 1955) offered another kind of challenge to Hempel and Carnap’s theories of confirmation. Here is the famous passage: Suppose that all emeralds examined before a certain time t are green. At time t, then, our observations support the hypothesis that all emeralds are green; and this is in accord with our definition of confirmation. Our evidence statements assert that emerald a is green, that emerald b is green, and so on; and each confirms the general hypothes is that all emeralds are green. So far, so good. Now let me introduce another predicate less familiar than “green”. It is the predicate “grue” and it applies to all things examined before t just in case they are green but to other things just in case they are blu e. Then at time t we have, for each evidence statement asserting that a given emerald is green, a parallel evidence statement asserting that that emerald is grue. And the statements that emerald a is grue, that emerald b is grue, and so on, will each confirm the general hypothesis that all emeralds are grue. Thus according to our definitio n, the prediction that all emeralds subsequently examined will be green and the
196
CHAPTER 6. CONFIRMATION
prediction that all will be grue are alike confirmed by evidence statements describing the same observations. But if an emerald subsequently examined is grue, it is blue and hence not green. Thus although we are well aware which of the two incompatible predictions is genuinely confirmed, they are equally well confirmed according to our definition. Moreover, it is clear that if we simply choose an appropriate predicate, then on the basis of these same observations we shall have equal confirmation, by our definition, for any prediction whatever about other emeralds. (1955, pp. 73–4) The target here is any theory of confirmation on which the observation that multiple objects all have property F confirms that the next object will have F as well. As we saw, Carnap built this “lear ning from experience” feature into his theory of confirmation. It was also a feature of Hempel ’s positive theory of confirmation, so Goodman is objecting to both Carnap’s and Hempel’s theories . We will focus on the consequences for Carnap, since I did not present the details of Hempel’s positive approach. Goodman’s concern is as follows: Suppose we have observed 99 emeralds before time t, and they have all been green. On Carnap’s theory, this total evidence confirms the hypothesis that the next emerald observed will be green. So far, so good. But Goodman says this evid ence can be re-exp ressed as the proposition that the first 99 emeralds are grue. On Carnap’s theory, this evidence confirms the hypothesis that the next emerald observed will be grue. But for the next emerald to be grue it must be blue. Thus it seems that on Carnap’s theory our evidence confirms both the prediction that the next emerald will be green and the prediction that the next emerald will be blue. Goodman thinks it’s intuitively obvious that the former prediction is confirmed by our evidence while the latter is not, so Carnap’s theory is getting things wrong. Let’s look more carefully at the details. Begin with a language L containing constants a1 through a100 representing objects, and predicates G and O representing the following properties: Gx: x is green Ox: x is observed by time t
We then define “grue” as follows in language Gx
” Ox:
L:
x is grue; it is either green and observed by time t or non-green and not observed by t
197
6.3. GRUE
The grue predicate says that the facts about whether an emerald is green match the facts about whether it was observed by t .21 Goodman claims that according to Carnap’s theory, our total evidence in the example confirms x Gx and Ga100 (which is good), but also x Gx Ox and Ga100
p@ q
p@ qp ” q
”
Oa100 (which are supposed to be bad).
But what exactly is our evidence in the example? Goodman agrees with Hempel that in assessing confirmation relation s we must explicitly and precisely state the contents of our total evidence. Evidence that the first 99 emeralds are green would be: E : Ga1 & Ga2 & . . . & Ga99
But E neither entails nor is equivalent to the statement that the first 99 emeralds are grue (because it doesn’t say anything about whether those emeralds’ G -ness matches their O -ness), nor does E confirm x Gx Ox on Carnap’s theory. A better statement of the evidence would be:
p@ qp ” q
E1:
pGa
1
q p
q
p
& Oa1 & Ga2 & Oa2 & . . . & Ga99 & Oa99
q
Here we’ve added an important fact included in the example: that emeralds a1 through a 99 are observed by t . This evidence statement entails both that all those emeralds we re green and that they all were grue. A bit of technical work with Carnap’s theory 22 will also show that according to that theory, E 1 confirms x Gx, Ga100, x Gx Ox , and Ga100 Oa100. It looks lik e Carnap is in trouble. As long as his theor y is willing to “project” past observations of any property onto future predict ions that that property will appear, it will confirm grue predictions alongside green predictions. The theory seems to need a way of preferring greenness over grueness for projection purposes; it seems to need a way to play favorites among properties. Might this need be met by a technical fix? One obvious difference between green and grue is the more complex logical form of the grue predicate in L . There’s also the fact that the definition of “grue” involves a predicate O
p@ q
p@ qp ” q
”
that makes a reference to times; perhaps for purposes of induction predicates referring to times are suspicious. So maybe we could build a new version of Carnap’s theory that only projects logically simple predicates, or predicates that involve no reference to time. Yet Goodman shows that we can point all these distinctions in the other direction by re-expressing the problem in an alternate language L1 , built on the following two predicates: GRx: x is grue
198
CHAPTER 6. CONFIRMATION
Ox: x is observed by time t
We can define the predicate “green” in language L 1 ; it will look like this: Ox: x is green; it is either grue and observed by time t or non-grue
GRx
”
and not observed by t
E1:
pGRa
Expressed in
L1 , 1
the evidence E 1 is
q p
q
p
& Oa1 & GRa2 & Oa2 & . . . & GRa99 & Oa99
q
This expression of E 1 in L 1 is true in exactly the same possible worlds as the expression of E 1 we gave in L . And once more, when applied to L 1 Carnap’s theory has E 1 confirming both that all emeralds are grue and that they are green, and that a100 will be grue and that it will be green. But in L1 all the features that were supposed to help us discriminate against grue now work against green—it’s the definition of greenness that is logically complex and involves the predicate O referring to time. If you believe that it’s logical complexity or reference to times that makes the difference between green and grue, you now need a reason to prefer the L
L
1
expression of the problem in language over its expression in . ofThis is why Goodman’s grue problem is sometimes described as a problem language dependence : We could build a formal confirmation theory that projected logically simple predicates but not logically complex, yet such a theory would yield different answers when the very same problem was expressed in different languages (such as L and L1 ). Why is language dependence such a concern? Recall that Hempel endorsed the Equivalence Condition in part because he didn’t want confirmation to depend on the particular way hypotheses and evidence were presented. For theorists like Hempel and Carnap who take confirmatio nal relations to be objective, it shouldn’t make a difference how particular subjects choose to represent certain propositions linguistically. Two scientists shouldn’t draw different conclusions from the same data just because one speaks English and the other speaks Japanese! 23 Hempel and Carnap sought a theory of confirmation that worked exclusively with the syntactical forms of propositions represented in language. Goodman charges that such theories can yield consistent verdicts only if appropriate languages are selected for them to operate within. Since a syntactical theory operates only once a language has been provided, it cannot choose among languages for us. Goodman concludes that “Confirmation of a hypothesis by an instance depends rather heavily upon features of the hypothesis other than its syntactical form.” (1955, pp. 72–3)
6.3. GRUE
199
Warning: It is sometimes suggested that—alth ough this is certainly not a syntactical distinction—the grue hypothesis can be dismissed out of hand on the grounds that it is “metaphysically weird”. This usually involves reading “All emeralds are grue” as saying that all the emeralds in the universe are green before time t then switch to being blue after t. But that rea ding is neither required to get the problem going nor demanded by anything in (Goodman 1946) or (Goodman 1955). Suppose, for insta nce, that each emerald in the universe is either green or blue, and no emerald ever changes color. By an unfortunate accident, it just so happens that the emeralds you observe before t are all and only the green eme ralds. In that case it will be true that all emeralds are grue, and no metaphysical sleight-of-hand is required.
As the previous warning suggests, the metaphysical details of Goodman’s grue example havebetween sometimes its being philosophical point. indicates a correlation twoobscured properties: green and being“Grue” observed by time t. It happen s to be a perfect correlation, expressed by a biconditional. Some such correlations are legitimat ely projectible in science: If you observe that fish are born with a fin on the left side whenever they are born with a fin on the right, this bilateral symmetry is a useful, projectible biconditional correlation. The trouble is that any sizable body of data will contain many correlations, and we need to figure out which ones to project as regularities that will extend into the future. (The women in this meeting room all have non-red hair, all belong to a particular organization, and all are under 6 feet tall. Which of those properties will also be exhibited by the next woman to enter the room?) Grue is a particularly odd, particularly striking example of a spurious correlation, but is emblematic of the problem of sorting projectible from unprojectible hypotheses. 24 It is not at all essential to the example that one of the properties involved refers to times, or that one of the properties is a relatively simple physical property (color). Sorting spurious from significant correlations is a general problem, for all sorts of variables.25 Goodman offers his own proposal for detecting projectible hypotheses, and many authors have made further proposals since then. Instead of investigating those, I’d like to examine exactly what the grue problem establishes about Carnap’s theor y (and others). The first thing to note is that although
200
CHAPTER 6. CONFIRMATION
evidence E 1 confirms on Carnap’s theory that emerald a100 is grue, it does not confirm that emerald a 100 is blue. Recall that Carna p offers a hypothetical prior distributi on m˚ that is supposed to capture the unique, logicallymandated ultimate evidential standard. A hypothesis H is confirmed by 1
E (the total evidence evidence we settled on for the grue example) just in ˚ H case m˚ H E 1 m . For example, it turns out that
p | qą p q
m
˚
pGa | E 1q ą ˚pGa q m
100
(6.5)
100
So on Carnap’s theory, E 1 confirms Ga100 . But if that’s true, then E 1 must be negatively relevant to Ga100, the proposition that emerald a 100 is blue. So while E 1 confirms that a100 is green and confirms that a100 is grue, it does not confirm that a100 is blue. How is this possible, given that Oa100 (i.e. a100 is not observed by t)? The key poi nt is that Oa100 is not stated in the total evidence E 1 . E 1 says that every emerald a1 through a99 was observed by t and is green. If that’s all we put into the evide nce, that evidence is going to confirm that a100 both is green and was observed by t. After all, if eve ry obje ct described in the evidence has the property Gx & Ox , Carnapian “learning
„
„
„
from experience” will confirm that other ob jects have that property as well. Once we understand that Carnap’s theory is predicting from E 1 that a100 bears both Ox and Gx, the prediction that a100 will have Gx Ox is no longer so startling. In fact, the assessment of E 1 one gets from Carnap’s theory is intuitively plausible. If all you knew about the world was that there existed 99 objects and all of them were green and observed before t, you would expect that if there were a 100th object it would be green and observed before t as well. In other words, you’d expect the 100th object to be grue—by virtue of being green (and observed), not blue!26 We can read the prediction that the 100th object is grue as a prediction that it’s not green only if we smuggle extra background knowledge into the case—namely, the assumption that a 100 is an unobserved emerald. (This is similar to what happened in Hempel’s analysis of the Paradox of the Ravens.) What happens if we explicitly state this extra
”
fact, by adding to the evidence that a100 is not observed by t? E 2 : Ga1 & Oa1 & Ga2 & Oa2 & . . . & Ga99 & Oa99 & Oa 100
p
q p
q
p
q „
Skipping the calculations (see Exercise 6.6), it turns out that m
˚
pGa ” Oa | E2 q “ ˚p„Ga | E 2q “ ˚pGa | E 2q “ ˚ pGa q “ 1{2 100
100
m m
100
100
m
100
(6.6)
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
201
On Carnap’s probabilistic distribution m˚ , E 2 confirms neither that a 100 will be grue, nor that a 100 will be green, nor—for that matter—that all emeralds are grue or that all emeralds are green. Perhaps it’s a problem for Carnap’s theory that none of these hypotheses 2
are confirmed by E , when˚ intuitively some of them should be. Or perhaps it’s a problem that on m , E 1 confirms that all emeralds are grue—even if that doesn’t have the consequence of confirming that the next emerald will be blue. Suffice it to say that while language-dependence problems can be found for Carnap’s theory as well as various other positive theories of confirmation,27 it’s very subtle to determine exactly where those problems lie and what their significance is.
6.4
Subjective Bayesian confirmation
I began my discussion of Carnap’s confirmation theory by pointing out its central insight: We can get a confirmation relation with exact ly the features we want by equating confirmation with proba bilistic relevance. In 1980, Clark Glymour reported the influence of this insight on philosophical theories of confirmation: Almost everyone interested in confirmation theory today believes that confirmation relations ought to be analysed in terms of probability relations. Confirmation theory is the theory of probability plus introductions and appendices. Moreover, almost everyone believes that confirmation proceeds through the formation of conditional probabilities of hypotheses on evidence. The basic tasks facing confirmation theory are thus just those of explicating and showing how to determine the probabilities that confirmation involves, developing explications of such metascientific notions as “confirmation,” “explanatory power,” “simplicity,” and so on in terms of functions of probabilities and conditional probabilities, and showing that the canons and patterns of scientific inference result. It was not always so. Probabilistic accou nts of confirmation really became dominant only after the publication of Carnap’s Logical Foundations of Probability, although of course many probabilistic accounts had preceded Carnap’s. An eminent contemporary philosopher has compared Carnap’s achievement in inductive logic with Frege’s in deductive logic: just as before Frege there was only a small and theoretically uninteresting collection of principles of deductive inference, but
202
CHAPTER 6. CONFIRMATION
after him the foundation of a systematic and profound theory of demonstrative reasoning, so with Carnap and inductive reasoning. (1980, pp. 64-5) Carnap holds that if a rational agent’s distribution is cr, evidence E confirms hypothesis H relative tocredence that agent’s background corpus just in case cr H E cr H . The distinctive feature of Carnap’s positive theory is that he thinks only one credence distribution is rationally permissible for each agent given her total evidence: the distribution obtained by conditionalizing the “logical probability” distribution m on her background corpus. So if we wanted, we could give Carnap’ s entire account of confirmation without mentioning agents at all: E confirms H relative to K just in case m H E & K m H K . In the end, confirmation facts are logical and objective, existing “out there” among the propositions. Carnap’s commitment to Uniqueness makes him an Objective Bayesian in the normative sense. Subjective Bayesians appreciate Carnap’s centr al insight about probabilistic relevance, and agree with him that a piece of evidence E confirms hypothesis H relative to an agent’s credence distribution
p | qą p q
p |
qą p | q
p | qą p q
cr just inthe casecommitment cr H E to craH . Butdistribution these points m of determining agreement are ble from unique theseparacorrect cr-distribution relative to total evidence for all rational agents. Subjective Bayesians think that specifying an agent’s background corpus/total evidence K is insufficient to fully determine her rational credences. They are willing to let different rational agents construct their credences using different hypothetical priors, encoding those agents’ differing evidential standards. So two rational agents with the same total evidence may assign different credences to the same propositions. This makes agents’ particular credence distributions much more significant to the Subjective Bayesian account of confirmation than they are to Carnap’s approach. For Subjective Bayesians, whether E confirms H cannot be relative simply to a background corpus, because such a corpus is insufficient to determine an entire probabi lity distribution. Without a unique function m to rely on, Subjective Bayesians need something else to fill out the details around K and generate a full distribution. For this, they usually rely on the opinions of a particular agent. A Subjective Bayesian will say that E confirms H for a specific agent just in case cr H E cr H on that agent’s curre nt credence distribution cr. A piece of evidence confirms a hypothesis for an agent when the evidence is positively relevant to the hypothesis relative to that agent’s curren t credences. Put another way, evidence confirms a hypothesis for an agent just in case conditionalizing on
p | qą p q
203
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
that evidence would increase her confidence in the hypothesis. Since Subjective Bayesians permit rational agents to assign different credence distributions (even against the same background corpus!), this means that the same evidence will sometimes confirm different hypotheses for different rational agents. Two agents with, say, different levels of trust in authority might draw differing conclusions about whether a particular fact confirms that Oswald acted alone or is further evidence of government conspiracy. For a Subjective Bayesian, it may turn out that due to differences in the agents’ credence distributions, the fact in question confirms one hypothesis for one agent and a different hypothesis for the other. There need be no independent, absolute truth about what’s really confirmed. And this confirmatory difference need not be traceable to any difference in the agent’s background corpora; the agents may possess different credence distributions because of differences in their evidential standards, even when their bodies of total evidence are the same. The Subjective Bayesian approach to confirmation is still a probabilistic relevance account, so it still displays all the desirable features we identified in Section 6.1.2. Take any credence distribution cr a rational agent might K assign satisfiesofthe orov axioms Formula. be the that conjunction all Kolmog propositions X in Land suchRatio that cr X 1.Let Now specify that E confirms H relative to K and that credence distribution just in case cr H E cr H . Confirmation relative to cr will now disp lay exactly the features we accepted in Figure 6.1: Disconfirmation Duality and the Equivalence, Entailment, and Converse Entailment Conditions. So, for instance, one will get the desirable result that relative to any rational credence distribution, a hypothesis is confirmed by evidence it entails. 28 While Subjective Bayesians usually talk about confirmation relative to a particular agent’s credence distribution, they are not committed to do so. The central claim of the Subjective Bayesian account of confirmation is that confirmation is always relative to some probability distribution, which cannot be sufficiently specified by providing a corpus of background evidence. The determining distribution is often—but need not always be—an agent’s credence distribution. 29 For example, a scientist may assess her experimental data relative to a commonly-accepted probabilistic model of the phenomenon under examination (such as a statistical model of gases), even if that model doesn’t match her personal credences about the events in question. Similarly, a group may agree to judge evidence relative to a probability distribution distinct from the credence distributions of each of its members. Whatever probability distribution we consider, the Kolmogorov axioms and Ratio Formula ensure that the confirmation relation relative to
p q“
p | qą p q
204
CHAPTER 6. CONFIRMATION
that distribution will display the general conditions we desire. The most common objection to the Subjective Bayesian view of confirmation is that for confirmation to play the objective role we require in areas like scientific inquiry, it should never be relative to something so subjective as an agent’s degrees of belief about the world. (A Congressional panel’s findings about the evidence related to the Kennedy assassination shouldn’t reflect the committee members’ personal levels of trust in authorit y!) We will return to this objection—and to some theories of confirmation that try to avoid it—in Chapt er 11. For now I want to consider another objection to the Subjective Bayesian view, namely that it is so empty as to be nearuseless. There are so many probability distributions available that for any E and H we will be able to find some distribution on which they are positively relevant (except in extreme cases when E H ). It looks, then, lik e the Subjective Bayesian view tells us almost nothing substantive about which particular hypotheses are confirmed by which bodies of evidence. While the Subjective Bayesian denies the existence of a unique probability distribution to which all confirmation relations are relative, the view need not be anything-goes. 30 Often we are interested in confirmation rela-
(„
tions relative to someconstraints rational agent’s credences, and Chapter 5 proposed a number of plausible beyond the Kolmogorov axioms and Ratio Formula that such credences may satisfy. These constraints, in turn, impose some substantive shape on any confirmation relation defined relative to a rational agent’s credences. For example, David Lewis shows at his (1980, p. 285ff.) that if a credence distribution satisfies the Principal Principle, then relative to that distribution the evidence that a coin has come up heads on x percent of its tosses will confirm that the objective chance of heads on a single toss is close to x . This result of Lewis’s has the form: if your credences have features suchand-such, then confirmation relative to those credences will have features so-and-so. The fact that features suc h-and-such are required by rationality is neither here nor there. For example, if you assign equal credence to each possible outcome of the roll of a six-sided die, then relative to your credence distribution the evidence that the roll came up odd will confirm that it came up prime. This will be true regardless of wheth er your tota l evidence rationally requir ed equanimity over the possible outcome s. Subjective Bayesianism can yield interesting, informative results about which bodies of evidence confirm which hypotheses once the details of the relevant probability distribution are specified. The theory can also work in the opposite direction: it can tell us what features in a probability distribution will generate particular kinds of con-
205
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
firmational relations. But before I can outline some of Subjective Bay esianism’s more interesting results on that front, I need to explain how Bayesians measure the strength of evidential support. 6.4.1 Confirmation measures We have been considering a classificatory question: Under what conditions does a body of evidence E confirm a hypothesis H ? But related to that classificatory question are various comparative confirmational questions: Which of E or E 1 confirms H more strongly? Is E better evidence for H or H 1 ? etc. These comparative questions could obviously be answered if we had the answer to an underlying quantitative question: To what degree does E confirm H ? (Clearly if we knew the degree to which E confirms H and the degree to which E 1 confirms the same H , we could say whether E or E 1 confirms H more strongly.) Popper (1935/1959) introduc ed the notion of degrees of confirmation. Since then various Bayesian confirmation measures have been proposed to quantify degree of confirmation: they take propositions E and H and some probability distribution Pr (perhaps an agent’s credence distribution) and try to measure how much E confirms H relative to Pr.
Warning: When we set out to understand degree of confirmation in terms of Pr, it’s important not to conflate firmness and increase in firmness (Section 6.2.1). It’s also important to get clear on how degree of confirmation relates to various notions involving justification. Compare the following: •
the degree to which E confirms H relative to Pr
p | q
•
Pr H E
•
the degree to which an agent with total evidence justified in believing (or accepting) H
E would be
The degree to which E confirms H relative to Pr cannot be measured as Pr H E . The confirmation of E by H is a relation between E and H , while the value of Pr H E may be affected as much by the value of Pr H as it is by the influence of E . So Pr H E is not solely reflective of the relationship between H and E (relative to Pr). Pr H E tells us how probable H is given E (relative to Pr). If Pr represents an agent’s hypothetical prior distribution, Pr H E tells us the degree of confidence rationality requires that agent to assign H when her total evidence is E .
p | q p q p | q
p | q
p | q
p | q
206
CHAPTER 6. CONFIRMATION
Some authors discuss the degree to which H “justifies” E . This may or may not be meant as synonymous with the degree to which H confirms E . Even so, it cannot be identical to Pr H E , for the reasons just explained. But other authors think it’s a category mis-
p | q
take to speak of one proposition’s justifying another; evidence may only justify particular attitudes towards H . When Pr is an agen t’s hypothetical prior, we might speak of Pr H E as the credence in H an agent is justified in possessing when her total evidence is E .
p | q
Yet even this is distinct from the degree to which such an agent is justified in believing H . Belief is a binary doxastic attitude. We might propose a theory that quantifies how much an agent is justified in possessing this binary attitude . But there’s no particular reason to think that the resulting measure should satisfy the Kolmogorov axioms, much less be precisely equal to Pr H E for any Pr with independent significance. (See (Shogenji 2012).)
p | q
Finally, there is the view that an agent is justified in believing or accepting H only if Pr H E is high (where E represents total
p | q
p | q
evidence and Pr her hypothetical prior). Here Pr H E is not supposed to measure how justified such an acceptance would be; it’s simply part of a necessary condition for such acceptance to be justified. Whether one acce pts this propos al depends on one’s views about the rational relations between credences and binary acceptances/beliefs.
So if the degree to which E confirms H relative to Pr cannot be measured by Pr H E , how should it be measured? There is a now a sizable literature that attemp ts to answer this question. Almost all of the measures that have been seriously defended are relevance measures: They agree with our earlier analysis that E confirms H relative to Pr just in case Pr H E Pr H . In other words, the relevance measures all concur that confirmation goes along with positive probabilistic relevance (and disconfirmation goes with negative probabilistic relevance). Yet there turn out to be a wide variety of confirmation measures satisfying this basic constraint. The following measures have all been extensively discussed in the historical
p | q
p | qą p q
207
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
literature:31
p q “ PrpH | E q ´ PrpH q spH, E q “ PrpH | E q ´ PrpH |„E q „ r pH, E q “ log PrpH | E q PrpH q „ PrpE | H q lpH, E q “ log PrpE |„H q
d H, E
p
q
These measures are to be read such that, for instance, d H, E is the degree to which E confirms H relative to Pr on the d-measure. Each of the measures has been defined such that if H and E are positively relevant on Pr, then the value of the measure is positive; if H and E are negatively relevant, the value is negative; and if H is independent of E then the value is 0. In other words: positive valu es represent confirmation, negative values represent disconfirmation, and 0 represents irrelevance.32 For example, if Pr assigns each of the six faces on a die equal probability of coming up on a given roll, then
p
q “ Prp2 | primeq ´ Prp2q “ 1{3 ´ 1{6 “ 1{6
d 2, prime
(6.7)
This value is positive because evidence that the die roll came up prime would confirm the hypothesis that it came up 2. Beyond the fact that it’s positive, the particular value of the d -measure has little significan ce here. (It’s not as if a d-value of, say, 10 has any particular meaning.) But the specific values do allow us to make comparisons. For example, d 3 5, prime 1 3. So according to the d-measure (sometimes called the “difference measure”), on this Pr-distribution evidence that the die came up prime more strongly supports the disjunctive conclusion that it came up 3 or 5 than the conclusion that it came up 2. Since they are all relevance measures, the confirmation measures I listed will agree on classificatory facts about whether a particular E supports a particular H relative to a particular Pr. Nevertheless, they are disti nct measures because they disagree about various comparative facts. A bit of calculation will reveal that r 2, prime log 2 . Again, that par ticular number has no special significance, nor is there really much to say about how an r -score of log 2 compares to a d-score of 1/ 6. ( r and d measure confirmation on different scales, so to speak.) But it is significant that r 3
p _
p
pq
q“
q“ {
pq
p_
208
CHAPTER 6. CONFIRMATION
q“ p q
5, prime log 2 as well. According to the r-measure (sometimes called the “log ratio measure”), evidence that the roll came up prime confirms the hypothesis that it came up 2 to the exact same degree as the hypothesis that it came up either 3 or 5. That is a substantive difference with the d -measure on a comparative confirmation claim. Since the various confirmation measures can disagree about comparative confirmation claims, to the extent that we are interested in making such comparisons we will need to select among the measures av ailable. Arguing for some measures over others occupies much of the literature in this field. What kinds of arguments can be made? Well, we might test our intuitions on individual cases. For instance, it might just seem intuitiv ely obvious to you that the primeness evidence favors the 3 5 hypothesis more strongly than the 2 hypothesis, in which case you will favor the d -measure over the r measure. Another approach parallels Hempel’s approac h to the qualitative confirmation relation: We first identify abstract features we want a confirmation measure to display, then we test positive proposals for each of those features. For example, suppose E confirms H strongly while E 1 confirms H only
_
weakly. If weturns let cout represent confirmation measure that to be), thec “true” H, E and c H, E 1 measure will both(whichever be positive numbers (because E and E 1 both confirm H ), but c H, E will be the larger of the two. Intuitively, since E is such good news for H it should also be very bad news for H ; since E 1 is only weakly good news for H it should be only weakly bad news for H . This means th at while c H, E and c H, E 1 are both negative, c H, E is the lower (farther from zero) of the two. That relationship is guaranteed by the following formal condition:
p
„
q
p
p
„ p„ q
p„
Hypothesis Symmetry: For all H and E in Pr, c H, E c H, E .
p
q
q “ ´ p„
L
q
q
p„
q
and every probabilistic
q
Hypothesis Symmetry says that evidence which favors a hypothesis will disfavor the negation of that hypothesis just as strongly. It guarantees that if c H, E c H, E 1 then c H, E c H, E 1 .33
p
qą p
q
p„
q ă p„
q
Hypothesis Symmetry won’t do all that much work in narrowing our field; of the confirmation measures under consideration, only r is ruled out by this condition. A considerably stron ger condi tion can be obtained by following Carnap’s thought that entailment and refutation are the two extremes of confirmation. 34 If that’s right, then confirmation measures must satisfy the following adequacy condition: Logicality: All entailments receive the same degree of confirmation, and
209
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
have a higher degree of confirmation than any non-entailing confirmations.35 If we combine Logicality with Hypothesis Symmetry, we get the further result that refutations are the strongest form of disconfirmation, and all refutations are equally strong. Logicality is violated by, for instance, confirmation measure d. It’s easy to see why. d subtracts the prior of H from its posterior. Since the posterior can never be more than 1, the prior will therefore put a cap on how high d can get. For example, if Pr H 9 10, then no E will be able to generate a d-value greater than 1 10, which is the value one will get when E H. On the other hand, we saw in Equation (6.7) that d-values greater than 1 10 are possible even for evidence that doesn’t entail the hypothesis (e.g. d 2, prime 1 6), simply because the prior of the hypothesis in question begins so much lower. As with the firmne ss concept of confirmation, the prior of H interferes with the d -score’s assessment of the relation between E and H . This interference generates a violation of Logicality. Out of all the confirmation measures prominently defended in the historical literature (including all the measures described above), only measure l satifies Logicality.36 This constitutes a strong argument in favor of measure l (sometimes called the “log likelihood-ratio measure” of confirmation). If l looks familiar to you, that may be because it simply applies a logarithm to the Bayes factor, which we stud ied in Section 4.1. 2. There we saw that the Bayes factor equals the ratio of posterior odds to prior odds, and is a good way of measuring the impact a piece of evidence has on an agent’s opinio n about a hypothesis. Moreover, the log-likelihood ratio has a convenient mathematical feature often cited approvingly by statisticians: When pieces of evidence E1 and E2 are screened off by H on Pr, l H, E1 & E 2 l H, E1 l H, E2 . (See Exercise 6. 8.) We often have cases in which independent pieces of evidence stack up in favor of a hypothesis. Measure l makes confirmation by independent evidence additive; the strength of a stack of independent pieces of evidence equals the sum of the individual pieces’ strengths. However, a new confirmation measure 37 has recently been proposed (Crupi, Tentori, and Gonzalez 2007) that also satisfies both Hypothesis Symmetry and Logicality:
{
{ p
p
p q“ {
(
q“ {
q“ p
p
z H, E
$’& q“’ %
q` p
PrpH
q
| E q´PrpH q
1´PrpH q PrpH
| E q´PrpH q PrpH q
p | q ě PrpH q
if Pr H E if Pr
pH | E q ă PrpH q
210
CHAPTER 6. CONFIRMATION
This measure is particulary interesting because it measures confirmation differently from disconfirmation (hence the piecewise definitio n). That means confirmation and disconfirmation may satisfy different general conditions under the z -measure. For example, the following condition is satisfied for cases of disconfirmation but not for cases of confirmation:
p
z H, E
q “ zpE, H q
(6.8)
Interestingly, Crupi, Tentori, and Gonzalez have conducted empirical studies in which subjects’ comparative judgments seem to track z -scores better than the other confirmation measures. In particular, subjects seem to intuitively treat disconfirmation cases differently from confirmation cases. (See Exercise 6.9.) 6.4.2
Subjective Bayesian solutions to the Paradox of the Ravens
Earlier (Section 6.1.1) we saw Hempel endorsing conditions on confirmation according to which the hypothesis that all ravens are black would be confirmed by the observation of a this black raven but so-called also by the observation of anot redonly herring. Hempel explained result—the Paradox of the Ravens—by arguing that its seeming paradoxicality results from background assumptions we illicitly smuggle into the question. Our confirmation intuitions are driven by contingent facts we typically know about the world, but for Hempel the only fair test of ravens confirmation was against an empirically empty background. Hempel would ultimately defend a positive theory of confirmation on which black raven and red herring observations stand in exactly the same relations to the ravens hypothesis, as we long as we stick to a tautological background corpus. Subjective Bayesians take the paradox in exactly the opposite direction. They examine our contingent background assumptions about what the world is like, and try to explain the intuitive confirmation judgment that results. As Charles Chihara puts it (in a slightly different context), the problem is “that of trying to see why we, who always come to our experiences with an encompassing complex web of beliefs,” assess the paradox the way we do. (1981, p. 437) Take the current knowledge you actually have of what the world is like. Now suppose that against the background of that knowledge, you are told that you will soon be given an object a to observe. You will record whether it is a raven and whether it is black; you are not told in advance whether a will have either of these properties. Recall that on the Subjective Bay esian view
211
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
of confirmation, evidence E confirms hypothesis H relative to probability distribution Pr just in case E is positively relevant to H on Pr. In this situation it’s plausible that, when you gain evidence E about whether a is a raven and whether it is black, you will judge the confirmation of various hypotheses by this evidence relative to your personal credence distribution. So we will let your cr play the role of Pr. The key judgment we hope to explain is that the ravens hypothesis (all ravens are black) is more strongly confirmed by the observation of a black raven than by the observation of a non-black non-raven (a red herring, say). One might go further and suggest that observing a red herring shouldn’t confirm the ravens hypothesis at all. But if we look to our considered judgments (rather than just our first reactions) here, we should probably grant that insofar as a non-black raven would be absolutely disastrous news for the ravens hypothesis, any observation of a that doesn’t reveal it to be a non-black raven should be at least some good news for the hypothesis. 38 Expressing our key judgment formally requires us to measure degrees of confirmation, a topic we discussed in the previous section. If c H, E measures the degree to which E confirms H relative to cr, the Bayesian
p
claims that
q
p
q ą cpH, „Ba & „Raq (6.9) where H is the ravens hypothesis p@xqpRx Ą Bx q. Again, the ide a is that c H,Ba & Ra
relative to the credence distribution cr you assign before observing a, observing a to be a black raven would confirm H more strongly than observing a to be a non-black non-raven. Fitelson and Hawthorne (2010b) show that Equation (6.9) will hold relative to cr if both the following conditions are met:
p„ q ą crpRaq p„ | q ď crp„Ba q p | q crpRaq
cr Ba cr Ba H cr Ra H
(6.10) (6.11)
These conditions are jointly sufficien t for the confirmational result in Equation (6.9 ). They are not nece ssary; in fact, Bayesians have proposed a number of different sufficient sets over the years. 39 But these have the advantage of being simple and compact; they also work for every construal of c canvassed in the previous section except for confirmation measure s. What do these conditions say ? You satis fy Equation (6.10) if you are more confident prior to observing the ob ject a that it will be non-black than you are that a will be a raven. This would make sense if, for example, you thought a was going to be randomly selected for you from a universe
212
CHAPTER 6. CONFIRMATION
that contained more non-black things than ravens. 40 Equation (6.11) then considers the ratio of your confidence that a will be non-black to your confidence that it will be a raven. Meeting condition (6.10 ) makes this ratio greater than 1; now we want to know how the ratio wou ld change were you to suppose all ravens are black. Equation (6.11) says that when you make this supposition the ratio doesn’t go up—supposing all ravens are black wouldn’t, say, dramatically increase how many non-black things you thought were in the pool or dramatically decrease your count of ravens. (It turns out from the math that for the confirmational judgment in Equation (6.9) to go false, the left-hand ratio in (6.11) would have to be much larger than the right-hand ratio; hence my talk of dramatic changes.) This constraint seems sensible. Under normal circumstances, for instance, supposing that all ravens are black should if anything increase the number of black things you think there are, not increase your count of non -black items. Subjective Bayesians suggest that relative to our real-life knowledge of the world, were we to confront a selection situation like the one proposed in the ravens scenario, our credence distribution would satisfy Equations (6.10) and (6.11). Relative to such a credence distribution, the observation of a black ravens hypothesis more Bayesian strongly than the observation ofraven a redconfirms herring.the This is how a Subjective exp lains the key intuitive judgment that the ravens hypothesis is better confirmed by a black raven observation than a red herring observation: by showing how that judgment follows from more general assumptions about the composition of the world. Given that people’s outlook on the world typically satisfies Equations (6.10) and (6.11), it follows from the Subjective Bayesian’s quantitative theory of confirmation that if they are rational they will take the black raven observation to be more highly confirmatory.41 Now one might object that people who endorse the key ravens judgment have credence distributions that don’t actually satisfy the conditions specified (or other sets of sufficient condit ions Bayesians have proposed). Or an Objective Bayesian might argue that a confirmation judgment can be vindicated only by grounding it in something firmer than personal crede nces. I am not going to take up those arguments here. But I hope to have at least fought back the charge that Subjective Bayesianism about confirmation is empty. The Subjective Bay esian account of confirmation tells us when evidence E confirms hypothesis H relative to credence distribution cr. You might think that because it does very little to constrain the values of cr, this account can tell us nothing interesting about when evidence confirms a hypothesis. But we hav e just seen a substantive, unexpected result. It was not at all obvious at the start of our inquiry that any rational credence
6.4. SUBJECTIVE BAYESIAN CONFIRMATION
213
distribution satisfying Equations (6.10) and (6.11) would endorse the key ravens judgment. Any Subjective Bayesian result about confirmation will have to take the form, “If your credences are such-and-such, then these confirmation relations follow,” but such conditionals can nevertheless be highly informative. For instance, the result we’ve just seen not only reveals what confirmation judgments agents will make in typical circumstances, but also which atypical circumstances may legitimately undermine those judgments. Return to the Hall of Atypically-Colored Birds, where a bird is displayed only if the majority of his species-mates are one color but his color is different. Suppose it is part of an agent’s background knowledge (before she observes object a ) that a is to be selected from the Hall of Atypically-Colored Birds. If at that point—before observing a—the agent were to suppose that all ravens are black, that would dramatically decrease her confidence that a will be a raven. If all ravens are black, there are no atypically-colored ravens, so there should be no ravens in the Hall. 42 Thus given the agent’s background knowledge about the Hall of Atypically-Colored Birds, supposing the ravens hypothesis H decreases her confidence that a will be a raven (that is, Ra). This makes the lefthand side(6.11) of Equation (6.11) greater than theconditions righthandin side, and renders Equa tion false. So one of the sufficient our ravens result fails, and Equation (6.9) cannot be derived. This provides a tidy explanation of why, if you know you’re in the Hall of AtypicallyColored Birds, observing a black raven should not be better news for the ravens hypothesis than observing a non-black non-raven. Besides this account of the Paradox of the Ravens, Subjective Bayesians have offered solutions to various other confirmational puzzles. For example, Hawthorne and Fitelson (2004) approach the problem of irrelevant conjunction (Section 6.1.2) by specifying conditions under which adding an irrelevant conjunct to a confirmed hypothesis yields a new hypothesis that—while still confirmed—is less strongly confirmed than the srcinal. Similarly, Chihara (1981) and Eells (1982, Ch. 2) respond to Goodman’s grue example (Section 6.3) by specifying credal conditions under which a run of observed green emeralds more strongly confirms the hypothesis that all emeralds are green than the hypothesis that all emeralds are grue. Even more intriguingly, the Subjective Bayesian account of confirmation has recently been used to explain what look like irrational judgments on the part of agents. The idea here is that sometimes when subjec ts are asked questions about probability, they respond with answers about confirmation. In Tversky and Kahneman’s Conjunction Fallacy experiment (Section 2.2.4), the hypothesis that Linda is a bank teller is entailed by the hypothesis that
214
CHAPTER 6. CONFIRMATION
Linda is a bank teller and active in the feminist movement. This entailment means that an agent satisfying the probability axioms must be at least as confident in the former hypothesis as the latter. But it does not mean that evidence must confir m the form er as strongly as the latte r. Crupi, Fitelson, and Tentori (2008) outline credal conditions under which the evidence presented to subjects in Tversky and Kahneman’s experiment would confirm the feminist-bank-teller hypothesis more strongly than the bank-teller hypothesis. It may be that subjects who rank the feminist-bank-teller hypothesis more highly in light of that evidence are reporting confirmational judgments instead of credences. Similarly, in analyzing the Base Rate Fallacy (Section 4.1.2) we noted the strong Bayes factor of the evidence one gets from a highly reliable disease test. Since the Bayes factor tracks the log likelihood-ratio measure of confirmation, this tells us that a positive result from a reliable test strongly confirms that the patient has the disease (as it should!). When doctors are asked for their confidence that the patient has the disease in light of such a positive test result, the high values they report may reflect their confirmational judgments. The Subjective Bayesian account of confirmation may therefore provide an explanation of what subjects are doing when they seem to make irrational credence reports. Nevertheless, having an explanation for subjects’ behavior does not change the fact that these subjects may be making serious mistakes. It’s one thing when a doctor is asked in a study to report a credence value and repor ts a confirmation value instead. But if that doctor goes on to make treatment decisions based on the confirmation value rather than the posterior probability, this can have significant consequences. Confusing how probable a hypothesis is on some evidence with how strongly that hypothesis is confirmed by that evidence is a version of the firmness/increase-in-firmness conflation. If the doctor recommends a drastic treatment for a patient on the basis that the test applied was highly reliable (even though, with the base rates taken into account, the posterior probability that a disease is present remains quite low), her confusion about probability and confirmation may prove highly dangerous for her patient.
6.5
Exercises
Unless otherwise noted, you should assume when completing these exercises that the distributions under discussion satisfy the probability axioms and Ratio Formula. You may also assume that whenever a conditional proba-
215
6.5. EXERCISES
bility expression occurs, the needed proposition has nonzero unconditional credence so that conditional probabilitie s are well-defined. Problem 6.1. Suppose the Special Consequence Condition and Converse Consequence were proposition both true. Show that under those assumptions, if evidence E Condition confirms some H relative to K , then relative to K evidence E will also confirm any other proposition X we might choose.∗ (Hint: Start with the problem of irrelevant conjunction.) Problem 6.2. For purposes of this problem, assume that the Equivalence Condition, the Entailment Condition, and Disconfirmation Duality are all true of the confirmation relation. (a) Show that if E & K deductively refutes H but K does not refute H on its own, then E disconfirms H relative to K . (b) Show that if H & K deductively refutes E but K does not refute H on its own, then E disconfirms H relative to K . Problem 6.3. Suppose we have propositions E , H , and K in L meeting the following conditions: (1) the set containing E , H , and K is logically consistent; (2) E & K H ; and (3) K H . Suppose also prob abilistic distribution Pr over L is such that for any proposition X , Pr X 1 just in case K X .
(
*
p q“
(
p q ą 0. (b) Prove that Pr pH | E q “ 1. (c) Prove that Pr pH q ă 1. (a) Prove that Pr E
Problem 6.4. Suppose we have a language whose only atomic propositions are F a1 , F a2 ,...,Fa n for some integer n 1. In that case, m: F an 1 2.
ą
p
q“ {
(a) Show that for any non-cont radictory proposition K expressible solely in terms of F a1 through F an´1 , m: F an K 1 2.
p
| q“ {
(b) What does the resul t you demonstrated in part (a) have to do with Carnap’s point that m: does not allow “learning from experience”? ∗ For purposes of this problem you may assume that special logical relationships.
E , H , X , and K stand in no
216
CHAPTER 6. CONFIRMATION
Problem 6.5. (a) Make a probability table over state-descriptions for the four atomic propositions F a, F b, Ga, Gb. In the righ t-hand column, enter the values Carnap’s m˚ assigns to each state-des cription. (Hint: Keep in mind that F a & F b & Ga & Gb belongs to a different structure-
„ „ description than F a & „F b & „Ga & Gb.) (b) Use your table to show that ˚ pF b | F a & Ga & Gbq ą ˚ pF b | F aq. (c) Use your table to show that ˚ pF b | F a & Ga & „Gbq “ ˚ pF bq. m
m
m
m
(d) For each of problem (b) and (c) above, explain how your answe r relates to m˚ ’s handling of “analogical effects”. † Problem 6.6. Suppose E 2 is the proposition
pGa
1
q p
q
p
q „
& Oa1 & Ga2 & Oa2 & . . . & Ga99 & Oa99 & Oa100
Without actually making a probability table , argue convincingly that on Carnap’s confirmation theory: ˚
pGa ˚ pGa ˚ pGa ˚ pGa ˚ pGa
(a)
m
100
(b)
m
100
(c)
m
100
(d)
m
100
(e)
m
100
” Oa | E 2q “ ˚p„Ga | E 2q ” Oa | E 2q “ ˚pGa | E 2q | E2 q “ 1{2 q “ 1{2 ” Oa | E 2q “ ˚pGa | E 2q “ ˚pGa q 100
m
100
m
100
100
m
100
100
m
100
Problem 6.7. Provide examples showing that the r-measure of confirmation violates each of the following constraints: (a) Hypothesis Symmetry (b) Logicality Problem 6.8. Prove that on the l-measure of degree of confirmation, if E1 is screened off from E2 by H on Pr, then the degree to which E1 & E 2 confirms H can be found by summing the degrees to which E 1 and E2 each confirm H individually. (Hint: Remember that log x y log x log y .)
p ¨ q“
†
I owe this entire problem to Branden Fitelson.
`
217
6.5. EXERCISES
Problem 6.9. Crupi, Tentori, and Gonzalez think it’s intuitive that on whatever measure c correctly gauges confirmation, the following constraint will be satisfied for cases of disconfirmation but not confirmation:
p
c H, E
q “ cpE, H q
(a) Provide a real-world example of two propositions A and B such that, intuitively, A confirms B but B does not confirm A to the same degree. (Don’t forget to specify the Pr distribution to which your confirmation judgments are relative!) (b) Provide a real-world example of two propositions C and D such that, intuitively, C disconfirms D and D disconfirms C to the same degree. (Don’t make it too easy on yourself—pick a C and D that aren’t mutually exclusive!) (c) Does it seem to you intu itively that for any propositions C and D and probability distribution Pr, if C disconfirms D then D will disconfirm C to the same degree? Explain why or why not. Problem 6.10. (a) Provide an example in which the l- and z -measures disagree on a comparative confirmational claim. That is, provide an example in which the l -measure says that E 1 confirms H 1 more strongly than E2 confirms H2 , but the z -measure says E2 confirms H2 more strongly than E1 confirms H1 . (b) (Note : This one’s fa irly difficult!) Prove that the l- and z -measures never disagree on how strongly two pieces of evidence confirm the same hypothesis. That is, prove that there do not exist H , E1 , E2 , and Pr such that l H, E1 l H, E2 but z H, E1 z H, E2 .
p
qą p
q
p
qă p
q
Problem 6.11. The solution to the Paradox of the Ravens presented in Section 6.4.2 is not the only Subjective Bayesian solution that has been defended. An earlier solution ‡ invoked the following four conditions (where
p@xqpRx Ą Bxq): p „ qą0 (ii) Pr p„Ba q ą PrpRaq (iii) PrpRa | H q “ PrpRaq
H abbreviates
(i) Pr Ra & Ba
‡
See (Fitelson and Hawthorne 2010a, Sect. 7) for discussion.
218
CHAPTER 6. CONFIRMATION
(iv) Pr
p„Ba | H q “ Prp„Baq
Assuming Pr satisfies these conditions, complete each of the following. (Hint: Feel free to write H instead of the full, quantified proposition it represents, but don’t forget what H entails about Ra and Ba .)
p„Ra & „Ba q ą PrpRa & Ba q. (b) Prove that Pr pRa & Ba & H q “ PrpH q ¨ PrpRaq. (c) Prove that Pr p„Ra & „Ba & H q “ PrpH q ¨ Prp„Ba q. (a) Prove that Pr
(d) Show that on confirmation measure d, if Pr satisfies conditions (i) through (iv) then Ra & Ba confirms H more strongly than Ra & Ba does.
„
„
(e) Where in your proofs did you use condi tion (i)? (f) Suppose Pr is your creden ce distribution when you know you are about to observe an object a drawn from the Hall of Atypically-Colored Birds. Which of the conditions (i) through (iv) will Pr probably not satisfy? Explain.
6.6
Further reading
Introductions and Overviews
Ellery Eells (1982). Rational Decision and Causality . Cambridge Studies in Philosophy. Cambridge: Cambridge University Press The latter part of Chapter 2 (pp. 52–64) offers an excellent discussion of Hempel’s adequacy conditions for confirmation, how the correct conditions are satisfied by a probabilistic relevance approach, and Subjective Bayesian solutions to the Paradox of the Ravens and Goodman’s grue puzzle. Rudolf Carnap (1955/198 9). Statistical and Inductive Probability. In: Readings in the Philosophy of Science. Ed. by Baruch A. Brody and Richard E. Grandy. 2nd. Prentice-Hall A brief, accessible overview by Carnap of his position on the meaning of “probability” and the development of his various confirmation functions. (Here he uses “individual distribution” to refer to state-descriptions and “statistical distribution” to refer to structure-descriptions.) Includes a probability table with diagrams!
6.6. FURTHER READING
219
Alan H´ajek and James M. Joyce (2008). Confirmation. In: The Routledge Companion to Philosophy of Science . Ed. by Stathis Psillos and Martin Cur d. New York: Routledge, pp. 115–128 Besides providing an overview of much of the material in this chapter, suggests that there may not be one single correct function for measuring degree of confirmation. Classic Texts
Carl G. Hempel (1945a ). Studies in the Logic of Confirmation (I). Mind 54, pp. 1–26 Carl G. Hempel (1945b). Studies in the Logic of Confirmation (II). Mind 54, pp. 97–121 Hempel’s classic papers discussing his adequacy conditions on the confirmation relation and offering his own positive, syntactical account of confirmation. Rudolf Carnap (1950). Logical Foundations of Probability. Chicago: University of Chicago Press While much of the material earlier in this book is crucial for motivating Carnap’s probabilistic theory of confirmation, his discussion of distributions : and m˚ occurs in the Appendix. (Note that the preface distin guishing m “firmness” from “increase in firmness” concepts of confirmation does not appear until the second edition of this text, in 1962.) Janina Hosiasson-Lindenbaum (1940). On Confirmation. Journal of Symbolic Logic 5, pp. 133–148 Early suggestion that the Paradox of the Ravens might be resolved by first admitting thatthen bothsecond a blackarguing raven and red herring confirm that ravens are black, but thata the former confirms moreallstrongly than the latter. Also anticipates some of Carnap’s later conclusi ons about which adequacy conditions could be met by a confirmation theory based on probability. Nelson Goodman (1955). Fact, Fiction, and Forecast . Cambridge, MA: Harvard University Press
220
NOTES
Chapter III contains Goodman’s “grue” discussion. Extended Discussion
Michael G. Titelb aum (2010 ). Not Enou gh There Ther e: Evidence, Reasons, and Language Independence. Philosophical Perspectives 24, pp. 477–528 Proves a general language-dependence result for all objective accounts of confirmation (including accounts that are Objective Bayesian in the normative sense), then evaluates the result’s philosophical significance. Katya Tentori, Vincenzo Crupi, and Selena Russo (2013). On the Determinants of the Conjunction Fallacy: Probability versus Inductive Confirmation. Journal of Experimental Psychology: General 142, pp. 235–255 Assessment of various explanations of the Conjunction Fallacy in the psychology literature, including the explanation that subjects are reporting confirmation judgments rather than posterior credences.
Notes 1 Scientists—and philosophers of science—are interested in a number of properties and relations of evidence and hypotheses besides confirmation. These include predictive power, informativeness, simplicity, unification of disparate phenomena, etc. An interesting ongoing Bayesian line of research asks whether and how these various other notions relate to confirmation. 2 (Good 1967 ) offers a more detailed examp le in the same vein . Good descr ibes the population distributions of two worlds constructed so that observing a black raven confirms that we are in the world in which not all ravens are black. 3 As I pointed out in Chapter 4’s note 12, this passage may have been the inspiration for David Lewis’s referring to hypothetical priors (numerical distributions reflecting no contingent evidence) as “superbaby” credences. 4 In discussing the Paradox of the Ravens, one might wonder in general whether p@xqpRx Ą
Bx q—especially with its material conditional, and its lack of existential import—is a faithful translation of “All ravens are black.” Strictly speaking, Hempel is examining what confirms the proposition expressed by the sentence in logical notation, rather than the proposition expressed by the sentence in English. But if the two come apart, intuitions about “All ravens are black” may be less relevant to Hempel’s discussion. 5 Despite his attention to background corpora, Hempel isn’t careful about backgrounds in the adequacy conditions he proposes. So I will add the relevant background restrictions to Hempel’s official definitions of his conditions, and explain the motivations for those restrictions as we go along.
NOTES
221
Also, in case you’re wonderi ng why E , H , and K are required to be consistent in the Entailment Condition, consider a case in which K refutes E and E is entirely irrelevant to H . E & K will be a contradiction, and so will entail H , but we don’t want to say E confirms H relative to K . I will insert similar consiste ncy requirements as needed going forward. 6
An interesting literature has sprung up among Bayesian epistemologists about the precise conditions under which one can rely on evidence of evidence to constitute evidence for a hypothesis. See, for example, (T al and Comesa˜na ta), (Roche 2014), and (Fitelson 2012), buildin g off foundational work in (Shogenji 2003). 7 I learned of this example from Bradley (2015, § 1.3); as far as I know it first appeared at (Pryor 2004, pp. 350–1). Note, by the way, that while one might want a restriction to keep the Special Consequence Condition from applying when K ( H 1 , in the stated counterexamples H 1 is not entailed by K . Out of desperation we could try to save Special Consequence by claiming it holds only relative to tautological backgrounds (as Hempel did with Nicod’s Criteri on). But we can recreate our cards counterexample to Special Consequence by emptying out the background and adding facts about how the card was drawn as conjuncts to each of A, B , and C . Similar remarks apply to the counterexamples we’ll soon produce for other putative confirmation constraints. 8 For one of many recent articles on confirmational intransitivity and skepticism, see (White 2006). 9 We could provide another argument for the Consistency Condition from the premises that (1) if a hypothesis is confirmed by some evidence then we should accept that hypothesis; and (2) one should never accept inconsis tent propositions. But we’ve alrea dy rejected (1) for our notion of confirmation. 10 I’m assum ing the definition of an ostrich includes its being a flightless bird, and whatever K is involved doesn’t entail E , H , or H 1 on its own. 11 Hypothetico-deductivism is a positive view of confirmation that takes the condition in Converse Entailment to be not only sufficient but also necessary for confirmation: E confirms H relative to K just in case H & K ( E and K * E . This is implausible for a number of reasons (see (Hempel 1945b)). Here’s one: Evidence that a coin of unknown bias has come up heads on exactly half of a huge batch of flips supports the hypothesis that the coin is fair; yet that evidence isn’t entailed by that hypothesis. 12 Strictly speaking there will be infinitely many X in L such that PrpX q “ 1, so we will take K to be a proposition in L logically equivalent to the conjunction of all such X . I’ll ignore this detail in what follows. 13 Carnap’s preface to the second edition distinguishes the firmness and increase in firmness concepts because he had equivocated between them in the first edition. Carnap was roundly criticized for this by Popper (1954). 14
I even made this mistake once in an article, despite my intense awareness of the issue! Luckily the error was caught before the offending piece was published. 15 Here we assume that, as pointed out in Chapter 2’s note 5, the atomic propositions of L are logically independent. 16 A word about Carnap’s notation in his (1950). Carnap actually introduces two confirmation functions, mp¨q and cp¨, ¨q. For any non-contradictory propos ition K in L, cp¨, K q is just the function I’ve been describing as Pr p¨q relative to K ; in other words, cp¨, K q “ mp¨ | K q “ mp¨ & K q{mpK q. As I’ve just mentioned in the main text, this makes c somewhat redundant in the theory of confirmation, so I won’t bring it up again.
222
NOTES
17 As I mentioned in Chapter 5, note 8, Carnap actually thinks “probability” is ambiguous between two meanings. What he calls “probability1 ” is the logical notion of probability we’ve been discussing. Carnap’s “probability 2 ” is based on frequencies, and is therefore objective as well. 18 Among other things, m: represents the technique for determining probabilities which
Ludwig Wittgenstein proposed in his Tractatus Logico-Philosophicus. (1921/1961, Proposition 5.15ff.) 19 Formally, two state-descriptions are disjuncts of the same structure-description just in case one state-description can be obtained from the other by permuting its constants. 20 Interestingly, Carnap’s continuum proposal shared a number of features with a much earlier proposal by Johnson (1932). 21 To make matters simpler, I’m going to assume going forward that (1) each object under discussion in the grue example is observed exactly once (so that “not observed by t” is equivalent to “observed after t”); (2) each object is either green or blue (so “not green” is equivalent to “blue”); and (3) each object is an emerald. Strictly speaking these assumptions should be made explicit as part of the agent’s total evidence, but since doing so would make no difference to the forthcoming calculations, I won’t bother. This approach is backed up by Goodman’s position in his (1955, p. 73, n. 9) that the grue problem is “substantially the same” as the problem he offered in (Goodman 1946). The earlier version of the problem was both more clearly laid-out and cleaner from a logical point of view. For insta nce, instead of green and blu e, he used red and not-r ed. The earlier paper also made clearer exactly whose positive theories of confirmation Goodman took the problem to target. 22 I’m going to assume Goodman is criticizing the version of Carnap’s theory committed to m˚ ; Carnap’s subsequent changes to handle analogical effects make little differen ce here. 23 Compare the difficulties with partition selection we encountered for indifference principles in Section 5.3. 24 To emphasize that not every pattern observed in the past should be expected to hold in the future, John Venn once provided the following example: “I have given a false alarm of fire on three different occasions and found the people came to help me each time.” (1866, p. 180) One wonders if his false alarms were intentional experiments in induction. (Quoted in (Galavotti 2005, pp. 77-8).) 25 Hume’s (1739–40/1978) problem of induction asked what justifies us in projecting any past correlat ions into the future. Goodman’s “new riddle of induction” asks, given that we are justified in projecting some correlations, which ones we ought to project. 26 Hempel’s theory of confirmation displays a similar effect. And really, any close reader of Hempel should’ve known that some of Goodman’s claims against Hempel were overstated. I mentioned that Hempel endorses the Consistency Conditio n (Section 6.1.2); he goes on to prove that it is satisfied by his positive theory of confirm ation. On Hempel’s theory, the hypotheses confirmed by any piece of evidence must be consistent both with that evidence and with each other. So contra Goodman, it just can’t be that on Hempel’s theory we get the “devastating result that any statement will confirm any statement.” (1955, p. 81) 27 For more on this topic, see (Hooker 1968), (Fine 1973, Ch. VII), (Maher 2010), and (Titelbaum 2010). 28 I should also point out that the Subjective Bayesian account of confirmation does not suffer from any language-dependence problems. Suppose credence distribution cr, defined over language L, makes it the case that cr pH | E q ą crpH q. We might define a differe nt
NOTES
223
language L 1 that expresses all the same propositions as L, and a distribution cr 1 over L1 . Intuitively, cr1 expresses the same credences as cr just in case cr 1 pX 1 q “ cr pX q whenever X 1 P L1 is the same proposition as X P L. In that case, cr 1 will satisfy the Kolmogorov axioms just in case cr. And if cr pH | E q ą crpH q, we will have cr 1 pH 1 | E 1 q ą crpH 1 q for the H 1 and E 1 in L1 that express H and E . So confirmation relations are unaffected by translation into a different language. (The same will be true of the confirmation measures we discuss in Section 6.4.1.) 29 Keep in mind that we’re discussing Subjective Baye sians in the normative sense . (Section 5.1.2) Their Subjective Bayesianism is an epistemological position, about the variety of rational hypothetical priors avai lable. Subjective Bayesians in the normative sense need not be Subjective Bayesians in the semantic sense; they need not read every “probability” assertion as the attribution of a credence to a particular individual. 30 As we put it in Chapter 5, the Subjective Bayesian need not be an extreme Subjective Bayesian, who denies any constraints on rational hypothetical priors beyond the probability axioms. 31 For citations to various historical authors who defended each measure, see (Eells and Fitelson 2002). 32 The logarithms have been added to the r - and l-measures to achieve this centering on 0. Removing the logarithms would yield measures ordinal ly equivalent to their logged versions, but whose values ran from 0 to infinity (with a value of 1 indicating probabilistic independence). Notice also that the base of the logarithms is irrelevant for our purposes. 33 Hypothesis Symmetry was defended as a constraint on degree of confirmation by (Kemeny and Oppenh eim 1952) ; see also (Eells and Fitel son 2002) , who gave it that particular name. 34 Carnap thought of confirmation as a “generalization of entailment” in a number of senses. Many Subjective Bayesians are happy to accept Carnap’s idea that deductive cases are limiting cases of confirmation. But they aren’t willing to follow Carnap in taking those limiting cases as a model for the whole domain. Whether E entails H relative to K depends just on t he content of those propositions, and Carnap thought matters should be the same for all confirmatory relations. To a Subjective Bayesian, though, whether E confirms H relative to K depends on something more—a full probability distribution Pr. 35 See (Fitelson 2006) for Logicality. 36 A few tech nical notes: First, when E ( H the denominator in l goes to zero . We think of l as assigning an infinite positive value in these cases, and an infinite negative value when E refutes H . Second, any confirmation measure ordinally equivalent to l (such as l without the logarithm out front) will satisfy Logical ity as well. Third, in discussing Logicality I am restricting my attention to “contingent” cases, in which neither E nor H is entailed or refuted by the K associated with Pr. 37 (Glass and McCartney 2015) notes that Crupi, Tentori, and Gonzalez’s z -measure adapts to the problem of confirmation the so-called “certainty factor” that has been used in 38 the field of expert systems since (Shortliffe and Buchanan 1975). Another thought one might have is that while a red herring confirms that all ravens are black, its degree of confirmation of that hypothesis is exceedingly weak in absolute terms. While some Bayesian analyses of the paradox also try to establish that result, we won’t consider it here. (See (Vranas 2004) for discussion and citations on the proposal that a red herring confirms the ravens hypothesis to a degree that is “positive but minute.”) 39 For citations to many historical proposals, see (Fitelson and Hawthorne 2010a, esp. n. 10). (Fitelson and Hawt horne 2010b) goes beyond these histo rical sources by also proposing necessary conditions for Equation (6.9), which unfortunately are too complex
224
NOTES
to detail here. 40 The result assumes you assign non-extreme unconditional credences to the proposition that a is black and to the proposition that it’s a raven. This keeps various denominators in the Ratio Formula positive. We also assume you have a non-extreme prior in H . 41 Why “if they are rational”? The mathe matical result assum es not only that the credence distribution in question satisfies Equations (6.10) and (6.11), but also that it satisfies the probability axioms and Ratio Formula. (This allows us to draw out conclusions about values in the credence distribution beyond what is directly specified in Equations (6.10) and (6.11).) Subjective Bayesians assume a rational credence distribution satisfies the probability axioms and Ratio Formula. 42 Perhaps even with the supposition that all ravens are black, the agent’s confidence that a will be a raven is slightly above zero because once in a long while the Hall’s curators make a mistake.
Chapter 7
Decision Theory Up to this p oint most of our discuss ion has been about epistemology. But probability theory srcinated in attempts to understand games of chance, and historically its most extensive application has been to practical decisionmaking. The Bayesian theory of probabilistic credence is a central element of decision theory, which developed throughout the twentieth century in philosophy, psychology, and economics. Decision theory searches for rational principles to evaluate the various acts available to an agent at any given moment. Given what she values (her utilities ) and how she sees the world (her credences), decision theory recommends the act that is most efficacious for achieving those values from her point of view. Decision theory has always been a crucial application of Bayesian theory. In his The Foundations of Statistics, L.J. Savage wrote, Much as I hope that the notion of probability defined here is consistent with ordinary usage, it should be judged by the contribution it makes to the theory of decision. (1954, p. 27) Decision theory has also been extensively studied, and a number of excellent book-length introductions are now available. (I recommend one in the Further Readings section of this chapter.) As a result, I haven’t packed as much information into this chapter as the preceding chapter on confirmation. I hope only to equip the reader with the terminology and ideas we will need later in this book, and that she would need to delve further into the philosophy of decision theory. We will begin with the general mathematical notion of an expectation, followed by the philosophical notion of utility. We will then see how Savage calculates expected utilities to determine rational preferences among acts, 225
226
CHAPTER 7. DECISION THEORY
and the formal properties of rational preference that result. Next comes Richard Jeffrey’s Evidential Decision Theory, which improves on Savage by applying to probabilistically dependent states and acts. We will then discuss Jeffrey’s troubles with certain kinds of risk-aversion (especially the Allais Paradox), and with Newcomb’s Problem. Causal Decision Theory will be proposed as a better respon se to Newcomb. I will clos e by briefly tracing some of the historical back-and-forth about which decision theory handles Newcomb’s problem best.
7.1
Calculating expectations
Suppose there’s a numerical quantity—say, the number of hits a particular batter will have in tonight’s baseball game—and you have opinions about what value that quantity will take. We can then calculate your expectation for the quantity. While there are subtletie s we will return to later, the basic idea of an expectation is to multiply each value the quantity might take by your credence that it’ll take that valu e, then add up the results. So if you’re 30% confident the batter will have 1 hit, 20% confident she’ll have 2 hits, and 50% confident she’ll have 3, your expectation for the number of hits is
¨ ` 0.20 ¨ 2 ` 0.50 ¨ 3 “ 2.2
0.30 1
(7.1)
Your expectation of a quantity is not the value you anticipate the quantity will actually take, or even the value you think it’s most probable the quantity will take—in the baseball example, you’re certain the batter won’t have 2 .2 hits in tonight’s game! Your expectation of a quantity is a kind of estimate of the value the quantity will take. When you’re uncertain about the value of a quantity, a good estimate may straddle the line between multiple options. While your expectation for a quantity isn’t necessarily the exact value you think it will take on a given occasion, it should equal the average value you expect that quantity to take in the long run. Suppose you’re certain that our batter will play in many, many games. The law of large numbers says that if you satisfy the probability axioms, you’ll have credence 1 that as the number of games increases, her average number of hits per game will tend towards your expectati on for that quantity. In other words, you’re highly confident that as the number of games approaches the limit, the batter’s average hits per game will approach 2 .2.1 We’ve already calculated expectations for a few different quantities in this book. For example, when you lack inadmissible evidence the Principal
7.1. CALCULATING EXPECTATIONS
227
Principle requires your credence in a proposition to equal your expectation of its chance. (See especially our calculation in Equation (5.7).) But by far the most commonly calculated expectations in life are monetary values. For example, suppose you have the opportunity to buy stock in a company just before it announces quarterly earnings. If the announcement is good you’ll be able to sell shares at $100 each, but if the announcement is bad you’ll be forced to sell at $10 apiece. The value you place in these shares depends on your confidence in a good report. If you’re 40% confident in a goo d earnings report, your expected value for each share is
¨
$100 0.40
` $10 ¨ 0.60 “ $46
(7.2)
As a convention, we let positive monetary values stand for money accrued to the agent; negative monetary values are amounts the agent pays out. So your expectation of how much money you will receive for each share is $46. An agent’s fair price for an investment is what she takes to be that investment’s break-even point—she’d pay anything up to that amount of money in exchange for the investment. If you use expected values to make your investment decisions, fair price of theyour stock just described will be $46. If youyour buy shares for for lesseach thanshare $46 each, expectation for that transaction will be positive (you’ll expect to make money on it). If you buy shares for more than $46, you’ll expect to lose money. The idea that your fair price for an investment should equal your expectation of its monetary return dates to Blaise Pascal, in a famous 17th-century correspondence with Pierre Fermat. (Fermat and Pascal 1654/1929) There are a couple of reasons why this is a sensible idea. First, suppose you know you’re going to be confronted with this exact investment situation many, many times. The law of large numbers says that you should anticipate a long-run average return of $46 per share. So if you’re going to adopt a standing policy for buying and selling such investments, you are highly confident that any price higher than $46 will lose you money and any price lower than $46 will make you money in the long- term. Second, expectations vary in intuitive ways when conditions chan ge. If you become more confident in a good earnings report, each share becomes more valuable to you, and you should be willing to pay a higher price. This is exactly what the expected value calculation predicts. If you learn that a good earnings report will send the share value to only $50, this decreases the expected value of the investment and also decreases the price you should be willing to pay. An investment is a type of bet, and fair betting prices play a significant role in Bayesian lore. (We’ll see one reason wh y in Chapter 9.) A bet that
228
CHAPTER 7. DECISION THEORY
pays $1 if proposition P is true and nothing otherwise has an expected value of $1 cr P $0 cr P $cr P (7.3)
¨ p q ` ¨ p„ q “ p q
If you use expec tations to calculate fair betting price s, your price for a gamble that pays $1 on P equals your unconditional credence in P . We can also think of fair betting pri ces in terms of odds. We saw in Section 2.3.4 that an agent’s odds against P equal cr P : cr P . So if the agent’s credence in P is 0.25, her odds against P are 3 : 1. What will she consider to be a fair bet on P ? Consider what the casinos would call a bet on P at 3 : 1 odds. If you place such a bet and wi n, you ge t back the srcinal amount you bet plus 3 times that amount. If you lose your bet, you’re out however much you bet. So suppose the agent with 0 .25 credence in P places a $20 bet on P at 3 : 1 odds. Her expected net return is
p„ q
p q
p q¨pnet return on winning bet q ` crp„P q ¨ pnet return on losing bet q “ 0.25 ¨ $60 ` 0.75 ¨ ´$20 “ $0
cr P
(7.4)
This expects a bet at 3She : 1 will oddsbetowilling be a break-even from agent her perspective, it’s on a fairPbet. to bet on gamble— P at those odds or anything highe r. In gener al, an agent who bets according to her expectations will accept a bet on a proposition equal to her odds against it, or anything higher. Remember that an agent’s odds against a proposition increase as her credence in the proposition decreases. So if an agent becomes less confident in P , you need to offer her higher odds on P before she’ll be willing to gamble. A lottery ticket is a type of bet, and in the right situation calculating its expected value can be highly lucrative. Ellenberg (2014, Ch. 11) relates the story of Massachusetts’ Cash WinFall state lottery game, which was structured so that if the jackpot got large enough, the expected payoff for a single ticket would climb higher than the price the state charged for that ticket. For example, on February 7, 2005 the expected value of a $2 lottery ticket was $5.53. The implications of this arrangement were understood by three groups of individuals—led respectively by an MIT student, a medical researcher in Boston, and a retiree in Michigan who had played a short-lived similar game in his home state . Of course, the expec ted value of a ticket isn’t necessarily what you will win if you buy a single ticket, but because of the long-run behavior of expectations your confidence in a net profit goes up the more tickets you buy. So these groups bough t a lot of tickets. For instance, on August 13, 2010 the MIT group bought around 700,000 tickets,
229
7.1. CALCULATING EXPECTATIONS
almost 90% of the Cash WinFall tickets purchased that day. Their $1 .4 million investment netted about $2 .1 million in payouts, for a 50% profit in one day. Expected value theory can be extremely effective. 7.1.1
The move to utility
Yet sometimes we value something other than money. For example, suppose it’s late at night, it’s cold out, you’re trying to catch a bus that costs exactly $1, and you’ve got no money on you. A stranger offers either to give you $1 straight up, or to flip a fair coin and give you $2 .02 if it comes up heads. It might be highly rational for you to prefer the guaranteed dollar even though its expected monetary value is less than that of the coin bet. Decision theorists and economists explain this preference with the notion of utility. Introduced by Daniel Bernoulli and Gabriel Cramer in the 18th century,2 utility is a numerical quantity meant to directly measure how much an agent values an arrangement of the world. Just as we suppose that each agent has her own credence distribution, we will suppose that each agent has a utility distribution over the propositions in language L. The utility an agent assigns a proposition represents how much she values that proposition’s beingtotrue (or if you like, how happy that proposition’s being true would make her). If an agent would be just as happy for one proposition to b e true as another, she assigns them equal utilit y. But if it would make her happier for one of those propositions to be true, she assigns it the higher utility of the two. Utilities provide a uniform value-measurement scale. In the bus example above, you don’t value each dollar equally . Going from zero dollars to one dollar would mean a lot to you; it would get you out of the cold and on your way home. Going from one dollar to two dollars would not mean nearly as much in your present context. Not every dollar represents the same amount of value in your hands, so counting the number of dollars in your possession is not a consistent measure of how much you value your current state. On the other hand, utilities measure value uniformly. We stipulate that each added unit of utility (sometimes called a util) is equally valuable to an agent. She is just as happy to go from 50 utils to 49 as she is to go from 1 util to 2, and so on. Having introduced this uniform value scale, we can explain your preferences in the bus case using expectat ions. Admittedly, the coin flip gamble has a higher expected monetary payoff ($1.01) than the guaranteed dollar. But monetary value doesn’t always translate neatly to utility, and utility reflects the values on which you truly make your deci sions. Let’s say that
´
´
230
CHAPTER 7. DECISION THEORY
having no money is worth 0 utils to you in this case, receiving one dollar and being able to get on the bus is worth 100 utils, and receiving $2 .02 is worth 102 utils. (The larger amount of money is still more valuable to you; just not much more valuable.) When we calculate the expected utility of the gamble, it only comes to 51 utils, which is much less than the 100 expected utils associated with the guaranteed dollar. So you prefer the dollar guarantee. The setup of this example is somewhat artificial, because it makes the value of money change radically at a particular cutoff point. But economists think money generally has a decreasing marginal utility for agents. While an agent always receives some positive utility from each additional dollar (or peso, or yuan, or. . . ), the more dolla rs she alre ady has the les s extra utility it will be. The first bill ion you earn mak es your fami ly comfortable; the second billion doesn’t have as much significance for your life. Postulating an underlying locus of value distinguishable from net worth helps explain why we don’t always chase the next dollar as hard as we chased the first. With that said, quantifying value on a constant numerical scale introduces many the that sameaproblems wepsychology found with will quantifying First, it’s notofclear real agent’s always beconfidence. as nuanced as a numerical utility structure seems to imply. And second, the moment you assign numerical utilities to every arrangement of the world you make them all comparable; the possibility of incommensurable values is lost. (Compare Section 1.2.2.)
7.2 7.2.1
Expected Utility Theory Preference orderings, and money pumps
A decision problem presents an agent with a partition of acts, from which she must choose exactly one. Her choice tracks her prefer ences among the acts. If the av ailable acts are A and B , and she prefers A to B (we write A ą B ), then the agent decides to perform action A . A similar point applies when B ą A. Yet it might be that the agent is indifferent between A and B (we write A B ), in which case she may choose either one. Sometimes these decisions are easy. If the agent is certain how much utility will be generated by the performance of each act, the choice is simple— she prefers the act leading to the highest-utility result. Yet the utility resulting from an act often depends on features of the world beyond the agent’s control (think, for instance, of the factors determining whether a particular
„
231
7.2. EXPECTED UTILITY THEORY
career choice turns out well), and the agent may be uncertain how those features stand. In that case, the agent needs a technique for factoring uncertainty into her decision. She needs a technique for combining credences and utilities to generate preferences. Decision theory responds to this proble m by providing a valuation function, which combines credences and utilities to assign each act a numerical score. The agent’s prefere nces follow from these scores: A ą B just in case A receives a higher score than B , while A B when the scores are equal. Given a particular decision problem, a rational agent will select the available act with the highest score (or—if there are ties at the top— one of the acts with the highest score). Just to give one example of a valuation function, suppose you assigned each act a numerical score as follows: consider all the possible worlds to which you assign nonzero credence, find the one in which that act produces the lowest utility, and then assign that utility value as the act’s score. This valuation function generates preferences that satisfy the maximin rule , so-called because it selects the act with the highest minimum utility payoff. Maximin attends only to the worst case scenario for each available act.
„
While that maximin is just one valuation function see others later), any approach ties preferences to numerical scores(we’ll assigned over acts imposes a certain structure on an agent’s preferences. For instance, it guarantees that her preferences will display: Preference Transitivity: For any acts A, B , and C , if the agent prefers A to B and B to C , then the agent prefers A to C . This follows from the fact that if act A’s score is greater than act B ’s, and B ’s is greater than C ’s, then A’s must be greater than C ’s as well. Preference Transitivity is suggested as a rational constraint on agents’ preferences. One might object that an agent may prefer A to B and prefer B to C , but never have thought to compare A to C . In other words, one might think that an agent’s preference ordering could go silent on the comparison between A and C and still be rational. Yet by coordinating preference with a numerical valuation over the entire partition of acts, we have already settled this issue; we have required the agent’s preferences to form a total ordering. Since every act receives a score, every act is comparable, and the agent has a preference (or indifference) between any two acts. Decision theorists sometimes express this as: Preference Completeness: For any acts A and B , exactly one of the following is true: the agent prefers A to B , the agent prefers B to A, or the agent is indifferent between the two.
232
CHAPTER 7. DECISION THEORY
Notice that Preference Completeness entails the following: Preference Asymmetry: There do not exist acts A and B such that the agent both prefers A to B and prefers B to A. To recap: The first move of decision theory is to coordinate preferences with the output of a valuation function combining credences and utilities. By making this move, decision theory requires preferences to satisfy Preference Transitivity and Asymmetry. Hopefully it’s intuitive that rational preferences satisfy these tw o conditions. But we can do better than that: We can provide an argument for them. Consider a situation in which some of us find ourselves frequently. On any given weeknight, I would prefer to do something else over washing the dishes. (Going to a movie ? Great! Watching the game ? Good idea !) But when the week ends and the dishes have piled up, I realize that I would’ve preferred foregoing one of those weeknight activites in order to avoid a disgusting kitchen. Each of my individual decisions was made in accordance with my preferences among the acts I was choosing between at the time, yet together thoseonce localsuggested preferences uphe to prefers a globaleating outcome A student toadded me that out Itodisprefer. cooking for himself, prefers eating at a friend’s to eating out, but prefers cooking for himself to eating at a friend’s. Imagine one nigh t my student is preparin g himself dinner, then decides he’d prefer to order out. He calls up the takeout place, but before they pick up the phone he decides he’d rather drive to his friend’s for dinner. He gets in his car and is halfway to his frie nd’s, when he decid es he’d rather cook for himself. At which point he turns arou nd and goes home, havi ng wasted a great deal of time and ener gy. Each of those choices reflects the student’s preference between the two options he considers at the time, yet their net effect is to leave him right back where he started meal-wise and out a great deal of effort overall. My student’s preferences violate Transitivity; as a result he’s susceptible to a money pump . In gener al, a money pump against intransitive preferences (preferring A to B , B to C , and C to A) can be constructed like this: Suppose you’ re about to perform act B , and I suggest I could make it possible to do A instead. Since you prefer A to B , there must be some amount of something (we’ll just suppose it’s money) you’d be willing to pay me for the option to perform A . So you pay the price, are about to perform A, but then I hold out the possibility of performing C instead. Since you prefer C to A , you pay me a small amount to make that switch. But then I offer you the opportunity to perform B rather than C —for a small price, of
233
7.2. EXPECTED UTILITY THEORY
course. And now you’re back to where you started with respect to A , B , and C , but out a few dolla rs for your trou ble. To add insu lt to injury, I could repeat this set of trades again, and again, milking more and more money out of you until I decide to stop. Hence the “money pump” terminolog y. 3 Violating Preference Transitivity leaves one susceptible to a moneypumping set of trades. (If you violate Prefe rence Asymmetry, the money pump is even simp ler.) In a money pump, the agent proceeds through a series of exchanges, each of which looks favorable given his preferences between the two acts involved. But when those exchanges are combined , the total package produces a net loss (which the agent would prefer to avoid). The money pump therefore seems to reveal an inconsistency between the agent’s local and global prefere nces, as in my dishwashing example. (We will further explore this kind of inconsistency in our Chapter 9 discussion of Dutch Books.) The irrationality of being susceptible to a money pump has been taken as a strong argument against violating Preference Asymmetry or Transitivity.4 7.2.2 Savage’s expected utility Savage (1954) frames decision problems using a partition of acts available to the agent and a partition of states the world might be in. A particular act performed with the world in a particular state produces a particular outcome. Agents assign numerical utility values to outcomes; given partial information they also assign credences over states.5 Here’s a simple examp le: Suppose you’ re trying to decide wheth er to carry an umbrella today, but you’re uncertain whether it’s going to rain. This table displays the utilities you assign various outcomes:
take umbrella leave
rain 0
dry 1
´10
0
´
it You have two available acts, represented in the rows of the table. There are two possible states of the world, represented in the columns. Performing a particular act when the world is in a particular state produces a particular outcome. If you leave your umbrella behind and it rains, the outcome is you walking aroun d wet. The cells in the table report your utili ties for the outcomes produced by various act/state combi nations. Your utility for
234
CHAPTER 7. DECISION THEORY
´
walking around wet is 10 utils, while carrying an umbrella on a dry day is inconvenient but not nearly as unpleasant ( 1 util). How should you evaluate available acts and set your preferences among them? For a finite partition S1 , S2 ,...,S n of possible states of the world,
´
Savage endorses the following valuation function: EUsav A u A & S1 cr S1 u A & S2
p q“ p
q¨ p q` p q ¨ crpS q ` . . . ` upA & S q ¨ crpS q 2
n
(7.5)
n
Here A is the particular act being evaluated. Savage evaluates acts by calculating their expected utilities; EUsav A represents the expected utility of act A calculated in the manne r Savage prefers. (We’ll see other ways of calculating expected utility later on.) cr Si is the agent’s unconditional credence that the world is in state Si ; u A & S i is the utility she assigns to the outcome that will eventuate should she perform act A in state Si .6 So EU sav calculates the weighted average of the utilities the agent might receive if she performs A, weighted by her credence that she will receive each one. Savage hold s that given a decision among a partition of acts, a rational agent will set her preferences in line with her expected utilitie s. She
p q p q p q
will choose to perform an act with at least as great an expected utility as that of any act on offer. Now suppose that in the umbrella case you have a 0 .30 credence in rain. We can calculate expected utilities for each of the available acts as follows:
p p
q “ 0 ¨ 0.30 ` ´1 ¨ 0.70 “ ´0.7 q “ ´10 ¨ 0.30 ` 0 ¨ 0.70 “ ´3
EUsav take
EUsav leave
(7.6)
Taking the umbrella has the higher expected utility, so Savage thinks that if you’re rational you’ll prefer to take the umbrella. You’re more confiden t it’ll be dry than rain, but this is outweighed by the much greater disutility of a disadvantageous decision in the latter case than the former. EUsav is a valuation function that combines credences and utilities in a specific way to assign numerical scores to acts. As a numerical valuation function, it generates a preference ordering satisfying Preference Asymmetry, Transitivity, and Completeness. But calculating expected utilities this way also introduces new features not shared by all valuation functions. For example, Savage’s expected utility theory yields preferences that satisfy the: Dominance Principle: If act A produces a higher-utility outcome than act B in each possible state of the world, then A is preferred to B.
235
7.2. EXPECTED UTILITY THEORY
The Dominance Principle 7 seems intuitively like a good rational principle. Yet (surprisingly) there are decision problems in which it yields very bad results. Since Savage’s expected utility theory entails the Dominance Principle, it can be relied upon only when we don’t find ourselves in decision problems like that. 7.2.3
Jeffrey’s theory
To see what can go wrong with dominance reasoning, consider this example from (Weirich 2012): A student is considering whether to study for an exam. He reasons that if he will pass the exam, then studying is wasted effort. Also, if he will not pass the exam, then studying is wasted effort. He concludes that because whatever will happen, studying is wasted effort, it is better not to study. The student entertains two possible acts—study or not study—and two possible states of the world—he either passes the exam or he doesn’t. His utility table looks something like this: study don’t study
pass 18 20
fail 5
´ ´3
Because studying costs effort, passing having not studied is better than passing having studied, and failing having not studied is also better than failing having studied. So whether he passes or fails, not studying yields a higher utility. By the Dominance Principle, the student should prefer not studying to studying. This is clearly a horrible argument; it ignores the fact that whether the student studies affects whether he passes the exam .8 The Dominance Principle—and Savage’s expected utility theory in general—breaks down when the state of the world depends on the act the agent performs. Savage recognizes this limitation, and so requires that the acts and states used in framing decision problems be independent of each other. Jeffrey (1965 ), however, notes that in real life we often analyze decision problems in terms of dependent acts and states. Moreover, he worries that agents migh t face decision problems in which they are unable to identify independent acts and
236
CHAPTER 7. DECISION THEORY
states.9 So it would be helpful to have a decision theory that didn’t require acts and states to be independent. Jeffrey offers just such a theory. The key innovation is a new valuation function that calculates expected utilities differen tly from Savage’s. Given 10
an act A and a finite partition S 1 , S2 ,...,S Jeffrey calculates
n
of possible states of the world,
p q “ upA & S q ¨ crpS | Aq ` upA & S q ¨ crpS | Aq ` . . . ` upA & S q ¨ crpS | Aq
EUedt A
1
1
2
2
n
(7.7)
n
I’ll explain the “EDT” subscript later on; for now, it’s crucial to see that Jeffrey alters Savage’s approach (Equation (7.5)) by replacing the agent’s unconditional credence that a given state Si obtains with the agent’s conditional credence that S i obtains given A . This incorporates the possibility that performing the act the agent is evaluating will change the probabilities of various states of the world. To see how this works, consider Jeffrey’s (typically civilized) example of a guest decid ing wheth er to bring white or red wine to dinner. The guest is certain his host will serve either chicken or beef, but doesn’t know which. The guest’s utility table is as follows: white red
chicken beef 1 1 0 1
´
For this guest, bringing the right wine is always pleasurable. Red wine with chicken is merely awkward, while white wine with beef is a disaster. Typically, the entree for an evening is settled well before the guests arrive. But let’s suppose our guest suspects his host is especially accommodating. The guest is 75% confident that the host will select a meat in response to the wine provided. (Perhaps the host has a stocked pantry, and waits to prepare dinner until the wine has arrived.) In that case, the state (mea t served) depends on the agent’s act (wine chosen). This means the agent cannot assign a uniform unconditional credence to each state prior to his decision. Instead, the guest assigns one crede nce to chicken conditional on his bringing white, and another credence to chicken conditional on his bringing red. These credences are reflected in the following table: white red
chicken 0 .75 0 .25
beef 0.25 0.75
7.2. EXPECTED UTILITY THEORY
237
It’s important to read the credence table differently from the utility table. In the utility table, the entry in the white/chicken cell is the agent’s utility assigned to the outcome of chicken served and white wine. In the credence table, the white/chicken entry is the agent’s credence in chicken served given white wine. The probability axioms and Ratio Formula together require all the credences conditional on white wine sum to 1, so the values in the first row sum to 1. The values in the second row sum to 1 for a similar reason. (In this example the values in each column sum to 1 as well, but that won’t always be the case.) We can now use Jeffrey’s formula to calculate the agent’s expected utility for each act. For instance,
p
EUedt white
q “ upwhite & chickenq ¨ crpchicken | whiteq ` upwhite & beefq ¨ crpbeef | whiteq “ 1 ¨ 0.75 ` ´1 ¨ 0.25 “ 0.5
(7.8)
(We multiply the values in the first row of the utility table by the corresponding values in the first row of the credence table, then sum the results.) A similar calculation yields EUedt red 0.75. Bringing red wine has a higher expected utility for the agent than bringing white, so the agent should prefer bringing red. Earlier I said somewhat vaguely that Savage requires acts and states to be “independent”; Jeffrey’s theory gives that notion a precise meaning. EUedt revolves around an agent’s conditional credences, so for Jeffrey the relevant notion of independence is probabilistic independence relative to the agent’s credence function. That is, an act A and state Si are independent for Jeffrey just in case cr Si A cr Si (7.9)
p q“
p | q“ p q
In the special case where the act A being evaluated is independent of each state S i , the cr Si A expressions in Jeffrey’s formula may be replaced with cr Si expressions. This makes Jeffrey ’s expected utility calculation identical to Savage’s. When acts and states are probabilistically independent, Jeffrey’s theory yields the same preferences as Savage’s. And since Savage’s theory entails the Dominance Principle, Jeffrey’s theory will also embrace Dominance in this special case. But what happens to Dominance when acts and states are dependent ? Here Jeffrey offers a nuclear deterrence example. Suppose a nation is choosing whether to arm itself with nuclear weapons, and knows its rival nation
p | q
p q
238
CHAPTER 7. DECISION THEORY
will follow its lead. The possible states of the world under consideration are war versus peace. The utility table might be: arm disarm
war peace 100 0 50 50
´´
Wars are worse when both sides hav e nuclear arms; peace is also better without nukes on hand (because of nuclear accid ents, etc.). A dominance argument is now available since whichever state obtains, disarming provides the greater utility. So applying Savage’s theory to this example would yield a preference for disarming. Yet the advocate of nuclear deterrence takes the states in this example to depend on the acts. The deterrence advocate’s credence table might be: war arm 0 .1 disarm 0 .8
peace 0.9 0.2
The idea of deterrence is that if both countries have nuclear arms, war becomes much less likely. If arming increases the probabilit y of peace, the acts and states in this examp le are proba bilistically dependent. Jeffrey’s theory calculates the following expected utilities from these tables:
p q “ ´100 ¨ 0.1 ` 0 ¨ 0.9 “ ´10 q “ ´50 ¨ 0.8 ` 50 ¨ 0.2 “ ´30
EUedt arm
p
EUedt disarm
(7.10)
Relative to the deterrence advocate’s credenc es, Jeffrey’s theory yields a preference for arming. Act/state dependence has created a preference ordering at odds with the Dominance Principle. 11 When an agent takes the acts and states in a decision problem to be independent, Jeffrey’s and Savage’s decision theories are interchangeable, and dominance reasoning is reliable. But Jeffrey’s theory also provides reliable verdicts when acts and states are dependent, a case in which Savage’s theory and the Dominance Principle may fail. 7.2.4
Risk aversion, and Allais’ paradox
Different people respond to risks differently. Many agents are risk-averse; they would rather have a sure $10 than take a 50-50 gamble on $30, even though the expected dollar value of the latter is greater than that of the former.
7.2. EXPECTED UTILITY THEORY
239
Economists have traditionall y explained this preference by appealing to the declining marginal utility of money . If the first $10 yields much more utility than the next $20 for the agent, then the sure $10 may in fact have a higher expected utility than the 50-50 gamble. This makes the apparently risk-averse behavior perfectly rational. But it does so by portraying the agent as only apparently risk-averse. The suggestion is that the agent would be happy to take a risk if only it offered her a higher expectation of what she really values—utility. But might some agents genui nely be willing to give up a bit of expected utility if it meant they didn’t hav e to gamble? If we could offer agents a direct choice between a guaranteed 10 utils and a 50-50 gamble on 30, might some pref er the former? (Recall that utils are defined so as not to decrease in marginal value.) And might that preference be rationally permissible? Let’s grant for the sake of argument that risk-aversion concerning monetary gambles can be explained by attributing to the agent a decreasing marginal utility distribution over dollars. Other documented responses to risk cannot be explained by any kind of utility distribution. Suppose a fair lottery is to be held with 100 numbered tickets. You get to choose between two gambles, with the following payoffs should particular tickets be drawn: Ticket 1 Tickets 2–11 Tickets 12–100 Gamble A $1M $1M $1M Gamble B $0 $5M $1M (Here “$1M” is short for 1 million dollars.) Which gamble would you prefer? After recording your answer somewhere, consider the next two gambles (on the same lottery) and decide which of them you would prefer if they were your only options: Ticket 1 Tickets 2–11 Tickets 12–100 Gamble C $1M $1M $0 Gamble D $0 $5M $0 When subjects are surveyed, they often prefer Gamble D to C ; they’re probably not going to win anything, but if they do they’d like a serious shot at $5 million. On the other hand, many of the same subjects prefer Gamble A to B , because A guarantees them a payout of $1 million. Yet anyone who prefers A to B while at the same time preferring D to C violates Savage’s12 Sure-Thing Principle: If two acts yield the same outcom e on a particular state, any preference between them remains the same if that outcome is changed.
240
CHAPTER 7. DECISION THEORY
In our example, Gambles A and B yield the same outcome for tickets 12 through 100: 1 million dol lars. If we chan ge that commo n outcome to 0 dollars, we get Gambles C and D . The Sure -Thing Princ iple requ ires an agent who prefers A to B also to prefer C to D . Put another way: if the Sure-Thing Principle holds, we can determine a rational agent’s preferences between any two acts by focusing exclusively on the states for which those acts produce differen t outcomes. In both the decision proble ms here, tickets 12 through 100 produce the same outcome no matter which act the agent selects. So we ought to be able to determine her prefe rences by focusi ng exclusively on the outcomes for tickets 1 through 11. Yet if we focus exclusively on those tickets, A stands to B in exactly the same relationship as C stands to D . So the agent’s prefe rences across the two decisi ons should be aligned. The Sure-Thing Principle is a theorem of Savage’s decision theory. It is therefore also a theorem of Jeffrey’s decision theory for cases in which acts and states are independent, as they are in the present gambling example. Thus preferring A to B while preferring D to C —as real-life subjects often do—is incompatible with those two decis ion theor ies. And here we can’t chalk up the problem to working with dollars rather than utils. There is no possible utility distribution over dollars on which Gamble A has a higher expected utility than Gamble B while Gamble D has a higher expected utility than Gamble C . (See Exercise 7.6.) Jeffrey and Savage, then, must shrug off these commonly-paired preferences as irrational. Yet Maurice Allais, the Nobel-winning economist who introduced the gambles in his (1953), thought that this combination of preferences could be perfectly rational. Because it’s impossible to maintain these seemingly-reasonable preferences while hewing to standard decision theory, the example is now known as Allais’ Paradox. Allais thought the example revealed a deep flaw in the decision theories we’ve been considering. 13 We have been discussing decision theories as normative accounts of how rational agents b ehave. Economists, however, often assume that decision theory provides an accurate descriptive account of real agents’ market decisions. Real-life subjects’ responses to cases like the Allais Paradox prompted economists to develop new descriptive theories of agents’ behavior, such as Kahneman and Tversky’s Prospect Theory (Kahneman and Tversky 1979; Tversky and Kahneman 1992). More recently, Buchak (2013) has proposed a generalization of standard decision theory that accounts for risk aversion without positing declining marginal utilities, and is consistent with the Allais preferences subjects often display.
7.3. CAUSAL DECISION THEORY
7.3
241
Causal Decision Theory
Although we have been focusing on the expected values of propositions describing acts, Jeffrey’s valuation function can be applied to any sort of proposition. Forwith example, suppose favorite player haswhether been outheofwill commission for weeks an injury, andmy I am waiting to hear play in tonight’s game. I start wondering whether I would pref er that he play tonight or not. Usually it would make me happy to see him on the field, but there’s the possibility that he will play despite his injury’s not being fully healed. That would definitely be a bad outcome. So now I combine my credences about states of the world (is he fully healed? is he not?) with my utilities for the various possible outcomes (plays fully healed, plays not fully healed, etc.) to determine how happy I would be to hear that he’s playing or not playing. Having calculated expected utilities for both “plays” and “doesn’t play”, I decide whether I’d prefer that he play or not. Put another way, I can use Jeffrey’s expected utility theory to determine whether I would consider it good news or bad were I to hear that my favorite player will be playing tonight. And I can do so wheth er or not I have any influence on the truth of that proposition. Jeffrey’s theory is sometimes described as calculating the “news value” of a proposition. Even for propositions describing our own acts, Jeffrey’s expected utility calculation assesses news value. I might be given a choice between a sure $1 and a 50-50 chance of $2 .02. I would use my credences and utility function to determine expected values for each act, then declare which option I preferred. But notic e that this calc ulation wou ld go exactly the same if instead of my selecting among the options, someone else was selecting on my behalf. What’s ultimately being compared are the proposition that I receive a sure dollar and the proposition that I receive whatever payoff results from a particular gamble. Whether I have the ability to make one of those propositions true rather than the other is irrelevant to Jeffrey’s preference calculations. 7.3.1
Newcomb’s Problem
Jeffrey’s focus on news value irrespective of agency leads him into trouble with Newcomb’s Problem. This problem was intr oduced to philosophy by Robert Nozick, who attributed its construction to the physicist William Newcomb. Here’s how Nozick introduced the problem: Suppose a being in whose power to predict your choices you have enormous confidence. (One might tell a science-fiction story
242
CHAPTER 7. DECISION THEORY
about a being from another planet, with an advanced technology and science, who you know to be friendly, etc.) You know that this being has often correctly predicted your choices in the past (and has never, so far as you know, made an incorrect prediction about your choices), and furthermore you know that this being has often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation to be described below. One might tell a longer story, but all this leads you to believe that almost certainly this being’s prediction about your choice in the situation to be discussed will be correct. There are two boxes. [The first box] contains $1,000. [The second box] contains either $1,000,000, or nothing. . . . You have a choice between two actions: (1) taking what is in both boxes (2) taking only what is in the second box. Furthermore, and you know this, the being knows that you know this, and so on: (I) If the being predicts you will take what is in both boxes, he does not put the $1,000,000 in the second box. (II) If the being predicts you will take only what is in the second box, he does put the $1,000,000 in the second box. The situation is as follows. First the being makes its prediction. Then it puts the $1,000,000 in the second box, or does not, depending upon what it has predicted. Then you make your choice. What do you do? (1969, pp. 114–5) Historically, Newcomb’s Problem prompted the development of a new kind of decision theory, now known as Causal Decision Theory (sometimes just “CDT”). At the time of Nozick’s discussion, extant decision theories (such as Jeffrey’s) seemed to recommend taking just one box in Newcomb’s Problem (so-called “one-boxing”). But many philosophers thought twoboxing was the rational act. 14 By the time you make your decision, the being has already made its prediction and taken its action. So the money is already either in the second box, or it’s not—nothing you decide can affect whether the money is there. However much money is in the second box, you’re going to get more mone y ($1,000 more ) if you take both boxes . So you should two-box. I’ve quoted Nozick’s srcinal presentation of the problem because in the great literature that has since grown up around Newcomb, there is often
243
7.3. CAUSAL DECISION THEORY
debate about what exactly coun ts as “a Newcomb Problem”. Does it matter whether the agent is certain that the prediction will be correct? Does it matter how the predictor makes its predictions, and whether backward causation (some sort of information fed backwards from the future) is involved? Perhaps more importan tly, who cares about such a strange and fanciful problem? But our purpose is not generalized Newcombology—we want to understand why Newcomb’s Problem spurred the development of Causal Decision Theory. That can be understood by working with just one version of the problem. Or better ye t, it can be understood by work ing with a kind of problem that comes up in everyday life, and is much less fanciful: I’m standing at the bar, trying to decide whether to order a third appletini. Drinking a third appletini is the kind of act much more typical of people with addictive personalities. People with addictive personalities also tend to become smokers. I’d kind of like to have another drink, but I really don’t want to become a smoker (smoking causes lung cancer, is increasingly frownedupon in my social circle, etc.). So I shouldn’t order that next appletini. Let’s work through the reasoning here on decision-theoretic grounds. First, stipulate that I have the following utility table: third appletini no more
smoker 99
´ ´100
non 1 0
Ordering the third appletini is a dominant act. But dominance should dictate preference only when acts and states are independent, and my concern here is that they’re not. My credence distr ibution has the following features (with A, S , and P representing the propositions that I order the appletini, that I become a smoker, and that I have an addictive personality, respectively):
p | q ą crpS |„P q p | q ą crpP |„Aq
cr S P
(7.11)
cr P A
(7.12)
I’m more confident I’ll become a smoker if I have an addictive personality than if I don’t. And having that third appletin i is a positive indic ation that I
244
CHAPTER 7. DECISION THEORY
have an addictive personality. Combining these two equations (and making a couple more assumptions I won’t bother spelling out), we get:
p | q ą crpS |„Aq
cr S A
(7.13)
From my point of view, ordering the third appletini is positively correlated with becoming a smoker . Looking back at the utility table, I do not consider the states listed along the top to be probabilistically independent of the acts along the side. Luckily, Jeffrey’s decision theory works even when acts and states are depende nt. So I apply Jeffrey’s valuation function to calculate expected utilities for the two acts:
p q “ ´99 ¨ crpS | Aq ` 1 ¨ crp„S | Aq p„Aq “ ´100 ¨ crpS |„Aq ` 0 ¨ crp„S |„Aq
EUedt A EUedt
(7.14)
Looking at these equations, you might think that A receives the higher expected utility. But I assign a considerably highe r value to cr S A than cr S A , so the 99 in the top equation is multiplied by a significantly larger quantity than the 100 in the bottom equation. Assuming the corre-
p |„ q
´
p | q
´
„
S and A is strongAenough, A receives the better expected lation utility between and I prefer to perform . But this is all wrong! Whether I have an addictiv e personality is (let’s say) determined by genetic factors, not anything I could possibly affect at this point in my life. The die is cast (so to speak); I either have an addictive personality or I don’t; it’s already determined (in some sense) whether an addictive personality is going to lead me to become a smoker. Nothing about this appletini—whether I order it or not—is going to change that. So I might as well enjoy the drink. 15 Assuming the reason ing in the previ ous paragr aph is correct, it’s an interesting question why Jeffrey’s decision theory yields the wrong result. The answer is that on Jeffrey’s theory ordering the appletini gets graded down because it would be bad news about my future. If I order the drink, that’s evidence that I have an addictive personality (as indicated in Equation (7.12)). Having an addictive personality is unfortu nate because of its potential cons equences for becomi ng a smoker. I expect a world in which I order another drink to be a worse world than a world in which I don’t, and this is reflected in the EU edt calculation. Jeffrey’s theory assess es the act of ordering a third appletini not in terms of the consequences it will cause to come about, but instead in terms of the consequences it provides evidence will come about. For this reason Jeffrey ’s theory is described as an Evidential Decision Theory (or “EDT”).
„
245
7.3. CAUSAL DECISION THEORY
Figure 7.1: Third drink causal fork addictive personality (P )
smoker (S )
third appletini ( A)
The trouble with Evidential Decision Theory is that an agent’s performing an act may be evidence of a consequence that it’s too late for her to cause (or prevent). Even though the act indicates the consequence, it seems irrational to factor the value of that consequence into a decision about whether to peform the act. As Skyrms (1980a , p. 129) puts it, my not hav ing the third drink in order to avoiding becoming a smoker would be “a futile attempt to manipulate the cause by suppressing its sympt oms.” In making decisions we should focus on what we can control—the causal consequences of our acts. Weirich writes, Deliberations should attend to an act’s causal influence on a state rather than an act’s eviden ce for a state. A good decision aims to produce a good outcome rather than evidence of a good outcome. It aims for the good and not just sign s of the good. Often efficacy and auspiciou sness go hand in hand. When they come apart, an agent should perform an efficacious act rather than an auspicious act. (2012)
7.3.2
A causal approach
The causal structure of our third drink example is depicted in Figure 7.1. As we saw in Chapter 3, correlation often indicates causation— but not always. Propositions on the tines of a causal fork will be correlated even though neither causes the other . This accounts for A’s being relevant to S on my credence function (Equation (7.13)) even though my ordering the third appletini has no causal influence on whether I’ll become a smoker. The causally spurious correlation in my credences affects Jeffrey’s expected utility calculation because that calculation works with credences in
246
CHAPTER 7. DECISION THEORY
p | q
p q
states conditional on acts (cr Si A ). Jeffrey replaced Savage’s cr Si with this conditional expression to track dependencies between states and acts. The Causal Decision Theorist responds that while credal correlation is a kind of dependence, it’s not the kind of dependence that decisions should track. Preferences should be based on causal dependencies. So the Causal Decision Theorist’s valuation function is:
p q “ upA & S q¨crpA S q ` upA & S q ¨ crpA ` . . . ` upA & S q ¨ crpA S q
EUcdt A
1
1
2
n
S2
q
(7.15)
n
Here A S represents the subjunctive conditional “If the agent were to perform act A, state S would occur.” 16 Causal Decision Theory uses such conditionals to track causal relations in the world. 17 Of course, an agent may be uncertain what consequences a given act A would cause. So EU cdt looks across the partition of states S1 ,...,S n and invokes the agent’s credence that A would cause any particular given Si . For many decision problems, Causal Decision Theory yields the same results as Evidential Decision Theory. In Jeffrey’s wine example, it’s plausible that
p
|
q“ p
q“
cr chicken white cr white chicken 0.75 (7.16) The guest’s credence that chicken is served on the condition that she brings white wine is equal to her credence that if she were to bring white, chicken would be served. So one may b e substituted for the other in expected utilit y calculations, and CDT’s evaluations turn out the same as Jeffrey’s. But when conditional credences fail to track causal relations (as in cases with causal forks ), the two theories may yield differ ent results. This is in part due to their differing notions of independ ence. EDT treats act A and state S as independent when they are probabilistically independent relative to the agent’s credence function. CDT focuses on whether the agent take s A and S to be causally independent, which occurs just when
p
cr A S
q “ crpS q
(7.17)
When an agent thinks A has no causal influence on S , her credence that S will occur if she performs A is just her credence that S will occ ur. In the third drink example my ordering another appletini may be evidence that I’ll become a smoker, but I know it has no causal bearing on whether I take up smoking. So from a Causal Decision The ory point of view, the acts and state s in that problem are inde pendent. When acts and state s are independent, dominance reasoning is appropriate, so I should prefer the dominant act and order the third appletini.
247
7.3. CAUSAL DECISION THEORY
Now we can return to a version of the Newcomb Problem that distinguishes Causal from Evidential Decision Theory. Suppose that the “being” in Nozick’s story makes its prediction by analyzing your brain state prior to your making the decision and applying a complex neuro-psychological theory. The being’s track record makes you 99% confident that its predictions will be correct. And to simplify matters, let’s suppose you assign exact ly 1 util to each dollar, no matter how many dollar s you already have. Then your utility and credence matrices for the problem are: Utilities P1 T1 1,000,000
P2
0
1,001,000
T2
Credences
1,000
P1
P2
T1
0.99
0.01
T2
0.01
0.99
where T1 and T2 represent the acts of taking one box or two boxes (respectively), and P1 and P2 represent the states of what the being predicted. Jeffrey calculates expected values for the acts as follows:
p q “ upT pT q “ upT
q ¨ crpP | T q ` upT & P q ¨ crpP | T q ` upT
q ¨ crpP | T q “ 990, 000 & P q ¨ crpP | T q “ 11, 000
EUedt T1
1
& P1
1
1
1
& P2
2
1
EUedt
2
1
1
2
2
2
2
2
2
(7.18)
So Evidential Decision Theory recommends one-boxing. Yet we can see from Figure 7.2 that this version of the Newcomb Problem contains a causal fork; the being’s prediction is based on your brain state, which also has a causal influence on the number of boxes you take. This should make us suspicious of EDT’s recommendations. The agent’s act and the being’s prediction are probabilistically correlated in the agent’s credences, as the credence table reveals. But that’s not because the number of boxes taken has any caus al influence on the prediction. Causal Decision Theory calculates expected utilities in the example like this:
p q “ upT & P q ¨ crpT “ 1, 000, 000 ¨ crpT
EUcdt T1
1
1
1
1
p q “ upT & P q ¨ crpT “ 1, 001, 000 ¨ crpT
EUcdt T2
2
1
2
2
q ` upT & P q ¨ crpT q ` 0 ¨ crpT P q
P1 P1
2
1
1
2
P2
q
2
(7.19)
q ` upT & P q ¨ crpT q ` 1, 000 ¨ crpT P q
P1 P1
1
2
2
2
2
P2
q
248
CHAPTER 7. DECISION THEORY
Figure 7.2: Newcomb Problem causal fork brain state
prediction
boxes taken
It doesn’t matter what particular values the credences in these expressions take, because the act has no causal influence on the prediction. That is,
p
cr T1
P1
q “ crpP q “ crpT
P2
q “ crpP q “ crpT
1
2
P1
q
(7.20)
2
P2
q
(7.21)
and
p
cr T1
2
With these causa l independencies in mind, you can tell by inspec tion of Equation (7.19) that EU cdt T2 will be greater than EU cdt T1 , and Causal Decision Theory endorses two-boxing.
p q
7.3.3
p q
Responses and extensions
So is that it for Evidential Decision Theory ? Philosophical debates rarely end cleanly; Evidential Decision Theorists have made a number of responses to the Newcomb Problem. First, one might respond that one-boxing is the rationally mandated act. Representing the two-boxers, David Lewis once wrote, The one-bo xers sometimes taunt us: if you’re so smart, why ain’cha rich? They have their milli ons and we have our thousands, and they think to show the error of our ways. They think we are notthis richgoes because we have irrationally chosen not to have our millions. (1981b, p. 377) Lewis’ worry is this: Suppose a one-boxer and a two-boxer each go through the Newcomb scenario many times. As a successful predictor, the being in the story will almost always predict that the one-boxer will one-box, and so place the $1,000,000 in the second box for him. Meanwhile, the two-boxer
7.3. CAUSAL DECISION THEORY
249
will almost alway s find the second box empty. The one-boxer will rack up millions of dollars, while the two-boxer will gain only thousands. Each agent has the goal of making as much money as possible, so one-boxing (and, by extension, EDT) seems to provide a better rational strategy for reaching one’s goals than two-boxing (and CDT). The Causal Decision Theorist’s response (going at least as far back as (Gibbard and Harper 1978/1981)) is that some unfortunate situations reward agents monetarily for behaving irrationally, and the Newcomb Problem is one of them. The jury is still out on whether this response is convincing. In November 2009 the PhilPapers Survey polled over three thousand philosophers, and found that 31 .4% of them accepted or leaned towards twoboxing in the Newcomb Problem, while 21 .3% accepted or leaned towards one-boxing. (The remaining respondents were undecided or offered a different answer.) So it’s unclear that EDT’s embrace of one-boxing is a fatal defect. Meanwhile, there are other cases in which EDT seems to give the intuitively rational result while CDT does not (Egan 2007). Jeffrey, on the other hand, was convinced that two-boxing is rationally required in the Newcomb Problem. So he defended Evidential Decision Theory in various In the second edition The Logic of Decision Jeffrey added aways. ratifiability condition to of his EDT. Ratifiability holds(1983), that an act is rationally permissible only if the agent assigns it the highest expected utility conditional on the supposition that she chooses to perform it. Ratifiability avoids regret—if choosing to perform an act would make you wish you’d done someth ing else, then you shouldn’t choose it. In the Newcomb Problem, supposing that you’ll choose to one-box makes you confident that the being predicted one-boxing, and so makes you confident that the $1,000,000 is in the second box. So supposing that you’ll choose to one-box makes two-boxing seem the better choice. One-boxing is unratifiabl e, and so can be rationally rejected. We won’t cover the technical details of ratifiability here, in part because Jeffrey ultimately abandoned that response. Jeffrey eventually (1993, 2004) came to believe that the Newcomb Problem isn’t really a decision problem. Suppose that in the Newcomb Problem the agent assigns the credences we described earlier because she takes the causal structure of her situation to be something like Figur e 7.2. In that case, she will see her physical brain state as having such a strong influence on how many boxes she takes that whether she one-boxes or two-boxes will no longer seem like a free choice. Jeffrey held that in order to make a genuine decision, an agent must see her choice as the cause of the act (and ultimately the outcome) produc ed. Read in this light, the Newcomb case seemed to involve too much causal influence
250
CHAPTER 7. DECISION THEORY
on the agent’s act from factors beyond her choice. In the final sentences of his last work, Jeffrey wrote, “I now conclude that in Newcomb problems, ‘One box or two?’ is not a question about how to choose, but about wha t you are already set to do, willy-nilly. Newcomb problems are not decision problems.” (2004, p. 113)
7.4
Exercises
Unless otherwise noted, you should assume when completing these exercises that credence distributions under discussion satisfy the probability axioms and Ratio Formula. You may also assume that whenever a conditional probability expression occurs, the needed proposition has nonzero unconditional credence so that conditional probabilities are well-defined. Problem 7.1. When you play craps in a casino there are a number of different bets you can make at any time. Some of these are “propo sition bets” on the outcome of the next roll of two fair dice. Below is a list of some proposition bets, and the odds at which casinos offer them. Name of Bet Wins when Odds paid Big red Dice total 7 4:1 Any craps Dice total 2, 3, or 12 7 : 1 Snake eyes Dice total 2 30 : 1 Suppose you place a $1 bet on each proposition at the odds listed above. Rank the three bets from highest expected dollar value to lowest. Problem 7.2. The St. Petersburg game is played as follows: A fair coin is flipped repeat edly until it comes up heads. If the coin comes up heads on the first toss, the playe r wins $2. Heads on the second toss pays $4, heads on the third toss pays $8, etc. ∗ (a) If you assign fair prices equal to expected monetary payouts (and credences equal to objective chances), how much should you be willing to pay to play the St. Petersburg game? (b) If you were confronted with this game in real life, how much woul d you be willing to pay to play it? Explain your answer. Problem 7.3. (a) Suppose an agent is indifferent between two gambles with the following utility outcomes: ∗
This game was invented by Nicolas Bernoulli in the 18th century.
251
7.4. EXERCISES
Gamble 1 Gamble 2
„P
P x y
y x
where P is a proposition about the state of the world, and x and y are utility values with x y . Assuming this agent maximiz ues EU sav, what can you determine about the agent’s cr P ?
‰
p q
(b) Suppose the same agent is also indifferent between these two gamb les: Gamble 3 Gamble 4
P d m
„P w m
p q “ p„P q, d “ 100, and w “ ´100. What can you deter-
where cr P cr mine about m?
(c) Finally, suppose the agent is indifferent between these two gambles:
Gamble 65 Gamble where r cr Q ?
p q
“ 100, s “ 20, and t “ 80.
Q r t
„Q s t
What can you det ermine about
Problem 7.4. You are confronted with a decision problem involving two possible states of the world ( S and S ) and three available acts ( A, B , and C ).
„
(a) Suppose that of the three S -outcomes, B & S does not have the highest utility for you. Also, of the three S -outcomes, B & S does not have the highest utility. Applying Savage’s decision theory, does it follow that you should not choose act B ? Defend your answer.
„
„
(b) Suppose that of the S -outcomes, B & S has the lowest utility for you.
„
„
S -outcomes, Also, of the three decision & itS follow has thethat lowest utility. not Still applying Savage’s theory,Bdoes you should choose act B ? Defend your answer.
(c) Suppose now that you apply Jeffrey’s decis ion theory to the situation in part (b). Do the same conclusions necessarily follow about whether you should choose act B ? Explain. † †
This problem was inspired by a problem of Brian Weatherson’s.
252
CHAPTER 7. DECISION THEORY
Problem 7.5. Suppose an agent faces a decision problem with two acts and B and finitely many states. (a) Prove that if the agent sets her preferences using EU
sav
A
, those prefer-
ences will satisfy the Dominance Principle. (b) If the agent switches from EU sav to EUedt , exactly where will your proof from part (a) break down? Problem 7.6. Referring to the payoff tables for Allais’ Paradox in Section 7.2.4, show that no assignment of values to u $0 , u $1M , and u $5M that makes EU edt A EUedt B will also make EU edt D EUedt C . (You may assume that the agent assigns equal credence to each numbered ticket’s being selected, and this holds regardless of which gamble is made.)
p qą
p q
p q p q p qą
p
p q
q
Problem 7.7. Having gotten a little aggressive on a routine single to center field, you’re now halfway between first base and secon d base. You must decide whether to proceed to second base or run back to first. The throw from the center fielder is in midair, and given the angle you can’t tell whether it’s headed to first or second base. But you do know that this center fielder has a great track-record at predicting where runners will go—your credence in his throwing to second conditional on your going there is 90%, while your credence in his throwing to first conditional on your going to first is 80%. If you and the throw go to the same base, you will certainly be out, but if you and the throw go to different bases you’ll certainly be safe. Being out has the same utility for you no matter where you’re out. Being safe at first is better than being out, and being safe at second is better than being safe at first by the same amount that being safe at first is better than being out. (a) Of the two acts ava ilable (running to first or running to second), whic h should you prefer according to Evidential Decision Theory (that is, accoring to Jeffrey’s decision theory)? (b) Does the problem provide enough infor mation to determine which act is preferred by Causal Decis ion Theory? If so, explain which act is preferred. If not, explain what further inform ation would be required and how it could be used to determine a preference. Problem 7.8. In the Newcomb Problem, do you think it’s rational to take just one box or take both boxes? Explain your thinking.
7.5. FURTHER READING
7.5
253
Further reading
Introductions and Overviews
Martin Peterson (2009). An Introduction to Decision Theory . Cambridge Introductions to Philosophy. Cambridge: Cambridge University Press A book-length general introduction to decision theory, including chapters on game theory and social choice theory. Classic Texts
Leonard J. Savage (1954). The Foundations of Statistics . New York: Wiley Savage’s classic book laid the foundations for modern decision theory and much of contemporary Bayesian statistics. Richard C. Jeffrey (1983). The Logic of Decision. 2nd. Chicago: University of Chicago Press In the first edition, Jeffrey ’s Chapter 1 introduced a decision theory capable of handling dependent acts and states. In the second edition, Jeffrey added an extra section to this chapter explaining his “ratifiability” response to the Newcomb Problem. Extended Discussion
Lara Buchak (2013). Risk and Rationality . Oxford: Oxford University Press Presents a generalization of the decision theories discussed in this chapter that is consistent with a variety of real-life agents’ responses to risk. For instance, Buchak’s theory accommodates genuine risk-aversion, and allows agents to simultaneously prefer Gamble A to Gamble B and Gamble D to Gamble C in Allais’ Paradox. James M. Joyce (1999). The Foundations of Causal Decision Theory. Cambridge: Cambridge University Press A systematic explanation and presentation of causal decision theory, unifying that approach under a general framework with evidential decision theory and proving a representation theorem that covers both.
254
NOTES
Notes 1 The law of large numbers comes in many different forms, each of which has slightly different conditions and a slightly different conclusion. Most versions require the repeated trials to be independent and identically distributed (IID), meaning that each trial has the
same probability of yielding a given result and the result on a given trial is independent of all previous results. (In other words, you think our batter is consistent across games and unaffected by previous performance.) Most versions also assume Countable Additivity for their proof. Finally, since we are dealing with results involving the infinite, we should remember that in this context credence 1 doesn’t necessa rily mean certain ty. An agent who satisfies the probability axioms, the Ratio Formula, and Countable Additivity will assign creden ce 1 to the average’s approaching the expectation in the limit, but that doesn’t mean she rules out all possibilities in which those values don’t conve rge. (For Countable Additivity and cases of credence-1 that don’t mean certainty, see Section 5.4. For more details and proofs concerning laws of large numbers, see (Feller 1968, Ch. X).) 2 See (Bernoulli 1738/1954) for both his discussion and a reference to Cramer. 3 The first money pump was presented by (Davidson, McKinsey, and Suppes 1955, p. 146), who attributed the inspiration for their example to Norman Dalkey. I don’t know who introduced the “money pump” terminology. By the way, if you’ve ever read Dr. Seuss’ story “The Sneetches”, the Fix-it-Up Chappie (Sylvester McMonkey McBean) gets a pretty good money pump going before he packs up and leaves. 4
Quinn (1990) case (“the preferences. puzzle of the self-torturer”) in which it mayThough be rational for an agentpresents to have aintransitive 5 While Savage thought of acts as functions from states to outcomes, it will be simpler for us to treat acts, states, and outcomes as propositions—the proposition that the agent will perform the act, the proposition that the world is in a particular state , and the proposition that a particular outcome occurs. 6 For simplicity’s sake we set aside cases in which some S i make particular acts impossible. Thus A & Si will never be a contradiction. 7 The Dominance Principle I’ve presented is sometimes known as the Strong Dominance Principle. The Weak Dominance Princi ple says that if A produces at least as good an outcome as B in each possible state of the world, plus a better outcome in at least one possible state of the world, then A is preferred to B . The names of the principles can be a bit confusing—it’s not that Strong Dominance is a stronger principle ; it’s that it involves a stronger kind of dominance. In fact, the Weak Dominance Principle is logically stronger than the Strong Dominance Principle, in the sense that the Weak Dominance Principle entails the Strong Dominance Principle. (Thanks to David Makinson for suggesting this clarification.) Despite being a logically stronger principle, Weak Dominance is also a consequence of Savage’s expected utility theory, and has the same kinds of problems as Strong Dominance. 8 In a similar display of poor reasoning, Shakespeare’s Henry V (Act 4, Scene 3) responds to Westmoreland’s wish for more troops on their side of the battle—“O that we now had here but one ten thousand of those men in England, that do no work today”—with the following: If we are marked to die, we are enough to do our country loss; and if to live, the fewer men, the greater share of honor. God’s will, I pray thee wish not one man more.
NOTES
255
9
For a brief discussion and references, see (Jeffrey 1983, § 1.8). Instead of referring to “acts”, “states”, “outcomes”, and “utilities”, Jeffrey speaks of “acts”, “condition s”, “consequen ces”, and “desirabil ities” (respec tively). As in my presentation of Savage’s theory, I have made some changes to Jeffrey’s approach for the sake of simplicity and consistency with the rest of the discussion. 10
11
The decision-theoretic structure here bears striking similarities to Simpson’s Paradox. We saw in Section 3.2.3 that while David Justice had a better batting average than Derek Jeter in each of the years 1995 and 1996, over the entire two-year span Jeter’s average was better. This was because Jeter had a much high er proportion of his bats in 1996, which was a better year for both hitters. So selecting a Jeter at-bat is much more likely to land you in a good year for hitting. Similarly, the deterrence utility table shows that disarming yields better outcom es than arming on each possible state of the world. Yet arming is much more likely than disarming to land you in the peace state (the right-hand column of the table), and so get you a desirable outcome. 12 While Savage coined the phrase “Sure-Thing Principle”, it’s actually a bit difficult to tell from his text exactly what he meant by it. I’ve presented a contemporary cleaning-up of Savage’s discussion, inspired by the Sure-Thing formulation in (Eells 1982, p. 10). It’s also worth noting that the Sure-Thing Principle is intimately related to decision-theoretic axioms known as Separability and Independence, but we won’t delve into those conditions here. 13 (Heukelom 2015) provides an accessible history of the Allais Paradox, and of Allais’ disputes with Savage over it. 14 By the way, in case you’re looking for a clever way out Nozick specifies in a footnote to the problem that if the being predicts you will decide what to do via some random process (like flipping a coin), he does not put the $1,000,000 in the second box. 15 Eells (1982, p. 91) gives a parallel example from theolog y: “Calvinism is sometimes thought to involve the thesis that election for salvation and a virtuous life are effects of a common cause: a certain kind of soul. Thus, while leading a virtuous life does not cause one to be elected, still the probability of salvation is higher conditional on a virtuous life than conditional on an unvirtuous life. Should one lead a virtuous life?” 16 It’s important for Causal Decision Theory that A S conditionals be “causal” counterfactuals rather than “backtracking” counterfactuals; we hold facts about the past fixed when assessing A ’s influence on S . (See (Lewis 1981a) for the distinction and some explanation.) 17 There are actually many ways of executing a causal decision theory; the approach presented here is that of (Gibbard and Harper 1978/1981), drawing from (Stalnaker 1972/1981). Lewis (1981a) thought Causal Decision Theory should instead return to Savage’s unconditional credences and independence assumptions, but with the specification that acts and states be causally independent. For a comparison of these approaches along with various others, plus a general formulation of Causal Decision Theory that attempts to cover them all, see (Joyce 1999).
256
NOTES
Part IV
Arguments for Bayesianism
257
259 To my mind, the best argument for Bayesian Epistemology is the uses to which it can be put. In the prev ious part of this book we saw how the Bayesian approach interacts with confirmation and decision theory, two central topics in the study of theoretical and practical rationality (respectively). The five core normative Bayesian rules grounded formal representations of how an agent should assess what her evidence supports and how she should make decisions in the face of uncertainty. These are just two of the many applications of Bayesian Epistemology, which have established its significance in the minds of contemporary philosophers. Nevertheless, Bayesian history also offers more direct arguments for the normative Bayesian rules. The idea is to prove from premises acceptable on independent grounds that, say, a rational agent’s unconditional credences at a given time satisfy Kolmogorov’s probability axioms. These days the three most prominent kinds of arguments for Bayesianism are those based on representation theorems, Dutch Books, and accuracy measurements. This part of the book will devote one chapter to each type of argument. Some of these argument-types can be used to establish more than just the probability axioms as requirements of rationality; the Ratio Formula, Conditionalization, andhas other norms wenorms have it discussed may be argued Countable for. Each Additivity, argument-type particular can and can’t be used to support; I’ll mention thes e applications as we go along. But they all can be used to argue for the probability axioms. As I mentioned in Chapter 2, probabilism is the thesis that a rational agent’s unconditional credence distribution at a given time satisfies Kolmogorov’s three axioms. (I sometimes call a distribution that satisfi es the axioms a “probabilistic” distribution; other authors call such a distribution coherent.) Among the probab ility axioms, by far the most difficult to establish is Finite Addit ivity. We’ll see why as we dig into the argum ents’ particulars, but it’s worth a quick reminder at this point what Finite Additivity does. In Chapter 2 we met three characters: Mr. Prob, Mr. Weak, and Mr. Bold. For a given proposition P , the three of them assign the following credences: Mr. Prob: cr Mr. Weak: cr Mr. Bold: cr
p q“0 p q“0 p q“0 F F F
p q “ 1{6 p q “ 1{?36 p q “ 1{ 6
cr P cr P cr P
p„P q “ 5{6 p„P q “ 25{36 p„P q “ ?5{?6
cr cr cr
p q“1 p q“1 p q“1
cr cr cr
T T T
All three of these characters satisfy the Non-Negativity and Normality axioms. They also satisfy such intuitiv e credal norms as Entailmen t: the rule that a proposition must receive at least as much credence as any proposition
260 that entails it. Yet of the three, only Mr. Prob satisfies Finite Additivit y. This demonstrates that Finite Additivity is logically independent of these other norms; they can be satisfied even if Finite Additivity is not. Mr. Weak’s credences are obtained by squaring each of Mr. Prob’s. This makes Mr. Weak’s level of confidence in logically contingent propositions (P , P ) lower than Mr. Prob’s. Mr. Weak is comparatively conservative, unwilling to b e very confident in contingent claims. So while Mr. Weak is certain of P P , his individual credences in P and P sum to less than 1. Mr. Bold’s distribution, on the other hand, is obtained by square-rooting Mr. Prob’s credences. Mr. Bold is highly confiden t of contingent propositions, to the point that his credences in P and P sum to more than 1.
„
_„
„
„
When we argue for Finite Additivity, we argue that Mr. Weak and Mr. Bold display a rational flaw not present in Mr. Prob. It’s worth wondering in exactly what respect Mr. Weak and Mr. Bold make a rational mistake. This is especially pressing because empirical findings suggest that real humans consistently behave like Mr. Bold: they assign credences to mutually exclusive disjuncts that sum to more than their credence in the disjunction. Tversky and Koehler (1994 ) summarize a great deal of evide nce on this front. In one particularly striking finding, subjects were asked to write down the last digit of their phone number and then estimate the percentage of American married couples with exactly that many children. The subjects with numbers ending in 0, 1, 2, and 3 each assigned their digit a value greater than 25%. If these values reflect the subjects’ credences, then we’ve exceeded 100% before we even mention families of more than 3 kids! Each of the three argument-types we consider will explain what’s wrong with violat ing Finite Additivity in a slightly different way. And for each argument, I will ultimately have the same complaint. In order to support Finite Additivity—a mathematical linearity constraint on the combination of credences—each of the arguments assumes some other linearity constraint. It’s then unclear how the normativity of this other constraint is any better established than that of Finite Additivity. I call this the Linearity In, Linearity Out problem, and it threatens to make each of the arguments for Finite Additivity viciously circular. If the traditional arguments for probabilism are revealed to be questionbegging, probabilism’s applications become all the more signifi cant. Near the end of Chapter 10 I’ll ask whether Finite Additivity is necessary for those. We’ll briefly examine whether Bayesian Epistemology’s successes in confirmation and decision theory could still be secured if we weakened our commitment to probabilism.
261
Further Reading Alan H´ajek (2009a). Arguments For—Or Against—Probabilism? In: Degrees of Belief . Ed. by Franz Huber and Chri stoph Schmidt-Petri. Vol. 342. Synthese Library. Springer, pp. 229– 251 Excellent introduction to, and assessment of, all the arguments for probabilism discussed in this part of the book.
262
Chapter 8
Representation Theorems Decision theory aligns a rational agent’s credence and utility distributions with her prefe rences among av ailable acts. It does so in two steps : first, a valuation function combines the agent’s credences and utilities to assign each act a numerical score; second, if the agent is rational she will prefer acts with higher scores. Savage’s decision theory evaluates an act A by calculating its expected utility as follows:
p q “ upA & S q¨ crpS q` upA & S q¨ crpS q` . . . ` upA & S q¨ crpS q (8.1)
EU A
1
1
2
n
2
n
where u represents the agent’s utilities, cr represents her credences, and states S1 through Sn form a finite partition. A rational agent will have A ą B just in case
p q ¨ crpS q ` . . . ` upA & S q ¨ crpS q ą upB & S q ¨ crpS q ` . . . ` upB & S q ¨ crpS q
u A & S1
n
1
1
1
n
n
(8.2)
n
where “ A ą B ” indicates that the agent prefers act A to act B . Equation (8.2) relates three types of attitudes that impact an agent’s practical life: her preferences among acts, her credences in states, and her utilities over outcomes.1 It’s a bit like an equation with three variables; if we know two of them we can solve for the third . For instance, if I specify a rational agent’s full utility and credence distributions, you can determine her preference between any two acts using Equation (8.2). Going in a different direction, de Finetti (1931/1989) showed that if you know an agent’s utilities and her preferences among certain kinds of bet-making acts, you can determine her credences. Meanwhile von Neumann and Morgensten (1947) 263
264
CHAPTER 8. REPRESENTATION THEOREMS
Figure 8.1: Results deriving some decision-theoretic attitudes from others Author Preferences (straightforward) determine de Finetti (1931) given von Neumann and given Morgenstern (1947) Ramsey (1931) given
Utilities Credences given given given determine determine given determine determine
showed that given an agent’s preferences over risky acts with specified credal profiles (called “lotteries”), one can determine her utilities. (See Figure 8.1.) Yet at some point during the 1920s, Frank Ramsey discovered how to do something remarkable: given only one of the variables in Equation (8.2), he figured out how to determine the other two. (The relevant paper, (Ramsey 1931), was published after Ramsey’s death in 1930 at age 27.) Given an agent’s full preference ranking over acts, Ramsey showed how to determine both that agent’s credences and her utilit ies. Ramsey’s method laid the groundwork for representation theorems later proven by Savage and others. And these represe ntation theorems ground an important argument for probabilism. This chapter begins with an overview of Ramsey’s method for determining credences and utilities from preferences. I will then present Savage’s representation theorem and discuss how it is taken to support probabilism. Finally, I will present contemporary criticisms of the representation theorem argument for probabilism. Especially eager readers may skip over the Ramsey section; strictly speaking one needn’t know how Ramsey pulled the trick to understand representation theorems and their relation to probabilism. Yet I will not be presenting any proof of the representation theorem, so if you want to know how it’s possible to get both credences and utilities from preferences it may be worth studying Ramsey ’s approach. Ramsey’s process also illustrates why certain structural assumptions are necessary for the theorems that came later. One side note before we begin: Readers familiar with deci sion theor y (perhaps from Chapter 7) will know that many contemporary decision theorists have found fault with Savage’s expected utility formula as a valuation function. But since we will mainly be discussing Savage’s representation theorem, I will use Savage-style expected utilities (as defined in Equation
8.1. RAMSEY’S FOUR-STEP PROCESS
265
(8.1)) throughout this chapter. One can find similar representation theorems for Jeffrey-style Evidential Decision Theory in (Jeffrey 1965) and for Causal Decision Theory in (Joyce 1999).
8.1
Ramsey’s four-step process
Here’s how Ramsey’s process work s. We imagine we are given an agent’s complete preference ranking over acts, some of which are acts of accepting various “gambles” (which provide one outcome if a proposition is true, another outcome if that proposition is false). We assume that the agent assigns finite numerical utilities, credences satisfying the probability axioms, and preferences in line with her (Savage-style) expected utilities. Yet we are given no further information about which credence and utility values she assigns to particular propositions. That’s what we want to determine. Ramsey’s process works by sorting through the agent’s preference rankings until we find preferences that fit certain patterns . Those patterns allow us to determine particular features of the agent’s credences and utilities, which leverage to for determine further features, until we can set a L. utility we andthen credence value each proposition in the agent’s language Step One: Find ethically neutral propositions Ramsey defines a proposition P as ethically neutral for an agent if the agent is indifferent between any two gambles whose outcomes differ only in replacing P with P . The intuitive idea is that an agent just doesn’t care how an ethically neutral proposition comes out, so she values any outcome in which P occurs just as much as she values an otherwise-identical outcome in which P occurs. (Despite the terminology, Ramsey is clear that a proposition’s “ethical neturality” has little to do with ethic s at all.) For instance, a particular agent might care not one whit about hockey teams and how they fare; this lack of caring will show up in her preferences among various acts (including gambles). Suppose this agent is confronted with two acts: one will make the Blackhawks win the Stanley Cup and also get her some ice cream, while another will make the Blackhawks lose but still get her the same ice cream. If propositions about hockey results are ethical ly neutral for the agent, she will be indifferent between performing those two acts. In Step One of Ramsey’s process, we scour the agent’s preferences to find a number of propos itions that are ethically neutral for her. We can tell an ethically neutral proposition P because every time P appears in
„
„
266
CHAPTER 8. REPRESENTATION THEOREMS
the outcomes of a gamble, she will be indifferent between that gamble and another gamble in which every P in an outcome has been replaced by a P .
„
„
P , P until Stepnow Two: Find ethically neutral with we equal We examine the agent’s preferences findcredence three propositions X , Y , and P such that P is ethically neutral for the agent and the agent is indifferent between these two gambles:
Gamble 1 Gamble 2
P X&P Y &P
„
P Y & P X& P
„ „
In this decision table the possible states of the world are listed across the top row, while the acts available to the agent are listed down the first column. Since we don’t know the agent’s utility values, we can’t put them in the cells. So I’ve listed ther e the outcome that will result from each act-s tate pair. For instance, Gamble 1 yields outcome X & P if P is true, Y & P if P is false. If we’ve established in Step One that hockey results are ethically
„
neutral our agent, Gamble might make it the the (agent receivesfor chocolate ice then cream ( X ) if1the Blackhawks wincase andthat vanilla Y ) if they lose, while Gamble 2 gives her vanilla if they win and chocolate if they lose. If the agent is indifferent between the acts of making Gamble 1 and Gamble 2, and if the agent’s preferences reflect her expected utilities, then we have
p
u X&P
q ¨ crpP q ` upY & „P q ¨ crp„P q “ EUpGamble 1 q “ EUpGamble 2 q “ upY & P q ¨ crpP q ` upX & „P q ¨ crp„P q
(8.3)
But we’ve already ascertained that P is ethically neutral for the agent—she doesn’t care whether P is true or false. So
p
u X&P
q “ upX & „P q “ upX q
(8.4)
Since the agent gets no utility advantage from P ’s being either true or false, her utility for X & P is just her utility for X , which is also her utility for X & P .2 A similar equation holds for Y . Substituting these results into Equation (8.3), we obtain
„
p q ¨ crpP q ` upY q ¨ crp„P q “ upY q ¨ crpP q ` upX q ¨ crp„P q
uX
(8.5)
267
8.1. RAMSEY’S FOUR-STEP PROCESS
p q“ p q
One way to make this equation true is to have u X u Y . How can we determine whether those utilities are equal strictly from the agent’s preferences? We might offer her a gamble that produces X no matter what—a gamble sometimes referred to as a constant act . If the age nt is ind ifferent between the constant act that produces X and the constant act that produces Y , she must assign X and Y the same utilities. But now suppose we offer the agent a choice between those constant acts and she turns out to have a preference between X and Y . In that case, the only way to make Equation (8.5) true is to have cr P cr P . So if the agent is indifferent between Gambles 1 and 2, considers P ethically neutral, and assigns distinct utilities to X and Y , she must be equally confident in P and P . Intuitively, here’s how this step works: If you prefer one outcome to another then you’ll lean toward gambles that make you more confident you’ll receive the preferred result. The only way you’ll be indifferent between a gamble that gives you the preferred outcome on P and a gamble that gives you that preferred outcome on P is if your confidence in P is equal to your confidence in P . To return to our earlier exa mple: Suppose hockey
p q “ p„ q
„
„
„
propositions are ethically agent, The she first prefers chocolate ice cream to vanilla, and she isneutral offered for twoour gambles. gamble provides chocolate on a Blackhawks win and vanilla on a loss; the second provides vanilla on a Blackhawks win and chocolate on a loss. If she thinks the Blackhawks are likely to win she’ll prefer the first gamble (because she wants that chocolate); if she thinks the Blackhawks are likely to lose she’ll prefer the second. Being indifferent b etween the gambles make s sense only if she thinks a Blackhawks loss is just as likely as a win. Step Three: Determine utilities We’ve now found an ethically neutral proposition P that the agent takes to be as likely as not. Next we survey the agent’s pref erences until we find three propositions D , M , and W satisfying the following two conditions: First, u D uM u W . (We can determine this by examin ing the agent’s preferences among constant acts involving D , M , and W .) Second, the agent is indifferent between these two gambles:
p qą p qą p q
P D&P M &P
„
P W& P M& P
„ „ Because P is ethically neutral for the agent, u pD q “ upD & P q, upW q “ upW &„P q, and u pM q “ upM &P q “ upM &„P q. So the agent’s indifference Gamble 3 Gamble 4
268
CHAPTER 8. REPRESENTATION THEOREMS
between these gambles tell us that
p q ¨ crpP q ` upW q ¨ crp„P q “ upM q ¨ crpP q ` upM q ¨ crp„P q
uD
We’ve also selected a P such that cr P through by this value, leaving
cr
(8.6)
P . So we can ju st div ide
p q “ p„ q upD q ` upW q “ upM q ` upM q upD q ´ upM q “ upM q ´ upW q
(8.7) (8.8)
In other words, the gap between the agent’s utilities in D and M must equal the gap between her utilities in M and W . Intuitively, the agent prefers D to M , so if P is true then the agent would rather have Gamble 3 than Gamble 4. On the other hand, the agent prefers M to W , so if P then the agent would rather have Gamble 4. If the agent considered P much more likely than P , then a small preference for D over M could balance a much stronger preference for M over W . But we’ve chosen a P that the agent finds just as likely as P . So if the ag ent is indifferent between Gambles 3 and 4, the advantage conferred on Gamble
„
„
„
D instead M mustto 3advantage by its potential to provide precisely outthan the conferred on Gamble 4 by itsofpotential providebalance M rather W . The agent must value D over M by the exact same amount that she values M over W . This kind of gamble allows us to establish equal utility gaps between various propositions. In this case, the utility gap betwe en D and M must equal that between M and W . Suppose we stip ulate that u D 100 and uW 100. (As we’ll see in the next section, any finite v alues would’ve worked equally well here as long as u D u W .) Equation (8.8) then tells us that u M 0. By repeatedly applying this technique, we can find a series of benchmark propositions for the agen t’s utili ty scale. For example, we might find a proposition C such that the utility gap between C and D is equal to that between D and M . In that case w e know that u C 200. On the oth er hand, we might find a proposition I whose utility is just as far from M as it is from D; I has utility 50. Then we find proposition G with utility 75. As we find more and more of these propos itions with speci al utility values, we can use them to establish the utilities of other propositions (even propositions that don’t enter into convenient Gambles like 3 and 4 between which the agent is indifferent). If the agent prefers the const ant act that produces E to the constant act that produces G, her utility for E must be greater than 75. But if she prefe rs D ’s constant act to E ’s, u E must be
p q“´ p q“
p q“
p qą p q
p q“
p q
8.2. SAVAGE’S REPRESENTATION THEOREM
269
less than 100. By drawing finer and finer such distincti ons, we can specify the agent’s utility for an arbitrary proposition to as narrow an interval as we like. Repeated applications of this step will determine the agent’s full utility distribution over a propositional language to any desired level of precision. Step Four: Determine credences We’ve now determined the agent’s utilities for every proposition in her language; the final step is to determine her credences. To determine the agent’s credence in an arbitrarily selected proposition Q , we find propositions R , S , and T such that the agent is indifferent between a constant act providing T and the following gamble: Gamble 5
Q R&Q
„Q „Q
S&
We then have
p q “ upR & Qq ¨ crpQq ` upS & „Qq ¨ crp„Qq
uT
(8.9)
We assumed at the outset that the agent’s credence distribution satisfies the Q with 1 cr Q , yielding probability axioms. So we can replace cr
p„ q ´ p q upT q “ upR & Qq ¨ crpQq ` upS & „Qq ¨ r1 ´ crpQqs
(8.10)
We then apply a bit of algebra to obtain
p q “ upRup&T qQ´q u´pSup&S &„Q„qQq
cr Q
(8.11)
Since we already know the agent’s utilities for every proposition in her language, we can fill out all the values on the right-hand side and calculate her credence in Q. And since this method works for arbitra rily selected Q, we can apply it repeatedly to determine the agent’s entire credence distribution over her language.3
8.2
Savage’s representation theorem
The previous section didn’t flesh out all the details of Ramsey’s process for determining credences and utilities from preferences. But Savage (1954) proved a representation theorem which guarantees that the necessary details can be provided. I’ll start by presenting the theorem, then explain some of its individual parts.
270
CHAPTER 8. REPRESENTATION THEOREMS
Representation Theorem: If an agent’s prefe rences satisfy certain constraints, then there exists a unique probabilistic credence distribution and unique utility distribution (up to positive affine transformation) that yield those preferences when the agent maximizes expected utility. We saw the basic idea in Ramsey’s four-step process: The Representation Theorem says that starting from an agent’s preferences, we’ll always be able to construct a unique probabilistic credence distribution and unique utility distribution (up to positive affine transformation, which I’ll explain shortly) that generate those preferences through expected utility maximization. In order for this to work, the preferences must satisfy certain constraints, often called the preference axioms . These constraints are called “axioms” because we take the agent’s satisfaction of them as given in applying the representation theorem; calling them “axioms” does not mean they cannot be independently argued for. For example, Savage assumes the preferences under discussion will satisfy these two constraints (introduced in Chapter 7): Preference Asymmetry: There do not exist acts A and B such that the agent both prefers A to B and prefers B to A. Preference Transitivity: For any acts A, B , and C , if the agent prefers A to B and B to C , then the agent prefers A to C . Section 7.2.1’s money pump argument tries to show that these axioms will be satisfied by the preferences of any rational agent.4 Preference Asymmetry and Transitivity are substantive constraints on an agent’s preferences—the kinds of things we might rationally fault her for failing to meet. Yet many of Savage’s axioms merely require the agent’s preference structure to display a certain level of richness; Suppes (1974) calls these “structure axioms”. We saw one good example in Chapter 7: Preference Completeness: For any acts A and B , exactly one of the A to following is true: agent prefers , the agent prefers B to A, or the agent is the indifferent between theBtwo.
Even more demanding assumptions5 popped up in Ramsey’s four-step process: At various stages, we had to assume that if we combed through enough of the agent’s preferences, we’d eventually find propositions falling into a very specific prefe rence pattern. In Step Four, for examp le, we assum ed that for any arbitrary proposition Q there would be propositions R , S , and
271
8.2. SAVAGE’S REPRESENTATION THEOREM
T such that the agent was indifferent between T ’s constant act and a gamble that generated R on Q and S otherwise. More generally, we assumed
a large supply of propositions the agent treated as ethically neutral, and among these some propositions the agent took to be as likely as not. It’s doubtful that any agent has ever had preferences rich enough to satisfy all of these assumptions. And we wouldn’t want to rationally fault agents for failing to do so. Yet decision theorists tend to view the structure axioms as harmless assumptions added in to make the math come out nicely. Since Savage’s work a number of alternative representation theorems have been proven, many of which relax his srcinal structural assumptions. 6 If an agent’s preferences satisfy the preference axioms, Savage’s Representation Theorem guarantees the existence of a unique probabilistic credence distribution for the agent and a unique utility distribution “up to positive affine transformation”. Why can’t we determi ne a unique utility distribution for the agent full stop? Recall that in Step Three of Ramsey’s process—the step in which we determined the agent’s utility distribution— we stipulated that proposition D had a utility of 100 and proposition W a utility of 100. I chose those va lues because they were nice , round num-
´
bers; significance, and we easily chosen other other valuesthey (as had longno as special D came out more valuable than could W ). have Stipulating utilities for these propositions would have affected our utility assignments down the line. For example, if we had chos en u D 100 and u W 0 instead, the proposition M that we proved to have the same utility distance from D as W would have landed at a utility of 50 (rather than 0). Yet I hope it’s clear that differing utility scales resulting from different utility stipulations for D and W would have many things in common. This is because they measure the same underlying quantity: the extent to which an agent values a particular arrangement of the world. Different numerical scales that measure the same quantity may be related in a variety of ways; we will be particularly interested in measurement scales related by scalar and affine transformations. Two measurement scales are related by a scalar transformation when values on one scale are const ant multiples of values on the other . A good example are the kilogram and pound measurement scales for mass. An object’s mass in pounds is its mass in kilograms times 2.2. Thus the kilogram and pound scale s are related by a scalar trans formation. In this case the multiplying constant (2.2) is positive, so we call it a positive scalar transformation. Scalar transformations maintain zero points and ratios, and positive scalar transformations maintain ordinal ranks . Taking these one at a time: Anything that weighs 0 kilograms also weighs 0 pounds; the pound and
p q“
p q“
272
CHAPTER 8. REPRESENTATION THEOREMS
kilogram scales have the same zero point. Moreover, if I’m twice as heavy as you in kilograms then I’ll be twice as heav y as you in pounds. Scalar transformations preserve ratios among values. Finally, since it’s a positive scalar transformation, putting people in order by their weight in kilograms will also order them by their weight in pounds. Affine transformations are a bit more complex: the conversion not only multiplies by a constant but also adds a constant. Celsius and Fahrenheit temperatures are related by an affine transformation; to get Fahrenheit from Celsius you multiply by 1 .8 th en a dd 3 2. This is a positive affine transformation (determined again by the sign of the multiplying constant). Positive affine transformations maintain ordinal ranks, but not necessarily zero points or ratios among v alues. Again, one at a time: Tahiti is hotter than Alaska whatever temperature scale you use; a positive affine transformation keeps things in the same order . While 0 ˝ C is the (usual) freezing point of water, 0 ˝ F is a much colder tem perature. So a value of 0 does not indicate the same worldly situati on on both temperature scale s. Positive affine transformations may also distort ratios: 20 ˝ C is twice 10 ˝ C, but their equivalents 68˝ F and 50 ˝ F (respectively) do not fall in a ratio of 2 to 1. Affine transformations do, however, equality gaps. Suppose I tell you, “Tomorrow willpreserve be hotterfacts thanabout todaythe by the sameof number of degrees that today was hotter than yesterday.” This will be true on the Fahrenheit scale just in case it’s true on the Celsi us scale . (Scalar transformations preserve gap equality as well, since scalar transformations are the special case of affine transformations in which the added constant is 0.) Savage’s representation theorem guarantees that if an agent’s preferences satisfy the preference axioms, we will be able to find a probabilistic credence distribution and a utility distribution that match those preferences via expected utility maximization. In fact, we will be able to find many such utility distributions, but all the utility distributions that match this particular agent’s preferences will be related by positive affine transformation. Decision theorists tend to think that if two utility distributions are related by a positive affine transformation, there is no real underlying difference between an agent’s having one and the agent’s having another. Each distribution will rank states of affairs in the same order with respect to value, and when put into an expected utility calculation with the same credence distribution each will produce the same preferences among acts. The difference between such distributions is really in some particular utility values we—the utility measure rs—stipulate to set up our measurement scale. No matter which measurement scale we choose, the agent will still prefer choco-
8.3. REPRESENTATION THEOREMS AND PROBABILISM
273
late ice cream to van illa, and vanilla ice cream to none. And it may turn out that she prefers chocolate to vanilla by exactly the same amount that she prefers vanilla to none. (Affine transformations preserve the equivalence of gaps; establishing equivalent utility gaps was the main business of Ramsey’s third step.) This equanimity among utility scales related by positive affine transformation does lose us absolute zero points and ratios among utility assignments. Different utility scales will yield different results about whether our agent likes chocolate ice cream twice as much as vanilla. If we’re going to treat each of those measurement scales as equally accurate, we’ll have to deny that there’s any fact of the matter about the ratio between the agent’s utility for chocolate and utility for vani lla. But it’s unclear what it would even mean for an agent to value chocolate twice as much as vanilla (especially since such facts could have no bearing on her preferences among acts). So decision theorists tend not to mourn their inability to make utility ratio claims.
8.3
Representation theorems and probabilism
It’s time to pause and ask what decision theory—and representation theorems— are really for. Kenny Easwaran writes, Naive applications of decision theory often assume that it works by taking a specification of probabilities and utilities and using them to calculate the expected utilities of various acts, with a rational agent being required to take whichever act has the highest (or sufficiently high) expected utility. However, justifications of the formal framework of expected utility theory generally work in the opposite way—they start with an agent’s preferences among acts, and use them to calculate an implied probability and utility function.... The orthodox view of decision theory endorsed by Savage (1954) and Jeffrey (1965) takes preferences over acts with uncertain outcomes to be the fundamental concept of decision theory, and shows that if these preferences satisfy a particular set of axioms, then they can be represented by a probability function and a utility function. . . . This confl icts with a naive reading of the concept of expected utility, which was perhaps the dominant understanding of theories that were popular in the 17th to 19th
274
CHAPTER 8. REPRESENTATION THEOREMS
centuries. One often assumes that utilities and probabilities are prior to preference, and that decision theory says that you should prefer an act with a higher expected utility over any act with a lower expected utilit y. And this is how the theory of expected utility is often applie d in practi cal contexts. (2014a, pp. 1–2, emphasis in srcinal) Decision theory is often presented (and was largely presented in Chapter 7) as a first-personal guide to decision-making and problem-s olving. (Blaise Pascal initiated his famous calculations of expected returns because he and Pierre Fermat wanted to find the proper way of settling up the payouts for a casino game. (Fermat and Pascal 1654/19 29)) Once an agent has assigned her credences that various states of the world obtain, and her utilities to all the relevant outcomes, she can combine them via a valuation function to determine which available act she ought rationally to prefer. Representation theorems belong to a fairly different approach to decision theor y—what we might call a third-personal approach. Suppose an economist has been studying a subject, noting the decisions she has made some when confronted by various choices in the reveals of the the subject’s preferences to the economist, but past. certai This nly not all. Suppose economist also assumes that the subject is rational in the sense that her total set of preferences (both revealed and as-yet-unrevealed) together satisfies the preference axioms (Transitivity, Asymmetry, Completeness, etc.). A representation theorem then guarantees that the agent can be understood as if her past and future preferences are the result of maximizing expected utility relative to some utility distribution and probabilistic credence distribution. So the econo mist can take the subject’s past preferences and deduce features of credence and utility distributions that would generate those preferences were she maximizing expected utility. The economist then uses what’s known about those (imagined ) credences and utilities to predict preferences not yet observed. The subject’s future decisi ons must match these predictions, on pain of violating the prefe rence axioms. (In part (b) of Exercise 8.3 you’ll use Ramsey’s four-step process to make a prediction in this way.) To the extent that decision theorists and economis ts can assume real agents satisfy the preference axioms, this makes decision theory a powerful predictive tool. The third-personal approach can also be applied in a more abstract fashion. Since any agen t who is rational in the sens e of satisf ying the preference axioms is representable as maximizing expected utility, we can prove results about the preferences of rational agents by proving that maximiz-
8.3. REPRESENTATION THEOREMS AND PROBABILISM
275
ing expected utility requir es certain kinds of preference relationships. For instance, we could argue that any agent displaying the preferences in the Allais Paradox (Section 7.2.4) must be irrational by showing that no possible utility distribution would generate those preferences for an expected-utility maximizer. All of these third-personal results—both abstract and particular—suppose at some point that the agent sets her preferences by maximizing expected utility relative to a utility distribution and probabilistic credence distribution. But this supposition is a kind of bridge, taking the theorist from premises about the agent’s preferences to conclusions that concern her preferences as well. To get the relevant results, we need not demonstrate that the agent actually sets her preferences using utilities and probabilistic credences. Expected-utility maximization acts as a mathematical model, making the acquisition of preferen ce results more tractable . Resnik (1987, p. 99) writes, “the [representation] theorem merely takes information already present in facts about the agent’s preferences and reformulates it in more convenient numerical terms.” Could we do more? Some Bayesian Epistemologists have used representation argue that agents have probabilistic degrees of theorems belief. At to a first pass, therational argument runsmust something like this: Representation Theorem Argument for Probabilism (Premise) A rational agent’s preferences satisfy the preference axioms. (Theorem) Any agent whose preferences satisfy the preferen ce axioms can be represented as maximizing expected utility relative to a probabilistic credence distribution. (Conclusion) All rational agents have probabilistic credence distributions. 8.3.1
Objections to the argument
As usual, a first approach to refuting this Representation Theorem Argument would be to deny its premise. Whatever representation theorem one uses in the argument (Savage’s or one of its descendents), that theorem will assume that the agent satisfi es a particular set of preference axioms. One might then deny that rationalit y requires satisfying those axioms. For example, some philosophers have argued that Preference Transitivity is not a rational requirement. (See Chapter 7, note 4.) On the other hand one might accept the premise, but only in a way that doesn’t generate the desired conclusion. Chapter 1 distinguished practical
276
CHAPTER 8. REPRESENTATION THEOREMS
rationality, which concerns connections between attitudes and action, from theoretical rationality, which assesses representational attitudes considered as such. The preference axioms, being constraints on preferences between acts, are requirements of practical rationality. So if it’s successful, the Representation Theorem Argument demonstrates that any agent who satisfies the requirements of practical rationality has probabilistic credences. As Ramsey put it after laying out his four-step process, Any definite set of degrees of belief which broke [the probability rules] would be inconsistent in the sense that it violated the laws of preference between options, such as that preferability is a transitive asymmetrical relation. (1931, p. 84) Yet we have offered probabilism as a thesis about theoretical rationality. The Representation Theorem Argument seems to show that an agent with non-probabilistic credences will make irrational decisions about how to behave in her life. But we wanted to show that non-probabilistic credences are flawed as representations in themselves, independently of how they lead to action. Adding the word “practically” before the word “rational” in the argument’s Premise forces us to add “practically” before “rational” in its 7 Conclusion as well, but that isn’t the Conclusion we were hoping to obtain. Setting aside concerns about the Premise, one might worry about the validity of the Representation Theorem Argument as I’ve reconstructed it. Logically, the Premise and Theorem together entail that a rational agent can be represented as maximizing expected utility relative to a probabilistic credence distribution. How does it follow that a rational agent has probabilistic credences? To establish that a rational agent has probabilistic credences, we need to establish two claims: (1) that such an agent has numerical credences to begin with; and (2) that those credences satisfy the probability axioms. It’s unclear that the Representation Theorem Argument can even establish the first of these claims. Alan H´ajek explains the trouble as follows: The concern is that for all we know, the mere possibility of representing you one way or another might have less force than we want; your acting as if the representation is true of you does not make it true of you. To make this concern vivi d, suppose that I represent your preferences with Voodooism. My voodoo th eory says that there are warring voodoo spirits inside you. When you prefer A to B , then there are more A-favouring spirits inside you than B -favouring spirits. I interpret all of the usual
8.3. REPRESENTATION THEOREMS AND PROBABILISM
277
raitonality axioms in voodoo terms. Transitivity: if you have more A-favouring spirits than B -favouring spirits, and more B favouring spirits than C -favouring spirits, then you have more A-favouring spirits than C -favouring spirits. . . . And so on . I then ‘prove’ Voodooism: if your preferences obey the usual rationality axioms, then there exists a Voodoo representation of you. That is you act as if there are warring voodoo spirits inside you in conformity with Voodooism. Conclusion: rationality requires you to have warring Voodoo spirits in you. Not a happy result. (2009a, p. 238, emphases in srcinal) It’s possible to defend the representation theorem approach—and close this gap in the argument—by adopting a metaphysically thin conception of the attitudes in question. Voodoo (or Voudou) is a complex set of cultural traditions involving a variety of ontological and metaphysical commitments. Demonstrating that an agent behaves as if there were voodoo spirits inside her seems insufficie nt to establish such metaphysical claims. On the other hand, one might define the notion of a credence such that all there is to possessing particularstru credence is acting according with a aparticular cture. distribution At one point Ramsey write s, to “I preferences suggest that we introduce as a law of psychology that [an agent’s] behaviour is governed by what is call ed the mathematical expectation. . . . We. . . define degree of belief in a way which presupposes the use of the mathematical expectation.” (1931, p. 76) Bruno de Finetti (1937/1964) employs a similar definitional approach. On such a metaphysically thin behaviorist or functionalist account, 8 an agent’s acting as if she has probabilistic credences may be tantamount to her having such credences. (This point of view also makes it less worrisome that the argument’s Premise invokes constraints of practical rationality.) The prominence of such operationalist views during parts of the twentieth century explains why so little distance was perceived between conclusions that follow uncontroverisally from Savage’s Representation Theorem and the more controversial claims of the Representation Theorem Argument. Yet such straightforward, metaphysically thin operationalisms have fallen out of favor, for a variety of reasons. For example, even if we identify mental states using their functional roles, it’s too restrictive to consider only roles related to preferen ces among acts. Doxastic attitudes have a variety of functions within our reasoning, and may even be directly introspectible. Christensen (2004) also notes that even if they aren’t directly introspectible, degrees of belief affect other mental states that can be introspected, such as our emotions. (Consider how confidence that you will perform well affects
278
CHAPTER 8. REPRESENTATION THEOREMS
your feelings upon taking the stage.) With a thicker conception of credences in place, it’s difficult to imagine that merely observing an agent’s preferences among acts would suffice to attribute such doxastic attitudes to her. 9 Perhaps, then, we can scale back what the Representation Theorem Argument is meant to show. Suppose we have convinced ourselves on independent grounds that agents have numerical degrees of belief, perhaps via considerations about comparisons of confidence like those adduced in Chapter 1. With the existence of credences already established, could the Representation Theorem Argument show that rationality requires those credences to satisfy the probability axioms? Unfortunately the logic of the argument prevents it from even achieving that. Go back and carefully read the Representation Theorem on page 270. The phrase “there exists a unique probabilistic credence distribution” contains a key ambiguity. One might be tempted to read it as saying that given an agent’s full preferen ce ordering among acts, there will be exactly one credence distribution that matches those preferences via expected utility maximization, and moreover that credence distribution will be probabilistic. But that’s not how the theorem works. The proof begins by assuming that we’re looking a probabilistic credence distribution, andone then showing that out of all thefor probabilistic distributions there is exactly that will match the agent’s preferences. (If you look closely at Step Four of Ramsey’s process— the step that determines credence values—you’ll notice that halfway through we had to assume those values satisfy the probability axioms.) In the course of an argument for probabilism, this is extremely question-begging—how do we know that there isn’t some other, non-probabilistic distribution that would lead to the same preferences by expected utility maximization? What if it turns out that any agent who can be represented as if she is maximizing expected utility with respect to a probabilistic distribution can also be represented as maximizing expected utility with respect to a non-probabilistic distribution? This would vitiate the argument’s ability to privilege probabilistic distributions.10 8.3.2
Reformulating the argument
We can address these questions by proving a new version of the Representation Theorem:11 Revised Representation Theorem: If an agent’s preferences satisfy certain constraints, then there exists a unique credence distribution (up to positive scalar transformation) and unique utility distribu-
8.3. REPRESENTATION THEOREMS AND PROBABILISM
279
tion (up to positive affine transformation) that yield those preferences when the agent maximizes expected utility. Moreover, one of those scalar credence transforms satisfies the probability axioms. The revised theorem shows that if an agent maximizes expected utility, her full set of preferences narrows down her credences to a very narrow class of possible distributions. These distributions are all positive scalar transformations of each other; they all satisfy Non-Negativity and Finite Additivity; and they all satisfy something like Normality. The main difference between them is in the partic ular numerical value they assign to tautologies. One distribution will assign a credence of 100 to every logical truth (we can think of this distribution as working on a percentage scale), while another distribution will assign tautologies degre e of belief of 1. For any particular proposition P , the former distribution will assign a value exactly 100 times that assi gned by the latt er. If we stipulate that our credence scale tops out at 1, there will be only one credence distribution matching the agent’s preferences, and that distribution will satisfy the Kolomogorov axioms. The Revised Representation Theorem then demonstrates that any agent who generates preferences satisfying the preference axioms by maximizing expected utilities has a probabilistic credence distributio n. The revised theorem provides a revised argument: Revised Representation Theorem Argument for Probabilism (Premise 1) A rational agent’s prefer ences satisfy the preference axioms. (Premise 2) A rational agent’s preferences align with her credences by expected utility maximization. (Theorem) If an agent’s preferences satisfy the prefer ence axioms and align with her credences by expected utility maximization, then that agent has probabilistic credences (up to a positive scalar transformation). (Conclusion) All agents have probab ilistic credence distributions (or a positiverational scalar transform ation thereof). This argument has the advantage of being valid. Its conclusion isn’t quite probabilism, but if we treat the maximum numerical credence value as a stipulated matter, it’s close enough to probabilism to do the trick. Yet this version of the argument highlights another crucial assumption of representation theorems, embodied in Premise 2. Why should we assume
280
CHAPTER 8. REPRESENTATION THEOREMS
that rational preferences maximize expected utility? Savage’s expected utility equation is just one of many valuation functions that could be used to combine credences and utilities into prefer ences. In Chapter 7 we considered other valuation functions endorsed by Jeffrey and by causal decision theorists. But those valuation functions all maximized expected utilities in some sense—they worked by calculating linear aver ages. An agent might instead determine her preferences using the following “squared credence” rule instead: A ą B just in case 2
2
p q ¨ crpS q ` . . . ` upA & S q ¨ crpS q ą upB & S q ¨ crpS q ` . . . ` upB & S q ¨ crpS q
u A & S1
n
1
n
2
1
1
n
2
n
(8.12)
This valuation function behaves differently than the expected utility rules. For example, if an agent has a 2 3 credence that a particular bet will pay her $4 and a 1 3 credence that it will pay her nothing, applying the squared credence rule will lead her to prefer that bet to a guaranteed $3. Expected
{
{
utility maximization recommends the of opposite preference. Here’s another interesting feature the squared credence rule: Return to our friends Mr. Prob and Mr. Bold from the introduction to this part of the book. Mr. Prob’s credences are probabilistic, while Mr. Bold’s credence in any proposition is the square-root of Mr. Prob’s. Mr. Bold satisfies NonNegativity and Normality, but not Finite Additivity. His credence in any contingent proposition is higher than Mr. Prob’s. Now suppose that while Mr. Prob determines his preferences by maximizing Savage-style expected utilities, Mr. Bold’s preferences are generated using Equation (8.12). In that case, Mr. Prob and Mr. Bold have the exact same preferences between any two acts.12 It’s easy to see why: Mr. Bold’s credence in a given state S i is the square-root of Mr. Prob’s, but Mr. Bold squares his credence values in the process of calculating his valuation function. Mr. Bold’s aggressiv e attitude assignments and risk-averse act selections cancel out precisely, leaving him with pre ferences identical to Mr. Prob ’s. This means that if Mr. Prob’s preferences satisfy the preference axioms, Mr. Bold’s do as well. 13 If all you know about an aribtrary agent is that her preferences satisfy the preference axioms, it will be impossible to tell whether she has probabilistic credences and maximizes expected utility, or has non-probabilistic credences and a different valuation func tion. If I assure you that this agent is fully rational, does that break the tie? Why does rationality require maximizing expected utility—what’s rationally wrong with the way Mr. Bold proceeds?
8.4. EXERCISES
281
For the (revised) Representation Theorem Argument to succeed, we need a convincing argument that maximizing expected utility is rationally required.14 The argument cannot be that any agent who fails to maximize expected utility will adopt intuitively unappealing preferences among acts. (This was the strategy of our money pump argument for some of the preference axioms.) The alternative to maximizing expected utility we’re considering here is capable of generating all the same act preferences—intuitively appealing or otherwise—that expected utilities can. I said in the introduction to this part of the book that Finite Additivity is the most difficult to establish of Kolmogorov’s axiom s. The revis ed Representation Theorem Argument shows that if we can assume rational agents set their preferences by maximizing expected utility, then Finite Additivity is entailed by the preference axioms. But now we have what I call a Linearity In, Linearity Out problem. In order to demonstrate that rational agents satisfy one linearity constraint (Finite Additivity, which adds credences in mutually exclusive disjuncts in a straightforward linear fashion), we need to assume another linearity constraint (maximizing expected utility, which calculates expected utilities by straightforward linear averaging). We can criticize Mr. Boldtofor being non-linear in hisa credences if it’s antecedently permissible criticize him for having non-linearonly valuation function. To be clear: I have no proble m with a decision theor y that lists both probabilism and expected-utility maximization among its rationally-required norms. We saw earlier how these norms comple ment each other and allow rational choice theorists to derive interesting and substantive results. My complaint is with expected-utility maximization as a premise in what’s meant to be an independent argument for probabilism. Representation theorem arguments for probabilism rely on an assumption that looks just as nonobvious and in need of independent support as probabilism did. 15
8.4
Exercises
Problem 8.1. Show that any real-number measurement scale with finite upper and lower bounds can be converted by a positive affine transformation into a scale with the bounds 0 and 1. Problem 8.2. (a) List three differen t real-life cases in which two distinct measuring scales measure the same quantity and are related by a positive scalar transf ormation. (Measurements of mass cannot be one of your examples.)
282
CHAPTER 8. REPRESENTATION THEOREMS
(b) List three differen t real-life cases in which two distinct measurin g scales measure the same quantity and are related by a positive affine transformation that is not a scalar transformation. (Measurements of temperature cannot be one of your examples.) Problem 8.3. Shane is a graduate student who doesn’t care much about the outcomes of sporting events (though he may have opinions about them). Assume the propositions the Heat win the NBA Finals and the Blackhawks win the Stanley Cup are ethically neutral for Shane. Among Shane’s preferences between various acts and gambles are the following: Go to movie —preferred to— Read book —indifferent with— Go to movie if Heat win the NBA Finals, work on dissertation if Heat don’t win —indifferent with— Go to movie if Heat don’t win, dissertate if Heat win —preferred to— Go to gym —indifferent with— Read book if Heat win, dissertate if Heat don’t win —indifferent with— Go to movie if Blackhawks win the Stanley Cup, dissertate if Blackhawks don’t win —preferred to— Dissertate For the sake of definiteness, suppose Shane assigns a utility of 100 to going to a movie and a utility of 0 to working on his diss ertation. Suppose also that Shane’s preferences satisfy the preference axioms, his credences satisfy the probability axioms, and he determines his preferences by maximizing expected utilities in the standard way. (a) Use Ramsey’s four-step process to determine as much about Shane’s utility and credence values as you can. Be sure to explain your method. (b) Imagine Shane is offered a gamble on which he reads a book if the Blackhawks win the Stanley Cup, but dissertates if they don’t win. Would Shane prefer to accept this gamble or go to the gym?
283
8.5. FURTHER READING
p q“ {
Problem 8.4. (a) Suppose an agent assigns cr P 1 3 and sets her preferences according to standard expected utility calculations. Explain how she might nevertheless prefer a guaranteed $10 to a gamble that pays $40 on P and nothing otherwise, if dollars have declining marginal utility for her. (b) Now suppose the agent doesn’t have declining marginal utility for money— in fact, she assigns exactly one util per dollar gained or lost, no matter how many she already has. Show that such an agent could still prefer a guaranteed $10 to a gamble that pays $40 on P and nothing otherwise, if she assigns preferences using the “squared credences” valuation function (Equation (8.12)) rather than standard expected utility calculations. Problem 8.5. Suppose I’ve been cross-examining an agent for some time about her preferences, and all the preferences I’ve elicited satisfy the preference axioms. Mr. Prob comes along and calculates a utility distribution and probabilistic credence distribution that would generate the elicited preferences if the agent is an expected-utility maximizer. Mr. Bold then claims that Mr. Prob is wrong about the agent’s credence values—according to Mr. Bold, the agent’s nonextreme credences are actually the square-root of what Mr. Prob has suggested, but the agent is a squared-credence maximizer. Do you think there’s any way to tell if Mr. Prob or Mr. Bold is correct about the agent’s credence values? Could there even be a fact of the matter to the effect that one of them is right and the other is wrong?
8.5
Further reading
Introductions and Overviews
Richard C. Jeffrey (1965). The Logic of Decision . 1st. McGrawHill series in probability and statistics. New York: McGrawHill Chapter 3 carefully explains techniques for drawing out credences and utilities from preferences, including a step-by-step walkthrough of Ramsey’s technique with examples. Classic Texts
Frank P. Ramsey (1931). Truth and Probability. In: The Foundations of Mathematics and other Logic Essays . Ed. by R.
284
CHAPTER 8. REPRESENTATION THEOREMS
B. Braithwaite. New York: Harcourt, Brace and Company, pp. 156–198 Ramsey inspired all future representation theorems with his four-step process for determining an agent’s credences and utilities from her preferences. Leonard J. Savage (1954). The Foundations of Statistics . New York: Wiley Though the proof is spread out over the course of the book, this work contains the first general representation theorem. Extended Discussion
Patrick Maher (1993). Betting on Theories. Cambridge Studies in Probability, Induction, and Decision Theory. Cambridge: Cambridge University Press Mark Kaplan (1996). Decision Theory as Philosophy . Cambridge: Cambridge University Press David Christensen (2004). Putting Logic in its Place . Oxford: Oxford University Press Each of these authors explains and defends some version of a representation theorem argument for rational constrain ts on degrees of belief: Maher in his Chapter 8, Kaplan in his Chapter 1, and Christensen in his Chapter 5. Lyle Zynda (2000). Representation Theorems and Realism About Degrees of Belief. Philosophy of Science 67, pp. 45–69 Demonstrates that rational preferences representable as maximizing expected utilities based on probabilistic credences can also be represented as maximizing some other quantity based on non-probabilistic credences, then explores the consequences for realism about probabilistic credence. Christopher J.G. Meacham and Jonathan Weisberg (2011). Representation Theorems and the Foundations of Decision Theory. Australasian Journal of Philosophy 89, pp. 641–663 A critical examination of representation theorem arguments, assessing their potential to establish both descriptive and normative claims about degrees of belief.
NOTES
285
Notes 1 As in Chapter 7, we will read “ A” as the proposition that the agent performs a particular act and “ Si ” as the proposition that a particular state obtains. Thus preferences, credences, and utilities will all be propositional attitudes. 2
To b e slightly more careful about Equation (8.4): Standard expected utility theories (such as Savage’s) endorse a principle for utilities similar to the Conglomerability principle for credences we saw in Section 5.4. For any X and P , upX & P q and u pX & „P q set the bounds for u pX q. If u pX & P q “ upX & „P q, upX q must equal this value as well. 3 Ramsey points out that this approach will not work for propositions Q with extreme unconditional credences. But those can be ferreted out easily: for instance, if the agent is certain of Q then she will be indifferent between receiving D for certain and a gamble that yields D on Q and M on „ Q. Also, we have to be sure in Step 4 that we’ve selected Q,R,S,T such that u pT q ‰ upS & „Qq ‰ upR & Qq. 4 (Maher 1993, Ch. 2 and 3) provides an excellent general defense of the preference axioms as rational requirements. 5 As I pointed out in Chapter 7, Preference Completeness actually has some substantive consequences when considered on its own. For example, it entails Preference Asymmetry. But decision theorists often think of these principles in the following order: If we first take on substantive constraints such as Preference Asymmetry and Transitivity, then adding Preference Completeness is just a matter of requiring preference orderings to be total. 6 See (Fishburn 1981) for a useful survey. One popular move is to replace the requirement that an agent’s preference ordering actually satisfy a particular richness constraint with a requirement that the ordering be extendable into a fuller ordering satisfying that constraint. 7 Of course, if one holds the position (mentioned in Section 1.1.2) that all requirements of theoretical rationality ultimately boil down to requirements of practical rationality, this objection may not be a concern. 8 Historically, it’s interesting to consider exactly what philosophy of mind Ramsey was working with. On the one hand, he was surrounded by a very positivist, behaviorist milieu. (Among other things, Ramsey was Wittgenste in’s supervisor at Cambridge and produced the first English translation of Tractatus Logico-Philosophicus .) On the other hand, Ramsey’s writings contai n suggestions of an early functionalism. Brian Skyrms writes that, “Ramsey thinks of personal probabilities as theoretical parts of an imperfect but useful psychol ogical model, rather than as concepts given a strict but operational definition.” (1980b, p. 115) 9 That’s not to say that the links between credence and rational preference become useless once operationalism about doxastic attitudes is abandoned. Savage (1954, pp. 27–28) has a nice discussion of the advantages of determining an agent’s numerical credence values by10observing her preferences over trying to determine them by asking her to introspect. This concern is forcefully put by Meacham and Weisberg (2011). 11 This revised theorem is proven for a relevant set of preference axioms in unpublished work by myself and Lara Buchak. The result is a fairly straightforward elaboration of proofs used in standard Dutch Book arguments for probabilism. 12 The idea of mimicking a probabilistic agent’s preference structure by giving a nonprobabilistic agent a non-standard valuation function comes from (Zynda 2000). 13 One might have thought that I dismissed our srcinal Representation Theorem Argument for Probabilism too quickly, for the following reason: Even if we’re not operationalists
286
NOTES
about degrees of belief, we might think that if probabilistic degrees of belief figure in the simplest, most useful explanation of observed rational agent preferences then that’s a good reason to maintain that rational agents possess them. In this vein, Patrick Maher writes: I suggest that we understand attributions of probability and utility as essentially a device for interpreting a person’s preferences. On this view, an attribution of probabilities and utilities is correct just in case it is part of an overall interpretation of the person’s preferences that makes sufficiently good sense of them and better sens e than any competin g interpretation does. . . . If a person’s preferences all maximize expected utility relative to some cr and u, then it provides a perfect interpretation of the person’s preferences to say that cr and u are the person’s probabi lity and utilit y functions. Thus, having preferences that all maximize expected utility relative to cr and u is a sufficient (but not necessary) condition for cr and u to be one’s probability and utility functions. (1993, p. 9) Suppose we accept Maher’s criterion for correct attitude attribution. The trouble is that even if the interpretation of a rational agent’s preferences based on probabilistic credences and maximizing expected utilities is a perfect one, the alternate interpretation based on Bold-style credences and the squared credence valuation function looks just as perfect as well. Thus the probabili st interpretation fails to make better sense than competing interpretations, and a representation theorem argument for probabilism cannot go through. 14 Notice that in our earlier quote from (Ramsey 1931, p. 76), Ramsey simply introduces “as a law of psychology” that agents maximize expected utility. 15
In Section 7.1 we suggested that the law of large numbers provides one reason to use expectation s in estimating values. The idea is that one’s expectation of a numerical quantity equals the average value one anticipates that quantity will approach in the limit. Why doesn’t this provide an argument for making decisions on the basis of expected utilities? One might worry here that using the long-run average smuggles in a linearity bias. But there’s an even deeper proble m with the proposed argumen t: The law of large numbers says that if you satisfy the probability axioms, then you’ll have credence 1 that the average in the limit equals you r expectation. A result that assumes probabilism cannot be used to ground maximizing expected utility if we hope to use the latter as part of our argument for probabilism.
Chapter 9
Dutch Book Arguments Chapter 8 presented the Representation Theorem Argument for probabilism. In its best form, this argument shows that any agent who satisfies certain preference axioms and maximizes expected utility assigns credences satisfying Kolmogorov’s probability rules.1 Contraposing, an agent who maximizes expected utility but fails to assign probabilistic credences will violate at least one of the preference axioms. But why should rationality require satisfyin g the preference axioms? In Chapter 7 we argued that an agent who violates certain of the preference axioms—Preference Asymmetry and/or Preference Transitivity—will be susceptible to a money pump: a series of decisions, each of which is recommended by the agent’s preferences, but which together leave the agent back where she started with less mone y on her hands. It looks irrational to leave yourself open to such an arrangement, and therefore irrational to violate the preference axioms. While money pumps may be convincing, it’s an awfully long and complicated road from them to probabilism. This chapter assesses a set of arguments that are fairly similar to money pump arguments, but which constrain credences in a much more direct fashion. These arguments show that if an agent’s credences violate particular constraints, we can construct a Dutch Book against her—a set of bets, each of which the agent views as fair, but which together guaran tee that she will lose money come what may. Dutch Books can be constructed not only against agents whose credences violate Kolmogorov’s probability axioms, but also against agents whose credences violate the Ratio Formula, updating by Conditionalization, and many of the other credal constraints proposed in Chapter 5. This chapter begins by working through those putative norms, showing 287
288
CHAPTER 9. DUTCH BOOK ARGUMENTS
how to construct Dutch Books against agents who violate each one. We then ask whether the possibility of constructing a Dutch Book against agents who violate a particular norm can be turned into an argument for that norm’s being rationally requ ired. After offering the most plaus ible version of a Dutch Book Argument that we can, we will canvass a number of traditional objections.
9.1
Dutch Books
Dutch Book Arguments revolve around agents’ betting behavior, so we’ll begin by discussing how an agent’s credences influence the bets she’ll accept. For simplicity’s sake we will assume throughout this chapter that agents assign each dollar the same amount of utility (no matter how many dollars they already have). That way we can express bets in dollar terms inste ad of worying about the logistics of paying off a bet in utils. Suppose I offer to sell you the following ticket: This ticket entitles the bearer to $1 if P is true, and nothing otherwise. for some particular proposition P . If you’re rational, what is your fair price for that ticket—that is, how much would you be willing to pay to possess it? It depend s how confid ent you are that P is true. If you think P is a long shot, then you think this ticket is unlikely to be worth anything, so you won’t pay much for it. The more confident you are in P , however, the more you’ll pay for the tick et. For example, if P is the proposition that a fair coin flip comes up heads, you might be willing to pay $0 .50 for that ticket. If you pay $0 .50 for the ticket, then you’ve effectively made a bet on which you net $0 .50 if P is true (coin comes up heads) but lose $0 .50 if P is false (coin comes up tails). Seems like a fair bet. A ticket that pays off on P is worth more to a rational agent the more confident she is of P . In fact, we typically assume that a rational agent’s fair betting price for a $1 ticket on P is cr P —she will purchase a ticket that pays $1 on P for any amount up to cr P dollars. For example, suppose neither you nor I knows anything about the day on which Frank Sinatra was born. Nevertheless, I offer to sell you the following ticket:
p q p q
289
9.1. DUTCH BOOKS
This ticket entitles the bearer to $1 if Sinatra was born on a weekend, and nothing otherwise.
{
If you spread your credences the days the week, $2at that 7—or about $0 .29—is a fair bettingequally price foramong this ticket. To of buy the ticket price is to place a particular kind of bet that the selected day will be on the weekend. If you lose the bet, you’re out the $0 .29 you paid for the ticket. If you win the bet, it cost you $0 .29 to buy a ticket which is now worth $1, so you’re up $0 .71. Why do you demand such a premium— why do you insist on a higher potential payout for this bet than the amount of your potential loss? Because you think you’r e more likely to lose than win, so you’ll only make the bet if the (unlikely) payout is greater than the (probable) loss. Now look at the same transaction from my point of view—the point of view of someone who’s selling the ticket, and will be on the hook for $1 if Sinatra was born on a weekend. You spread your credence s equally among the days, and are willing to buy the ticket for up to $0.29. If I spre ad my credences in a similar fashion, I should be willing to sell you this ticket for at least . any amount of to $1 from $0 29.meOn the we onefind hand, handSinatra’s ing out abirthday. ticket that may entitle you once outI’m about On the other hand, I don’t think it’s very likely that I’ll have to pay out, so I’m willing to accept as little as $0 .29 in exchange for selling you the ticket. In general, an agent’s fair betting price for a gambling ticket is both the maximum amount she would pay for that ticket and the minimum amount for which she would sell it. All the tickets we’ve considered so far pay out $1 if a particular proposition is true. Tickets can be bought or sold for othe r potential payoffs, though. In general, the rational fair betting price for a ticket that pays $ S if P is true and nothing otherwise is $ S cr P .2 (Think of this as the fair betting price of S tickets, each of which pays $1 on P .) This formula works both for run-of-the-mill betting cases and for cases in which the agent has very extreme opinions. For instance, consider an agent’s behavior when her credence in P is 0. Our formula sets her fair betting pri ce at $0, whatever the stakes S . Since the agent doesn’t think the tick et has any chance of paying off, she will not pay any amount of mone y to possess it. On the other hand, she will be willing to sell such a ticket for any amount you like, since she doesn’t think she’s incurring any liability in doing so. Bayesians (and bookies) often quote bets using odds instead of fair betting prices. For instance, a bet that Sinatra was born on a weekend would typically go off at 5 to 2 odds. This means that the ratio of your potential
¨ p q
290
CHAPTER 9. DUTCH BOOK ARGUMENTS
net payout to your potential net loss is 5:2 (0 .71 : 0 .29). A rational agent will accept a bet on P at her odds against P (that is, cr P : cr P )—or better. Yet despite the ubiquit y of odds talk in professional gambling, our calculations will be expressed in terms of fair betting prices going forward.
p„ q p q
9.1.1
Dutch Books for probabilism
Suppose we have an agent who violates the probability axioms by assigning both cr P 0.7 and cr P 0.7 for some particular proposition P . (Perhaps he’s a character like Mr. Bold. ) Given his credence in P , this agent’s fair betting price for a ticket that pays $1 if P is true will be $0 .70. Given his credence in P , his fair betting price for a ticket that pays $1 if P is true will also be $0 .70. So let’s sell him both of these tickets, at $0 .70 each. Our agent is now in troub le. He has paid a total of $1 .40 for the two tickets, and there’s no way he can make all that money back. If P is true his first ticket is worth $1 but his second tick et is worth nothing. If P is false his first ticket is worth nothing and his second ticket pays only $1. Either way, he’s going to wind up out $0 .40. We can summarize the agent’s situation with the following table:
p q“
„
p„ q “
„
P
Ticket pays on P Ticket pays on P TOTAL
„
0.30 0.70 0.40
´ ´
„P ´0.70 0.30 ´0.40
The colu mns of this table part ition the possible states of the world. In this case, our partition consists of the propositions P and P . The agent purchases two tickets; each ticket is recorded on one row. The entries in the cells report the agent’s net payout for that ticket in that state; all values are in dollars, and negat ive numbers indic ate a loss. So, for insta nce, the upper-right cell reports that if P is true then the agent loses $0 .70 on his P ticket (the ticket cost him $0 .70, and doesn’t win him anything in that state). The upper-left cell records that the P ticket cost the agent $0.70, but he makes $1 on it if P is true, for a net profit of $0 .30. The final row reports the agent’s total payout for all his ticket s in a given state of the world. As we can see, an agent who purchases both tickets will lose $0 .40 no matter which state the world is in. Purchasing this set of tickets guarantees him a net loss. A Dutch Book is a set of bets, each placed with an agent at her fair betting price (or better), that together guarantee her a sure loss come what
„
„
291
9.1. DUTCH BOOKS
may.3 The idea of a Dutch Book is much like that of a money pump (Section 7.2.1): we make a series of exchanges with the agent, each of which individually looks fair (or favorable) from her point of view, but which together yield an undesirable outcome. In a Dutch Book, each bet is placed at a price the agent considers fair given her credence in the proposition in question, but when all the bets are added up she’s guaranteed to lose money no matter which possible world is actual. Ramsey (1931, p. 84) recognized a key point about Dutch Books, which was proven by de Finetti (1937/1964): Dutch Book Theorem: If an agent’s credences violate at least one of the probability axioms (Non-Negativity, Normality, or Finite Additivity), a Dutch Book can be constructed against her. We will prove this theorem by going through each of the axioms one at a time, and seeing how to make a Dutch Book against an agent who violates it. Non-Negativity and Normality are relatively easy. An agent who violates Non-Negativity will set a negative betting price for a ticket that pays $1 on P .to some Since assigns a negative betting that ticket,proposition she is willing sell the it atagent a negative price. In other words,price thisto agent is willing pay you some amount of money to take a ticket which, if P is true, entitles you to an extra $1 from her on top of what she already paid you to take it. Clearly this is a losing proposition for the agent. Now suppose an agent violates Normality by assigning, say, cr P P 1.4. This agent will pa y $1 .40 for a ticket that pays $1 if P P is true. That $1 will definitely come in, but will still represent a net loss for the agent. On the other hand , if the agent assi gns a credence less than 1 to a tautology, she will sell for less than $1 a ticket that pays $1 if the tautology is true. The tautology will be true in every possibl e world, so in every world the agent will lose money on this bet. Finally, suppose that for mutually exclusive P and Q, an agent violates Finite Additivity by assigning cr P 0.5, cr Q 0.5, and cr P Q 0.8. Because of these credences, the agent pays $0 .50 for a ticket that pays $1 on P , and then pays another $0 .50 for a ticket that pays $1 on Q. Then we have her sell us for $0 .80 a ticket that pays $1 if P Q. At this point, the agent has collected $0 .80 from us and paid a total of $1 for the two tickets she bough t. So she’s dow n $0 .20. Can she ho pe to make this money back? Well, the tickets she’s holding will be worth $1 if either P or Q is true. She can’t win on both tickets, because P and Q were stipulated to be mutually exclusive. So at most, the agent’s tickets are going
p _„ q “ _„
p q“
p q“
p _ q“
_
292
CHAPTER 9. DUTCH BOOK ARGUMENTS
_
to earn her $1. But if either P or Q is true, P Q will be true as well, so she will have to pay out $1 on the ticket she sold us. The moment she earns her $1 she’ll have to pay it back out to us. There’s no way for the agent to make her money back, so no matter what happens she’ll be out a net $0 .20. The situation is summed up in this table: Ticket pays on P Ticket pays on Q Ticket pays on P TOTAL
„
P& Q 0.50 0.50 0.20 0.20
´ _Q ´ ´
„P & Q „P & „Q ´0.50 ´0.50 0.50 ´0.50 ´0.20 0.80 ´0.20 ´0.20
Since P and Q are mutually exclusive, there is no possible world in which P & Q is true, so our partition has only thr ee eleme nts. On the first row, the P -ticket for which the agent paid $0 .50 nets her a positive $0 .50 in the state where P is true. Similarly for Q on the second row. The third row represents a ticket the agent sold, so she makes $0 .80 on it unless P Q is true, in which case she suff ers a net loss. The final row sum s the rows
_
above it to show that each possible state guarantees the agent a $0 .20 loss from her bets. A similar Book can be constructed for any agent who assigns cr P Q cr P cr Q . For a Book again st agents who violate Fin ite Additivity by assigning cr P Q cr P cr Q , see Exercise 9.1.
p _ q ă p q` p q p _ qą p q` p q
9.1.2
Other Dutch Books
Dutch Books can also be constructed against agents who violate other rational credence requirements. For example, suppose an agent has the probabilistic unconditional credence distribution specified by the following probability table: P
Q
T
T
T
F
F
T
F
F
cr 1 4 1 4 1 4 1 4
{ { {
But now suppose that this agent violates the Ratio Formula by assigning cr P Q 0.6. To construct a Dutch Book against this agent, we need the idea of a conditional bet. Suppose we sell the agent the following ticket:
p | q“
293
9.1. DUTCH BOOKS
If Q is true, this ticket entitles the bearer to $1 if P is true and nothing otherwise. If Q is false, this ticket may be returned to the seller for a full refund of its purchase price. If Q turns out to be false, it doesn’t matter how much the agent paid for this ticket; her full purc hase price will be refunded to her. So if Q is false the agent’s purchase of this ticket will net her exactly $0. That means the agent’s purchase price for this ticket should be dictated by her opinion of P in worlds where Q is true. In other words, the agent’s purchase price for this ticket should be driven by cr P Q . We call the resulting bet a conditional bet on P given Q. A conditional bet on P given Q wins or loses money for the agent only if Q is true; if the payoff on P (given Q) is $1, the agent’s fair betting price for such a bet is cr P Q . In general, conditional bets are always priced using conditional credences. Since our imagined agent sets cr P Q 0.6, she will purchase the ticket above for $0 .60. We now ask her to sell us two more tickets:
p | q
p | q p | q“
1. We pay the agent $0 .25 for a ticket that pays us $1 if
P & Q.
2. We pay the agent $0 .30 for a ticket that pays us $0 .60 if
„Q.
Notice that Ticket 2 is for stakes other than $1; we’ve calculated the agent’s fair betting price ($0 .30) by multiplying her credence in Q (1 2) by the ticket’s payoff ($0.60). The agent has received $0 .55 from us, but she’s also paid out $0 .60 for the conditional ticket. So she’s down $0 .05. If Q is false, she’ll get a refund of $0 .60 for the conditional ticket, but she’ll also owe us $0 .60 on Ticket 2. If Q is true and P is true, she gets $1 from the conditional ticket but owes us $1 on Tic ket 1. And if Q is true and P is false, she neither pays nor collects on any of the tickets and so is still out $0 .05. No matter what, the agent loses $0 .05. The following table summarizes the situation:
„
Ticket 1
P &Q 0.75
´ ´
Ticket2 0 .30 Conditional ticket 0 .40 TOTAL 0.05
{
„P & Q „Q 0.25
0.25
´00.30.60 ´00.30 ´0.05 ´0.05
A similar Dutch Book can be constructed against any agent who violates the Ratio Formula. David Lewis figured out how to turn this Dutch Book against Ratio Formula violators into a strategy against anyone who fails to update by
294
CHAPTER 9. DUTCH BOOK ARGUMENTS
Conditionalization.4 Suppose we have an agent who assigns the unconditional credence distribution described above at ti . Suppose also that this agent satisfies the Ratio Formula at all times (and so isn’t bookable by virtue of probability axiom or Ratio Formula violations). But now suppose that
p | p q q““
if the agent learns Q between ti and tj , she will assign cr j P 0.6. Since she satisfies the Ratio Formula, the agent assigns cr i P Q 0.5, so this crj assignment will violate Conditionalization. We take advantage of this agent’s Conditionalization violation by first purchasing Tickets 1 and 2 described above from her at ti . The prices on these tickets match the agent’s unconditional ti credences, so she will be willing at t i to sell them at the prices listed. We then formulate the following strategy: If the agent learns Q between t i and t j , we will sell her a ticket at tj that pays $1 on P . We know that this age nt will ass ign cr j P 0.6 if she learns Q, so in that circumstance she will be willing to buy this ticket for $0 .60. If the agent doesn’t learn Q between t i and t j , we will not engage in any transactions with her beyond Tickets 1 and 2. 5 Putting all this together, the agent’s payoffs once more are:
p q“
P &Q
Ticket 1 Ticket2 Ticket if Qlearned TOTAL
´
0.75 0 .30 0 .40 0.05
´
P &Q
Q
„ 0.25 0„.25 0.30 ´0.30 ´0.60 0 ´0.05 ´0.05
(Because the agent purchases the third ticket only if Q is true, it neither costs nor pays her anything if Q is false.) This agent received $0 .55 from us for selling two tickets at ti . If Q is false, no more tickets come into play, but she owes us $0 .60 on Ticket 2, so she’s out a total of $0 .05. If Q is true, she purchases the third ticket, and so is out $0 .05. If P is also true, she wins $1 on that third ticket but has to pay us $1 on Ticket 1, so she’s still down $0 .05. If P is false (while Q is true), none of the tickets pays, and her net loss remains at $0 .05. No matter what, the agent loses money over the course of ti to tj . Quick terminological remark: A Dutch Book is a set of bets guaranteed to generate a loss. Strictly speaki ng, we haven’t just built a Dutch Book against Conditionalization violators, because we haven’t described a single set of bets that can be placed against the agent to guarantee a sure loss in every case. Instead, we’ve specified two sets of bets, one to be placed if the agent learns Q and the other to be placed if not. (The former set cont ains three bets, while the latter contains two.) We’ve given the bookie a strategy for placing different sets of bets in different circumstances, such that each
9.2. THE DUTCH BOOK ARGUMENT
295
potential set of bets is guaranteed to generate a loss in the circumstances in which it’s placed. For this reason, Lewis’s argumen t supporting Conditionalization is usually known as a Dutch Strategy argument rather than a Dutch Book argument. Dutch Books or Strategies have been constructed to punish violators of many of the additional Bayesian constraints we considered in Chapter 5: Regularity (Kemeny 1955; Shimony 1955), the Principal Principle (Howson 1992), the Reflection Principle (van Fraassen 1984), Countable Additivity (Adams 1962), and Jeffrey Conditionalization (Armendt 1980; Skyrms 1987b). I will not work through the detai ls here. Instead, we will consider the normative consequences of these Dutch Bo ok constructions.
9.2
The Dutch Book Argument
A Dutch Book is not an argum ent. A Dutch Book is simply a set of bets, and a set of bets doesn’t argue for an ything. But once we know that a Dutch Book can be constructed in a particular kind of situation, we can use that fact to argue for various putative rational norms. For instance: Dutch Book Argument for Probabilism (Premise) It is not possible to construc t a Dutch Book against a rational agent. (Theorem) If an agent’s credences violate at least one of the probability axioms, a Dutch Book can be constructed against her. (Conclusion) Rational agents’ credences do not violate the probability axioms. The key premise is that no rational agent is susceptible to being Dutch Booked; just as rational preferences help one avoid money pumps, so should rational credences save us from Dutch Books. Once we have this premise, similar Dutch Book Arguments can be constructed for the Ratio Formula, Conditionalization, and all the other norms mentioned in the previous section. But is the premise plausible? What if it turns out that a Dutch Book can be constructed against any agent, no matter what rules her credences do or don’t satisf y? In other wor ds, imagin e that Dutch Books wer e just a rampant, unavoidable fact of life. Then the premise of the Dutch Book Argument would be false. To reassure ourselves that we don’t live in this dystopian universallybookable world, we need a series of what H´ ajek (2009a) calls Converse Dutch Book Theorems . The usual Dutch Book Theorem tells us that
296
CHAPTER 9. DUTCH BOOK ARGUMENTS
if an agent violates the probability axioms, she is susceptible to a Dutch Book. A Converse Dutch Book Theorem would tell us that if an agent satisfies the probability axioms, she is not susceptible to a Dutch Book. If we had a Converse Dutch Book theorem, then we wouldn’t need to worry that whatever credences we assigned, we could be Dutch Booked. The Converse Dutch Book Theorem would guarantee us safety from Dutch bookies as long as we maintained probabilistic credences; together with the standard Dutch Book Theorem, the Converse Theorem would constitute a powerful consideration in favor of assigning probabilistic over non-probabilistic credence distributions. Unfortunately we can’t get a Converse Dutch Book Theorem of quite the kind I just described. Satisfying the probability axioms with her unconditional credences does not suffice to innoculate an agent against Dutch Books. An agent whose uncondition al credences satisfy the probabilit y axioms might still violate the Ratio Formula with her conditional credences, and as we’v e seen this wou ld leave her open to Book. So it can’t be a theorem that no agent with probabilistic credences can ever be Dutch Booked (because that isn’t true!). Instead, our Converse Dutch Book Theorem has to saybe that as long as anwith agent’s credences satisfy the probability axioms, she can’t Dutch Booked the kind of Book we deployed against agents who violate the axioms . For instance, if an agent satisfies Non-Negativity, there won’t be any propositions to which she assigns a negative credence, so we won’t be able to construct a Book against her that requires selling a ticket at a negative fair betting price (as we did againstnst the Non-Negativity violator). Lehman (1955) and Kemeny (1955) each independently proved that if an agent’s credences satisfy the probability axioms, she isn’t susceptible to Books of the sort in Section 9.1.1. 6 This makes it plausible that Dutch Book avoidance is a necessary condition for rationality, as the premise of our Dutch Book Argument requires. Converse Dutc h Book Theorems can also help us beat back another challenge to Dutch Book Arguments. H´ajek (2009a) defines a Czech Book as a set of bets, each placed with an agent at her fair betting price (or better), that together guarantee her a sure gain come what may. It’s easy to see that whenever one can construct a Dutch Book against an agent, one can also construct a Czech Book. We simply take each ticke t contained in the Dutch Book, leave its fair betting price intact, but have the agent sell it rather than buy it (or vice versa). In the betting tables associated with eac h Book, this will flip all the negative payouts to positive and positive payouts to negative. So the total payouts on the bottom row of each column in the table will be positive, and the agent will profit come what may.
9.2. THE DUTCH BOOK ARGUMENT
297
According to the Dutch Book Theorem, a Dutch Book can be constructed against any agent who violates a probability axiom. We now know that whenever a Dutch Book can be constructed, a Czech Book can be as well. This gives us the Czech Book Theorem: If an agent’s cre dences violate at least one of the probability axioms, a Czech Book can be constructed against her. Violating the probability axioms leaves an agent susceptible to Dutch Books, which seems to be a disadvantage. But viola ting the proba bility axioms also opens up the possibility that an agent will realize Czech Books, which seems to be an advantage. Perhaps a rational agent wou ld make herself susceptible to Dutch Books in order to be ready for Czech Books, in which case the premise of our argument is false once more. In gener al, H´ ajek worries that any argument for probabilism based on Dutch Books will be canceled by Czech Books, leaving the Dutch Book Theorem normatively inert. At this point, converse theor ems become signifi cant. The Converse Dutch Book Theorem says that satisfying the probability axioms protects an agent from the disadvantage of susceptibility to particular kinds of Dutch Book. But there is no Converse Czech Book Theor em—it’s just not true that any agent who satisfies the probability axioms must forgo Czech Books. That’s because a rational agent will purchase betting tickets at anything up to her fair betting price (and sell at anything at least that number). For instance, an agent who satisfies the axioms will assign credence 1 to any tautology and so set $1 as her fair betting price for a ticket that pays $1 if some tautology is true. But if we offer her that ticket for, say, $0 .50 instead, she will be perfectly happy to take it off our hands. Since the ticket pays off in every possible world, the agent will make a profit come what may. So here we have a Czech Book available to agents who satisfy the probability axioms. Non-probabilistic agents are susceptible to certain kinds of Dutch Book, while probabilistic agents are not. Non-probabilistic agents can take advantage of Czech Books, but probabilistic agents can too. The advantage seems to go to probabilism. 9.2.1
Dutch Books depragmatized
Recent authors have reformulated the Dutch Book Argument in response to two objections. First, like the Repre sentation Theorem Argument of
298
CHAPTER 9. DUTCH BOOK ARGUMENTS
Chapter 8, the Dutch Book Argument moves from a practical premise to a theoretical conclusion. The argument establishes that an agent with nonprobabilistic credences may behave in ways that are practically disadvantageous— buying and selling gambling tickets that together guarantee a loss. At best, then, the argum ent seems to establish that it is practically irrational to assign credences violating the probability axioms (or various other credal norms). Yet we wanted to establish these constr aints on credence as requirements of theoretical rationality (see Chapter 1). The distinction b etween requirements of practical and theoretical rationality may disappear if one understands doxastic attitudes purely in terms of their effects on action. de Finetti, for exampl e, explored a position that defines an agent’s credences in terms of her betting behavior: Let us suppose that an individual is obliged to evaluate the rate p at which he would be ready to exchange the possession of an arbitrary sum S (positive or negative) dependent on the occurrence of a given event E , for the possession of the sum pS ; we will say by definition that this number p is the measure of the degree of probability attributed by the individual considered to the event E , or, more simply, that p is the probability of E (according to the individual considered). (1937/1964, pp. 101–102) Yet even if we take this definitional approach, a second objection to the Dutch Book Argument remains: As a practical matter, susceptibility to Book doesn’ t seem that significant. Few of us are surrounded by bookies ready to press gambles on us should we violate the probability axioms. If the Dutch Book Argument is supposed to talk us into probabilistic credences on the grounds that failing to be probabilistic will lead to bad practical consequences, those practical consequences had better be a fairly realistic threat. From time to time one still hears an agent’s credences defined as her fair betting prices, but (as we discussed in Chapter 8) this kind of behaviorism is increasingly unpopular. Instead, recent authors hav e tried to make out the Dutch Book Argument as establishing a true requirement of theoretical rationality. The idea is that despite the Dutch Book’s pragmatic appearance, the bookie and his Books are merely a device for dramatizing an underlying doxastic inconsistency. These authors take their inspiration from the srcinal passage in which Ramsey mentioned Dutch Books: These are the laws of probability, which we have proved to be necessarily true of any con sistent set of degrees of belief. . . . If
9.2. THE DUTCH BOOK ARGUMENT
299
anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd. He could have a book made against him by a cunning better and would then stand to lose in any event. (1931, p. 84) Interpreting this passage, Skyrms writes that “For Ramsey, the cunning bettor is a dramatic device and the possibility of a dutch book a striking symptom of a deeper incoherence.” (1987a, p. 227) Since the bookie is only a device for revealing this deeper incoherence, it doesn’t matter whether we are actually surr ounded by bo okies or not. (The fact that a child’s condition would lead her to break 100 ˝ F on a thermometer indicates an underlying problem whether or not any thermome ters are around.) As Brad Armendt puts it, We should resist the temptation to think that a Dutch book argument demonstrates that the violations (violations of probability, for the synchronic argument) are bound to lead to dire outcomes for the unfortunate agent. The problem is not that violators are bound to suffer, it is that their action-guiding beliefs exhibit an inconsistency. That inconsistency can be vividly depicted by imagining the betting scenario, and what would befall the viola tors were they in it. The idea is that the irrationality lies in the inconsistency, when it is present; the inconsistency is portrayed in a dramatic fashion when it is linked to the willing acceptance of certain loss . The value of the drama lies not in the likelihood of its being enacted, but in the fact that it is made possible by the agent’s own beliefs, rather than a harsh, brutal world. (1992, p. 218) To argue that Dutch Book vulnerability reveals a deeper rational inconsistency, we start by relating credences to betting behavior in a more nuanced manner than de Finetti’s. Howson and Urbach (2006), for instance, say that an agent who assigns credence cr P to proposition P won’t necessarily purchase a $1 ticket on P at price $cr P , but will regard such a purchase as fair. (This ties the credal attitude to another attitude—regarding as fair—rather than tying credences directly to behavior.) The Dutch Book Theorem then tells us that an agent with nonprobabilistic credences will regard each of a set of bets as fair that together guarantee a sure loss. Since such a set of bets is clearly un fair, a nonprobabilistic agent’s degrees of belief are theoretically inconsistent because they regard as fair something that is guaranteed not to be.
p q p q
300
CHAPTER 9. DUTCH BOOK ARGUMENTS
Christensen (2004, Ch. 5) attenuates the connection between credences and betting rates even more. As a purely descriptive matter, an agent with particular degrees of belief may or may not regard any particular betting arrangement as fair (perhaps she makes a calculation error; perhaps she doesn’t have any thoughts about betting arrangements; etc.). Nevertheless, Christensen argues for a normative link between credences and fair betting prices. If an agent assigns a particular degree of belief to P , that degree of belief sanctions as fair purchasing a ticket for $cr P that pays $1 on P ; it justifies the agent’s evaluating such a purchase as fair; and it makes it rational for the agent to purchase such a ticket at (up to) that price. Christensen then argues for probabilism from three premises: 7
p q
Depragmatized Dutch Book Argument for Probabilism (Premise) An agent’s degrees of belief sanction as fair monetary bets at odds matching her degrees of belief. (Christensen calls this premise “Sanctioning”.) (Premise) A set of bets that is logically guaran teed to leave an agent moneta rily worse off is rationally defective. (“Bet Defectiveness”) (Premise) If an agent’s beliefs san ction as fair each of a set of bets, and that set of bets is rationally defective, then the agent’s beliefs are rationally defective. (“Belief Defectiveness”) (Theorem) If an agent’s degrees of belief violate the probability axioms, there exists a set of bets at odds matching her degrees of belief that is logically guaranteed to leave her monetarily worse off. (Conclusion) If an agent’s degrees of belief violate the probability axioms, that agent’s degrees of belief are rationally defective. The theorem in this argument is, once more, the Dutch Book Theorem, and the argument’s conclusion is a version of probabilism. Christensen assesses this kind of Dutch Book Argument (“DBA”) as follows: This distinctively non-pragmatic version of the DBA allows us to see why its force does not depend on the real possibility of being duped by clever bookies. It does not aim at showing that probabilistically incoherent degrees of belief are unwise to harbor for practical reas ons. Nor does it locate the problem with probabilistically incoherent beliefs in some sort of preference inconsistency. Thus it does not need to identify, or define, degrees
9.3. OBJECTIONS TO DUTCH BOOK ARGUMENTS
301
of belief by the ideally associated bet evaluations. Instead, this DBA aims to show that probabilistically incoherent beliefs are rationally defective by showing that, in certain particularly revealing circumstances, they would provide justification for bets that are rationally defective in a particularly obvious way. The fact that the diagnosis can be made a priori indicates that the defect is not one of fitting the beliefs with the way the world happens to be: it is a defect internal to the agent’s belief system. (2004, p. 121, emphasis in srcinal)
9.3
Objections to Dutch Book Arguments
If we can construct both a Dutch Book Theorem and a Converse Dutch Book Theorem for a particular norm (probabilism, the Ratio Formula, updating by Conditionalization, etc.), then we have a Dutch Book Argument that rationality requires honoring that norm. I now want to review various objections to Dutch Book Arguments that have arisen over the years; these objections apply equally well to depragmatized versions of such arguments. It’s worth beginning with a concern that is often overlooked yet has plagued the Dutch Book progr am from its start. Dutch Book Arguments assume that a rational agent’s fair betting price for a bet that pays $1 on P is $cr P . An author like de Finetti who identifies an agent’s credence in P with the amount she’s willing to pay for a ticket that yields $1 on P is free to make this move. But contemporary authors unw illing to grant that identification need some argument that these betting prices are rational. A simple argumen t comes from expected value calculations. As we saw in Chapter 7, Equation (7.3), an agent’s expected monetary payout for a ticket that pays $1 on P is
p q
¨ p q ` $0 ¨ crp„P q “ $crpP q
$1 cr P
(9.1)
So an agent whose preferences are driven by expected value calculations will give assign that ticket a fair betting price of $cr P . (This calculation can be generalized to bets at other stakes.) But this argument for the fair betting prices we’ve been assuming takes as a premise that rational agents maximize expecte d utility. (Or expected monetary return—recall that we assumed for the duration of this chapter that agents assign cons tant marginal utility to money.) If we had that premise available, we could argue much more directly for probabilism via the Revised Representation Theorem of Chapter 8.
p q
302
CHAPTER 9. DUTCH BOOK ARGUMENTS
At the beginning of Section 9.1 I tried to motivate the typical formula for fair betting prices without invo king expectations. I invoked intuitions about how an agent’s fair betting price for a ticket should rise and fall as her credences and the stakes change. Unfortunately these intuitive motivations can’t get us quite far enough. An agent could assign fair betting prices that rise and fall in the manner described without setting those fair betting prices equal to her credences. Recall Mr. Bold, who assigns to each proposition the square-root of the credence assigned by Mr. Prob. Mr. Prob’s credences satisfy the probability axioms, while Mr. Bold’s violate Finite Additivit y. Now suppose that Mr. Bold sets his fair betting prices for various gambling tickets equal not to his credences, but instead to the square of his credences. Mr. Bold’s fair bettin g prices (for tickets on contingent propositions) will still rise and fall in exactly the ways that intu ition requires. In fact, he will be willin g to buy or sell any gambling ticket at exactly the same prices as Mr. Prob. And since Mr. Prob isn’t susceptible to various kinds of Dutch Book, Mr. Bold won’t be either. In general, an agent who assigns nonprobabilistic credences may be able to avoid Book by assigning his betting prices in nonstandard fashion. Without assumption aboutshow how that rational agents set betting prices,are the Dutcha strong Book Argument cannot nonprobabilistic credences irrational.8 9.3.1
The Package Principle
The objection just raised applies to any Dutch Book Argument, because it questions how fair betting prices are set for the bets within a Book. Another, more traditional objection applies only to Books involving more than one gambling ticket; for instance, it applies to the Dutch Book against Finite Additivity violators but not to the Books against Non-Negativity and Normality offenders. (As I keep saying, Finite Addi tivity is the most difficult of the three axioms to establish as a rational rule.) The traditional objection begins with interference effects that may be generated by placing a series of bets in succession. Interference effects occur when the initial bets in a series interfere with an agent’s willingness to accept the remaining bets. While she might have accepted the remaining bets as fair had they been offered to her in isolation, the bets she’s made already turn her against them. For example, the agent might hav e a personal policy never to tie up more than a certain total amount of money in gambles at one time. Or the third bet might be on the proposition “I will never make more than two bets in my life.” More to the point, suppose we hav e an agent
9.3. OBJECTIONS TO DUTCH BOOK ARGUMENTS
303
whose credences violate the probability axioms; we carefully constr uct a set of bets guaranteeing a sure loss, each of which will be placed at odds matching her degree of belief in the relevant proposition. We offer these bets to her one at a time. There’s no guarantee that placi ng the first few wagers won’t interfere with the agent’s willingness to accept the remainder. Besides the interference effects just mentioned, the agent might see her sure loss coming down the pike, and simply refuse to place any more bets past some point! Interference effects undermine the claim that any agent with nonprobabilistic credences can be trapped into placing a sure-loss set of bets. Interference effects are often introduced (as I’ve just done) by talking about placing bets with an agent one at a time. A Dutch Book defender might respond by suggesting that the bookie place his bets with the agent all at once—as a package deal. Yet the agent might still reject this package on the grounds that she doesn’t like to tie up so much money in gambles, or that she can see a sure loss on the way. The sequential offering of the bets over time is ultimately irrelevant to the dialectic. A more promising response to interference effects points out how heavily they rely on the transactional pragmatics betting. Depragmatized Book Arguments a nonprobabilisticofagent on the grounds thatDutch her credences sanctionindict a sure-loss set of bets; whether interference effects would impede her actually placing those bets is neither here nor there. Yet there’s a problem in the vicinity even for depragmatized arguments. Howson and Urbach’s and Christensen’s arguments contend that Dutch Bookability reveals an underlying doxastic inconsistency. What’s the nature of that inconsistency? Earlier we saw Ramsey suggesting of the probability rules that, “If anyone’s mental condition violated these laws, his choice would depend on the precise form in which the options were offered him, which would be absurd.” From this suggestion, Skyrms takes the principle that for a rational agent, “A betting arrangement gets the same expected utility no matter how described.” (1987a, p. 230) Similarly, Joyce writes that an agent’s nonprobabilism “leads her to commit both the prudential sin of squandering happiness and the epistemic sin of valuing prospects differently depending on how they happen to be described.”(1998, p. 96) The rational inconsistency revealed by Dutch Bookability seems to be that the agent evaluates one and the same entity differently depending on how it is presented. Skyrms calls the entity being evaluated a “betting arrangement”. To illustrate how one and the same betting arrangement might be presented in two different ways, let’s return to our Dutch Book against a Finite Additivity violator (Section 9.1.1). That Book begins by
304
CHAPTER 9. DUTCH BOOK ARGUMENTS
selling the agent a ticket that pays $1 if P is true and another ticket that pays $1 on Q (call these the “ P -ticket” and the “ Q-ticket”, respectively). The agent assigns cr P cr Q 0.5, so she will buy these tickets for $0.50 each. At that point the agent has purchased a package consisting of
p q“ p q“
two tickets: This ticket entitles the bearer to $1 if P is true, and nothing otherwise.
This ticket entitles the bearer to $1 if Q is true, and nothing otherwise.
Call this the “ P Q-package”. We assume that since the agent is willing to pay $0 .50 for each of the two tickets in the package, she will pay $1 for the package as a whole. In the next step of the Finite Additivity Dutch Book, we buy the following ticket from the agent (which we’ll call the “ P Q-ticket”):
_
This ticket entitles the bearer to $1 if P Q is true, and nothing otherwise.
_
p _ q“
Our agent assigns cr P Q 0.8, so she sells us this ticket for $0 .80. Now compare the P Q-package with the P Q-ticket, and keep in mind that in this example P and Q are mutually exclusive. If either P or Q turns out to be true, the P Q-package and the P Q-ticket will each pay exactly $1. Simlarly, each one pays $0 if neither P nor Q is true. So the P Q-package and the P Q-ticket have identical payoff profiles in every possible world. This is the sense in which they represent the same “betting arrangement”. When we offer the agent that betting arrangement as a package of two bets on atomic propositio ns, she values the arrangement at $1. When we offer that arrangement as a single bet on a disjunction, she values it at $0 .80. She values the same thing—the same betting arrangement—differently under thes e two presentations. If she’s willin g to place bets based on those evaluations, we can use them to take money from her. (We sell the arrangement to her in the form she prices expensively, then buy it back in the form she’ll part with for cheap.) But even if the agent won’t actually place the bets, the discrepancy in her evaluations reveals a rational flaw in her underlying credences. The general idea is that any Dutch Book containing multiple bets reveals a violation of
_
_
_
9.3. OBJECTIONS TO DUTCH BOOK ARGUMENTS
305
Extensional Equivalence: If two betting arrangements have the same payoff in every possible world, a rational agent will value them equally. I certainly won’t question Extensional Equivalence. But the argument above sneaks in another assumption as well. How did we decide that our agent valued the P Q-package at $1? We assumed that since she was willing to pay $0.50 for the P -ticket on its own and $0 .50 for the Q-ticket as well, she’d pay $1 for these two tick ets bundl ed together as a package. We assumed the Package Principle: A rational agent’s value for a package of b ets equals the sum of her values for the individual bets it contains. Our argument needed the Package Principle to get going (as does every Dutch Book consisting of more than one bet). We wanted to indict the set of credences our agent assigns to P , Q , and P Q. But bets based on those individual propositions would not invoke Extensional Equivalence, because no two such bets hav e identical payoffs in every possi ble world. So we
_
combined the P - and Q -tickets into the P Q-package, a betting arrangement extensionally equivalent to the P Q-ticket. We then needed a value for the P Q-package, a new object not immediately tied to any of the agent’s credences. So we applied the Package Principle. 9 Is it legitimate to assume the Package Principle in arguing for Finite Additivity? I worry that we face a Linearity In, Linearity Out probl em again. In order to get a Dutch Book Argum ent for Finite Additi vity, we need to assume that a rational agent values a package of bets on mutually exclusive propositions at the sum of her values for bets on the individual propositions. Schick (1986, p. 113) calls this “th e unspoken assumption. . . of value additivity” in Dutch Book Arguments; it seems to do exactly for bet valuations what Finite Additivity does for credences. 10 And without an independent argument for this Package Principle, the Dutch Book Argument for probabilism cannot succeed.11
_
9.3.2
Dutch Strategy objections
The first objection discussed in this section—concerning fair betting prices— applies to any Dutch Book. The Package Principle objection applies to any Book containing multiple b ets. But even beyond those ob jections, special problems arise for Dutch arguments involving credences at multiple times. I will focus here on Lewis’s Dutch Strategy Argument for Conditionalization;
306
CHAPTER 9. DUTCH BOOK ARGUMENTS
similar points apply to Strategies supporting Jeffrey Conditionalization and other potential diachronic norms. To get the concern going, we need to focus on an aspect of Dutch Books and Strat egies not yet remar ked-upon. I keep saying that a Dutch Book guarantees the agent will lose money in every possible wor ld. What set of possible worlds am I talking about? It can’t be the set of logically possible worlds; after all, there are logically possible worlds in which no bets are ever placed. When we say that a Dutch Book guaran tees the agent a sure loss in every world, we usually mean something like the agent’s doxastically possible worlds—the worlds she entertains as a live option. It makes sense to construct Dutch Books around worlds the agent considers possible. Dutch Book susceptibility is supposed to be a rational flaw, and rationality concerns how things look from the agent’s own point of view. Imagine a bookie sells you for $0 .50 a bet that pays $1 if a particular fair coin flip comes up heads. The bookie then claims he has Dutch Booked yo u, because he’s alre ady seen that coin flip and it came up tails! Your willingness to purchase that bet didn’t reveal any rational flaw in your credences. Admittedly there’s some sense in which the bookie sold you a bet that’s a loser in every live possibility. But that’s a sense of “live possibility” to which you didn’t have access when you placed the bet; relative to your information the bet wasn’t a sure loss. To fix our attention on Dutch Books or Strategies that may reveal rational flaws, we usually require them to generate a sure loss across the agent’s entir e space of doxastically possible worlds . A convenient way to do this is to stipulate that the bookie in a Dutch Book or Strategy must be capable of constructing the Book or implementing the Strategy without relying on any contingent information the agent lacks. With that in mind, let’s return to Lewis’s Dutch Strategy against Conditionalization violators. Here’s the particular set of bets we used in Section 9.1.2: Ticket 1 Ticket2 Ticket if Qlearned TOTAL
P &Q 0.75 0 .30 0 .40 0.05
´
´
„P & Q „Q 0.25 0.30 0.60 0.05
´ ´
0.25 0.30 0 0.05
´ ´
This Strategy was constructed against an agent who assigns equal unconditional credence to each of the four P Q state-descriptions at ti , assigns cri P Q 0.5, yet assigns cr j P 0.6 if she learns that Q between t i and tj .
p | q“
p q“
{
307
9.3. OBJECTIONS TO DUTCH BOOK ARGUMENTS
At a first pass, this Strategy seems to meet our requirement that the bookie never need know more than the agent. Tickets 1 and 2 are purchased from the agent at ti using betting prices set by her ti credences. The third ticket is sold to the agent at t j only if she learns Q between t i and t j . But by tj the agent (and the bookie) know whether she has learned Q , so the bookie
needn’t know more than the agent to decide whether to sell that ticket. Yet matters turn out to be more subtle than that. To see why, I’d suggest that the reader go off and construct a Dutch Strategy against an agent who assigns the same ti credences as in our example but assigns cr j P 0.4 if she learns that Q between ti and tj . Obviously the change in cr j P value changes the bet made if Q is learned—that bet must be keyed to the agent’s fair betting prices at the later time. More interestingly, though, you’ll find that while the bets placed at ti have much the same structure as Ticket 1 and Ticket 2, the bookie needs to sell them at ti rather than buying them in order to guarantee a sure loss. Now imagine a bookie is confronted at ti by an agent who assigns equal credence to all four P Q state-descriptions and satisfies the probability axioms and Ratio Fo rmula. The bookie wan ts to initiate a Dutch Strategy
p q“ p q
{
that willatcost moneydoesn should’t she failwhic to update by Conditionalization tj . the Butagent the bookie know h Strategy to pursue: the Strategy against agents who assign cr j P 0.5, or the Strategy against crj P 0.5. These Strategies require the b ookie to take different sides on his t i bets, so in order to pursue a course that will definitely cost the agent, the bookie must know at ti what the agent’s credences will be at tj . Given our stipulation that the bookie knows only what the agent does, this means that a Dutch Strategy can be constructed against an agent who violates Conditionalization only if that agent knows in advance exactly how she’ll be violating it. How might an agent know in advance what credences she’ll assign in the futu re? One possi bility is if the agent has a standing policy, or plan, for updat ing in response to evide nce. If that plan is at variance with the recommendations of Conditionalization, the agent (and bookie) will be able to tell at ti that she’ll violate Conditionalization at t j . Moreover, if the agent knows at ti that, say, her updating plan would lead her to assign crj P 0.6 in light of Q, the bookie can take advantage of this information to set up a Dutch Strategy by placing the appropriate bets at ti . But now we have to ask whether the resulting Dutch Strategy is really a defense of Conditionalization. Conditionalization is a diachronic norm; to violate Conditionalization is to assign one set of credences at ti and then a mismatched set of credences at tj . Yet our Dutch Strate gy seems to focus
p qă
p qą
p q“
308
CHAPTER 9. DUTCH BOOK ARGUMENTS
on a mismatch at ti —a mismatch between the agent’s ti credences and her plans at ti for updating going forw ard. An agent wise to the ways of the Dutch will be able to see at ti that her updating plans at that time will, if implemented, combine with her current credences to guarantee a sure loss. van Fraassen writes, Let us emphasize especially that these features are demonstrable beforehand, without appeal to any but logical considerat ions, and the strategy’s implementation requires no information inaccessible to the agen t himself. The gener al conclusion must be that an agent vulnerable to such a Dutch strategy has an initial state of opinion and practice of changing his opinion, which together constitute a demonstrably bad guide to life. (1984, p. 240, emphasis in srcinal) Notice that it’s the initial state of opinion plus the initial practice of changing opinion that together constitute a bad guide. The Dutch Strategy target s a synchronic inconsistency between stances adopted at ti , not the kind of diachronic inconsistency Conditionalization concerns. point can bebut dramatized follows: Consider anviolates agent who plans at ti toThe conditionalize, when tj as comes around actually Conditionalization. No Dutch Strategy can be implemented against such an agent; since the bookie won’t at ti know the details of the violation, he won’t be able to place the requisite t i bets. On the other hand, cons ider an agent who plans at ti not to conditionalize, yet nevertheless winds up doing so. At ti the bookie will place various bets with her, and she will anticipate assigning tj credences that allow the bookie to complete a sure-loss contract. Admittedly, when tj comes around and the agent violates her own updating plan, she will no longer be willing to accept the bets she thought she would at t i . But at ti the agent doesn’t know this is going to happen; from her point of view at ti it looks lik e she has guara nteed a loss. So the agent still seems to be doing something irrational at ti . To sum up: An agent who pla ns to conditionalize but doesn’t is not susceptible to a Dutch Strategy, while an agent who doesn’t plan to conditionalize but does can still be accused of irrationality by Dutch Strategy means. If Dutch Strategies reveal any kind of rational tension, it seems to be one that exists not between an agent’s ti and tj credences, but instead between her ti credences and her ti plans for updating going forward. Perhaps we could salvage the Dutch Strategy defense of Conditionalization with the help of another rational rule: Rational updates faithfull y implement the agent’s plans from earlier times. With this rule in place, any
9.3. OBJECTIONS TO DUTCH BOOK ARGUMENTS
309
agent who fails to update by Conditionalization must have made a rational mistake somewhere: either her failure to conditionalize was the faithful implementation of an earlier anti-Conditionalization plan (in which case a Dutch Strategy can be executed against her); or she planned to conditionalize but failed to properly implement that plan (running afoul of our new rational rule). Notice that this argum ent works only if we assume that rational agents always have updating plans; an agent who doesn’t plan in advance how she’s going to update can’t be caught on either horn of the previous dilemma. But even if we suppose that rational agents always have updating plans, there’s a deeper problem with the propos ed approach: How do we argue that rationality requires agents to honor their earlier commitments at later times? Such a requirement cannot be established by Dutch Strategy means. To see why, consider an analogous situation: Suppose my sister and I each have credences satisfying the probability axioms and Ratio Formula, but I assign cr P 0.7 while she assigns cr 0.7. A clever bookie cou ld P place bets with each of us that together guaranteed him a sure profit. But that doesn’t reveal a rational fault in either my credences or my sister’s, nor
p q“
p„ q “
doesWhat it reveal wrong with our opinions. if ananything agent atrationally credences to having stand indiffering the same relation t j takes her to her ti assignments that I take my credences to stand in to my sister’s? What if the t j agent doesn’t see anything rationally pressing about lining up her credences with the credences or plans she constructed at ti ? A bookie may have placed a set of bets with the agent’s t i and t j selves that together guarantee a sure loss. But the tj agent will find that no more impressive than the sure-los s contract constructible against me and my sister. If the tj agent doesn’t think there’s any antecedent rational pressure for her to coordinate her current attitudes with those of her ti self, the fact that their combined activities result in a loss will be of little normative interest to her. A Dutch Strategy may establish that it’s rational for an agent to plan to Conditionalize. But it cannot establish the diachronic principle that an agent is rationally required at a later time to do what she planned earlier. There may course be be other arguments for such a diachronic rational requirement, butofthey must independently established before a Dutch Strategy can have any bite. As Christensen puts it, Without some independent reason for thinking that an agent’s present beliefs must cohere with her future beliefs, her potential vulnerability to the Dutch strategy provides no support at all for [conditionalization]. (1991, p. 246)
310
CHAPTER 9. DUTCH BOOK ARGUMENTS
While a Dutch Strategy Argument may fill out the details of rational updating norms should any exist, it is ill-suited to establish the existence of diachronic rational requirements in the first place. 12
9.4
Exercises
Problem 9.1. In Section 9.1.1 we constructed a Dutch Book against an agent like Mr. Bold whose credences are subadditive. Now construct a Dutch Book against an agent whose credences are super additive: for mutually exclusive P and Q, he assigns cr P 0.3, cr Q 0.3, but cr P Q 0.8. Describe the bets composing your Book, say why the agent will find each one acceptable, and show that the bets guarantee him a loss in every possible world.
p q“
p q“
p _ q“
Problem 9.2. Roxanne’s credence distribution at a particular time includes the following values:
p
cr A & B
q “ 0.5
p q “ 0.1
cr A
p q “ 0.5
cr B
p _ Bq “ 0.8
cr A
(a) Show that Roxanne’s distribution violates the probability axioms. (b) Construct a Dutch Book against Roxa nne’s credences. Lay out the bets involved, then show that those bets actually constitute a Dutch Book against Roxanne. Note: The Book must be constructed using only the credences described above; since Roxanne is non-probabilistic you may not assume anything about the other credences she assigns. However, the Book need not take advantage of all four credences. (c) Construct a Czech Book “against” Roxanne. Lay out the bets involv ed and show that they guarantee her a profit in every possible world. (d) Does the success of your Dutch Book against Roxanne require her to satisfy the Package Principle? Explain. Problem 9.3. You are currently certain that you are not the b est singer in the world. You also currently satisfy the probability axioms and the Ratio Formula. Yet you assign crede nce 0 .5 that you will go to a karaoke bar tonight, and while under the influence of cheap beer and persuasive friends will be certain that you are the best singer in the world. Suppose a bookie offers you the following two betting tickets right now:
311
9.5. FURTHER READING
This ticket entitles you to $20 if you go to the bar, and nothing otherwise. If you go to the bar, this ticket entitles you to $40 if you are not the world’s best singer, and nothing if you are. If you don’t go to the bar, this ticket may be returned to the seller for a full refund of its purchase price. (a) Suppose that righ t now, a bookie offers to sell you the first ticket above for $10 and the second ticket for $30. Explain why, given your current credences, you will be willing to buy the two tickets at those prices. (Remember that the second ticket involves a conditional bet, so its fair betting price is determined by your current conditional credences.) (b) Describe a Dutch Strategy the bo okie can plan against you right now. In particular, describe a third bet that he can plan to place with you later tonight only if you’re at the bar, such that he’s guaranteed to make a net profit from you come what may. Be sure to explain why you’ll be willing to accept that third bet later on, and how it creates a Dutch Strategy against you.∗ Problem 9.4. Do you think there is any kind of Dutch Book that reveals a rational flaw in an agent’s attitud es? If so, say why, and say which kinds of Dutch Books you take to be revealing. If not, explain why not.
9.5
Further reading
Introductions and Overviews
Susan Vineberg (2011). Dutch Book Arguments. In:
The Stan-
ford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Summer 2011
Covers all the topics discussed in this chapter in much greater depth, with extensive citations. Classic Texts ∗
I owe this entire problem to Sarah Moss.
312
CHAPTER 9. DUTCH BOOK ARGUMENTS
Frank P. Ramsey (1931). Truth and Probability. In: The Foundations of Mathematics and other Logic Essays . Ed. by R. B. Braithwaite. New York: Harcourt, Brace and Company, pp. 156–198 Bruno de Finetti (1937/1964 ). Foresight: Its Logical Laws, its Subjective Sources. In: Studies in Subjective Probability. Ed. by Henry E. Kyburg Jr and H.E. Smokler. Originally published as “La pr´ evision; ses lois logiques, ses sources subjectives” in Annales de l’Institut Henri Poincar´ e , Volume 7, 1–68. New York: Wiley, pp. 94–158 On p. 84, Ramsey notes that any agent whose degrees of belief violated the laws of probability “could have a book made against him by a cunning better and would then stand to lose in any event.” de Finetti goes on to prove it.
Frederic Schick (1986). Dutch Bookies and Money Pumps. Journal of Philosophy 83, pp. 112–9
The
Compares Dutch Book and money pump arguments, then offers a Package Principle objection to each.
Paul Teller (1973). Conditionalization and Observation. Synthese 26, pp. 218–258 First presentation of Lewis’s Dutch Strategy Argument for Conditionalization; also contains a number of other interesting arguments for Conditionalization. Extended Discussion
David Christensen (2001). Preference-Based Arguments for Probabilism. Philosophy of Science 68, pp. 356–76 Presents depragmatized versions of both the Representation Theorem and Dutch Book Arguments for probabilism, then responds to objections.
NOTES
313
Notes 1 Or something very close to them—the agent may assign a positive, real numerical credence to all tautologies other than 1. 2 This formula is easy to derive if we assume that the agent selects her fair betting prices so as to maximize expected doll ar return, as we did in Section 7.1. Yet I’ve been scrupulously avoiding making that assumption here, for reasons I’ll explain in Section 9.3. 3 A “book” is a common term for a bet placed with a “bookmaker” (or “bookie”), but why Dutch ? While Hacking (200 1, p. 169) traces the term back to Ramsey, and suggests it may have been English betting slang of Ramsey’s day, I haven’t been able to find “Dutch” terminology in Ramsey’s text. So the srcins remain a mystery (at least to me). Meanwhile Hacking avoids the suggestion of an ethnic slur by speaking in terms of “sure-loss contracts”. But “Dutch Book” is so ubiquitous in the Bayesian literature that it would only be confusing to avoid it here; I will read it simply as a testament to the craftiness and probabilistic acumen of the Dutch people. 4 Lewis didn’t publish this argumen t against non-Conditionalizers. Instead it was reported by Teller (1973), who attributed the innovation to Lewis. 5 Notice that the ticket sold to the agent at tj does not constitute a conditional bet. It’s a normal ticket paying out $1 on P no matter what, and we set the agen t’s fair betting price for this ticket using her unconditional credences at tj . It’s just that we decide whether or not to sell her this (normal, unconditional ticket) on the basis of what she learns between t i and tj .
Notice also that for purposes of this example we’re assuming that the agent learns Q between t i and tj just in case Q is true. 6 The phrase “Books of the sort in Section 9.1.1” is unclear as stated; Lehman and Kemeny each specify precis ely what kinds of b etting packages their results concern. See also an early, limited Converse result in (de Finetti 1937/1964, p. 104). 7 I have slightly altered Christensen’s premises to remove his references to a “simple agent”. Christensen uses the simple agent to avoid worries about the declining marginal utility of money and about the way winning one bet may alter the value of a second’s payout. Inserting the simple agent references back into the argument would not protect it from the objections I raise in the next section. 8 All this should sound very reminiscent of my criticisms of Representation Theorem Arguments in Chapter 8. Mathematically, the proof of the Revised Representation Theorem from that chapter is very similar to standard proofs of the Dutch Book Theorem. 9 A similar move is hidden in Christensen’s “Belief Defectiveness” principle. The principle says that if an agent’s degrees of belief sanction as fair each bet in a set of bets, and that set of bets is rationally defective, then the agent’s beliefs are rationally defective. The intuitive idea is that beliefs which sanction something defective are themselves defective. Yet without the Package Principle, an agent’s beliefs might sanction as fair each of the bets in a particular set without sanctioni ng the entire set as fair. And it’s the entire set of bets that is the rationally defective object—it’s the set that guarantees the agent a sure loss. 10 Schick contends that money pump arguments for the Preference Axioms (Section 7.2.1) also assume something like the Package Principle. 11 If we assume that rational agents maximize expected utility, we can generate straightforward arguments for both Extensiona l Equivalence and the Package Principle. But again, if we are allowed to make that assumption then probabilism already follows quickly.
314
NOTES
(In fact, Extensional Equivalence is a key lemma in proving the Revised Representation Theorem.) 12 For what it’s worth, one could make a similar point about Dutch Book Arguments for synchronic norms relating distinct degrees of belief. To dramatize the point, imagine that an agent’s propositional attitudes were in fact little homunculi, each assigned its own proposition to tend to and adopt a degree of belief towards. If we demonstrated to one such homunculus that combining his assignments with those of other homunculi would generate a sure loss, he might very well not care. The point of this fanciful scenario is that while Dutch Books may fill out the details of rational relations among an agent’s degrees of belief at a given time, they are ill-suited to establish that rationality requires such synchro nic relations in the first place. Absent an antecedent rational pressure to coordinate attitudes adopted at the same time, the fact that such attitudes could be combined into a sure loss would be of little normative interest. No one ever comments on this point about Dutch Books because we all assume as part of the background that contemporaneous degrees of belief stand in important rational relations to each other.
Chapter 10
Accuracy Arguments The previous two chapters considered arguments for probabilism based on Representation Theorems and Dutch Books. We criticized both types of argument on the grounds that they begin with premises about practical rationality—premises that restrict a rational agent’s attitudes towards acts, or towards sets of bets. Probabilism hopes to establish that the probability axioms are requirements of theoretical rationality on an agent’s credences, and it’s difficult to see how one could move from practical premises to such a theoretical conclusion. This chapter builds arguments for probabilism from explicitly epistemic premises. The basic idea is that, as a type of representational attitude, credences can be assessed for accuracy. We are used to asses sing other doxastic attitudes, such as binary beliefs, in terms of their accuracy. A belief in the proposition P is accurate if P is true; disbelief in P is accurate if P is false. A traditional argument moves from such accuracy assessments to a rational requirement that agents’ belief sets be logically consistent (Chapter 1’s Belief Consist ency norm). The argument begins by noting that if a set of propositions is logically inconsistent, there is no (logically) possible world in which all those proposi tions are true . (That’s our definition of logical inconsistency.) So if an agent’s beliefs are logically inconsis tent, she’s in a position to know that at least some of them are inaccur ate. Moreover, she can know this a priori —without invoking any contingent truths. Since an inconsistent set contains falsehoods in every possible world, no matter which world is actual her inconsistent belief set misrepresents how things are. 1 There are plenty of potential flaws in this argument—starting with its assumption that beliefs have a teleological “aim” of being accurate. I present the argument here because it offers a good template for the arguments for 315
316
CHAPTER 10. ACCURACY ARGUMENTS
probabilism to be discussed in this chapter. Whatever concerns you have about the Belief Consistency argument above, keep them in mind as you consider accuracy arguments for probabilism. To argue about credences on accuracy-based grounds, we need some way of assessing credal accuracy. This presents a puzzle: a credence of, say, 0.6 in proposition P doesn’t really say that P is true, but neither does it say that P is false. So we can’t assess the accuracy of this credence by asking whether the truth-value it assigns to P exactly matches P ’s truth-value in the world. Nor can we say that cr P 0.6 is accurate just in case P is true “to degree 0.6”; we’ve assumed that propositions are wholly true or wholly false, full-stop. So just as we moved from classificatory to quantitative doxastic attitudes in Chapter 1, we will move from a classificatory to a quantitative concept of accuracy. We will consider various numerical measures of just how accurate a particular crede nce (or set of credences) is. We’ll begin with historical “calibration” approaches that measure credal accuracy by comparing credences to frequencies. Then we’ll reject calibration in favor of the more contemporary “gradational accuracy” approach. The most commonly-used gradational accuracy measure is known as the
p q“
Brier Using theBelief BrierConsistency score we willargument constructabove: an argument forthe probabilismscore. similar to the violating probability axioms damages a set of credences’ accuracy in every p ossible world. This argument turns out to be available not just for the Brier score, but for all the gradational accuracy measures in a class known as the “strictly proper scoring rules”. Our quest ion will then become why the strictly proper scoring rules are superior to other accuracy-measurement options, especially options that rule out probabilism. The spectre will arise once more that our argumen t for probabilism is question-begging by virtue of a Linearity-In, Linearity-Out construction. This will lead us to ask something you may have wondered over the last couple of chapters: How important is it, really, that rational credences satisfy Finite Additivity, as opposed to related norms with similar consequences for thought and behavior? Besides arguing for probabilism, Bayesian epistemologists have offered accuracy-based arguments for other norms such as the Principal Principle (Pettigrew 2013a), the Principle of Indifference (Pettigrew 2014), Reflection (Easwaran 2013), and Conglomerability (Easwaran 2013). We’ll close this chapter with an argument for Conditionalization based on minimizing expected future inaccurac y. Unfortunately this argument has the same drawback as Dutch Strategy arguments for Conditionalization; it ultimately fails to establish any truly diachronic norms.
317
10.1. ACCURACY AS CALIBRATION
10.1
Accuracy as calibration
In Section 5.2.1 we briefly considered a putative rational principle for matching one’s credence that a particular outcome will occur to the frequency with which that outcome occurs. that context, match be between one’s credence that In outcome B willthe occur andwas thesupposed frequencytowith which one’s evidence suggests B occurs. But we migh t instead assess an agent’s credences relative to actual frequencies in the world: If events of type A actually produce outcomes of type B with frequency x, an agent’s credence that a particular A-event will produce a B -outcome is more accurate the closer it is to x. Now imagine that an agent managed to be perfectly accurate with respect to the actual frequencies. In that case, she would assign credence 2 3 to outcomes that occurred 2 3 of the time, credence 1 2 to outcomes that occurred 1 2 of the time, etc. Or—flipping this around—propositions to which she assigned credence 2 3 would turn out to be true 2 3 of the time, propositions to which she assigned credence 1 2 would turn out to be true 1 2 of the time, etc. This approach to accuracy—getting the freq uencies right, as it were—generates the notion of
{
{
{
{
{
{
{
{
Calibration: A credence distribution over a finite set of propositions is perfectly calibrated when, for any x, the set of propositions to which the distribution assigns credence x contains exactly fraction x of truths. For example, suppose your weather forecaster comes on television every night and reports her degree of confidence that it will snow the next day. You might notice that every time she says she’s 20% confident of snow, it snows the next day. In that case she’s not a very accurate forecaster. But if it snows on just about 20% of those days, we’d say she’s doing her job well. If exactly 20% of the days on which she’s 20% confident of snow turn out to have snow (and exactly 30% of the days on which she’s 30% confident. . . etc.), we say the forecaster is perfectly calibrated . Calibration seems like an intuitive way to gauge accuracy.2 I’ve defined only what it means to be perfectly calibrated; there are also numerical measures for assessing degrees of calibration short of perfection (see (Murphy 1973)). 3 But all the good and bad features of accuracy as calibration can be understood by thinking solely about perfect calibration. First, the good: van Fraassen (1983) and Abner Shimony (1988) both argued for probabilism by showing that in order for a credence distribution to be embeddable in larger and larger systems with calibration scores approaching
318
CHAPTER 10. ACCURACY ARGUMENTS
perfection, that credence distribution must satisfy the probability axioms. This seems to be a powerful argument for probabilism—if we’re on board with calibration as a measure of accuracy. Here’s why we might not be. Consider two agents, Sam and Diane, who assign the following credence distributions over propositions X1 through X4 : X1
X2
X3
X4
{
{
{ {
1 2 0
Sam 1 2 1 2 1 2 Diane 1 1 1 10
{
Now suppose that propositions X1 and X2 are true, while X3 and X4 are false. Look at the table and ask yourself whose cred ences intuitively seem more accurate.4 I take it the answer is Diane. Yet Sam’s credences are perfectly calibrated— he assigns credence 1 2 to all four propositions, exactly half of which are true—while Diane’s credences are not. This is an intuitive flaw with measuring accuracy by calibration. A similar point can be made by considering the following (real life!) example: On the morning of Fe bruary 1, 2015, I lo oked outside and found it
{
was snowing At least inches had accumulated the night, the snow washeavily. still coming dowfour n, and it showed no signs ofduring stopping. The online weather report on my smartphone, though, showed an at-the-moment 90% probability of snow. Why hadn’t the forecaster simply looked out her window and updated the report to 100%? I was suddenly struck by a possible explanation. Let’s imagine (what’s probably not true) that the forecaster posts to the online weather report her current credence that it will snow on the current day. Suppose also that weather forecasting sites are graded for accuracy, and promoted on search engines based on how well they score. Finally, suppose this accuracy scoring is done by measuring calibration. What if, up to February 1, it had snowed every time the forecaster reported a 100% credence, but it had snowed on only 8 of the 9 occasions on which she had expressed a 90% credence? The snow on February 1 would then present her with an opportunity. She could report her true, 100% confidence in snow for February 1 on the website. Or she could post a 90% probability of snow. Given that it was clearly snowin g on February 1, this would bring her up to a perfect calibration score, and shoot her website to the top of the search ranki ngs. Calibration gives the forecaster an incentive to misreport her own credences—and the content of her own evidence. Calibration is one example of a scoring rule ; a procedure for rating distributions with respect to accuracy. James M. Joyce reports that “the
10.1. ACCURACY AS CALIBRATION
319
term ‘scoring rule’ comes from economics, where values of [such rules] are seen as imposing penalties for making inaccurate probabilis tic predictions.” (2009, p. 266) Done right, the imposition of such penalties can be a good way of finding out what experts really think—what’s known as credence elicitation. If you reward (or punish) an expert according the accuracy of her reports, you incentivize her to gather the best evidence she can, consider it carefully, and then report to you her genuine conclusions. Seen through this lense of credence elicit ation, calibration fails as a scoring rule. As we’ve just seen, a forecaster being rewarded according to her level of calibration may be incentivized to misreport her true opinions, and what she takes to be the import of her evidence. Yet perhaps it’s unfair to criticize calibration on the grounds that it perversely incentivizes credence reports—norms for assertion can be messy, and anyway probabilism is a norm on agents’ thoughts, not their words. So let’s consider calibration as a direct accuracy measure of our forecaster’s credences. Prior to February 1 it has snowed whenever the forecaster was certain of snow, but of the days on which she had a 0 .9 credence in snow it has snowed 8 times. Looking out her window and seeing snow, the forecaster 5
assigns 1 toand snow. Yet if is her goalmeasured is to be asby accurate as possible with hercredence credences, if accuracy truly calibration, the forecaster will wish that her credence in snow was 0 .9. After all, that would make her pefectly calibrated! Assessing the forecaster’s credences according to calibration makes those credences unstable. By the forecaster’s own lights—given the credences she has formed in light of her evidence—she thinks it would be better if she had different credences. Such instability is an undesirable feature in a credence distribution, and is generally thought to be a hallmark of irrationality. David Lewis offers the following analogy: It is as if Consumer Bulletin were to advise you that Consumer Reports was a best buy whereas Consumer Bulletin itself was not acceptable; you could not possibly trust Consumer Bulletin completely thereafter. (1971, p. 56) If we use calibration to measure accuracy, the weather forecaster’s credence distribution based on what she sees out the window becomes unstable. Such instability is a sign of irrationality. So from a calibration point of view, there’s something rationa lly wrong with the forecaster’s credence. But in reality there’s nothing wrong with the forecaster’s credences—they are a perfectly rational response to her evidence! The problem lies with calibration as
320
CHAPTER 10. ACCURACY ARGUMENTS
a measure of accuracy; it makes some credence distributions look rationally suboptimal that are in fact perfectly permissible (if not required!).
This is just one way in which calibration can reward an agent for ignoring her evidence. To see another, notice that any agent assigning credences over a partition of n propositions can guarantee herself a perfect calibration score by assigning each proposition a credence of 1 n. For instance, if a six-sided die is to be rolled, an agent can guarantee herself perfect calibration (no matter how the roll comes out!) by assigning each possible outcome a credence of 1 6. Depending on how you feel about the Principle of Indifference (Section 5.3), this might be a reasonable assignment when the agent has no evidence relevant to the members of the parti tion. But now suppos e the agent has highly reliable evidence that the die is biased in favor of coming up 6. Letting her credences reflect the bias won’t earn her a better calibration score than the uniform 1 6 distribution, and might very well serve her worse.
{
{
{
One could make various moves here in an attempt to save calibration as a plausible measure of accuracy. For instance, calibration scores are less easily manipulable if we measure them only in the long- run. But this generates questions about the accuracy of credences in non-repeatable events, and soon we’re assessing not actual long-run calibration but instead hypothetical calibration in the limit. Before long, we’ve made all the desperate moves used to prop up the frequency theory of “probability” (Section 5.1.1), and run into all the same problems.
The response her e is the same as it was with the freque ncy theory: Instead of employing a notion that emerges only when events are situated in a larger collective, we find a notion that can be meaningfully applied to single cases considered one at a time (like propensity). Looking back at Sam and Diane, our intuitive judgment that Diane is globally more accurate than Sam arises from local judgments that she was more accurate than him on each individual proposition. If you knew only the truth-value of X1 , you could still have said that Diane was more accurate than Sam on that one proposition. Our accuracy intui tions apply piece-wise; we assess credence s on one proposition at a time, then combine the results into a global accuracy measure.
10.2. THE GRADATIONAL ACCURACY ARGUMENT FOR PROBABILISM321
10.2
The gradational accuracy argument for probabilism
10.2.1
The Brier score
We will now develop what’s known as the “gradational accuracy ” approach to evaluating credences. Our guiding idea will be that inaccuracy is distance from truth—a credence distribution gains accuracy by moving its values closer to the truth-values of propositions . Of course, credenc e values are real numbers, while truth-values are not. But it’s natural to overcome that obstacle by letting 1 stand for truth and 0 stand for falsehood. Just as we have a distribution cr expressing the agent’s credences in propositions, we’ll have another distribution tv reflecting the truth-values of those propositions. Distribution tv assigns numerical values to the propositions in L such that tv X 1 if X is true and tv X 0 if X is false. 6 Once we have distribution cr representing the agent’s credences and distribution tv representing the truth, we want a scoring rule that measures how far apart these distributions are from each other. It’s easiest to visualize
p q“
p q“
the challenge on atwo diagram. To simplify matters, a credence bution over only propositions, X and Y . Ourconsider agent assigns cr X distri0.7 and cr Y 0.6. I have depicted this credenc e assignment in Figure 10.1. In this diagram the horizontal axis represents the proposition X while the vertical axis represents Y . Any credence assignment to these two propositions can be represented as an ordered pair; I have placed a dot at the agent’s cr-distribution of .7, .6 . What about the values of tv? Let’s suppose that propositions X and Y are both true. So tv X tv Y 1. I have marked 1, 1 —the location of tv on the diagram—with another dot. Now our question is how to measure the inaccuracy of the agent’s credences; how should we gauge how far cr is from tv? A natural suggestion is to use distance as the crow flies, indicated by the arrow in Figure 10.1. A quick calculation tells us that the length of the
p q“
p q“
p q p q“ p q“
arrow is:
2
p q
2
2
2
p1 ´ 0.7q ` p1 ´ 0.6q “ p0.3q ` p0.4q “ 0.25
(10.1)
Pythagorean Theorem aficionados will note the lack of a square-root in this distance expression (the arrow is actually 0 .5 units long ). But since we’re going to be using inaccuracy measurements only for ordinal comparisons (which credence distribution is farther from the truth), particular numerical values don’t matter much—and neither does the square-root.
322
CHAPTER 10. ACCURACY ARGUMENTS
Figure 10.1: The Brier score Y
p q
tv: 1, 1
p
cr: .7, .6
q
X
When generalized to a credence distribution over finitely-many propositions X1 , X2 ,...,X
p
Ibr cr, ω
n,
this distance measure of inaccuracy becomes 2
2
q “ ptv pX q´crpX qq `ptv pX q´crpX qq `. . .`ptv pX q´crpX qq ω
1
ω
1
2
ω
2
n
n
(10.2) A few notes about this equ ation: First, what are the ω s doing in there? We usually want to evaluate the inaccuracy of your credence distribution relative to conditions in the actual world. But sometimes we’ll wonder how inaccurate your credences would’ve been if you’d maintained your distribution but live d in a different possible world. For example, in Figure 10.1 we might wonder how inaccurate the credence distribution cr would have been had X and Y both been fals e. That is, we might want to calculate the distance between cr and the point 0, 0 . Equation (10.2) calculates the inaccuracy of credence distribution cr in an arbitrary possible world ω . tvω Xi represents the truth-value of proposition Xi in world ω ; Ibr cr, ω then measures the inaccuracy of cr relative to conditions in that world. (So for the credence distribution 0.7, 0.6 and the world 0, 0 , Equation (10.2) would yield an I br -value of 0 .72 0.62 0.85.)7 Second, Equation (10.2) measures inaccuracy by tallying up one proposition at a time, then summing the results. For any credence distri bution cr and particular proposition Xi , evaluating tv Xi cr Xi 2 is one way of gauging how far off distribution cr is on that particular proposition. Equation (10.2) makes that calculation for each individual proposition Xi , then
p q
p q
p
p
`
q
“
p q
p p q ´ p qq
q
2
10.2. THE GRADATIONAL ACCURACY ARGUMENT FOR PROBABILISM323
adds up the results. In general, a scoring rule that sums the results of separate calculations on individual propositions is referred to as separable. Separable scoring rules track our intuition that accuracy assessments of an entire credence distribution can be built up piece-wise, considering the accuracy of one credence at a time; this was exactly the feature we found lacking in calibration’s evaluation of Sam and Diane. The particular separable scoring rule in Equation (10.2) is known as the Euclidean distance, the quadratic loss function, or most commonly as the Brier score .8 (This accounts for the “BR” subscript in I br .) The Brier score is hardly the only scoring rule available, but it is natural and widelyused. So we will stic k with it for the time being , until we exami ne other options in Section 10.3.1. At that point we’ll find that even among the separable scoring rules, there may be ordinal non-equivalence—two separable scoring rules may disagree about which distribution is most accurate in a given world. Nevertheless, all the separable scoring rules have some features in common. For instance, while I br cr, ω is in some sense a global measure of the inaccuracy of cr in world ω , it doesn’t take into account any wholistic or interactive features among the individual credences cr assigns.
p
q
Separable scores for iexample, take into each account sum or difference of cr Xi and cr can’t, Xj for j . Moreover, Xi the contributes equally to the sum I br cr, ω in a separable rule. Thus each proposition to which the agent assigns a credence is treated in some sense equally. If you think that in particular circumstances it may be more important to be accurate about some Xj than others, this inequity will not be capturable in a separable scoring rule. Still, the main mathematica l results of this chapter would go through even if we accommodated such biases by disparately weighting the tv Xi cr Xi 2 expressions before summing.9 Finally, the scoring rules we consider from this point on will measure the in accuracy of credence distributions in particular worlds. So an agent looking to be as accurate as possible will seek to minimize her score. Some authors prefer to work with credence distributions’ epistemic utility , a numerical measure of epistemic value that rational agents maximize. Now there may be many aspects of a credence distribution that make it epistemically valuable or disvaluable besides its distance from the truth. But many authors work under the assumption that accuracy is the sole determiner of a distribution’s epistemic value, in which case that value can be calculated directly from the distribution’s inaccuracy. (The simplest way is to let the epistemic utility of distribution cr in world ω equal 1 Ibr cr, ω .) If you find yourself reading elsewhere about accuracy arguments, you should be sure to notice whether the author asks agents to minimize inaccuracy or
p q
p
p q q
‰
p p q ´ p qq
´ p
q
324
CHAPTER 10. ACCURACY ARGUMENTS
maximize utility. On either approach, the best credence is the one closest to the pin (the distribution tv). But with inaccurac y, as in golf, lowest score wins. 10.2.2 Joyce’s accuracy argument for probabilism In our discussion of calibration we saw that it’s rationally problematic for an agent’s credence distribution to be “unstable”—for it to seem to the agent, by her own lights, like another credence distribution would be preferable to her own. We ultimately rejected assess ing agents’ credenc es using calibration, but now we have an alternative accuracy measu re: the Brier score. If we could convince an agent that her credences are less accurate, as measured by the Brier score, than some other distribution over the same set of propositions, then it would seem irrational for her to have her own credence distribution (as opposed to the other one). How can we convince an agent that her credences are less accurate than some alternative? Inaccuracy is always measured relative to a world . Presumably the agent is interested how things stand in the actual world, but presumably she also has some uncertainty as to which propositions are true or false in the actu al world. If she doesn’t know the tv-values, she won’t be able to calculate her own Brier score, much less that of an alternative distribution. But what if we could show her that there exists a single distribution that fares better than her own with respect to accuracy in every logically possible world ? Then she wouldn’t need to know which world was actual; she could determine on an a priori basis that however things stand in the world, she would do better from an accuracy standpoint if she had that other distribution. In light of information like this, her present credences would look irrational. This line of thought leads to the following principle:
Admissibles Not Dominated: If an agent’s credence distribution is rationally permissible, and she measures inaccuracy with an acceptable scoring rule, then there does not exist another distribution that is more accurate than her own in every possible world. Admissibles Not Dominate d is a conditional. Contraposing it, we get that any credence distribution accuracy-dominated by another distribution on an acceptable scoring rule is rationally impermissible (or “inadmissible”, in the accuracy literature’s jargon). Repurposing a theorem of de Finetti’s (1974), and following on the work of Rosenkrantz (1981), Joyce (1998) demonstrated the
10.2. THE GRADATIONAL ACCURACY ARGUMENT FOR PROBABILISM325
Gradational Accuracy Theorem: Given a credence distribution cr over a finite set of propositions X1 , X2 ,...,X n , if we use the Brier score Ibr cr, ω to measure inaccuracy then:
p
•
q
If cr does not satisfy the probability axioms, then there exists a probabilistic distribution cr 1 over the same propositions such that I br cr1 , ω Ibr cr, ω in every logically possible world ω ; and
p
•
qă p
q
If cr does satisfy the probability axioms, then there does not exist any cr 1 over those propositions such that I br cr1 , ω Ibr cr, ω in every logically possible world.
p
q
p
qă
The Gradational Accuracy Theor em has two parts. The first part says that if an agent has a non-probabilistic credence distribution cr, we will be able to find a probabilistic distribution cr 1 defined over the same propositions as cr that accuracy-dominates cr. No matter what the world is like, distribution cr 1 is guaranteed to be less inacc urate than cr. So the agent with distribution can be certain what may, she is leaving certain amount ofcraccuracy on thethat, tablecome by assigning cr rather than cra 1 . There’s a cost in accuracy, independent of what you think the world is like and therefore discernible a priori , to assigning a non-probabilistic credence distribution—much as there’s a guaranteed accuracy cost to assigning logically incon sistent beliefs. On the other hand (and this is the secon d part of the theorem), if an agent’s credence distribution is probabilistic, then no distribution (probabilistic or otherwise) is more accurate in every possible world. This seems a strong advan tage of probabilis tic credence distributions.10 Proving the second part of the theorem is difficult, but I will show how to prove the first part. There are three probability axioms—Non-Negativity, Normality, and Finite Additivity—so we need to show how violating each one leaves a distribution susceptible to accuracy domination. We’ll take them one at a time, in order. Suppose credence distribution cr violates Non-Negativity by assigning some proposition a negative credence. In Figure 10.2 I’ve imagined that cr assigns credences to two propositions, X and Y , bearing no special logical relations to each other. cr violates Non-Neg ativity by assigning cr X 0. (The value of cr Y is irrelevant to the argument, but I’ve supposed it lies between 0 and 1.) We introduce probab ilistic cr 1 such that cr 1 Y cr Y but cr 1 X 0; cr 1 is the closest point on the Y -axis to distribution cr.
p q
p q“
p qă p q“ p q
326
CHAPTER 10. ACCURACY ARGUMENTS
Figure 10.2: Violating Non-Negativity Y
cr
p q
p q
ω 3 : 0, 1
ω 1 : 1, 1
cr
1
p q
ω 4 : 0, 0
p q
ω 2 : 1, 0
X
We need to show that cr 1 is less inaccurate than cr no matter which possible world is actual. Given our two propos itions X and Y , there are four possible worlds.11 I’ve marked them on the diagram as ω 1 , ω 2 , ω 3 , and ω4 , determining the coordinates of each world by the truth-values it assigns to X and Y . (In ω2 , for instance, X is true and Y is false.) We now need to show that for each of these worlds, cr 1 receives a lower Brier score than cr. In other words, we need to show that for each wor ld cr 1 is closer as the crow flies than cr is. Clearly cr1 is closer to ω 2 and ω 1 than cr is, so cr 1 is less inaccurate than cr relative to both ω 2 and ω1 . What about ω3 ? I’ve indicated the distances from cr and cr 1 to ω3 with arrows. Because cr 1 is the closest point on the Y -axis to cr, the points cr, cr 1 , and ω3 form a right triangle. The arrow from cr to ω3 is the hypotenuse of that triangle, while the arrow from cr 1 to ω3 is a leg. So the latte r must be shor ter, and cr 1 is less inaccurate by the Brier score relative to ω3 . A parallel argument shows that cr 1 is less inaccurate relative to ω4 . So cr 1 is less inaccurate than cr relative to each possible world. That takes care of Non-Negativity. 12 The accuracy argument against violating Normality is depicted in Figure 10.3. Suppose X is a tautology and cr assigns it some value other than 1. Since X is a tautology, there are no logically possible worlds in which it is false, so we need consider only the possible worlds marke d as ω 2 and ω1 in the diagram. We construct cr 1 such
10.2. THE GRADATIONAL ACCURACY ARGUMENT FOR PROBABILISM327
Figure 10.3: Violating Normality Y
p q
ω 1 : 1, 1
cr
1
cr
p q
ω 2 : 1, 0
X
that cr 1 Y cr Y and cr 1 X 1. cr 1 is closer than cr to ω1 because the arrow from cr to ω 1 is the hypotenuse of a right triangle of which the arrow from cr 1 to ω1 is one leg. A similar argument shows that cr 1 is closer than cr to ω2 , demonstrating that cr 1 is less inaccurate than cr in every logically possible world. Explaining how to accuracy-dominate a Finite Additivity violator requires a three-dimensional argument sufficiently complex that I will leave it for an endnote. 13 But we can show in two dimensions what happens if you violate one of the rules that follows from Finite Additivity, namely our Negation rule. Suppose your creden ce distribution assigns cr-values to two propositions X and Y such that X is the negation of Y . If you violate Negation, you’ll have cr Y 1 cr X . I’ve depicted only ω2 and ω3 in Figure 10.4 because only those two worlds are logically possible (since X and Y must have opposite truth-values). The diagonal line connecting ω2 and ω 3 has the equation Y 1 X ; it contains all the credence distributions satisfying Negation. If cr violates Negation, it will fail to lie on this line. Then we can accu racy-dominate cr with the 1 ). Once point closest to cr lying on that diagonal line (call that point cr 1 more, we’ve created a right triangle with cr, cr , and world ω3 . The arrow representing the distance from cr to ω3 is the hypotenuse of this triangle, while the arrow from cr 1 to ω 3 is its leg. So cr 1 has the shorter distance, and if ω3 is the actual world cr 1 will be less inaccurate than cr according to the
p q“ p q
p q“
p q‰ ´ p q
“ ´
328
CHAPTER 10. ACCURACY ARGUMENTS
Figure 10.4: Violating Negation Y
p q
ω 3 : 0, 1
cr cr
1
p q
ω 2 : 1, 0
X
Brier score. A parallel argument applies to ω 2 , so cr 1 is less inaccurate than cr in each of the two logically possible worlds. 14 Joyce (1998, 2009) leverages the advantage of probabilistic credence distributions displayed by the Gradational Accuracy Theorem into an argument for probabilism: Gradational Accuracy Argument for Probabilism (Premise 1) A rationally-permissible credence distribution cannot be accuracy-dominated on any acceptable scoring rule. (Premise 2) The Brier score is an acceptable scor ing rule. (Theorem) If we use the Brier score, then any non-probabilistic credence distribution can be accuracy-dominated. (Conclusion) All rationally-permissible credence distributions satisfy the probability axioms. The first premise of this argument is Admissibles Not Dominated. The theorem is the Gradational Accuracy Theorem. The conclusion of this argument is Probabilism.
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM329
Figure 10.5: Truth-Directedness Y
p q
tv: 1, 1
p.48, .9q p
cr: .7, .6
q
X
10.3
Objections to the accuracy argument for probabilism
Unlike Representation Theorem and Dutch Book Arguments, the Gradational Accuracy Argument for Probabilism has nothing to do with an agent’s decision-theoretic preferences over practical acts. It clearly pertains to the theoretical rationality of credences assigned in pursuit of an epistemic goal: accuracy. (This is why Joyce’s srcinal (1998) paper was titled “A Nonpragmatic Vindication of Probabil ism”.) This has been seen as a ma jor advantage of the accuracy argument for proba bilism. Of course, one has to be comfortable with the idea that belief-for mation is a goal-directed activity— teleological, so to speak—and commentators have objected to that position. (You can find some examples in the Further Reading.) But I want to focus on a more technical objection that has been with the grada tional accuracy approach from its incep tion. Premise 2 of the Gradational Accuracy Argument states that the Brier score is an acceptable scoring rule. The Brier score is certainly not the only scoring rule possible; why do we think it’s acceptable? And what does it even mean for a scoring rule to be acceptable in this context?
330
CHAPTER 10. ACCURACY ARGUMENTS
10.3.1
The absolute-value score
In his srcinal (1998) presentation of the argument, Joyce selected the Brier score on the grounds that it exhibits a number of appealing formal properties— what we might think of as adequacy conditions for an acceptable scoring rule. We’ve already seen that the Brier score is a separable rule. The Brier score also displays Truth-Directedness: If a distribution cr is altered by moving at least one cr Xi value closer to tv ω Xi , and no individual cr-values are moved farther away from tv ω , then I cr, ω decreases.
p q
p q
p
q
The intuitive idea of Truth-Directedness is that if you change your credence distribution by moving some propositions closer to their truth-values, and leaving the rest alone, this should make you less inaccurate. This condition is depicted in Figure 10.5. (Ignore the dashe d elements in that diagram for now.) Assume once more that the agent assigns credences only to the propositions X and Y , and that both these propositions are true in the actual world. If the agent’s crede nce distribution is 0.7, 0.6 , every point on or in the gray box (except for 0.7, 0.6 itself) assigns an X -credence or a Y -credence closer to 1 than hers. On a truth-directed scoring rule, all of those distributions are more accurate than the agent’s. The Brier score isn’t the only truth-directed scoring rule, or the only way of measuring distan ce on a diagram. Brier measu res dista nce as the crow flies. But suppos e you had to travel from the distr ibution 0.7, 0.6 to the truth 1, 1 by traversing a rectangular street grid, which permitted movement only parallel to the axes. The shortest distance betwee n those two points measured in this fashion—what’s sometimes called the “taxicab distance”—is 1 0.7 1 0.6 0.3 0.4 0.7 (10.3)
p
p
q
q
p
p q
q
| ´ |`| ´ |“ ` “
I’ve illustrated this distance in Figure 10.6, for a credence distribution over two propositions X and Y . Generalizing this calculation to a distribut ion over finitely-many propositions X1 , X2 ,...,X n yields
p
Iabs cr, ω
q “ |tv pX q´ crpX q|`| tv pX q´ crpX q|` . . .`|tv pX q´ crpX q| ω
1
1
ω
2
2
ω
n
n
(10.4)
We’ll call this the absolute-value scoring rule. Both the absolute-value score and the Brier score satisfy Truth-Directedness. We can see this by attending to the dash ed eleme nts in Figur e 10.5. The dashed line passing through 0.7, 0.6 shows distributions that have the exact same inaccuracy as 0.7, 0.6 if we measure inaccuracy by the absolute-value
p
p q
q
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM331
Figure 10.6: The absolute-value score Y
p q
tv: 1, 1
p
q
cr: .7, .6
X
score.15 Any point between that dashed line and 1, 1 is more accurate than 0.7, 0.6 by the absolute-value score. Notice that all the points in the gray box fall into that category, so the absolute-value score is truth-directed. The dashed quarter-circle shows distributions that are just as inaccurate as 0.7, 0.6 if we measure inaccuracy by the Brier score. Points between the dashed quarter-circle and 1, 1 are less inaccurate than 0.7, 0.6 according to the Brier score. Again, the gray box falls int o that region, so the Brie r score is truth-directed. Perhaps more interestingly, we can see in Figure 10.5 that the Brier score and the absolute-value score are ordinally non-equivalent measures of inaccuracy. To bring out the contrast, consider the distribution 0.48, 0.9 . Notice that Truth-Directedness doesn’t settle whether this distribution is more or less accurate than 0.7, 0.6 —given that both X and Y have truthvalues of 1, 0.48, 0.9 does better than 0.7, 0.6 with respect to Y but worse with respect to X . We have to decid e whether the Y improvement is dramatic enough to merit the X sacrifice; Truth-Directedness offers no guidance concerning such tradeoffs. The Brier score and absolute-value score render opposite verdicts on this point. 0.48, 0.9 lies inside the dashed line, so the absolute-value score evaluates this distribution as less inaccurate than 0.7, 0.6 . But 0.48, 0.9 lies outside the quarter-circle, so the Brier score evaluates it as more inaccurate. Here we have a concrete case in which the absolute and Brier scores disagree in their accuracy rankings of two
p
q
p
p q
q
p q
p
q
p
p
p
q
q
p
p
p
q
p
q
q
q
q
332
CHAPTER 10. ACCURACY ARGUMENTS
distributions. Such disagreement is especially important when it comes to the Gradational Accuracy Argument. A Gradational Accuracy Theor em cannot be proven for the absolute-value score; in fact, uniformly replacing the Brier score with the absolute-value score in the statement of that theorem yields a falsehood. (We’ll demonstrate this in the next section.) So the Gradational Accuracy Argument for Probabilism cannot be run with the absolute-value score in place of the Brier score. If you thought the absolute-value score was an acceptable accuracy measure while the Brier score was not, the argument for probabilism would fail. 10.3.2
Proper scoring rules
Clearly it makes a difference to the Gradational Accuracy Argument for Probabilism whether the Brier score or the absolute-value score (or both, or neither) is an acceptable measur e of inaccuracy. In his (1998), Joyce offered adequacy conditions beyond Truth-Directedness and separability that favored the Brier score over the absolute-value score. Maher (2002), however, argued thatscoring these properties implausible as requirements acceptable rules, and were defended the absolute-value score.onSorationallywe’re left wondering how to select one over the other. Historically, the Brier score was favored over the absolute-value score because Brier belongs to a broad class of scoring rules called the “proper” scoring rules. To understand this notion of propriety, we first need to understand expected inaccuracies. Suppose I want to assess the inaccuracy of my friend Reyna’s credence distribution. We’ll simplify matters by stipulating that Reyna assigns only two credence values, crR X 0 .7 and cr R Y 0 .6. Stipulate also that I am going to use the absolute-value score for inaccuracy measurement. We know from Equation (10.3) that if X and Y are both true, Reyna’s I abs score is 0 .7. The trouble is, I’m not cer tain whe ther X or Y is true; I assign positive credence to each of the four truth-value assignments over X
p q“
p q“
Y . The table below and shows from my credence possibilities—which is distinct Reyna’s.distr ibution (cr) over the four
ω1 ω2 ω3 ω4
X
Y
T
T
T
F
F
T
F
F
cr 0.1 0.2 0.3 0.4
p
Iabs crR , 0.7 0.9 1.1 1.6
¨q
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM333
The last column in this table shows the inaccuracy of Reyna’s distribution in each of the four possible worlds according to the absolute-value score. If X and Y are both true, her inaccuracy is 0 .7; if X is true but Y is false, it’s 0.9, etc. The table tells me the inaccuracy of Reyna’s distribution in each possible world. I can’t calculate her actual inaccur acy, because I’m not certain which possible world is actual. But I can calculate how inaccurate I expect Reyna’s distribution to b e. The inaccuracy of a credence distribution is a numerical quantity, and just like any numerical quantity I may calculate my expectation for its value. My expectation for the I abs value of Reyna’s distribution crR is
p q“ p q ¨ crpω q ` I pcr , ω q ¨ crpω q ` I pcr “ 0.7 ¨ 0.1 ` 0.9 ¨ 0.2 ` 1.1 ¨ 0.3 ` 1.6 ¨ 0.4 “ 1.22
EIcr crR
Iabs crR , ω1
1
abs
R
2
2
abs
For each world, I calculate how inaccurate cr
R
R , ω3
q ¨ crpω q ` I pcr 3
abs
R , ω4
(10.5)
would be in that world, and 16
multiply by my credence cr that that world is actual. I then sum the results across all four worlds. Notice that because I’m more confident in, say, worlds ω3 and ω4 than I am in worlds ω1 and ω 2 , my expected inaccuracy value for Reyna’s distribution falls near the higher end of the values in the fourth column of the table. In general, if an agent employs the scoring rule I to measure inaccuracy, the agent’s credence distribution is cr, and the finite set of worlds under consideration is ω1 , ω2 ,...,ω n , the agent’s expected inaccuracy for any distribution cr1 is: EIcr cr1
p q “ Ipcr1, ω q¨ crpω q` Ipcr1, ω q¨ crpω q` . . . ` Ipcr1, ω q¨ crpω q (10.6) 1
1
2
2
n
n
This equation generalizes the expected inaccuracy calculation of Equation (10.5) above. The notation EI cr cr1 indicates that we are calculating the expected inaccuracy of credence distribution cr1 , as judged from the point of view of credence distribution cr. 17 Equation 10.6 allows me to calculate my expected inaccuracy for any credence distribution, probabilistic or otherwise. If I wanted, I could even calculate my expected inaccuracy for my own credence distribution. That is, I could calculate EIcr cr . But this is a frau ght calculation. When I calculate my expected inaccuracy for my own current credences and compare it to the inaccuracy I expect for someone else’s credences, I might find that I expect
p q
p q
q ¨ crpω q 4
334
CHAPTER 10. ACCURACY ARGUMENTS
that other distribution to be more accurate than my own. We will say that distribution cr1 defeats cr in expectation if EIcr cr1
p q ă EI pcrq cr
(10.7)
Your credence distribution defeats mine in expectation when, from the point of view of my own credence distribution, I expect yours to be less inaccurate than mine. Being defeated in accuracy expectation is not quite as bad as being accuracy-dominated. Being defeated in expectation is kind of like having a twin sister who takes all the same classes as you but has a better GPA. Being accuracy-dominated is like your twin’s getting a better grade than you in every single class . Still, being defeated in expectation is a rational flaw. Joyce writes, If, relative to a person’s own credence s, some alternative system of beliefs has a lower expected epistemic [inaccuracy], then, by her own estimation, that system is preferable from the epistemic perspective. This puts her in an untenable doxastic situation. She has a prima facie epistemic reason, grounded in her beliefs, to think that she should not be relying on those very beliefs. This is a probabilistic version of Moore’s paradox. Just as a rational person cannot fully believe “ X but I don’t believe X ,” so a person cannot rationally hold a set of credences that require her to estimate that some other set has higher epistemic utility. [This] pers on is. . . in this path ological position: her beliefs undermine themselves. (2009, p. 277) The idea that rational agents avoid being defeated in expectation is related to our earlier weather-forecaster discussion of stability and credence elicitation. Lewis (1971) calls a distribut ion that assigns itself the highest expected accuracy “immodes t”. (“When asked which method has the b est estimated accuracy, the immod est meth od answers: ‘I have’.”) He then relates immodesty to an agent’s epistemic goals: If you wish to maximize accuracy in choosing a [credence-assignment] method, and you have knowingly given your trust to any but an immodest method, how can you justify staying with the method you have chosen? If you really trust your method, and you really want to maximize accuracy, you should take your method’s advice and maximize accuracy by switching to some other method
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM335
that your srcinal method recom mends. If that method is also not immodest, and you trust it, and you still want to maximize accuracy, you should switch again; and so on, unless you happen to hit upon an immodest method. Immodesty is a condition of adequacy because it is a necessary condition for stable trust. (1971, p. 62) These arguments from Joyce and Lewis support the following principle: Admissibles Not Defeated: If an agent’s crede nce distribution is rationally permissible, and she measures inaccuracy with an acceptable scoring rule, then she will not expect any distribution to be more accurate than her own. Admissibles Not Defeated says that under an acceptable scoring rule, no credence distribution that is rationally permissible will take itself to be defeated in expectation by another distribution. 18 Admissibles Not Defeated relates two elements: a credence distribution and a scoring rule. If we’ve already settle d on an acceptable scoring rule, we can use Admissibles Not Defeated to test the rational permissibility of a credence distribution. But we can also argue in the other direction: If we know a particular credence distribution is rational, we can use Admissibles Not Defeated to argue that particular scoring rules are not acceptable. For example, suppose I’m certain a fair die has just been rolled, but I know nothing about the outcome. I entertain six propositions, one for each possible outcome of the roll, and let’s imagine that I assign each of those propositions a credence of 1 6. That is, my credence distribution cr assigns cr 1 cr 2 cr 3 cr 4 cr 5 cr 6 1 6. This is a t least a rationally permissible distribution in my situation. But now suppose that, in addition to having this perfectly permissible credence distribution, I also use the absolute-value scoring rule to assess accuracy. I entertain six possible worlds—call them ω 1 through ω 6 , with the
{ pq“ pq“ pq“ pq“ pq“ pq“ {
1 subscripts indicating how a given of world. the roll comes out 1, so tv the1 roll comes 1 whileout theintv-value each In of world the otherω , outcomes is zero. Thus we have
p q“
p
q “ |1 ´ 1{6| ` 5 ¨ |0 ´ 1{6| “ 10{6 “ 5{3 (10.8) A bit of reflection will show that I pcr, ω q through I pcr, ω q also equal 5{3, for similar reasons. Recalling that I assign credence 1 {6 to each of the six Iabs cr, ω1
abs
2
abs
6
336
CHAPTER 10. ACCURACY ARGUMENTS
possible worlds, my expected inaccuracy for my own credence distribution is EIcr cr 6 5 3 1 6 5 3 (10.9)
p q“ ¨p { ¨ { q“ {
Next I consider my crazy friend Ned, who has the same evidence as me but assig ns credence 0 to each of the six roll- outcome propositions. That is, Ned’s distribution cr N assigns cr N 1 crN 2 crN 3 crN 4 crN 5 crN 6 0. How inaccurate do I expect Ned to be? Again, in ω1 , tv 1 1 while the tv-value of each other outcome is 0. So
p q“ p q“
pq“
p q“
p
Iabs crN , ω1
pq“
q “ |1 ´ 0| ` 5 ¨ |0 ´ 0| “ 1
pq“
pq“
(10.10)
Similar calculations show that, as measured by the absolute-value score, in each possible world Ned’s distribution will have an inaccuracy of 1. When I calculate my expected inaccuracy for Ned, I get
p q “ 6 ¨ p1 ¨ 1{6q “ 1
EIcr crN
(10.11)
If I calculate inaccuracy using the absolute-value rule, I will expect Ned’s distribution to be less inaccurate than my own; my credence distribution is defeated in expectation by Ned’s. Yet Ned’s distributi on isn’t better than mine in any epistemic sense—in fact, the Principal Principle would say that my distribution is rationally required while his is rationally forbidden! Something has gone wrong, and it isn’t the credences I assigned. Instead, it’s the scoring rule I used to compare my credences with Ned’s. In fact, we can use this example to construct an argument against the absolute-value score as an accep table scoring rule. In the example, my credence distribution is rationally permiss ible. According to Admissibles Not Defeated, a rationally permissible distribution cannot be defeated in expectation on any acceptable scoring rule. On the absolute-value rule, my credence distribution in the example is defeated in expectation (by Ned’s). So the absolute-value scoring rule is not an acceptable inaccuracy measure. (This argument is similar to the argument we made against calibration as an accuracy measure, on the grounds that calibration made perfectly rational forecaster credences look unstable and therefore irrational. ) The Ned example cannot be used to make a similar argument against the Brier score. Exercise 10.4 shows that if I had used the Brier score, I would have expected my own credence distribution to be more accurate than Ned’s. In fact, the Brier score is an example of a proper scoring rule: Proper Scoring Rule: A scoring rule is proper just in case any agent with a probabilistic credence distribution who uses that rule takes
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM337
herself to defeat in expectation every other distribution over the same set of propositions. The absolute-value scoring rule is not proper. The Brier score is: a probabilistic agent who uses the Brier score will always expect herself to do better with respect to accuracy than any other distribution she considers. 19 The Brier score is not the only scoring rule with this feat ure. For the sake of illustration, here’s another proper scoring rule: 20
p
Ilog cr, ω
q“r´
p ´|tv pX q´crpX q|qs`. . .`r´ logp1´|tv pX q´crpX q|qs
log 1
ω
1
ω
1
n
n
(10.12)
Historically, the Brier score has been favored over the absolute-value score for inaccur acy measurement b ecause Brier is a proper scoring rule. Of course, propriety gives us no means of choosing between the Brier score and other proper scor es such as the logarithmi c rule of Equation (10.12). But it turns out we don’ t need to. Predd et al. (2009 ) showed that a Gradational Accuracy Theorem can be proven for any separable, proper scoring rule (not just the Brier score) . So, for instance, on the logarithmi c scoring rule any non-probabilistic credence over distribution will be accuracy-dominated by some probabilistic distribution the same proposition s. The same is not true for the abso lute-value score. In fact, if you look back to the Crazy Ned example, you’ll find that Crazy Ned’s non-probabilistic distribution accuracy-dominates my probabilistic distribution cr. In each of the six possible worlds, I abs crN , ω 1 while I abs cr, ω 5 3. On an i mproper scoring rule, a non-probabilistic distribution may accuracy-dominate a probabilistic one. Since any proper scoring rule may figure in the Gradational Accuracy Theorem, we could substitute any other proper scoring rule in for the Brier score and the Gradational Accuracy Argument for Probabilism would still run fine. Does this mean we’ve found a good way to support the Brier score (or any other proper score) as an acceptable scoring rule for establishing probabilism? A proper scoring rule is one on which probabilistic distributions always expect themselves to be more accurate than the alternativ es. But why focus on what probabilistic distributions expect? Inaccuracy measurement has many applications, and in many of those applications (including one we’ll see in Section 10.4), it is already assumed that probabilistic credence distributions are ration al. In such situations we want an accuracy measure that interacts well with probabilistic distributions, so proper scoring rules are a natural fit, and it’s traditional to apply the Brier score because of
p
q“
p
q“ {
338
CHAPTER 10. ACCURACY ARGUMENTS
its propr iety. But when an inaccuracy measure is going to be used to argue for probabilism—as in the Gradational Accuracy Argument—it seems question-begging to privilege probabilistic distributions in selecting our scoring rule. For instance, our Crazy Ned argumen t against the absolute-v alue score started by assuming that my probabilistic distribution assigning credence 1 6 to each of the possible roll outcomes was rationally permissible. We then criticized the absolute-value score on the grounds that it made that distribution look unstable and therefore irrational. Yet this criticism looks circular in the course of a debate about the rational status of credences satisfying the probability axioms. In his (2009), Joyce mov ed from his old approach to defending probabilism to a new argument that explicitly begins with the rational permissibility of probabilistic distributions. While I won’t go into the specifics of that argument here, it takes as a premise that given any numerical distribution satisfying the probability axioms, there exists some situation in which it would be rationally permissible for an agent to assign those values as her credences. Admittedly, this premise—that probabilistic credences are rationally permitted —is weaker than the ultimate conclusion of Joyce’s argument—that
{
probabilistic are rationally required Still, assuming withou t any inde pendent support credences for the premise, it feels like we’re .simply something about the rationality of probabilism in order to prove something about the rationality of probabilism. It sounds like Line arity In, Linearity Out to me.21 Joyce does try to provide independent support for his premise. He argues that for any probabilistic distribution, we could imagine a situation in which an agent is rationally certain that those values reflect the objective chances of the propositions in question. By the Principal Principle, the agent would then be rationally required to assign the relevant values as her credences. Yet recall our characters Mr. Prob, Mr. Bold, and Mr. Weak. Mr. Prob satisfies the probability axioms, while Mr. Bold violates Finite Additivity by having his credence in each proposition be the square-root of Mr. Prob’s credence in that proposition. Mr. Bold happily assigns a higher credence to every uncertain propos ition than Mr. Prob does. In arguin g for prob abilism, we look to establish that Mr. Bold’s (and Mr. Weak’s) credences are rationally forbidden. If we could establish that rational credences must match the numerical values of known frequencies or objective chances, then in many situations Mr. Bold’s distribution could be ruled out immediately, because frequencies and chances must each be additive. 22 But part of Mr. Bold’s boldness is that even when he and Mr. Prob are both certain that a particular proposition has a particular nonextreme chance, he’s willing to
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM339
assign that propositio n a higher credence than its chance va lue. Mr. Bold is willing to be more confident of a given experimental outcome than its numerical chance! What if, when confronted with a fair die roll like the one in the Crazy Ned example, Mr. Bold maintains that it is rationally impermissible to assign a credence of 1 6 to each out come? It’s not that Mr. Bold disa grees with us about what the chances are; it’s that he disagrees with us about whether rationally-permissible credences equal the chances. 23 Faced with this position, our argument against the absolute-value score could not get off the ground, and we would have no way to favor the Brier score over absolute-value in constructing a Gradational Accur acy Argument. Similarly, Joyce’s argument for his premise would go nowhere, because Mr. Bold clearly rejects the Principal Principle.24 While we might intuitively feel like Mr. Bold’s position is crazy, the accuracy-based arguments against it are all question-begging.
{
10.3.3
Do we really need Finite Additivity?
Let’s a stepSome back authors and get don’t a broader on theconside arguments discussed in thistake chapter. thinkview accuracy rations are central to assessing doxa stic attitudes for rationality. But among those who embrace an accuracy-based approach, a few principles are uncontroversial. Everyone accepts Admissibles Not Dominated, and most authors seem okay with Admissibles Not Defeated. Everyone thinks accuracy measures should be truth-directed, and most are on-board with separability. Controversy arises when we try to put more substantive constraints on the set of acceptable scoring rules. In order to run a gradational accuracy argume nt for probabilism, we need to narrow the acceptable scoring rules to the set of proper scores (or one of the other restricted sets Joyce considers in his (1998) and (1999)). But arguments for such a restricted set that look convincing ultimately turn out to be question-begging. What if we didn’t try to narrow the set so far—what if we worked only with constraints on scoring rules that are entirely uncontroversial? In Exercise 10.3, you’ll show that as long as one’s scoring rule is truth-directed, Admissibles Not Dominated endorses Normality and Non-Negativity as rational constraints on credence. As usual, Finite Additivity is the most difficult Kolmogorov axiom to establish. But an excellent (1982) paper by Dennis Lindley shows how close we can get to full probabilism without strong constraints on our scoring rules. Lindley assumes Admissibles Not Dominated, then lays down some very
340
CHAPTER 10. ACCURACY ARGUMENTS
minimal constraints on acceptable scoring rules. I won’t work through the details, but besides separability and Truth-Directedness he assumes (for instance) that an acceptable scoring rule must be smooth—your score doesn’t suddenly jump when you slightly increase or decrease your credence in a proposition. Lindley shows that these thin constraints on scoring rules suffice to narrow down the class of rationally-permissible credence distributions, and narrow it down more than just Normality and Non-Negativity would. In fact, every rationally permissible credence distribution is either probabilistic (it satisfies all three Kolmogorov axioms) or can be altered by a simple transformation into a probabilistic distribution. The permissible crede nce distributions stand to the probabilistic ones in something like the relation Mr. Bold and Mr. Weak stand to Mr. Prob. While Mr. Prob satisfies Finite Additivity, Mr. Bold and Mr. Weak don’t; but their credences can be converted into Mr. Prob’s by a simple mathematical operation (squaring for Mr. Bold; square-rooting for Mr. Weak). Depending on how you think about credences, you might draw one of two diametrically opposed lessons from Lindley’s result. First, you might think that if two credence distributions are relatable by a simple mathematical transformation like them. the oneAdmittedly, Lindley applies, no significant difference between Mr. there Prob isand Mr. Weak cognitive assign different numerical credence values. If I tell them I’ve flipped a fair coin, Mr. Prob might assign credence 1 2 that it came up heads while Mr. Bold will assign 1 2 .707. But there’s an awful lot on which Mr. Prob, Mr. Weak, and Mr. Bold agree. Their distributions are ordinally equivalent—Mr. Prob is more confident of X than Y just in case Mr. Bold and Mr. Weak are as well. And all three distributions satisfy certain struc tural constraints, such as Normality, Non-Negativity, and our credal Entailment rule. So one might think that all three distributions are really mathematical variants of the same basic outlook on the world. Perhaps the differences between these characters in our numerical credal models are an artifact of those models’ excessive precision. In real life these characters would think and act in much the same ways; a functionalist might try to argue that the doxastic attitudes in their heads are identical. If one takes that approach, then Lindley’s result would seem to establish Finite Additivity in the only way that could possibly matter. Given Lindley’s minimal conditions on an acceptable accuracy score, every rationallypermissible credence distribution either satisfies Finite Additivity or is indistinguishable from an additive distrib ution in any significant sense. All of the various distributions that can be transformed into a given probabilistic distribution are alternate representations of the same underlying mental
{? «
{
10.3. OBJECTIONS TO THE ACCURACY ARGUMENT FOR PROBABILISM341
state. If we like, we can choose to work with the probabili stic distribution, because it’s the representation of that mental state which is most mathematically convenient. But there’s nothing more substantive than that to the claim that rationality requires Finite Additivity.25 In this book I have rejected the approach to doxastic attitudes just described, assigning a more realist significance to the numbers employed by Bayesian credal models. Chapter 1 motivated the move from comparative to quantitative confidence models by noting that agents with ordinally equivalent opinions may nevertheless disagree on the relative sizes of confidence gaps. Given a tautology, a contradiction, and the proposition that a fair coin came up heads, Mr. Prob and Mr. Bold will rank these three propositions in the same order with respect to confidence. But Mr. Prob will also say that he is more confident in heads than in the contradiction by exactly the same amount that he is more confident in the tautology than in heads. Mr. Bold won’t say that. (Mr. Bold has a larger gap between heads and the contradiction than he has between heads and the tautology.) If we think this is a real cognitive difference, then the distinction between Prob and Bold drawn by our numerical Bayesian models is picking up on a genuine difference in theirAdopting doxasticthis attitudes. realist approach will lead us to draw a different lesson from Lindley’s result. If Lindley’s constr aints are the only constraints we are willing to accept on an accuracy scoring rule, then accuracy enthusiasts will have to accept that while there are rational restrictions on credence going beyond Normality and Non-Negativity, they aren’t strong enough to establish Finite Additivit y. Credence distributions that fail to satisfy the probability axioms (in a real, substantive sense) may still be rationally permissible. Is that bad news? Over and over in this part of the book we have been unable to argue for Finite Additivity without sneaking in some linearity assumption. What would happen if we abandoned Finite Additivity in favor of weaker rational constraints on credence, such as the ones that come out of Lindley’s result? In Part III I suggested we assess Bayesian Epistemology by considering its applications; I focused especially on applications to confirmation and decision theory. In decision theory the kinds of distinctions picked up by a quantitative confidence measure but not by a comparative ranking may be significant. If I am offer ed a gamble that yiel ds a small profit on P but a major loss on P , my decision will depend not only on whether P is more likely than P , but on how much more likely it is. So differences in confidence gaps between ordinally-equivalent credence distributions may
„ „
342
CHAPTER 10. ACCURACY ARGUMENTS
be highly important to decision theory . Yet we saw in Chapter 8 that the differences between Mr. Prob’s and Mr. Bold’s credence distributions may be practically neutralized if those agents apply suitably chosen valuation functions. If Mr. Prob combines his credence s and utilities to generate preferences by maximizing expected value, and Mr. Bold combines his credences and the same utilities to generate preferences using a different function, Mr. Prob and Mr. Bold can wind up with identical preferences. In that case, the numerical differences between Mr. Prob’s and Mr. Bold’s credences will make no differ ence to how they choose to act. Moreover, Mr. Prob and Mr. Bold will both satisfy the preference axioms that underlie the intuitive appeal of decision theory’s accoun t of practical rationali ty. To the extent that decision theory yields a fruitful, appealing account of real-life agents’ rational choices, that account could run just as well without assuming those agents satisfy Finite Additivity. The significance of Finite Additivity to Bayesian accounts of confirmation is a much more open question. As with decision theory , confirmation results depend not just on confidence orderings but also on quantitative relations among numerical credence values. In Section 6.4.2 we investigated credence distributions relative observing blackdoes raven more strongly confirms the hypothesis thattoallwhich ravens are blackathan observing a non-black, non-raven. The Bayesian solution to the Ravens Paradox presented there describes two conditions on such distributions (Equations (6.10) and (6.11)). The second of those conditions is about the sizes of gaps—it asks whether learning a particular hypothesis would change the degree to which you are more confident in one proposition than another. Despite their ordinal agreements, characters like Mr. Prob and Mr. Bold have different ratios between their credenc es in particular proposition s. So Equation (6.11) might be satisfied by one of them but not by the other. This means that if Mr. Prob and Mr. Bold apply traditional Bayesian confirmation measures, they may disagree on whether the ravens hypothesis is more strongly confirmed by a black raven or by a red herring. 26 Confirmation is one of many non-decision-theoretic applications of Bayesian Epistemology (coherence of a belief set, measuring information content, etc.) that seems to rely on the additivity of rational credences. Perhaps in each of those applications we could play a trick similar to the one we used in decision theory. In decision theory we compensated for Mr. Bold’s non-additive credence distribution by having him use a non-standard valuation function; the combination yielded act preferences identical to Mr. Prob’s. What happens if Mr. Bold also uses a non-traditional confirmation measure? Perhaps there’s an odd-looking confirmat ion measure Mr. Bold
10.4. AN ACCURACY ARGUMENT FOR CONDITIONALIZATION 343
could apply which, despite Mr. Bold’s credal differences with Mr. Prob, would leave the two agents with identical judgments about confirmational matters.27 It’s unclear, though, how such a non-traditional measure would stand up to the arguments, intuitive considerations, and adequacy conditions that have been deployed in the debate over confirmation measures. I know of no literature on this subject. As it stands, I tend to think that maintaining Finite Additivity is more important for Bayesian applications to theoretical rationality (how we infer, how we reason, how we determine what supports what) than it is for appli cations to practical ration ality. But that is pure speculation on my part.
10.4
An accuracy argument for Conditionalization
Up to this point we’ve considered accuracy-based arguments for only synchronic Bayesian norms. We’ve found that establishing probabilism on non-circular grounds is somewhat difficult. But if you’ve already accepte d probabilism, a remarkable accuracy-based argument for updating by Conditionalization becomes available. The relevant result was proven by Hilary Greaves and David Wallace (2006).28 We start by restricting our attention to proper scorin g rules. Doing so is non-circular in this context, because we imagine that we’ve already accepted probabilism as rationally required. This allows us to appeal to the fact that proper scores are credence-eliciting for probabilistic credences as a reason to prefer them. Greaves and Wallace think of Conditionalization as a plan one could adopt for how to change one’s credences in response to one’s future evidence. Imagine we have an agent at time t i with probabilistic credence distribution cri , who is certain she will gain some evidence before tj . Imagine also that there’s a finite partition of propositions E1 , E2 ,...,E n in L such that the agent is certain the evidence gained will be a member of that partition. The agent can then form a plan for how she intends to update—she says to herself, “If I get evidence E1 , I’ll update my credences to such-and-such”; “If I get evidence E 2 , I’ll update my credences to so-and- so”; etc. In other words, an updating plan is a function from member s of the evidence partition to cr j distributions she would assign in response to receiving that evidence. Conditionalization is the plan that directs an agent receiving partition member cri Em . Em as evidence between ti and tj to set cr j Next, Greaves and Wallace show how, given a particular updating plan, the agent can calculate from her point of view at ti an expectation for how inaccurate that plan will be. 29 Roughly, the idea is to figure out what cre-
p¨q“ p¨| q
344
CHAPTER 10. ACCURACY ARGUMENTS
dence distribution the plan would generate in each possible world, measure how inaccurate that distribution would be in that world, multiply by the agent’s ti confidence in that possible world, then sum the resul ts. More precisely, the expectation calculation proceeds in six steps: 1. Pick a possible world ω to which the agent assigns non-zero credence at ti . 2. Figure out which member of the partition E1 , E2 ,...,E n the agent will receive as evidence between ti and tj if ω turns out to be the actual world. (Because possible worlds are maximally specified, there will always be a unique answer to this question.) We’ll call that piece of evidence E . 3. Take the updating plan being evaluated and figure out what credence distribution it recommends to the agent if she receives evidence E between ti and tj . This is the crede nce distribution the agen t will assign at t j if ω is the actual world and she follows the plan in question. We’ll call that distribution cr j . 4. Whichever scoring rule we’ve chosen proper scoring rules), use it to determine the inaccuracy of (among cr j if ωthe is the actual world. (In other words, calculate I crj , ω .)
p
q
5. Multiply that inaccurac y value by the agen t’s t i credence that ω is the actual world. (In other words, calculate cr i ω I crj , ω .)
p q¨ p
q
6. Repeat this process for each wor ld to which the agent assigns positiv e credence at ti , then sum the results. This calculation has the ti agent evaluate an updating plan by determining what cr j distribution that plan would recommend in each possible world. She assesses the recommended distibution’s accuracy in that world, weighting the result by her confidence that the world in question will obtain. Repeating this process for each possible world and summing the results, she develops an overall expectation of how accurate her tj credences will be if she implements the plan. Greaves and Wallace go on to prove the following theorem: Accuracy Updating Theorem: For any proper scoring rule, probabilistic distribution cr i , and evidential partition in L, a ti agent who calculates expected inaccuracies as described above will find Conditionalization more accurate than any updating plan that diverges from it.
345
10.5. EXERCISES
The Accuracy Updating Theorem demonstrates that from her vantage point at ti , an agent with probabilistic credences and a proper scoring rule will expect to be most accurate at t j if she updates by Conditionalization. Given a principle something like Admissibles Not Defeated for updating plans, we can use this result to argue that no updating plan deviating from Conditionalization is rationally acceptable. Does this argument show that the agent is rationally required to update by Conditionalization between ti and tj ? If she’s interested in mini mizing expected inaccuracy, then at ti she should certainly plan to update by conditionalizing—of all the updating plans available to the agent at ti , she expects Conditionalization to be most accurate. Yet being required to make a plan is differ ent from being required to implement it. Even if the agent remembers at t j what she planned at t i , why should the t j agent do what her ti self thought best? Among other things, the tj agent has more evidence than her ti self did. This is the same problem we identified in Chapter 9 for diachronic Dutch Strategy arguments. The Accuracy Updating Theorem estab lishes a synchronic point about which policy a ti agent concerned with accuracy will 30
j self applies. hope her trequired absenttheir a substantive premise that agents rationally later onBut to honor earlier plans, we cannot moveare from this synchronic point to a genuinely diachronic norm like Conditionalization.
10.5
Exercises
Problem 10.1. On each of ten consecutive mornings, a weather forecaster reports her cred ence that it will rain that day . Below is a record of the credences she reported and whether it rained that day. Day cr rain Rain?
1
2
3
4
5
6
7 8 9 1 0
p q 1{2 1{4 1{3 1{3 1{2 1{4 1{3 Y
N
N
N
Y
Y
N Y N
1 N
{
1 2
{
1 4
Unfortunately, the forecaster’s reports turned out not to be perfectly calibrated over this ten-day span. But now imagine she is given the opportunity to go back and change two of the credences she reported over those ten days. ∗ What two changes should she make so that her reports over the span become perfectly calibrated? (Assume that changing her credence report does not change whether it rains on a given day.) ∗
Perhaps via time-machine?
346
CHAPTER 10. ACCURACY ARGUMENTS
Problem 10.2. Throughout this problem, assume the Brier score is used to measure inaccuracy. (a) Suppose we have an agent who assigns credences to two propositions, X and Y . Draw a box diagram (like those in Figures 10.2, 10.3, and 10.4) illustrating the possible distributions she might assign over these two propositions. Then shade in the parts of the box in which cr X cr Y .
p qě p q
( p qą p q
(b) Now suppose that Y X . Use your diagram from part (a) to sho w that if an agent’s credence distribution violates the Entailment rule by assigning cr Y cr X , there will be another distribution that is more accurate than hers in every logicall y possible world . (Hint: When Y X , only three of the four corners of your box represent logically possible worlds.)
(
(c) In Exercise 9.2 we encountered Roxanne, who assigns the following credences (among others) at a given time:
p
cr A & B
q “ 0.5
p q “ 0.1
cr A
Construct an alternate credence distribution over these two propositions that is more accurate than Roxanne’s in every logically possible world. (Hint: Let A & B play the role of proposition Y , and A play the role of X .) To demonstrate that you’ve succee ded, calculate Roxann e’s inaccuracy and the alternate distribution’s inaccuracy in each of the three available possible worlds. Problem 10.3. Assuming only that our inaccuracy scoring rule is truthdirected and separable, argue for each of the following from Admissibles Not Dominated: (a) Non-Negativity (b) Normality Problem 10.4. Return to the Crazy Ned example of Section 10.3.2, in which you assign 1 6 credence to each of the six possible die roll outcomes while Ned assign s each a credence of 0. This time we’ll use the Brier sco re (rather than the absolute-value score) to measure inaccuracy in this example.
{
(a) Calculate the inaccuracy of your credence distribution in a world in which the die comes up 1. Then calculate Ned’s inacc uracy in that world.
347
10.5. EXERCISES
(b) Calculate your expected inaccur acy for your own distribution, then calculate your expected inaccuracy for Ned’s distribution. (c) How do your results illustrate the fact that the Brier score is a proper scoring rule? Problem 10.5. Use results discussed in this chapter to show that the Brier score fails to be credence-eliciting for all non-probabilistic credence distributions. That is, argue that for any agent who assigns a non-proba bilistic distribution and measures inaccuracy using the Brier score, there will be another distribution that she expects to have lower inaccuracy than her own. Problem 10.6. Suppose that at ti an agent assigns credences to exactly four propositions , as follows: proposition P &Q P& Q
„ „„PP&&„QQ
cr i 0.1 0.2 00..34
The agent is certain that between t i and t j , she will learn whether Q is true or false. (a) Imagine the agent has a very bizarre updati ng plan: No matte r what she learns between t i and t j , she will assign the exact same credences to the four propositions at t j that she did at t i . Using the six-step process described in Section 10.4, and the Brier score to measure inaccuracy, calculate the agent’s expected inaccuracy for this updating plan from her point of view at ti . (Hint: You only need to conside r four possible worlds, one for each of the four possible truth-value assignments to the propositions P and Q.) (b) Now imagineby instead that the agent’s is to generate she her tj credences conditionalizing her t i updating credencesplan on the information learns between the two times. Calculate the agent’s ti expected inaccuracy for this updating plan (using the Brier score to measure inaccuracy once more). (c) How do your results illustrate Greaves and Wallace’s Accuracy Updating Theorem?
348
CHAPTER 10. ACCURACY ARGUMENTS
Problem 10.7. In this exercise you will prove a limited version of Greaves and Wallace’s Accuracy Updating Theorem. Suppose we have an agent with probabilistic, regular credence distribution cr i over only four propositions: X & Y , X & Y , X & Y , and X & Y . Suppose the agent is certain at t i
„ „
„
„
that between then and t j she will learn the truth about whether Y obtains. Moreover, assume the agent uses the Brier score to measure inaccuracy. (a) To begin, suppose that the agent has an updating plan on which she assigns nonzero tj credence to X & Y or X & Y in the event she learns that Y is true. Explain how to construct an alternate updat ing plan that assigns zero credence to both of these propositions after learning Y and that has a lower expected inaccuracy than the agent’s plan from her point of view at ti . (A similar argument can be made about what the agent should do when she learns Y .)
„
„
„
„
(b) Your work in part (a) allows us to restrict our attention to updating plans that assign 0 credence to propositions denying Y once Y is learned. Use the Gradational Accuracy Theorem to argue that among such plans, for any plan that has the agent assign a non-probabilistic t j distribution after learning Y there exists another plan that has her assign a probabilistic distribution at tj after learning Y and that she expects to have a lower inaccuracy from her point of view at ti . (A similar argument can be made for the agent’s learning Y .)
„
(c) Given your results in parts (a) and (b), we may now confine our attention to updating plans that respond to learning Y by assigning a probabilistic t j distribution with zero credence for X & Y and X & Y . Argue that among such plans, any plan that agrees with Conditionalization about what the agent should assign if she learns Y but disagrees with Conditionalization about what she should assign if she learns Y will have a higher ti expected inaccuracy than updating by Conditionalization. (A similar argument could be made for any plan that agrees with Conditionalization on Y but disagrees on Y .)
„
„
„
„
„
p q“
Useful algebra fact for par t (c): A quadratic equation of the form f x ´b ax2 bx c with positive a attains its minimum when x . 2a
` `
“
Problem 10.8. Of the three kinds of arguments for probabilism we’ve considered in this part of the book—Representation Theorem arguments, Dutch Book arguments, and accuracy-based arguments—do you find any of them convincing? If so, which do you find most convincing? Explain your answer.
10.6. FURTHER READING
10.6
349
Further reading
Introductions and Overviews
Richard Pettigrew (2013b ). Epistemic Utility and Norms for Credences. Philosophy Compass 8, pp. 897–908 Eminently-readable introduction to accuracy-based arguments for Bayesian norms and particular arguments for probabilism and Conditionalization. Richard Pettigrew (2011). Epistemic Utility Arguments for Probabilism. In: The Stanford Encyclopedia of Philosophy . Ed. by Edward N. Zalta. Winter 2011 More comprehensive overview of the current accuracy-argument literature. Classic Texts
Bas C. van Fraassen (1983). Calibration: A Frequency Justification for Personal Probability. In: Physics Philosophy and Psychoanalysis. Ed. by R. Cohen and L. Laudan. Dordrecht: Reidel, pp. 295–319 Abner Shimony (1988). An Adamite Derivation of the Calculus of Probability. In: Probability and Causality . Ed. by J.H. Fetzer. Dordrecht: Reidel, pp. 151–161
Classic arguments for probabilism on calibration grounds. Bruno de Finetti (1974). Theory of Probability . Vol. 1. New York: Wiley Contains de Finetti’s proof of the mathematical result underlying Joyce’s Gradational Accuracy Theorem. James M. Joyce (1998). A Nonpragmatic Vindication of Probabilism. Philosophy of Science 65, pp. 575–603 Foundational article that first made the accuracy-dominance argument for probabilism. Hilary Greaves and David Wallace (2006). Justifying Conditionalization: Conditionalization Maximizes Expected Epistemic Utility. Mind 115, pp. 607–632
350
CHAPTER 10. ACCURACY ARGUMENTS
Presents the minimizing-expected-inaccuracy argument for updating by Conditionalization. Extended Discussion
James M. Joyce (2009). Accuracy and Coherence: Prospects for an Alethic Episte mology of Partial Belief . In: Degrees of Belief. Ed. by Franz Huber and Chris toph Schmidt-Petri. Vol. 342. Synthese Library. Springer, pp. 263–297 Joyce further discusses the arguments in his earlier accuracy article and various conditions yielding privileged classes of accuracy scores. Dennis V. Lindley (1982). Scoring Rules and the Inevitability of Probability. International Statistical Review 50, pp. 1–26 Paper discussed in Section 10.3.3 in which Lindley shows that even with very minimal conditions on acceptable accuracy scores, every rationally permissible credence distribution is either probabilistic or can be converted to a probabilistic distributio n via a simple transform ation. Hannes Leitgeb and Richard Pet tigrew (2010a). An Objective Justification of Bayesianism I: Measuring Inaccuracy. Philosophy of Science 77, pp. 201–235 Hannes Leitgeb and Richar d Pettigrew (2010b). An Objective Justification of Bayesianism II: The Consequences of Minimizing Inaccuracy. Philosophy of Science 77, pp. 236–272 Presents alternate accuracy-based arguments for synchronic and diachronic Bayesian norms. Kenny Easwaran (2013). Expected Accuracy Supports Conditionalization— and Conglomerability and Reflection. Philosophy of Science 80, pp. 119–142 Shows how expected inaccuracy minimization can be extended in the infinite case to support such controversial norms as Reflection and Conglomerability. Hilary Greaves (2013). Epistemic Decision Theory. Mind 122, pp. 915–52
NOTES
351
Jennifer Carr (ms). Epistemic Utili ty Theory and the Aim of Belief. Unpublished manuscript. Selim Berker (2013). Epistemic Teleology and the Separateness of Propositions. Philosophical Review 122, pp. 337–93 These papers criticize the teleological epistemology of accuracy-based arguments for rational constraints.
Notes 1 In Chapter 9 I suggested that rational appraisals concern how things look from the agent’s own point of view. (It was important that an agent be able to tell for herself that her credences left her susceptible to a Dutch Book.) An agent isn’t typically in a position to assess the accuracy of her own beliefs, since she doesn’t have access to the truth-values of all propositions. This makes the a priori aspect of the argument for Belief Consisten cy crucial—an agent with inconsistent beliefs can see from her own standpoint that at least some of those beliefs are false, without invoking contingent facts not necessarily at her disposal. 2 There’s also been some interesting empirical research on how well-calibrated agents’ credences are in the real worl d. A robust findi ng is that everyday peopl e tend to be overconfident in their opinions—only, say, 70% of the propositions to which they assign credence 0 .9 turn out to be true. (For a literature survey see (Lichtenstein, Fischoff, and Phillips 1982).) On the other hand, Murphy and Winkler (1977) found weather forecasters’ precipitation predictions to be fairly well-calibrated—even before the introduction of computer, satellite, and radar improvements we’ve made since the 1970s! 3 Like so many notions in Bayesian Epistemology, the idea of accuracy as calibration was hinted at in Ramsey. In the latter half of his (1931), Ramse y asks what it would be for credences “to be consistent not merely with one another but also with the facts.” (p. 93) He later writes, “Granting that [an agent] is going to think always in the same way about all yellow toadstools, we can ask what degree of confidence it would be best for him to have that they are unwholesome. And the answer is that it will in general be best for his degree of belief that a yellow toadstool is unwholesome to be equal to the proportion of yellow toadstools which are in fact unwholesome.” (p. 97) 4 This example is taken from (Joyce 1998). 5 If you’re a Regularity devotee (Section 4.2), you may think the forecaster shouldn’t assign absolute certainty to snow—what she sees out the window could be clever Hollywood staging! Setting the forecaster’s credence in snow to 1 makes the numbers in this example
easier, but the same point could be made using an example with regular credences. 6 Compare the practice in statistics of treating a proposition as a dichotomous random variable with value 1 if true and 0 if false. 7 Notice that we’re keeping the numerical values of the distribution cr constant as we measure inaccu racy relative to different possible worl ds. I br pcr, ω q doesn’t somehow measure the inaccuracy in world ω of the credence distribution the agent would have in that world. Instead, given a particular credence distribution cr of interest to us, we will use I br pcr, ω q to measure how inaccurate that very numerical distribution is relative to each of a number of distinct possible worlds.
352
NOTES
8
Named after George Brier—another meteorologist!—who discuss ed it in his (1950). The crucial point turns out to be that a disparately-weighted Brier score is still a “proper scoring rule” (about which muc h more later). This is proven as a lemma called “Stability” at (Greaves and Wallace 2006, p. 627). 10 The second part of the Gradational Accuracy Theorem stands to the first part much 9
as 11 the Converse Dutch Book Theorem stands to the Dutch Book Theorem (Chapter 9). Strictly speaking there are four world- types here, a world being assigned to a type according to the truth-values it gives X and Y . But sinc e all the worl ds of a particular type will enter into accuracy calculations in the same way, I will simplify discussion by pretending there is exactly one world in each type. 12 Notice that a similar argument could be made for any cr lying outside the square defined by ω4 , ω2 , ω3 , and ω1 . So this argument also shows how to accuracy-dominate a distribution that violates our Maximum rule. Now one might wonder why we need an argument that credence-values below 0 or above 1 are irrational—didn’t we stipulate our scale for measuring degrees of belief such that no value could ever fall outside that range? On some ways of understanding credence, arguments for Non-Nega tivity are indeed superfluous. But one might define credences purely in terms of their role in generating preferences (as discussed in Chapter 8) or in sanctioning bets (as discussed in Chapter 9), in which case there would be no immediate reason why a credence couldn’t take on a value below zero. 13 Suppose you assign credences to three propositions X , Y , and Z such that X and Y are mutually exclusive and Z )( X _ Y . We establish X -, Y -, and Z -axes, then notice that only three points in this space represent logically possible worlds: p0, 0, 0q, p1, 0, 1q, and p0, 1, 1q. The distr ibutions in this space satisfying Finite Addit ivity all lie on the plane passing through those three points. If your credence distrib ution cr violates Finite Additivity, it will not lie on that plane. We can accuracy-dominate it with distribution cr1 that is the closest point to cr lying on the plane. If you pick any one of the three logically possible worlds (call it ω ), it will form a right triangle with cr and cr 1 , with the segment from cr to ω as the hypotenuse and the segment from cr 1 to ω as a leg. That makes cr 1 closer than cr to ω . 14 To give the reader a sense of how the second part of the Gradational Accuracy Theorem is proven, I will now argue that no point lying inside the box in Figure 10.4 and on the illustrated diagonal may be accuracy-dominated with respect to worlds ω2 and ω3 . In other words, I’ll show how satisfying Negation wards off accuracy domination (assuming one measures inaccuracy by the Brier score). Start with distribution cr 1 in Figure 10.4, which lies on the diagonal and therefore satisfies Negation. Imagine drawing two circles throug h cr 1 , one centered on ω2 and the other centered on ω 3 . To improve upon the accuracy of cr1 in ω 2 , one would have to choose a distribution closer to ω2 than cr 1 —in other words, a distribution lying inside the circle centered on ω2 . To improve upon the accuracy of cr 1 in ω3 , one would have to choose a distribution lying inside the circle centered on ω 3 . But since cr 1 lies on the line connecting ω2 and ω3 , those circles are tangent to each other at cr 1 , so there is no point lying inside both circles. Thus no distribution is more accurate than cr 1 in both ω 2 and ω3 . 15 The dashed line is like a contour line on a topographical map. There, every dashed point on a given contour lin e lies at the same altitude . Here, every dashed point has the same level of inaccuracy. 16 Here I’m employing a convention that “crpω1 q” is the value cr assigns to the proposition that X and Y have the truth-values they possess in world ω1 . In other words, cr pω1 q is the cr-value on the first line of the probability table.
NOTES
353
17 Readers familiar with decision theory (perhaps from Chapter 7) may notice that the expected-inaccuracy calculation of Equation (10.6) strongly resembles Savage’s formula for calcu lating expected utili ties. Here a “state” is a possible world ωi that might be actual, an “act” is assigning a particular credence distribution cr 1 , and an “outcome” is the inaccuracy that results if ωi is actual and one assigns cr 1 . Savage’s expected util ity
formula was abandoned by Jeffrey because it yielded implausible results when states and acts were not independent. Might we have a similar concern about Equation (10.6)? What if the act of assigning a particular credence distribution is not independent of the state that a particular one of the ω i obtains? Should we move to a Jeffrey-style expected inaccurac y calculation, and perhaps from there to some analogue of Causal Decisi on Theory? As of this writing, this question is only just beginning to b e explored in the accuracy literatu re, in articles such as (Greaves 2013) and (Konek and Levinstein ms). 18 Notice that Admissibles Not Defeated entails our earlier principle Admissibles Not Dominated. If distribution cr1 accuracy-dominates distribution cr, it will also have a lower expected inaccuracy than cr from cr’s point of view (because it will have a lower inaccuracy in every possible world). So being accuracy-dominated is a particularly bad way of being defeated in expectation. (As in sports, it’s bad enough to get defeated, but even wor se to get dominated.) Admissibles Not Defeated says that permissible credence distributions are never defeated in expectation; this entails that they are also never dominated. 19 On a proper scoring rule, a probabilistic agent will always expect her own accuracy to be better than that of any other distribution. On the absolute-value rule, a probabilistic agent will sometimes expect other distributions to be better than her own. Some scoring rules fall in the middle: on such rules, a probabilistic agent will never expect anyone else to do better than herself, but she may find other distributions whose expected accuracy is tied with her own. To highlight this case, some authors distingu ish “strictly proper” scoring rules from just “proper” ones. On a strictly proper scoring rule a probabilistic agent will never find any other distribution that ties hers for accuracy expectation; a merely proper rule allow s such ties. I am using the term “proper” the way these auth ors use “strictly proper”. For an assessment of how the distinction betwee n propriety and strict propriety interacts with the results of this chapter and with varying notions of accuracy dominance (such as “strong” vs. “weak” accuracy domination), see (Schervish, Seidenfeld, and Kadane 2009). 20 This rule is only intended to be applied for cr-values between 0 and 1 (inclusive). 21 From a Linearity-In, Linearity-Out point of view, Joyce’s (2009) argument does have one advantage over attempts to favor the Brier score using propriet y considerations. If you’re truly worried about making linear assumptions in the process of establishing probabilism, you might be concerned that Admissibles Not Defeated centers around linear expectations of inaccuracy. Joyce’s (2009) argument runs from his premise to probabilism using only Admissibles Not Dominated along the way, and without invoking Admissibles Not Defeated at all. 22
See note 4 in Chapter 5. Compare (Fine 1973, Sect. IIID). See (H´ajek 2009a) for a very different kind of objection to Joyce’s argument. 25 I’m inclined to read Lindley’s own interp retation of his result along these lines. For one thing, Lindley titles his paper “Scoring Rules and the Inevitability of Probability”. For another, after noting on page 8 that Admissibles Not Defeated is a kind of Pareto optimality rule, he writes that an agent who chooses any of the distributions permitted by that rule and a minimally-acceptable scoring rule is thereby “effectively introducing probabilities”. 23 24
354
NOTES
26 The same goes for Bayesian results mentioned in Chapter 6, note 38 showing that a red herring cannot confirm the ravens hypothesis to anything more than an exceedingly weak degree. These results depend on particular credal differen ces and ratios being “minute” in absolute terms, so they might go through for Mr. Prob but not for Mr. Bold (or vice versa). 27
Since Mr. Bold’s credences are the square-root of Mr. Prob’s, an obvious move would be to take whatever confirmation measure Mr. Prob uses and replace all of its credal expressions with their squares. 28 As we’ll see, the Greaves and Wallace result focuses on minimizing expected inaccuracy. For Conditionalization arguments based on accuracy-domination, see (Briggs ms) and (Williams ms). For an alternative expected-accuracy approach to updating, see (Leitgeb and Pettigrew 2010a,b). 29 It’s important that Greaves and Wallace restrict attention to what they call “available” updating plans. Available plans guide an agent’s credal response to her total evidence (including the evidence she imagines she’ll receive); they do not allow an agent to set her credences based on further factors not in evidence. For instance, consid er the updating plan according to which an agent magically assigns credence 1 to each proposition just in case it’s true and credence 0 just in case it’s false—even if her evidence isn’t fine-grained enough to indicate the truth-v alues of all the relevant propositions. This would be an excellent plan in terms of minimizing inaccuracy, but isn’t a feasible updating strategy for an agent going forwar d. This updating plan does not count as available in Greaves and Wallace’s sense, and so does not compete with Conditionalization for the most accurate updating plan. 30 Like Reflection, the resulting norm is a synchronic requirement on the agent’s attitudes towards diachronic contents.
Part V
Challenges and Objections
355
364
Glossary accuracy a doxastic attitude adopted towards a particular proposition is accurate to the extent that it appropriately reflects the truth-value of that proposition. 315 accuracy domination given distributions cr and cr 1 over the same set of propositions, cr1 accuracy-dominates cr just in case cr 1 is less inaccurate than cr in each and every logically possible world. 325 Accuracy Updating Theorem For any proper scoring rule, probabilistic distribution cri , and evidential partition in L , an agent’s ti expected inaccuracy for updating by Conditionalization will be lower than that of any updating plan that diverges from it. 345 act In a decision problem, an agent must choose exactly one of the available acts. Depending on the state of the world, that act will produce one of a number of outcomes, to which the agent may assign varying utilities. 230 actual world the possible world in which we live. Events that actually happen happen in the actual world. 28 admissible evidence evidence that, if it has any effect on an agent’s credence about the outcome of an event, does so by way of affecting the agent’s credences about the outcome’s objective chance. 133 affine transformation Two measurement scales are related by an affine transformation when values on one scale can be obtained by multiplying values on the other scale by a particular constant, then adding another specified constant. The Fahrenheit and Celsius scales for temperature provide one example. 272 365
366
Glossary
Allais’ Paradox a set of gambles for which subjects’ intuitions often fail to satisfy the Sure-Thing Principle; proposed by Maurice Allais as a counterexample to standard decision theory. 240 analogical a cluster effects involving analogical as: theeffects degree to which of evidence that one object has areasoning, property such confirms that another object has that property should increase in light of information that the objects have other properties in common. 194 analyst expert expert to which one defers because of her skill at forming attitudes on the basis of evidence. 140 antecedent In a conditional of the form “If P , then Q,” P is the antecedent. 28 atomic proposition a proposition in language L that does not contain any connectives or quantifiers. An atomic proposition is usually represented either as a single capital letter ( P , Q , R , etc.) or as a predicate applied to some constants ( F a, Lab, etc.). 28 Base Rate Fallacy assigning a posterior credence to a hypothesis that over-emphasizes the likelihoods associated with one’s evidence and under-emphasizes one’s prior in the hypothesis. 99 Bayes Factor for a given piece of evidence, the ratio of the likelihood of the hypothesis to the likelihood of the catchall. An update by Conditionalization multiplies your odds for the hypothesis by the Bayes factor. 99 Bayes Net a diagram of causal relations between variables developed from information about probabilistic dependencies among them. 77
p | q “ crpE | H q¨crpH q{crpE q.
Bayes’ Theorem for any H and E in L , cr H E 63
Belief Closure If some subset of the propositions an agent believes ent ails a further proposition, rationality requires the agent to believe that further proposition as well. 7 Belief Consistency Rationality requires the set of propositions an agent believes to be logically consistent. 7
367
Glossary
Bertrand’s Paradox When asked how probable it is that a chord of a circle is longer than the side of an inscribed equilateral triangle, the Principle of Indifference produces different answers depending on how the chord is specified. 145 Brier score a scoring rule based that measures the inaccuracy of a distribution by its Euclidean distance from the truth. 323 calibration a credence distribution over a finite set of propositions is perfectly calibrated when, for any x, the set of propositions to which the distribution assigns credence x contains exactly fraction x of truths. 317 catchall the proposition that the hypothe sis H under consideration is false (in other words, the proposition H ). 66
„
Causal Decision Theory decision theory in which expected utility depends on an act’s causal tendency to promote various outcomes. 246 classificatory concept places an entity in one of a small number of kinds. 4 coherent A coherent credence distribution satisfies Kolmogorov’s probability axioms. 259 common cause a single event that causally influences at least two other events. 74 commutative Updating by Conditionalization is commutative in the sense that updating first on E then on E 1 has the same effect as updating in the opposite order. 95 comparative concept places one entity in order with respect to another. 4
(
Comparative Entailment For propositions P and Q, if P Q then rationality requires an agent to be at least as confident of Q as P . 11 condition in a conditional credence, the proposition the agent supposes. 58 conditional bet a conditional bet on P given Q wins or loses money for the agent only if Q is true; if Q is false the bet is called off. An agent’s fair betting price for a conditional bet that pays $1 on P (given Q) is typically cr P Q . 293
p | q
368
Glossary
conditional credence a degree of belief assigned to an ordered pair of propositions, indicating how confident the agent is that the first proposition is true on the supposition that the second is. 58 conditional independence When cr Q & R 0, P is probabilistically independent of Q conditional on R just in case cr P Q & R cr P R . 69
p
qą p |
q“ p | q
Conditionalization for any time ti and later time tj , if proposition E in L represents everything the agent learns b etween ti and tj and cri E 0, then for any H in L, crj H cri H E . (Bayesians’ traditional updating rule). 92
p qą
p q“ p | q
confirmation Evidence confirms a hypothesis just in case the evidence supports that hypothesis (to any degree). 175 confirmation measure a numerical measure of the degree to which evidence E confirms hypothesis H relative to probability distribution Pr. 205 A B C
K
L
A
Confirmation for any C, relative , , and in , A if confirms confirms B relativeTransitivity to K and B confirms to K , then C relative to K . 182 Conglomerability For each proposition P and partition Q 1 , Q2 , Q3 ,... in L, cr P is no greater than the largest cr P Qi and no less than the least cr P Qi . 151
p q p | q
p | q
conjunction P & Q is a conjunction; P and Q are its conjuncts. 28 Conjunction Fallacy being more confident in a conjunction than you are in one of its conjuncts. 40 Consequence Condition if E in L confirms every member of a set of propositions relative to K and that set jointly entails H 1 relative to K , then E confirms H 1 relative to K . 183 consequent In a conditional of the form “If quent. 28
P , then Q,” Q is the conse-
Consistency Condition for any E and K in L, the set of all hypotheses confirmed by E relative to K is logically consistent with E & K . 185 consistent The propositions in a set are consistent when at least one possible world makes all the propositions true. 31
369
Glossary
constant a lower-case letter in language universe of discourse. 33
L
representing an object in the
constant act a decision-theoretic act that produces the same outcome for an agent regardless which state of the world obtains. 267 contingent a proposition that is neither a tautology nor a contradiction. 31 contradiction a proposition that is false in every possible world. 31 Contradiction rule for any contradiction
F
in
L,
p q “ 0. 36
cr
F
Converse Consequence Condition for any E , H , H 1 , and K (with H 1 consistent with K ), if E confirms H relative to K and H 1 & K H , then E confirms H 1 relative to K . 186
(
Converse Dutch Book Theorem a theorem showing that if an agent satisfies particular constraints on her credences, she will not be susceptible to a particular kind of Dutch Book. 296 Converse Entailment Condition for any consistent E , H , and K in if H & K E but K E , then E confirms H relative to K . 186
L,
Countable Additivity For any countable partition Q1 , Q2 , Q3 ,... in cr Q1 Q2 Q3 . . . cr Q1 cr Q2 cr Q3 . . .. 151
L,
(
*
p _ _ _ q“ p q` p q` p q`
credence degree of belief. 4 credence elicitation structuring incentives so that rational agents will report the truth about the credence values they assign. 319 cumulative Updating by Conditionalization is cumulative in the sense that updating first on evidence E and then on evidence E 1 has the same net effect as updating once, on the conjunction E & E 1 . 95 Czech Book a set of bets, each placed with an agent at her fair betting price (or better), that together guarantee her a sure gain come what may. 297 database expert expert to which one defers because her evidence includes one’s own, and more. 140
370
Glossary
decision problem a situation in which an agent must choose exactly one out of a partition of available acts, in hopes of attaining particular outcomes. Decision problems are the targets of analysis in decision theory. 230 decision theory searches for rational principles to evaluate the acts available to an agent in a decision problem. 225 Decomposition rule for any propositions P and Q in cr P & Q . 36 Q
q` p
„ q
L,
p q “ crpP &
cr P
decreasing marginal utility When a quantity has decreasing marginal utility, less utility is derived from each additional unit of that quantity the more units you already have. Economists often suggest that money has decreasing marginal utility for the typical agent. 230 defeat in expectation given distributions cr and cr 1 over the same set of propositions, cr 1 defeats cr in expectation if cr calculates a lower expected inaccuracy for cr 1 than it does for cr. 334 deference principle any principle directing an agent to align her current credences with some other distribution (such as objective chances, credences of an expert, or credences of her future self). 139 direct inference determining how likely one is to obtain a particular experimental result from probabilistic hypotheses about the setup. 63 Disconfirmation Duality for any E , H , and K in L, E confirms H relative to K just in case E disconfirms H relative to K . 187
„
disjunction P
_ Q is a disjunction;
P and Q are its disjuncts. 28
disjunctive normal form The disjunctive normal form of a non-contradictory proposition is the disjunction of state-descriptions that is equivalent to that proposition. 32 distribution an assignment of real numbers to each proposition in language L. 34 Dominance Principle if act A produces a higher-utility outcome than act B in each possible state of the world, then A is preferred to B . 235 doxastic attitude a belief-like representational propositional attitude. 4
371
Glossary
doxastically possible worlds the subset of possible worlds that a given agent entertains. 39 Dutch Book a set of bets, each placed with an agent at her fair betting price (or may. 291 better), that together guarantee her a sure loss come what Dutch Book Theorem If an agent’s credences violate at least one of the probability axioms (Non-Negativity, Normality, or Finite Additivity), a Dutch Book can be constructed against her. 291 Dutch Strategy a strategy for placing different sets of bets with an agent over a period of time, depending on what the agent learns during that period of time. If the strategy is implemented correctly, the bets placed will guarantee the agent a sure loss come what may. 295
(
entailment P entails Q (P Q) just in case there is no possible world in which P is true and Q is false. On a Venn diagram, the P -region is wholly contained in the Q-region. 30 Entailment Condition for any consistent E , H , and K in L , if E &K but K H , then E confirms H relative to K . 181
*
Entailment rule for any propositions P and Q in L , if P cr Q . 36
p q
(H
( Q then crpP q ď
epistemic utility a numerical measure of the epistemic value of a set of doxastic attitudes. 324 Equivalence Condition Suppose H H 1, E E 1 , and K K 1. Then E confirms (/disconfirms) H relative to background K just in case E 1 confirms (/disconfirms) H 1 relative to background K 1 . 181
)(
)(
Equivalence rule for any propositions P and Q in
p q “ crpQq. 36
)(
L,
if P
)( Q then
cr P
equivalent Equivalent propositions are associated with the same set of possible worlds. 30 ethically neutral A proposition P is ethically neutral for an agent if the agent is indifferent between any two gambles whose outcomes differ only in replacing P with P . 265
„
372
Glossary
Evidential Decision Theory decision theory in which expected utility is calculated using an agent’s credences in states conditional on the available acts. 244 evidential probability the degreeastoindependent which a bodyofofany evidence probabilifies a hypothesis, understood particular agent’s attitudes. 129 evidential standards Applying an agent’s ultimate evidential standards to her total evidence at a given time yields her doxastic attitudes at that time. Bayesians represent ultimate evidential stand ards as hypothetical priors. 109 evidentialism the position that what attitudes are rationally permissible for an agent supervene on her evidence. 127 exhaustive The propositions in a set are jointly exhaustive if each possible world makes at least one of the propositions in the set true. 31 expectation An agent’s expectation for the value of a particular quantity is a weighted average of the values that quan tity might take, with weights provided by the agent’s credences across those possible values. 226 Extensional Equivalence If two betting arrangements have the same payoff in every possible world, a rational agent will value them equally. 305 fair price An agent’s break-even point for a bet or investment. She will be willing to pay anything up to that amount of money in exchange for the bet/investment. 227 falsification A piece of evidence falsifies a hypothesis if it refutes that hypothesis relative to one’s background assumptions. 69 Finite Additivity for any mutually exclusive propositions P and Q in L, cr P Q cr P cr Q . (one of the three probability axioms). 34
p _ q“ p q` p q
Finite Additivity (Extended) for any finite set of mutually exclusive propositions P1 , P2 ,...,P n , cr P1 P2 . . . Pn cr P1 cr P2 . . . cr Pn . 36
` p q
t
u p _ _ _ q “ p q` p q`
373
Glossary
firmness concept of confirmation E confirms H relative to K just in case a probability distribution built on background K makes the probability of H on E high. 188 frequency theory an interpretation of probability according to which the probability is x that event A will have outcome B just in case fraction x of events like A have outcomes like B . 123 Gambler’s Fallacy expecting later outcomes of an experiment to “compensate” for unexpected previous results despite the probabilistic independence of future results from those in the past. 70
p _ Qq “
General Additivity rule for any propositions P and Q in L , cr P cr P cr Q cr P & Q . 36
p q` p q´ p
q
Gradational Accuracy Theorem Given a credence distribution cr over a finite set of propositions X1 , X2 ,...,X n , if we use the Brier score Ibr cr, ω to measure inaccuracy then: (1) If cr does not satisfy the
p
q
probability axioms, there exists a probabilistic distribution cr1 over the same propositions such that I br cr1 , ω Ibr cr, ω in every logically possible world ω ; and (2) If cr does satisfy the probability axioms, no such cr1 exists. 325
p
qă p
q
higher-order credences an agent’s credences about her own current credences. Includes both her credences about what her current credencevalues are and her credences about what those values should be ). 169 Humphreys’ Paradox difficulty for the propensity interpretation of probability that when the probability of E given H can be understood in terms of propensities it is often difficult to interpret the probability of H given E as a propensity as well. 166 Hypothesis Symmetry for all H and E in c H, E c H, E . 208
p
q “ ´ p„
L
and every probabilistic Pr,
q
hypothetical frequency theory interpretation of probability that looks not at the proportion of actual events producing a particular outcome but instead at the proportion of such events that would produce that outcome in the limit. 125
374
Glossary
hypothetical prior distribution a regular, probabilistic distribution used to represent an agent’s ultimate evidential standards. The agent’s credence distribution at a given time can be recovered by conditionalizing her hypothetical prior on her total evidence at that time. 111 Hypothetical Priors Theorem Given any finite series of credence distributions cr 1 , cr2 ,..., crn each of which satisfies the probability axioms and Ratio Formula, let Ei be a conjunction of the agent’s total evidence at t i . If the cr update by Conditionalization, then there exists at least one regular probability distribution Pr H such that for all 1 i n, cri PrH Ei . 111
p¨q“
ď ď
p¨| q
hypothetico-deductivism theory of confirmation on which E confirms H relative to background corpus K just in case H & K E and K E . 221
(
*
IID trials independent, identically distributed probabilistic events. Trials are IID if the probabilities associated with a given trial are unaffected by the outcomes of other trials (independence), and if each trial has the same probability of producing particular outcomes as every other trial does (identically distributed). 89, 254 inconsistent The propositions in a set are inconsistent when there is no possible world in which all of them are true. 31 increase in firmness concept of confirmation E confirms H relative to K just in case a probability distribution built on K makes the posterior of H on E higher than the prior of H . 188
p qą
independence When cr Q 0, proposition P is probabilistically independent of proposition Q relative to cr just in case cr P Q cr P . 67
p | q“ p q
infinitesimal a number that is greater than zero but less than any positive real number. 153 initial prior distribution credence distribution assigned by an agent before she possessed any contingent evidence. 107 interference effect any effect of placing the initial bets in a Dutch Book that makes an agent unwilling to accept the remaining bets (which she otherwise would have regarded as fair). 303
375
Glossary
interpretations of probability philosophical theories about the nature of probability and the meaning of linguistic probability expressions. 122 inverse inference determining how likely a probabilistic hypothesis is on the basis of a particular run of experimental data. 64 irrelevant probabilistically independent. 67 Jeffrey Conditionalization Proposed by Richard C. Jeffrey as an alternative updating rule to Conditionalization, holds that for any ti and tj with i j , any A in L, and a finite partition B1 , B2 ,...,B n in L whose elements each have nonzero cr i , crj A cri A B1 crj B1 cri A B2 crj B2 . . . cri A Bn crj Bn . 155
ă p q “ p | q¨ p q` p | q¨ p q` ` p | q¨ p q
Judy Benjamin Problem An example proposed by Bas van Fraassen in which an agent’s experience directly alters some of her conditional credence values. van Fraassen argued that this example could not be addressed by traditional Conditionalization or by Jeffrey Conditionalization. 159 just in case if and only if. 28 Kolmogorov’s axioms the three axioms (Non-Negativity, Normality, and Finite Additivity) that provide necessary and sufficient conditions for a probability distribution. 34 language dependence A theory is language dependent if it ascribes conflicting properties to the same propositions when those propositions are expressed in different languages. 198 law of large numbers any one of a number of mathematical results indicating roughly the following: the probability is 1 that as the number of trials approaches the limit, the average value of a quantity will approach its expected value. 226 Law of Total Probability for any proposition P and finite partition Q1 , Q2 ,...,Q in L , cr P cr P Q1 cr Q1 cr P Q2 cr Q2 . . . cr P Qn cr Qn . 62
p q
p q “ p | q¨ p q` p | q¨ p q` ` p | q¨
likelihood the probability of some particular piece of evidence on the supposition of a particular hypothesis—cr E H . 63
p | q
n
376
Glossary
Lockean thesis connects believing a proposition with having a degree of confidence in that proposition above a numerical threshold. 15 logical probability the degree to which a body of evidence probabilifies a hypothesis, tailment. 129 understood as a logical relation similar to deductive enLogicality All entailments receive the same degree of confirmation, and have a higher degree of confirmation than any non-entailing confirmations. 209 Lottery Paradox paradox for requirements of logical belief consistency and closure involving a lottery with a large number of tickets. 8
”
material biconditional A material biconditional P Q is true just in case P and Q are both true or P and Q are both false. 28
Ą
material conditional A material conditional P Q is false just in case its antecedent P is true and its consequent Q is false. 28 Maximality rule for any proposition P in
L,
p q ď 1. 36
cr P
maximin rule decision rule that prefers the act with the highest minimum payoff. 231 Maximum Entropy Principle Given any partition of the space of possibilities, and any set of constraints on allowable credence distributions over that partition, the Maximum Entropy Principle selects the allowable distribution with the highest entropy. 146 money pump a situation in which an agent’s preferences endorse her making a series of decisions, the net effect of which is to cost her a great deal of utility but otherwise leave her exactly where she began. Money pumps are used to argue that preferences violating Preference Transitivity or Preference Asymmetry are irrational. 233 Monty Hall Problem a famous probabilistic puzzle case, demonstrating the importance of taking an agent’s total evidence into account. 104 Multiplication rule When P and Q have nonextreme cr-values, P and Q are probabilistically independent relative to cr if and only if cr P & Q cr P cr Q . 68
q“ p q¨ p q
p
377
Glossary
mutually exclusive The propositions in a set are mutually exclusive when there is no possible world in which more than one of the propositions is true. 31 negation
„P is the negation of
P . 28
Negation rule for any proposition P in
L,
p„P q “ 1 ´ crpP q. 35
cr
„ p@ qp Ą q negative relevance When crpQq ą 0, Q is negatively relevant to P relative to cr just in case cr pP | Qq ă crpP q. 68 negative instance F a & Ga is a negative instance of the universal generalization x F x Gx . 177
Newcomb’s Problem a puzzle that prompted the introduction of Causal Decision Theory. Introduced to philosoph y by Robert Nozick, who attributed its construction to William Newcomb. 242 Nicod’s Criterion for any predicates F and G and constant a of L ,
q
p@xqpF x Ą
„
Gx is confirmed by F a & Ga and disconfirmed by F a & Ga. 177 Non-Negativity for any proposition P in L, cr P 0. (one of the three probability axioms). 34
p qě
nonmonotonicity Probabilistic relations are nonmonotonic in the sense that even if H is highly probable given E , H might be improbable given the conjunction of E with some E 1 . 103 Normality for any tautology bility axioms). 34
T
in
L,
cr
p q “ 1. (one of the three probaT
normalization factor In an update by Conditionalization, state-descriptions inconsistent with E (the evidence learned) have their unconditional credences sent to zero. The remaining state-descriptions all have their unconditional credencesofmultiplied the same normalization factor, equal to the reciprocal 119 E ’s prior.by normative distinction The normative distinction between Subjective and Objective Bayesians concerns the strength of rationality’s requirements. Distinguished this way, Objective Bayesians hold that there is exactly one rationally-permissible set of evidential standards (/hypothetical priors), so that any body of total evidence gives rise to a
378
Glossary
unique rational attitude toward s any particular proposition . Subjective Bayesians deny that rational requirements are strong enough to mandate a unique attitude in every case. 127 objective chance a type of physical prob ability that can b e applied to the single case. 126 observation selection effect the effect that the manner in which evidence was obtained (say, the method by which a sample was drawn) may have on the appropriate conclusions to draw from that evidence. 103
p q p„ q p q
odds If an agent’s unconditional credence in P is cr P , her odds for P are cr P : cr P , and her odds against P are cr P : cr P . 46
p q p„ q
ordering formal structure introduced by a comparative relation over a particular set. For example, specifyin g pairs of proposition s for which I am more confident in one than the other introduces a confidence ordering over the set of propositions. 11 outcome the result state. of an agent’s a particular act with in a particular Agents performing assign utilities to outcomes. 233 the world Package Principle A rational agent’s value for a package of bets equals the sum of her values for the individual bets it contains. 305 Paradox of the Ravens counterintuitive consequence of many formal theories of confirmation that the proposition that a particular object is a non-black non-raven confirms the hypothesis that all ravens are black. 178 partition a mutually exclusive, jointly exhaus tive set of propositions. On a Venn diagram, the regions representing propositions in a partition combine to fill the entire rectangle without overlapping at any point. 31 Partition rule for any finite partition of propositions in L , the sum of their unconditional cr-values is 1. 36 permissive case an example in which two agents with identical total evidence assign different credences without either agent’s thereby being irrational. Objective Bayesians in the normative sense deny the existence of permissive cases. 129
379
Glossary
positive instance F a & Ga is a positive instance of the universal generalization x F x Gx . 177
p@ qp Ą q
p q ą 0, Q is positively relevant to P relative to cr just in case cr pP | Qq ą crpP q. 68
positive relevance When cr Q
possible worlds different ways the world might have come out. Possible worlds are maximally specified—for any event and any possible world that event either does or does not occur in that world—and the possible worlds are plentiful enough such that for any combination of events that could happen, there is a possible world in which that combination of events does happen. 28 posterior the probability of some hypothesis on the supposition of a particular piece of evidence— P H E . 65
p | q
practical rationality concerns the connections between attitudes and actions. 7 predicate a capital letter representing a property or relation in language L. 33 Preface Paradox paradox for requirements of logical belief consistency and closure in which the preface to a nonfiction book asserts that at least one the claims in the book is false. 9 Preference Asymmetry condition there do not exis t acts A and B such that the agent both prefers A to B and prefers B to A. 232 preference axioms formal constraints we assume a rational agent’s preferences satisfy in order to apply a representation theorem. 270 Preference Completeness for any acts A and B , exactly one of the following is true: the agent prefers A to B , the agent prefers B to A, or the agent is indifferent between the two. 232 Preference Transitivity condition for any acts A, B , and C , if the agent prefers A to B and B to C , then the agent prefers A to C . 231 Principal Principle David Lewis’s proposal for how rational credences concerning an event incorporate suppositions about the objective chances of that event’s possible outcomes. 134
380
Glossary
Principle of Indifference if an agent has no evidence favoring any proposition in a partition over any other, he should sprea d his crede nce equally over the members of the partition. 144 Principle of Total Evidence a rational agent’s credence distribution takes into account all the evidence available to her. 103 prior an unconditional probability; the probability of a proposition before anything has been supposed. For example, an agent’s prior credence in a particular hypothesis H is cr H . 65
p q
probabilism the thesis that rationality requires an agent’s credences to satisfy the probability axioms. 35
p qą
probabilistically independent When cr Q 0, P is probabilistically independent of Q relative to cr just in case cr P Q cr P . 67
p | q“ p q
probability axioms Kolmogorov’s axioms. 34 probability distribution any distribution satisfying Kolmogorov’s probability axioms. 34 probability kinematics what Richard C. Jeffrey, its inventor, called the updating rule now generally known as “Jeffrey Conditionalization”. 155 probability table a table that assigns unconditional credences to each member in a partition. To satisfy the probability axioms, the values in each row must be non-negative and all the values must sum to 1. When the partition mem bers are state-des criptions of a language L , the values in the probability table suffice to specify all of the agent’s credences over L. 43 problem of irrelevant conjunction counterintuitive consequence of many formal theories of confirmation that whenever evidence E confirms hypothesis H it will also confirm H & X for various X s irrelevant to E and H . 187 problem of the single case the challenge of interpreting probability such that single (and perhaps non-repeatable) events may receive nonextreme probabilities. 125 propensity theory interpretation of probability identifying probability with a physical arrangement’s quantifiable tendency to produce outcomes of a particular kind. 125
381
Glossary
Proper Scoring Rule A scoring rule is proper just in case any agent with a probabilistic credence distribution who uses that rule assigns her own credences a lower expected inaccuracy than any other distribution over the same set of propositions. 337 proposition an abstract entity expressible by a declarative sentence and capable of having a truth-value. 3 propositional attitude an attitude adopted by an agent towards a proposition or set of propositions. 3
„
_
propositional connective one of five truth-functional symbols ( , &, , , ) used to construct larger propositions from atomic propositions. 28
Ą”
quantitative concept characterizes an entity by ascribing it a numerical value. 4 ratifiability decision-theoretic requirement that an act is rationally permissible only if the agent assigns it the highest expected utility conditional on the supposition that she chooses to perform it. 249
p qą
p | q“
Ratio Formula for any P and Q in L, if cr Q 0 then cr P Q cr P & Q cr Q . The Bayesian rational constrain t relating an agent’s conditional credences to her unconditional credences. 59
p
q{ p q
reference class problem when considering a particular event and one of its possible outcom es, the frequency with which this type of event produces that type of outcome depends on which reference class (eventtype) we choose out of the many to which the event belongs. 124 Reflection Principle For any proposition A in L, real number x, and times t i and t j with j i, rationality requires cri A crj A x x. 141
ą
p | p q“ q“
„
refutation P refutes Q just in case P entails Q. When P refutes Q , every world that makes P true makes Q false. 30 regular a distribution that does not assign the value 0 to any logically contingent propositions. 101 Regularity Principle In a rational credence distribution, no logically contingent proposition receives unconditional credence 0. 100
382
Glossary
relevance measure a confirmation measure that indicates confirmation just in case E is positively relevant to H on Pr; disconfirmation just in case E is negatively relevant to H on Pr; and neither just in case E in independent of H on Pr. 207 relevant not probabilistically independent. 68 Representation Theorem If an agent’s preferences satisfy certain constraints, then there exists a unique probabilistic credence distribution and unique utility distribution (up to positive affine transformation) that yield those preferences when the agent maximizes expected utility. 270 Rigidity condition For any A in L and any Bm in the finite partition B1 , B2 ,...,B n , crj A Bm cri A Bm . This condition obtains between ti and tj just in case the agent Jeffrey Conditionalizes across B1 , B2 ,...,B n . 157
p | q“ p | q
risk aversion preferring an act with lesser expected v alue because it offers a surer payout. 238 rule of succession Laplace’s rule directing an agent who has witnessed h heads on n independent flips of a coin to set credence h 1 n 2 that the next flip will come up heads. 167
p ` q{p ` q
scalar transformation Two measurement scales are related by a scalar transformation when values on one scale can be converted to values on the other by multiply ing by a specified constan t. The pound and kilogram scales for mass provide one example. 272 scoring rule a quantitative measure of the accuracy (or inaccuracy) of distributions. 319 R screens Q whenon P either screening off not offtoPQfrom is unconditionally to Q but relevant conditional R or R. 70relevant
„
semantic distinction When classified according to the semantic distinction, Subjective Bayesians take “probability” talk to reveal the credences of agents, while Objective Bayesians assign “probability” assertions truth-conditions independent of the attitudes of particular agents or groups of agents. 127
383
Glossary
separable a separable scoring rule measures how far a distribution is from the truth one proposition at a time, then sums the results. 323 sigma algebra a set of sets closed under union, intersection, and complementation. A probability distribution can be assigned er a sigma algebra containing sets of possible worlds instead of overov a language containing propositions. 53 Simple Binarist a made-up character who describes agents’ doxastic propositional attitudes exclusively in terms of belief, disbelief, and suspension of judgment. 5 Simpson’s Paradox Two propositions may be correlated conditional on each member of a partition yet anti-correlated unconditionally. 72 Special Consequence Condition for any E , H , H 1 , and K in L, if E confirms H relative to K and H & K H 1 , then E confirms H 1 relative to K . 183
(
state as in adecision theory, Which an arrangement of the worldwhic (usually represented proposition). state obtains affects h outcome will be generated by the agent’s performing a particular act. 233 state-description a conjunction of language L in which (1) each conjunct is either an atomic proposition of L or its negation; and (2) each atomic proposition of L appears exactly once. 32 straight rule Reichenbach’s name for the norm setting an agent’s credence that the next event of type A will produce an outcome of type B exactly equal to the observed frequency of B -outcomes in past Aevents. 167 strict Conditionalization another name for the Conditionalization updating rule. The “strict” is usually used to emphasize a contrast with Jeffrey Conditionalization. 158 structure-description Given a particular language, a structure-description says how many objects possess each of the available property profiles, but doesn’t say which particular objects have which profiles. 194 subadditive In a subadditive distribution, there exist mutually exclusive P and Q in L such that cr P Q cr P cr Q . 52
p _ qă p q` p q
384
Glossary
superadditive In a superadditive distribution, there exist mutually exclusive P and Q in L such that cr P Q cr P cr Q . 49
p _ qą p q` p q
supervenience A-properties supervene on B -properties just in case any A
B
two objects For thatexample, differ in one’s their score -properties differ inontheir properties. on a test also supervenes the an-swers one provides; if two students got different scores on the same test, their answers must have differed. 61 Sure-Thing Principle if two acts yield the same outcome on a particular state, any preference between them remains the same if that outcome is changed. 240 tautological background a background corpus containing no contingent information, logically equivalent to a tautology T. 178 tautology a proposition that is true in every possible world. 31 theoretical rationality evaluates representational attitudes in their capacity as representations, without considering how they influence action. 8 total ordering an ordering in which the comparative relation is applied to every pair of items in the set. 11 Truth-Directedness If an inaccuracy score is truth-directed, altering a distribution by moving some of its values closer to the truth and none of its values farther from the truth will decrease that distribution’s inaccuracy. 330 truth-value True and false are truth-values. We assume propositions are capable of having truth-values. 3 unconditional credence an agent’s degree of belief in a proposition, without making any suppositions beyond her current background information. 34 Uniqueness Thesis Given any proposition and body of total evidence, there is exactly one attitude it is rationally permissib le for agents with that body of total evidence to adopt towards that proposition. 127 universe of discourse the set of objects under discussion. 33
Glossary
385
ur-prior alternate name for a hypothetical prior distribution. 111 util a single unit of utility. 229 utility a numerical measure of the degree to which an agent values a particular proposition’s being true. 229 valuation function In a decision problem, the agent’s valuation function combines her crede nces and utilities to assign each available act a numerical score. The agent then prefers the act with the highest score. 231 Venn Diagram diagram in which an agent’s doxastically possible worlds are represented as points in a rectangle. Propositions are represented by regions containing those points , with the area of a region often representing the agent’s credence in an associated proposition. 29
386
Glossary
Index of Names Achinstein, Peter, 195 Adams, Ernest, 81, 295 Alchourr´ on, Carlos E., 21 Allais, Maurice, 240 Armendt, Brad, 295, 299 Arntzenius, Frank, 87
Carnap, Rudolf, 4, 54, 103, 118, 128, 166, 175, 188–202, 208, 218, 219 Carr, Jennifer, 351 Cartwright, Nancy, 89 Chihara, Charles, 210, 213 Christensen, David, 16, 20, 277, 284, 299–301, 309, 312, 313 Clatterbuck, Hayley, 88 Comesa˜na, Juan, 221 Cramer, Gabriel, 254 Crupi, Vincenzo, 209, 214, 217, 220
Bartha, Paul, 120 Bayes, Reverend Thomas, 63 Bell, Elizabeth, 21 Bengson, John, 20 Bergmann, Merrie, 52 Berker, Selim, 351 Bernoulli, Daniel, 254 Bernoulli, Jacob, 122 Bernoulli, Nicolas, 250 Bertrand, Joseph, 145 Bickel, P.J., 72 Bolzano, Bernard, 128 Bovens, Luc, 174 Bradley, Darren, 120, 221 Bramley-Moore, L., 89
Dalkey, Norman, 254 Davidson, Donald, 20, 254 de Finetti, Bruno, 111, 129, 163, 263, 277, 291, 298, 312, 313, 324, 349 Descartes, Ren´ e, 185 Easwaran, Kenny, 165, 274, 316, 350 Eddington, A., 103 Eells, Ellery, 213, 218, 223, 255 Egan, Andy, 166, 249 Elga, Adam, 163 Ellenberg, Jordan, 120, 228 Ellis, Robert Leslie, 123
Brier, George, 352 Briggs, Rachael, 354 Buchak, Lara, 240, 253, 285 Buchanan, B., 223 Campbell, Louise Firth, 167 Cariani, Fabrizio, 90 387
388
INDEX OF NAMES
Feldman, Richard, 120, 127, 182 Feller, William, 254 Fermat, Pierre, 227, 274 Fine, T.L., 222, 353
Hitchcock, Christopher R., 86, 120 Holton, Richard, 21 Hooker, C.A., 222
Fischoff, B., 351 Fishburn, Peter C., 285 Fisher, R.A., 88 Fitelson, Branden, 53, 82, 88, 211, 213, 214, 216, 217, 221, 223, 224 Foley, Richard, 15, 20
Hosiasson-Lindenbaum, Janina, 150, 219 Howson, Colin, 120, 164, 165, 295, 299 Hume, David, 20, 222 Humphreys, Paul, 166 H´ajek, Alan, 86, 88, 162, 165, 166, 219, 261, 276
G¨ardenfors, Peter, 21 Galavotti, Maria Carla, 54, 88, 111, 162, 167, 222 Gibbard, Allan, 249, 255 Gillies, Donald, 163 Glass, David H., 223 Glymour, Clark, 184, 202 Gonzalez, Michel, 209, 217 Good, I.J., 21, 120, 179, 220 Goodman, Nelson, 195–201, 220 Greaves, Hilary, 343, 350–353 H´ajek, Alan, 353 H´ajek, Alan, 87, 295–297 Hacking, Ian, 52, 118, 166, 313 Hall, Monty, 103 Hall, Ned, 87, 140 Hammel, E.A., 72 Harper, William, 249, 255 Hart, Casey, 116 Hartmann, Stephan, 174 Hawthorne, James, 211, 213, 217, 224 Hempel, Carl, 222 Hempel, Carl G., 175, 177–187, 195, 210, 219 Hesse, Mary, 195 Heukelom, Floris, 255
Jaynes, E.T., 145, 164 Jeffrey, Richard, 265, 283 Jeffrey, Richard C., 100, 130, 154–159, 165, 235–245, 249, 253 Jeffreys, Harold, 164 Jeter, Derek, 72 Johnson, W.E., 222 Joyce, James M., 89, 169, 219, 253, 255, 265, 303, 319, 324–339, 349–351 Justice, David, 72 Kadane, J.B., 169, 353 Kahneman, Daniel, 40, 51, 98, 213, 240 Kaplan, Mark, 284 Kemeny, John, 296 Kemeny, John G., 223, 295 Keynes, John Maynard, 128, 144, 164 Kim, Jaegwon, 20 Koehler, Derek J., 260 Kolmogorov, Andrey, 34, 53 Konek, Jason, 353 Kornblith, Hilary, 20 Kuhn, Thomas S., 110
389
INDEX OF NAMES
Kulkarni, S., 337 Kyburg, Jr, Henry E., 8, 19 Lange, Marc, 170 Laplace, 167 Pierre-Simon, 88, 122, Leblanc, H., 54, 150 Lee, A., 89 Lehman, R. Sherman, 296 Leibniz, Gottfried Wilhelm von, 128 Leitgeb, Hannes, 350, 354 Levi, Isaac, 102, 118, 120 Levinstein, Ben, 353 Lewis, C.I., 102, 154 Lewis, David, 80, 87, 108, 119, 126, 131–139, 163, 204, 220, 248, 255, 293, 319, 335 Lichtenstein, S., 351 Lieb, E.H., 337 Lindley, Dennis V., 339–341, 350 Locke, John, 15 Maher, Patrick, 222, 284–286, 332 Makinson, David, 254 Makinson, David C., 8, 19, 21, 169 Mazurkiewicz, Stefan, 54 McCartney, Mark, 223 McKinsey, J.C.C., 254 Meacham, Christopher, 284 Meacham, Christopher J.G., 118, 285 Moor, James, 52 Moore, G.E., 185 Morgenstern, Oskar, 263 Moss, Sarah, 20, 89, 115, 162, 311 Murphy, A., 317 Murphy, Allan H., 351
Nelson, Jack, 52 Newcomb, William, 241 Nicod, Jean, 177 Nozick, Robert, 241 O’Connell, J.W., 72 Oppenheim, Paul, 223 Osherson, D., 337 Papineau, David, 165 Pascal, Blaise, 7, 227, 274 Pearson, K., 89 Peirce, Charles Sanders, 125 Peterson, Martin, 253 Pettigrew, Richard, 164, 316, 349, 350, 354 Phillips, L., 351 Pollock, John L., 21 Poor, V., 337 Popper, Karl, 125, 150, 166 Popper, Karl R., 54, 205, 221 Predd, J., 337 Price, Richard, 88 Pryor, James, 221 Quinn, Warren S., 254 Ramsey, F.P., 78, 90, 129, 264–269, 276, 277, 284, 285, 291, 298, 303, 312, 313, 351 Reichenbach, Hans, 64, 74, 86, 167 Renyi, Alfred, 150 Resnik, Michael D., 275 Roche, William, 221 Roeper, P., 54, 150 Rosenkrantz, Roger, 324 Russo, Selena, 220 Salmon, Wesley, 169
390 Savage, L.J., 225, 233–240, 253, 269–273, 284, 285 Schervish, M.J., 169, 353 Schick, Frederic, 305, 312 Schoenfield, Miriam, 120 Seidenfeld, Teddy, 164, 169, 353 Seiringer, R., 337 Selvin, Steve, 103 Seuss, Dr., 254 Shapiro, Amram, 167 Shimony, Abner, 295, 317, 349 Shogenji, Tomoji, 206, 221 Shortliffe, E., 223 Simpson, E.H., 89 Skyrms, Brian, 52, 169, 245, 285, 295, 299, 303 Spohn, Wolfgang, 21 Staffel, Julia, 50, 115 Stalnaker, Richard C.,86255 Stephenson, Todd A., Stoppard, Tom, viii Suppes, Patrick, 254, 270 Tal, Eyal, 221 Teller, Paul, 118, 312, 313 Tentori, Katya, 209, 214, 217, 220 Thomason, Richard, 120 Titelbaum, Michael G., 89, 116, 120, 164, 169, 220, 222, 285 Tversky, Amos, 40, 51, 98, 213, 240, 260
INDEX OF NAMES
Urbach, Peter, 164, 299 van Fraassen, Bas, 87, 140–143, 146, 159, 163, 169, 295, 308, 317, 349 Velasco, Joel, 21, 119 Venn, John, 29, 123, 169, 222 Vineberg, Susan, 311 von Mises, Richard, 123 von Neumann, John, 263 Vranas, Peter B. M., 223 Wainer, Howard, 120 Wald, Abraham, 120 Wallace, David, 343, 350, 352 Weatherson, Brian, 166, 169, 251 Weintraub, Ruth, 20 Weirich, Paul, 235, 245 Weisberg, Jonathan, 163, 284, 285 White, Roger, 120, 127, 221 Williams, J. Robert G., 53 Williams, J.R.G., 354 Williamson, Timothy, 165 Winkler, Robert L., 351 Wittgenstein, Ludwig, 54, 222, 285 Wright, Rosalind, 167 Yalcin, Seth, 90 Yule, G.U., 89 Zynda, Lyle, 284, 285
Bibliography Achinstein, Peter (1963). Variety and Analogy in Confirmation Theory. Philosophy of Science 3, pp. 207–221. Adams, Ernest (1962). On Rational Betting Systems. Archiv f¨ ur mathematische Logik und Grundlagenforschung 6, pp. 7–29. — (1965). The Logic of Conditionals. Inquiry 8, pp. 166–97. Alchourr´ on, Carlos E., Peter G¨ardenfors, and David Makinson (1985). On the Logic of Theory Change: Partial Meet Contraction and Revision Functions. The Journal of Symbolic Logic 50, pp. 510–530. Allais, Maurice (1953). Le Comportement de l’homme rationnel devant le risque: Critique des postulates et axiomes de l’ecole Am´ericaine. Econometrica 21, pp. 503–46. Armendt, Brad (1980). Is There a Dutch Book Argument for Probability Kinematics? Philosophy of Science 47, pp. 583–588. — (1992). Dutch Strategies for Diachro nic Rules: When Believer s See the Sure Loss Coming. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 1, pp. 217–229. Arntzenius, Frank (1993). The Common Cause Principle. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 2, pp. 227–237. Bartha, Paul and Christopher R. Hitchcock (1999). No One Knows the Date or the Hour: An Unorthodox Application of Rev. Bayes’s Theorem. Philosophy of Science 66, S339–53. Bergmann, Merrie, James Moor, and Jack Nelson (2013). The Logic Book. 6th edition. New York: McGraw Hill. Berker, Selim (2013). Epistemic Teleology and the Separateness of Propositions. Philosophical Review 122, pp. 337–93. Bernoulli, Daniel (1738/1954). Exposition of a New Theory on the Measurement of Risk. Econometrica 22, pp. 23–36. Bernoulli, Jacob (1713). Ars Conjectandi. Basiliae.
391
392
BIBLIOGRAPHY
Bertrand, Joseph (1888/1972). Calcul des probabilit´ es. 2nd. New York: Chelsea Publishing Company. Bickel, P.J., E.A. Hammel, and J.W. O’Connell (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187, pp. 398–404. Bolzano, Bernard (1837/1973). Wissenschaftslehre. Translated by Jan Berg under the title Theory of Science. Dordrecht: Reidel. Bovens, Luc and Stephan Hartmann (2003). Bayesian Epistemology. Oxford: Oxford University Press. Bradley, Darren (2010). Conditionalization and Belief De Se . Dialectica 64, pp. 247–250. — (2015). A Criticial Introduction to Formal Epistemology. Bloomsbury. Brier, George (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78, pp. 1–3. Briggs, Rachael (ms). An Accuracy-Dominance Argument for Conditionalization. Unpublished manuscript. Buchak, Lara (2013). Risk and Rationality. Oxford: Oxford University Press. Carnap, Rudolf (1945). On Inductive Logic. Philosophy of Science 12, pp. 72– 97. — (1947). On theResearch Application Inductive Logic. Philosophy and Phenomenological 8, pp.of133–148. — (1950). Logical Foundations of Probability. Chicago: University of Chicago Press. — (1955/1989). Statistical and Inductive Probability. In: Readings in the Philosophy of Science. Ed. by Baruch A. Brody and Richard E. Grandy. 2nd. Prentice-Hall. — (1962). Logical Foundations of Probability. 2nd. Chicago: University of Chicago Press. Carr, Jennifer (ms). Epistemic Utility Theory and the Aim of Belief. Unpublished manuscript. Cartwright, Nancy (1979). Causal Laws and Effective Strategies. Noˆ us 13, pp. 419–437. Chihara, C. (1981). Quine and the Confirmational Paradoxes. In: Midwest Studies in Philosophy 6: Foundations of Analytic Philosophy . Ed. by P. French, H. Wettstein, and T. Uehling. University of Minnesota Press, pp. 425–52. Christensen, David (1991). Clever Bookies and Coherent Beliefs. The Philosophical Review 100, pp. 229–247. — (2001). Preference-Based Arguments for Probabilism. Philosophy of Science 68, pp. 356–76. — (2004). Putting Logic in its Place . Oxford: Oxford University Press.
BIBLIOGRAPHY
393
Crupi, Vincenzo, Branden Fitelson, and Katya Tentori (2008). Probability, Confirmation, and the Conjunction Fallacy. Thinking & Reasoning 14, pp. 182–199. Crupi, Vincenzo, Katya Tentori, and Michel Gonzalez (2007). On Bayesian Measures of Evidential Support: Theoretical and Empirical Issues. Philosophy of Science 74, pp. 229–252. Davidson, Donald (1984). Inquiries into Truth and Interpretation . Oxford: Clarendon Press. Davidson, Donald, J.C.C. McKinsey, and Patrick Suppes (1955). Outlines of a Formal Theory of Value, I. Philosophy of Science 22, pp. 140–60. de Finetti, Bruno (1931/1989). Probabilism: A Critical Essay on the Theory of Probability and the Value of Science. Erkenntnis 31. Translation of B. de Finetti, Probabilismo, Logos 14: 163–219., pp. 169–223. — (1937/1964). Foresight: Its Logical Laws, its Subjective Sources. In: Studies in Subjective Probability. Ed. by Henry E. Kyburg Jr and H.E. Smokler. Originally published as “La pr´evision; ses lois logiques, ses sources subjectives” in Annales de l’Institut Henri Poincar´e , Volume 7, 1–68. New York: Wiley, pp. 94–158. — (1995). giatore. Filosofia della probabilit`a. Ed. by Alberto Mura. Milan: Il SagEaswaran, Kenny (2013). Expected Accuracy Supports Conditionalization— and Conglomerability and Reflection. Philosophy of Science 80, pp. 119– 142. — (2014a). Decision Theory without Representation Theorems. Philosophers’ Imprint 14, pp. 1–30. — (2014b). Regularity and Hyperreal Crede nces. Philosophical Review 123, pp. 1–41. Eddington, A. (1939). The Philosophy of Physical Science. Cambridge: Cambridge University Press. Eells, Ellery (1982). Rational Decision and Causality. Cambridge Studies in Philosophy. Cambridge: Cambridge University Press. Eells, Ellery and Branden Fitelson (2002). Symmetries and Asymmetries in Evidential Support. Philosophical Studies 107, pp. 129–142. Egan, Andy (2007). Some Counterexamples to Causal Decision Theory. Philosophical Review 116, pp. 93–114. Elga, Adam (2007). Reflection and Disagreement. Noˆ us 41, pp. 478–502. Ellenberg, Jordan (2014). How Not to Be Wrong: The Power of Mathematical Thinking. New York: Penguin Press.
394
BIBLIOGRAPHY
Ellis, Robert Leslie (1849). On the Foundations of the Theory of Probabilities. Transactions of the Cambridge Philosophical Society VIII, pp. 1– 6. Feldman, Richard (2007). Reasonable Religious Disagreements. In: Philosophers without Gods: Meditations on Atheism and the Secular Life. Ed. by Louise M. Antony. Oxford: Oxford University Press. Feller, William (1968). An Introduction to Probability Theory and Its Applications. 3rd. New York: Wiley. Fermat, Pierre and Blaise Pascal (1654/1929). Fermat and Pascal on Probability. In: A Source Book in Mathematics . Ed. by D. Smith. Translated by Vera Sanford. New York: McGraw-Hill, pp. 546–65. Fine, Terrence L. (1973). Theories of Probability: An Examination of Foundations. New York, London: Academic Press. Finetti, Bruno de (1974). Theory of Probability. Vol. 1. New York: Wiley. Fishburn, Peter C. (1981). Subjective Expected Utility: A Review of Normative Theories. Theory and Decision 13, pp. 129–99. Fitelson, Branden (2006). Logical Foundations of Evidential Support. Philosophy of Science 73, pp. 500–512.
— The (2008). A Decision Procedure Review of Symbolic Logicfor 1, Probability pp. 111–125.Calculus with Applications. — (2012). Evidence of Evidence is Not (Necessarily) Evidence. Analysis 72, pp. 85–88. — (2015). The Strongest Possible Lewisian Tr iviality Result. Thought 4, pp. 69–74. Fitelson, Branden and Alan H´ajek (ta). Declarations of Independence. Synthese. Published online October 2, 2014. Fitelson, Branden and James Hawthorne (2010a). How Bayesian Confirmation Theory Handles the Paradox of the Ravens. Boston Studies in the Philosophy of Science 284. Ed. by Ellery Eells and J. Fetzer. — (2010b). The Wason Task(s) and the Paradox of Confirmation. Philosophical Perspectives 24. Ed. by John Hawthorne and J. Turner. Foley, Richard (1993). Working Without a Net . Oxford: Oxford University Press. — (2009). Beliefs, Degrees of Belief, and the Lockean Thesis. In: Degrees of Belief . Ed. by Franz Huber and Christoph Schmidt-Petri. Vol. 342. Synthese Library. Springer, pp. 37–48. Galavotti, Maria Carla (2005). Philosophical Introduction to Probability . CSLI Lecture Notes 167. Stanford, CA: CSLI Publications. Gibbard, A. and W. Harper (1978/1981). Counterfactuals and Two Kinds of Expected Utility. In: Ifs: Conditionals, Belief, Decision, Chance, and
BIBLIOGRAPHY
395
Time. Ed. by W. Harper, Robert C. Stalnaker, and G. Pearce. Dordrecht: Reidel, pp. 153–190. Gillies, Donald (2000). Varieties of Propensity. British Journal for the Philosophy of Science 51, pp. 807–835.
Glass, David H. and Mark McCartney (2015). A New Argument for the Likelihood Ratio Measure of Confirmation. Acta Analytica 30, pp. 59– 65. Glymour, Clark (1980). Theory and Evidence. Princeton, NJ: Princeton University Press. Good, I. J. (1967). The White Shoe is a Red Herring. British Journal for the Philosophy of Science 17, p. 322. — (1968). The White Shoe qua Herri ng is Pink. British Journal for the Philosophy of Science 19, pp. 156–7. — (1971). Letter to the Editor. The American Statistician 25, pp. 62–3. Goodman, Nelson (1946). A Query on Confirmation. The Journal of Philosophy 43, pp. 383–385. — (1955). Fact, Fiction, and Forecast. Cambridge, MA: Harvard University Press. Greaves, Hilary and (2013). Epistemic Theory. Mind 122, pp. 915–52. Greaves, Hilary David WallaceDecision (2006). Justifying Conditionalization: Conditionalization Maximizes Expected Epistemic Utility. Mind 115, pp. 607–632. Hacking, Ian (1971). The Leibniz-Carnap Program for Inductive Logic. The Journal of Philosophy 68, pp. 597–610. — (2001). An Introduction to Probability and Inductive Logic . Cambridge: Cambridge University Press. H´ ajek, Alan (1996). ‘Mises Redux’—Redux: Fifteen Arguments Against Finite Frequentism. Erkenntnis 45, pp. 209–227. — (2003). What Conditional Probability Could Not Be. Synthese 137, pp. 273– 323. — (2009a). Arguments For—Or Against—Probabilism? In: Degrees of Belief. Ed. by Franz Huber and Christoph Schmidt-Petri. Vol. 342. Synthese Library. Springer, pp. 229–251. — (2009b). Fifteen Arguments Against Hypothetical Frequentism. Erkenntnis 70, pp. 211–235. — (2011a). Conditional Probability. In: Philosophy of Statistics . Ed. by Prasanta S. Bandyopadhyay and Malcolm R. Forster. Vol. 7. Handbook of the Philosophy of Science. Amsterdam: Elsevier, pp. 99–136.
396
BIBLIOGRAPHY
H´ajek, Alan (2011b). Interpretations of Probability. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Winter 2011. URL: http://plato.stanford.edu/archives/win2011/entries/probability-interpret/. H´ajek, Alan and Ned Hall (1994). The Hypothesis of the Conditional Construal of Conditional Probability. In: Probability and Conditionals: Belief Revision and Rational Decision. Ed. by Ellery Eells and Brian Skyrms. Cambridge Studies in Probability, Induction, and Decision Theory. Cambridge University Press, pp. 75–112. H´ajek, Alan and James M. Joyce (2008). Confirmation. In: The Routledge Companion to Philosophy of Science . Ed. by Stathis Psillos and Martin Curd. New York: Routledge, pp. 115–128. Hall, Ned (2004). Two Mistakes About Credence and Chance. Australasian Journal of Philosophy 82, pp. 93–111. Hart, Casey and Michael G. Titelbaum (ta). Intuitive Dilation? Thought. Hawthorne, James and Branden Fitelson (2004). Re-solving Irrelevant Conjunction with Probabilistic Independence. Philosophy of Science 71, pp. 505– 514. Hempel, Carl G. (1945a). Studies in the Logic of Confirmation (I). Mind 54, pp. 1–26.Studies in the Log ic of Confirmation (II). Mind 54, pp. 97–121. — (1945b). Hesse, Mary (1963). Models and Analogies in Science . London: Sheed & Ward. Heukelom, Floris (2015). A History of the Allais Paradox. The British Journal for the History of Science 48, pp. 147–69. Hitchcock, Christopher R. (2012). Probabilistic Causation. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Winter 2012. Holton, Richard (2014). Intention as a Model for Belief. In: Rational and Social Agency: The Philosophy of Michael Bratman. Ed. by Manuel Vargas and Gideon Yaffe. Oxford: Oxford University Press, pp. 12–37. Hooker, C. A. (1968). Goodman, ‘Grue’ and Hempel. Philosophy of Science 35, pp. 232–247. Hosiasson-Lindenbaum, Janina (1940). On Confirmation. Journal of Symbolic Logic 5, pp. 133–148. Howson, Colin (1992). Dutch Book Arguments and Consistency. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 2, pp. 161–8. — (2014). Finite Additivit y, Another Lottery Paradox and Conditionalisation. Synthese 191, pp. 989–1012. Howson, Colin and Peter Urbach (2006). Scientific Reasoning: The Bayesian Approach. 3rd. Chicago: Open Court.
BIBLIOGRAPHY
397
Hume, David (1739–40/1978). A Treatise of Human Nature . Ed. by L. A. Selby-Bigge and Peter H. Nidditch. Second. Oxford: Oxford University Press. Humphreys, Paul (1985). Why Propensities Cannot Be Probabilities. Philosophical Review 94, pp. 557–70. Jaynes, E. T. (1957a). Information Theory and Statistical Mechanics I. Physical Review 106, pp. 620–30. — (1957b). Information Theor y and Statistical Mechanics II. Physical Review 108, pp. 171–90. Jeffrey, Richard C. (1965). The Logic of Decision . 1st. McGraw-Hill series in probability and statistics. New York: McGraw-Hill. — (1970). Dracula meets Wo lfman: Acceptance vs. Partial Belief. In: Induction, Acceptance, and Rational Belief. Ed. by M. Swain. Dordrecht: Reidel, pp. 157–185. — (1983). The Logic of Decision. 2nd. Chicago: University of Chicago Press. — (1993). Causal ity and the Logic of Decision. Philosophical Topics 21, pp. 139–51. — (2004). Subjective Probability: The Real Thing. Cambridge: Cambridge
University Johnson, W.E.Press. (1932). Probability: The Deductive and Inductive Problems. Mind 41, pp. 409–23. Joyce, James M. (1998). A Nonpragmatic Vindication of Probabilism. Philosophy of Science 65, pp. 575–603. — (1999). The Foundations of Causal Decision Theory . Cambridge: Cambridge University Press. — (2005). How Probabil ities Reflect Eviden ce. Philosophical Perspectives 19. — (2009). Accuracy and Coheren ce: Prospects for an Alethic Epistem ology of Partial Belief. In: Degrees of Belief. Ed. by Franz Huber and Christoph Schmidt-Petri. Vol. 342. Synthese Library. Springer, pp. 263–297. Kahneman, Daniel and Amos Tversky (1979). Prospect Theory: An Analysis of Decision Under Risk. Econometrica XLVII, pp. 263–291. Kaplan, Mark (1996). Decision Theory as Philosophy . Cambridge: Cambridge University Press. Kemeny, John G. (1955). Fair Bets and Inductive Probabilities. The Journal of Symbolic Logic 20, pp. 263–273. Kemeny, John G. and Paul Oppenheim (1952). Degree of Factual Support. Philosophy of Science 19, pp. 307–324. Keynes, John Maynard (1921). Treatise on Probability. London: MacMillan and Co., Limited.
398
BIBLIOGRAPHY
Kim, Jaegwon (1988). What Is “Naturalized Epistemology”? In: Philosophical Perspectives. Ed. by J. Tomberlin. Vol. 2. Atascadero, CA: Ridgeview Publishing Co., pp. 381–405. Kolmogorov, A. N. (1933/1950). Foundations of the Theory of Probability . Translation edited by Nathan Morrison. New York: Chelsea Publishing Company. Konek, Jason and Ben Levinstein (ms). The Foundations of Epistemic Decision Theory. Unpublished manuscript. Kornblith, Hilary (1993). Epistemic Normativity. Synthese 94, pp. 357–76. Kuhn, Thomas S. (1957). The Copernican Revolution: Planetary Astronomy in the Development of Western Thought . New York: MJF Books. Kyburg Jr, Henry E. (1961). Probability and the Logic of Rational Belief . Middletown: Wesleyan University Press. — (1970). Conjunctivitis. In: Induction, Acceptance, and Rational Belief . Ed. by M. Swain. Boston: Reidel, pp. 55–82. Lange, Marc (2000). Is Jeffrey Conditionalization defective by virtue of being non-commutative? Remarks on the sameness of sensory experience. Synthese 123, pp. 393–403. Laplace, Pierre-Simon Philosophical Essay onSpringer. Probabilities . Translated from the (1814/1995). French by Andrew Dale. New York: Lehman, R. Sherman (1955). On Confirmation and Rational Betting. Journal of Symbolic Logic 20, pp. 251–262. Leitgeb, Hannes and Richard Pettigrew (2010a). An Objective Justification of Bayesianism I: Measuring Inaccuracy. Philosophy of Science 77, pp. 201–235. — (2010b). An Objective Justification of Bayesianism II: The Consequences of Minimizing Inaccuracy. Philosophy of Science 77, pp. 236–272. Levi, Isaac (1980). The Enterprise of Knowledge . Boston: The MIT Press. Lewis, C. I. (1946). An Analysis of Knowledge and Valuation . La Salle, Illinois: Open Court. Lewis, David (1971). Immodest Inductive Methods. Philosophy of Science 38, pp. 54–63. — (1976). Probabilities of Conditionals and Conditiona l Probabilities. The Philosophical Review 85, pp. 297–315. — (1980). A Subjectivist’s Guide to Objective Chance. In: Studies in Inductive Logic and Probability. Ed. by Richard C. Jeffrey. Vol. 2. Berkeley: University of California Press, pp. 263–294. — (1981a). Causal Decision Theory. Australasian Journal of Philosophy 59, pp. 5–30. — (1981b). ‘Why Ain’cha Rich?’ Noˆ us 15, pp. 377–80.
BIBLIOGRAPHY
399
— (1994). Humean Supervenience Debugged. Mind 103, pp. 473–490. Lichtenstein, S., B. Fischoff, and L. Phillips (1982). Calibration of Probabilities: The State of the Art to 1980. In: Judgment under Uncertainty: Heuristics and Biases . Ed. by Daniel Kahneman, P. Slovic, and Amos Tversky. Cambridge: Cambridge University Press, pp. 306–334. Lindley, Dennis V. (1982). Scoring Rules and the Inevitability of Probability. International Statistical Review 50, pp. 1–26. Locke, John (1689/1975). An Essay Concerning Human Understanding. Ed. by Peter H. Nidditch. Oxford: Oxford University Press. Maher, Patrick (1993). Betting on Theories. Cambridge Studies in Probability, Induction, and Decision Theory. Cambridge: Cambridge University Press. — (2002). Joyce’s Argument for Probabilism. Philosophy of Science 96, pp. 73–81. — (2010). Explicati on of Inductive Probability. Journal of Philosophical Logic 39, pp. 593–616. Makinson, David C. (1965). The Paradox of the Preface. Analysis 25, pp. 205– 7. — Journal (2011). Conditional Probability in the of Qualitative Belief Change. of Philosophical Logic 40, pp.Light 121–53. Mazurkiewicz, Stefan (1932). Zur Axiomatik der Wahrscheinlichkeitsrechnung. Comptes rendues des s´ eances de la Soci´et´ e des Sciences et des Lettres de Varsovie 25, pp. 1–4. Meacham, Christopher J.G. (ms). Ur-Priors , Conditionalization, and UrPrior Conditionalization. Unpublished manuscript. Meacham, Christopher J.G. and Jonathan Weisberg (2011). Representation Theorems and the Foundations of Decision Theory. Australasian Journal of Philosophy 89, pp. 641–663. Moore, G.E. (1939). Proof of an External World. Proceedings of the British Academy 25. Moss, Sarah (ms). Probabilistic Knowledge. Forthcoming. Oxford University Press. Murphy, A. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology 12, pp. 595–600. Murphy, Allan H. and Robert L. Winkler (1977). Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Journal of the Royal Statistical Society, Series C 26, pp. 41–7. Nicod, Jean (1930). Foundations of Geometry and Induction . Translated by Philip Wiener. New York: Harcourt, Brace and Company.
400
BIBLIOGRAPHY
Nozick, Robert (1969). Newcomb’s Problem and Two Principles of Choice. In: Essays in Honor of Carl G. Hempel . Synthese Library. Dordrecht: Reidel, pp. 114–115. Papineau, David (2012). Philosophical Devices: Proofs, Probabilities, Possibilities, and Sets . Oxford: Oxford University Press. Pascal, Blaise (1670/1910). Pens´ ees . Translated by W.F. Trotter. London: Dent. Pearson, K., A. Lee, and L. Bramley-Moore (1899). Genetic (Reproductive) Selection: Inheritance of Fertility in Man. Philosophical Transactions of the Royal Society A 73, pp. 534–539. Peirce, Charles Sanders (1910/1932). Notes on the Doctrine of Chances. In: Collected Papers of Charles Sanders Peirce . Ed. by Charles Hartshorne and Paul Weiss. Cambridge, MA: Harvard University Press, pp. 404–14. Peterson, Martin (2009). An Introduction to Decision Theory . Cambridge Introductions to Philosophy. Cambridge: Cambridge University Press. Pettigrew, Richard (2011). Epistemic Utility Arguments for Probabilism. In: The Stanford Encyclopedia of Philosophy . Ed. by Edward N. Zalta. Winter 2011.
— (2013a). New EpistemeA10, pp.Epistemic 19–35. Utilit y Argument for the Principal Principle. — (2013b). Epistemic Utility and Norms for Credence s. Philosophy Compass 8, pp. 897–908. — (2014). Accuracy, Risk, and the Principle of Indifference. Philosophy and Phenomenological Research 90. Pettigrew, Richard and Michael G. Titelbaum (2014). Deference Done Right. Philosophers’ Imprint 14.35. Pollock, John L. (2001). Defeasible Reasoning with Variable Degrees of Justification. Artificial Intelligence 133, pp. 233–282. Popper, Karl (1955). Two Autonomous Axiom Systems for the Calculus of Probabilities. British Journal for the Philosophy of Science 6, pp. 51–57. Popper, Karl R. (1935/1959). The Logic of Scientific Discovery . London: Hutchinson & Co. — (1938). A Set of Independen t Axioms for Probabi lity. Mind 47, pp. 275– 9. — (1954). Degree of Confirmation. British Journal for the Philosophy of Science 5, pp. 143–9. — (1957). The Propensi ty Interpretation of the Calculus of Probabilit y and the Quantum Theory. The Colston Papers 9. Ed. by S. K¨orner, pp. 65– 70.
BIBLIOGRAPHY
401
Predd, J. et al. (2009). Probabilistic Coherence and Proper Scoring Rules. IEEE Transactions on Information Theory 55, pp. 4786–4792. Pryor, James (2004). What’s Wrong with Moore’s Argument? Philosophical Issues 14, pp. 349–378. Quinn, Warren S. (1990). The Puzzle of the Self-Torturer. Philosophical Studies 59, pp. 79–90. Ramsey, Frank P. (1929/1990). General Propositions and Causality. In: Philosophical Papers. Ed. by D.H. Mellor. Cambridge: Cambridge University Press, pp. 145–163. — (1931). Truth and Probability. In: The Foundations of Mathematics and other Logic Essays. Ed. by R. B. Braithwaite. New York: Harcourt, Brace and Company, pp. 156–198. Reichenbach, Hans (1935/1949). The Theory of Probability. English expanded version of the German srcinal. Berkeley: University of California Press. — (1938). Experience and Prediction. Chicago: University of Chicago Press. — (1956). The Principle of Common Cause . In: The Direction of Time . University of California Press, pp. 157–160. Renyi, Alfred (1970). Foundations of Probability . San Francisco: HoldenDay.Michael D. (1987). Choices: An Introduction to Decision Theory . Resnik, Minneapolis: University of Minnesota Press. Roche, William (2014). Evidence of Evidence is Evidence Under ScreeningOff. Episteme 11, pp. 119–24. Roeper, P. and H. Leblanc (1999). Probability Theory and Probability Logic . Toronto: University of Toronto Press. Rosenkrantz, Roger (1981). Foundations and Applications of Inductive Probability. Atascadero, CA: Ridgeview Press. Salmon, Wesley (1966). The Foundations of Scientific Inference. Pittsburgh: University of Pittsburgh Press. Savage, Leonard J. (1954). The Foundations of Statistics . New York: Wiley. Schervish, M. J., T. Seidenfeld, and J.B. Kadane (2009). Proper Scoring Rules, Dominated Forecasts, and Coherence. Decision Analysis 6, pp. 202– 221. Schick, Frederic (1986). Dutch Bookies and Money Pumps. The Journal of Philosophy 83, pp. 112–9. Schoenfield, Miriam (2014). Permission to Believe: Why Permissivism Is True and What It Tells Us About Irrelevant Influences on Belief. Noˆ us 48, pp. 193–218. Seidenfeld, Teddy (1986). Entropy and Uncertainty. Philosophy of Science 53, pp. 467–491.
402
BIBLIOGRAPHY
Seidenfeld, Teddy, M. J. Schervish, and J.B. Kadane (ms). Non-Conglomerability for Countably Additive Measures that are Not κ -Additive. Unpublished manuscript. Selvin, Steve (1975). A Problem in Probability. The American Statistician 29. Published among the Letters to the Editor., p. 67. Shapiro, Amram, Louise Firth Campbell, and Rosalind Wright (2014). The Book of Odds . New York: Harper Collins. Shimony, Abner (1955). Coherence and the Axioms of Confirmation. Journal of Symbolic Logic 20, pp. 1–28. — (1988). An Adamite Deriv ation of the Calculus of Probabilit y. In: Probability and Causality . Ed. by J.H. Fetzer. Dordrecht: Reidel, pp. 151– 161. Shogenji, Tomoji (2003). A Condition for Transitivity in Probabilistic Support. British Journal for the Philosophy of Science 54, pp. 613–6. — (2012). The Degree of Epistemi c Justificaiton and the Conjunction Fallacy. Synthese 184, pp. 29–48. Shortliffe, E. and B. Buchanan (1975). A Model of Inexact Reasoning in Medicine. Mathematical Biosciences 23, pp. 351–79. Simpson, E.H. (1951). Interpretation of Interaction Tables. Journal of the The Royal Statistical Society, Series Bin Contingency 13, pp. 238–241. Skyrms, Brian (1980a). Causal Necessity: A Pragmatic Investigation of the Necessity of Laws. New Haven: Yale University Press. — (1980b). Higher Order Degr ees of Belief. In: Prospects for Pragmatism . Ed. by D. H. Mellor. Cambridge: Cambridge University Press, pp. 109– 137. — (1987a). Coherence. In: Scientific Inquiry in Philosophical Perspective . Ed. by N. Rescher. Pittsburgh: University of Pittsburgh Press, pp. 225– 42. — (1987b). Dynamic Coherenc e and Probability Kinematics. Philosophy of Science 54, pp. 1–20. — (2000). Choice & Chance: An Introduction to Inductive Logic . 4th. Stamford, CT: Wadsworth. Spohn, Wolfgang (2012). The Laws of Belief: Ranking Theory & Its Philosophical Applications. Oxford: Oxford University Press. Stalnaker, Robert C. (1972/1981). Letter to David Lewis. In: Ifs: Conditionals, Belief, Decision, Chance, and Time . Ed. by W. Harper, Robert C. Stalnaker, and G. Pearce. Dordrecht: Reidel, pp. 151–2. Stephenson, Todd A. (2000). An Introduction to Bayesian Network Theory and Usage . Tech. rep. 03. IDIAP.
BIBLIOGRAPHY
403
Suppes, Patrick (1974). Probabilistic Metaphysics . Uppsala: University of Uppsala Press. Tal, Eyal and Juan Comesa˜ na (ta). Is Evidence of Evidence Evidence? Noˆus. Forthcoming. Teller, Paul (1973). Conditionalization and Observation. Synthese 26, pp. 218– 258. Tentori, Katya, Vincenzo Crupi, and Selena Russo (2013). On the Determinants of the Conjunction Fallacy: Probability versus Inductive Confirmation. Journal of Experimental Psychology: General 142, pp. 235– 255. Titelbaum, Michael G. (2010). Not Enough There There: Evidence, Reasons, and Language Independence. Philosophical Perspectives 24, pp. 477–528. — (2013). Quitting Certainties: A Bayesian Framework Modeling Degrees of Belief . Oxford: Oxford University Press. Tversky, Amos and Daniel Kahneman (1974). Judgment under Uncertainty: Heuristics and Biases. Science 185, pp. 1124–1131. — (1983). Extensional Versus Intuitive Reasoning: The Conjunction Fallacy in Probability Judgment. Psychological Review 90, pp. 293–315. — (1992). Advances Cumulative Representation of Uncertainty. Journalin of Prospect Risk and Theory: Uncertainty 5, pp. 297–323. Tversky, Amos and Derek J. Koehler (1994). Support Theory: A Nonextensional Representation of Subjective Probability. Psychological Review 101, pp. 547–567. van Fraassen, Bas C. (1981). A Problem for Relative Information Minimizers. British Journal for the Philosophy of Science 32, pp. 375–379. — (1982). Ration al Belief and the Common Cause Princ iple. In: What? Where? When? Why? Ed. by Robert McLaughlin. Dordrecht: Reidel, pp. 193–209. — (1983). Calibratio n: A Frequency Justification for Personal Probability. In: Physics Philosophy and Psychoanalysis . Ed. by R. Cohen and L. Laudan. Dordrecht: Reidel, pp. 295–319. — (1984). Belief and the Will. The Journal of Philosophy 81, pp. 235–256. — (1989). Laws and Symmetry . Oxford: Clarendon Press. — (1995). Belief and the Problem of Ulysses and the Sirens. Philosophical Studies 77, pp. 7–37. — (1999). Conditionalization: A New Argument For. Topoi 18, pp. 93–96. Venn, John (1866). The Logic of Chance. London-Cambridge: MacMillan. Vineberg, Susan (2011). Dutch Book Arguments. In: The Stanford Encyclopedia of Philosophy. Ed. by Edward N. Zalta. Summer 2011.
404
BIBLIOGRAPHY
von Mises, Richard (1928/1957). Probability, Statistics and Truth. (English edition of the srcinal German Wahrscheinlichkeit, Statistik und Wahrheit.) New York: Dover. von Neumann, J. and O. Morgenstern (1947). Theory of Games and Economic Behavior. 2nd. Princeton, NJ: Princeton University Press. Vranas, Peter B.M. (2004). Hempel’s Raven Paradox: A lacuna in the standard Bayesian solution. British Journal for the Philosophy of Science 55, pp. 545–560. Wainer, Howard (2011). Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies. Princeton, NJ: Princeton University Press. Weatherson, Brian and Andy Egan (2011). Epistemic Modals and Epistemic Modality. In: Epistemic Modality. Ed. by Andy Egan and Brian Weatherson. Oxford: Oxford University Press, pp. 1–18. Weintraub, Ruth (2001). The Lottery: A Paradox Regained and Resolved. Synthese 129, pp. 439–449. Weirich, Paul (2012). Causal Decision Theory. In: The Stanford Encyclopedia of Philosophy . Ed. by Edward N. Zalta. Winter 2012. Weisberg, Jonathan (2007). Conditionalization, Reflection, and Self-Knowledge. Philosophical Studies 135, pp. 179–197. White, Roger (2005). Epistemic Permissiveness. Philosophical Perspectives 19, pp. 445–459. — (2006). Problems for Dogmatism. Philosophical Studies 131, pp. 525– 557. Williams, J. Robert G. (ms). A Non-Pragmatic Dominance Argument for Conditionalization. Unpublished manuscript. — (ta). Probability and Non-Classical Logic. In: Oxford Handbook of Probability and Philosophy . Ed. by Alan H´ajek and Christopher R. Hitchcock. Oxford University Press. Williamson, Timothy (2007). How Probable Is an Infinite Sequence of Heads? Analysis 67, pp. 173–80. Wittgenstein, Ludwig (1921/1961). Tractatus Logico-Philosophicus. Translated by D.F. Pears and B.F. McGuinness. London: Routledge. Yalcin, Seth (2012). A Counterexample to Modus Tollens. Journal of Philosophical Logic 41, pp. 1001–1024. Yule, G.U. (1903). Notes on the Theory of Association of Attributes in Statistics. Biometrika 2, pp. 121–134. Zynda, Lyle (2000). Representation Theorems and Realism About Degrees of Belief. Philosophy of Science 67, pp. 45–69.