Hand Handou out t No. No. 7
Phil.015 April 12, 2016 ELEMENTS OF DECISION THEORY The elements of decision theory are similar to those of the theory of games in that decision theory may be considered as the theory of a two-person game, in which nature takes the role of one of the players, and the other player if of course the statistician. statistician. Because the problems are simpler, it is useful to start with decisions in the absence of data . Often certain decision decision problems involving involving data can cleverly cleverly be converted into no-data problems. NO-DAT NO-DATA DECISION PROBLEMS
I.
The basic conceptual ingredients of a no-data decision model are the same is those of a zero-sum two-person game-theoretic model, namely: space (sometimes (i) A non-empty set Θ = {θ1 , θ2 , · · · } – called the state space referred to as the parameter space ), of possible states of nature.
(ii) A non-empty set A = {a1 , a2 , · · · } – called the action space , of actions available to the decision-maker. And (iii) a loss function function ℓ : Θ × A −−−→
R that
assigns a real number ℓ(θ, a)
to each pair (θ, a), specified by the state of nature θ and action a. The value ℓ(θ, a) represents the loss incurred when the decision-maker takes action a and the nature is in state θ . Alth Althou ough gh ℓ(θ, a) can even be a negative number, representing gain or payoff , statisticians usually think of ℓ(θ, a) conservatively as a loss and take 0 to be the smallest loss (i.e., no loss loss at all). Tha Thatt is to say say, ℓ(θ, a) is always a non-negative real number. Simply, nature chooses a point θ in Θ, and the statistician (without being informed of the choice nature has made) chooses an action a in A. As a consequence consequence of these two choices, the statistician statistician (decision-mak (decision-maker) er) loses an amount ℓ(θ, a). In a mathematical sense,1 the triple Θ, A, ℓ with the specifications itemized above is called a no-data decision model (or a two-person game). Here are some examples: 1
An essential part of understanding how a mathematical method works in decision theory is being able to view it somewhat somewhat abstractly abstractly as a mathematical mathematical phenomenon. phenomenon. This approach approach helps to see how an abstract idea might work in areas other than the one we may be interested in at the moment.
1
1. Consider the decision problem faced by Alma: whether or not to hold a garage sale the following Saturday. To announce the sale, she would have to pay $30 to place a notice in the local newspaper. If the day is sunny, she expects to receive about $200 from her customers; the net return after deducing the cost of the ad is $170. If it rains, however, no sale is likely to occur, and Alma suffers a loss due to the newspaper ad. Her default action is to hold no sale, for which she neither gains nor loses anything, come rain or shine. The obvious no-data decision model Θ, A, ℓ is specified by setting: (a) Θ = {θ1, θ2 }, where θ1 = “sunny”, θ2 = “rainy” (This set belongs to “nature”, viewed as player I.) (b) A = {a1 , a2 }, where a1 = “no sale”, a2 = “sale”. (These are the possible choices of player II, the decision-maker.) (c) The loss function (in dollars) ℓ : Θ × A −−−→ R is given by the loss table : θ1
θ2
a1 170 a2 0
170 200
Thus ℓ(θ1 , a1 ) = 170, ℓ(θ2 , a2 ) = 200, etc. Note that the real-valued payoff or utility function (gain in dollars) U : Θ × A −→ R is given by the table θ1
θ2
a1 0 a2 170
0 -30
The loss function is obtained from the payoff (utility) function U by simply adding $170 to it. We do this in order to get non-negative real numbers for all losses. So the worst loss is 0. This example is discussed later with additional information and conceptual enrichment. Formally, we set ℓ(θi , a) =df U max (a) − U (θi , a), where U max (a) = max U (θ1 , a), U (θ2 , a), · · · , U (θn , a) .
2. Homemaker Alma Jones can cook spaghetti, hamburger, or steak for dinner. She has learned from her past experience that if her husband is in good mood, she can serve him spaghetti and save some money, but if he is in a bad mood, only a juicy steak will calm him down and make him bearable. Clearly, there are three states of nature (Jone’s possible “modes of being”) and there are three actions available to Alma, leading to another finitary decision model Θ, A, ℓ, where (a) Θ = {θ1 , θ2 , θ3 }, where θ1 = “Mr. Jones is in a good mood”, θ2 = 2
“Mr. Jones is in a normal mood”, and θ3 = “Mr. Jones is in a bad mood”; (b) A = { a1 , a2, a3 }, where a1 = “prepare spaghetti”, a2 = “prepare hamburger”, and a3 = “prepare steak”. (c) The loss function (for food, gasoline, etc.) ℓ : Θ ×A −−−→ R is defined by the loss table a1 a2 a3
θ1
θ2
θ3
0 5 10
2 3 9
4 5 6
3. In estimation theory it is standard to specify the pertinent decision model Θ, A, ℓ by simply setting Θ = A = R (real line) and ℓ(θ, a) = (θ − a )2 (quadratic loss). In particular, suppose a coin has an unknown probability (“state” or “propensity”) θ in Θ = [0 , 1] (the closed unit interval) of coming up heads. We wish to estimate this probability on the basis of one toss of the coin. Clearly, each toss-based estimate corresponds to an element a in A = [0, 1]. Here the loss function is given by the quadratic loss ℓ(θ, a) = (θ − a)2 . Given the decision models above, the fundamental problem is what course of action should Alma take? It is natural to search for the “best course of action”, i.e., an action that brings the smallest loss no matter what the “true” state of nature. For this we need a principle (scheme, procedure) that algorithmically specifies the action that is “best” according to the principle used. Generally, a principle leads to a partial ordering on the set A of available courses of action, and any such ordering may be considered a principle. Here the relation a1 a2 means that action a 2 is at least as good as action a 1 . In brief, a decision principle is modeled by the pair A, . As we shall see, we may (for example) define the ordering a1 a2 by the formula ∀θ[ℓ(θ, a2 ) ≤ ℓ (θ, a1 ). Two important elementary principles are basic to the study of decision models: 1. The Minimax Principle: A distinct type of partial ordering of actions in a decision model is obtained by ordering the actions according to the worst that could happen to the decision maker. In other words an action a is preferred to an action a′ if and only if max{ℓ(θ1 , a), ℓ(θ2 , a), · · · , ℓ(θn , a)} < max {ℓ(θ1 , a′ ), ℓ(θ2, a′ ), · · · , ℓ(θn , a′ )}. Here we are considering the losses over the various possible states of nature and choose the action for which the loss is maximal. This provides 3
an ordering among the possible actions. Now, according to the minimax principle , the decision maker takes the action a for which the maximum loss max{ℓ(θ1 , a), ℓ(θ2 , a), · · · , ℓ(θn , a)} is the smallest. It is easy to see that in Example 1 (garage sale) we have = max{ℓ(θ1 , a1 ), ℓ(θ2 , a1 )} = 170 M(a2 ) = max{ℓ(θ1 , a2 ), ℓ(θ2 , a2 )} = 200 M(a1 )
so that Alma should choose a1, because it has a smaller “max”. In the same way as above, we find that in Example 2 (what food to prepare), according to the minimax principle the best choice is action a3 (prepare steak). We see that the minimax principle reflects a pessimistic attitude. 2. The Bayes Minimum Expected Loss Principle: Bayesians take the point of view that the possible states of nature (parameters) can be seen as values of a single random variable θ, so that we have outcomes (events) “θ = θ 1 ”, “θ = θ 2”, · · · with known prior probabilities p0 (θ1 ) = P(θ = θ 1 ), p0 (θ2 ) = P(θ = θ2 ), · · · . For example, Alma may believe, based on her past experience, that sunny days are far more frequent than rainy days, and she may even be able to put a reasonably accurate percentage value on the occurrence of “sunny” vs. “rainy”. Given a decision model Θ, A, ℓ together with a prior probability distribution p0 (θ1 ), p0 (θ2 ), p0 (θ3 ), · · · on possible states in Θ (serving as an epistemic enrichment of Θ in the extant decision model, so that now we use Θ, p0), we define the Bayes loss of action ai by the expected value (average loss)
B(ai )
=df B p (ai ) = ℓ (θ1 , ai ) p0 (θ1 ) + ℓ(θ2, ai ) p0 (θ2 ) + · · · + ℓ(θn , ai ) p0(θn ). 0
Thus, given a prior distribution p0(θi ) on the states of nature, the Bayes loss incurred for a given action a is now a random variable with expected value B(a). The Bayes action is then defined to be the action a that minimizes the Bayes loss B(a). Thus, the computation of expected losses B(a1 ), B(a2 ), B(a3 ), · · · according to a given prior distribution provides a means of partially ordering the available actions in A, say in the form B(a1 ) < B(a2 ) < B(a3 ) < · · · . The action that is farthest to the left on this scale is the most desirable from the Bayesian perspective. Suppose the prior probability distribution p0(θ) = P(θ = θ) on states of nature in Example 1 (garage sale) is given by the table 4
θ
p0(θ)
θ1
θ2
0.7
0.3
Find the Bayes action that minimizes the average losses. Do the same in Example 2 (prepare dinner for Mr. Jones), given that the prior probability distribution on states of nature is given by the table θ
p0(θ)
θ1
θ2
θ3
0.5
0.3
0.2
Considerably better decisions become available, if in addition to a prior knowledge there is access to various observation results of a designated parent random variable X , whose values are assumed to depend on the states of nature in the form of (conditional) probability distributions p(x|θ) = P(X = x | θ = θ ). Simply, in an observation setting we think of the states of nature θ as “causes” and the values x of parent variable X as “effects”, but of course a precisely formulated cause-effect relationship is only statistical. Unfortunately, the welcome feature of having additional information from observation data somewhat complicates the no-data decision models, treated above. II.
DATA-BASED DECISION PROBLEMS
To give a correct mathematical structure to the process of information gathering, we assume that there is a parent random variable X – taking its values in an observation space (sample space) X , whose known probability distribution p(x|θ) depends on the “true” state of nature θ. Thus, what we have so far is a decision model Θ, A, ℓ coupled with a random variable X with range X and known probability distribution pX (x|θ). For each θ there is a probability x measure P(X ≤ x | θ ) = −∞ pX (x′ | θ ) dx ′ , specified by pX (x | θ ).
For example, Let Θ = { 34 , 12 } be the set of possible (unknown) probabilities of getting a head in one toss of a (fair or biased) coin. We think of θ1 = 34 as one of the states a coin would be in if it were considerably biased in favor of heads, and of course θ2 = 12 encodes the state of a fair (balanced) coin. The problem is that the decision maker does not know the “true” state of the coin. One major way to find out is to observe the coin’s behavior in a longer sequence of tosses. Let X be the “success’ random variable, counting the number of heads that come up in each given sequence of trials, whose distribution is given by the binomial P(X = x | θ = θ i )
= p X (x|θi ) =
5
n θi x (1 − θi n−x ), x
where x = 0, 1, 2, · · · , n are the possible values of the “success’ random variable X , i = 1, 2 for the possible states, and natural number n is the sample size. The pertinent decision space A = {a1 , a2 } consists of two decisions: a1 = the die is fair and a2 = the die is biased. The fundamental question here is what “decision rule” or “strategy” should the decision maker use in light of information obtained from observing the values of X . Given a decision rule d , on the basis of recording the value x of random variable X the decision maker chooses action d(x) = a in A. What is new here is the decision function (also known as a decision rule or a strategy) d : X −→ A that assigns to each outcome x of X in its observation (sample) space X a unique action d(x) in A. Of course, the decision maker may apply several different decision functions – some good and some not so good, or even foolish. So, now the primary problem is no longer the choice of the “right” action in A but the choice of the “right” decision function d in the decision function space AX , comprised of all possible mappings from X to A. Because X is a random variable, now the loss ℓ(θ, d(X ) ) becomes a random quantity that has to be averaged over all possible values of X . Note that in the coin example above, the decision space A X consists of 2n decision functions. Fortunately, since most of them are uninteresting from the standpoint of decision theory, we shall focus only on a much smaller subset of so-called admissible decision functions. The first step in setting up a data-based decison model is the introduction of the risk function ℜ : Θ × AX −−−→ R that assigns to each state of nature θ in Θ and to each decision d : X −→ A in A X the numerical value ℜ(θ, d), interpreted as the risk incurred by using decision rule d, when the “true” state of nature is θ . Remember, the loss ℓ(θ, d(X )) is now a random quantity, because it is a function of X . The classical decision-theoretic definition of the risk function ℜ : Θ × AX −→ R (expected value of the “random loss”) is as follows:
ℜ(θ, d) = ℓ (θ, d(x1 )) p(x1 |θ) + ℓ(θ, d(x2 )) p(x2|θ) + · · · + ℓ(θ, d(xn )) p(xn |θ), where x1 , x2 , · · · , xn are the possible values of parent random variable X , p(x1 |θ) = P(X = x1 | θ = θ ) denotes the probability value that parent random variable X takes value x1 , given that the state of nature is θ. Likewise for p(x2 |θ) = P(X = x 2 | θ = θ ), etc. Finally, d(x1 ) is the action assigned by decision rule d to outcome x1 . We interpret d(x2 ), · · · , d(xn ) similarly. Notice that the no-data decision model Θ, A, ℓ, treated earlier, has been replaced by a brand-new, data-based decision model Θ, AX , ℜ, in which the decision function space A X has an underlying structure, including the probability distributions p(x|θ) (one for each θ ) for X , whose exploitation is the main 6
objective of data-based decision theory. We do not yet know which is the best decision function in a data-based decision model. As in the no-data case, we can use: 1. The minimax principle by calculating the maximum risk M(d)
=df max{ℜ(θ1, d), ℜ(θ2 , d), · · · , ℜ(θm , d)}
for all admissbile d over the various possible states of nature θ 1 , θ2 , · · · , θm . Then choose the decision function d∗ that minimizes the maximums ′ ′′ M(d), M(d ), M(d ), · · · in the model. In this way, a partial ordering is defined over the decision function space AX , raking all decision rules from a (pessimistic) minimax perspective. 2. The Bayes minimum expected value principle assumes that the states of nature come with prior probability “weights” p 0(θ), so that one can determine the Bayes risk (average risk over the possible states of nature)
B(d)
=df = B p (d) = ℜ (θ1 , d) p0 (θ1) + ℜ(θ2, d) p0 (θ2 ) + · · · + ℜ(θm , d) p0 (θm ) 0
for using decision function d. The prereferred decision function d∗ is the one that minimizes the Bayes risk B(d∗ ). Thus, there is a family of decision functions d 1 , d2 , d3 , · · · with respective Bayes risks B(d1 ), B(d2 ), B(d3 ), · · · that have to be calculated. A decision rule d∗ is called a Bayes decision function just in case its Bayes risk B(d∗ ) is the smallest in the list B(d1 ), B(d2 ), B(d3 ), · · · . Let us consider further the examples introduced earlier. 1. Suppose that in addition to givens in Example 1 (Alma’s garage sale), Alma has two possible forecasts (based on how her neighbors have profited from their recent garage sales), captured by the values of the parent random variable X : Outcome x 1 forecasts “great sale” and x 2 forecats “weak sale”. The pertinent frequencies in the form of probability distributions p(x | θ 1 ) and p(x | θ 2 ) are given by the table x1
x2
θ1 0.7 θ2 0.4
0.3 0.6
How many distinct decision functions are there? We have 4 = 22 possible decision rules, including some patently wrong ones that totally ignore the data: 7
x1 x2
d1 a1 a1
d2 a1 a2
d3 a2 a1
d4 a2 a2
Note that decision rule d 1 picks action a 1 = d 1 (x1 ) = d 1 (x2) independently of what the observation result tells Alma. The only decision rule that makes good empirical sense is d3 . It assigns action a 2 (“sale”) to the forecast x 1 of “great sale” and it assigns action a 1 (“no sale”) to the forecast x 1 of “weak sale”. In applications, the “bad” decision rules are automatically eliminated by the dominance relation. Recall that a decision function d dominates another decision function d′ just in case ∀θ[ℜ(θ, d) ≤ ℜ(θ, d′ )]. We say that d dominates d′ strictly or alternatively that decision function d is better than d′ just in case ∀θ[ℜ(θ, d) ≤ ℜ(θ, d′ )] and ∃θ[ℜ(θ, d) < ℜ(θ, d′ )]. A decision function d is said to be admissible if and only if d is not dominated strictly by any other decision function. From now on, we confine our attention to the subset ∂ AX ⊂ AX of optimal decision functions. In view of dominance, the other decision functions will never be picked. But of course, in general there are many admissible decision functions in A X , forming a kind of boundary. Getting back to Alma’s problem, first we calculate the risks
ℜ(θ1 , d2 ), ℜ(θ1, d3 ), ℜ(θ2 , d2 ) and ℜ(θ2 , d3) Then, depending on the avaliability of the prior p 0 (θ1 ), p0 (θ2 ), we calculate the Bayes risks B(d2 ) and B(d3). Finally, we choose the decision function d∗ with the smallest Bayes risk. 2. Returning to the Alma Jones example (Example 2, choices in preparing dinner), suppose Alma lost the afternoon paper that Mr. Jones loves to read after work. When Mr. Jones returns home, she will have to tell him that she lost the damn paper. That could make Mr. Jones terribly mad or maybe not too mad. This has happened many times before. Alma foresees 4 possible responses from Mr. Jones that will help her decide which food to prepare for dinner. Here the parent random variable X takes four values, x1 , x2 , x3 , x4 , where Mr. Jones’ responses are encoded precisely by these values: x1 = Newspapers will get lost. x2 = I keep telling you ’a place for everything and everything in its place’. x3 = Why did I ever get married? x4 = an absent-minded, far-away look.
8
Based on past experience, the frequencies p(x|θ1), p(x|θ2 ), p(x|θ3 ) of Mr. Jones’ possible responses are given by the table x1
x2
x3
x4
θ1 0.5 θ2 0.2 θ3 0
0.4 0.5 0.2
0.1 0.2 0.5
0 0.1 0.3
Here the first row specifies the probability distribution p (x|θ1 ), the second row gives p(x|θ1 ), and the third row defines p(x|θ3 ) for X . Evidently, here the risk function for θ1 and decision rule d is given by
ℜ(θ1 , d) = ℓ (θ, d(x1 )) p(x1 |θ1 )+ℓ(θ1 , d(x2 )) p(x2 |θ1)+· · ·+ℓ(θ1 , d(x4 )) p(x4 |θ1 ). Now, suppose Alma knows that Mr. Jones is in a good mood 30% of the time, in a normal mood 50% of the time, and is in a bad mood 20% of the time. Now the resulting enriched data-based decision model Θ, AX , ℜ allows to calculate the Bayes risks B(d)
== ℜ (θ1 , d) p0 (θ1 ) + ℜ(θ2 , d) p0 (θ2 ) + ℜ(θ3 , d) p0 (θ3 )
for all admissible d by substituting the given percentages for p0(θ). Alma can now choose the best strategy, based on her data about Mr. Jones’ behavior, reflecting whether Mr. Jones is in a good, normal or bad mood. Suppose we observe a parent random variable X with possible values x1 , x2 , x3 and note that the observation outcome was “X = x 1 ”. Because we can calculate the marginal probability P(X = x 1 ) from the weighted average P(X = x 1 )
= p (x1 |θ1 ) p0 (θ1 ) + p(x1 |θ2 ) p0 (θ2 ) + · · · + p(x1 |θm ) p0 (θm )
of conditional probabilities and prior, the posterior probability p1 (θi |x1 ) = P(θ = θ i | X = x 1 ) = p 0 (θi )
p(x1 |θi ) P(X = x 1 )
(known from the Bayes theorem) can be used in determining the conditional Bayes risk B(d|x1 )
== ℜ (θ1 , d) p1 (θ1 |x1 ) + ℜ(θ2 , d) p1 (θ2|x1 ) + ℜ(θ3 , d) p1 (θ3 |x1 )
for all d. As before, the decision maker chooses the decision function d∗ provided that its Bayes risk B(d∗ |x1) is the smallest among all Bayes risks B(d j |x1 ) of decision functions. 9