325 Notes

Lecture Notes for MS&E 325: Topics in Stochastic Optimization (Stanford) and CIS 677: Algorithmic Decision Theory and Bayesian Optimization (UPenn) Ashish Goel Stanford University [email protected]

Sudipto Sudipto Guha University of Pennsylvania [email protected]

Winter 2008-09 (Stanford); Spring 2008-09 (UPenn) Under Under Constr Construct uction: ion: Do not Distribu Distribute te

2

Chapter 1

Introduction: Class Overview, Markov Decision Processes, and Priors This class deals with optimization problems where the input comes from a probability distribution, or in some cases, is generated iteratively by an adversary. The first part of the class deals with “Algorithmic Decision Theory”, where we will study algorithms for designing strategies for making decisions which are provably (near)-optimal, computationally efficient, and use available and acquired data, as well as probabilistic models thereon. This field touches upon statistics, machine learning, combinatorial algorithms, and convex/linear optimization and some of the results we study are several decades old. This field is seeing a resurgence because of the large scale of data being generated by Internet applications. The second part of this class will deal with combinatorial optimization problems, such as knapsack, scheduling, routing, network design, and inventory management, given stochastic inputs.

1.1

Input models and objectives

Consider N alternatives, and assume that you have to choose one alternative during every time step t, where t goes from 0 to . This series of choices is called a “strategy”; the arm chosen by the strategy at time t is denoted as at . Generally, the alternatives are called “arms”, and choosing an alternative is called “playing” the corresponding arm. Arm i gives a reward of ri (t) at time t, where ri (t) may depend on all the past choices made by the strategy. The quantity ri (t) may be a random variable, with a distribution that is unknown, known, or on which you have some “prior beliefs”. The quantity ri (t) may also be chosen adversarially. This gives two broad classes of input models, probabilistic and adversarial, with many important variations. We will study these models in great detail. There are also several objectives we could have. The first is a “finite-horizon objective”, where we are given a finite horizon T and the goal is to:

∞

T −1

Maximize E[

 t=0

3

rat (t)].

The second is the infinite horizon discounted reward model. Here, we are given a discount factor θ [0, 1); informally this is today’s value for a Dollar that we will make tomorrow. Think of this as (1 - the interest rate). If you have a long planning horizon, θ should be chosen to be very close to 1. If you are very short-sighted, θ will be close to 0. The objective is:

∈

∞



Maximize E[

θt rat (t)].

t=0

By linearity of expectations, we can exchange the limit and the sum so that we obtain the equivalent objectives: T −1

Maximize

 

E[rat (t)],

and

t=0

∞

Maximize

θt E[rat (t)].

t=0

We will now define rat (t) as the “expected reward”, and get rid of the expectation altogether. This expectation is over both the input (if the input is probabilistic) and the strategy (if the strategy is randomized). These problems are all collectively called “multi-armed bandit problems”. Colloquially, a slot machine in a casino is also called a one-armed bandit, since it has one lever, and usually robs you of your money. Hence, a multi-armed bandit is an appropriate name for this class of problems, where we can think of each alternative as one arm of a multi-armed slot machine. Remember, our goal in this class is to design these strategies algorithmically.

1.2

An illustrative example

Suppose you are in Las Vegas for an year, and will go to a casino everyday. In the casino there are two slot machines, each of which gives a Dollar as a reward with some unknown probability, which may not be the same for both machines. For the first, there is data available from 3 trials, and one of them was a success (i.e. gave a reward) and and two were failures (i.e. gave no reward). For the second machine, there is data available from 2 108 trials, 108 of which gave a reward. We will say that the first machine is a (1, 2) machine: the first component of the tuple refers to the number of “successful” trials, and the second refers to the number of unsuccessful ones. Thus, the second machine is a (108 , 108 ) machine. What would be your best guess of the expected reward from the first machine? Clearly 1/3. For the second machine? Clearly 1/2. Also, it seems clear that the second machine is also less risky. So which machine should you play the very first day? Surprisingly, it is the first. If you get a reward the first day, the first machine would become a (2, 2) machine and you can play it again. If you get a reward again, the first machine would become a (3, 2) machine, and would start to look better than the second. If on the other hand, the first machine does not give you a reward the first day, or the second day, then you can revert to playing the second machine. We will see how to make this statement more formal. Here, we sacrifice some expected reward (i.e. choose not to exploit the best available machine) on the first day in order to explore an arm with a high “upside”; this is an example of an “explorationexploitation” tradeoff and is a central feature of algorithmic decision theory..

×

4

1.3

Markov Decision Processes

Before starting on the main topics in this class, it is worth seeing one of the most basic and useful tools in stochastic optimization: Markov Decision Processes. When you see a stochastic optimization problem, this is probably the first tool you must try, the second being Bellman’s formula for dynamic programming which we will briefly see later. Only when these are inefficient or inapplicable should you try more advanced approaches such as the ones we are going to see in the rest of this class. Assume you are given a finite state space S , an initial state s0 , a set of actions A, a reward function r : S A , and a function P : S A S [0, 1] such that v∈S P (u,a,v) = 1 for all states u S and all actions a A. Informally, a MDP is like a Markov chain, but the transitions happen only when an action is taken, and the transition probabilities depend on the state as well as the action taken. Also, depending on the state u and the action a, we get a reward r(u, a). Of course the reward itself could be a random variable, but as pointed out earlier, we replace it by its expected value in that case. Given a MDP, you might want to maximize the finite horizon reward or the infinite horizon discounted reward. We will focus on the latter for now. The former is tractable as well. Let θ be the discount factor. Let π(s) be the expected discounted reward obtained by the optimum strategy starting from state s, and let π(s, a) be the expected discounted reward obtained by the optimum strategy starting from state s assuming that action a is performed. Then the following linear constraints must be satisfied by the π’s:

∈

× →

× × →

∈

∀u ∈ S, a ∈ A : ∀u ∈ A, a ∈ A :

π(u) π(u, a)



π(u, a)

≥ ≥

r(u, a) + θ

(1.1)



P (u,a,v)π(v)

(1.2)

v∈S

The optimum solution can now be found by a linear program with the constraints as given above and the linear objective: Minimize π(s0 ) (1.3) Thus, MDPs can be solved very efficiently, provided the state space is small. MDPs can also be defined with countable state spaces, but then the LP formulation above is not directly useful as a solution procedure.

1.4

Priors and posteriors

Let us revisit the illustrative example. Consider the (1, 2) machine. Given that this is all you know about the machine, what would be a reasonable estimate of the success probability of the machine? It seems reasonable to say 1/3; remember, this is an illustrative example, and there are scenarios in which some other estimate of probability might make sense. The tuple (1, 2) and the probability estimate (1/3) represents a prior belief on the machine. If you play this machine, and you get a success (which we now believe will happen with a probability of 1/3), then the machine will become a (2, 2) machine, with a success probability estimate of 1/2; this is called a “posterior” belief. Of course, if the first trial results in a failure, the posterior would have been (1, 3) with a success probability estimate of 1/4. 5

This is an example of what are known as Beta priors. We will use (α, β ) to roughly1 , denote the number of successes and failures, respectively. We will see these priors in some detail later on, and will refine and motivate the definitions. These are the most important class of priors. Another prior we will use frequently (despite it being trivial) is the “fixed prior”, where we believe we know the success probability p and it never changes. We will refer to an arm with this prior as a standard arm with probability p. For the purpose of this class, we can think of a prior as a Markov chain with a countable state space S , transition probability matrix (or kernel) P , a current state u S , and a reward function r : S . Typically, the range of the reward function will be [0, 1] and we will interpret that as +, a the probability of success. Beta priors can then be interpreted as having a state space + reward function r(α, β ) = α/(α + β ), and transition probabilities P ((α, β ), (α + 1, β )) = α/(α + β ) and P ((α, β ), (α, β + 1)) = β/(α + β ) (and 0 elsewhere). When we perform an experiment on, i.e. play, an arm with prior (S,r,P,u), the state of the arm changes to v according to the transition probability matrix P . We assume that we observe this change, and we now obtain the “posterior” (S,r,P,v), which acts as the prior for the next step. For much of this class, we will assume that the prior changes only when an arm is played. How do we know we have the correct prior? Where does a prior come from? On one level, these are philosophical questions, for which there is no formal answer – since this is a class in optimization, we can assume that the priors are given to us, they are reliable, and we will optimize assuming they are correct. On another level, the priors often come from a generative model, i.e. from some knowledge of the underlying process. For example, we might know (or believe) that a slot machine gives Bernoulli iid rewards, with an unknown probability parameter p. We might further make the Bayesian assumption that the probability that this parameter is p given the number of successes and failures we have observed is proportional to the probability that we would get the observed number of successes and failures if the parameter were p. This leads to the Beta priors, as we will discuss later. Of course this just brings up the philosophical question of whether the Bayesian assumption is valid or not. In this class, we will be agnostic with respect to to this question. Given a prior, we will attempt to obtain optimum strategies with respect to that prior. But we will also discuss the scenario where the rewards are adversarial.

∈

→

1

Z × Z

More precisely, α − 1 and β − 1 denote the number of successes and failures, respectively.

6

Chapter 2

Discounted Multi-Armed Bandits and the Gittins Index We will now study the problem of maximizing the expected discounted infinite horizon reward, given a discount factor θ, and given n arms with the i-th arm having a prior (S i , ri , P i , ui ). Our goal is to maximize the expected discounted reward by playing exactly one arm in every time step. The state ui is observable, an arm makes a transition according to P i when it is played, and this transition takes exactly one time unit, during which no other arm can be played. This problem is useful in many domains including advertising, marketing, clinical trials, oil exploration, etc. Since S i , ri , P i remain fixed as the prior evolves, the state of the system at any time is described by (u1 , u2 , . . . , un ). If the state spaces of the individual arms are finite and of size k each, the state space of the system can be of size O(kn ) which precludes a direct dynamic programming or MDP based approach. We will still be able to solve this efficiently using a striking and beautiful theorem, due to Gittins and Jones. We will assume for now that the range of each reward function is [0, 1]. Also recall that θ [0, 1).

∈

Theorem 2.1 [The Gittins’ index theorem:] Given a discount factor θ, there exists a function gθ from the space of all priors to [0, 1/(1 θ)] such that it is an optimum strategy to play an arm i for which gθ (S i , ri , P i , ui ) is the largest.

−

This theorem is remarkable in terms of how sweeping it is. Notice that the function gθ , also called the Gittins’ index, depends only on one arm, and is completely oblivious to how many other arms there are in the system and what their priors are. For common priors such as the Beta priors, one can purchase or pre-compute the Gittins’ index for various values of α,β,θ and then the optimum strategy can be implemented using a simple table-lookup based process, much simpler than an MDP or a dynamic program over the joint space of all the arms. This theorem is existential, but as it turns out, the very existence of the Gittins’ index leads to an efficient algorithm for computing it. We will outline a method for finite state spaces.

2.1

Computing the Gittins’ index

First, recall the definition of the standard arm with success probability p; this arm, denoted R p , always gives a reward with probability p. Given two standard arms R p and Rq , where p < q, which 7

would you rather play? Clearly, Rq . Thus, the Gittins’ index of Rq must be higher than that of R p . R1 dominates all possible arms, and R0 is dominated by all possible arms. The arm R p yields a total discounted profit of p/(1 θ), by summing up the infinite geometric progression p,pθ,pθ2 , . . .. Given an arm J with prior (S,r,P,u), there must be a standard arm R p such that given these two arms, the optimum strategy is indifferent between playing J or R p in the first step. The Gittins’ index of arm J must then be the same as the Gittins’ index of arm R p , and hence any strictly increasing function of p can be used as the Gittins’ index. In this class, we will use gθ = p/(1 θ). The goal then is merely to find the arm R p given arm J . This is easily accomplished using a simple LP. Consider the MDP with a finite state space S and action space (aJ , aR ), where the first action corresponds to playing arm J and the second corresponds to playing arm R p , in which case we will always play R p since nothing changes in the next step. The first action yields reward r(u) when in state u, whereas the second yields the reward x = p/(1 θ). The first action leads to a transition according to the matrix P whereas the second leads to termination of the process. The optimum solution to this MDP is given by:

−

−

−

Minimize π(u), subject to: (a) s

∀ ∈ S, π(s) ≥ x, and (b)∀s ∈ S, π(s) ≥ r(s) + θ



P (u, v)π(v).

v∈S

If this objective function is bigger than x then it must be better to play arm J in state u. Let the optimum objective function value from this LP be denoted z ∗ (x). Our goal is to find the smallest x such that z ∗ (x) x (which will denote the point of indifference between R p and J ). We can obtain this by performing a binary search over x: if z∗ (x) = x then the Gittins’ index can not be larger than x, and if z ∗ (x) > x then the Gittins’ index must be larger than x. In fact, this can also be obtained from the LP:

≤

Minimize x, subject to: (a) π(u)

≤ x, (b) ∀s ∈ S, π(s) ≥ x, and (c) ∀s ∈ S, π(s) ≥ r(s)+θ



P (u, v)π(v).

v∈S

A spreadsheet for approximately computing the Gittins’ index for arms with Beta priors is available at http://www.stanford.edu/ ∼ashishg/msande325 09/gittins index.xls . Exercise 2.1 In this problem, we will assume that three advertisers have made bids on the same keyword in a search engine. The search engine (which acts as the auctioneer) assigns (α, β ) priors to each advertiser, and uses their Gittins index to compute the winner. The advertisers are: 1. Advertiser (a) has α = 2, β = 5, has bid $ 1 per click and has no budget constraint. 2. Advertiser (b) has α = 1, β = 4, pays $ 0.2 per impression and additionally $ 1 if his ad is clicked. He has no budget constraint. 3. Advertiser (c) has α = 1, β = 2, has bid $ 1.5 per click and his ad can only be shown 5 times (including this one). There is a single slot, the discount factor is θ = 0.95 and a first price auction is used. Compute the Gittins’ index for each of the three advertisers. Which ad should the auctioneer allocate the slot to? Briefly speculate on what might be a reasonable second price auction. 8

2.2

A proof of the Gittins’ index theorem

We will follow the proof of Tsitsiklis; the paper is on the class web-page.

9

10

Chapter 3

Bayesian Updates, Beta Priors, and Martingales As mentioned before, priors often come from a parametrized generative model, followed by Bayesian updates. We will call such priors Bayesian. We will assume that the generative model is a family of single parameter distributions, parametrized by θ. Let f θ denote the probability distribution on the reward, if the underlying parameter of the generative model is θ. If we knew θ, the prior would be trivial. At time t, the prior is denoted as a probability distribution pt on the parameter θ. Let xt denote the reward obtained the t-th time an arm is played (i.e. the observation at time t). We are going to assume there exist suitable probability measures over which we can integrate the functions f θ and pt . A Bayesian update essentially says that the posterior probability at time t (i.e. the prior for time t + 1) of the parameter being θ given an observation xt is proportional to the probability of the observation being t, given parameter θ. This of course is modulated by the probability of the parameter being θ at time t. Hence, pt+1 (θ xt ) is proportional to f θ (xt ) pt (θ). We have to normalize this to make pt+1 a probability distribution, which gives us

|

pt+1 (θ xt ) =

|

3.1



θ

f θ (xt ) pt (θ) . f θ (xt ) pt (θ )dθ 

Beta priors

Recall that the prior Beta(αt , β t ) corresponds to having observed αt 1 successes and β t 1 failures up to time t. We will show that we can also interpret the prior Beta(αt , β t ) as one that comes from the generative model of Bernoulli distributions with Bayesian updates, where the parameter θ corresponds to the on probability of the Bernoulli distribution. Suppose α0 = 1 and β 0 = 1, i.e., we have observed 0 successes and 0 failures initially. It seems natural to have this correspond to having a uniform prior on θ over the range [0, 1]. Applying the Bayes’ rule repeatedly, we get pt (θ αt , β t ) is proportional to the probability of observing αt 1 successes and β t 1 failures if the underlying parameter is θ, i.e. proportional to

−

|

−



−

−



αt + β t 2 αt −1 θ (1 αt 1

−

−

11

t −1

− θ)β

.

1

− x)b 1 = Γ(a)Γ(b)/Γ(a + b), we get Γ(αt + β t )θα 1 (1 − θ)β 1 pt (θ) = .

Normalizing, and using the fact that



x=0

xa−1 (1

−

t−

t−

Γ(αt )Γ(β t )

This is known as the Beta distribution. The following exercise completes the proof that the transition matrix over state spaces defined earlier is the same as the Bayesian update based prior derived above. Exercise 3.1 Given the prior pt as defined above, show that the probability of obtaining a reward at time t is αt /(αt + β t ).

3.2

Bayesian priors and martingales

We will now show that Bayesian updates result in a martingale process for the rewards. Consider the case where the prior can be given both over a state space ( S,r,P,y) as well as a generative model with Bayesian updates. Let st denote the state at time t and let rt denote the expected reward in state st . We will show that the sequence of rewards rt is a martingale with respect to the sequence of states st . This fact will come in handy multiple times later in this class. Formally, the claim that the sequence of rewards rt is a martingale with respect to the sequence of states st is the same as saying that

 

 

 

 

E[rt+1 s0 , s1 , . . . , st ] = rt .

|

We will use x to denote the observed reward at time t and y to denote the observed reward at time t + 1. Thinking of the prior as coming from a generative model, we get rt = 1   x x θ pt (θ)f θ (x)dθdx, where pt depends only on st . Similarly, we get E[rt+1 st ] = y y θ f θ (y) pt+1 (θ st )dθ dy.

 



Using Bayesian updates, we get pt+1 (θ st , x) =

|

f θ  (x) pt (θ ) .   f θ (x) pt (θ )dθ 

 θ

ing over x we need to integrate over x, i.e., pt+1 (θ st ) = Combining, we obtain: E[rt+1 st ] =

|

  y

y

θ



f θ (y) 

  x

θ

|

f θ (x) pt (θ ) f θ (x) pt (θ )dθ 



|

θ







In order to remove the condition-

 x pt+1 (θ st , x)

 

 

|



θ

pt (θ )f θ (x)dθ dx. 

pt (θ )f θ (x)dθ dxdθ dy. 

The integrals over θ and θ cancel out, giving E[rt+1 st ] =

|

  y

y

θ



f θ (y) 

 x

f θ (x) pt (θ )dxdθ dy. 

Since f θ is a probability distribution, the inner-most integral evaluates to pt (θ ), giving E[rt+1 st ] =

|

  y

y

θ

f θ (y) pt (θ )dθ dy. 

This is the same as rt . Exercise 3.2 Give an example of a prior over a state space such that this prior can not be obtained from any generative model using Bayesian updates. Prove your claim. 1





Here θ , θ , θ



are just different symbols; they are not derivatives or second derivatives of θ .

12

|

Exercise 3.3 Extra experiments can not hurt: The budgeted learning problem is defined as follows: You are given n arms with a separate prior (S i , ri , P i , ui ) on each arm. You are allowed to make T plays. At the end of the T plays, you must pick a single arm i, and you will earn the expected reward of the chosen arm at that time. Let z ∗ denote the expected reward obtained by the optimum strategy for this problem. Show that z ∗ is non-decreasing in T if the priors are Bayesian. The next two exercises illustrate how surprising the Gittins’ index theorem is. Exercise 3.4 Show that the budgeted learning problem does not admit an index-based solution, preferably using Bayesian priors as counter-examples. Hint: Define arms A,B such that given A,B the optimum choice is to play A whereas given A and two copies of B, the optimum choice is to play B. Hence, there can not be a total order on all priors. Exercise 3.5 Show that the finite-horizon multi-armed bandit problem does not admit an indexbased solution, preferably using Bayesian priors as counter-examples.

13

14

Chapter 4

Minimizing Regret Against Unknown Distributions The algorithm in this chapter is based on the paper, Finite-time Analysis of the Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, and P. Fischer, http://www. springerlink.com/content/l7v1647363415h1t/ . The Gittins’ index is efficiently computable, decouples different arms, and gives an optimum solution. There is one problem of course; it assumes a prior, and is optimum only in the class of strategies which have no additional information. We will now make the problem one level more complex. We will assume that the reward for each arm comes from an unknown probability distribution. Consider N arms. Let X i,s denote the reward obtained when the i-th arm is played for the s-th time. We will assume that the random variables X i,s are independent of each other, and we will further assume that for any arm i, the variables X i,s are identically distributed. No other assumptions will be necessary. We will assume that these distributions are generated by an adversary who knows the strategy we are going to employ (but not any random coin tosses we may make). Let µi = E[X i,s ] be the expected reward each time arm i is played. Let i∗ denote the arm with the highest expected reward, µ∗ . Let ∆i = µ∗ µi denote the difference in the expected reward of the optimal arm and arm i. A strategy must choose an arm to play during each time step. Let I t denote the arm chosen at time t and let ki,t denote the number of times arm i is played during the first t steps. Ideally, we would like to maximize the total reward obtained over all time horizons T simultaneously. Needless to say there is no hope of achieving this: the adversary may randomly choose a special arm j, make all X j,s equal to 1 deterministically, and for all i = j, make all X i,s = 0 deterministically. For any strategy, there is a probability at least half that the strategy will not play arm j in the first N/2 steps, and hence, the expected profit of any strategy is at most N/4 over the first N/2 steps, whereas the optimum profit if we knew the distribution would be N/2. Instead, we can set a more achievable goal. Define the regret of a strategy at time T to be the total difference between the optimal reward over the first T steps and the reward of the strategy over the same period. The expected regret is then given by

−



T

E[Regret(T )] = T µ∗

−

 i=1

15

µI t =



∆i E[ki,T .

i:µi <µ

∗

In a classic paper, Lai and Robbins showed that this expected regret can be no less than N

 i=1

∆i D(X i X ∗ )

||



· log T.

Here D(X i X ∗ ) is the Kullback-Leibler divergence (also known as the information divergence) of the distribution of X i with respect to the distribution of X ∗ and is given by f i ln f i /f ∗ , where the integral is over some suitable underlying measure space. Surprisingly, they showed that this lower bound can be asymptotically matched for special classes of distributions asymptotically, i.e., as T . We are going to see an even more surprising result, due to Auer, Cesa-Bianchi, and Fischer, where a very similar bound is achieved simultaneously for all T and for all distributions with support in [0, 1]. We will skip the proof of correctness since that is clearly specified in the paper (with a different notation), but will describe the algorithm, state their main theorem, and see a useful balancing trick. The algorithm, which they call UCB1 , assigns an upper confidence bound to each arm. Let X i (t) denote the average reward obtained from all the times arm i was played up to and including

||



→∞



t time t. Let ct,s = 2 ln denote a “confidence interval.” Then, at the end of time t, assign the s following index, called the upper-confidence-index and denoted U i (t) to arm i:

U i (t) = X i (t) + ct,ki (t) . During the first N steps, each arm is played exactly once in some arbitrary order. After that, the algorithm repeatedly plays the arm with the highest upper-confidence index, i.e. at time t + 1, plays the arm with the highest confidence index U i (t). Ties are broken arbitrarily. This strikingly simple rule leads to the following powerful theorem (proof scribed by Pranav Dandekar): Theorem 4.1 The expected number of times arm i is played up to time T , E[ki (T )], is at most 8 ln T + δ, where δ is some fixed constant independent of the distributions, T , or the number of arms. ∆2 i

Proof: In the first N steps, the algorithm will play each arm once. Therefore, we have T

∀i, ki(T ) = 1 +

{

I t = i

t=N +1

}

where I t = i is an indicator variable which takes value 1 if arm i is played at time t and 0 otherwise. For any integer l 1, we can similarly write

{

}

≥

T

∀i, ki(T ) = l +

{

I t = i

t=N +1

Let i∗ = arg max µi i

∗

µ = µi

∗

U ∗ (t) = U i (t) ∗

X ∗ (t) = X i (t) k ∗ (t) = ki (t) ∗

∗

16

∧ ki(t − 1) ≥ l}

If arm i was played at time t, this implies its score, U i (t 1), was at least the score of the arm with the highest mean, U ∗ (t 1) (note that this is a necessary but not sufficient condition). Therefore we have

−

−

T

∀i, ki(T )

{ ≤   l+

U i (t

t=N +1

∗

− 1) ≥ U (t − 1) ∧ ki(t − 1) ≥ l}

T

∀i, ki(T ) ≤ l +

max U i (si )

Instead of taking the max over l where U i (si ) U ∗ (s)

∀i, ki(T ) ≤ l +

t

ki (t

− 1)

 ≥ l

≤ si < t and the min over 0 < s < t, we sum over all occurrences

≥

T

≥ 0min U (s)
l≤si
t=N +1

∧

∗

t

  

U i (si )

t=1 si =l s=1

∗

≥ U (s)



T

=l+

t

t

  

X i (si ) + ct,si

t=1 si =l s=1

∗

≥ X s + ct,s



(4.1)

∗

Observe that X i (si ) + ct,si

≥ X s + ct,s implies that at least one of the following must hold X s ≤ µ − ct,s (4.2) X i (si ) ≥ µi + ct,s (4.3) µ ≤ µi + 2ct,s (4.4)  ln T We choose l such that the last condition is false. Since cT,l = 2 lnl T , we set l ≥ 8 ∆ . Ignoring the ki (t − 1) ≥ l condition, we have ∗

∗

i

∗

i

2

i

E[ki (T )]

≤

∞ t 8 ln T Pr[X i (si ) + t ∆2i t=1 s =1

 i

t

∞

≥ µi + ct,s ] + i

 t

t=1

Pr[X ∗ s

s=1

∗

≤ µ − ct,s]

To bound the probabilities in the above expression, we make use of the Chernoff-Hoeffding bound: Fact 4.2 (Chernoff-Hoeffding Bound) Given a sequence of t i.i.d random variables z1 , z2 , . . . , zt such that for all i, zi [0, 1], let S = ti=1 zi and µ = E[zi ]. Then for all a 0



∈

Pr

and

Pr

≥

 

2

>µ+a

e−2ta

S t

<µ

e−2ta

Using the Chernoff-Hoeffding bound, we get Pr[X i (si )

≤  − ≤

S t

a

2

−2si c2 t,s

≥ µi + ct,s ] ≤ e i

i

= e−4 ln t =

Similarly, Pr[X ∗ s

∗

≤ µ − ct,s] ≤ e

−2sc2 t,s

= e−4 ln t =

1 t4

1 t4

Substituting these bounds on the probabilities, we get 8 ln T E[ki (T )] ≤ + ∆2 i

t

∞ ∞ t 1 1 8 ln T 1 8 ln T t + t = + 2 = +δ t4 t=1 s=1 t4 t2 ∆2i ∆2i t=1 s =1 t=1 ∞

  i

17



where δ is a constant. Plugging into the expression for the regret, we get an expected regret of O((ln T ) i:µi <µ (1/∆i )) which is very close (both qualitatively and quantitatively) to the lower bound of Lai and Robbins.



∗

Exercise 4.1 Prove that if the discount factor theta is larger than 1 1/N 2 then the algorithm U CB 1 results in a near-optimal solution to the discounted infinite horizon problem.

−

Exercise 4.2 Prove that the algorithm U CB 1 plays every arm infinitely often. The above algorithm appears to have very high regret when the means are all very close together, i.e. ∆i ’s are very small. That doesn’t seem right, since intuitive the algorithm should do well in that setting. The key to understanding this case is a balance argument. Let us ignore the constant N ln T δ in the above theorem. Now, divide the arms into two classes – those with ∆i and those T with ∆i <



N ln T . T

 ≥

Let

 

A+ = i : ∆i

  ≥   N ln T T

A− = i : ∆i <

N ln T T

Then the total regret of the algorithm is given by

   ≤    ≤ 

E[Regret(T )] =

       

E[Regreti (T )] +

i∈A+

E[Regret(T )]

i∈A+

E[Regret(T )]

8

i∈A+

E[Regret(T )]

≤8

E[Regret i (T )]

i∈A

−

8 ln T + O(∆i ) + ∆i

T ln T + O(∆i ) N

√

NT ln T + O(N ) + T

∆i E[ki (T )]

i∈A−

+

i∈A−

N ln T E[ki (T )] T

√

N ln T = O( NT ln T ) T

Thus, we have the corollary:

√

Corollary 4.3 The algorithm UCB 1 has expected regret O( NT ln T ) for all T . Notice that the algorithm does not depend on the choice of the balancing factor – just the analysis.

18

Chapter 5

Minimizing Regret in the Partial Information Model Scribed by Michael Kapralov. Based primarily on the paper: The Nonstochastic Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. SIAM J on Computing, 32(1), 48-77. We use xi (t) [0, 1] to denote the reward obtained from playing arm i at time t, but now we assume that the values xi (t) are determined by an adversary and need not come from a fixed distribution. We assume that the adversary chooses a value for xi (t) at the beginning of time t. The algorithm must then choose an arm I t to play, possibly using some random coin tosses. The algorithm then receives profit X I t (t) . It is easy to see that any deterministic solution gets revenue 0, so any solution with regret bounds needs to be randomized. It is important that the adversary does not see the outcomes of the random coin tosses by the algorithm. This is called the partial information model, since only the profit from the chosen arm is revealed. We will compare the profit obtained by an algorithm to the profit obtained by the best arm in hindsight. Our algorithm maintains weights wi (t) 0, where t is the timestep. We set wi (0) := 1 for all i. Denote W (t) := N i=1 wi (t). At time t arm i is chosen with probability

∈

≥



pi (t) = (1

wi (t) γ + , − γ ) W (t) N

where γ > 0 is a parameter that will be assigned a value later. We denote the index of the arm played at time t by I t . We define the random variable x î (t) as x î (t) =



xi (t)/pi (t) if arm i was chosen 0 o.w.

Note that E[ˆ xi (t)] = xi (t). We can now define the update rule for weights wi (t): γ wi (t + 1) = wi (t)exp x î (t) . N



The regret after T steps is T

Regret[T ]

= max i





T

E[X i ]

t=1

19

 − t=1



E[X It (t)] .

Note that the Regret[T ] = γT , i.e. linear, but we will be able to tune γ to a fixed value of T to obtain sublinear regret. In order to handle unbounded T we can play using a fixed γ for some time and then adjust γ . We will use the following facts, which follow from the definition of x ˆ i (t):

≤ N γ

(5.1)

pi (t)ˆ xi (t) = xI t (t)

(5.2)

xî (t) =

xi (t) pi (t)

N

 i=1

N

N

2



pi (t) (ˆ xi (t)) =

i=1



N

x î (t)xi (t)

i=1

 ≤

x î (t).

(5.3)

i=1

We have W (t + 1) = W (t) N

 ≤ i=1



N i=1

    

wi (t + 1) = W (t)

N wi (t) wi (t) + W (t) i=1 W (t) N

N i=1

wi (t)exp W (t)



γ x ˆ (t) N i

  

γ γ 2 x î (t) + 2 (ˆ xi (t))2 N N

γ γ 2 x î (t) + 2 (ˆ xi (t))2 . N N

wi (t) = 1+ W (t) i=1

γ We applied the inequality exp(x) 1 + x + x2 , x [0, 1] to exp N x î (t) (justified by 5.1) and used the definition of W (t) to substitute the first sum with 1. pi (t) γ i (t) Since pi (t) (1 γ )wi (t)/W (t), we have w . Using this estimate together 1−γ W (t) N (1−γ ) with 5.2 and 5.3, we get

≤

∈

≥ −

≤

W (t + 1) W (t)

≤

N

≤

≤







pi (t) γ γ 2 1+ x î (t) + 2 (ˆ xi (t))2 1 γ N N i=1



−

γ γ 2 1+ xI k (t) + 2 N (1 γ ) N (1 γ )

−

−



N



x î (t).

i=1

We now take logarithms of both sides and sum over t from 1 to T . Note that the lhs telescopes and we get after applying the inequality log(1 + x) x to the rhs

≤

log

W (T ) W (0)

  ≤  T

t=1

=

γ 2

γ xI (t) + 2 N (1 γ ) k N (1

1

N

 − −    −  ≥  ∀  ≤  1

γ ) j =1

   

x î (t)

T

γ γ 2 N T xI (t) + 2 x î (t) . γ t=1 N k N j =1 t=1

We denote the reward obtained by the algorithm by G = T t=1 xI k (t) and the optimal reward by T T   G = maxi t=1 xi (t). Using the fact that E[G ] E[ t=1 x î (i)] i and W (0) = N , we get





E log

W (T ) N

1

1

−

γ γ 2 E[G] + E[G ] . γ N N 20

(5.4)

On the other hand, since W (T ) = exp W (T ) log N



 

γ T ˆ j (t) t=1 N x

, we have

T W j (T ) γ log = xˆ j (t) N N t=1

≥

− log N.

Using the fact that E[ˆ x j (t)] = x j (t) and setting j = argmax1≤ j≤N E log

W (T ) N

≥ N γ EG − log N.

T t=1 x j (t),



we get (5.5)

Putting 5.4 and 5.5 together, we get γ E[G] N

1

− log N ≤ 1 −

γ γ 2 E[G] + E[G ] γ N N (1 γ )





−

(5.6)

This implies that E[G]

≥ (1 − γ )E[G] − γ E[G] − (1 − γ )γ N log N ,

(5.7)

N − E[G] ≤ 2γ E[G] + N log . γ

(5.8)

i.e. E[G ]

To balance the first two terms, we set γ := E[G ] Since G

− E[G]

≤ T , we also have E[G]

and setting γ :=



N log N 2T

  ≤

N log N 2G

and getting

2N log N E[G ].

N , − E[G] ≤ 2γT + N log γ

(5.9)

(5.10)

yields E[G ]

− E[G] ≤  2NT log N.

(5.11)

Exercise 5.1 This is the second straight algorithm we have seen that has a regret that depends on T . Unlike the previous algorithm, this one requires a knowledge of T . Present a technique that converts an algorithm which achieves a regret of O(f (N ) T ) for any given T to one that achieves a regret of O(f (N ) T ) for all T .

√

√

Exercise 5.2 Imagine now that there are M distinct types of customers. During each time step, you are told which type of customer you are dealing with. You must show the customer one product (equivalent to playing an arm) which the customer will either purchase or discard. If the customer purchases the product, then you make some amount between 0 and 1. The regret is computed relative to the best product-choice for each customer type, in hindsight. Present an algorithm that achieves regret O( MNT log N ) against an adversary. Prove your result. What lower bound can you deduce from material that has been covered in class or pointed out in the reading list?

√

21

Exercise 5.3 Designed by Bahman Bahmani. Assume a seller with an unlimited supply of a good is sequentially selling copies of the good to n buyers each of whom is interested in at most one copy of the good and has a private valuation for the good which is a number in [0, 1]. At each instance, the seller offers a price to the current buyer, and the buyer will buy the good if the offered price is less than or equal to his private valuation. Assume the buyers valuations are iid samples from a fixed but unknown (to the seller) distribution with cdf F (x) = P r(valuation x). Define D(x) = 1 F (x) and f (x) = xD(x). In this problem, we will prove that if f (x) has a unique global maximum x∗ in (0, 1) and f  (x∗ ) < 0, then the seller has a pricing strategy that achieves O( nlogn) regret compared to the adversary who knows the exact value of each buyer’s valuation but is restricted to offer a single price to all the buyers. To do this, assume the seller restricts herself to offering one of the prices in 1/K, 2/K, , (K ∗ 1)/K, 1 . Define µi as the expected reward of offering price i/K , µ = max µ1 , , µK , and ∗ ∆i = µ µi .

≤

−

√

{

} −

··· { · ·· }

−

∃ constants C 1, C 2 : ∀x ∈ [0, 1] : C 1(x − x)2 < f (x ) − f (x) < C 2(x − x)2 b) Prove that ∆i ≥ C 1 (x − i/K )2 for all i. Also, prove that the j − th smallest value among ∆i ’s is at least as large as C 1 · ( j2K 1 )2 c) Prove that µ > f (x ) − C 2 /K 2 d) Prove that using a good choice of K and some of the results on MAB discussed in class, the √ seller can achieve O( nlogn) regret. ∗

a) Prove that:

∗

−

∗

∗

22

∗

∗

Chapter 6

The Full Information Model, along with Linear Generalizations This chapter is based on the paper Efficient algorithms for online decision problems by A. Kalai and S. Vempala, http://people.cs.uchicago.edu/ ∼kalai/papers/onlineopt/ onlineopt.pdf

Exercise 6.1 Read the algorithm by Zinkevich, Online convex programming and generalized in finitesimal gradient ascent, http: // www. cs. ualberta. ca/ ∼maz/ publications/ ICML03. pdf . How would you apply his technique and proof in a black-box fashion to the simplest linear generalization of the multi-armed bandit problem in the full information model (eg. as described in the paper by Kalai and Vempala)? Note that Zinkevich requires a convex decision space, whereas Kalai and Vempala assume that the decision space is arbitrary, possibly just a set of points. Exercise 6.2 Modify the analysis of UCB1 to show that in the full information model, playing the arm with the best average return so far has statistical regret O( T (log T + log N )) (i.e. regret against a distribution).



23

325 Notes

Recommend Documents