Lecture 9: Exploration & Bandits

Why Exploration Matters

Throughout this course we have studied four fundamental challenges in reinforcement learning: generalization, optimization, delayed outcomes, and exploration. The first three have dominated our attention so far—we have learned to evaluate and improve policies using dynamic programming, Monte Carlo methods, TD learning, and function approximation. But all of those algorithms assumed, implicitly or explicitly, that the agent would somehow visit all relevant parts of the state-action space often enough to learn. This lecture confronts the fourth challenge: how should an agent balance exploiting what it currently believes is the best action against exploring alternatives that might be better?

This is the exploration-exploitation dilemma, and it arises whenever an agent must learn and act at the same time. A doctor prescribing treatments must weigh a drug with a known track record against a newer drug that might be superior. A recommendation system must decide whether to show a user content similar to what they have clicked before or to test novel content that might reveal new preferences. Exploring too little risks missing the best option forever, while exploring too much wastes time on clearly inferior choices.

Insight — Why Exploration Cannot Be Ignored

Without deliberate exploration, an RL agent can lock onto a suboptimal action permanently. It may never try a better action—or try it too few times to recognize its superiority through the noise of stochastic rewards. Exploration is not a luxury; it is a prerequisite for finding optimal behavior.

To study exploration in its purest form, we strip away sequential decision-making—states, transitions, delayed rewards—and focus on the simplest possible setting: the multi-armed bandit. Bandits isolate the exploration-exploitation tradeoff in a single-step decision problem, where precise theoretical guarantees are within reach. The ideas we develop here—regret, confidence bounds, optimism—extend directly to full MDPs in the next lecture.

The Multi-Armed Bandit Problem

Definition — Multi-Armed Bandit (MAB)

A multi-armed bandit is defined by a tuple $(\mathcal{A}, \mathcal{R})$ where:

$\mathcal{A}$ is a known set of $K$ actions (also called arms).
$\mathcal{R}_a(r) = P(r \mid a)$ is an unknown probability distribution over rewards for each arm $a$.

At each time step $t$, the agent selects an action $a_t \in \mathcal{A}$ and the environment generates a reward $r_t \sim \mathcal{R}_{a_t}$. The agent's goal is to maximize its cumulative reward $\sum_{\tau=1}^{T} r_\tau$ over $T$ rounds.

The name "multi-armed bandit" comes from the image of a gambler facing a row of slot machines (one-armed bandits), each with an unknown payout distribution. The gambler must decide which machines to play to maximize total winnings—learning the payouts through experience while trying to accumulate as much reward as possible.

We write $Q(a) = \mathbb{E}[r \mid a]$ for the true mean reward of arm $a$. The optimal arm is $a^* = \arg\max_{a \in \mathcal{A}} Q(a)$, and the optimal expected reward per round is $V^* = Q(a^*) = \max_{a \in \mathcal{A}} Q(a)$.

Example — Treating Broken Toes

Consider a doctor deciding how to treat patients presenting with broken toes. There are three treatment options (arms): (1) surgery, (2) buddy-taping the broken toe to an adjacent toe, and (3) doing nothing. The outcome is binary: the toe has healed after six weeks ($r = +1$) or has not ($r = 0$), as assessed by X-ray.

Each treatment option $a_i$ is a Bernoulli random variable with unknown parameter $\theta_i$, so $Q(a_i) = \theta_i$. Each patient is one round: the doctor selects a treatment, observes the outcome, and updates their beliefs. A bandit is the right model because each patient's treatment is a single decision—no states or transitions to worry about.

Suppose the true (unknown) parameters are $\theta_1 = 0.95$ (surgery), $\theta_2 = 0.90$ (buddy-taping), and $\theta_3 = 0.10$ (doing nothing). Surgery is optimal, but the doctor does not know this initially. The challenge is to identify the best treatment quickly while minimizing the number of patients who receive suboptimal care.

The Greedy Algorithm and Its Failure

The simplest idea: estimate the value of each arm and always pick the one with the highest estimate. We maintain a running estimate $\hat{Q}_t(a)$ for each arm via Monte Carlo averaging:

$$\hat{Q}_t(a) = \frac{1}{N_t(a)} \sum_{i=1}^{t} r_i \cdot \mathbf{1}(a_i = a)$$

where $N_t(a)$ is the number of times arm $a$ has been pulled up to time $t$. The greedy algorithm then selects:

$$a_t = \arg\max_{a \in \mathcal{A}} \hat{Q}_{t-1}(a)$$

The problem with greedy is that it can lock onto a suboptimal arm forever. Consider the broken-toe example: suppose we try each arm once and observe surgery yields $r = 0$ (which happens with probability 0.05), buddy-taping yields $r = 1$, and doing nothing yields $r = 0$. Now $\hat{Q}(a_1) = 0$, $\hat{Q}(a_2) = 1$, $\hat{Q}(a_3) = 0$. Greedy selects buddy-taping for every subsequent patient and never tries surgery again, even though surgery is the best treatment. One unlucky early outcome caused the algorithm to write it off permanently.

Insight — Greedy Locks On

Because greedy never explores, a single unlucky early outcome can permanently steer the agent away from the optimal arm. This is the most basic failure mode of pure exploitation.

Regret: Measuring the Cost of Learning

To evaluate bandit algorithms rigorously, we need a formal performance measure. Raw cumulative reward is hard to interpret without knowing what is achievable, so we compare the agent to an oracle that always pulls the optimal arm.

Definition — Regret

The per-step regret (or instantaneous regret) at time $t$ is the expected gap between the optimal arm's reward and the reward of the arm actually selected:

$$l_t = \mathbb{E}\!\left[V^* - Q(a_t)\right]$$

The total regret (or cumulative regret) over $T$ rounds is:

$$L_T = \mathbb{E}\!\left[\sum_{t=1}^{T} \big(V^* - Q(a_t)\big)\right] = T \cdot V^* - \sum_{t=1}^{T} \mathbb{E}\!\left[Q(a_t)\right]$$

Maximizing cumulative reward is equivalent to minimizing total regret.

We can rewrite regret in terms of the gap $\Delta_a = V^* - Q(a)$ for each arm—the difference between the optimal mean reward and the mean reward of arm $a$—and the expected number of times each arm is pulled:

$$L_T = \sum_{a \in \mathcal{A}} \mathbb{E}[N_T(a)] \cdot \Delta_a$$

A good algorithm must ensure that arms with large gaps (highly suboptimal arms) are pulled infrequently. But the gaps are unknown—if we knew them, we would know the optimal arm and there would be nothing to learn. The art of bandit algorithm design is keeping $N_T(a)$ small for large-gap arms without knowing which arms have large gaps.

Example — Regret of Greedy on Broken Toes

In the broken-toe scenario where greedy locks onto buddy-taping ($Q(a_2) = 0.90$), the per-step regret is $\Delta_{a_2} = 0.95 - 0.90 = 0.05$, so total regret grows as $0.05 \cdot T$—linear in $T$. If greedy instead locked onto doing nothing ($Q(a_3) = 0.10$), per-step regret would be $0.85$ and total regret $0.85 \cdot T$. Either way, regret grows without bound, proportional to the number of decisions.

In practice we cannot compute regret directly because it requires knowledge of $V^*$. Instead, we prove upper bounds on an algorithm's regret that hold for any bandit problem.

Types of Regret Bounds

Two flavors of regret bounds appear throughout the bandit literature:

Problem-dependent bounds express regret in terms of the gaps $\Delta_a$. These are tighter when there is a clear best arm (large gaps), since such problems are easier.
Problem-independent bounds express regret only as a function of $T$ (the number of rounds) and $K$ (the number of arms), without reference to the specific gaps. These bounds are more general but necessarily looser.

A celebrated lower bound due to Lai and Robbins (1985) shows that no algorithm can achieve total regret growing slower than logarithmically in $T$:

Theorem — Lai-Robbins Lower Bound

For any algorithm and any bandit problem, the asymptotic total regret satisfies:

$$\lim_{T \to \infty} L_T \geq \log T \sum_{a : \Delta_a > 0} \frac{\Delta_a}{D_{\mathrm{KL}}(\mathcal{R}_a \| \mathcal{R}_{a^*})}$$

where $D_{\mathrm{KL}}(\mathcal{R}_a \| \mathcal{R}_{a^*})$ is the Kullback-Leibler divergence between the reward distributions of arm $a$ and the optimal arm $a^*$.

The hardest problems are those where suboptimal arms have reward distributions very similar to the optimal arm (small KL divergence), making them hard to distinguish. The good news: the lower bound is sublinear—specifically logarithmic—so algorithms that learn efficiently do exist.

Epsilon-Greedy: Simple Exploration

The simplest fix: force exploration by occasionally selecting a random arm. $\epsilon$-greedy does exactly this—with probability $1 - \epsilon$ it picks the arm with the highest estimated value (exploitation), and with probability $\epsilon$ it picks an arm uniformly at random (exploration).

Algorithm: $\epsilon$-Greedy Bandit

Initialize $\hat{Q}(a) = 0$ and $N(a) = 0$ for all $a \in \mathcal{A}$.
For $t = 1, 2, \ldots, T$:
1. With probability $\epsilon$: select $a_t$ uniformly at random from $\mathcal{A}$.
2. With probability $1 - \epsilon$: select $a_t = \arg\max_{a} \hat{Q}(a)$ (break ties randomly).
3. Observe reward $r_t$.
4. Update: $N(a_t) \leftarrow N(a_t) + 1$ and $\hat{Q}(a_t) \leftarrow \hat{Q}(a_t) + \frac{1}{N(a_t)}\big(r_t - \hat{Q}(a_t)\big)$.

$\epsilon$-greedy ensures every arm is tried infinitely often (so the agent eventually learns the true values), and it is trivial to implement. But fixed $\epsilon$-greedy has a significant drawback: it explores at the same rate forever, even after the optimal arm has been identified with high confidence.

In the broken-toe example with $\epsilon = 0.1$, even after 10,000 patients the algorithm still picks a random arm 10% of the time. "Doing nothing" ($Q(a_3) = 0.10$) gets selected about $0.1 \times 1/3 \approx 3.3\%$ of the time—far more than necessary—each pull incurring a gap of $\Delta_{a_3} = 0.85$.

Theorem — Linear Regret of Fixed $\epsilon$-Greedy

For fixed $\epsilon > 0$, $\epsilon$-greedy has linear total regret. Even after identifying the optimal arm, it continues selecting suboptimal arms a constant fraction $\epsilon$ of the time:

$$L_T = \Omega(\epsilon \cdot T)$$

Informally, an algorithm has linear regret whenever it selects a non-optimal action a constant fraction of the time. Both $\epsilon$-greedy with $\epsilon > 0$ and pure greedy ($\epsilon = 0$) can exhibit linear regret—the former because it explores too much, the latter because it explores too little.

One mitigation is decaying $\epsilon$-greedy, where $\epsilon_t$ decreases over time (e.g., $\epsilon_t = 1/t$). As the agent gathers more data, it explores less. With the right decay schedule, this achieves sublinear regret—but the optimal schedule typically depends on unknown problem-specific quantities (like the gaps $\Delta_a$), making it hard to tune in practice.

Optimism in the Face of Uncertainty

Can we do better than brute-force random exploration? Yes—using one of the most elegant principles in decision-making under uncertainty: optimism in the face of uncertainty.

Insight — The Optimism Principle

When uncertain about the value of an action, assume it is as good as it could plausibly be, then act greedily on these optimistic estimates. This drives exploration automatically: well-sampled arms have tight confidence intervals and optimistic estimates close to their true value, while under-sampled arms have wide intervals and receive a large optimistic bonus. The algorithm naturally gravitates toward under-explored arms.

Only two things can happen when you pull an optimistically-valued arm: either it really does have a high mean reward (and you collect high reward), or it does not (and the observed low reward tightens the confidence interval, so the arm gets selected less often in the future). Either way, you win.

To implement optimism, we need to quantify uncertainty about each arm's value. This is where concentration inequalities come in.

Confidence Bounds via Concentration Inequalities

Suppose we have pulled arm $a$ a total of $N_t(a)$ times and observed rewards $r_1, r_2, \ldots, r_{N_t(a)}$. The sample mean $\hat{\mu}_a = \hat{Q}_t(a)$ is our point estimate of the true mean $\mu_a = Q(a)$. How far can $\hat{\mu}_a$ be from $\mu_a$?

Theorem — Concentration for Sub-Gaussian Random Variables

(Corollary 5.5, Lattimore and Szepesvari, Bandit Algorithms.) Assume that $X_1, X_2, \ldots, X_n$ are independent random variables with mean $\mu$ and that $X_i - \mu$ is $\sigma$-sub-Gaussian. Let $\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} X_i$. Then for any $\varepsilon \geq 0$:

$$P(\hat{\mu} \geq \mu + \varepsilon) \leq \exp\!\left(-\frac{n\varepsilon^2}{2\sigma^2}\right), \qquad P(\hat{\mu} \leq \mu - \varepsilon) \leq \exp\!\left(-\frac{n\varepsilon^2}{2\sigma^2}\right)$$

This is a generalization of Hoeffding's inequality to sub-Gaussian variables. Bounded random variables (e.g., rewards in $[0, 1]$) are $\frac{1}{2}$-sub-Gaussian ($\sigma^2 = \frac{1}{4}$), and for convenience we often assume $\sigma = 1$ (1-sub-Gaussian).

Rearranging the concentration inequality, we can construct a confidence interval for $\mu_a$. For any confidence level $\delta \in (0, 1]$, with probability at least $1 - \delta$:

$$\mu_a \leq \hat{\mu}_a + \sqrt{\frac{2\sigma^2 \log(1/\delta)}{N_t(a)}}$$

The term $\sqrt{\frac{2\sigma^2 \log(1/\delta)}{N_t(a)}}$ is an upper confidence bound (UCB) on the estimation error. It shrinks as $N_t(a)$ grows (more data, tighter bounds) and grows as $\delta$ decreases (higher confidence, wider interval)—exactly the optimistic estimate we need.

Upper Confidence Bound (UCB) Algorithm

UCB puts the optimism principle into practice: at each round, select the arm with the highest upper confidence bound on its mean reward.

Algorithm: UCB1 (Auer, Cesa-Bianchi, and Fischer, 2002)

Initialize: Pull each arm once (rounds $t = 1, \ldots, K$).
For $t = K+1, K+2, \ldots, T$:
1. Compute the UCB for each arm: $$\mathrm{UCB}_t(a) = \hat{Q}_t(a) + \sqrt{\frac{2\ln t}{N_t(a)}}$$
2. Select $a_t = \arg\max_{a \in \mathcal{A}}\; \mathrm{UCB}_t(a)$.
3. Observe reward $r_t$ and update $\hat{Q}_t(a_t)$ and $N_t(a_t)$.

The UCB formula has two terms. $\hat{Q}_t(a)$ is the exploitation component—the current estimated value. $\sqrt{2\ln t / N_t(a)}$ is the exploration bonus—large when $N_t(a)$ is small (the arm is under-explored), shrinking as the arm is pulled more often. The $\ln t$ in the numerator ensures that even well-explored arms receive a slowly-growing bonus, so no arm is abandoned entirely.

Example — UCB on the Broken-Toe Problem

Back to the broken-toe scenario ($\theta_1 = 0.95$, $\theta_2 = 0.90$, $\theta_3 = 0.10$). Suppose initialization yields $\hat{Q}(a_1) = 1$, $\hat{Q}(a_2) = 1$, $\hat{Q}(a_3) = 0$ (each arm pulled once).

At $t = 4$ (after the three initialization pulls), the UCB for each arm with $\delta = 1/t$ is:

$$\mathrm{UCB}(a_i) = \hat{Q}(a_i) + \sqrt{\frac{2\ln 4}{1}}$$

Since each arm has $N(a_i) = 1$, the exploration bonus is identical, so UCB picks whichever arm has the highest $\hat{Q}$—either $a_1$ or $a_2$ (tied). Suppose it picks $a_1$ and observes reward 1. Now $N(a_1) = 2$ while $N(a_2) = N(a_3) = 1$, so $a_1$'s bonus is smaller than $a_2$'s, and the algorithm will likely try $a_2$ next. This cycling ensures all arms are explored, but arms with consistently low rewards have their UCB values dragged down and get selected less and less.

Contrast this with greedy (locks onto $a_2$ forever) or $\epsilon$-greedy (wastes 3.3% of pulls on the clearly inferior $a_3$). UCB concentrates pulls on the most promising arms while giving a fair hearing to under-explored alternatives.

Regret Bound for UCB

The key result: UCB achieves logarithmic regret, matching the Lai-Robbins lower bound up to constant factors.

Theorem — UCB Regret Bound

Problem-dependent bound. The UCB1 algorithm with $\delta = 1/t$ achieves total regret:

$$L_T \leq \sum_{a : \Delta_a > 0} \frac{4\log T}{\Delta_a} + \sum_{a} \Delta_a$$

Problem-independent bound. Converting to a bound that does not depend on the gaps:

$$L_T = O\!\left(\sqrt{KT \log T}\right)$$

Both bounds are sublinear in $T$: per-step regret $L_T / T$ vanishes as $T$ grows, so UCB eventually performs almost as well as the oracle.

Proof Sketch

The proof of the UCB regret bound proceeds in three steps. Let $\Delta_i = \mu^* - \mu_i$ denote the gap for arm $i$, and let $N_i(T)$ denote the number of times arm $i$ is pulled over $T$ rounds.

Step 1: Decompose regret in terms of counts. Total regret is $L_T = \sum_{i : \Delta_i > 0} \Delta_i \cdot \mathbb{E}[N_i(T)]$, so we need to bound $\mathbb{E}[N_i(T)]$ for each suboptimal arm.

Step 2: Bound pulls under the "good event." Define the good event $G_i$ as the event that the true mean $\mu_i$ is always below its UCB: $\mu_i \leq \mathrm{UCB}_i(t, \delta)$ for all $t$. Let $U_i = 2\log(1/\delta) / \Delta_i^2$. If $G_i$ holds and arm $i$ has been pulled more than $U_i$ times, then its UCB drops below $\mu^*$, while the optimal arm's UCB stays above $\mu^*$ (by definition of $G_{a^*}$). So arm $i$ cannot be selected—a contradiction. Hence under $G_i$, arm $i$ is pulled at most $U_i$ times.

Step 3: Account for the bad event. Using a union bound and setting $\delta = 1/T$, the probability that $G_i$ fails is at most $1/T$. In the worst case when $G_i$ fails, arm $i$ could be pulled up to $T$ times. Combining: $\mathbb{E}[N_i(T)] \leq U_i + P(G_i^c) \cdot T \leq \frac{2\log T}{\Delta_i^2} + 1$. Plugging into the regret decomposition gives the problem-dependent bound. The problem-independent bound follows by applying Cauchy-Schwarz.

Choosing the Confidence Level

A subtle point is the choice of $\delta$. Setting $\delta = 1/t$ (yielding the $\ln t$ term in the exploration bonus) is common and theoretically justified. The reasoning uses a union bound: we need the UCB to hold simultaneously across all time steps and all arms. The probability that any arm at any time has its true mean above its UCB is at most $\sum_t \sum_a \delta = KT\delta$. Setting $\delta = 1/(KT)$ or $\delta = 1/t^2$ makes this failure probability negligible.

What about choosing the arm with the highest lower confidence bound—a "pessimistic" approach? This does not ensure low regret. In a two-arm case with similar means, pessimism would consistently avoid the arm with the lower lower-bound, which might be the optimal arm—leading to linear regret for the same reason greedy fails.

Looking Ahead: Bayesian Approaches

UCB is a frequentist approach: it makes no assumptions about prior distributions over the arm parameters. The Bayesian alternative maintains a posterior distribution over each arm's mean reward and uses it to guide exploration. The most prominent example is Thompson sampling (probability matching): at each round, sample a reward estimate from the posterior for each arm and select the arm with the highest sample. Thompson sampling achieves Bayesian regret bounds that match or improve upon UCB in many settings, and it has become widely used due to its simplicity and strong empirical performance. We will explore it in the next lecture.

Summary

This lecture introduced the exploration-exploitation dilemma through the multi-armed bandit—the simplest setting that isolates the tension between learning and acting. We defined regret as the formal performance measure: how much cumulative reward is lost compared to an oracle that always selects the optimal arm.

We examined three algorithmic strategies in order of increasing sophistication:

Greedy: Always exploit the current best estimate. Can lock onto a suboptimal arm—linear regret.
$\epsilon$-Greedy: Explore uniformly at random with probability $\epsilon$. Guarantees all arms are tried, but fixed $\epsilon$ wastes pulls on clearly inferior arms—also linear regret, $O(\epsilon T)$.
UCB: Select the arm with the highest upper confidence bound (optimism principle). Automatically balances exploration and exploitation via an exploration bonus for under-sampled arms. Sublinear regret $O(\sqrt{KT \log T})$, matching Lai-Robbins up to logarithmic factors.

The key ideas—regret as a performance measure, concentration inequalities for confidence bounds, and optimism in the face of uncertainty—are not limited to bandits. Next lecture, we will see how they extend to the full MDP setting, enabling data-efficient RL algorithms that achieve low regret even when the agent must plan over sequences of decisions with delayed rewards.