Lecture 10: Fast RL & Regret Bounds

Recap: Multi-Armed Bandits and UCB

Last lecture, we formalized the multi-armed bandit problem and introduced the Upper Confidence Bound (UCB) algorithm. Recall that a bandit is defined by a tuple $(\mathcal{A}, \mathcal{R})$: $\mathcal{A}$ is a known set of $K$ arms and $R_a(r) = P(r \mid a)$ is an unknown reward distribution for each arm. At each step $t$, the agent picks an action $a_t \in \mathcal{A}$, receives a stochastic reward $r_t \sim R_{a_t}$, and aims to maximize cumulative reward over $T$ rounds.

The quality of a bandit algorithm is measured by its total regret—the difference between the reward of always playing the best arm and the reward actually collected:

$$\text{Regret}(T) = \sum_{\tau=1}^{T} \mathbb{E}\!\left[\mu^* - r_\tau\right] = T\mu^* - \sum_{\tau=1}^{T} \mathbb{E}\!\left[r_\tau\right]$$

where $\mu^* = \max_{a \in \mathcal{A}} \mu_a$ is the expected reward of the optimal arm. The UCB1 algorithm addresses this by selecting at each step the arm that maximizes an optimistic estimate of its value:

$$a_t = \arg\max_{a \in \mathcal{A}} \left[\hat{Q}_t(a) + \sqrt{\frac{2 \ln t}{N_t(a)}}\right]$$

where $\hat{Q}_t(a)$ is the empirical mean reward and $N_t(a)$ is the number of pulls. The bonus $\sqrt{2 \ln t / N_t(a)}$ is large for rarely-tried arms (encouraging exploration) and shrinks with more pulls (favoring exploitation). In this lecture, we rigorously prove that UCB achieves sublinear regret, establish fundamental lower bounds on what any algorithm can achieve, and extend these ideas from bandits to full MDPs.

Definition — Suboptimality Gap

For each arm $a$, the suboptimality gap (or "gap") is defined as:

$$\Delta_a = \mu^* - \mu_a$$

This is the per-step cost of playing arm $a$ instead of the best arm. For $a^*$, $\Delta_{a^*} = 0$. The gaps are unknown to the learner but drive regret analysis: they determine both how costly each suboptimal pull is and how hard it is to distinguish each arm from the best one.

Decomposing Regret by Arm

The key insight is that total regret decomposes as a sum over suboptimal arms, each weighted by how often the algorithm plays it. Since pulling the optimal arm incurs zero regret, only suboptimal arms contribute:

$$\text{Regret}(T) = \sum_{a:\, \Delta_a > 0} \Delta_a \cdot \mathbb{E}\!\left[N_a(T)\right]$$

where $N_a(T) = \sum_{t=1}^{T} \mathbf{1}(a_t = a)$ counts pulls of arm $a$. This decomposition reveals two factors driving regret: the gap $\Delta_a$ (how bad the arm is) and $\mathbb{E}[N_a(T)]$ (how often the algorithm is fooled into playing it). A good algorithm ensures arms with large gaps are pulled infrequently, while arms with small gaps—hard to distinguish from the optimum—contribute little per-pull regret even if pulled often.

UCB Regret Bound

Theorem — UCB Regret Bound (Theorem 7.1, Lattimore & Szepesvari)

For UCB applied to a stochastic $K$-armed bandit with rewards in $[0,1]$, any horizon $n$, and confidence parameter $\delta = 1/n^2$:

$$\text{Regret}(n) \leq \sum_{i:\, \Delta_i > 0} \left(\frac{16 \ln n}{\Delta_i} + 3\Delta_i\right)$$

In particular, this yields the gap-dependent bound $O\!\left(\sum_{i:\, \Delta_i > 0} \frac{\ln n}{\Delta_i}\right)$ and, by a worst-case argument over the gaps, the gap-independent bound $O\!\left(\sqrt{Kn \ln n}\right)$.

Note — Comparison with Lecture 9

In Lecture 9, we stated the UCB regret bound with a leading constant of 4 rather than 16. Both bounds are correct—the difference arises from the proof technique. The Lecture 9 bound uses a simpler analysis that directly requires the suboptimal arm's UCB to drop below $\mu^*$ after $U_i = 2\ln(1/\delta)/\Delta_i^2$ pulls, yielding a tighter constant. The bound presented here follows Lattimore & Szepesvari's Theorem 7.1, which uses a two-condition good event with a slack parameter $c = 1/2$, requiring the confidence width to shrink to only half the gap. This is more forgiving but increases the required sample count to $u_i = 16\ln n / \Delta_i^2$, producing the larger constant. Both approaches yield $O(\sum_i \ln n / \Delta_i)$ regret—only the leading constant differs.

Two takeaways. First, regret grows only logarithmically in $n$—dramatically better than the linear regret of naive strategies like pure exploration or pure exploitation. Second, the bound depends on $1/\Delta_i$: arms close to optimal are harder to rule out, require more samples, and contribute more regret. We now sketch the proof.

Proof Sketch: Bounding $\mathbb{E}[N_i(n)]$ for Suboptimal Arms

The strategy: bound $\mathbb{E}[N_i(n)]$ for each suboptimal arm $i$, then plug into the regret decomposition. Without loss of generality, assume arm 1 is optimal ($a^* = a_1$).

Proof Sketch

Step 1: Define the "good event." For each suboptimal arm $i$, define the good event $G_i$ as two conditions holding simultaneously:

Condition (*): The UCB for the optimal arm always exceeds its true mean:

$$\mu_1 \leq \min_{t \leq n}\, \text{UCB}_1(t, \delta) \quad \text{where } \text{UCB}_a(t, \delta) = \hat{Q}_t(a) + \sqrt{\frac{2\ln(1/\delta)}{N_t(a)}}$$

Condition (**): For a carefully chosen number of pulls $u_i$, the UCB of arm $i$ falls below the true mean of arm 1:

$$\hat{\mu}_{i,u_i} + \sqrt{\frac{2\ln(1/\delta)}{u_i}} < \mu_1$$

When both hold, the algorithm can distinguish arm $i$ from the optimum after $u_i$ pulls: the optimistic estimate for arm $i$ drops below what arm 1 can achieve.

Step 2: Bound pulls under the good event. When $G_i$ holds, arm $i$ cannot be pulled more than $u_i$ times. Suppose for contradiction that arm $i$ has been pulled $u_i$ times and is selected again at time $t$. By (**), $\text{UCB}_i(t, \delta) < \mu_1$, and by (*), $\mu_1 \leq \text{UCB}_1(t, \delta)$. Chaining gives $\text{UCB}_i(t, \delta) < \text{UCB}_1(t, \delta)$, so the algorithm should have selected arm 1 instead—contradiction. Therefore:

$$\mathbb{E}\!\left[\mathbf{1}(G_i)\, N_i(n)\right] \leq u_i$$

Step 3: Bound the probability of the bad event. We decompose the expected pull count using the law of total expectation:

$$\mathbb{E}[N_i(n)] = \mathbb{E}[\mathbf{1}(G_i)\, N_i(n)] + \mathbb{E}[\mathbf{1}(G_i^c)\, N_i(n)] \leq u_i + n \cdot P(G_i^c)$$

To bound $P(G_i^c)$, analyze each condition separately. For (*), union-bounding over all time steps and applying Hoeffding's inequality:

$$P\!\left(\mu_1 > \min_{t \leq n} \text{UCB}_1(t, \delta)\right) \leq \sum_{s=1}^{n} P\!\left(\mu_1 > \hat{\mu}_{1,s} + \sqrt{\frac{2\ln(1/\delta)}{s}}\right) \leq n\delta$$

For (**), applying Hoeffding to the empirical mean after $u_i$ pulls, provided $u_i$ is large enough that $\sqrt{2\ln(1/\delta)/u_i} \leq c\Delta_i$ for some $c \in (0,1)$:

$$P\!\left(\hat{\mu}_{i,u_i} + \sqrt{\frac{2\ln(1/\delta)}{u_i}} \geq \mu_1\right) \leq \exp\!\left(-\frac{u_i\, c^2 \Delta_i^2}{2}\right)$$

Step 4: Choose $u_i$ and combine. Rearranging the constraint:

$$u_i \geq \frac{2\ln(1/\delta)}{c^2 \Delta_i^2}$$

Setting $c = 1/2$ and $\delta = 1/n^2$, so that $\ln(1/\delta) = 2\ln n$, yields:

$$u_i = \frac{16\ln n}{\Delta_i^2}$$

Substituting back and combining failure probabilities:

$$\mathbb{E}[N_i(n)] \leq \frac{16\ln n}{\Delta_i^2} + 3$$

Plugging into the regret decomposition $\text{Regret}(n) = \sum_{i:\Delta_i > 0} \Delta_i \cdot \mathbb{E}[N_i(n)]$ yields:

$$\text{Regret}(n) \leq \sum_{i:\,\Delta_i > 0} \left(\frac{16\ln n}{\Delta_i} + 3\Delta_i\right)$$

which completes the proof. $\square$

Insight — Why Logarithmic Regret is Achievable

The proof reveals the mechanism behind logarithmic regret. Each suboptimal arm $i$ needs $O(\ln n / \Delta_i^2)$ pulls before its confidence interval is tight enough to rule it out. After that, the algorithm almost never selects it again. The regret contribution from arm $i$ is $\Delta_i \times O(\ln n / \Delta_i^2) = O(\ln n / \Delta_i)$. The logarithmic dependence on $n$ arises because the confidence bonus $\sqrt{\ln t / N_t(a)}$ shrinks at a rate that exactly balances the growing number of rounds—the algorithm "wastes" only logarithmically many pulls per suboptimal arm to become confident it is suboptimal.

The Gap-Independent Bound

The gap-dependent bound is tight when gaps are known, but can be vacuous when some gaps are very small. We can derive a gap-independent ("minimax") bound by worst-case reasoning. Each arm's contribution is bounded by $16 \ln n / \Delta_i + 3\Delta_i$, and the sum is maximized when all gaps are equal: $\Delta_i = \Delta$ for all $K-1$ suboptimal arms. Then:

$$\text{Regret}(n) \leq (K-1)\left(\frac{16\ln n}{\Delta} + 3\Delta\right)$$

Optimizing over $\Delta$ (setting the derivative to zero) yields $\Delta^* = \sqrt{16\ln n / 3}$, but we must respect $\Delta \leq 1$. Substituting back gives $O(\sqrt{Kn\ln n})$, independent of the gaps or reward distributions.

Lower Bounds on Bandit Regret

Theorem — Bandit Regret Lower Bound (Lai & Robbins, 1985)

For any consistent bandit algorithm (one whose regret is $o(n^p)$ for all $p > 0$ on every bandit instance), the expected regret on a $K$-armed bandit satisfies:

$$\liminf_{n \to \infty} \frac{\text{Regret}(n)}{\ln n} \geq \sum_{i:\, \Delta_i > 0} \frac{\Delta_i}{D_{\text{KL}}(\nu_i \,\|\, \nu^*)}$$

where $D_{\text{KL}}(\nu_i \| \nu^*)$ is the KL divergence between the reward distribution of arm $i$ and the optimal arm. For Gaussian rewards with unit variance, this simplifies to:

$$\liminf_{n \to \infty} \frac{\text{Regret}(n)}{\ln n} \geq \sum_{i:\, \Delta_i > 0} \frac{2}{\Delta_i}$$

This shows UCB's $O(\ln n / \Delta_i)$ per-arm scaling is order-optimal—no algorithm can do fundamentally better. The intuition is information-theoretic: distinguishing arm $i$ from the optimum requires enough samples to tell their reward distributions apart, which takes $\Omega(1/D_{\text{KL}}(\nu_i \| \nu^*))$ pulls, each costing $\Delta_i$ regret.

In the gap-independent (minimax) setting, the corresponding lower bound is:

$$\text{Regret}(n) = \Omega\!\left(\sqrt{Kn}\right)$$

Since UCB achieves $O(\sqrt{Kn\ln n})$, the gap to the lower bound is only $\sqrt{\ln n}$. Algorithms like MOSS (Minimax Optimal Strategy in the Stochastic case) close this gap entirely, achieving the minimax-optimal $O(\sqrt{Kn})$ rate. UCB remains popular nonetheless for its simplicity and strong practical performance.

Insight — The Exploration-Information Tradeoff

The lower bound reveals a deep connection between regret minimization and hypothesis testing. An algorithm must gather enough evidence to distinguish the optimal arm from each alternative. KL divergence quantifies the "statistical distance" between two reward distributions: when distributions are close (small KL divergence), more samples are needed, and more regret is incurred. This information-theoretic perspective unifies exploration across bandits, MDPs, and Bayesian optimization.

From Bandits to MDPs: The Exploration Challenge

Bandits are a powerful abstraction, but real RL problems involve sequential decision-making: actions affect not only the immediate reward but the next state, which determines future actions and rewards. A $K$-armed bandit is really a single-state MDP. Moving to full MDPs introduces several new exploration challenges:

State-dependent exploration: An action well-understood in one state may be untested in another.
Delayed consequences: A suboptimal action may lead to an informative state that enables future learning. The value of information is spread across time, not concentrated at the moment of decision.
Exponential state spaces: With $|\mathcal{S}|$ states and $|\mathcal{A}|$ actions, the number of state-action pairs grows multiplicatively, making exhaustive exploration costly.
Non-stationary exploration needs: As the agent learns about some parts of the state space, it should shift exploration to less-understood regions—but reaching those regions may require traversing known territory.

These challenges motivate formal frameworks for exploration efficiency. The two dominant ones are regret minimization (extending the bandit notion to MDPs) and PAC-MDP (bounding the number of suboptimal actions).

PAC-MDP: Probably Approximately Correct in MDPs

Definition — PAC-MDP

An algorithm $\mathfrak{A}$ is PAC-MDP (Probably Approximately Correct for MDPs) if, for any $\epsilon > 0$ and $\delta > 0$, with probability at least $1 - \delta$, the number of time steps $t$ at which $\mathfrak{A}$ takes an action that is not $\epsilon$-optimal is bounded by a polynomial in $(|\mathcal{S}|, |\mathcal{A}|, 1/\epsilon, 1/\delta, 1/(1-\gamma))$. Formally, the sample complexity of exploration $N_\epsilon$ satisfies:

$$P\!\left(\left|\{t : V^{\pi_t}(s_t) < V^*(s_t) - \epsilon \}\right| > N_\epsilon \right) \leq \delta$$

where $\pi_t$ is the policy executed by the algorithm at time $t$.

PAC-MDP differs from regret minimization in a key way. Regret measures the total cost of suboptimal actions, weighting each by its degree of suboptimality. PAC-MDP instead counts the number of $\epsilon$-suboptimal actions, treating all such actions equally. After a polynomial "burn-in" period, a PAC-MDP algorithm acts near-optimally for all remaining steps—it eventually figures things out and stops making costly mistakes, with high probability.

Insight — PAC-MDP vs. Regret

A PAC-MDP bound of $N_\epsilon$ suboptimal steps implies a regret bound of roughly $O(N_\epsilon)$ (each suboptimal step costs at most $O(V_{\max})$). But the conversion is often loose: PAC-MDP bounds tend to be polynomially larger than the best regret bounds. Conversely, low-regret algorithms are not always PAC-MDP, because regret allows many mildly-suboptimal actions as long as their total cost is controlled. The two frameworks offer complementary perspectives on exploration quality.

Model-Based Exploration: Optimism in MDPs

The "optimism in the face of uncertainty" principle that powered UCB extends naturally to MDPs. When the agent does not know the true dynamics, it assumes they are as favorable as possible (consistent with the data) and acts optimally under this optimistic model. This drives the agent toward unfamiliar states, because those states might harbor high rewards under the optimistic assumption.

R-MAX

R-MAX (Brafman and Tennenholtz, 2002) is one of the earliest provably efficient exploration algorithms for MDPs. It classifies each state-action pair as known or unknown and constructs an optimistic model:

Algorithm: R-MAX

Initialize: Mark all state-action pairs $(s, a)$ as unknown. Set visit threshold $m$.
Construct optimistic MDP $\tilde{M}$:
1. For known $(s,a)$ (visited $\geq m$ times): use empirical transition $\hat{P}(s' \mid s, a)$ and empirical reward $\hat{R}(s, a)$.
2. For unknown $(s,a)$: set reward to $R_{\max}$ (maximum possible reward) and transitions to a self-absorbing state with reward $R_{\max}$.
Plan: Compute the optimal policy $\tilde{\pi}^*$ for the optimistic MDP $\tilde{M}$ (via value iteration).
Act: Execute $\tilde{\pi}^*$ in the real environment. Update visit counts and empirical estimates.
Repeat: When any $(s,a)$ transitions from unknown to known, re-plan on the updated $\tilde{M}$.

Unknown state-action pairs are maximally attractive in the optimistic MDP, so $\tilde{\pi}^*$ seeks them out—driving exploration—until they become known. Once known, reward and dynamics in $\tilde{M}$ match reality (approximately), so the policy is near-optimal for the known portion of the state space.

Theorem — R-MAX Sample Complexity

R-MAX is PAC-MDP. With an appropriate choice of the threshold $m$, the number of $\epsilon$-suboptimal time steps is at most:

$$N_\epsilon = O\!\left(\frac{|\mathcal{S}|^2 |\mathcal{A}|}{\epsilon^3 (1-\gamma)^6} \ln\frac{1}{\delta} \ln\frac{1}{\epsilon(1-\gamma)}\right)$$

This is polynomial in all relevant quantities, guaranteeing efficient exploration.

Explicit Explore or Exploit (E3)

E3 (Kearns and Singh, 2002) takes a different approach. Rather than a single optimistic MDP, it maintains a set of "known" states and explicitly decides between exploitation (following the best policy in the known region) and exploration (seeking unknown states). E3 computes a "balanced wandering" policy guaranteed to either collect high reward or quickly reach an unknown state, ensuring steady exploration progress. Like R-MAX, E3 achieves polynomial sample complexity, with somewhat different polynomial dependence on the parameters.

Optimistic Initialization

A simpler, heuristic form of optimism is optimistic initialization: set $Q(s,a)$ or $V(s)$ to an optimistically high value (e.g., $V_{\max} = R_{\max}/(1-\gamma)$) before learning begins. Combined with Q-learning, this creates a natural exploration drive—the agent believes unvisited pairs have very high value and preferentially visits them. As experience accumulates, estimates decrease toward their true values and the exploratory pressure fades.

Optimistic initialization is easy to implement and often effective, but lacks R-MAX's formal guarantees. It does not ensure every pair is visited sufficiently, and in large or adversarial environments the initial optimism can decay before the agent has explored adequately. Still, it remains widely used for its simplicity and compatibility with existing algorithms.

Sample Complexity of Exploration

Sample complexity of exploration gives a unified lens for comparing algorithms: how many suboptimal steps must an algorithm endure before converging to near-optimal performance? The answer depends on the MDP's structure.

For tabular MDPs with $|\mathcal{S}|$ states and $|\mathcal{A}|$ actions, the best known PAC-MDP bounds scale as:

$$N_\epsilon = \tilde{O}\!\left(\frac{|\mathcal{S}|^2 |\mathcal{A}|}{\epsilon^2(1-\gamma)^4}\right)$$

where $\tilde{O}$ hides logarithmic factors—polynomial in all quantities, confirming efficient exploration is possible. For regret minimization in episodic MDPs, algorithms like UCBVI achieve:

$$\text{Regret}(T) = \tilde{O}\!\left(\sqrt{|\mathcal{S}||\mathcal{A}| H^3 T}\right)$$

where $H$ is the episode length and $T$ the total number of episodes. The minimax-optimal rate replaces $H^3$ with $H^2$; the result above is near-optimal and illustrates that efficient exploration in tabular MDPs is well understood in theory.

Insight — The Bandit-MDP Connection

The progression from bandits to MDPs reveals a recurring theme: optimism in the face of uncertainty is remarkably versatile. In bandits, UCB adds a bonus to each arm's estimated reward. In MDPs, R-MAX adds a bonus to unknown state-action pairs. Both make poorly-understood options look favorable, ensuring they are tried before being discarded. The formal analysis follows the same template:

Decompose regret by identifying where suboptimality occurs.
Bound the frequency of suboptimal decisions using concentration inequalities and the optimistic structure.
Sum over all sources of regret.

The core machinery—Hoeffding bounds, union bounds, careful counting—transfers directly from bandits to MDPs, with the main complication being coupling between states across time.

Summary

This lecture established the theoretical foundations of fast reinforcement learning. We proved that UCB achieves $O\!\left(\sum_{i:\Delta_i > 0} \ln n / \Delta_i\right)$ regret in bandits, with the proof hinging on regret decomposition by arm, a "good event" analysis via Hoeffding concentration, and a careful choice of exploration pulls $u_i$ per suboptimal arm. We then showed this logarithmic rate is essentially optimal: the Lai-Robbins lower bound shows no algorithm can achieve $o(\ln n)$ regret, and the minimax lower bound is $\Omega(\sqrt{Kn})$.

Turning to MDPs, we introduced PAC-MDP as a formal exploration-efficiency measure: an algorithm is PAC-MDP if it takes at most polynomially many suboptimal actions. R-MAX and E3 are provably efficient model-based algorithms, both building on optimism in the face of uncertainty, while optimistic initialization offers a simpler but less theoretically grounded alternative. The sample complexity of exploration in tabular MDPs is now well understood, with matching upper and lower bounds confirming that polynomial-time efficient exploration is achievable.

Next, we turn to Bayesian approaches—Thompson sampling and Bayesian regret bounds—which offer an alternative to the frequentist optimism-based methods studied here.