Lecture 11: Bayesian Bandits

From Frequentist to Bayesian Bandits

In the previous lecture, we studied multi-armed bandits from a frequentist perspective. We assumed only that rewards are bounded, and derived Upper Confidence Bound (UCB)—an algorithm that constructs confidence intervals around estimated means, picks the arm with the highest upper bound, and guarantees $O(\sqrt{KT \ln T})$ regret without any prior knowledge.

But what if we do have prior knowledge? In clinical trials, online advertising, and recommendation systems, we often have domain expertise or historical data that gives us beliefs about reward rates before we begin. The Bayesian approach provides a principled framework for incorporating this information: maintain a full probability distribution over each arm's unknown parameters, and update those beliefs as data arrives.

Recall the setup: a tuple $(\mathcal{A}, R)$ where $\mathcal{A}$ is a set of $K$ arms and $R_a(r) = P[r \mid a]$ is an unknown reward distribution for each arm $a$. At each step $t$, the agent selects $a_t \in \mathcal{A}$, observes $r_t \sim R_{a_t}$, and aims to minimize total regret:

$$L_T = \mathbb{E}\!\left[\sum_{\tau=1}^{T} V^* - Q(a_\tau)\right]$$

where $V^* = \max_{a \in \mathcal{A}} Q(a)$ is the value of the best arm. The Bayesian perspective adds a prior $p(R)$ over the unknown reward distributions, updating it into a posterior as data accumulates.

Bayesian Inference for Bandits

In the Bayesian view, the unknown reward parameters are random variables, and we maintain a posterior distribution capturing our uncertainty. Let each arm $i$ have a reward distribution parameterized by $\phi_i$. We begin with a prior $p(\phi_i)$ and update after each observation using Bayes' rule.

Definition — Posterior Update

Given a prior $p(\phi_i)$ over the reward parameter of arm $i$ and an observed reward $r_{i1}$ from pulling that arm, the posterior distribution is computed via Bayes' rule:

$$p(\phi_i \mid r_{i1}) = \frac{p(r_{i1} \mid \phi_i)\, p(\phi_i)}{p(r_{i1})} = \frac{p(r_{i1} \mid \phi_i)\, p(\phi_i)}{\int_{\phi_i} p(r_{i1} \mid \phi_i)\, p(\phi_i)\, d\phi_i}$$

The numerator is likelihood times prior; the denominator is a normalizing constant ensuring the posterior integrates to one.

Computing this posterior exactly can be intractable—the denominator integral may lack a closed form. But when the prior and likelihood belong to certain matched families, the update becomes analytically tractable. This leads us to conjugate priors.

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same parametric family as the prior. When this holds, the Bayesian update reduces to a simple parameter update rather than numerical integration. Exponential family distributions always have conjugate priors.

Key Insight

Why conjugacy matters for bandits. We update beliefs after every arm pull—potentially millions of times. If each update required MCMC or numerical integration, the algorithm would be impractical. Conjugate priors make the posterior update a constant-time operation: just increment a few sufficient statistics. This is what makes Thompson Sampling computationally competitive with UCB.

The Beta-Bernoulli Bandit

The most important instance of a Bayesian bandit is the Beta-Bernoulli model. Each arm's reward is binary—0 or 1—drawn from a Bernoulli with unknown parameter $\theta$. This naturally models click/no-click on an ad, success/failure of a treatment, or conversion/bounce on a website variant.

Definition — Beta Distribution

The Beta distribution $\text{Beta}(\alpha, \beta)$ is a continuous probability distribution on $[0, 1]$ with probability density function:

$$p(\theta \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\,\Gamma(\beta)}\, \theta^{\alpha - 1}(1 - \theta)^{\beta - 1}$$

where $\Gamma(\cdot)$ is the Gamma function. The mean is $\frac{\alpha}{\alpha + \beta}$ and the variance is $\frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}$. Intuitively, $\alpha$ counts "pseudo-successes" and $\beta$ counts "pseudo-failures."

The Beta distribution is conjugate to the Bernoulli likelihood: if our prior over $\theta$ is $\text{Beta}(\alpha, \beta)$, then after observing a reward $r \in \{0, 1\}$, the posterior is again Beta with updated parameters:

$$\text{Prior: } \theta \sim \text{Beta}(\alpha, \beta) \quad\xrightarrow{\text{observe } r}\quad \text{Posterior: } \theta \mid r \sim \text{Beta}(\alpha + r,\; \beta + 1 - r)$$

A success ($r = 1$) increments $\alpha$; a failure ($r = 0$) increments $\beta$. After $s$ successes and $f$ failures, the posterior is $\text{Beta}(\alpha + s, \beta + f)$—just add to two counters.

Example — Treating Broken Toes

Suppose we have three treatment options (arms) for a broken toe, with unknown true Bernoulli success rates: Surgery ($\theta_1 = 0.95$), Taping ($\theta_2 = 0.90$), Nothing ($\theta_3 = 0.10$). We initialize a uniform prior $\text{Beta}(1, 1)$ over each arm's success probability. Here is how the posterior evolves over several rounds of Thompson Sampling:

Round 1: Sample from each arm's posterior: $\tilde{\theta}_1 = 0.3$, $\tilde{\theta}_2 = 0.5$, $\tilde{\theta}_3 = 0.6$. The highest sample is arm 3 (Nothing), so we pull it. We observe reward $r = 0$ (failure). Update: arm 3's posterior becomes $\text{Beta}(1, 2)$, which shifts its mean down to $1/3$.

Round 2: Sample again: $\tilde{\theta}_1 = 0.7$, $\tilde{\theta}_2 = 0.5$, $\tilde{\theta}_3 = 0.3$ (arm 3's posterior is now pessimistic). We pull arm 1 (Surgery). Observe $r = 1$ (success). Update: arm 1's posterior becomes $\text{Beta}(2, 1)$, mean $= 2/3$.

Round 3: Sample: $\tilde{\theta}_1 = 0.71$, $\tilde{\theta}_2 = 0.65$, $\tilde{\theta}_3 = 0.1$. Pull arm 1 again. Observe $r = 1$. Posterior: $\text{Beta}(3, 1)$, mean $= 3/4$.

Round 4: Sample: $\tilde{\theta}_1 = 0.75$, $\tilde{\theta}_2 = 0.45$, $\tilde{\theta}_3 = 0.4$. Pull arm 1 again. Observe $r = 1$. Posterior: $\text{Beta}(4, 1)$, mean $= 4/5$.

After just four rounds, the algorithm has concentrated on the best arm (Surgery), having quickly discovered that Nothing is ineffective. Arm 1's posterior is becoming increasingly concentrated, while arm 3's has shifted decidedly leftward.

Using the Posterior for Exploration

The Bayesian framework gives us a posterior $p(R \mid h_t)$ over reward parameters given history $h_t = (a_1, r_1, \ldots, a_{t-1}, r_{t-1})$. How should we use this posterior to guide action selection?

Bayesian UCB uses posterior quantiles as optimistic estimates instead of Hoeffding-based confidence intervals. This inherits the theoretical guarantees of optimism while being more data-efficient when the prior is well-calibrated.

Probability matching selects each action with a probability equal to the posterior probability that it is optimal:

$$\pi(a \mid h_t) = P\!\left[Q(a) > Q(a') \;\text{for all } a' \neq a \;\middle|\; h_t\right]$$

Computing this probability analytically is typically intractable—it requires integrating over all arms' posteriors simultaneously. But there is a remarkably simple algorithm that implements probability matching exactly: Thompson Sampling.

Thompson Sampling

Thompson Sampling (also called posterior sampling), proposed by Thompson in 1933, is one of the oldest heuristics for exploration-exploitation. Despite its simplicity, it was not rigorously analyzed until the 2010s, when a series of papers established guarantees matching or nearly matching UCB.

Thompson Sampling (General Form)

Initialize prior $p(R_a)$ for each arm $a \in \mathcal{A}$.
for $t = 1, 2, \ldots$ do
for each arm $a$: sample a reward model $\tilde{R}_a$ from its posterior $p(R_a \mid h_t)$.
Compute $\tilde{Q}(a) = \mathbb{E}[\tilde{R}_a]$ for each arm.
Select action $a_t = \arg\max_{a \in \mathcal{A}} \tilde{Q}(a)$.
Observe reward $r_t$ from the environment.
Update posterior $p(R_{a_t} \mid h_{t+1})$ using Bayes' rule.
end for

For the Beta-Bernoulli case, the algorithm becomes especially clean.

Thompson Sampling for Beta-Bernoulli Bandits

Initialize $\alpha_a = 1,\; \beta_a = 1$ for each arm $a$. // Uniform prior
for $t = 1, 2, \ldots$ do
for each arm $a$: sample $\tilde{\theta}_a \sim \text{Beta}(\alpha_a, \beta_a)$.
Select $a_t = \arg\max_{a} \tilde{\theta}_a$.
Observe reward $r_t \in \{0, 1\}$.
if $r_t = 1$: $\alpha_{a_t} \leftarrow \alpha_{a_t} + 1$. else: $\beta_{a_t} \leftarrow \beta_{a_t} + 1$.
end for

Each iteration involves only sampling from Beta distributions ($O(1)$ per arm) and incrementing counters—negligible per-step overhead compared to UCB.

Thompson Sampling Implements Probability Matching

Thompson Sampling exactly implements probability matching: the probability it selects arm $a$ at time $t$ equals the posterior probability that arm $a$ is optimal:

$$\pi(a \mid h_t) = P\!\left[Q(a) > Q(a'),\; \forall a' \neq a \;\middle|\; h_t\right] = \mathbb{E}_{R \mid h_t}\!\left[\mathbf{1}\!\left(a = \arg\max_{a' \in \mathcal{A}} Q(a')\right)\right]$$

Proof sketch: Thompson Sampling equals probability matching

Consider the event that Thompson Sampling selects arm $a$. This occurs when the sampled value $\tilde{Q}(a)$ exceeds $\tilde{Q}(a')$ for all other arms $a' \neq a$. Since $\tilde{R}_a$ is drawn independently from the posterior $p(R_a \mid h_t)$, the probability of this event is:

$$P[\text{TS selects } a] = P\!\left[\tilde{Q}(a) > \tilde{Q}(a'),\; \forall a' \neq a\right]$$

But $\tilde{Q}(a) = \mathbb{E}[\tilde{R}_a]$ where $\tilde{R}_a$ is drawn from the posterior. By the definition of the posterior, this probability is exactly:

$$P\!\left[Q(a) > Q(a'),\; \forall a' \neq a \;\middle|\; h_t\right]$$

which is the probability matching criterion. The critical step is that sampling from the posterior and then maximizing is equivalent to computing the probability that each arm is optimal under the posterior. $\blacksquare$

Key Insight

Why posterior sampling naturally explores. Probability matching is inherently optimistic, though in a softer way than UCB. Arms we are uncertain about have a wide posterior, so their samples will occasionally be very high—causing them to be selected. As data accumulates and the posterior concentrates, sampling variance shrinks and the algorithm naturally shifts toward exploitation. Unlike UCB, which deterministically picks the arm with the highest upper bound, Thompson Sampling introduces randomization—a significant practical advantage in settings with delayed feedback.

Thompson Sampling vs. Optimism

How does Thompson Sampling compare to UCB in practice? Return to the broken-toe example with three arms (Surgery $\theta_1 = 0.95$, Taping $\theta_2 = 0.90$, Nothing $\theta_3 = 0.10$):

UCB (Optimism): Must pull each arm once to form confidence intervals ($a_1, a_2, a_3$), wasting one pull on the terrible arm 3 before it can begin exploiting.
Thompson Sampling: May pull arm 3 first (bad luck in sampling), but quickly learns from the failure and shifts attention. In the example trace, TS observes the failure on arm 3, then pulls arm 1 three times in a row (all successes). By step 4, TS has accumulated more reward than UCB.

This illustrates a broader empirical pattern: Thompson Sampling's randomized exploration naturally adapts to the observed reward structure, rather than exploring systematically. It excels in contextual bandit settings—such as online news recommendation (Chapelle and Li, 2010)—where the action space is large and deterministic strategies would repeatedly select the same suboptimal arm before receiving feedback.

Key Insight

Randomization as a feature, not a bug. Consider a news website where thousands of users arrive per second before any click data is returned. UCB is deterministic—given the same data, it always selects the same arm—so it would recommend the same article to every user until feedback arrives. Thompson Sampling randomly varies recommendations across concurrent users, achieving better coverage even without real-time feedback. This is one reason it has become the de facto standard for production A/B testing and recommendation systems.

Bayesian Regret and Theoretical Guarantees

To evaluate Bayesian bandit algorithms, we need a notion of regret that accounts for the prior. Frequentist regret fixes a true parameter $\theta$ and measures performance against the best arm under $\theta$:

Definition — Frequentist vs. Bayesian Regret

Frequentist regret for algorithm $\mathcal{A}$ over $T$ rounds under true parameters $\theta$:

$$\text{Regret}(\mathcal{A}, T;\, \theta) = \mathbb{E}_\tau\!\left[\sum_{t=1}^{T} Q(a^*) - Q(a_t) \;\middle|\; \theta\right]$$

where $\mathbb{E}_\tau$ denotes expectation over the randomness in actions and rewards.

Bayesian regret additionally averages over the prior on $\theta$:

$$\text{BayesRegret}(\mathcal{A}, T) = \mathbb{E}_{\theta \sim p_\theta}\!\left[\text{Regret}(\mathcal{A}, T;\, \theta)\right] = \mathbb{E}_{\theta \sim p_\theta, \tau}\!\left[\sum_{t=1}^{T} Q(a^*) - Q(a_t) \;\middle|\; \theta\right]$$

Bayesian regret is a weaker criterion than frequentist regret: it measures average-case performance over the prior, not worst-case over all parameters. An algorithm can have low Bayesian regret by performing well on "most" parameter settings according to the prior, even if it struggles on some adversarial configurations.

Theorem — Bayesian Regret of Thompson Sampling

For a $K$-armed stochastic bandit with Bernoulli rewards and a Beta prior, Thompson Sampling achieves Bayesian regret:

$$\text{BayesRegret}(\text{TS}, T) = O\!\left(\sqrt{KT \ln T}\right)$$

This matches the frequentist regret bound of UCB up to logarithmic factors. In fact, for many specific problem structures, Thompson Sampling achieves the optimal (instance-dependent) Bayesian regret bound of $O\!\left(\sum_{a: \Delta_a > 0} \frac{\ln T}{\Delta_a}\right)$, where $\Delta_a = Q(a^*) - Q(a)$ is the suboptimality gap of arm $a$.

Proof sketch: Bounding Bayesian regret via optimism

The key idea is to relate Thompson Sampling's regret to a UCB-style bound. For UCB, regret decomposes into terms involving the upper confidence bound $U_t(a_t)$:

$$\text{Regret}(\text{UCB}, T;\, \theta) = \mathbb{E}_\tau\!\left[\sum_{t=1}^{T} Q(a^*) - Q(a_t)\right] \leq \mathbb{E}_\tau\!\left[\sum_{t=1}^{T} U_t(a_t) - Q(a_t)\right]$$

where the inequality holds when $U_t(a)$ is a valid upper bound on $Q(a)$ for all arms and times. For Thompson Sampling, an analogous decomposition works because the sampled $\tilde{Q}(a)$ plays a role similar to the upper confidence bound: on average over the prior, it is as optimistic as needed. Combining with concentration arguments on the posterior width yields the stated bound. See Agrawal and Goyal (2012) and Russo and Van Roy (2014) for details. $\blacksquare$

An important subtlety: the frequentist regret bounds for Thompson Sampling (holding for every fixed $\theta$, not just on average) are slightly weaker than the best UCB bounds—the tightest instance-dependent UCB bound is not quite matched. But the Bayesian regret bounds are essentially optimal, and empirically Thompson Sampling frequently outperforms UCB.

Bayes-Optimal Exploration

Thompson Sampling is highly effective, but is it optimal? Given a prior and a known horizon $T$, we can formulate the bandit problem as an MDP where the state is the current posterior (or equivalently, the history) and the action is which arm to pull. The Bayes-optimal policy is the solution to this MDP via dynamic programming.

The catch: the state space is all possible posteriors, which grows exponentially with the horizon. For a Beta-Bernoulli bandit with $K$ arms, the state after $t$ steps is a tuple $(\text{Beta}(\alpha_1, \beta_1), \ldots, \text{Beta}(\alpha_K, \beta_K))$ where the parameters sum to $2K + t$. The number of possible states is combinatorial in $K$ and $t$, making exact dynamic programming intractable except for tiny problems.

Gittins Indices

A remarkable result partially sidesteps this intractability. The Gittins index theorem (Gittins, 1979) shows that for discounted Bayesian bandits, the optimal policy has a simple index form: compute a single real-valued index for each arm (depending only on that arm's posterior and the discount factor) and play the arm with the highest index.

The Gittins index for an arm with posterior state $s$ and discount factor $\gamma$ is the "fair subsidy" that would make the agent indifferent between pulling that arm and receiving a known constant reward—the value $\nu$ solving a certain optimal stopping problem. Computing Gittins indices requires solving a one-dimensional dynamic program per arm, but the index structure decouples the arms—a dramatic reduction from the full joint state space.

The Gittins index result is limited, though: it requires discounting (not a finite horizon), independent arms, and does not extend to contextual bandits or MDPs. For these reasons, Thompson Sampling remains the practical algorithm of choice despite not being exactly Bayes-optimal.

Extensions and Practical Considerations

Contextual Bandits

A major strength of Thompson Sampling is its natural extension to contextual bandits, where a context (feature vector) is observed before choosing an action. In news recommendation, for instance, the context includes user features (browsing history, demographics), each arm is an article, and the reward (click or not) depends on the user-article match.

We maintain a posterior over parameters of a reward model $r = f(x, a;\, \phi) + \epsilon$, where $x$ is the context. For a linear model with Gaussian noise, the posterior over $\phi$ stays Gaussian, and Thompson Sampling proceeds by sampling $\tilde{\phi}$, computing $\tilde{Q}(a) = f(x, a;\, \tilde{\phi})$ for each arm, and selecting the maximum. Chapelle and Li (2010) showed this substantially outperforms UCB-based methods on the Yahoo! news dataset.

Sensitivity to the Prior

If the prior is well-calibrated—accurately reflecting the true distribution of reward parameters—Thompson Sampling will outperform frequentist methods. But if the prior is badly misspecified (assigning very low probability to the true reward rates), Thompson Sampling can underperform UCB in the short term, since it may take many observations to overcome a misleading prior.

In practice, weakly informative priors like $\text{Beta}(1, 1)$ (uniform) are common defaults that avoid strong assumptions while still enabling the Bayesian machinery. The regret guarantees hold for any prior, so the algorithm converges to optimal behavior regardless.

Summary

This lecture developed the Bayesian approach to multi-armed bandits as an alternative to the frequentist UCB framework. The key ideas:

Bayesian bandits maintain a posterior over each arm's reward parameters, updated via Bayes' rule. Prior knowledge is encoded in the initial distribution.
Conjugate priors (e.g., Beta for Bernoulli rewards) enable analytically tractable posterior updates, making Bayesian algorithms computationally practical.
Thompson Sampling: sample from the posterior, act greedily on the sample. It exactly implements probability matching—selecting each arm with probability equal to its posterior probability of being optimal.
Regret guarantees match UCB up to constants: $O(\sqrt{KT \ln T})$ Bayesian regret. Empirically, Thompson Sampling often outperforms UCB, especially in contextual and large-action-space settings.
Bayes-optimal exploration is well-defined but computationally intractable. The Gittins index provides an optimal index policy for discounted bandits but does not extend to most practical settings.
Practical advantages include natural randomization (beneficial for delayed feedback), easy extension to contextual bandits, and strong empirical performance across diverse domains.

Together with UCB, Thompson Sampling forms the core toolkit for bandit exploration. The two perspectives are complementary: UCB provides worst-case guarantees independent of any prior, while Thompson Sampling leverages prior knowledge and tends to be more data-efficient when that prior is reasonable. In subsequent lectures, we extend these ideas from bandits to the full RL setting, where exploration must be balanced across states and time steps in an MDP.