What Is Reinforcement Learning?
Reinforcement learning (RL) is about learning from experience to make good decisions under uncertainty. Unlike supervised learning, where we get labeled input-output pairs, an RL agent must discover which actions yield the most reward by trying them. Humans and animals do this constantly—learning from the consequences of their actions without explicit instruction.
The field traces back to Richard Bellman's dynamic programming in the 1950s. Over the following decades, ideas from control theory, operations research, and psychology converged into the modern RL framework. In the last ten years, RL has produced a series of striking successes.
Recent Successes
Several landmark results illustrate RL's power and breadth:
- AlphaGo Zero — DeepMind's system achieved superhuman performance on the board game Go, a domain with roughly $10^{170}$ possible board positions, far beyond brute-force search. The key ingredient was large-scale RL combined with deep neural networks, learning entirely from self-play without human data (Silver et al., Nature 2017).
- DeepSeek-R1-Zero — A large language model trained via large-scale reinforcement learning without supervised fine-tuning as a preliminary step, demonstrating remarkable reasoning capabilities and showing that RL alone can unlock sophisticated chain-of-thought behavior.
- Plasma control for fusion — RL was used to control the plasma configuration inside a tokamak reactor, a critical step toward practical nuclear fusion energy (Degrave et al., Nature 2022).
- COVID-19 border testing — An RL-based policy enabled efficient and targeted testing of travelers, allocating scarce testing resources to maximize detection rates (Bastani et al., Nature 2021).
- ChatGPT and OpenAI o1 — Reinforcement learning from human feedback (RLHF) is a central technique behind aligning large language models to human preferences. OpenAI's o1 model further uses RL to teach the model "how to think productively" via chain-of-thought reasoning.
- Mathematical Olympiad — Google DeepMind achieved a gold-medal score (35 out of 42 points) at the International Mathematical Olympiad, with solutions described as "astonishing" by the IMO president.
The Four Pillars of Reinforcement Learning
What distinguishes RL from other branches of machine learning? Four fundamental challenges arise in virtually every RL problem:
Optimization
The goal of RL is to find an optimal (or near-optimal) way to make decisions. This requires an explicit objective. A simple analogy: finding the shortest route between two cities in a road network. In RL the objective is typically to maximize cumulative reward. The "Reward is Enough" hypothesis (Silver, Singh, Precup, and Sutton) posits that maximizing reward is a sufficiently generic objective to drive behavior exhibiting most, if not all, abilities studied in natural and artificial intelligence.
Delayed Consequences
Actions taken now can have impacts far into the future. Saving money for retirement, or finding a key early in a video game like Montezuma's Revenge to unlock a door much later, are both instances of delayed consequences. This introduces two intertwined challenges:
- Planning: Decisions must account not just for immediate benefit but also for long-term ramifications.
- Temporal credit assignment: When learning from experience, it is difficult to determine which past decisions were responsible for later success or failure.
Exploration
An RL agent learns about the world by making decisions—acting as a scientist running experiments. When you learn to ride a bicycle, you must try (and fall) many times. Decisions influence what you learn: you only observe the reward for the action you actually took, never for the alternatives you did not try. If you choose to attend Stanford instead of MIT, you experience only the Stanford trajectory. This creates a fundamental tension between exploration (trying new things to gather information) and exploitation (choosing actions known to yield high reward).
Generalization
A policy is a mapping from past experience to action. In principle one could pre-program a lookup table covering every possible situation, but in practice the state space is far too large. The agent must generalize from limited experience to new, previously unseen states—exactly the kind of challenge that deep learning excels at, which is why the combination of RL and deep neural networks has been so fruitful.
The Sequential Decision-Making Framework
RL formalizes decision making as a discrete-time loop between an agent and an environment. The goal is to select actions that maximize total expected future reward, balancing immediate and long-term payoffs. This framework applies broadly—web advertising, robot manipulation, medical treatment planning, and more.
The Agent-Environment Loop
At each discrete time step $t$:
- The agent selects an action $a_t$.
- The environment transitions to a new state based on $a_t$, and emits an observation $o_t$ and a scalar reward $r_t$.
- The agent receives $o_t$ and $r_t$, and the cycle repeats.
History and State
The history at time $t$ is the complete sequence of past interactions:
$$h_t = (a_1, o_1, r_1, \ldots, a_t, o_t, r_t)$$The agent chooses its next action based on the history, but maintaining the entire history quickly becomes intractable. Instead, we define a state as a compact summary of the history sufficient to predict what happens next:
$$s_t = f(h_t)$$The choice of state representation has profound implications for computational complexity, the amount of data required for learning, and the quality of the resulting policy.
The Markov Property
The Markov assumption is mathematically simple and can often be satisfied by including a small amount of recent history in the state. In practice, a common choice is $s_t = o_t$ (the most recent observation). Different state representations lead to different trade-offs in computational cost, data efficiency, and performance.
Types of Sequential Decision Processes
Sequential decision problems vary along several dimensions:
- Full vs. partial observability: Can the agent observe the true state, or only a noisy projection of it? Partially observable settings lead to POMDPs.
- Deterministic vs. stochastic dynamics: Does the same action from the same state always lead to the same outcome, or is there randomness?
- Bandits vs. full RL: Do actions influence only the immediate reward (the bandit setting), or do they also affect the next state and hence future rewards?
Components of a Markov Decision Process
When we combine the Markov assumption with actions and rewards, we arrive at the Markov Decision Process (MDP), the central mathematical framework of this course. An RL algorithm often includes one or more of three core components: a model, a policy, and a value function.
The Model
The model is the agent's internal representation of how the world works. It consists of two parts:
- Transition (dynamics) model: Predicts the next state given the current state and action: $$p(s_{t+1} = s' \mid s_t = s,\; a_t = a)$$
- Reward model: Predicts the expected immediate reward: $$r(s, a) = \mathbb{E}[r_t \mid s_t = s,\; a_t = a]$$
The Policy
A policy $\pi$ determines how the agent chooses actions. It can take two forms:
- Deterministic: $\pi(s) = a$, mapping each state to a single action.
- Stochastic: $\pi(a \mid s) = \Pr(a_t = a \mid s_t = s)$, assigning a probability distribution over actions for each state.
For example, a Mars Rover policy might be $\pi(s_1) = \pi(s_2) = \cdots = \pi(s_7) = \text{TryRight}$. Since every state maps to a single action, this is a deterministic policy.
The Value Function
The value function $V^\pi$ quantifies how good it is to be in a given state under a particular policy $\pi$. It is defined as the expected discounted sum of future rewards:
$$V^\pi(s) = \mathbb{E}_\pi\!\left[r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \gamma^3\, r_{t+3} + \cdots \;\middle|\; s_t = s\right]$$The discount factor $\gamma$ weighs immediate versus future rewards. The value function can be used to compare policies: if $V^{\pi_1}(s) > V^{\pi_2}(s)$ for all states $s$, then $\pi_1$ is strictly better than $\pi_2$.
Evaluation and Control
Two fundamental tasks arise repeatedly in RL:
- Evaluation (prediction): Given a fixed policy $\pi$, estimate the expected rewards from following it—that is, compute $V^\pi$.
- Control (optimization): Find the policy $\pi^*$ that maximizes the value function across all states.
Types of RL Agents
RL agents can be categorized along two axes based on which components they explicitly maintain:
- Model-based agents maintain an explicit model of the environment (transition and reward functions). They may or may not also maintain a policy or value function. Because they can simulate future trajectories using the model, they can plan ahead.
- Model-free agents have no explicit model. Instead, they rely directly on a learned value function, a learned policy, or both. They learn entirely from sampled experience without trying to reconstruct the environment dynamics.
Orthogonally, agents can be value-based (deriving actions from a learned value function), policy-based (directly optimizing a parameterized policy), or actor-critic (combining both). This taxonomy determines which algorithms apply and what trade-offs they entail.
Markov Processes and Markov Reward Processes
Before tackling the full MDP, we build up in complexity. We begin with Markov processes (Markov chains) and then add rewards to obtain Markov Reward Processes (MRPs).
Markov Process (Markov Chain)
- $\mathcal{S}$: a finite set of states ($s \in \mathcal{S}$)
- $P$: a transition model specifying $P(s_{t+1} = s' \mid s_t = s)$
When the number of states $N$ is finite, we can represent $P$ as an $N \times N$ transition matrix where entry $(i, j)$ gives $P(s_j \mid s_i)$:
$$P = \begin{pmatrix} P(s_1|s_1) & P(s_2|s_1) & \cdots & P(s_N|s_1) \\ P(s_1|s_2) & P(s_2|s_2) & \cdots & P(s_N|s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1|s_N) & P(s_2|s_N) & \cdots & P(s_N|s_N) \end{pmatrix}$$Each row sums to 1. Given a starting state, we can sample episodes—trajectories of states generated by repeatedly applying the transition probabilities. For example, starting from $s_4$ in the Mars Rover chain, possible episodes include $s_4, s_5, s_6, s_7, s_7, \ldots$ or $s_4, s_3, s_2, s_1, \ldots$
Markov Reward Process (MRP)
A Markov Reward Process extends a Markov chain by adding rewards—but still has no actions.
- $\mathcal{S}$: a finite set of states
- $P$: transition model, $P(s_{t+1} = s' \mid s_t = s)$
- $R$: reward function, $R(s) = \mathbb{E}[r_t \mid s_t = s]$
- $\gamma \in [0, 1]$: discount factor
Return and Value Function
The horizon $H$ is the number of time steps in each episode. It can be finite or infinite. The return $G_t$ from time step $t$ is the discounted sum of rewards up to the horizon:
$$G_t = r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \cdots + \gamma^{H-1}\, r_{t+H-1}$$The state value function for an MRP is the expected return starting from state $s$:
$$V(s) = \mathbb{E}[G_t \mid s_t = s] = \mathbb{E}\!\left[r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \cdots + \gamma^{H-1}\, r_{t+H-1} \;\middle|\; s_t = s\right]$$The Discount Factor
The discount factor $\gamma$ controls the trade-off between short-term and long-term reward:
- $\gamma = 0$: The agent is purely myopic—it cares only about the immediate reward $r_t$. In this case $V(s) = R(s)$.
- $\gamma = 1$: Future rewards are weighted equally with immediate ones. This is only well-defined when episodes are guaranteed to terminate ($H < \infty$).
- $0 < \gamma < 1$: Intermediate values create a preference for sooner rewards while still accounting for the future. This is also mathematically convenient because it ensures that the infinite sum converges even when $H = \infty$.
Empirically, humans and animals often behave as if they use a discount factor strictly less than 1, preferring immediate gratification over delayed rewards.
The Bellman Equation for MRPs
The Markov property gives the value function a recursive structure: the value of any state decomposes into the immediate reward plus the discounted value of successor states.
This equation is the cornerstone of nearly all RL algorithms. It expresses a consistency condition: the value of a state must equal the immediate reward plus the expected discounted value of the next state.
Matrix Form and Analytic Solution
For a finite-state MRP with $N$ states, the Bellman equation can be written in matrix form. Let $V$ and $R$ be $N$-dimensional column vectors and $P$ the $N \times N$ transition matrix:
$$V = R + \gamma\, P\, V$$Rearranging algebraically:
$$V - \gamma\, P\, V = R \quad \Longrightarrow \quad (I - \gamma\, P)\, V = R \quad \Longrightarrow \quad V = (I - \gamma\, P)^{-1}\, R$$The matrix $(I - \gamma P)$ is always invertible when $\gamma < 1$ (via spectral radius arguments). The direct solution costs $O(N^3)$—feasible for small state spaces but prohibitive for large ones.
Iterative Computation via Dynamic Programming
For larger state spaces, we can compute $V$ iteratively rather than inverting a matrix:
- Initialize $V_0(s) = 0$ for all $s \in \mathcal{S}$.
- For $k = 1, 2, \ldots$ until convergence:
- For all $s \in \mathcal{S}$: $$V_k(s) = R(s) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s)\, V_{k-1}(s')$$
Each iteration has complexity $O(|\mathcal{S}|^2)$, since updating one state requires summing over all possible successor states. The iterates $V_k$ converge to the true value function $V$ as $k \to \infty$.
Looking Ahead
This lecture established the foundations of reinforcement learning: the agent-environment loop, the Markov property, Markov chains, and Markov Reward Processes. The Bellman equation for MRPs gives us both an analytic solution and an iterative algorithm for computing state values.
Next, we add actions, graduating from MRPs to full Markov Decision Processes (MDPs). We will introduce the Bellman equations for MDPs, define optimal value functions $V^*$ and $Q^*$, and study planning algorithms—policy evaluation, policy iteration, and value iteration—that find optimal policies when the model is known.