Lecture 1

Introduction to Reinforcement Learning

Learning through experience to make good sequential decisions under uncertainty, from core concepts to Markov Reward Processes.

Foundations Markov Property MRP Bellman Equation
Original PDF slides

What Is Reinforcement Learning?

Reinforcement learning (RL) is about learning from experience to make good decisions under uncertainty. Unlike supervised learning, where we get labeled input-output pairs, an RL agent must discover which actions yield the most reward by trying them. Humans and animals do this constantly—learning from the consequences of their actions without explicit instruction.

The field traces back to Richard Bellman's dynamic programming in the 1950s. Over the following decades, ideas from control theory, operations research, and psychology converged into the modern RL framework. In the last ten years, RL has produced a series of striking successes.

Definition
Reinforcement Learning. A computational framework in which an agent learns to make sequential decisions by interacting with an environment, receiving reward signals, and adjusting its behavior to maximize cumulative reward over time.

Recent Successes

Several landmark results illustrate RL's power and breadth:

Key Insight
RL is particularly powerful in two settings: (1) when no examples of desired behavior exist—for instance because the goal is to surpass human performance—and (2) when the problem involves an enormous search space with delayed outcomes, making hand-designed solutions infeasible.

The Four Pillars of Reinforcement Learning

What distinguishes RL from other branches of machine learning? Four fundamental challenges arise in virtually every RL problem:

Optimization

The goal of RL is to find an optimal (or near-optimal) way to make decisions. This requires an explicit objective. A simple analogy: finding the shortest route between two cities in a road network. In RL the objective is typically to maximize cumulative reward. The "Reward is Enough" hypothesis (Silver, Singh, Precup, and Sutton) posits that maximizing reward is a sufficiently generic objective to drive behavior exhibiting most, if not all, abilities studied in natural and artificial intelligence.

Delayed Consequences

Actions taken now can have impacts far into the future. Saving money for retirement, or finding a key early in a video game like Montezuma's Revenge to unlock a door much later, are both instances of delayed consequences. This introduces two intertwined challenges:

Exploration

An RL agent learns about the world by making decisions—acting as a scientist running experiments. When you learn to ride a bicycle, you must try (and fall) many times. Decisions influence what you learn: you only observe the reward for the action you actually took, never for the alternatives you did not try. If you choose to attend Stanford instead of MIT, you experience only the Stanford trajectory. This creates a fundamental tension between exploration (trying new things to gather information) and exploitation (choosing actions known to yield high reward).

Generalization

A policy is a mapping from past experience to action. In principle one could pre-program a lookup table covering every possible situation, but in practice the state space is far too large. The agent must generalize from limited experience to new, previously unseen states—exactly the kind of challenge that deep learning excels at, which is why the combination of RL and deep neural networks has been so fruitful.

The Sequential Decision-Making Framework

RL formalizes decision making as a discrete-time loop between an agent and an environment. The goal is to select actions that maximize total expected future reward, balancing immediate and long-term payoffs. This framework applies broadly—web advertising, robot manipulation, medical treatment planning, and more.

The Agent-Environment Loop

At each discrete time step $t$:

  1. The agent selects an action $a_t$.
  2. The environment transitions to a new state based on $a_t$, and emits an observation $o_t$ and a scalar reward $r_t$.
  3. The agent receives $o_t$ and $r_t$, and the cycle repeats.

History and State

The history at time $t$ is the complete sequence of past interactions:

$$h_t = (a_1, o_1, r_1, \ldots, a_t, o_t, r_t)$$

The agent chooses its next action based on the history, but maintaining the entire history quickly becomes intractable. Instead, we define a state as a compact summary of the history sufficient to predict what happens next:

$$s_t = f(h_t)$$

The choice of state representation has profound implications for computational complexity, the amount of data required for learning, and the quality of the resulting policy.

The Markov Property

Definition
A state $s_t$ is Markov if and only if it is a sufficient statistic of the history for predicting the future: $$p(s_{t+1} \mid s_t, a_t) = p(s_{t+1} \mid h_t, a_t)$$ In words: the future is independent of the past, given the present state.

The Markov assumption is mathematically simple and can often be satisfied by including a small amount of recent history in the state. In practice, a common choice is $s_t = o_t$ (the most recent observation). Different state representations lead to different trade-offs in computational cost, data efficiency, and performance.

Types of Sequential Decision Processes

Sequential decision problems vary along several dimensions:

Components of a Markov Decision Process

When we combine the Markov assumption with actions and rewards, we arrive at the Markov Decision Process (MDP), the central mathematical framework of this course. An RL algorithm often includes one or more of three core components: a model, a policy, and a value function.

The Model

The model is the agent's internal representation of how the world works. It consists of two parts:

Example
Mars Rover. Consider a rover that can occupy one of seven locations $s_1, \ldots, s_7$ along a line. It has two actions: TryLeft and TryRight. The transitions are stochastic—for instance, $P(s_1 \mid s_1, \text{TryRight}) = 0.5$ and $P(s_2 \mid s_1, \text{TryRight}) = 0.5$. Rewards are $+1$ in state $s_1$, $+10$ in state $s_7$, and $0$ everywhere else. Note that the agent's model may be wrong—it is only an approximation of the true environment dynamics.

The Policy

A policy $\pi$ determines how the agent chooses actions. It can take two forms:

For example, a Mars Rover policy might be $\pi(s_1) = \pi(s_2) = \cdots = \pi(s_7) = \text{TryRight}$. Since every state maps to a single action, this is a deterministic policy.

The Value Function

The value function $V^\pi$ quantifies how good it is to be in a given state under a particular policy $\pi$. It is defined as the expected discounted sum of future rewards:

$$V^\pi(s) = \mathbb{E}_\pi\!\left[r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \gamma^3\, r_{t+3} + \cdots \;\middle|\; s_t = s\right]$$

The discount factor $\gamma$ weighs immediate versus future rewards. The value function can be used to compare policies: if $V^{\pi_1}(s) > V^{\pi_2}(s)$ for all states $s$, then $\pi_1$ is strictly better than $\pi_2$.

Evaluation and Control

Two fundamental tasks arise repeatedly in RL:

Types of RL Agents

RL agents can be categorized along two axes based on which components they explicitly maintain:

Orthogonally, agents can be value-based (deriving actions from a learned value function), policy-based (directly optimizing a parameterized policy), or actor-critic (combining both). This taxonomy determines which algorithms apply and what trade-offs they entail.

Markov Processes and Markov Reward Processes

Before tackling the full MDP, we build up in complexity. We begin with Markov processes (Markov chains) and then add rewards to obtain Markov Reward Processes (MRPs).

Markov Process (Markov Chain)

Definition
A Markov Process is a memoryless random process—a sequence of random states satisfying the Markov property. It is defined by:
  • $\mathcal{S}$: a finite set of states ($s \in \mathcal{S}$)
  • $P$: a transition model specifying $P(s_{t+1} = s' \mid s_t = s)$
There are no rewards and no actions.

When the number of states $N$ is finite, we can represent $P$ as an $N \times N$ transition matrix where entry $(i, j)$ gives $P(s_j \mid s_i)$:

$$P = \begin{pmatrix} P(s_1|s_1) & P(s_2|s_1) & \cdots & P(s_N|s_1) \\ P(s_1|s_2) & P(s_2|s_2) & \cdots & P(s_N|s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1|s_N) & P(s_2|s_N) & \cdots & P(s_N|s_N) \end{pmatrix}$$

Each row sums to 1. Given a starting state, we can sample episodes—trajectories of states generated by repeatedly applying the transition probabilities. For example, starting from $s_4$ in the Mars Rover chain, possible episodes include $s_4, s_5, s_6, s_7, s_7, \ldots$ or $s_4, s_3, s_2, s_1, \ldots$

Markov Reward Process (MRP)

A Markov Reward Process extends a Markov chain by adding rewards—but still has no actions.

Definition
A Markov Reward Process (MRP) is a tuple $(\mathcal{S}, P, R, \gamma)$ where:
  • $\mathcal{S}$: a finite set of states
  • $P$: transition model, $P(s_{t+1} = s' \mid s_t = s)$
  • $R$: reward function, $R(s) = \mathbb{E}[r_t \mid s_t = s]$
  • $\gamma \in [0, 1]$: discount factor
With $N$ states, $R$ can be expressed as an $N$-dimensional vector.

Return and Value Function

The horizon $H$ is the number of time steps in each episode. It can be finite or infinite. The return $G_t$ from time step $t$ is the discounted sum of rewards up to the horizon:

$$G_t = r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \cdots + \gamma^{H-1}\, r_{t+H-1}$$

The state value function for an MRP is the expected return starting from state $s$:

$$V(s) = \mathbb{E}[G_t \mid s_t = s] = \mathbb{E}\!\left[r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \cdots + \gamma^{H-1}\, r_{t+H-1} \;\middle|\; s_t = s\right]$$

The Discount Factor

The discount factor $\gamma$ controls the trade-off between short-term and long-term reward:

Empirically, humans and animals often behave as if they use a discount factor strictly less than 1, preferring immediate gratification over delayed rewards.

The Bellman Equation for MRPs

The Markov property gives the value function a recursive structure: the value of any state decomposes into the immediate reward plus the discounted value of successor states.

Theorem
Bellman Equation for MRPs. For any state $s$ in a Markov Reward Process: $$V(s) = \underbrace{R(s)}_{\text{immediate reward}} + \gamma \underbrace{\sum_{s' \in \mathcal{S}} P(s' \mid s)\, V(s')}_{\text{discounted future value}}$$

This equation is the cornerstone of nearly all RL algorithms. It expresses a consistency condition: the value of a state must equal the immediate reward plus the expected discounted value of the next state.

Matrix Form and Analytic Solution

For a finite-state MRP with $N$ states, the Bellman equation can be written in matrix form. Let $V$ and $R$ be $N$-dimensional column vectors and $P$ the $N \times N$ transition matrix:

$$V = R + \gamma\, P\, V$$

Rearranging algebraically:

$$V - \gamma\, P\, V = R \quad \Longrightarrow \quad (I - \gamma\, P)\, V = R \quad \Longrightarrow \quad V = (I - \gamma\, P)^{-1}\, R$$

The matrix $(I - \gamma P)$ is always invertible when $\gamma < 1$ (via spectral radius arguments). The direct solution costs $O(N^3)$—feasible for small state spaces but prohibitive for large ones.

Iterative Computation via Dynamic Programming

For larger state spaces, we can compute $V$ iteratively rather than inverting a matrix:

Dynamic Programming for MRP Value Computation
  1. Initialize $V_0(s) = 0$ for all $s \in \mathcal{S}$.
  2. For $k = 1, 2, \ldots$ until convergence:
  3. For all $s \in \mathcal{S}$: $$V_k(s) = R(s) + \gamma \sum_{s' \in \mathcal{S}} P(s' \mid s)\, V_{k-1}(s')$$

Each iteration has complexity $O(|\mathcal{S}|^2)$, since updating one state requires summing over all possible successor states. The iterates $V_k$ converge to the true value function $V$ as $k \to \infty$.

Key Insight
The Bellman equation transforms a potentially intractable problem—evaluating an infinite sum of future rewards—into a set of simultaneous linear equations (for MRPs) or a fixed-point iteration. This recursive decomposition is the single most important structural property exploited by RL algorithms.

Looking Ahead

This lecture established the foundations of reinforcement learning: the agent-environment loop, the Markov property, Markov chains, and Markov Reward Processes. The Bellman equation for MRPs gives us both an analytic solution and an iterative algorithm for computing state values.

Next, we add actions, graduating from MRPs to full Markov Decision Processes (MDPs). We will introduce the Bellman equations for MDPs, define optimal value functions $V^*$ and $Q^*$, and study planning algorithms—policy evaluation, policy iteration, and value iteration—that find optimal policies when the model is known.