Monte Carlo and Temporal-Difference Prediction

Abstract. Dynamic programming evaluated policies by sweeping over a known environment model. Monte Carlo and temporal-difference methods remove that assumption: the agent estimates values from sampled experience. We stay on the evaluation side of the story: the policy is held fixed while we learn what return its states tend to produce.

Dynamic programming is exact and clean, but it assumes access to $p(s', r \mid s,a)$ . In many RL problems the model is unavailable. The agent can act, observe $S_t$ , $A_t$ , $R_{t+1}$ , and $S_{t+1}$ , but it cannot enumerate every possible next state or average over the transition distribution.

This chapter asks a narrower question: if a policy $\pi$ is fixed, how can we estimate its state-value function $v_\pi(s)$ from sampled episodes or transitions? Control methods such as Sarsa and Q-learning come next. Here we only do model-free prediction.

Leaving Dynamic Programming Behind

For a fixed policy, dynamic programming used the Bellman expectation update:

v_{k+1}(s) = \sum_a \pi(a \mid s) \sum_{s',r} p(s',r \mid s,a) \left[r + \gamma v_k(s')\right]

The inner sum, $\sum_{s',r} p(s',r \mid s,a)[r + \gamma v_k(s')]$ , is where dynamic programming uses its strongest assumption. To compute it, the agent must know the full one-step model: which next states can follow from $(s,a)$ , which rewards can appear, and how likely each $(s', r)$ pair is. In a real learning problem, this information is usually unavailable. The agent does not get to inspect the transition distribution from the outside; it only gets to interact with the environment.

Model-free prediction keeps the Bellman idea but changes how the expectation is accessed. Instead of averaging over all possible next states and rewards explicitly, the environment is treated as a sampler. The agent takes an action, observes one reward and one next state, and uses that single transition as evidence about the expectation dynamic programming would have computed from the model. One sample is noisy, but many samples can gradually pull the value estimate toward the same object: the expected return under the fixed policy.

The two basic model-free prediction methods are Monte Carlo learning and temporal-difference learning; both estimate values from experience, but they use different pieces of that experience as their learning target.

Monte Carlo Prediction

Monte Carlo prediction estimates $v_\pi(s)$ by averaging returns observed after visits to $s$ . Here the expectation is the value function,

v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s],

and the samples are complete returns generated by following the policy $\pi$ in the environment. Suppose an episode visits state $s$ at time $t$ . The return from that visit is the total discounted reward observed after that time:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots

This $G_t$ is one sample of the random return whose expectation defines $v_\pi(s)$ . One episode gives a noisy sample; many episodes give many such samples. Monte Carlo prediction stores the returns observed after visits to each state and uses their sample average as the value estimate:

V(s) = \frac{1}{N(s)} \sum_{i=1}^{N(s)} G^{(i)}(s)

The important point is that this update does not use the transition probabilities $p(s',r \mid s,a)$ . It does not ask what could have happened from $s$ . It waits to see what actually happened after visiting $s$ , then averages those observed outcomes. In this sense MC prediction is model-free, but it is not online in the one-step sense: ordinary MC must wait until the episode terminates before the return $G_t$ is known.

There are two standard variants:

First-visit MC: in each episode, update a state only from the first time it appears.
Every-visit MC: update from every occurrence of the state in the episode.

For first-visit MC, each episode contributes at most one return sample per state. If state $s$ appears several times in the same episode, only the return after the first occurrence is added to the average for $s$ . Every-visit MC uses all occurrences instead. Both are based on complete returns; the distinction is only which visits are allowed to contribute samples.

A first-visit Monte Carlo prediction procedure looks like this:

Input: policy pi to evaluate
Initialize V(s) arbitrarily for all states s
Initialize Returns(s) as an empty list for all states s

Repeat forever, or for a chosen number of episodes:
    Generate an episode following pi:
        S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T-1}, A_{T-1}, R_T

    G <- 0
    For t = T-1, T-2, ..., 0:
        G <- gamma * G + R_{t+1}
        If S_t does not appear in S_0, S_1, ..., S_{t-1}:
            Append G to Returns(S_t)
            V(S_t) <- average(Returns(S_t))

Because MC uses the actual sampled return rather than a learned estimate of the future, first-visit MC gives an unbiased sample-based estimate in the usual episodic setting. Its weakness is variance: the return includes every random reward from the visit until termination, so a single unusual episode can push the estimate far from its typical value. This high variance is the price MC pays for not bootstrapping.

Temporal-Difference Prediction

Temporal-difference prediction keeps Monte Carlo's model-free setting but changes when the update happens. Monte Carlo waits until the end of an episode because it needs the full return $G_t$ . TD(0), the simplest temporal-difference method, updates as soon as the next reward and next state are observed.

The reason this is possible is that TD(0) uses a one-step target. After the transition $S_t, A_t, R_{t+1}, S_{t+1}$ , it treats the observed reward plus the current estimate of the next state's value as a new estimate for $V(S_t)$ :

\text{TD target} = R_{t+1} + \gamma V(S_{t+1})

Then it moves the old estimate a small step toward that target:

V(S_t) \leftarrow V(S_t) + \alpha \left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]

The bracketed quantity is the TD error:

\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

It measures the discrepancy between two predictions. Before seeing the transition, the agent's prediction for state $S_t$ was $V(S_t)$ . After seeing one reward and landing in $S_{t+1}$ , it has a revised one-step prediction: take the reward already observed, then add the discounted value currently assigned to the next state. If this revised prediction is larger, $V(S_t)$ moves upward. If it is smaller, $V(S_t)$ moves downward.

This is the sense in which TD combines the two ideas we have already seen. Like Monte Carlo, it learns from sampled experience and does not require the dynamics $p(s', r \mid s,a)$ . Like dynamic programming, it bootstraps: the target contains $V(S_{t+1})$ , another learned estimate, instead of waiting for the actual future return all the way to termination. It can be described as learning a guess from another guess, but with one real reward inserted between them.

For policy evaluation, the tabular TD(0) procedure is:

Input: policy pi to evaluate
Parameter: step size alpha in (0, 1]
Initialize V(s) arbitrarily for all non-terminal states s
Set V(terminal) = 0

Repeat forever, or for a chosen number of episodes:
    Initialize S

    Repeat until S is terminal:
        Choose A according to pi(. | S)
        Take action A, observe R and S'
        V(S) <- V(S) + alpha [R + gamma V(S') - V(S)]
        S <- S'

The important difference from the Monte Carlo pseudocode is the position of the update. TD(0) does not wait for the episode to finish and then sweep backward through a completed trajectory. It updates inside the episode, immediately after each transition. That makes TD natural for long episodes and continuing tasks, where a final return may be delayed for a long time or may not exist at all.

The learning rate $\alpha$ controls how strongly a single transition changes the estimate. A small $\alpha$ averages information slowly but steadily; a large $\alpha$ makes the value function react quickly to new experience, at the cost of more noise from individual transitions.

Bias and Variance

Monte Carlo and TD differ most clearly through bias and variance.

Monte Carlo uses the complete return $G_t$ . If episodes are sampled under the policy $\pi$ , the return is a direct sample of what $v_\pi(s)$ means. That gives MC low bias, but high variance: long episodes accumulate many random events, and all of them affect the update.

TD(0) uses the one-step target $R_{t+1} + \gamma V(S_{t+1})$ . This usually has lower variance because it contains only one sampled reward before bootstrapping. But the target is biased whenever $V(S_{t+1})$ is wrong. Early in learning, it often is.

Monte Carlo waits longer and uses a cleaner target. TD updates sooner and uses a noisier estimate of the future. The right choice depends on episode length, reward noise, and how inaccurate the current value function is.

TD's ability to update online is a major practical advantage. It can learn during an episode, and it can be applied naturally to continuing tasks where waiting for termination is impossible.

The $n$ -Step Bridge

Monte Carlo and TD(0) are endpoints of a larger family. An $n$ -step return waits for $n$ sampled rewards, then bootstraps:

G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V(S_{t+n})

When $n=1$ , this is the TD(0) target. When $n$ reaches the end of the episode, it becomes the Monte Carlo return. Larger $n$ usually means less bootstrap bias and more sampled-return variance. TD( $\lambda$ ) and eligibility traces, which we only preview here, combine many $n$ -step targets with exponentially decaying weights.

Experiment: Blackjack

The runnable example uses Blackjack, the card-game environment used to introduce Monte Carlo methods. The player's goal is to finish with a hand closer to 21 than the dealer's hand, without going over 21. Face cards count as 10. An ace is usable if we can count it as 11 without going over 21.

This is a good model-free prediction example because the agent does not receive a transition table. It only gets episodes: a sequence of visible states, actions, and rewards from one game. The state used in the script is the standard tabular Blackjack observation:

S_t = (\text{player sum}, \text{dealer showing}, \text{usable ace flag})

The action set is small: 0 means stick and 1 means hit. To keep this chapter on prediction, the policy is fixed rather than improved: the player sticks on sums of 20 or 21 and hits otherwise. The task is to estimate the value function of that policy,

v_\pi(s) = \mathbb{E}_\pi[G_t \mid S_t = s],

from many sampled games. A win gives reward $+1$ , a loss gives $-1$ , and a draw gives $0$ . Because rewards arrive only when the game ends, Blackjack makes the Monte Carlo idea concrete: after a state is visited, the return is simply the final game outcome.

The full script implements the Blackjack dynamics directly instead of relying on a library environment. That keeps the task visible: drawing cards, checking for a usable ace, applying the fixed policy, and producing one episode are all plain Python functions.

def fixed_policy(state: State, stick_threshold: int = 20) -> int:
    """Stick on high sums and hit otherwise. 0 = stick, 1 = hit."""
    player_sum, _, _ = state
    return 0 if player_sum >= stick_threshold else 1

The episode generator returns transitions of the form (state, action, reward, next_state). Most hit transitions have reward 0.0, because the game has not ended yet. If the player busts, the transition receives reward -1.0 and the next state is terminal. If the player sticks, the dealer plays out the rest of the game and the final transition receives -1.0, 0.0, or 1.0.

if action == 1:
    player.append(draw_card(rng))
    if is_bust(player):
        episode.append((state, action, -1.0, None))
        return episode
    next_state = observation(player, dealer)
    episode.append((state, action, 0.0, next_state))
    continue

First-visit Monte Carlo prediction waits for the episode to finish, then walks backward through the completed game. For each state, it uses only one return sample from that episode. The value update below is an incremental sample average: instead of storing every return in a list, it updates the running mean directly.

for state, _, reward, _ in reversed(episode):
    returns = reward + gamma * returns
    if state in values and state not in visited:
        counts[state] += 1
        values[state] += (returns - values[state]) / counts[state]
        visited.add(state)

TD(0) uses the same sampled episodes but updates after each transition. When next_state is terminal, the next value is defined as zero. Otherwise, the target bootstraps from the current estimate of the next Blackjack state.

for state, _, reward, next_state in episode:
    next_value = 0.0 if next_state is None else values.get(next_state, 0.0)
    target = reward + gamma * next_value
    values[state] += alpha * (target - values[state])

Running the script for 200_000 episodes prints a few sample values and opens a four-panel value-surface plot: Monte Carlo and TD(0), each split into states with and without a usable ace. The exact numbers vary with the random seed, but the signs are interpretable. A state such as (20, 10, False) tends to be good under the stick-on-20 policy: the player already has a strong hand, even though the dealer shows a 10. A state such as (13, 2, False) is usually bad: the player must hit under the fixed policy, and many draws can still lead to weak hands or busts.

The point of the example is not that this policy is optimal. It is not doing control, and it does not change its behavior from the learned values. The point is to show the prediction machinery on a real episodic task: Monte Carlo waits for the final outcome of a game and averages complete returns, while TD(0) starts pushing values around as soon as one reward and one next state are available.

Full code

The complete runnable example, including Blackjack episode generation, first-visit Monte Carlo prediction, TD(0) prediction, and displayed value-surface plotting: monte-carlo-and-temporal-difference.py

You may also find these implementations useful:

What Comes Next

Monte Carlo and TD prediction estimate $v_\pi$ for a fixed policy. The next chapter moves from prediction to control: using action-value estimates to improve behavior, first with on-policy Sarsa and then with off-policy Q-learning.

Monte Carlo and Temporal-Difference Prediction

On this page