What is Reinforcement Learning?
Overview of the RL paradigm — agents, environments, rewards, and the learning loop.
Every day, we make decisions without explicit instructions. A child touches a hot stove and learns to avoid it. A chess player sacrifices a pawn now for a positional advantage ten moves later. A driver adjusts speed on an unfamiliar road based on feedback from the steering wheel, the curve ahead, and past experience. None of these involve a teacher providing the "correct" answer — instead, we learn by interacting with the world and observing the consequences of our actions. Reinforcement learning formalizes this process: an agent learns a behavior strategy by trial and error, guided only by a scalar reward signal.
This chapter lays the groundwork for everything that follows. We introduce the core abstractions — agent, environment, state, action, reward — and show how they fit together into a mathematical framework.
Learning from Interaction
Supervised learning requires a dataset of input-output pairs: images labeled "cat" or "dog", sentences paired with translations. The learner memorizes patterns from these examples and generalizes to new inputs. This works remarkably well — but it assumes someone already knows the right answers and has taken the time to label them.
Reinforcement learning starts from a fundamentally different premise. There are no labels. Instead, an agent takes actions in an environment and receives rewards — numerical signals indicating how good or bad the outcome was. The agent's goal is to discover, through repeated interaction, a strategy that maximizes the total reward it accumulates over time.
Think of learning to ride a bicycle. No one hands you a dataset of "correct muscle activations for each body position." You get on, wobble, fall, adjust, and gradually develop balance. The reward signal is simple: staying upright is good, falling is bad. But the relationship between your actions (leaning, pedaling, steering) and the outcome (balance or crash) is complex, delayed, and noisy. This is the essence of the RL problem.
Three features distinguish reinforcement learning from other forms of machine learning:
-
No supervisor, only rewards. The agent never sees the "correct" action. It only knows whether what it did led to high or low reward — and even that signal may arrive much later.
-
Sequential decision-making. Actions have consequences that extend beyond the immediate moment. Sacrificing short-term reward for long-term gain is often the right strategy, and the agent must learn when.
-
The agent's actions affect its data. In supervised learning, the training set is fixed. In RL, the data the agent sees depends on the decisions it makes. Explore a different part of the environment, and you encounter different states and rewards. This creates a feedback loop between learning and data collection that has no analogue in supervised or unsupervised learning.
The Agent-Environment Interface
We now formalize the intuition above. The RL framework consists of two entities — an agent and an environment — that interact in discrete time steps
At each step :
- The agent observes the current state .
- The agent selects an action .
- The environment transitions to a new state according to the transition dynamics .
- The environment emits a reward .
This produces a trajectory — a sequence of states, actions, and rewards:
In words: the agent perceives the world, acts, and receives feedback. Then the world changes — partly because of the agent's action, partly due to its own dynamics — and the cycle repeats. This loop is the atomic unit of reinforcement learning. Everything we build in this handbook — from tabular Q-learning to PPO to AlphaZero — operates within this framework.
States vs. observations
We assume the agent fully observes the state . In many real-world problems, the agent only sees a partial or noisy observation — for example, a robot's camera image rather than the full physical state of the room. This setting is called a Partially Observable MDP (POMDP). For now, we work with full observability. The key ideas carry over.
Let us pin down each component more precisely.
State space . The set of all possible situations the agent can encounter. It may be finite (the 64 squares of a chessboard), countably infinite (positions in an unbounded grid), or continuous (joint angles of a robotic arm, ).
Action space . The set of moves available to the agent. Like states, it can be discrete (left, right, up, down) or continuous (a torque applied to each joint, ). In some formulations, the action space depends on the state: .
Transition function . The probability of arriving in state given that the agent was in state and took action . This function encodes the environment's dynamics. The agent typically does not know — it must learn from experience.
Reward function . A scalar signal the environment returns after the agent takes action in state . Rewards can also depend on the next state: . The reward is the agent's sole source of feedback about the quality of its behavior.
Discount factor . A parameter that controls how much the agent cares about future rewards relative to immediate ones. We return to this below.
Together, these components define a Markov Decision Process (MDP), which we study in depth in the next section. For now, the key property is the Markov assumption: the next state and reward depend only on the current state and action, not on the full history.
In other words, the state contains all the information the agent needs to make decisions. The past is irrelevant once the present is known. This is a strong assumption, but it makes the problem tractable — and many real-world problems satisfy it or can be reformulated to satisfy it.
Where RL Fits in Machine Learning
Machine learning is often divided into three paradigms. Understanding how they differ clarifies what makes RL unique.
| Supervised Learning | Unsupervised Learning | Reinforcement Learning | |
|---|---|---|---|
| Signal | Correct label for each input | No labels | Scalar reward, possibly delayed |
| Goal | Predict outputs from inputs | Find structure in data | Maximize cumulative reward |
| Data | Fixed dataset | Fixed dataset | Generated by the agent's own actions |
| Feedback timing | Immediate (each sample has a label) | None | Delayed (reward may come many steps later) |
| Example | Image classification | Clustering, dimensionality reduction | Game playing, robot control |
The most important distinction is the last row about data. In supervised and unsupervised learning, the dataset exists before training begins and does not change. In RL, the agent generates its own training data through its actions. This creates a fundamental challenge: the data distribution depends on the current policy, but the policy is exactly what we are trying to improve. As the policy changes, the distribution of visited states shifts — a phenomenon with no parallel in classical machine learning.
This also means that RL faces a problem the other paradigms do not: the exploration-exploitation dilemma. The agent must decide whether to take actions it already knows are good (exploit) or try something new in hopes of discovering something better (explore). We study this problem in depth in the chapter on multi-armed bandits.
Formalizing the Goal
An agent that merely maximizes the next reward is myopic. Consider a chess player who captures an opponent's piece at the cost of losing the queen two moves later — the immediate reward is positive, but the long-term consequence is catastrophic. We need a notion of cumulative, long-term reward.
The return is the total accumulated reward from time step onward:
The discount factor determines the balance between present and future. When is close to 0, the agent is short-sighted — it cares mostly about the next reward. When is close to 1, it is far-sighted — it values future rewards almost as much as immediate ones.
Why discount at all? Three reasons:
- Mathematical convenience. Without discounting (), the sum may diverge for infinite-horizon tasks, making the objective ill-defined.
- Uncertainty. The further into the future, the less certain we are about what will happen. Discounting reflects this uncertainty.
- Preference for sooner rewards. All else equal, receiving reward sooner is better. This mirrors economic concepts like the time value of money.
Notice that the return satisfies a recursive relationship:
This simple identity — the return at time equals the immediate reward plus the discounted return from the next step — is the seed from which the Bellman equations grow. We explore this in the chapter on MDPs and dynamic programming.
Episodic vs. continuing tasks
Some problems have a natural endpoint — a game ends, a robot reaches its target. These are episodic tasks, and the return is a finite sum. Other problems run indefinitely — a thermostat controls temperature around the clock. These are continuing tasks, and discounting () is essential to keep finite. The framework handles both.
Policies
A policy is the agent's strategy: a rule that maps states to actions. It is the central object we want to learn.
A policy can be deterministic — for every state, it prescribes a single action:
Or it can be stochastic — for every state, it defines a probability distribution over actions:
Why would we ever want a stochastic policy? Two reasons. First, in partially observable environments, randomness can be optimal — think of rock-paper-scissors, where any deterministic strategy is exploitable. Second, stochastic policies are easier to optimize with gradient-based methods, as we shall see in the policy gradient chapter.
The goal of RL, stated precisely, is to find a policy that maximizes the expected return from every state:
This is the optimal policy. A central result in RL theory (which we prove in the MDP chapter) is that at least one deterministic optimal policy always exists for any finite MDP.
Value Functions
Directly searching over all possible policies is intractable. Instead, we evaluate how good it is to be in a given state (or to take a given action in a given state) under a particular policy. These evaluations are called value functions.
The state-value function answers: "If I am in state and follow policy from now on, what return can I expect?"
The action-value function is more specific: "If I am in state , take action , and then follow policy , what return can I expect?"
These two functions are closely related. If we know , we can recover by averaging over the policy:
In plain terms: the value of a state is the average value of the actions we might take there, weighted by how likely we are to take each one.
The optimal value functions and correspond to the best possible policy:
Once we know , finding the optimal policy is trivial — just pick the action with the highest value:
This is why so much of RL reduces to the problem of estimating value functions accurately. The methods differ in how they estimate these values — from exact dynamic programming to deep neural network approximations — but the underlying logic is the same.
The Exploration-Exploitation Dilemma
Imagine you are choosing a restaurant for dinner. You know a place nearby that is reliably good — an 8 out of 10. But there is a new restaurant you have never tried. It might be a 10. It might be a 4. Do you exploit your known option or explore the unknown one?
This is the exploration-exploitation dilemma, and it pervades all of reinforcement learning. An agent that only exploits will converge to a locally good policy but may miss the globally optimal one. An agent that only explores will gather information but never use it. The challenge is to balance the two.
We study this problem through the lens of the multi-armed bandit — a simplified setting that isolates the exploration-exploitation trade-off from the complexities of sequential decision-making. There, we develop principled strategies: -greedy, upper confidence bounds (UCB), and Thompson sampling. These ideas carry forward into every RL algorithm we encounter later.
What Lies Ahead
This chapter introduced the framework and vocabulary of reinforcement learning — the agent-environment loop, policies, value functions, and the objective of maximizing discounted returns.
The next chapter maps out the landscape of RL algorithms along three axes: model-free vs. model-based, value-based vs. policy-based, and on-policy vs. off-policy. After that, we trace the history of the field from Bellman's equation to RLHF.
From the Introduction section, we move to Value-Based Methods, starting with multi-armed bandits and building up through MDPs, dynamic programming, and deep Q-networks. The notation is consistent throughout — refer back to this chapter whenever you need to check a symbol's meaning.
References
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Book
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. Paper
- Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. Paper