Taxonomy of RL Methods
Model-free vs. model-based, value-based vs. policy-based, on-policy vs. off-policy — a map of the field.
The field of RL is broad, and algorithms differ along several axes. Before diving into specific methods, it helps to see the landscape from above. This chapter organizes the main families of approaches we cover in this handbook and explains the key distinctions between them.
Model-Free vs. Model-Based
A model-free method learns a policy or value function directly from experience, without trying to understand how the environment works. The agent does not need to know (or learn) the transition function or the reward function .
A model-based method first learns a model of the environment — an approximation of and — and then uses it to plan. Think of the difference between a chess player who evaluates positions by intuition (model-free) versus one who mentally simulates sequences of moves (model-based).
Model-based methods can be more sample-efficient because they reuse the learned model for planning. But if the model is inaccurate, planning with it can lead to poor decisions — a problem known as model bias.
Most of this handbook focuses on model-free methods, since they are simpler to implement and analyze. We cover model-based approaches in a dedicated section.
Value-Based vs. Policy-Based
Value-based methods learn a value function ( or ) and derive the policy from it — typically by acting greedily: . Q-learning and DQN are value-based.
Policy-based methods parameterize the policy directly — for example, as a neural network — and optimize its parameters by gradient ascent on the expected return. REINFORCE is a pure policy-based method.
Actor-critic methods combine both: a policy (the "actor") selects actions, while a value function (the "critic") evaluates them. This reduces variance compared to pure policy methods and extends naturally to continuous action spaces. A2C, PPO, SAC, and TD3 all belong to this family.
On-Policy vs. Off-Policy
An on-policy method learns about the policy it is currently following. The data used for updates must come from the current policy, which means old data cannot be reused. SARSA and PPO are on-policy.
An off-policy method can learn about one policy (the target policy) while following a different one (the behavior policy). This enables experience replay — storing past transitions in a buffer and sampling from them repeatedly. Q-learning and DQN are off-policy.
Think of it this way: on-policy is like learning by doing. Off-policy is like learning by watching someone else — or by reviewing your own past experiences. Off-policy methods are generally more sample-efficient, but they introduce additional challenges around stability and convergence.
Overview Table
| Family | Learns | Example Algorithms | Covered in |
|---|---|---|---|
| Value-based | or | Q-Learning, DQN, Double DQN | Value-Based Methods |
| Policy gradient | REINFORCE, TRPO, PPO | Policy Gradient | |
| Actor-critic | Both and or | A2C, DDPG, TD3, SAC | Actor-Critic |
| Model-based | , | Dyna-Q, MPC, MuZero | Model-Based RL |
The handbook follows this structure roughly in order: we start with value-based methods, move to policy gradients, then actor-critic, and finally model-based approaches. The last section covers topics that cut across these categories.
References
- Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Book