Skip to content

RL Handbook

A comprehensive guide to Reinforcement Learning

Abstract

This handbook gives a comprehensive, up-to-date guide to reinforcement learning and sequential decision making. Starting from bandits and Markov decision processes, it progresses through value-based methods, policy gradients, actor-critic architectures, and model-based approaches. Advanced topics include imitation learning, offline RL, curiosity-driven exploration, and multi-agent systems. The material balances mathematical rigor with runnable code examples, and is designed to serve as an open, continuously updated resource for students, researchers, and engineers entering or working in the field

Chapter Contents

  1. 01

    Introduction

    1. 01.1Introduction
    2. 01.2What is Reinforcement Learning?
    3. 01.3Taxonomy of RL Methods
  2. 02

    Value-Based

    1. 02.1Multi-Armed Bandits
    2. 02.2Markov Decision Processes
    3. 02.3Dynamic Programming
    4. 02.4Monte Carlo and Temporal-Difference Prediction
    5. 02.5Sarsa and Q-Learning
    6. 02.6Deep Q-Networks
    7. 02.7DQN Improvements
  3. 03

    On-Policy Policy-Based

    1. 03.1Policy Gradient and REINFORCE
    2. 03.2Actor-Critic, A2C and A3C
    3. 03.3TRPO
    4. 03.4PPO
  4. 04

    Off-Policy Policy-Based

    1. 04.1Off-Policy Policy-Based Framework
    2. 04.2DDPG
    3. 04.3TD3 and SAC
  5. 05

    Model-Based

    Coming soon
  6. 06

    Advanced Topics

    Coming soon

Author

Ruslan Ageev

RL research @ Tsinghua University | ML & AI

Acknowledgements

We thank all contributors who helped improve this handbook through feedback, corrections, and new material

© 2026 Ruslan Ageev