History of Reinforcement Learning

From Thorndike's Law of Effect to RLHF — key milestones that shaped the field.

Reinforcement learning draws on ideas from psychology, control theory, operations research, and neuroscience. Its history is one of convergent discovery — similar ideas emerging independently in different fields before merging into a unified framework. Understanding this history helps us see why certain algorithms exist and how the field arrived at its current state.

Early Foundations (1900s–1950s)

The concept of learning from reward and punishment traces back to Edward Thorndike's Law of Effect (1911): behaviors followed by satisfying outcomes are more likely to recur. This idea — that actions are reinforced by their consequences — gave the field its name.

In the 1950s, the mathematical side took shape. Richard Bellman formulated the Bellman equation (1957) while solving optimal control problems for continuous systems. This equation — which expresses the value of a state as the immediate reward plus the discounted value of successor states — became the mathematical backbone of value-based RL. Bellman also coined the term dynamic programming for the class of algorithms that solve such recursive equations.

Around the same time, researchers in operations research studied Markov Decision Processes (MDPs) as a formal framework for sequential decision-making under uncertainty. The MDP framework provided the language in which RL problems would eventually be stated.

The Slow Years (1960s–1970s)

Researchers in adaptive control and animal learning experimented with trial-and-error systems, but computational resources were insufficient for complex problems. The field progressed slowly. Notable work included Samuel's checkers-playing program (1959), which used a form of temporal-difference learning, and early connectionist approaches to reward-based learning.

The key ideas were present, but the computing power to realize them was not.

The Modern Era Begins (1980s)

The 1980s saw the convergence of ideas that established RL as a distinct field of machine learning.

In 1988, Richard Sutton introduced TD( $\lambda$ ), a family of temporal-difference learning algorithms that unified ideas from Monte Carlo estimation and dynamic programming. TD methods learn value estimates from incomplete episodes, updating predictions based on other predictions — a concept Sutton called bootstrapping.

Around the same time, Chris Watkins proposed Q-learning (1989) — the first practical algorithm for learning optimal policies without a model of the environment. Q-learning is off-policy: it can learn the optimal action-value function regardless of the exploration strategy used to generate data. This property made it both theoretically elegant and practically useful.

These two contributions, together with the MDP framework, established the theoretical foundation on which modern RL is built.

Scaling Up (1990s)

The 1990s provided early evidence that RL could work on nontrivial problems.

Gerald Tesauro's TD-Gammon (1992) trained a neural network to play backgammon at world-class level using TD learning and self-play. The network learned evaluation functions from scratch, with no human expertise beyond the rules. This was a remarkable demonstration — and also a cautionary tale, as similar approaches initially failed to generalize to other games.

In 1998, Sutton and Barto published the first edition of Reinforcement Learning: An Introduction, which became the field's foundational textbook and remains essential reading today.

Policy gradient methods also matured during this decade. Williams' REINFORCE algorithm (1992) showed how to optimize stochastic policies directly by gradient ascent, opening the door to problems with continuous action spaces.

The Deep RL Revolution (2013–2017)

The deep learning revolution reached RL in 2013. DeepMind's DQN (Mnih et al., 2013; published in Nature 2015) combined Q-learning with deep convolutional networks to play Atari games directly from raw pixels, achieving human-level performance across dozens of games. Three key innovations made this possible: experience replay for data efficiency, target networks for training stability, and the use of deep networks as function approximators for the action-value function.

This single result ignited the field of deep reinforcement learning and attracted enormous research attention.

In 2016, AlphaGo (Silver et al.) defeated the world champion in Go — a game with approximately $10^{170}$ possible positions, long considered beyond the reach of AI. AlphaGo combined deep neural networks with Monte Carlo Tree Search (MCTS) and learned from both human expert games and self-play.

Its successor AlphaZero (Silver et al., 2017) went further: it learned Go, chess, and shogi from scratch through pure self-play, with no human knowledge beyond the rules. AlphaZero demonstrated that RL combined with search could discover superhuman strategies from first principles.

Concurrently, PPO (Schulman et al., 2017) emerged as a practical, stable policy gradient method. Its simplicity and reliability made it the default choice for many applications — a position it holds to this day.

Complex Environments and Real-World Impact (2017–2020)

RL demonstrated success in increasingly complex domains.

OpenAI Five (2019) defeated professional Dota 2 teams in a game with vastly more complexity than Go — partial observability, continuous actions, long time horizons, and coordination between five agents.

MuZero (Schrittwieser et al., 2020) mastered Atari, Go, chess, and shogi without even knowing the rules. It learned a model of the environment entirely from experience and used that model for planning — bridging the gap between model-free and model-based approaches.

In robotics, sim-to-real transfer became practical: agents trained in simulation could be deployed on physical robots with careful domain randomization. RL was used to train dexterous manipulation, locomotion, and drone flight.

RL Meets Language Models (2020–Present)

The most consequential recent application of RL is Reinforcement Learning from Human Feedback (RLHF) — the technique used to align large language models with human preferences.

The key insight: instead of defining a reward function manually, train a reward model from human preference comparisons (which response is better?), then use RL — typically PPO — to optimize the language model against this learned reward. This approach, formalized by Christiano et al. (2017) and scaled by Ouyang et al. (2022) at OpenAI, became a standard component of the training pipeline for models like ChatGPT, Claude, and Gemini.

More recent work has explored alternatives to PPO for alignment, including Direct Preference Optimization (DPO), which bypasses the reward model entirely. We cover these methods in the chapter on RL for sequence generation and RLHF.

Applications Beyond Research

RL has moved well beyond academic benchmarks:

Games and strategy. From Atari to Go to StarCraft II, RL agents have matched or exceeded human performance. These serve as benchmarks that push algorithmic development.
Robotics. RL trains robots to walk, grasp objects, fly drones, and manipulate tools. Sim-to-real transfer addresses the cost and safety of real-world data collection.
Large language models. RLHF is now a standard stage in training large language models, shaping their behavior through human preference data.
Recommendation systems. Recommending content is a sequential decision problem. RL models long-term user engagement rather than optimizing for immediate clicks.
Scientific discovery. RL has been used to optimize chemical reactions, design new materials, control nuclear fusion plasma in tokamaks, and manage data center energy consumption.
Autonomous systems. Self-driving cars, traffic signal control, and supply chain optimization all involve sequential decision-making under uncertainty.

References

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Book
Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge. Thesis
Tesauro, G. (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3), 58–68.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. Paper
Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. Paper
Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv preprint arXiv:1712.01815. Paper
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. Paper
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588(7839), 604–609. Paper
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. Paper
Christiano, P. et al. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30. Paper
Terven, J. (2025). Deep Reinforcement Learning: A Chronological Overview and Methods. AI, 6(3), 46. Paper

History of Reinforcement Learning

On this page