Mathematical Foundations of Reinforcement Learning is a free textbook by Shiyu Zhao (Westlake University) that teaches RL from a mathematical perspective. Unlike most RL books that focus on algorithm procedures, this one explains why algorithms are designed the way they are and why they work. Published by Springer and Tsinghua University Press (2025), with 10,000+ GitHub stars and 54+ bilingual video lectures totaling 2.1M+ views.
| *Source: GitHub Repository | Springer | Author Homepage | Amazon* |
Why This Book Stands Out
The core differentiator is math-first pedagogy with controlled depth:
- Explains the βwhyβ β not just how to run value iteration, but why Bellman equations guarantee convergence and why policy gradients work
- Gray-box design β deeper mathematical content is placed in shaded boxes that readers can selectively engage with based on their comfort level
- Unified examples β every concept and algorithm is illustrated using a single grid-world environment, so readers build cumulative intuition instead of context-switching between toy problems
- Progressive structure β each chapter builds on the previous one in a coherent learning path
Chapter Overview
The book is structured in two parts across 10 chapters:
Part 1 β Foundational Tools
| Chapter | Topic |
|---|---|
| 1 | Basic Concepts (MDPs, states, actions, rewards, policies) |
| 2 | Bellman Equation |
| 3 | Bellman Optimality Equation |
Part 2 β Algorithms
| Chapter | Topic |
|---|---|
| 4 | Value Iteration and Policy Iteration |
| 5 | Monte Carlo Methods |
| 6 | Stochastic Approximation and Temporal-Difference Methods |
| 7 | Temporal-Difference Methods (continued) |
| 8 | Value Function Approximation |
| 9 | Policy Gradient Methods |
| 10 | Actor-Critic Methods |
Companion Resources
| Resource | Details |
|---|---|
| Video Lectures | 54+ segments on YouTube and Bilibili (Chinese + English) |
| Code | Community implementations in Python, MATLAB, R, and C++ |
| Slides | LaTeX/Beamer lecture slides (source available upon request) |
| Study Notes | Community-contributed supplementary materials |
How It Compares to Other RL Textbooks
| Book | Approach | Best For |
|---|---|---|
| Sutton & Barto (2018) | Intuition-first, broad coverage, foundational | First exposure to RL; conceptual understanding |
| Szepesvari (2010) | Dense, proof-heavy, ~100 pages | Convergence proofs, regret bounds, theory researchers |
| Zhao (2025) | Math-first but readable, controlled depth | Understanding why algorithms work; bridge between intuition and theory |
| Xiao (2022) | Unified math framework + Python code | Implementation-oriented learners who want both theory and code |
Sutton & Barto is the standard first read β it builds intuition through examples and pseudocode, with math in optional shaded boxes. Zhaoβs book is the natural second read: it takes concepts you already intuit and gives you the mathematical machinery to understand them rigorously. Where Sutton & Barto says βthis works,β Zhao shows you the proof of why it works β but without the density of Szepesvariβs monograph.
The controlled depth is the key insight. Many students bounce off rigorous RL theory because the gap between Sutton & Barto and a measure-theoretic treatment is too large. Zhao fills that gap precisely.
How LearnAI Team Could Use This
Teaching RL Courses
- Graduate RL seminar β use as the primary textbook. The chapter structure maps cleanly to a semester: Part 1 in weeks 1-5, Part 2 in weeks 6-14
- Undergraduate AI course β assign specific chapters (1-4) alongside Sutton & Barto for students who want deeper understanding
- Math for ML course β the Bellman equation chapters are excellent standalone material on dynamic programming and fixed-point theory
- Flipped classroom β assign the bilingual video lectures as pre-class material, use class time for working through gray-box proofs together
Self-Study Path for Team Members
Week 1-2: Chapters 1-3 (foundations + Bellman equations)
β Watch corresponding video lectures
β Run grid-world code examples
Week 3-4: Chapters 4-6 (value iteration, MC, TD)
β Compare with Sutton & Barto chapters on same topics
β Note where mathematical insight changes your understanding
Week 5-6: Chapters 7-10 (function approximation, policy gradients, actor-critic)
β Connect to modern deep RL (PPO, SAC) built on these foundations
Research Applications
- Students working on RL-related projects get a rigorous reference for convergence properties and algorithm design rationale
- The mathematical framework helps when reading RL papers that assume familiarity with Bellman operators, contraction mappings, and stochastic approximation theory
- Gray-box sections serve as a bridge to more advanced references (Bertsekas, Puterman) for students heading into theory research
Real-World Use Cases
| Use Case | How This Book Helps |
|---|---|
| Robotics control | Understanding why policy gradient methods converge (or donβt) in continuous action spaces |
| Game AI | Mathematical foundation for value iteration and Monte Carlo tree search variants |
| Recommendation systems | Bellman equation framework applies directly to sequential recommendation as an MDP |
| LLM alignment (RLHF) | Policy gradient and actor-critic chapters provide the mathematical foundation for PPO β the algorithm behind RLHF |
| Operations research | Dynamic programming chapters connect RL to classical optimization |
| Autonomous driving | Function approximation theory explains when and why deep RL generalizes (or fails to) |
The RLHF connection is particularly timely: anyone working with LLM fine-tuning benefits from understanding why PPO works, not just how to call trl.PPOTrainer(). Chapters 9-10 on policy gradients and actor-critic methods provide exactly that foundation.
About the Author
Shiyu Zhao is an Associate Professor and Director of the Intelligent Unmanned Systems Laboratory at Westlake University, Hangzhou, China. PhD in Electrical and Computer Engineering from the National University of Singapore (2014). The book originated from his graduate-level RL lecture notes developed since 2019.