Mathematical Foundations of Reinforcement Learning β€” Free Textbook by Shiyu Zhao

Mathematical Foundations of Reinforcement Learning β€” Free Textbook by Shiyu Zhao

Mathematical Foundations of Reinforcement Learning is a free textbook by Shiyu Zhao (Westlake University) that teaches RL from a mathematical perspective. Unlike most RL books that focus on algorithm procedures, this one explains why algorithms are designed the way they are and why they work. Published by Springer and Tsinghua University Press (2025), with 10,000+ GitHub stars and 54+ bilingual video lectures totaling 2.1M+ views.

*Source: GitHub Repository Springer Author Homepage Amazon*

Why This Book Stands Out

The core differentiator is math-first pedagogy with controlled depth:

  • Explains the β€œwhy” β€” not just how to run value iteration, but why Bellman equations guarantee convergence and why policy gradients work
  • Gray-box design β€” deeper mathematical content is placed in shaded boxes that readers can selectively engage with based on their comfort level
  • Unified examples β€” every concept and algorithm is illustrated using a single grid-world environment, so readers build cumulative intuition instead of context-switching between toy problems
  • Progressive structure β€” each chapter builds on the previous one in a coherent learning path

Chapter Overview

The book is structured in two parts across 10 chapters:

Part 1 β€” Foundational Tools

Chapter Topic
1 Basic Concepts (MDPs, states, actions, rewards, policies)
2 Bellman Equation
3 Bellman Optimality Equation

Part 2 β€” Algorithms

Chapter Topic
4 Value Iteration and Policy Iteration
5 Monte Carlo Methods
6 Stochastic Approximation and Temporal-Difference Methods
7 Temporal-Difference Methods (continued)
8 Value Function Approximation
9 Policy Gradient Methods
10 Actor-Critic Methods

Companion Resources

Resource Details
Video Lectures 54+ segments on YouTube and Bilibili (Chinese + English)
Code Community implementations in Python, MATLAB, R, and C++
Slides LaTeX/Beamer lecture slides (source available upon request)
Study Notes Community-contributed supplementary materials

How It Compares to Other RL Textbooks

Book Approach Best For
Sutton & Barto (2018) Intuition-first, broad coverage, foundational First exposure to RL; conceptual understanding
Szepesvari (2010) Dense, proof-heavy, ~100 pages Convergence proofs, regret bounds, theory researchers
Zhao (2025) Math-first but readable, controlled depth Understanding why algorithms work; bridge between intuition and theory
Xiao (2022) Unified math framework + Python code Implementation-oriented learners who want both theory and code

Sutton & Barto is the standard first read β€” it builds intuition through examples and pseudocode, with math in optional shaded boxes. Zhao’s book is the natural second read: it takes concepts you already intuit and gives you the mathematical machinery to understand them rigorously. Where Sutton & Barto says β€œthis works,” Zhao shows you the proof of why it works β€” but without the density of Szepesvari’s monograph.

The controlled depth is the key insight. Many students bounce off rigorous RL theory because the gap between Sutton & Barto and a measure-theoretic treatment is too large. Zhao fills that gap precisely.

How LearnAI Team Could Use This

Teaching RL Courses

  • Graduate RL seminar β€” use as the primary textbook. The chapter structure maps cleanly to a semester: Part 1 in weeks 1-5, Part 2 in weeks 6-14
  • Undergraduate AI course β€” assign specific chapters (1-4) alongside Sutton & Barto for students who want deeper understanding
  • Math for ML course β€” the Bellman equation chapters are excellent standalone material on dynamic programming and fixed-point theory
  • Flipped classroom β€” assign the bilingual video lectures as pre-class material, use class time for working through gray-box proofs together

Self-Study Path for Team Members

Week 1-2: Chapters 1-3 (foundations + Bellman equations)
    β†’ Watch corresponding video lectures
    β†’ Run grid-world code examples
Week 3-4: Chapters 4-6 (value iteration, MC, TD)
    β†’ Compare with Sutton & Barto chapters on same topics
    β†’ Note where mathematical insight changes your understanding
Week 5-6: Chapters 7-10 (function approximation, policy gradients, actor-critic)
    β†’ Connect to modern deep RL (PPO, SAC) built on these foundations

Research Applications

  • Students working on RL-related projects get a rigorous reference for convergence properties and algorithm design rationale
  • The mathematical framework helps when reading RL papers that assume familiarity with Bellman operators, contraction mappings, and stochastic approximation theory
  • Gray-box sections serve as a bridge to more advanced references (Bertsekas, Puterman) for students heading into theory research

Real-World Use Cases

Use Case How This Book Helps
Robotics control Understanding why policy gradient methods converge (or don’t) in continuous action spaces
Game AI Mathematical foundation for value iteration and Monte Carlo tree search variants
Recommendation systems Bellman equation framework applies directly to sequential recommendation as an MDP
LLM alignment (RLHF) Policy gradient and actor-critic chapters provide the mathematical foundation for PPO β€” the algorithm behind RLHF
Operations research Dynamic programming chapters connect RL to classical optimization
Autonomous driving Function approximation theory explains when and why deep RL generalizes (or fails to)

The RLHF connection is particularly timely: anyone working with LLM fine-tuning benefits from understanding why PPO works, not just how to call trl.PPOTrainer(). Chapters 9-10 on policy gradients and actor-critic methods provide exactly that foundation.

About the Author

Shiyu Zhao is an Associate Professor and Director of the Intelligent Unmanned Systems Laboratory at Westlake University, Hangzhou, China. PhD in Electrical and Computer Engineering from the National University of Singapore (2014). The book originated from his graduate-level RL lecture notes developed since 2019.