How to Train Your GPT by raiyanyahya is a 12-chapter interactive ML textbook that takes a Python developer from zero ML background to a fully operational 151M-parameter LLM β using nothing more than a Jupyter Notebook and a pip install. The resource distinguishes itself by pairing 3,900+ lines of heavily annotated, runnable code with kindergarten-style analogies for every major concept, so the gap between βI read about Transformersβ and βI trained oneβ closes in a single sitting rather than a curriculum.
| *Source: github.com/raiyanyahya/how-to-train-your-gpt | Surfaced via Weibo (η±ε―ε―-η±ηζ΄»), May 2026* |
The problem it actually solves
Most LLM learning resources split into two failure modes: theory-only textbooks that explain attention mathematically but never produce a working model, and API tutorials that call openai.chat.completions.create() and call it a day. The result is an enormous cohort of developers who can describe Transformers in a job interview but cannot debug a training run, tune a learning rate schedule, or explain why KV cache exists.
How to Train Your GPT closes that gap by making the model itself the deliverable. Every chapter produces runnable output β not a diagram, not a quiz, but actual model weights doing actual inference. By Chapter 12, the reader has built and run a 151M-parameter LLM trained on real data, with a custom pipeline they assembled line by line.
Whatβs inside
The textbook spans 12 chapters, each a self-contained Jupyter Notebook cell sequence that builds on the previous one. The progression is linear: you cannot skip Chapter 5 and understand Chapter 6, which is the correct pedagogy for a topic this layered.
| Chapter | Topic | What you build |
|---|---|---|
| 1 | ML fundamentals | First tensor ops, loss functions, gradient descent by hand |
| 2 | Tokenization | BPE with tiktoken; vocabulary construction |
| 3 | Embeddings | Token + positional embeddings; intuition via analogy |
| 4 | Attention mechanism | Scaled dot-product attention, masking, multi-head assembly |
| 5 | Transformer block | Full encoder/decoder block; residual connections, LayerNorm |
| 6 | LLaMA architecture | Swapping standard Transformer for LLaMA design choices (RoPE, SwiGLU, RMSNorm) |
| 7 | Training pipeline | AdamW optimizer, cosine warmup schedule, mixed precision (fp16/bf16) |
| 8 | Inference & sampling | Temperature, top-k, top-p (nucleus), beam search; weight reset |
| 9 | KV cache | Why it exists; implementing and measuring the speedup |
| 10 | Fine-tuning with LoRA | Low-rank adapter injection; training a task-specific delta |
| 11 | Efficiency extensions | Flash Attention, Mixture of Experts (MoE) fundamentals |
| 12 | End-to-end run | 151M-parameter model: full train β eval β inference pipeline |
Setup is intentionally minimal:
pip install torch tiktoken
jupyter notebook
No cloud GPU required for the early chapters. The 151M training run benefits from a GPU but the code runs (slowly) on CPU so learners without hardware access can still follow along.
Key technical concepts covered
The textbookβs pedagogical signature is explaining each concept at two levels simultaneously: a βfive-year-oldβ analogy that builds the mental model, then the actual implementation that reveals why the analogy holds. The Transformerβs core mechanisms map cleanly to this treatment:
INPUT SEQUENCE
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Token Embeddings β
β + Positional Embeddings (RoPE) β β "give every word a GPS coordinate"
βββββββββββββββββββββ¬ββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Multi-Head Self-Attention β
β Q Β· K^T / βd_k β softmax β Β· V β β "which words should talk to which"
β KV Cache stores past K,V β
βββββββββββββββββββββ¬ββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Feed-Forward (SwiGLU in LLaMA) β β "process each word in private"
β + RMSNorm + Residual Connection β
βββββββββββββββββββββ¬ββββββββββββββββββ
β
βΌ
[Repeat Γ N layers]
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Language Model Head β
β Linear β softmax β sample token β β temperature / top-k / top-p here
βββββββββββββββββββββββββββββββββββββββ
Training pipeline concepts:
| Concept | What the notebook teaches |
|---|---|
| AdamW optimizer | Why weight decay is separated from the momentum update; parameter group setup |
| Cosine warmup | Linear ramp + cosine decay schedule; why cold-starting at full LR causes loss spikes |
| Mixed precision | fp16/bf16 forward pass + fp32 master weights; gradient scaling to avoid underflow |
| KV cache | Caching past key/value matrices to avoid O(nΒ²) recomputation at inference time |
| LoRA | Freezing base weights; injecting low-rank matrices AΒ·B into attention projections |
| MoE | Routing tokens to sparse expert networks; compute-vs-capacity tradeoff |
How LearnAI Team Could Use This
CS coursework at Monmouth University β this resource is a direct fit for Qβs AI education research agenda and existing course portfolio:
- CS-310 (OO Design): The LLaMA architecture implementation is a clean case study in modular design β each Transformer component (attention head, FFN block, embedding layer) is a well-bounded class with a defined interface. Assign chapters 4β6 as a design-pattern reading; ask students to diagram the class hierarchy before they read the code, then compare.
- CS-336 (Program Analysis for Security): The attention mask and KV cache chapters are concrete examples of where an off-by-one or incorrect dtype coercion produces silent wrong answers rather than an exception. Use as a source for βfind the latent bugβ exercises; students practice reading annotated code for correctness properties.
- AI education research (Qβs LAI project): The textbookβs two-level explanation style β analogy first, implementation second β is a testable pedagogical intervention. Q could run a small study comparing student comprehension outcomes (quiz, code modification task) between this resource and a conventional textbook chapter on attention. The Jupyter format makes pre/post instrumentation straightforward.
- Self-study assignment: Assign Chapter 1β7 as a 2-week self-study module before a lab session that involves prompting or fine-tuning. Students who have built the training loop understand why changing a hyperparameter matters; students who havenβt treat the model as a black box.
- Formal verification angle (Qβs research): The training pipeline (optimizer state, gradient accumulation, mixed-precision invariants) is an underexplored domain for lightweight formal specifications. How to Train Your GPTβs annotated code is readable enough to serve as a candidate artifact for a student verification project β e.g., specifying and checking the cosine warmup monotonicity property or the LoRA rank constraint.
Real-World Use Cases
| Scenario | How to use |
|---|---|
| Junior ML engineer ramp-up | Assign chapters 1β7 as a 2-week onboarding track; by the end they can read and modify a real training script without hand-holding |
| βIβve used GPT but donβt know how it worksβ developer | Start at Chapter 3 (embeddings); the analogy style means no calculus background required for intuition-building |
| Fine-tuning project prep | Chapter 10 (LoRA) is a standalone reference β read it before applying LoRA to a domain-specific model so you understand what the adapter is actually doing |
| Debugging a training run | Chapters 7β9 (optimizer, sampling, KV cache) give enough mechanistic understanding to diagnose loss spikes, incoherent outputs, and inference slowdowns |
| Course lab material | Each chapter is a Jupyter Notebook β repackage individual chapters as graded lab exercises; the annotated code gives students enough scaffolding to modify without being lost |
| Interview preparation | Working through the full 12 chapters produces genuine βI built a 151M-parameter LLMβ credentials that hold up under technical questioning |
Important things to know
- 3,900+ lines of annotated code is the actual value. The explanations are good, but what makes this resource different from a blog post is that every claim is demonstrated by runnable code in the same cell. If the annotation says βthis prevents gradient underflow,β the next cell shows the effect of removing it.
- The LLaMA architecture choice is deliberate and current. Most βbuild a GPT from scratchβ tutorials implement the original 2017 Transformer. This textbook implements LLaMA design choices (RoPE, SwiGLU, RMSNorm) β closer to what production models actually look like in 2025β2026.
- Chapter 12βs 151M run requires real compute. Chapters 1β11 are CPU-viable (slowly). The full end-to-end training run benefits from a CUDA GPU. Google Colab (free tier) can handle it with patience; a local GPU is faster. Plan accordingly before assigning as coursework.
- The βfive-year-old analogyβ style is a feature, not a sign of shallowness. The textbook uses plain-language analogies to bootstrap intuition, then immediately drops into real PyTorch. The analogies are scaffolding, not a substitute for the math. Students who want the derivations will need a supplementary resource (e.g., Karpathyβs nanoGPT or the original Attention Is All You Need paper).
- LoRA and MoE chapters are introductory. Chapters 10β11 build genuine understanding of what LoRA and MoE are doing, but they are not production fine-tuning guides. For deployment-grade LoRA workflows, follow up with resources like Hugging Face PEFT or Axolotl.
- No hosted version β everything runs locally. There is no Colab link in the README; you clone the repo and run the notebooks yourself. This is intentional (the install is two packages), but instructors who want to run it in a managed lab environment will need to set up their own Jupyter server or containerize it.
- Companion resources in this wiki:
- Paper-Code Joint Analysis & Contract-Driven Skill Design β how to pair a paper (like βAttention Is All You Needβ) with code like this textbook
- Autoresearch: 100 Autonomous ML Experiments Overnight β what to do with a working training pipeline once you have one
- Claude Code 101 β Anthropicβs Official Onboarding Course β pairs well if you want to use Claude Code to extend or debug the notebooks
- Anthropic Academy β 13 Free Claude Courses, 12-Week Roadmap β broader curriculum context for AI learning resources