Stanford CS336 — Language Modeling from Scratch: The Complete Free LLM Curriculum

Stanford CS336 is a free, 17-lecture course that teaches you to build a language model from scratch — from byte-pair encoding tokenizers to RLHF alignment. Taught by Tatsunori Hashimoto and Percy Liang, it covers the entire LLM stack: data collection, transformer architectures, GPU optimization with Triton kernels, scaling laws, inference, and alignment. The Spring 2025 archive is fully available on YouTube (17 lectures), and Spring 2026 recordings are being posted as the course progresses. All assignments are on GitHub, and students build a working LLM by the end. This is the course that separates “API Callers” from “Architects.”

*Source: CS336 Official Site

YouTube Playlist (2025)

Stanford Online

GitHub Assignments*

Why This Course Matters

Most AI courses teach you to use models. CS336 teaches you to build them. The difference matters:

	API Caller	Architect (CS336 grad)
Understands	Prompting, fine-tuning APIs	How transformers work at the GPU kernel level
Can do	Call models, build apps	Train models, optimize inference, design architectures
Bottleneck	API rate limits and pricing	Compute and data
Replaceability	High (low-leverage, automatable)	Low (deep systems knowledge)
Career leverage	Lower (commoditized)	Higher (scarce expertise)

Instructors

Instructor	Role	Known For
Tatsunori Hashimoto	Stanford CS faculty	Language model evaluation, alignment, distribution shift
Percy Liang	Stanford CS faculty	HELM benchmark, CRFM, foundation model transparency

Both are leading researchers at Stanford’s Center for Research on Foundation Models (CRFM).

Course Structure

Format: 17 core lectures + 2 guest lectures (Mon/Wed, 80 min each) + 5 implementation-heavy assignments Duration: March 30 – June 3, 2026 Prerequisites: Python + PyTorch proficiency, linear algebra, probability, ML fundamentals Units: 5 (Stanford credit)

Complete Lecture Schedule

Block 1: Foundations (Weeks 1-4)

#	Date	Topic	Instructor	What You Learn
1	Mar 30	Overview, Tokenization	Percy	BPE tokenizer, course roadmap
2	Apr 1	PyTorch, Resource Accounting	Percy	GPU memory, FLOP counting, profiling
3	Apr 6	Architectures, Hyperparameters	Tatsu	Transformer variants, design choices
4	Apr 8	Attention Alternatives, MoE	Tatsu	Linear attention, sparse attention, Mixture of Experts

Block 2: Systems (Weeks 5-8)

#	Date	Topic	Instructor	What You Learn
5	Apr 13	GPUs, TPUs	Tatsu	Hardware architecture, memory hierarchy
6	Apr 15	Kernels, Triton, XLA	Percy	Writing custom GPU kernels in Triton
7	Apr 20	Parallelism I	Percy	Data parallelism, pipeline parallelism
8	Apr 22	Parallelism II	Tatsu	Tensor parallelism, expert parallelism

Block 3: Scaling (Weeks 9-10)

#	Date	Topic	Instructor	What You Learn
9	Apr 27	Scaling Laws	Tatsu	Chinchilla scaling, compute-optimal training
10	Apr 29	Inference	Percy	KV caching, speculative decoding, quantization
11	May 4	Scaling Laws II	Tatsu	Beyond Chinchilla, over-training

Block 4: Data (Weeks 11-12)

#	Date	Topic	Instructor	What You Learn
12	May 6	Evaluation	Percy	Benchmarks, contamination, HELM
13	May 11	Data: Sources, Transformation, Filtering	Percy	Common Crawl, deduplication, quality filtering
14	May 13	Data: Mixing, Rewriting, SFT	Percy	Data recipes, instruction tuning data

Block 5: Alignment (Weeks 13-15)

#	Date	Topic	Instructor	What You Learn
15	May 18	Alignment: RLHF/DPO	Tatsu	Reward models, PPO, Direct Preference Optimization
16	May 20	Alignment: RL Algorithms	Tatsu	GRPO, online RL, reasoning via RL
17	May 27	Alignment: RL Systems	Percy	Distributed RL training infrastructure

Guest Lectures: Daniel Selsam (Jun 1), Dan Fu (Jun 3)

The 5 Assignments

Each assignment is implementation-heavy — you write real code, not just answer questions.

Assignment 1: Basics (Due Apr 15)

Build the core components from scratch:

Byte-pair encoding (BPE) tokenizer
Transformer language model architecture
Cross-entropy loss function
Training loop with optimizer
Train a minimal working model

What you learn: The transformer isn’t magic — it’s matrix multiplications, attention masks, and positional encodings you can implement in PyTorch.

Assignment 2: Systems (Due Apr 29)

Optimize the transformer for real hardware:

Implement FlashAttention2 in Triton (custom GPU kernels)
Profile GPU utilization and memory
Set up distributed training across multiple GPUs

What you learn: The difference between “runs on GPU” and “runs efficiently on GPU” is 10-100x. Kernel-level optimization is where Architects separate from API Callers.

Assignment 3: Scaling (Due May 6)

Predict before you train:

Analyze how components (architecture, data, compute) affect performance
Fit scaling law curves from experiments
Project optimal model size for a given compute budget via API queries

What you learn: You don’t need to train every model to know which one will work. Scaling laws let you predict performance from small experiments.

Assignment 4: Data (Due May 20)

Build a real pre-training data pipeline:

Process Common Crawl web data
Implement quality filtering heuristics
Deduplicate at document and paragraph level

What you learn: Data quality determines model quality. The best architecture trained on bad data loses to a mediocre architecture trained on clean data.

Assignment 5: Alignment & RL (Due Jun 3)

Make the model useful and safe:

Supervised fine-tuning (SFT) on instruction data
Reinforcement learning for reasoning tasks
Optional: DPO safety methods

What you learn: Pre-training gives the model capability. Alignment gives it behavior. RLHF/DPO is how you go from “predicts next token” to “follows instructions helpfully.”

Compute Resources

Students need GPU access. Sponsored and recommended options:

Prices listed on the CS336 site are for a single B200 GPU (as of March 28, 2026):

Provider	Cost/hour	Notes
Modal (sponsor)	$6.25	$30 free monthly credit
Lambda Labs	$6.69
RunPod	$4.99
Nebius	$5.50 (on-demand), $3.05 (preemptible)	Preemptible is cheapest
Together AI	$7.49	8-GPU minimum

How to Self-Study This Course

If you’re not a Stanford student, here’s the path:

1. Watch lectures on YouTube (free)
        ↓
2. Clone assignment repos from GitHub
        ↓
3. Get GPU access (Modal free tier or RunPod)
        ↓
4. Work through assignments 1-5 in order
        ↓
5. Compare your solutions with community repos

Community solutions: YYZhang2025/Stanford-CS336 has notes and solutions for self-study reference.

What Makes This Course Unique

Feature	CS336	Most AI Courses
Build from scratch	BPE tokenizer, transformer, training loop — all your code	Use pre-built libraries
GPU kernels	Write FlashAttention in Triton	Call PyTorch functions
Real data pipeline	Process Common Crawl	Use pre-cleaned datasets
Scaling laws	Predict performance before training	Train and hope
Full alignment	SFT + RLHF + DPO	Maybe fine-tuning
Instructors	Percy Liang (CRFM director) + Tatsu Hashimoto (alignment/eval)	Varies

Academic Policies

Collaboration: Study groups OK, individual code required, list group members
AI Tools: LLMs allowed for conceptual questions, but not for direct problem-solving; IDE autocomplete discouraged
Late Days: 6 total, max 3 per assignment

Real-World Use Cases

ML engineers transitioning to LLM roles — CS336 gives the systems-level understanding that distinguishes LLM specialists from general ML practitioners.
Startup founders — Understanding the full stack (data → training → alignment) lets you make informed build-vs-buy decisions for model development.
Research scientists — The assignments provide hands-on experience with scaling laws, data pipelines, and alignment techniques used in frontier model development.
Self-taught developers — Free YouTube lectures + GitHub assignments make this accessible to anyone with PyTorch experience and $100-200 for GPU compute.

How LearnAI Team Could Use This

Graduate-level AI curriculum supplement — Assign CS336 lectures as required watching for an advanced AI course. Students work through assignments 1-3 as homework.
LLM architecture deep-dive — Use lectures 3-4 (architectures, attention alternatives, MoE) as foundational material for understanding why models like SubQ and DeepSeek V4 make the architectural choices they do.
Systems literacy module — Lectures 5-8 (GPUs, kernels, parallelism) fill the gap between “I can use PyTorch” and “I understand what the GPU is actually doing.”
Research methodology — Assignment 3 (scaling laws) teaches students to predict experimental results before running expensive experiments — a transferable research skill.