A single file. 200 lines. No dependencies. No PyTorch, no NumPy, no tensors β just pure Python scalars. MicroGPT by Andrej Karpathy implements a complete GPT-style language model that trains and generates text, exposing every mechanism that powers ChatGPT in code you can read in one sitting.
| *Source: Karpathyβs Blog Post | SNES-GPT β Assembly Port* |
What MicroGPT Teaches
The 200 lines contain six complete components β the same six that exist in every production LLM:
1. Dataset β 32K names, one per line
2. Tokenizer β Character-level (a-z + BOS = 27 tokens)
3. Autograd β Custom Value class with backward()
4. Architecture β Embeddings β Attention β MLP β RMSNorm
5. Training β Cross-entropy loss + Adam optimizer
6. Inference β Autoregressive sampling with temperature
The Autograd Engine
Every operation records its local derivative. One loss.backward() call chains them all via the multivariable chain rule:
class Value:
def __init__(self, data, children=(), local_grads=()):
self.data = data
self.grad = 0
self._children = children
self._local_grads = local_grads
The Architecture (4,192 Parameters)
Same structure as GPT-2, miniaturized:
Input token
β
Token Embedding (16-dim) + Position Embedding
β
βββββββββββββββββββββββββββββββββββ
β RMSNorm β
β Multi-Head Attention (4 heads) ββββ residual connection
β RMSNorm β
β MLP (16 β 64 β 16) ββββ residual connection
βββββββββββββββββββββββββββββββββββ
β
Output projection β 27 logits β softmax β next token
Key insight: Attention is a token communication mechanism (tokens look at each other). MLP is computation (each token thinks independently).
Training
# Tokenize: [BOS, e, m, m, a, BOS]
# Forward: feed tokens through model, build KV cache
# Loss: cross-entropy = -log(probability of correct token)
loss_t = -probs[target_id].log()
loss = (1 / n) * sum(losses)
loss.backward() # One call computes ALL gradients
# Adam optimizer updates 4,192 parameters
Loss drops from ~3.3 (random guessing among 27 tokens) to ~2.37 over 1,000 steps.
Inference
temperature = 0.5
token_id = BOS
for pos_id in range(block_size):
logits = gpt(token_id, pos_id, keys, values)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=probs)[0]
Generates plausible names: βkamon,β βkarai,β βvialanβ β not memorized from training data, but sampled from the learned distribution.
MicroGPT vs Production LLMs
| Aspect | MicroGPT | Production (GPT-4, Claude) |
|---|---|---|
| Data | 32K names | Trillions of tokens |
| Tokenizer | 27 characters | 100K+ subword vocabulary |
| Compute | Python scalars | GPU/TPU tensor operations |
| Parameters | 4,192 | 100B+ |
| Layers | 1 | 100+ |
| Training time | ~1 minute | Months on thousands of GPUs |
| Post-training | None | SFT + RLHF + constitutional AI |
The difference is scale, not mechanism. The same six components exist in both.
Community Example: SNES-GPT β MicroGPT on a Super Nintendo
Vincent Abruzzo ported MicroGPT to 65816 assembly and runs it on actual SNES hardware β proving that a transformer is just math, executable even on a 3.58 MHz processor from 1990.
Source: github.com/vabruzzo/snes-gpt (55 stars)
How It Works
| Aspect | Detail |
|---|---|
| Processor | 3.58 MHz 65816 CPU |
| Arithmetic | Q8.8 fixed-point (8 integer bits, 8 fractional bits) |
| Multiplier | SNES PPU hardware multiplier at $4202/$4203 |
| RAM | ~1KB WRAM for working buffers |
| Weights | 8KB stored directly in ROM cartridge |
| Parameters | 4,064 |
| Output | 20 generated names displayed via SNES graphics processor |
The Build Pipeline
make
# 1. Trains model in Python (~500 steps)
# 2. Quantizes weights to Q8.8 fixed-point
# 3. Generates exp() and inverse sqrt lookup tables (256 entries each)
# 4. Assembles 65816 source files (ca65)
# 5. Links final ROM (ld65) β build/snes_gpt.sfc
# 6. Run in Snes9x emulator
Assembly Source Structure
src/
βββ main.asm β SNES init, ROM header
βββ gpt.asm β Complete forward pass
βββ math.asm β Q8.8 arithmetic (multiply, divide, exp, rsqrt)
βββ vector.asm β Linear algebra primitives
βββ inference.asm β Generation loop + display
βββ snes.inc β Hardware register definitions
Debugging War Stories
The assembly port surfaced bugs that would never appear in Python:
- Register width mismatch β
.a16(16-bit accumulator) without.i16(16-bit index) caused 8-bit index instructions, misaligning the entire instruction stream - PRNG failure β byte-swap (
xba) substituted for left-shift produced a degenerate cycle where all 20 names were identical - Accumulator collision β fixed-point multiply and dot product shared zero-page variables, erasing sums mid-computation
The Philosophical Point
βThe model doesnβt learn explicit rules, it learns a probability distribution.β
MicroGPT reveals that LLMs perform no magic β theyβre a big math function mapping input tokens to a probability distribution over the next token. Understanding these 200 lines gives you genuine insight into how ChatGPT works. The SNES port proves the point further: if a 1990 game console can run a transformer, the mechanism really is just arithmetic.
How LearnAI Team Could Use This
- Transformer fundamentals lesson β Use MicroGPT as a readable from-scratch walkthrough before students move to PyTorch or larger model codebases.
- Autograd teaching module β Have learners trace one forward and backward pass through the custom
Valueclass to understand how gradients flow. - Architecture comparison lab β Compare MicroGPTβs tiny GPT-style stack against production LLM diagrams to separate mechanism from scale.
- Systems thinking exercise β Use the SNES-GPT port to show how model execution depends on arithmetic, memory layout, and hardware constraints.
Real-World Use Cases
- AI education workshops β Instructors can teach the full training loop without hiding mechanics behind tensor libraries.
- Debugging intuition for LLM engineers β Practitioners can inspect attention, loss, sampling, and optimizer behavior in a minimal codebase.
- Hardware-aware ML demos β The SNES port is a concrete example for lessons on quantization, fixed-point math, and constrained inference.
- From-scratch study groups β Learners can reimplement or modify one component at a time: tokenizer, autograd, attention, optimizer, or sampler.
Links
- MicroGPT blog post: karpathy.github.io/2026/02/12/microgpt
- SNES-GPT: github.com/vabruzzo/snes-gpt
- Related: Karpathyβs βEnd of Codingβ Talk