Karpathy's MicroGPT β€” A Complete GPT in 200 Lines of Pure Python

Karpathy's MicroGPT β€” A Complete GPT in 200 Lines of Pure Python

A single file. 200 lines. No dependencies. No PyTorch, no NumPy, no tensors β€” just pure Python scalars. MicroGPT by Andrej Karpathy implements a complete GPT-style language model that trains and generates text, exposing every mechanism that powers ChatGPT in code you can read in one sitting.

*Source: Karpathy’s Blog Post SNES-GPT β€” Assembly Port*

What MicroGPT Teaches

The 200 lines contain six complete components β€” the same six that exist in every production LLM:

1. Dataset        β†’ 32K names, one per line
2. Tokenizer      β†’ Character-level (a-z + BOS = 27 tokens)
3. Autograd       β†’ Custom Value class with backward()
4. Architecture   β†’ Embeddings β†’ Attention β†’ MLP β†’ RMSNorm
5. Training       β†’ Cross-entropy loss + Adam optimizer
6. Inference      β†’ Autoregressive sampling with temperature

The Autograd Engine

Every operation records its local derivative. One loss.backward() call chains them all via the multivariable chain rule:

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

The Architecture (4,192 Parameters)

Same structure as GPT-2, miniaturized:

Input token
    ↓
Token Embedding (16-dim) + Position Embedding
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  RMSNorm                        β”‚
β”‚  Multi-Head Attention (4 heads) │──→ residual connection
β”‚  RMSNorm                        β”‚
β”‚  MLP (16 β†’ 64 β†’ 16)            │──→ residual connection
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
Output projection β†’ 27 logits β†’ softmax β†’ next token

Key insight: Attention is a token communication mechanism (tokens look at each other). MLP is computation (each token thinks independently).

Training

# Tokenize: [BOS, e, m, m, a, BOS]
# Forward: feed tokens through model, build KV cache
# Loss: cross-entropy = -log(probability of correct token)
loss_t = -probs[target_id].log()
loss = (1 / n) * sum(losses)
loss.backward()  # One call computes ALL gradients
# Adam optimizer updates 4,192 parameters

Loss drops from ~3.3 (random guessing among 27 tokens) to ~2.37 over 1,000 steps.

Inference

temperature = 0.5
token_id = BOS
for pos_id in range(block_size):
    logits = gpt(token_id, pos_id, keys, values)
    probs = softmax([l / temperature for l in logits])
    token_id = random.choices(range(vocab_size), weights=probs)[0]

Generates plausible names: β€œkamon,” β€œkarai,” β€œvialan” β€” not memorized from training data, but sampled from the learned distribution.

MicroGPT vs Production LLMs

Aspect MicroGPT Production (GPT-4, Claude)
Data 32K names Trillions of tokens
Tokenizer 27 characters 100K+ subword vocabulary
Compute Python scalars GPU/TPU tensor operations
Parameters 4,192 100B+
Layers 1 100+
Training time ~1 minute Months on thousands of GPUs
Post-training None SFT + RLHF + constitutional AI

The difference is scale, not mechanism. The same six components exist in both.

Community Example: SNES-GPT β€” MicroGPT on a Super Nintendo

Vincent Abruzzo ported MicroGPT to 65816 assembly and runs it on actual SNES hardware β€” proving that a transformer is just math, executable even on a 3.58 MHz processor from 1990.

Source: github.com/vabruzzo/snes-gpt (55 stars)

How It Works

Aspect Detail
Processor 3.58 MHz 65816 CPU
Arithmetic Q8.8 fixed-point (8 integer bits, 8 fractional bits)
Multiplier SNES PPU hardware multiplier at $4202/$4203
RAM ~1KB WRAM for working buffers
Weights 8KB stored directly in ROM cartridge
Parameters 4,064
Output 20 generated names displayed via SNES graphics processor

The Build Pipeline

make
# 1. Trains model in Python (~500 steps)
# 2. Quantizes weights to Q8.8 fixed-point
# 3. Generates exp() and inverse sqrt lookup tables (256 entries each)
# 4. Assembles 65816 source files (ca65)
# 5. Links final ROM (ld65) β†’ build/snes_gpt.sfc
# 6. Run in Snes9x emulator

Assembly Source Structure

src/
β”œβ”€β”€ main.asm       β€” SNES init, ROM header
β”œβ”€β”€ gpt.asm        β€” Complete forward pass
β”œβ”€β”€ math.asm       β€” Q8.8 arithmetic (multiply, divide, exp, rsqrt)
β”œβ”€β”€ vector.asm     β€” Linear algebra primitives
β”œβ”€β”€ inference.asm  β€” Generation loop + display
└── snes.inc       β€” Hardware register definitions

Debugging War Stories

The assembly port surfaced bugs that would never appear in Python:

  • Register width mismatch β€” .a16 (16-bit accumulator) without .i16 (16-bit index) caused 8-bit index instructions, misaligning the entire instruction stream
  • PRNG failure β€” byte-swap (xba) substituted for left-shift produced a degenerate cycle where all 20 names were identical
  • Accumulator collision β€” fixed-point multiply and dot product shared zero-page variables, erasing sums mid-computation

The Philosophical Point

β€œThe model doesn’t learn explicit rules, it learns a probability distribution.”

MicroGPT reveals that LLMs perform no magic β€” they’re a big math function mapping input tokens to a probability distribution over the next token. Understanding these 200 lines gives you genuine insight into how ChatGPT works. The SNES port proves the point further: if a 1990 game console can run a transformer, the mechanism really is just arithmetic.

How LearnAI Team Could Use This

  • Transformer fundamentals lesson β€” Use MicroGPT as a readable from-scratch walkthrough before students move to PyTorch or larger model codebases.
  • Autograd teaching module β€” Have learners trace one forward and backward pass through the custom Value class to understand how gradients flow.
  • Architecture comparison lab β€” Compare MicroGPT’s tiny GPT-style stack against production LLM diagrams to separate mechanism from scale.
  • Systems thinking exercise β€” Use the SNES-GPT port to show how model execution depends on arithmetic, memory layout, and hardware constraints.

Real-World Use Cases

  1. AI education workshops β€” Instructors can teach the full training loop without hiding mechanics behind tensor libraries.
  2. Debugging intuition for LLM engineers β€” Practitioners can inspect attention, loss, sampling, and optimizer behavior in a minimal codebase.
  3. Hardware-aware ML demos β€” The SNES port is a concrete example for lessons on quantization, fixed-point math, and constrained inference.
  4. From-scratch study groups β€” Learners can reimplement or modify one component at a time: tokenizer, autograd, attention, optimizer, or sampler.