MIT 6.5940 (formerly 6.S965): TinyML and Efficient Deep Learning Computing, taught by Song Han at MIT EECS, is the canonical graduate course on making deep learning fit on resource-constrained hardware β phones, microcontrollers, edge devices, and laptops. Song Han is the originator of Deep Compression (the pruning + quantization paper that defined the field) and the EIE accelerator that first put weight sparsity into modern AI chips; his lab also produced MCUNet, AWQ, and SmoothQuant. The course runs ~26 lectures across four chapters: efficient inference (pruning / quantization / NAS / knowledge distillation / MCUNet), domain-specific optimization (LLMs / diffusion / vision transformers / GANs), efficient training (distributed + on-device), and advanced topics. Course is on hiatus for Fall 2025 due to Prof. Hanβs sabbatical; the previous-year recordings, slides, and labs remain freely available β and are the most-cited free curriculum for anyone who wants to actually deploy modern models off the cloud.
| *Source: hanlab.mit.edu/courses/2024-fall-65940 (official course page) | efficientml.ai (open-lecture portal) | github.com/mit-han-lab (code + supplementary) | Song Han faculty page | YouTube: MIT HAN Lab* |
Who Song Han is, and why it matters that this course is his
Song Han is an Associate Professor (with tenure) at MIT EECS who has spent over a decade producing the techniques the rest of the field now relies on for efficient inference:
| Contribution | What it gave the field |
|---|---|
| Deep Compression (ICLR 2016) | The original pruning + quantization + Huffman-coding pipeline; defined the model-compression vocabulary still used today |
| EIE β Efficient Inference Engine | First accelerator architecture to exploit weight sparsity; influenced modern AI chip design |
| MCUNet (NeurIPS 2020) | Tiny deep learning on microcontrollers; brought neural networks to <512 KB devices |
| AWQ (Activation-aware Weight Quantization) | Now-standard technique for 4-bit LLM quantization |
| SmoothQuant | Post-training quantization that makes LLM 8-bit / 4-bit deployment practical |
When Han teaches the course, heβs teaching his own field. The course pulls from his labβs actual production work β not surveys of other peopleβs papers.
What the course covers (26 lectures, 4 chapters)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Chapter I β Efficient Inference (Lectures 1-11) β
β β’ Introduction & Deep Learning Basics (Γ2) β
β β’ Pruning & Sparsity (Γ2) β
β β’ Quantization (Γ2) β
β β’ Neural Architecture Search (Γ2) β
β β’ Knowledge Distillation (Γ1) β
β β’ MCUNet & TinyEngine (Γ2) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Chapter II β Domain-Specific Optimization (Lectures 12-18) β
β β’ Transformers & LLMs β
β β’ Efficient LLM Deployment β
β β’ LLM Post-Training β
β β’ Long-Context LLMs β
β β’ Vision Transformers β
β β’ GANs / Video / Point Clouds β
β β’ Diffusion Models β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Chapter III β Efficient Training (Lectures 19-21) β
β β’ Distributed Training (Γ2) β
β β’ On-Device Training & Transfer Learning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Chapter IV β Advanced Topics + Projects (Lectures 22-26) β
β β’ Course Summary & Quantum ML (Γ2) β
β β’ Final Project Presentations (Γ3) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The lecture ratio is the clue to the courseβs identity: it spends nearly half the semester on inference-time optimization, half on LLM/diffusion-specific techniques, and only a few lectures on training. This is intentional β Hanβs argument is that deployment, not training, is the bottleneck for real-world AI.
The five labs (each worth 15% of the grade)
| Lab | What you do | What you learn |
|---|---|---|
| Lab 1 β Pruning | Implement magnitude pruning + iterative fine-tuning on a small CNN | Mechanical understanding of how sparsity actually changes a model |
| Lab 2 β Quantization | Post-training and quantization-aware training (PTQ + QAT) | Why 8-bit and 4-bit work in practice |
| Lab 3 β Neural Architecture Search | Implement a small NAS loop | The compute / search-space trade-off |
| Lab 4 β LLM Compression | Apply AWQ / SmoothQuant to a real LLM | Bridging βcompression theoryβ to βLLM in deploymentβ |
| Lab 5 β LLM Deployment on Laptop | Deploy Llama2-7B on your laptop | End-to-end deployment story: model β compression β runtime β device |
By the end, students have hands-on experience running a 7B-parameter LLM on consumer hardware β which most ML courses donβt teach.
Prerequisites
The course assumes:
- 6.191 β Computation Structures (MITβs intro to architecture)
- 6.390 β Intro to Machine Learning
This is not an entry-level course. It assumes you can read PyTorch, you know what a tensor is, and you have at least passing familiarity with how CPUs / GPUs / memory hierarchies work. If youβre brand-new to ML, do Karpathyβs Zero to Hero or AI Engineering from Scratch first.
How it positions against other Efficient AI / TinyML resources
| Resource | Strength | Where this course is different |
|---|---|---|
| Vendor-specific NVIDIA / Apple / Qualcomm tutorials | Hardware-specific | Vendor-neutral; teaches the underlying techniques |
| Generic Coursera / Udemy βEdge AIβ courses | Broad introduction | Far deeper on the techniques; tied to working research code |
| Reading Deep Compression + AWQ papers directly | Authoritative | Same author, but with pedagogical scaffolding + labs you can actually run |
| Industry talks at NeurIPS workshops | Cutting edge | Curated and sequenced as a 4-month curriculum |
| LLM-deployment guides (e.g., HuggingFace docs) | Pragmatic | Adds the why (compression theory) under the how |
How to use this course if you canβt enroll
The course is offered at MIT to MIT students for credit, but the materials are free and public. Hereβs how to actually use them:
- Watch the lecture recordings on the MIT HAN Lab YouTube channel. The 2023 / 2024 editions are complete and freely available.
- Download the slides from the per-lecture Dropbox links on the course site.
- Run the labs β they live on Google Colab and in the mit-han-lab GitHub org. All five are doable on a laptop (Lab 5 requires ~16 GB RAM for Llama2-7B).
- Read the linked papers. Each lecture cites 2-5 primary sources. The papers + the slides together are roughly a graduate-level reading list on efficient inference.
- Skip what you donβt need. If you only care about LLM deployment, jump straight to Chapter II (lectures 12-13) + Lab 5.
Companion / derivative resources
The screenshot that surfaced this entry pointed to a Chinese-language η₯δΉ column (θε·₯ε, 13-lecture restructure of Hanβs material). Thatβs one of several community ports β others include:
- classcentral.com indexes the YouTube lectures with searchable titles
- csdiy.wiki has a Chinese-side guide for βself-studyingβ MIT 6.5940
- Various Bilibili re-uploads with Chinese subtitles
If you read Chinese, the η₯δΉ / Bilibili community has done substantial work translating and re-explaining the lectures. If you read English, the MIT-direct YouTube + slides are the authoritative source.
How LearnAI Team Could Use This
- Default recommendation for any LearnAI member who needs to deploy a model below the cloud line β laptop / phone / browser / edge device. The labs are doable in a weekend if you have the prereqs.
- Reading list for CS-336 (Program Analysis for Security) if any student gravitates toward systems-level ML β pruning + quantization is a systems-and-program-analysis topic, not a pure ML topic.
- Faculty workshop on βwhere does the AI hype actually need to land?β β this course is the answer in one curriculum: it gets cut. Use Chapter I as a 90-minute intro for colleagues skeptical that small models matter.
- Curriculum design model β Hanβs βinference-first, training-lastβ sequencing is a defensible alternative to the standard βtraining-firstβ ML pedagogy. Worth studying for any LearnAI course module that touches deployment.
- For research-KB integration β the per-lecture paper list is a structured reading list ready to drop into the Zotero / Obsidian KB pipeline.
Real-World Use Cases
| Scenario | How to use the course |
|---|---|
| Deploying an LLM on a laptop | Lab 5 walks through Llama2-7B end-to-end; substitute your model of choice |
| Compressing a model for an edge device | Chapter I + Lab 1 / Lab 2 give you the compression toolkit |
| Choosing a quantization scheme for production LLM serving | Lecture on AWQ + SmoothQuant; Hanβs own techniques explained by Han |
| Designing a hardware-aware ML system | Chapter Iβs MCUNet + TinyEngine lectures explicitly cover hardware/software co-design |
| Teaching graduate students about model compression | The slide deck is among the most polished free graduate ML materials available |
Limitations and honest caveats
- Not currently being run live. Prof. Han is on sabbatical, so Fall 2025 doesnβt run; the 2026 status isnβt yet announced. The previous editionsβ materials remain available, but you donβt get current-cohort interaction (Piazza, project feedback).
- Prereqs are real. Students new to ML or to systems will struggle. Do the prereqs first.
- Bias toward Hanβs own techniques. Deep Compression, EIE, MCUNet, AWQ, SmoothQuant β all Han-lab work β get more coverage than competing techniques (e.g., GPTQ, ZeroQuant). Worth being aware of when comparing approaches.
- No formal license stated on the course page β slides and recordings are freely available, but reuse permissions arenβt explicit. For derivative teaching materials, ask the lab.
- Lab 5 requires substantial RAM (~16 GB for Llama2-7B with the compression pipeline). Older laptops will struggle.
- Heavily PyTorch. If your stack is JAX / TensorFlow, youβll need to translate. The techniques themselves are framework-neutral; the labs arenβt.
Important things to know
- The βefficiency stackβ mental model the course establishes β compress the model, then schedule the compute, then pick the right runtime, then pick the right hardware β is the most reusable takeaway. Use it as the framework for any deployment decision you have to make later.
- AWQ is now the de facto 4-bit LLM quantization technique. The lecture covering it explains why it works, not just how to call the library β invaluable when debugging.
- MCUNet is the unique offering in the course β almost no other free curriculum covers microcontroller-scale deep learning at this depth.
- The Llama2-7B lab is doable on a Mac with 16 GB RAM. Mid-2020s consumer hardware is enough.
- Companion deep-dives in this wiki:
- Anthropic Academy β 13 Free AI Courses β sibling free-curriculum entry
- Codex Orange Book β θ±εβs Bilingual Codex Reference β sibling free practitioner book
- LLM Architecture Gallery β visual reference for the model architectures this course teaches you to compress
- AI Engineering from Scratch β Karpathyβs Curriculum β the βbuild the modelβ complement to Hanβs βshrink the modelβ
- Building a Research KB β Zotero + Obsidian + Claude Code β where to put the per-lecture reading lists