MIT 6.5940 β€” Song Han's TinyML & Efficient Deep Learning Course

MIT 6.5940 β€” Song Han's TinyML & Efficient Deep Learning Course

MIT 6.5940 (formerly 6.S965): TinyML and Efficient Deep Learning Computing, taught by Song Han at MIT EECS, is the canonical graduate course on making deep learning fit on resource-constrained hardware β€” phones, microcontrollers, edge devices, and laptops. Song Han is the originator of Deep Compression (the pruning + quantization paper that defined the field) and the EIE accelerator that first put weight sparsity into modern AI chips; his lab also produced MCUNet, AWQ, and SmoothQuant. The course runs ~26 lectures across four chapters: efficient inference (pruning / quantization / NAS / knowledge distillation / MCUNet), domain-specific optimization (LLMs / diffusion / vision transformers / GANs), efficient training (distributed + on-device), and advanced topics. Course is on hiatus for Fall 2025 due to Prof. Han’s sabbatical; the previous-year recordings, slides, and labs remain freely available β€” and are the most-cited free curriculum for anyone who wants to actually deploy modern models off the cloud.

*Source: hanlab.mit.edu/courses/2024-fall-65940 (official course page) efficientml.ai (open-lecture portal) github.com/mit-han-lab (code + supplementary) Song Han faculty page YouTube: MIT HAN Lab*

Who Song Han is, and why it matters that this course is his

Song Han is an Associate Professor (with tenure) at MIT EECS who has spent over a decade producing the techniques the rest of the field now relies on for efficient inference:

Contribution What it gave the field
Deep Compression (ICLR 2016) The original pruning + quantization + Huffman-coding pipeline; defined the model-compression vocabulary still used today
EIE β€” Efficient Inference Engine First accelerator architecture to exploit weight sparsity; influenced modern AI chip design
MCUNet (NeurIPS 2020) Tiny deep learning on microcontrollers; brought neural networks to <512 KB devices
AWQ (Activation-aware Weight Quantization) Now-standard technique for 4-bit LLM quantization
SmoothQuant Post-training quantization that makes LLM 8-bit / 4-bit deployment practical

When Han teaches the course, he’s teaching his own field. The course pulls from his lab’s actual production work β€” not surveys of other people’s papers.

What the course covers (26 lectures, 4 chapters)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Chapter I β€” Efficient Inference (Lectures 1-11)                 β”‚
β”‚  β€’ Introduction & Deep Learning Basics    (Γ—2)                  β”‚
β”‚  β€’ Pruning & Sparsity                     (Γ—2)                  β”‚
β”‚  β€’ Quantization                           (Γ—2)                  β”‚
β”‚  β€’ Neural Architecture Search             (Γ—2)                  β”‚
β”‚  β€’ Knowledge Distillation                 (Γ—1)                  β”‚
β”‚  β€’ MCUNet & TinyEngine                    (Γ—2)                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Chapter II β€” Domain-Specific Optimization (Lectures 12-18)      β”‚
β”‚  β€’ Transformers & LLMs                                          β”‚
β”‚  β€’ Efficient LLM Deployment                                     β”‚
β”‚  β€’ LLM Post-Training                                            β”‚
β”‚  β€’ Long-Context LLMs                                            β”‚
β”‚  β€’ Vision Transformers                                          β”‚
β”‚  β€’ GANs / Video / Point Clouds                                  β”‚
β”‚  β€’ Diffusion Models                                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Chapter III β€” Efficient Training (Lectures 19-21)               β”‚
β”‚  β€’ Distributed Training                   (Γ—2)                  β”‚
β”‚  β€’ On-Device Training & Transfer Learning                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Chapter IV β€” Advanced Topics + Projects (Lectures 22-26)        β”‚
β”‚  β€’ Course Summary & Quantum ML            (Γ—2)                  β”‚
β”‚  β€’ Final Project Presentations            (Γ—3)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The lecture ratio is the clue to the course’s identity: it spends nearly half the semester on inference-time optimization, half on LLM/diffusion-specific techniques, and only a few lectures on training. This is intentional β€” Han’s argument is that deployment, not training, is the bottleneck for real-world AI.

The five labs (each worth 15% of the grade)

Lab What you do What you learn
Lab 1 β€” Pruning Implement magnitude pruning + iterative fine-tuning on a small CNN Mechanical understanding of how sparsity actually changes a model
Lab 2 β€” Quantization Post-training and quantization-aware training (PTQ + QAT) Why 8-bit and 4-bit work in practice
Lab 3 β€” Neural Architecture Search Implement a small NAS loop The compute / search-space trade-off
Lab 4 β€” LLM Compression Apply AWQ / SmoothQuant to a real LLM Bridging β€œcompression theory” to β€œLLM in deployment”
Lab 5 β€” LLM Deployment on Laptop Deploy Llama2-7B on your laptop End-to-end deployment story: model β†’ compression β†’ runtime β†’ device

By the end, students have hands-on experience running a 7B-parameter LLM on consumer hardware β€” which most ML courses don’t teach.

Prerequisites

The course assumes:

  • 6.191 β€” Computation Structures (MIT’s intro to architecture)
  • 6.390 β€” Intro to Machine Learning

This is not an entry-level course. It assumes you can read PyTorch, you know what a tensor is, and you have at least passing familiarity with how CPUs / GPUs / memory hierarchies work. If you’re brand-new to ML, do Karpathy’s Zero to Hero or AI Engineering from Scratch first.

How it positions against other Efficient AI / TinyML resources

Resource Strength Where this course is different
Vendor-specific NVIDIA / Apple / Qualcomm tutorials Hardware-specific Vendor-neutral; teaches the underlying techniques
Generic Coursera / Udemy β€œEdge AI” courses Broad introduction Far deeper on the techniques; tied to working research code
Reading Deep Compression + AWQ papers directly Authoritative Same author, but with pedagogical scaffolding + labs you can actually run
Industry talks at NeurIPS workshops Cutting edge Curated and sequenced as a 4-month curriculum
LLM-deployment guides (e.g., HuggingFace docs) Pragmatic Adds the why (compression theory) under the how

How to use this course if you can’t enroll

The course is offered at MIT to MIT students for credit, but the materials are free and public. Here’s how to actually use them:

  1. Watch the lecture recordings on the MIT HAN Lab YouTube channel. The 2023 / 2024 editions are complete and freely available.
  2. Download the slides from the per-lecture Dropbox links on the course site.
  3. Run the labs β€” they live on Google Colab and in the mit-han-lab GitHub org. All five are doable on a laptop (Lab 5 requires ~16 GB RAM for Llama2-7B).
  4. Read the linked papers. Each lecture cites 2-5 primary sources. The papers + the slides together are roughly a graduate-level reading list on efficient inference.
  5. Skip what you don’t need. If you only care about LLM deployment, jump straight to Chapter II (lectures 12-13) + Lab 5.

Companion / derivative resources

The screenshot that surfaced this entry pointed to a Chinese-language ηŸ₯乎 column (蚁ε·₯εŽ‚, 13-lecture restructure of Han’s material). That’s one of several community ports β€” others include:

  • classcentral.com indexes the YouTube lectures with searchable titles
  • csdiy.wiki has a Chinese-side guide for β€œself-studying” MIT 6.5940
  • Various Bilibili re-uploads with Chinese subtitles

If you read Chinese, the ηŸ₯乎 / Bilibili community has done substantial work translating and re-explaining the lectures. If you read English, the MIT-direct YouTube + slides are the authoritative source.

How LearnAI Team Could Use This

  • Default recommendation for any LearnAI member who needs to deploy a model below the cloud line β€” laptop / phone / browser / edge device. The labs are doable in a weekend if you have the prereqs.
  • Reading list for CS-336 (Program Analysis for Security) if any student gravitates toward systems-level ML β€” pruning + quantization is a systems-and-program-analysis topic, not a pure ML topic.
  • Faculty workshop on β€œwhere does the AI hype actually need to land?” β€” this course is the answer in one curriculum: it gets cut. Use Chapter I as a 90-minute intro for colleagues skeptical that small models matter.
  • Curriculum design model β€” Han’s β€œinference-first, training-last” sequencing is a defensible alternative to the standard β€œtraining-first” ML pedagogy. Worth studying for any LearnAI course module that touches deployment.
  • For research-KB integration β€” the per-lecture paper list is a structured reading list ready to drop into the Zotero / Obsidian KB pipeline.

Real-World Use Cases

Scenario How to use the course
Deploying an LLM on a laptop Lab 5 walks through Llama2-7B end-to-end; substitute your model of choice
Compressing a model for an edge device Chapter I + Lab 1 / Lab 2 give you the compression toolkit
Choosing a quantization scheme for production LLM serving Lecture on AWQ + SmoothQuant; Han’s own techniques explained by Han
Designing a hardware-aware ML system Chapter I’s MCUNet + TinyEngine lectures explicitly cover hardware/software co-design
Teaching graduate students about model compression The slide deck is among the most polished free graduate ML materials available

Limitations and honest caveats

  • Not currently being run live. Prof. Han is on sabbatical, so Fall 2025 doesn’t run; the 2026 status isn’t yet announced. The previous editions’ materials remain available, but you don’t get current-cohort interaction (Piazza, project feedback).
  • Prereqs are real. Students new to ML or to systems will struggle. Do the prereqs first.
  • Bias toward Han’s own techniques. Deep Compression, EIE, MCUNet, AWQ, SmoothQuant β€” all Han-lab work β€” get more coverage than competing techniques (e.g., GPTQ, ZeroQuant). Worth being aware of when comparing approaches.
  • No formal license stated on the course page β€” slides and recordings are freely available, but reuse permissions aren’t explicit. For derivative teaching materials, ask the lab.
  • Lab 5 requires substantial RAM (~16 GB for Llama2-7B with the compression pipeline). Older laptops will struggle.
  • Heavily PyTorch. If your stack is JAX / TensorFlow, you’ll need to translate. The techniques themselves are framework-neutral; the labs aren’t.

Important things to know

  • The β€œefficiency stack” mental model the course establishes β€” compress the model, then schedule the compute, then pick the right runtime, then pick the right hardware β€” is the most reusable takeaway. Use it as the framework for any deployment decision you have to make later.
  • AWQ is now the de facto 4-bit LLM quantization technique. The lecture covering it explains why it works, not just how to call the library β€” invaluable when debugging.
  • MCUNet is the unique offering in the course β€” almost no other free curriculum covers microcontroller-scale deep learning at this depth.
  • The Llama2-7B lab is doable on a Mac with 16 GB RAM. Mid-2020s consumer hardware is enough.
  • Companion deep-dives in this wiki: