SubQ — The First Subquadratic LLM: 12M Token Context at 5% of Opus Cost

SubQ is a new LLM from Miami startup Subquadratic that claims to be the first model built on a fully subquadratic attention architecture. Instead of comparing every token to every other token (quadratic cost), its Subquadratic Sparse Attention (SSA) mechanism selects only the positions that matter based on content — achieving a 12-million-token context window with 52x speed gains and costs under 5% of Opus. The benchmarks are strong but unverified: no weights, no independent evaluation, and significant community skepticism.

*Source: Subquadratic Official

VentureBeat Analysis

The New Stack

Hacker News Discussion*

The Core Idea: Why Most Attention Is Wasted

Standard transformer attention is quadratic: every token attends to every other token. For a 1M-token context, that’s 1 trillion attention computations — most of which produce near-zero weights.

SSA’s insight: if most attention weights are near zero, don’t compute them. For each query, SSA selects a small subset of positions based on content relevance (not fixed patterns), then computes exact attention only over those positions.

Standard Attention (Quadratic):
  Every token ↔ Every token = O(n²)
  1M tokens = 1,000,000,000,000 comparisons

SSA (Subquadratic):
  Every token → Select relevant positions → Attend only those = O(n)
  1M tokens = comparisons grow linearly with selected positions

Three Key Properties of SSA

Property	What It Means	Why It Matters
Linear scaling	Compute grows with the number of selected positions, not the full sequence	Context doubles → cost doubles (not quadruples)
Content-dependent routing	The model decides where to look based on meaning, not position	Token 3 or token 11,000,000 — if it’s relevant, SSA finds it
Precise retrieval	Unlike recurrent models that compress into fixed state, SSA computes exact attention over selected positions	Can retrieve from arbitrary positions without the lossy compression of state-space models

Benchmark Claims

Benchmark	SubQ 1M-Preview	Opus 4.6	GPT 5.4	Gemini 3.1 Pro
RULER 128K	95-97%	94.8%	—	—
SWE-Bench Verified	81.8-82.4	80.8-81.4	—	80.6
MRCR v2	65.9 (prod) / 83 (research)	78%	39%	26.3%

Efficiency Numbers

Context Length	SSA vs FlashAttention-2 Speedup
128K tokens	7.2x faster
256K tokens	13.2x faster
512K tokens	23x faster
1M tokens	52.2x faster
12M tokens	~1,000x compute reduction

Cost comparison: On RULER 128K, SubQ costs ~$8 vs Opus’s ~$2,600 — a 300x cost difference.

The Team

Person	Role	Background
Justin Dangel	CEO	Five-time founder (health tech, insurance tech)
Alex Whedon	CTO	Former Meta engineer, ex-Head of Gen AI at TribeAI
+ 11 PhD researchers	Research	From Meta, Google, Oxford, Cambridge, ByteDance, Adobe, Microsoft

Raised $29M seed round from investors including Justin Mateen (Tinder co-founder).

Why You Should Be Skeptical

This is where critical thinking matters. The claims are extraordinary, and several red flags exist:

Concern	Detail
No weights released	Cannot independently verify architecture claims
No full technical report	Blog post and benchmark results, but no paper with methodology
Single-run benchmarks	Each model run only once due to “high inference cost” — no confidence intervals
Cherry-picked evals	Only long-context retrieval and coding benchmarks shown — areas where SSA should have max advantage
Historical precedent	Magic.dev made similar 1000x efficiency claims in 2024 with $500M raised; no public evidence of delivery
“AI Theranos” debate	AI commentator Dan McAteer: “SubQ is either the biggest breakthrough since the Transformer… or it’s AI Theranos”

Not all reactions are negative. AI researcher John Rysana pushed back: “This is just subquadratic attention done well, which is very meaningful for long context workloads — odds of it being BS are extremely low.”

Prominent AI engineer Will Depue initially suggested SubQ may be “a sparse attention finetune of Kimi or DeepSeek.” CTO Alex Whedon later confirmed SubQ uses weights from open-source models as a starting point — meaning the architecture innovation is the SSA attention layer, not the base weights.

Bottom line: The architectural idea (sparse, content-dependent attention) is sound and well-established in research. Whether this specific implementation delivers on the headline numbers requires independent verification that doesn’t exist yet.

What This Means for the Field

SubQ is part of a broader trend toward more efficient attention mechanisms. Multiple research threads are exploring alternatives to dense attention:

Linear attention variants (Mamba, RWKV, Kimi Linear) — State-space and linear recurrence approaches
Fixed-pattern sparse attention (Longformer, BigBird) — Predefined sparsity patterns
Content-dependent sparse attention (SSA, DeepSeek Sparse) — Dynamic selection based on meaning

Whether any of these approaches fully replaces dense attention in production remains to be seen — previous claims of similar magnitude (e.g., Magic.dev’s LTM-2-mini in 2024) have not yet been publicly validated. But the research direction is real and well-funded.

Products Available

Product	What It Does	Status
SubQ API	12M token context window via API	Private beta
SubQ Code	Coding agent (CLI)	Private beta
SubQ Search	Deep research tool	Private beta

Real-World Use Cases

Entire codebase analysis — Load millions of lines of code into context without truncation or RAG.
Legal document review — Process complete contract sets, case law, and regulatory filings in a single pass.
Long-running agent state — Agents that maintain months of interaction history without forgetting.
Scientific literature synthesis — Analyze dozens of full papers simultaneously for systematic reviews.

How LearnAI Team Could Use This

Critical evaluation exercise — Use SubQ as a case study in evaluating AI claims: what evidence would you need to believe these benchmarks? What’s the difference between marketing and peer-reviewed results?
Attention mechanism deep-dive — Compare standard attention, sparse attention, linear attention, and SSA in a lecture on transformer architecture evolution.
Cost-performance analysis — Teach students to evaluate AI tools not just on accuracy but on cost/token, availability, and verification status.