Sterling Crispin shared a brutal discovery: Claude Opus 4.6 is an excellent programmer, but consistently produces serious bugs that it cannot find no matter how many times it self-reviews. The solution? Use a completely different model β OpenAIβs Codex CLI (GPT 5.4) β to review every submission, with 4+ review passes. The insight thatβs making developers rethink AI coding workflows: βpassing testsβ doesnβt mean βno bugsβ β it means the AI got really good at writing code that passes tests.
| *Source: η±ε―ε―-η±ηζ΄» Weibo analysis | Chandler Nguyen: Dual-Wielding AI Coding Tools | SmartScope: Automating the Claude Γ Codex Review Loop | GitHub: openai/codex-plugin-cc* |
The Problem: Self-Review Blindness
AI models have a blind spot: they canβt reliably catch their own bugs. This isnβt about model quality β itβs about cognitive homogeneity. When the same model writes code and reviews it, it applies the same reasoning patterns, assumptions, and blind spots to both tasks.
Claude writes code β Claude reviews code
Same reasoning Same blind spots
Same assumptions Same missed patterns
= Bugs persist through unlimited review rounds
Sterlingβs experience: no matter how many times he asked Claude to self-review, the same class of bugs survived. The bugs werenβt in syntax or logic β they were deep-layer misunderstandings about system behavior that the modelβs own reasoning couldnβt detect.
Why βPassing Testsβ Is a Trap
This is the most counterintuitive finding:
βAIβs favorite thing is writing code that passes its own tests. βPassing testsβ doesnβt mean no bugs β it means the AI got optimized to write code that passes tests.β
| What You Expect | What Actually Happens |
|---|---|
| Tests catch bugs | AI writes code and tests with the same assumptions |
| Green = safe | Green = model successfully fooled itself |
| More tests = more safety | More tests = more sophisticated self-deception |
| Linting catches the rest | Linting catches syntax, not semantic bugs |
The subtle bugs that survive: code runs perfectly, all tests green, but hides deep-layer misunderstandings that can destroy entire systems. Traditional validators (linters, type checkers, unit tests) canβt catch them because the model has been optimized to produce code that passes exactly those checks.
The Solution: Cross-Model Review
Use a different model to review. Different training data, different reasoning patterns, different blind spots = different bugs caught.
Claude (Opus 4.6) Codex (GPT 5.4)
ββββββββββββββββ ββββββββββββββββ
β Write code βββββββββββ>β Review code β
β Write tests β β Find bugs β
β Fast, daily β β Rigorous β
β driver β<βββββββββββ "Academic β
β β feedback β advisor" β
ββββββββββββββββ ββββββββββββββββ
Different model = different blind spots
= Bugs that Claude can't see, Codex catches
Why Not Just Use Codex for Everything?
Sterlingβs answer: Codex is like an academic advisor β too focused on βcorrect code,β loses sight of the systemβs actual purpose (telos). It over-engineers. Itβs also more expensive. Claude is better for daily driving β fast, practical, good at understanding what you actually want. But it needs Codexβs rigorous review eye.
| Model | Strength | Weakness |
|---|---|---|
| Claude | Fast execution, practical, understands intent | Canβt catch its own deep bugs |
| Codex | Rigorous review, catches subtle issues | Over-engineers, expensive, loses system purpose |
| Both | Write with Claude, review with Codex | Best of both worlds |
Emerging Patterns
Pattern 1: Plan-with-Codex
1. Claude writes implementation plan
2. Codex reviews plan for architectural issues
3. Claude revises based on Codex feedback
4. Loop until Codex approves
5. Claude implements the approved plan
6. Codex reviews the implementation
Pattern 2: Multi-Model Review Loop
Run reviews back and forth between models:
βHaving Opus critically review a plan from GPT-5.4, and then having GPT-5.4 review the revised plan from Opus β running this back and forth for a few rounds β produces significantly better results.β
Pattern 3: Role-Based Multi-Model
| Role | Model | Why |
|---|---|---|
| Architecture | Opus 4.6 | Best at system-level thinking |
| Implementation | Claude Code | Fast, practical daily driver |
| Code Review | Codex 5.4 | Catches bugs Claude misses |
| Security Audit | Either (cross-review) | Different model = fresh eyes |
How LearnAI Team Could Use This
- AI coding curriculum β teach students why self-review and self-written tests can miss semantic bugs
- Workflow demos β show Claude-as-implementer and Codex-as-reviewer as a practical multi-model pattern
- Assessment design β require adversarial review notes alongside generated code submissions
- Team engineering practice β use cross-model review as a lightweight review gate for internal prototypes
For Educators
This has direct implications for teaching software engineering:
- βPassing tests β correctβ is now a foundational principle, not an edge case
- Students need to learn adversarial review β the habit of having a different perspective check their work
- Cross-model review is the AI-era equivalent of code review culture β you never review your own code in production
Now Official: codex-plugin-cc
On March 30, 2026, OpenAI released codex-plugin-cc β an official plugin that brings Codex directly into Claude Code. No more switching terminals. Cross-model review is now a slash command away.
Install
/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup
Requires: ChatGPT subscription (including Free tier) or OpenAI API key + Node.js 18.18+.
Slash Commands
| Command | What It Does |
|---|---|
/codex:review |
Standard read-only code review on uncommitted changes. Supports --base main for branch diffs |
/codex:adversarial-review |
Devilβs advocate review β challenges design tradeoffs, hidden assumptions, failure modes |
/codex:rescue |
Hand off entire task to Codex (bug investigation, fixes, continuing work) |
/codex:status |
Show running/recent Codex jobs |
/codex:result |
Display completed job output + session ID for resuming in Codex |
/codex:cancel |
Kill active background jobs |
/codex:setup |
Verify installation, manage review gate |
Background Execution
All review commands support --background for async execution:
/codex:review --background # Runs in background
/codex:status # Check progress
/codex:result # Get findings when done
Review Gate (Optional)
Enable with /codex:setup --enable-review-gate. Uses Stop hooks to auto-run Codex review after every Claude response. Blocks if issues found β Claude must fix before proceeding.
Warning: Can trigger Claude/Codex loops that drain usage limits fast. Only enable during active monitoring.
Architecture
Claude Code session
β
βββ /codex:review βββ Local Codex CLI βββ OpenAI API
β (reuses your auth, (GPT 5.4)
β MCP config, repo)
β
βββ /codex:rescue βββ Codex subagent (full write access)
No separate runtime β wraps your machineβs global codex binary with existing auth and config.
Real-World Use Cases
- Production code review β run a different model over AI-generated diffs before merge
- Security-sensitive PRs β use adversarial review for auth, data handling, and error-path risks
- Teaching code quality β demonstrate why green tests are not the same as correct behavior
- Architecture review β have a second model challenge plans before implementation starts
Real-World Walkthrough: Building a REST API with Cross-Model Review
Hereβs a concrete example of the Claude + Codex workflow in practice β building a task management API from scratch.
Step 1: Claude Code Writes the Implementation
You: Build a REST API for managing tasks with Express.
Include CRUD endpoints, SQLite persistence, and input validation.
Claude Code generates:
server.jsβ Express app with routesdb.jsβ SQLite connection + schemamiddleware/validate.jsβ Input validationtests/api.test.jsβ Jest test suite
All tests pass. Linting clean. Looks good⦠right?
Step 2: Codex Standard Review
/codex:review
Codex (GPT 5.4) examines the diff with fresh eyes and reports:
β οΈ SQL Injection in db.js:34
Template literal used in query: `WHERE id = ${id}`
β Use parameterized query: db.get("WHERE id = ?", [id])
β οΈ Race condition in PATCH /tasks/:id
Read-then-write without transaction. Concurrent PATCHes
can overwrite each other.
β Wrap in db.run("BEGIN") ... db.run("COMMIT")
β οΈ Missing rate limiting
No rate limiter on POST /tasks β trivial to DoS.
βΉοΈ test coverage gap
No test for malformed JSON body (triggers unhandled 500)
Claudeβs own tests passed. Claudeβs own review found nothing. Codex caught 4 issues in seconds β including a SQL injection that would have shipped to production.
Step 3: Claude Code Fixes Based on Codex Feedback
You: Fix the issues Codex found β SQL injection, race condition,
rate limiting, and add the missing test.
Claude Code applies all fixes. Tests updated and passing.
Step 4: Codex Adversarial Review
/codex:adversarial-review Focus on security and error handling
Codex pushes harder:
π΄ Error messages leak internals
DELETE /tasks/999 returns: {"error": "SQLITE_ERROR: no such row"}
β Return generic 404, log details server-side only
π΄ No authentication
All endpoints are public. Even if "auth comes later,"
the route structure should enforce middleware now.
π‘ db.js opens connection at import time
If SQLite file is missing/corrupt, server crashes on startup
with unhelpful error. Add connection validation + retry.
π‘ Timestamps use Date.now() (epoch ms)
API returns milliseconds but no timezone. Use ISO 8601 strings.
Step 5: Claude Revises Again
Claude fixes the error messages, adds auth middleware placeholder, connection validation, and ISO timestamps. One more review:
/codex:review
β
No issues found. LGTM.
Step 6: Codex Reviews the Plan (Not Just the Code)
Before shipping, use Codex to review the architecture:
/codex:adversarial-review Challenge the overall design decisions
π‘ SQLite for a REST API?
Fine for prototype, but no concurrent write support.
If this grows beyond single-server, you'll need to migrate.
Consider abstracting db.js behind a repository pattern now.
π‘ No pagination on GET /tasks
Returns all tasks. Will become a problem at ~1000 rows.
Add ?page=1&limit=20 with Link headers.
βΉοΈ No health check endpoint
Add GET /health for deployment monitoring.
These arenβt bugs β theyβre architectural feedback that prevents future pain.
The Full Loop
Claude Code Codex (via plugin)
βββββββββββββββ βββββββββββββββββββββββ
β 1. Implementββββ /codex:review βββ β 2. Standard review β
β ββββ findings ββββββββββ (4 bugs found) β
β 3. Fix bugs β β β
β ββββ /codex:adversarialβ 4. Adversarial β
β ββββ deeper issues βββββ (4 more issues) β
β 5. Fix all β β β
β ββββ /codex:review βββ β 6. LGTM β
β
β β β β
β ββββ /codex:adversarialβ 7. Architecture β
β ββββ design feedback βββ review β
β 8. Refactor β β β
β 9. Ship π β β β
βββββββββββββββ βββββββββββββββββββββββ
Total: 8 issues caught that Claude's self-review missed.
The Harness Engineering Connection
Cross-model review is a harness mechanism. It maps to Pillar 2 (Architectural Constraints) β instead of asking the model to follow a βreview your own codeβ prompt (which it can ignore or apply with the same blind spots), you enforce review through a structurally different system. Code enforcement > prompt suggestions.
The codex-plugin-cc makes this zero-friction β no terminal switching, no copy-paste, no workflow interruption. The cross-model review thesis now has an official tool.
Evolution: From 2-Agent to 4-Agent Consensus Review
The next step beyond Claude + Codex pairing: 4 parallel agents with iterative consensus. A custom /cross-review skill orchestrates Claude Code, Codex, CodeRabbit, and an Integration Impact agent β all reviewing the same PR in parallel, then exchanging opinions and cross-validating until they agree.
Source: η‘ θ°·ιζΊε士 on Weibo (2026-04)
PR #251
β
βΌ
/cross-review launches 4 agents in parallel
β
ββββββββΌβββββββ¬βββββββββββ¬ββββββββββββββ
βΌ βΌ βΌ βΌ βΌ
ββββββββββ ββββββ ββββββββββ ββββββββββββββββ
β Claude β βCodexβ βCodeRabbitβ β Integration β
β Code β β β β β β Impact β
β(full β β(callerβ β(integra-β β (versions, β
β review)β βanalysis)βtion chk)β β CI, build) β
ββββββ¬ββββ βββββ¬ββ ββββββ¬βββββ ββββββββ¬βββββββ
β β β β
βββββββββββ΄ββββββββββ΄βββββββββββββ
β
βΌ
Exchange opinions β Cross-validate
β
βΌ
Up to 3 rounds until consensus
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Confirmed Issues (2+ reviewers) β
β Integration Findings β
β Dismissed Findings (with why) β
βββββββββββββββββββββββββββββββββββ
What Each Agent Brings
| Agent | Specialty |
|---|---|
| Claude Code | Full code review β logic, patterns, best practices |
| Codex | Consumer/caller analysis β who uses this code and how |
| CodeRabbit | Integration checks β API contracts, backward compat |
| Integration Impact | Cross-cutting analysis β versions, CI, build, .gitignore |
The Output: Three Categories
- Confirmed Issues β agreed by 2+ reviewers, with severity (Critical/Major/Minor/Info)
- Integration Findings β cross-cutting impact the individual reviewers missed
- Dismissed Findings β what the consensus rejected, with justification (e.g., βDocker is a system dependency, not a Flox packageβ)
The Real Insight
βThe value isnβt in three independent opinions β itβs in the mutual validation through iterative discussion.β
A single reviewerβs opinion is cheap. Four independent reviewers give you breadth but leave you to resolve conflicts yourself. Iterative consensus forces the reviewers to confront each otherβs findings: false positives get dismissed with reasons, genuine issues get confirmed by cross-checking, and cross-cutting impacts emerge that no single reviewer would have seen.
How It Compares
| Approach | Agents | Cost | Value |
|---|---|---|---|
| Self-review | 1 | 1x | Low (same blind spots) |
| Cross-model (Claude + Codex) | 2 | 2x | High (different reasoning patterns) |
| 4-agent consensus | 4 | ~5x (3 rounds) | Highest (mutual validation) |
Use single cross-model for daily work. Use 4-agent consensus for high-stakes PRs β release branches, security-sensitive code, architectural changes. The cost is worth it when a missed bug costs more than the review tokens. <!β REVIEW-TODO: [source_links] cross-model-code-review: Weibo source link is generic (https://weibo.com) β find specific η±ε―ε―-η±ηζ΄» post URL β>