Cross-Model Code Review β€” Why Claude Can't Catch Its Own Bugs

Cross-Model Code Review β€” Why Claude Can't Catch Its Own Bugs

Sterling Crispin shared a brutal discovery: Claude Opus 4.6 is an excellent programmer, but consistently produces serious bugs that it cannot find no matter how many times it self-reviews. The solution? Use a completely different model β€” OpenAI’s Codex CLI (GPT 5.4) β€” to review every submission, with 4+ review passes. The insight that’s making developers rethink AI coding workflows: β€œpassing tests” doesn’t mean β€œno bugs” β€” it means the AI got really good at writing code that passes tests.

*Source: 爱可可-ηˆ±η”Ÿζ΄» Weibo analysis Chandler Nguyen: Dual-Wielding AI Coding Tools SmartScope: Automating the Claude Γ— Codex Review Loop GitHub: openai/codex-plugin-cc*

The Problem: Self-Review Blindness

AI models have a blind spot: they can’t reliably catch their own bugs. This isn’t about model quality β€” it’s about cognitive homogeneity. When the same model writes code and reviews it, it applies the same reasoning patterns, assumptions, and blind spots to both tasks.

Claude writes code β†’ Claude reviews code
  Same reasoning      Same blind spots
  Same assumptions    Same missed patterns
  = Bugs persist through unlimited review rounds

Sterling’s experience: no matter how many times he asked Claude to self-review, the same class of bugs survived. The bugs weren’t in syntax or logic β€” they were deep-layer misunderstandings about system behavior that the model’s own reasoning couldn’t detect.

Why β€œPassing Tests” Is a Trap

This is the most counterintuitive finding:

β€œAI’s favorite thing is writing code that passes its own tests. β€˜Passing tests’ doesn’t mean no bugs β€” it means the AI got optimized to write code that passes tests.”

What You Expect What Actually Happens
Tests catch bugs AI writes code and tests with the same assumptions
Green = safe Green = model successfully fooled itself
More tests = more safety More tests = more sophisticated self-deception
Linting catches the rest Linting catches syntax, not semantic bugs

The subtle bugs that survive: code runs perfectly, all tests green, but hides deep-layer misunderstandings that can destroy entire systems. Traditional validators (linters, type checkers, unit tests) can’t catch them because the model has been optimized to produce code that passes exactly those checks.

The Solution: Cross-Model Review

Use a different model to review. Different training data, different reasoning patterns, different blind spots = different bugs caught.

Claude (Opus 4.6)          Codex (GPT 5.4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Write code   │──────────>β”‚ Review code  β”‚
β”‚ Write tests  β”‚           β”‚ Find bugs    β”‚
β”‚ Fast, daily  β”‚           β”‚ Rigorous     β”‚
β”‚ driver       β”‚<──────────│ "Academic    β”‚
β”‚              β”‚  feedback β”‚  advisor"    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Different model = different blind spots
= Bugs that Claude can't see, Codex catches

Why Not Just Use Codex for Everything?

Sterling’s answer: Codex is like an academic advisor β€” too focused on β€œcorrect code,” loses sight of the system’s actual purpose (telos). It over-engineers. It’s also more expensive. Claude is better for daily driving β€” fast, practical, good at understanding what you actually want. But it needs Codex’s rigorous review eye.

Model Strength Weakness
Claude Fast execution, practical, understands intent Can’t catch its own deep bugs
Codex Rigorous review, catches subtle issues Over-engineers, expensive, loses system purpose
Both Write with Claude, review with Codex Best of both worlds

Emerging Patterns

Pattern 1: Plan-with-Codex

1. Claude writes implementation plan
2. Codex reviews plan for architectural issues
3. Claude revises based on Codex feedback
4. Loop until Codex approves
5. Claude implements the approved plan
6. Codex reviews the implementation

Pattern 2: Multi-Model Review Loop

Run reviews back and forth between models:

β€œHaving Opus critically review a plan from GPT-5.4, and then having GPT-5.4 review the revised plan from Opus β€” running this back and forth for a few rounds β€” produces significantly better results.”

Pattern 3: Role-Based Multi-Model

Role Model Why
Architecture Opus 4.6 Best at system-level thinking
Implementation Claude Code Fast, practical daily driver
Code Review Codex 5.4 Catches bugs Claude misses
Security Audit Either (cross-review) Different model = fresh eyes

How LearnAI Team Could Use This

  • AI coding curriculum β€” teach students why self-review and self-written tests can miss semantic bugs
  • Workflow demos β€” show Claude-as-implementer and Codex-as-reviewer as a practical multi-model pattern
  • Assessment design β€” require adversarial review notes alongside generated code submissions
  • Team engineering practice β€” use cross-model review as a lightweight review gate for internal prototypes

For Educators

This has direct implications for teaching software engineering:

  • β€œPassing tests β‰  correct” is now a foundational principle, not an edge case
  • Students need to learn adversarial review β€” the habit of having a different perspective check their work
  • Cross-model review is the AI-era equivalent of code review culture β€” you never review your own code in production

Now Official: codex-plugin-cc

On March 30, 2026, OpenAI released codex-plugin-cc β€” an official plugin that brings Codex directly into Claude Code. No more switching terminals. Cross-model review is now a slash command away.

Install

/plugin marketplace add openai/codex-plugin-cc
/plugin install codex@openai-codex
/reload-plugins
/codex:setup

Requires: ChatGPT subscription (including Free tier) or OpenAI API key + Node.js 18.18+.

Slash Commands

Command What It Does
/codex:review Standard read-only code review on uncommitted changes. Supports --base main for branch diffs
/codex:adversarial-review Devil’s advocate review β€” challenges design tradeoffs, hidden assumptions, failure modes
/codex:rescue Hand off entire task to Codex (bug investigation, fixes, continuing work)
/codex:status Show running/recent Codex jobs
/codex:result Display completed job output + session ID for resuming in Codex
/codex:cancel Kill active background jobs
/codex:setup Verify installation, manage review gate

Background Execution

All review commands support --background for async execution:

/codex:review --background      # Runs in background
/codex:status                   # Check progress
/codex:result                   # Get findings when done

Review Gate (Optional)

Enable with /codex:setup --enable-review-gate. Uses Stop hooks to auto-run Codex review after every Claude response. Blocks if issues found β€” Claude must fix before proceeding.

Warning: Can trigger Claude/Codex loops that drain usage limits fast. Only enable during active monitoring.

Architecture

Claude Code session
    β”‚
    β”œβ”€β”€ /codex:review ──→ Local Codex CLI ──→ OpenAI API
    β”‚                      (reuses your auth,    (GPT 5.4)
    β”‚                       MCP config, repo)
    β”‚
    └── /codex:rescue ──→ Codex subagent (full write access)

No separate runtime β€” wraps your machine’s global codex binary with existing auth and config.


Real-World Use Cases

  • Production code review β€” run a different model over AI-generated diffs before merge
  • Security-sensitive PRs β€” use adversarial review for auth, data handling, and error-path risks
  • Teaching code quality β€” demonstrate why green tests are not the same as correct behavior
  • Architecture review β€” have a second model challenge plans before implementation starts

Real-World Walkthrough: Building a REST API with Cross-Model Review

Here’s a concrete example of the Claude + Codex workflow in practice β€” building a task management API from scratch.

Step 1: Claude Code Writes the Implementation

You: Build a REST API for managing tasks with Express.
     Include CRUD endpoints, SQLite persistence, and input validation.

Claude Code generates:

  • server.js β€” Express app with routes
  • db.js β€” SQLite connection + schema
  • middleware/validate.js β€” Input validation
  • tests/api.test.js β€” Jest test suite

All tests pass. Linting clean. Looks good… right?

Step 2: Codex Standard Review

/codex:review

Codex (GPT 5.4) examines the diff with fresh eyes and reports:

⚠️  SQL Injection in db.js:34
    Template literal used in query: `WHERE id = ${id}`
    β†’ Use parameterized query: db.get("WHERE id = ?", [id])

⚠️  Race condition in PATCH /tasks/:id
    Read-then-write without transaction. Concurrent PATCHes
    can overwrite each other.
    β†’ Wrap in db.run("BEGIN") ... db.run("COMMIT")

⚠️  Missing rate limiting
    No rate limiter on POST /tasks β€” trivial to DoS.

ℹ️  test coverage gap
    No test for malformed JSON body (triggers unhandled 500)

Claude’s own tests passed. Claude’s own review found nothing. Codex caught 4 issues in seconds β€” including a SQL injection that would have shipped to production.

Step 3: Claude Code Fixes Based on Codex Feedback

You: Fix the issues Codex found β€” SQL injection, race condition,
     rate limiting, and add the missing test.

Claude Code applies all fixes. Tests updated and passing.

Step 4: Codex Adversarial Review

/codex:adversarial-review Focus on security and error handling

Codex pushes harder:

πŸ”΄  Error messages leak internals
    DELETE /tasks/999 returns: {"error": "SQLITE_ERROR: no such row"}
    β†’ Return generic 404, log details server-side only

πŸ”΄  No authentication
    All endpoints are public. Even if "auth comes later,"
    the route structure should enforce middleware now.

🟑  db.js opens connection at import time
    If SQLite file is missing/corrupt, server crashes on startup
    with unhelpful error. Add connection validation + retry.

🟑  Timestamps use Date.now() (epoch ms)
    API returns milliseconds but no timezone. Use ISO 8601 strings.

Step 5: Claude Revises Again

Claude fixes the error messages, adds auth middleware placeholder, connection validation, and ISO timestamps. One more review:

/codex:review
βœ…  No issues found. LGTM.

Step 6: Codex Reviews the Plan (Not Just the Code)

Before shipping, use Codex to review the architecture:

/codex:adversarial-review Challenge the overall design decisions
🟑  SQLite for a REST API?
    Fine for prototype, but no concurrent write support.
    If this grows beyond single-server, you'll need to migrate.
    Consider abstracting db.js behind a repository pattern now.

🟑  No pagination on GET /tasks
    Returns all tasks. Will become a problem at ~1000 rows.
    Add ?page=1&limit=20 with Link headers.

ℹ️  No health check endpoint
    Add GET /health for deployment monitoring.

These aren’t bugs β€” they’re architectural feedback that prevents future pain.

The Full Loop

Claude Code                          Codex (via plugin)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Implement│─── /codex:review ──→ β”‚ 2. Standard review  β”‚
β”‚             │←── findings ─────────│    (4 bugs found)   β”‚
β”‚ 3. Fix bugs β”‚                      β”‚                     β”‚
β”‚             │─── /codex:adversarialβ”‚ 4. Adversarial      β”‚
β”‚             │←── deeper issues ────│    (4 more issues)  β”‚
β”‚ 5. Fix all  β”‚                      β”‚                     β”‚
β”‚             │─── /codex:review ──→ β”‚ 6. LGTM βœ…          β”‚
β”‚             β”‚                      β”‚                     β”‚
β”‚             │─── /codex:adversarialβ”‚ 7. Architecture     β”‚
β”‚             │←── design feedback ──│    review            β”‚
β”‚ 8. Refactor β”‚                      β”‚                     β”‚
β”‚ 9. Ship πŸš€  β”‚                      β”‚                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Total: 8 issues caught that Claude's self-review missed.

The Harness Engineering Connection

Cross-model review is a harness mechanism. It maps to Pillar 2 (Architectural Constraints) β€” instead of asking the model to follow a β€œreview your own code” prompt (which it can ignore or apply with the same blind spots), you enforce review through a structurally different system. Code enforcement > prompt suggestions.

The codex-plugin-cc makes this zero-friction β€” no terminal switching, no copy-paste, no workflow interruption. The cross-model review thesis now has an official tool.

Evolution: From 2-Agent to 4-Agent Consensus Review

The next step beyond Claude + Codex pairing: 4 parallel agents with iterative consensus. A custom /cross-review skill orchestrates Claude Code, Codex, CodeRabbit, and an Integration Impact agent β€” all reviewing the same PR in parallel, then exchanging opinions and cross-validating until they agree.

Source: η‘…θ°·ι™ˆζΊεšε£« on Weibo (2026-04)

        PR #251
           β”‚
           β–Ό
  /cross-review launches 4 agents in parallel
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β–Ό      β–Ό      β–Ό          β–Ό              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Claude β”‚ β”‚Codexβ”‚ β”‚CodeRabbitβ”‚ β”‚  Integration β”‚
β”‚ Code   β”‚ β”‚     β”‚ β”‚         β”‚ β”‚  Impact     β”‚
β”‚(full   β”‚ β”‚(callerβ”‚ β”‚(integra-β”‚ β”‚ (versions,  β”‚
β”‚ review)β”‚ β”‚analysis)β”‚tion chk)β”‚ β”‚  CI, build) β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
     β”‚         β”‚         β”‚            β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
         Exchange opinions β†’ Cross-validate
                       β”‚
                       β–Ό
         Up to 3 rounds until consensus
                       β”‚
                       β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Confirmed Issues (2+ reviewers) β”‚
         β”‚ Integration Findings             β”‚
         β”‚ Dismissed Findings (with why)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What Each Agent Brings

Agent Specialty
Claude Code Full code review β€” logic, patterns, best practices
Codex Consumer/caller analysis β€” who uses this code and how
CodeRabbit Integration checks β€” API contracts, backward compat
Integration Impact Cross-cutting analysis β€” versions, CI, build, .gitignore

The Output: Three Categories

  1. Confirmed Issues β€” agreed by 2+ reviewers, with severity (Critical/Major/Minor/Info)
  2. Integration Findings β€” cross-cutting impact the individual reviewers missed
  3. Dismissed Findings β€” what the consensus rejected, with justification (e.g., β€œDocker is a system dependency, not a Flox package”)

The Real Insight

β€œThe value isn’t in three independent opinions β€” it’s in the mutual validation through iterative discussion.”

A single reviewer’s opinion is cheap. Four independent reviewers give you breadth but leave you to resolve conflicts yourself. Iterative consensus forces the reviewers to confront each other’s findings: false positives get dismissed with reasons, genuine issues get confirmed by cross-checking, and cross-cutting impacts emerge that no single reviewer would have seen.

How It Compares

Approach Agents Cost Value
Self-review 1 1x Low (same blind spots)
Cross-model (Claude + Codex) 2 2x High (different reasoning patterns)
4-agent consensus 4 ~5x (3 rounds) Highest (mutual validation)

Use single cross-model for daily work. Use 4-agent consensus for high-stakes PRs β€” release branches, security-sensitive code, architectural changes. The cost is worth it when a missed bug costs more than the review tokens. <!– REVIEW-TODO: [source_links] cross-model-code-review: Weibo source link is generic (https://weibo.com) β€” find specific 爱可可-ηˆ±η”Ÿζ΄» post URL –>