Codex + Claude Code for Research β€” A Practical Tutorial

Codex + Claude Code for Research β€” A Practical Tutorial

The companion entry (Cross-Model Code Review) explains why cross-model review works and walks through a REST-API example. This entry is the research-work tutorial β€” how to use codex-plugin-cc as a daily driver for formal verification, type-systems research, program analysis, and paper writing, where β€œlooks right” is not good enough and adversarial rigor pays for itself.

*Source: openai/codex-plugin-cc Cross-Model Code Review setup entry Codex CLI 0.121.0 docs*

The Core Mental Model

Claude (Opus 4.6) and Codex (GPT-5.4) have different training data, different reasoning patterns, and different blind spots. For research work β€” where correctness is non-negotiable and novelty has to be defensible β€” that asymmetry is the whole point:

Claude (the practitioner)          Codex (the adversary)
  - Broad synthesis                  - Rigorous, skeptical
  - Drafts, explains, refactors      - Hunts counterexamples
  - Fast iteration                   - Challenges framings
  - Wants to help you ship           - Wants to prove you wrong

You drive Claude. You summon Codex when β€œthis has to actually be true.”

Command Cheat-Sheet

Command When to use
/codex:review Standard second-pair-of-eyes on uncommitted work β€” proofs, encodings, analyzers
/codex:adversarial-review Devil’s-advocate pass β€” challenges problem framing, hunts counterexamples
/codex:rescue Hand the whole task to Codex (run a tool, investigate a bug, fix the proof)
/codex:review --base main Review a feature branch’s diff against main
/codex:review --background Async review; check with /codex:status, fetch with /codex:result
/codex:cancel Kill a runaway background job

Keep the review gate off. Manual invocation beats the auto-gate because the gate triggers on every Claude response and quickly burns usage in a Claude↔Codex loop.

The Research Loop

            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                                                β”‚
            β–Ό                                                β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
  β”‚  Claude drafts  β”‚   proof / encoding / type rule /       β”‚
  β”‚  (practitioner) β”‚   problem statement / paper section    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
           β”‚ uncommitted changes                             β”‚
           β–Ό                                                 β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     /codex:review                      β”‚
  β”‚  Codex reviews  β”‚ or  /codex:adversarial-review          β”‚
  β”‚   (adversary)   β”‚     /codex:rescue (executes tools)     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
           β”‚ findings: gaps, counter-examples, open admits   β”‚
           β–Ό                                                 β”‚
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
  β”‚  Disagreement?  β”‚ ── No ──► commit                       β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
           β”‚ Yes                                             β”‚
           β–Ό                                                 β”‚
   slow down, resolve, repeat β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Rule of thumb: Cross-model disagreement is the signal. Whenever Claude says β€œdone” and Codex says β€œno”, that’s exactly where your research contribution lives β€” either your proof has a real gap, or your framing is unclear enough that two strong models read it differently.


Pattern 1 β€” Sharpening a Research Problem Statement

You have a vague intuition for a new result. Claude helps you articulate it; Codex tries to demolish it.

Step 1 β€” Claude drafts

Open the repo containing your draft (e.g., ~/research/relcost-hol/problem.md) and ask Claude:

Draft a one-page problem statement for proving non-interference for
a higher-order relational cost language, targeting POPL 2027. Include:
motivation, informal statement, why existing techniques don't suffice.

Step 2 β€” Codex attacks

With the draft uncommitted, run:

/codex:adversarial-review Focus on novelty and related-work overlap.
Assume you are Reviewer 2. Challenge every contribution claim.

Typical Codex output:

πŸ”΄ Contribution overlap β€” RelCost (Γ‡iΓ§ek et al., POPL 2017) already
   handles non-interference for a fragment of this language. Your
   delta (higher-order + effects) needs an explicit worked example
   showing where their system fails.

🟑 Motivation is abstract β€” the "security-critical" framing is
   generic. Name one concrete program where cost-preserving
   non-interference matters and existing tools cannot prove it.

🟑 "Novel type-and-effect system" β€” Barthe, GrΓ©goire, Zanella-BΓ©guelin
   (CSF 2013) and Rajani & Garg (POPL 2020) both have relational
   refinement systems with cost. Cite and differentiate.

Step 3 β€” Iterate

Feed Codex’s findings back to Claude:

Revise the problem statement to: (a) include a concrete failing
example vs. RelCost, (b) cite Barthe/Rajani and draw the delta
explicitly, (c) replace "security-critical" with the concrete scenario.

Re-run /codex:adversarial-review. Two to three rounds typically converge on a framing that survives a hostile read.

Budget: 3–5 Codex calls per session. If you hit 10, you’re rewriting the paper with Codex’s voice β€” stop and think.


Pattern 2 β€” Reviewing a Proof Script (Coq / Rocq / Lean)

Proof assistants accept incomplete proofs silently if you use admit, Admitted, shelve, or sorry. Self-review frequently misses these. Codex does not.

Step 1 β€” Claude writes the proof

Prove `preservation` for the STLC fragment in Stlc.v. Use induction
on the typing derivation; handle E_App and E_Abs explicitly.

Claude produces Lemma preservation : ... with a tactic script.

Step 2 β€” Codex reviews

/codex:review Focus on completeness of the proof: any admit,
Admitted, sorry, shelve, or uses of `eauto` / `firstorder`
that may silently discharge the wrong goal. Also verify the
induction principle matches the typing judgment structure.

Typical findings:

πŸ”΄ Stlc.v:87 β€” `admit.` in case E_App with applied function.
   The induction hypothesis is there; try `eapply IHHt1; eauto`
   after strengthening the context.

🟑 Stlc.v:104 β€” uses `eauto 7` as closer. This succeeds, but a
   weaker lemma (wrong quantifier) would also satisfy it. Replace
   with explicit `apply substitution_lemma; eauto.`

ℹ️  The induction is on the term, but the typing judgment is
   mutual with values. Consider `dependent induction Ht` for
   the well-typed case.

Step 3 β€” Claude patches, Codex re-verifies

Address all three findings. Remove admit, tighten the eauto, switch
to dependent induction in the second branch.

Re-run /codex:review. When it returns βœ… No issues found, commit.


Pattern 3 β€” Probing a New Type Rule for Soundness

You’ve proposed a new typing rule in a draft. Before formalizing, you want to know if a well-typed-but-stuck term exists.

Step 1 β€” Claude drafts

In rules.tex (or a markdown inference-rule file):

Add a typing rule T-Select for non-blocking channel select, with
premises for each branch's guard type and result type. State it
in the same style as the existing T-Send / T-Recv rules.

Step 2 β€” Codex hunts counterexamples

/codex:adversarial-review Construct a program that type-checks under
T-Select but gets stuck during evaluation. If you find one, show the
derivation and the stuck configuration. If you can't, explain
which premise of T-Select rules out each candidate attack.

Codex might reply:

πŸ”΄ Counterexample:
   select { true => !c1; true => !c2 } where c1, c2 : !unit
   Both guards are `true`, type-checks via T-Select with unit result.
   Evaluation: nondeterministic commit to one branch; if the scheduler
   is adversarial both sends are published to different receivers,
   violating linearity.

   Root cause: T-Select doesn't require branch guards to be disjoint,
   but your operational semantics picks one branch and consumes the
   prefix of both. Fix options:
   (a) Require mutually exclusive guards in T-Select.
   (b) Change the operational semantics to commit atomically.

Step 3 β€” Claude revises the rule or semantics

Feed the counterexample back. Either tighten the rule or change the operational semantics. This is the core research loop β€” Codex plays the role an anonymous reviewer would, months earlier.


Pattern 4 β€” PCSAT Encoding Sanity Check (Relational Verification)

This one is straight from Q’s pipeline. You’ve encoded a relational property as CHC constraints in a .clp file for PCSAT. Did you faithfully capture the source program’s semantics?

Step 1 β€” Claude writes the encoding

Encode the `merged_scheduler` function from scheduler.ml as CHC
constraints in scheduler.clp. Preserve the relational cost invariant
from Section 3 of the paper. Use the predicate naming scheme in
existing .clp files.

Step 2 β€” Codex executes and verifies

Unlike /codex:review (read-only), /codex:rescue can actually run tools. This is how you close the loop:

/codex:rescue Run pcsat on scheduler.clp and report the verdict.
Then, independently, check whether the encoding faithfully represents
scheduler.ml: for each predicate in the .clp, show which ML construct
it abstracts, and flag any case where the abstraction drops
information (e.g., overapproximating loops, non-linear assertions
reduced to linear ones).

Codex will run pcsat scheduler.clp, collect the result, read scheduler.ml, and produce a report:

Verdict: pcsat returned SAT (property violated)

Encoding audit:
  βœ… merge_step β†’ sched_step predicate: faithful
  βœ… relational cost pair (c₁, cβ‚‚) β†’ sched_cost predicate: faithful
  πŸ”΄ loop_invariant: your encoding drops the tuple-length constraint
     from line 47 of scheduler.ml. As encoded, PCSAT considers traces
     of any length β€” which is why it finds a "counterexample" at
     length 0 that doesn't exist in the real program.
  🟑 non-linear cost aggregation encoded as linear sum. This is a
     known PCSAT limitation, but the paper's Theorem 4.2 requires the
     quadratic bound β€” you need to justify the linearization.

Step 3 β€” Claude tightens the encoding

Add the tuple-length invariant to the relevant predicate in
scheduler.clp. Document the linearization assumption and cite
Theorem 4.2 explicitly as a side condition.

Re-run /codex:rescue. When the verdict matches the paper claim and the audit is clean, commit.


Reviewer 2’s favorite rejection reason is β€œyou missed X”. Catch it before submission.

Draft the Related Work section for our paper on ownership types for
async Rust. Cover: linear types, session types, existing Rust borrow
checker work, and async-specific verification tools.

Step 2 β€” Codex hunts gaps

/codex:adversarial-review Read related-work.tex and contribution.tex.
For each contribution claim, find at least one recent (2022+)
published paper that overlaps. If the overlap is partial, say exactly
what's different. If no overlap, say so and why you're confident.

Typical output:

πŸ”΄ Contribution C2 ("first relational proof of cancel-safety") β€”
   Jung et al. (OOPSLA 2023, "RustBelt goes async") prove a related
   safety property for tokio primitives. Your delta is the
   *relational* formulation; state that explicitly in Related Work.

🟑 You cite Pony's reference capabilities but miss Verona (Parkinson
   et al.). Verona is more recent (2024) and has a formal ownership
   calculus that overlaps with your Β§ 4.

βœ… Session types coverage is complete (Honda, Gay, Hu et al. all
   cited with correct differentiation).

Step 3 β€” Claude folds in the fixes

Now you cite Jung et al. and Verona, and reframe C2 with the relational delta explicit. You can also chain into /lit-scout for a fuller literature pass:

Run /lit-scout on "relational ownership types async" to confirm
I'm not missing anything else.

Working Async on Long Tasks

Proofs and big encodings can take minutes. Don’t block:

/codex:review --background
# keep writing in Claude Code
/codex:status          # check if done
/codex:result          # pull findings when ready

/codex:status shows all active/recent jobs. /codex:result shows the latest (or pass a session ID).

This pattern pairs well with background theorem provers β€” kick off Codex to audit one proof while you write the next.

When NOT to Use It

Situation Why skip Codex
Exploratory prototyping Review tax > value; you’re going to rewrite it anyway
Fixing a typo / rename No semantic content to review
Claude just wrote tests that are failing Fix the code first; Codex can’t help until the thing runs
You’re stuck on understanding a library Use /codex:rescue as reader, not reviewer β€” different use case
Mid-refactor, half-finished state Commit the intermediate step or --base against a checkpoint; otherwise Codex’s feedback is noise

Cost & Usage Discipline

  • Budget 3–5 Codex calls per research session. More than that usually means you’re letting Codex write the paper.
  • Prefer /codex:review over /codex:adversarial-review for routine checks; save adversarial for design decisions, problem statements, and pre-submission reads.
  • Keep the review gate off. It auto-invokes Codex after every Claude response and is a known loop hazard.
  • Use --background for anything over a few files or a big proof β€” don’t waste wall-clock time blocking.

Model Complementarity β€” The Disagreement Rule

Claude says "LGTM" + Codex says "LGTM" β†’ ship
Claude says "LGTM" + Codex raises issues β†’ slow down, investigate
Claude raises issues + Codex says "LGTM" β†’ usually Claude's caution is real; re-ask Codex with specifics
Both raise issues β†’ trivially: fix

The most valuable case is Claude-LGTM / Codex-flags β€” that’s where research bugs hide. Self-review missed it; a different reasoning style caught it. Treat these as signal, not noise.


How LearnAI Team Could Use This

  • CS305 / formal methods courses: Students pair Claude (to draft proofs, explain tactics) with Codex (to hunt admits, verify induction principles). Makes adversarial review a habit β€” the same cognitive move PhD students need in their own work.
  • PCSAT pipeline: The relational-verification research workflow (merged scheduler, cell morphing, relational cost) is exactly Pattern 4. Replace β€œmanually audit the encoding” with a /codex:rescue call that runs PCSAT and diffs the abstraction against the source program.
  • Paper drafts: Every contribution claim gets an adversarial pass before internal review. Cuts the β€œReviewer 2 missed X” rejection path by a meaningful fraction.
  • Thesis chapters / grant proposals: Pattern 1 (problem-statement sharpening) catches weak framings before advisor/PM time is spent on them.
  • AI-ed research artifacts: When LAI ships tutorials or autograders that students will rely on, cross-model review catches the silent-failure bugs Claude’s self-tests miss.

Real-World Use Cases (Recap)

# Pattern Codex command Research payoff
1 Problem-statement sharpening /codex:adversarial-review Defensible novelty claim; fewer weak-framing rejections
2 Proof-script review (Coq/Rocq) /codex:review No hidden admits, correct induction
3 Type-rule soundness probe /codex:adversarial-review Counterexamples before formalization
4 PCSAT encoding sanity check /codex:rescue Faithful CHC abstractions, verdicts that match paper claims
5 Literature-gap hunt /codex:adversarial-review + /lit-scout Fewer missed citations, sharper deltas

One-Screen Summary

Before Codex:
  Claude writes β†’ Claude reviews β†’ "LGTM" β†’ hidden bug ships
After Codex:
  Claude writes β†’ /codex:review β†’ gaps surface β†’ Claude patches
                β†’ /codex:review β†’ clean β†’ commit
                β†’ /codex:adversarial-review (before submission)

For research: correctness matters more than throughput.
The review tax is the work.

The setup entry (linked above) covers installation. This entry covers the muscle memory. Run it on a real proof this week and see which pattern fits your workflow.