Feynman AI Research Agent & Claude as Lab Partner β€” Promise and Pitfalls

Feynman AI Research Agent & Claude as Lab Partner β€” Promise and Pitfalls

Two stories from the same week paint the full picture of AI in research: Feynman, an open-source multi-agent system purpose-built for scientific investigation, and a physics professor’s brutally honest account of using Claude to reproduce theoretical predictions. Together they answer the question every researcher is asking: Can AI actually do science? The answer is yes β€” spectacularly fast, dangerously confident, and only safe under expert supervision.

*Sources: Feynman GitHub 爱可可-ηˆ±η”Ÿζ΄» (2026-03-26) ε“ˆε‹ƒθ§‚ε―Ÿε‘˜ (2026-03-26)*

Feynman: Open-Source AI Research Agent

Feynman is a multi-agent research system that coordinates specialized AI agents through natural language. You describe what you want to investigate; it dispatches the right agents, searches literature, and returns cited results.

User: /deepresearch "transformer attention mechanisms"
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Feynman Agent Orchestra            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Literature β”‚  Critical  β”‚    Experiment      β”‚
β”‚  Search    β”‚  Review    β”‚   Replication      β”‚
β”‚            β”‚            β”‚                    β”‚
β”‚ AlphaXiv   β”‚ Methodologyβ”‚ Code generation    β”‚
β”‚ parsing    β”‚ audit      β”‚ + execution        β”‚
β”‚ + ranking  β”‚ + gaps     β”‚ + validation       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
Cited report with precise references + linked code

Key Commands

Command What It Does
/deepresearch <topic> Full multi-agent deep dive β€” literature, synthesis, gaps
/lit <topic> Targeted literature search and summary
/audit <paper ID> Critical methodological review of a specific paper
/replicate <experiment> Attempt to reproduce an experiment with code

Features at a Glance

Feature Details
Architecture Multi-agent on Pi framework
Literature backend AlphaXiv parsing with citation linking
Interfaces Web UI + CLI
Runtime Node.js
Security Docker container isolation for code execution
Citation policy All outputs include precise references β€” no uncited claims
Target users AI scientists, engineers, research teams
License Open source

The design philosophy is right: every claim links back to a source, code links to literature, and experiments link to both. This is what research tooling should look like.

The Physics Case Study: Claude as Grad Student

A professor gave Claude a task any second-year grad student could handle: reproduce the Sudarsky shoulder predictions in particle physics. What followed was a two-week experiment that revealed both the ceiling and the floor of AI-assisted research.

What Worked β€” Spectacularly

Metric Result
Paper draft 20 pages in 3 days
Iterations 110+ standalone versions in 2 weeks
Token consumption 36 million tokens
CPU simulation time 40+ hours of local computation
Self-organization AI set its own plan, built structure, split into 102 subtasks

The tasks that make human grad students miserable β€” writing Fortran interfaces, tuning Python plots, computing integrals β€” AI did in seconds, without complaint, without fatigue. It self-organized its workflow: set milestones, built document structure, progressively reasoned through sub-problems, and split 102 subtasks in orderly fashion.

What Went Wrong β€” Dangerously

Mid-experiment, Claude exposed AI’s fatal weakness in scientific work: it tried to wing it.

What the AI fabricated:
β”œβ”€β”€ Made-up coefficients (looked plausible, were wrong)
β”œβ”€β”€ Fabricated citation tables with real-sounding terminology
β”œβ”€β”€ Statistical errors buried in calculations
└── "Plausible bullshitting" β€” correct form, wrong substance

In theoretical physics, this kind of confident fabrication is catastrophic. A wrong coefficient doesn’t just give you a bad number β€” it can make your predictions diverge enormously from reality while still looking mathematically reasonable. The paper would have passed a casual review.

The Save

The professor caught it. He forced point-by-point verification β€” every coefficient checked against source material, every citation confirmed, every statistical method validated. Once caught, Claude corrected the key errors and completed the full recalculation in 5 minutes. A human grad student doing the same corrections? Roughly 2 weeks.

AI Research Workflow (what actually works):

  AI generates ──→ Human verifies ──→ AI corrects ──→ Human validates
  (fast, broad)    (slow, precise)    (fast, targeted)  (final check)
      β”‚                  β”‚                  β”‚                β”‚
   3 days            catches errors      5 minutes       publishable
   110 drafts        fabrications         fixed           result

Lessons Learned

AI as Tireless Research Assistant

The physics case study proves AI can compress weeks of tedious work into hours: literature synthesis, code generation, integral computation, plot generation, document structuring. These are real, valuable capabilities for any research team.

But: Verification Is Non-Negotiable

AI Strength AI Weakness
Speed (seconds vs. weeks) Fabricates when uncertain
Never tired, never complains Cannot assess its own confidence
Handles 102 subtasks systematically β€œPlausible bullshitting” β€” correct form, wrong substance
Self-organizes complex workflows Won’t tell you when it’s guessing
Writes Fortran, Python, LaTeX fluently Statistical errors look like real results

The β€œTaste” That Remains Human

The professor’s final answer to β€œwill physicists lose their jobs?” β€” No. Computing power and knowledge are becoming cheap as water. What remains uniquely human is taste: the judgment to choose which problems are worth pursuing among infinite paths. AI can explore any direction you point it toward, but it cannot tell you which direction matters.

The Reusable Methodology: From Executor to Commander

θΆ…ηΊ§ε³° extracted the professor’s approach into a replicable framework. The core shift: stop being an executor, become a commander. Don’t rely on AI’s memory β€” give it searchable structured documents.

Source: θΆ…ηΊ§ε³° on Xiaohongshu (2026-04)

Three Core Principles

Principle What It Means
Clear I/O per task Every subtask has explicit input and output definitions β€” don’t let AI guess your intent
One-conversation granularity Each subtask should be completable within a single conversation
Result persistence After completing a task, immediately save results as input for the next task β€” AI reads documents, not memory

The 7-Phase Execution Model

This is how the professor decomposed a year of work into 102 subtasks across 7 phases:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Needs │──▢│ 2. Arch  │──▢│ 3. Module│──▢│ 4. Impl  β”‚
β”‚ Analysis β”‚   β”‚ Design   β”‚   β”‚ Decomposeβ”‚   β”‚ (one by  β”‚
β”‚          β”‚   β”‚          β”‚   β”‚          β”‚   β”‚  one)    β”‚
β”‚β€’ Define  β”‚   β”‚β€’ Problem β”‚   β”‚β€’ Break   β”‚   β”‚β€’ Priorityβ”‚
β”‚  problem β”‚   β”‚  structureβ”‚  β”‚  into    β”‚   β”‚  order   β”‚
β”‚β€’ Success β”‚   β”‚β€’ Tech    β”‚   β”‚  subtasksβ”‚   β”‚β€’ Verify  β”‚
β”‚  criteriaβ”‚   β”‚  roadmap β”‚   β”‚β€’ Clear   β”‚   β”‚  each    β”‚
β”‚β€’ Key     β”‚   β”‚β€’ Verify  β”‚   β”‚  I/O per β”‚   β”‚β€’ Record  β”‚
β”‚  limits  β”‚   β”‚  approachβ”‚   β”‚  subtask β”‚   β”‚  issues  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚                                            β”‚
       β–Ό                                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 5. Integ │──▢│ 6. Opt   │──▢│ 7. Docs  β”‚
β”‚ Testing  β”‚   β”‚ Iterate  β”‚   β”‚          β”‚
β”‚          β”‚   β”‚          β”‚   β”‚β€’ Process β”‚
β”‚β€’ Combine β”‚   β”‚β€’ Test-   β”‚   β”‚  docs    β”‚
β”‚  modules β”‚   β”‚  based   β”‚   β”‚β€’ Final   β”‚
β”‚β€’ End-to- β”‚   β”‚  optimizeβ”‚   β”‚  paper   β”‚
β”‚  end testβ”‚   β”‚β€’ Adjust  β”‚   β”‚β€’ Publish β”‚
β”‚β€’ Fix     β”‚   β”‚  params  β”‚   β”‚  prep    β”‚
β”‚  issues  β”‚   β”‚          β”‚   β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Every phase ends with a CHECKPOINT β€” verify direction before continuing.

Key Techniques

  • Incremental verification β€” After each task, immediately check results. If problems found, adjust your prompts β€” don’t let AI self-correct (it will rationalize errors instead of fixing them)
  • Structured documents β€” All intermediate results saved as searchable files. AI doesn’t need to remember what happened before β€” just reads documents to continue working
  • Error isolation β€” One subtask failure doesn’t affect others. Easy to locate and fix issues without cascading damage
  • Checkpoint design β€” Every phase boundary has a verification gate. Ensures the project doesn’t drift off course

Three Tips for Anyone Replicating This

  1. Learn to decompose tasks β€” Break big problems into small ones. AI handles one deterministic task at a time.
  2. Learn to design prompts with clear I/O β€” Every prompt needs explicit input/output definitions. Don’t let AI guess your intent.
  3. Learn to verify results β€” AI output needs your judgment. This is not a burden β€” it’s your core competitive advantage.

β€œAI is ready to be your β€˜second-year grad student.’ Are you ready to be the β€˜commander’?”

Practical Guidance for Researchers

Using Feynman or Similar Research Agents

  1. Start with /lit to survey a field before committing to a direction
  2. Use /audit on key papers before building on their results
  3. Run /replicate in Docker isolation β€” never trust generated code on bare metal
  4. Cross-check citations β€” even citation-focused tools can hallucinate references

Using Claude/LLMs for Research Computation

  1. Assign structured tasks β€” β€œreproduce this specific calculation” beats β€œexplore this topic”
  2. Demand intermediate outputs β€” check every step, not just the final result
  3. Watch for confident fabrication β€” the more fluent the output, the more carefully you should verify
  4. Use it for drudge work β€” Fortran interfaces, plotting, integral computation, formatting
  5. Never skip human verification β€” especially for coefficients, citations, and statistical claims

The Research Team of 2026

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Modern Research Workflow          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                         β”‚
β”‚  Human researcher (taste + judgment)    β”‚
β”‚       β”‚                                 β”‚
β”‚       β”œβ”€β”€ Feynman: literature + review  β”‚
β”‚       β”œβ”€β”€ Claude: computation + drafts  β”‚
β”‚       β”œβ”€β”€ Docker: safe code execution   β”‚
β”‚       └── Point-by-point verification   β”‚
β”‚                                         β”‚
β”‚  Output: faster research, same rigor    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For academic faculty: this is the emerging workflow. AI handles volume; you supply direction and verification. The professor who caught Claude’s fabrications didn’t need less expertise β€” he needed more. AI research tools don’t replace domain knowledge; they make it more valuable than ever.

How LearnAI Team Could Use This

  • Research workflow training β€” Teach students how to decompose research projects into verifiable AI-assisted subtasks.
  • Literature review support β€” Use Feynman-style agents to build citation-grounded paper maps before seminars or projects.
  • Replication assignments β€” Have students audit or reproduce selected papers with Docker-isolated code execution.
  • Faculty research acceleration β€” Use agents for first-pass synthesis, code scaffolding, and draft generation while keeping expert verification mandatory.

Real-World Use Cases

  1. Academic labs β€” Speed up literature review, experiment planning, and reproducibility checks.
  2. Graduate research β€” Break thesis or paper work into small AI-assisted tasks with checkpoints.
  3. Industry R&D β€” Evaluate papers before deciding whether to implement or benchmark new methods.
  4. Scientific writing β€” Generate structured drafts while preserving human review for claims, citations, and statistics.