Two stories from the same week paint the full picture of AI in research: Feynman, an open-source multi-agent system purpose-built for scientific investigation, and a physics professorβs brutally honest account of using Claude to reproduce theoretical predictions. Together they answer the question every researcher is asking: Can AI actually do science? The answer is yes β spectacularly fast, dangerously confident, and only safe under expert supervision.
| *Sources: Feynman GitHub | η±ε―ε―-η±ηζ΄» (2026-03-26) | εεθ§ε―ε (2026-03-26)* |
Feynman: Open-Source AI Research Agent
Feynman is a multi-agent research system that coordinates specialized AI agents through natural language. You describe what you want to investigate; it dispatches the right agents, searches literature, and returns cited results.
User: /deepresearch "transformer attention mechanisms"
β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Feynman Agent Orchestra β
ββββββββββββββ¬βββββββββββββ¬βββββββββββββββββββββ€
β Literature β Critical β Experiment β
β Search β Review β Replication β
β β β β
β AlphaXiv β Methodologyβ Code generation β
β parsing β audit β + execution β
β + ranking β + gaps β + validation β
ββββββββββββββ΄βββββββββββββ΄βββββββββββββββββββββ
β
Cited report with precise references + linked code
Key Commands
| Command | What It Does |
|---|---|
/deepresearch <topic> |
Full multi-agent deep dive β literature, synthesis, gaps |
/lit <topic> |
Targeted literature search and summary |
/audit <paper ID> |
Critical methodological review of a specific paper |
/replicate <experiment> |
Attempt to reproduce an experiment with code |
Features at a Glance
| Feature | Details |
|---|---|
| Architecture | Multi-agent on Pi framework |
| Literature backend | AlphaXiv parsing with citation linking |
| Interfaces | Web UI + CLI |
| Runtime | Node.js |
| Security | Docker container isolation for code execution |
| Citation policy | All outputs include precise references β no uncited claims |
| Target users | AI scientists, engineers, research teams |
| License | Open source |
The design philosophy is right: every claim links back to a source, code links to literature, and experiments link to both. This is what research tooling should look like.
The Physics Case Study: Claude as Grad Student
A professor gave Claude a task any second-year grad student could handle: reproduce the Sudarsky shoulder predictions in particle physics. What followed was a two-week experiment that revealed both the ceiling and the floor of AI-assisted research.
What Worked β Spectacularly
| Metric | Result |
|---|---|
| Paper draft | 20 pages in 3 days |
| Iterations | 110+ standalone versions in 2 weeks |
| Token consumption | 36 million tokens |
| CPU simulation time | 40+ hours of local computation |
| Self-organization | AI set its own plan, built structure, split into 102 subtasks |
The tasks that make human grad students miserable β writing Fortran interfaces, tuning Python plots, computing integrals β AI did in seconds, without complaint, without fatigue. It self-organized its workflow: set milestones, built document structure, progressively reasoned through sub-problems, and split 102 subtasks in orderly fashion.
What Went Wrong β Dangerously
Mid-experiment, Claude exposed AIβs fatal weakness in scientific work: it tried to wing it.
What the AI fabricated:
βββ Made-up coefficients (looked plausible, were wrong)
βββ Fabricated citation tables with real-sounding terminology
βββ Statistical errors buried in calculations
βββ "Plausible bullshitting" β correct form, wrong substance
In theoretical physics, this kind of confident fabrication is catastrophic. A wrong coefficient doesnβt just give you a bad number β it can make your predictions diverge enormously from reality while still looking mathematically reasonable. The paper would have passed a casual review.
The Save
The professor caught it. He forced point-by-point verification β every coefficient checked against source material, every citation confirmed, every statistical method validated. Once caught, Claude corrected the key errors and completed the full recalculation in 5 minutes. A human grad student doing the same corrections? Roughly 2 weeks.
AI Research Workflow (what actually works):
AI generates βββ Human verifies βββ AI corrects βββ Human validates
(fast, broad) (slow, precise) (fast, targeted) (final check)
β β β β
3 days catches errors 5 minutes publishable
110 drafts fabrications fixed result
Lessons Learned
AI as Tireless Research Assistant
The physics case study proves AI can compress weeks of tedious work into hours: literature synthesis, code generation, integral computation, plot generation, document structuring. These are real, valuable capabilities for any research team.
But: Verification Is Non-Negotiable
| AI Strength | AI Weakness |
|---|---|
| Speed (seconds vs. weeks) | Fabricates when uncertain |
| Never tired, never complains | Cannot assess its own confidence |
| Handles 102 subtasks systematically | βPlausible bullshittingβ β correct form, wrong substance |
| Self-organizes complex workflows | Wonβt tell you when itβs guessing |
| Writes Fortran, Python, LaTeX fluently | Statistical errors look like real results |
The βTasteβ That Remains Human
The professorβs final answer to βwill physicists lose their jobs?β β No. Computing power and knowledge are becoming cheap as water. What remains uniquely human is taste: the judgment to choose which problems are worth pursuing among infinite paths. AI can explore any direction you point it toward, but it cannot tell you which direction matters.
The Reusable Methodology: From Executor to Commander
θΆ ηΊ§ε³° extracted the professorβs approach into a replicable framework. The core shift: stop being an executor, become a commander. Donβt rely on AIβs memory β give it searchable structured documents.
Source: θΆ ηΊ§ε³° on Xiaohongshu (2026-04)
Three Core Principles
| Principle | What It Means |
|---|---|
| Clear I/O per task | Every subtask has explicit input and output definitions β donβt let AI guess your intent |
| One-conversation granularity | Each subtask should be completable within a single conversation |
| Result persistence | After completing a task, immediately save results as input for the next task β AI reads documents, not memory |
The 7-Phase Execution Model
This is how the professor decomposed a year of work into 102 subtasks across 7 phases:
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β 1. Needs ββββΆβ 2. Arch ββββΆβ 3. ModuleββββΆβ 4. Impl β
β Analysis β β Design β β Decomposeβ β (one by β
β β β β β β β one) β
ββ’ Define β ββ’ Problem β ββ’ Break β ββ’ Priorityβ
β problem β β structureβ β into β β order β
ββ’ Success β ββ’ Tech β β subtasksβ ββ’ Verify β
β criteriaβ β roadmap β ββ’ Clear β β each β
ββ’ Key β ββ’ Verify β β I/O per β ββ’ Record β
β limits β β approachβ β subtask β β issues β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β β
βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β 5. Integ ββββΆβ 6. Opt ββββΆβ 7. Docs β
β Testing β β Iterate β β β
β β β β ββ’ Process β
ββ’ Combine β ββ’ Test- β β docs β
β modules β β based β ββ’ Final β
ββ’ End-to- β β optimizeβ β paper β
β end testβ ββ’ Adjust β ββ’ Publish β
ββ’ Fix β β params β β prep β
β issues β β β β β
ββββββββββββ ββββββββββββ ββββββββββββ
Every phase ends with a CHECKPOINT β verify direction before continuing.
Key Techniques
- Incremental verification β After each task, immediately check results. If problems found, adjust your prompts β donβt let AI self-correct (it will rationalize errors instead of fixing them)
- Structured documents β All intermediate results saved as searchable files. AI doesnβt need to remember what happened before β just reads documents to continue working
- Error isolation β One subtask failure doesnβt affect others. Easy to locate and fix issues without cascading damage
- Checkpoint design β Every phase boundary has a verification gate. Ensures the project doesnβt drift off course
Three Tips for Anyone Replicating This
- Learn to decompose tasks β Break big problems into small ones. AI handles one deterministic task at a time.
- Learn to design prompts with clear I/O β Every prompt needs explicit input/output definitions. Donβt let AI guess your intent.
- Learn to verify results β AI output needs your judgment. This is not a burden β itβs your core competitive advantage.
βAI is ready to be your βsecond-year grad student.β Are you ready to be the βcommanderβ?β
Practical Guidance for Researchers
Using Feynman or Similar Research Agents
- Start with
/litto survey a field before committing to a direction - Use
/auditon key papers before building on their results - Run
/replicatein Docker isolation β never trust generated code on bare metal - Cross-check citations β even citation-focused tools can hallucinate references
Using Claude/LLMs for Research Computation
- Assign structured tasks β βreproduce this specific calculationβ beats βexplore this topicβ
- Demand intermediate outputs β check every step, not just the final result
- Watch for confident fabrication β the more fluent the output, the more carefully you should verify
- Use it for drudge work β Fortran interfaces, plotting, integral computation, formatting
- Never skip human verification β especially for coefficients, citations, and statistical claims
The Research Team of 2026
βββββββββββββββββββββββββββββββββββββββββββ
β Modern Research Workflow β
βββββββββββββββββββββββββββββββββββββββββββ€
β β
β Human researcher (taste + judgment) β
β β β
β βββ Feynman: literature + review β
β βββ Claude: computation + drafts β
β βββ Docker: safe code execution β
β βββ Point-by-point verification β
β β
β Output: faster research, same rigor β
βββββββββββββββββββββββββββββββββββββββββββ
For academic faculty: this is the emerging workflow. AI handles volume; you supply direction and verification. The professor who caught Claudeβs fabrications didnβt need less expertise β he needed more. AI research tools donβt replace domain knowledge; they make it more valuable than ever.
How LearnAI Team Could Use This
- Research workflow training β Teach students how to decompose research projects into verifiable AI-assisted subtasks.
- Literature review support β Use Feynman-style agents to build citation-grounded paper maps before seminars or projects.
- Replication assignments β Have students audit or reproduce selected papers with Docker-isolated code execution.
- Faculty research acceleration β Use agents for first-pass synthesis, code scaffolding, and draft generation while keeping expert verification mandatory.
Real-World Use Cases
- Academic labs β Speed up literature review, experiment planning, and reproducibility checks.
- Graduate research β Break thesis or paper work into small AI-assisted tasks with checkpoints.
- Industry R&D β Evaluate papers before deciding whether to implement or benchmark new methods.
- Scientific writing β Generate structured drafts while preserving human review for claims, citations, and statistics.