Can AI run a complete clinical research pipeline β from generating research ideas to running experiments to writing the paper? βTowards a Medical AI Scientistβ demonstrates a three-agent system that does exactly this across 19 clinical tasks and 6 data modalities. In the authorsβ benchmark, generated manuscripts scored near MICCAI level, and they report one system-generated manuscript was accepted at the non-archival ICAIS 2025 AI Scientist venue after peer review. The system outperformed GPT-5 in idea quality across the reported evaluation dimensions.
| *Source: arXiv 2603.28589 | Project Homepage* |
The Three-Agent Pipeline
βββββββββββββββββββββββ
β Idea Proposer β Generates research hypotheses from clinical
β β literature + real datasets
ββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββββ
β Experimental Executorβ Writes Python scripts, runs experiments,
β β handles errors, iterates until results
ββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββββ
β Manuscript Composer β Writes structured medical paper following
β β publication norms and ethics guidelines
βββββββββββββββββββββββ
Idea Proposer
Combines clinical domain knowledge with engineering reasoning. Unlike generic LLM brainstorming, it:
- Grounds ideas in real clinical literature
- Uses federated reasoning from both clinicians and engineers
- Generates testable hypotheses with clear experimental designs
- Enhances traceability and reduces AI βhallucinatedβ research directions
Experimental Executor
A general-purpose execution engine that:
- Connects to domain-specific medical tool chains
- Handles heterogeneous clinical data formats
- Self-debugs through iterative error correction
- Achieves high end-to-end execution success rates
Manuscript Composer
Follows structured medical writing conventions:
- Embeds ethics review mechanisms
- Outputs papers following structured medical writing conventions with ethics/data-use checks
- Directly produces initial drafts suitable for journal submission
Med-AI Bench: The Benchmark
| Dimension | Coverage |
|---|---|
| Clinical tasks | 19 tasks across clinical research |
| Data modalities | 6 types: medical images, video, EHR, physiological signals, text, multimodal |
| Cases | 171 benchmark cases |
| Evaluation | Double-blind human expert review + automated assessment |
Results
| Metric | Performance |
|---|---|
| Idea quality | Authors report it outperforms GPT-5 across 6 dimensions: novelty, maturity, ethicality, generalizability, utility, interpretability |
| Execution success | High end-to-end success rate with iterative debugging |
| Paper quality | In diabetic retinopathy evaluation, scored near MICCAI and above ISBI/BIBM baselines |
| Peer review | Authors report one manuscript accepted at non-archival ICAIS 2025 AI Scientist venue |
Three Modes for Different Research Needs
| Mode | What It Does | Best For |
|---|---|---|
| Paper-based Reproduction | Faithfully re-implements target papers or specified hypotheses with validation checks | Verifying results, learning methodology |
| Literature-inspired Innovation | Finds research gaps in existing literature and generates original approaches | Identifying novel directions |
| Task-driven Exploration | From a single user question, auto-discovers literature, plans experiments, runs them, writes up results | End-to-end research from scratch |
Why This Matters Beyond Medicine
The pattern β Propose β Execute β Write β generalizes to any empirical research domain:
- Computer science β Generate, implement, and write up ML experiments
- Social science β Design surveys, analyze data, draft papers
- Engineering β Propose designs, simulate, document results
The medical domain is just the hardest test case because it requires domain expertise, ethical compliance, reproducibility, and adherence to strict reporting standards.
Real-World Use Cases
- Clinical researchers β Accelerate the tedious parts of research (literature review, experiment scripting, first draft) while keeping human judgment for interpretation and ethics.
- Medical institutions β Screen research ideas at scale before committing wet-lab resources.
- Pharmaceutical R&D β Generate and evaluate hypotheses for drug repurposing or biomarker discovery.
- Research education β Demonstrate the full research pipeline to medical students with working examples.
How LearnAI Team Could Use This
- AI research ethics case study β When AI can write papers that pass peer review, what does authorship mean? What are the ethical boundaries?
- Automated science curriculum β Use as an example of the βAI scientistβ trend: where AI handles execution while humans guide strategy.
- Benchmark design exercise β Med-AI Bench is a model for how to evaluate AI research agents β students can design similar benchmarks for their own domains.
Links
- Paper: arXiv 2603.28589
- Homepage: Med-AI Scientist
- From: CUHK AIM Group