Towards a Medical AI Scientist β€” Full Auto Clinical Research from Idea to Paper

Towards a Medical AI Scientist β€” Full Auto Clinical Research from Idea to Paper

Can AI run a complete clinical research pipeline β€” from generating research ideas to running experiments to writing the paper? β€œTowards a Medical AI Scientist” demonstrates a three-agent system that does exactly this across 19 clinical tasks and 6 data modalities. In the authors’ benchmark, generated manuscripts scored near MICCAI level, and they report one system-generated manuscript was accepted at the non-archival ICAIS 2025 AI Scientist venue after peer review. The system outperformed GPT-5 in idea quality across the reported evaluation dimensions.

*Source: arXiv 2603.28589 Project Homepage*

The Three-Agent Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Idea Proposer      β”‚  Generates research hypotheses from clinical
β”‚                      β”‚  literature + real datasets
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Experimental Executorβ”‚  Writes Python scripts, runs experiments,
β”‚                      β”‚  handles errors, iterates until results
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Manuscript Composer  β”‚  Writes structured medical paper following
β”‚                      β”‚  publication norms and ethics guidelines
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Idea Proposer

Combines clinical domain knowledge with engineering reasoning. Unlike generic LLM brainstorming, it:

  • Grounds ideas in real clinical literature
  • Uses federated reasoning from both clinicians and engineers
  • Generates testable hypotheses with clear experimental designs
  • Enhances traceability and reduces AI β€œhallucinated” research directions

Experimental Executor

A general-purpose execution engine that:

  • Connects to domain-specific medical tool chains
  • Handles heterogeneous clinical data formats
  • Self-debugs through iterative error correction
  • Achieves high end-to-end execution success rates

Manuscript Composer

Follows structured medical writing conventions:

  • Embeds ethics review mechanisms
  • Outputs papers following structured medical writing conventions with ethics/data-use checks
  • Directly produces initial drafts suitable for journal submission

Med-AI Bench: The Benchmark

Dimension Coverage
Clinical tasks 19 tasks across clinical research
Data modalities 6 types: medical images, video, EHR, physiological signals, text, multimodal
Cases 171 benchmark cases
Evaluation Double-blind human expert review + automated assessment

Results

Metric Performance
Idea quality Authors report it outperforms GPT-5 across 6 dimensions: novelty, maturity, ethicality, generalizability, utility, interpretability
Execution success High end-to-end success rate with iterative debugging
Paper quality In diabetic retinopathy evaluation, scored near MICCAI and above ISBI/BIBM baselines
Peer review Authors report one manuscript accepted at non-archival ICAIS 2025 AI Scientist venue

Three Modes for Different Research Needs

Mode What It Does Best For
Paper-based Reproduction Faithfully re-implements target papers or specified hypotheses with validation checks Verifying results, learning methodology
Literature-inspired Innovation Finds research gaps in existing literature and generates original approaches Identifying novel directions
Task-driven Exploration From a single user question, auto-discovers literature, plans experiments, runs them, writes up results End-to-end research from scratch

Why This Matters Beyond Medicine

The pattern β€” Propose β†’ Execute β†’ Write β€” generalizes to any empirical research domain:

  • Computer science β€” Generate, implement, and write up ML experiments
  • Social science β€” Design surveys, analyze data, draft papers
  • Engineering β€” Propose designs, simulate, document results

The medical domain is just the hardest test case because it requires domain expertise, ethical compliance, reproducibility, and adherence to strict reporting standards.

Real-World Use Cases

  • Clinical researchers β€” Accelerate the tedious parts of research (literature review, experiment scripting, first draft) while keeping human judgment for interpretation and ethics.
  • Medical institutions β€” Screen research ideas at scale before committing wet-lab resources.
  • Pharmaceutical R&D β€” Generate and evaluate hypotheses for drug repurposing or biomarker discovery.
  • Research education β€” Demonstrate the full research pipeline to medical students with working examples.

How LearnAI Team Could Use This

  • AI research ethics case study β€” When AI can write papers that pass peer review, what does authorship mean? What are the ethical boundaries?
  • Automated science curriculum β€” Use as an example of the β€œAI scientist” trend: where AI handles execution while humans guide strategy.
  • Benchmark design exercise β€” Med-AI Bench is a model for how to evaluate AI research agents β€” students can design similar benchmarks for their own domains.