Paper2Code: Turn ArXiv Papers into Citation-Anchored Code

Paper2Code: Turn ArXiv Papers into Citation-Anchored Code

Paper2Code is a Claude Code skill that transforms any ArXiv paper into a runnable, citation-anchored Python implementation. Every line of generated code traces back to the exact paper section it implements, and any detail the paper skips is explicitly flagged β€” never silently invented.

*Source: GitHub - PrathamLearnsToCode/paper2code MCP Market Listing Author’s X Post*

The Problem: Paper Reproduction is Painful

Anyone who’s tried to reproduce a research paper knows the pain: key hyperparameters are buried in appendices or omitted entirely. You spend hours β€œguessing” what the authors actually did. Traditional LLM code generation makes this worse by confidently filling in the gaps without telling you.

Paper2Code solves this with a core philosophy of honesty over completeness.

Three Core Mechanisms

Mechanism What It Does Example
Citation Anchoring Every code line references its paper section # Β§3.2, Eq. 2 β€” softmax(QK^T / √d_k)
Ambiguity Auditing Classifies each detail as specified / partial / unspecified [UNSPECIFIED] Paper omits epsilon for LayerNorm
Transparent Defaults Uses reasonable defaults but marks them clearly eps=1e-6 # [UNSPECIFIED] Alternatives: 1e-5, 1e-8

Citation Anchoring in Action

# Β§3.2 β€” "We apply layer normalization before each sub-layer"
class TransformerBlock(nn.Module):
    def forward(self, x):
        # Β§3.2, Eq. 2 β€” attention_weights = softmax(QK^T / sqrt(d_k))
        attn_out = self.attention(self.norm1(x))
        x = x + attn_out  # Β§3.2 β€” residual connection

Ambiguity Audit Labels

  • [SPECIFIED] β€” Paper defines this explicitly
  • [PARTIALLY_SPECIFIED] β€” Paper is ambiguous; quote and reasoning included
  • [UNSPECIFIED] β€” Paper omits this; code uses reasonable default with alternatives listed
  • [ASSUMPTION] β€” Inferred from context with explanation
  • [FROM_OFFICIAL_CODE] β€” Taken from authors’ reference implementation

Installation & Usage

Install as a Claude Code skill via npx:

npx skills add PrathamLearnsToCode/paper2code/skills/paper2code

Then use with a simple slash command:

# Basic β€” just an ArXiv URL or ID
/paper2code https://arxiv.org/abs/1706.03762
/paper2code 1706.03762

# Specify framework
/paper2code https://arxiv.org/abs/2006.11239 --framework jax

# Full mode β€” includes training and data pipeline
/paper2code 2106.09685 --mode full

# Educational mode β€” extra comments, pedagogical notebook
/paper2code https://arxiv.org/abs/2010.11929 --mode educational

Generated Project Structure

{paper_slug}/
β”œβ”€β”€ README.md                  # Paper summary + quick-start
β”œβ”€β”€ REPRODUCTION_NOTES.md      # Full ambiguity audit
β”œβ”€β”€ requirements.txt           # Pinned dependencies
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ model.py              # Architecture  (Β§3.2 cited)
β”‚   β”œβ”€β”€ loss.py               # Loss functions (Eq. refs)
β”‚   β”œβ”€β”€ train.py              # Training loop  (Β§4.1 cited)
β”‚   β”œβ”€β”€ data.py               # Dataset skeleton
β”‚   β”œβ”€β”€ evaluate.py           # Metrics
β”‚   └── utils.py              # Shared utilities
β”œβ”€β”€ configs/
β”‚   └── base.yaml             # All hyperparams (cited or flagged)
└── notebooks/
    └── walkthrough.ipynb     # CPU-runnable pedagogical notebook

The walkthrough.ipynb is especially useful: it maps β€œpaper paragraph β†’ corresponding code β†’ shape check” in a closed loop, letting you verify each piece incrementally.

Pipeline Under the Hood

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Paper Fetch β”‚ ──▢ β”‚   Parsing    β”‚ ──▢ β”‚  Ambiguity   β”‚ ──▢ β”‚    Code      β”‚ ──▢ β”‚ Walkthrough  β”‚
β”‚  (ArXiv URL) β”‚     β”‚  (sections,  β”‚     β”‚    Audit     β”‚     β”‚  Generation  β”‚     β”‚  Notebook    β”‚
β”‚              β”‚     β”‚  equations)  β”‚     β”‚              β”‚     β”‚              β”‚     β”‚              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What Paper2Code Won’t Do

  • Guarantee correctness β€” it faithfully implements what the paper says, even if the paper is wrong
  • Silently invent details β€” unspecified choices are always flagged
  • Download datasets β€” provides skeleton data loaders only
  • Reimplement standard components β€” if the paper says β€œstandard transformer encoder,” it imports rather than rewrites

Who Should Use This

  • Researchers verifying whether a paper’s claims hold up in code
  • Algorithm engineers reproducing SOTA methods for their own projects
  • Students learning how papers translate into implementations
  • Reviewers checking if a paper’s described method is internally consistent

How LearnAI Team Could Use This

  • Paper-to-code labs β€” have students generate implementations, then audit which details were specified versus inferred.
  • Research reproducibility demos β€” compare generated code against official repositories to teach implementation gaps.
  • Critical reading practice β€” use ambiguity labels to show where papers leave out operational details.
  • Course project scaffolding β€” help students bootstrap runnable baselines from assigned ArXiv papers.

Real-World Use Cases

  • Research engineers β€” quickly turn papers into inspectable prototype implementations.
  • ML teams β€” evaluate whether a new method is worth deeper reproduction work.
  • Peer reviewers β€” check whether a method description is complete enough to implement.
  • Graduate students β€” learn how equations, architecture descriptions, and hyperparameters map into code.

A separate academic project called PaperCoder (arXiv 2504.17192) also tackles paper-to-code generation using a multi-agent framework with planning, analysis, and generation stages. It achieves strong results on the PaperBench benchmark. While different from this Claude Code skill, both address the same fundamental reproducibility challenge.