Andrej Karpathy recently shifted his βtoken throughputβ from manipulating code to manipulating knowledge. His approach: feed raw sources (articles, papers, repos, images) into a folder, and let an LLM incrementally compile a structured, interlinked markdown wiki β with summaries, concept pages, cross-references, and health checks. His current research wiki: ~100 articles, ~400,000 words, longer than most PhD dissertations, built without typing a single word.
| *Source: Karpathy β LLM Knowledge Bases (X post) | Karpathy β Farzapedia (X post) | llm-wiki gist | VentureBeat coverage* |
Why Not RAG?
Traditional RAG (Retrieval-Augmented Generation) rediscovers information every time you ask a question. Karpathyβs approach is different: the LLM incrementally builds and maintains a persistent wiki that compounds over time. The knowledge isnβt just retrieved β itβs organized, cross-linked, and maintained.
Traditional RAG:
Question β Search docs β Stuff context β Answer β (forgotten)
LLM Knowledge Base:
Sources β LLM compiles wiki β Wiki grows β Query wiki β Answer
β β
βββββ discoveries filed back βββββββββββββββ
The Three-Layer Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 3: Schema (CLAUDE.md) β
β Rules, conventions, structure, workflows β
β YOU control this β the "editor-in-chief" β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2: The Wiki (wiki/) β
β LLM-generated .md files: β
β summaries, concept pages, entity pages, β
β cross-references, index.md, log.md β
β LLM OWNS this layer β creates & updates β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 1: Raw Sources (raw/) β
β Immutable: articles, papers, repos, datasets, β
β images, notes β LLM reads but never modifies β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Essential Files
| File | Purpose |
|---|---|
| index.md | Content catalog β every wiki page with summary + metadata, organized by category. Updated on each ingest. |
| log.md | Append-only chronological record of ingests, queries, and lint passes. Enables timeline tracking. |
Four Core Operations
1. Ingest β Drop Sources, LLM Compiles
Drop new sources into raw/. The LLM reads them, writes summaries, updates the index, refreshes relevant pages across the wiki, and logs the entry.
raw/new-paper.pdf β LLM reads β
βββ writes wiki/papers/paper-summary.md
βββ updates wiki/index.md
βββ cross-links to related concept pages
βββ appends to wiki/log.md
2. Query β Ask Against the Wiki
Once the wiki is large enough, ask complex questions. The LLM synthesizes answers with citations, and files valuable discoveries back as new wiki pages.
3. Lint β Health Checks
Periodically run LLM βhealth checksβ to find:
- Contradictions between pages
- Stale claims needing updates
- Orphan pages with no cross-references
- Missing data gaps worth filling
- Interesting connections for new article candidates
4. Maintain β Continuous Enhancement
The LLM handles the tedious bookkeeping humans hate: updating cross-references, noting contradictions, maintaining consistency. It βdoesnβt get bored, doesnβt forget to update a cross-reference, and can touch 15 files in one pass.β
The 5 Design Principles
Karpathy emphasized these when praising Farzapedia (a personal Wikipedia built from 2,500 diary entries):
| Principle | What It Means |
|---|---|
| 1. Explicit | The memory artifact is visible and navigable. You can see exactly what the AI knows and doesnβt know. No hidden embeddings. |
| 2. Yours | Data lives on your local machine. Not in some AI providerβs system. |
| 3. File-over-app | Memory is simple markdown files. Not locked in any app. Survives any tool. |
| 4. BYOAI | Plug in any AI: Claude, ChatGPT, Codex, OpenCode, local models. The wiki is the interface, not the model. |
| 5. Inspectable | You can audit, edit, and correct the knowledge. The AI is a librarian, not an oracle. |
βThis approach puts βyouβ in full control. The data is yours. In Universal formats. Explicit and inspectable. Use whatever AI you want over it.β
Real-World Use Cases
Farzapedia β Personal Wiki
Developer Farza fed 2,500 entries from his diary, Apple Notes, and iMessage conversations into this workflow. Result: 400 detailed articles covering friends, startups, research areas, and favorite animes β a personal Wikipedia with backlinks and cross-references.
The key insight: this isnβt just note organization. The LLM creates synthesized knowledge β connecting dots across sources that youβd never connect manually.
Buffett Letters Knowledge Base β Domain Wiki
A different flavor of the same pattern: Warren Buffett Shareholder Letters Knowledge Base organizes 70 years of Berkshire Hathaway shareholder letters into a structured, queryable knowledge base.
| Metric | Count |
|---|---|
| Shareholder letters | 98 |
| Core investment concepts | 49 |
| Company case studies | 61 |
| Key investor profiles | 7 |
| Original passages | 4,726+ |
The site structures content across four views: letters overview, core concept interpretations, company case studies, and key investor profiles. Each passage is cross-referenced by year, theme, and company β turning 70 years of scattered wisdom into a single queryable artifact.
Why this matters: This is what a domain-specific knowledge base looks like done well. Unlike Farzapedia (personal), this shows the pattern applied to a public domain corpus. The same architecture works for:
- Academic research papers (your fieldβs key authors across decades)
- Legal case archives (precedents organized by topic)
- Historical documents (diaries, letters, memos)
- Technical documentation (API specs, RFCs, design docs)
The endgame: someone noted this could βdistill a Buffett financial advisorβ from the knowledge base β i.e., fine-tune an LLM on the structured corpus. This is exactly what Karpathy hinted at with synthetic data generation from your knowledge base.
TimYangβs Practical Implementation
TimYang spent half a day implementing Karpathyβs approach and shared practical findings:
Minimal Architecture (No Embeddings Needed)
wiki/
βββ index.md β One line per doc: Path | Summary | Tags
βββ document.summary.md β LLM-generated summaries from source
Just two layers of plain markdown. No vector database, no embedding tools. At <1,000 files, pure text compilation works fine.
Workflow
Data in: Index source doc β Write to index.md β Build Summary
Search: Query topic β LLM loads index.md β Gets file list
β Reads summaries β Decides if needs full source β Answer
Results
- Imported several research domain papers
- LLM cites specific viewpoints with <10 keyword hits
- Output is logically reorganized, not simple grep β usually gives 3-4 core points
- Surprise: output included content from files not yet loaded (info existed in Claude Codeβs memory)
What He Dropped
Claude Code suggested adding Tag/Topic indexes β TimYang dropped it. Too sparse at small scale, complex to maintain. Keep it simple.
Obsidian Skills for Knowledge Base Building
ιδΈι recommended 9 Obsidian skills + 2 plugins for building this workflow:
Skills
| Skill | What It Does |
|---|---|
| obsidian-cli | CLI interface for vault operations |
| defuddle | Clean web page extraction to markdown |
| obsidian-bases | Database-like views of notes |
| obsidian-markdown | Obsidian-flavored markdown support |
| canvas | Visual canvas for mind maps |
| mermaid | Diagram generation |
| excalidraw | Whiteboard-style drawings |
| tutor | Learning and knowledge decomposition |
| scholar | Academic research version of tutor |
Plugins
- Claudian β Claude Code integrated as sidebar in Obsidian
- Agent Client Plugin β Connect more coding agents to Obsidian
The Bigger Vision
欧巴θAI extended Karpathyβs thinking to its logical conclusion:
- Synthetic data + finetuning β As the knowledge base grows, generate training data from it and finetune the LLM to truly memorize the content, not just reference it through context
- Wiki as product β This could become a new product category: AI-maintained personal knowledge systems, not just scripts
- Automated research β For every frontier question, an LLM team could build a complete temporary wiki, iterate and review, then output a polished report β far beyond simple
.decode()output
βYou almost never need to manually write or edit the wiki β thatβs the LLMβs job. Not just scripts; this has the potential to become a genuinely new product.β
How to Start Building Yours
# 1. Create the structure
mkdir -p my-wiki/raw my-wiki/wiki
echo "# Index" > my-wiki/wiki/index.md
echo "# Log" > my-wiki/wiki/log.md
# 2. Add your schema (CLAUDE.md or similar)
# Define: categories, page format, cross-link rules, lint schedule
# 3. Drop sources into raw/
cp paper.pdf article.md notes.txt my-wiki/raw/
# 4. Open in Obsidian + start Claude Code
cd my-wiki && claude
> "Ingest all sources in raw/, create wiki pages, update index"
# 5. Query
> "What are the key disagreements across my sources on [topic]?"
# 6. Lint
> "Health check: find contradictions, stale claims, missing links"
Tools
- Obsidian β Frontend for browsing, graph view for connections
- Obsidian Web Clipper β Convert web articles to markdown
- Marp β Markdown-based presentations from wiki content
- Dataview plugin β Dynamic tables from frontmatter
- Git β Version control for the entire wiki
Case Study: LearnAI Wiki vs Karpathyβs Approach
The LearnAI Doc wiki was built by processing 460+ screenshots from Weibo/social media into 96 structured wiki entries with cross-links, cover images, and Obsidian notes β using Claude Codeβs /mywiki pipeline. Itβs a working knowledge base, but comparing it to Karpathyβs architecture reveals whatβs missing and how to level up.
Whatβs Working
Screenshots (462) β Claude reads β Research β Wiki entry (96)
β Obsidian note (27)
β Cross-links β Push
- Strong ingest pipeline β screenshot β research β structured entry is fast and consistent
- Cross-linking β entries reference each other, building a web of knowledge
- Public wiki β shareable, browsable, useful for teaching
- Dual output β wiki for depth, Obsidian notes for personal quick-reference
5 Gaps to Close
| Gap | Current State | Karpathyβs Approach | Fix |
|---|---|---|---|
| No query layer | Can browse entries, canβt ask βsynthesize everything I know about Xβ | LLM queries wiki, synthesizes across all entries | Build index.md + query workflow |
| No master index | 96 entries with no machine-readable catalog | index.md: every page with summary + tags in one file | Generate index.md from frontmatter |
| No lint/health checks | Entries written once, rarely updated | Periodic checks for contradictions, stale info, gaps | Monthly lint pass with Claude |
| Raw sources not archived | Screenshots in flat folder, no metadata | raw/ directory preserving originals with metadata | Organize raw/ by date + topic |
| Obsidian notes too thin | 27 notes (vs 96 entries); just pointers to wiki | Wiki IS the queryable knowledge base | Make Obsidian the primary KB, wiki the published view |
The Biggest Missing Piece: Cross-Entry Synthesis
96 entries about agent design, harness engineering, academic tools, Obsidian workflows, and more β but no way to ask: βWhat are all the agent design patterns Iβve collected across every entry?β Each entry is an island. Karpathyβs approach makes the whole wiki queryable as one knowledge graph.
How to Upgrade: 4-Step Plan
Step 1: Generate index.md β Scan all _wiki/*.md frontmatter β create a master index with title, category, tags, one-line summary per entry. This gives the LLM a map of the entire knowledge base.
Step 2: Add query workflow β New Claude Code skill: load index.md β find relevant entries β read them β synthesize a cross-entry answer with citations. Now you can ask βwhat do I know about X?β across all 96 entries.
Step 3: Monthly lint β Run a health check: which entries reference tools that have changed? Which cross-links are missing? Where are there contradictions? Which topics have gaps worth filling?
Step 4: Obsidian as primary KB β Flip the relationship. Obsidian vault becomes the rich, queryable knowledge base (like Karpathyβs wiki/). The Jekyll site becomes the published view β a curated subset, not the source of truth.
For Researchers: LLM Knowledge Bases as Literature Review Infrastructure
This approach has particular potential for academic research workflows:
The Research Knowledge Base Pattern
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Research Knowledge Base β
β β
β raw/ wiki/ β
β βββ papers/ βββ index.md β
β β βββ paper1.pdf βββ themes/ β
β β βββ paper2.pdf β βββ theme-A.md β
β β βββ paper3.pdf β βββ theme-B.md β
β βββ notes/ βββ methods/ β
β β βββ seminar-notes.md β βββ method-X.md β
β β βββ advisor-feedback.md β βββ method-Y.md β
β βββ data/ βββ gaps/ β
β βββ dataset1/ β βββ gap-1.md β
β βββ results.csv β βββ gap-2.md β
β βββ contradictions.md β
β βββ open-questions.md β
β βββ log.md β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
What This Enables for Research
| Capability | How It Works | Research Value |
|---|---|---|
| Living literature review | LLM maintains cross-linked summaries of every paper you read | Always up-to-date, never start from scratch |
| Gap detection | Lint passes identify whatβs missing across your sources | Research questions emerge from the gaps |
| Contradiction surfacing | LLM flags where papers disagree | Controversy = opportunity for contribution |
| Method comparison | Concept pages comparing methodologies across papers | Quick reference during experimental design |
| Synthesis queries | βWhat do all my sources say about X?β with citations | Literature review sections write themselves |
| Advisor-ready reports | Query wiki β structured summary for weekly meetings | No more scrambling before advisor meetings |
vs Traditional Tools
| Β | Zotero + Manual Notes | LLM Knowledge Base |
|---|---|---|
| Organization | Manual tagging, folders | Auto-categorized, cross-linked |
| Synthesis | You read and write summaries | LLM synthesizes, you review |
| Gap detection | You notice gaps while reading | LLM systematically identifies gaps |
| Querying | Search by keywords | Ask complex questions across all sources |
| Maintenance | Manual, often neglected | LLM runs health checks |
| Cost | Free (your time) | API costs (~$5-20/month for active research) |
| Risk | None | LLM may hallucinate β always verify citations |
How LearnAI Team Could Use This
A practical checklist for the LAI research team to build a personal or team knowledge base following Karpathyβs approach:
Phase 1: Foundation (Week 1)
- Set up Obsidian vault β install Obsidian, create vault at a shared/synced location
- Install essential skills β obsidian-cli, defuddle, obsidian-markdown, obsidian-bases
- Create directory structure β
raw/for source materials,wiki/for compiled knowledge - Write schema (CLAUDE.md) β define categories relevant to your research area, page format rules, cross-link conventions
- Create index.md β start with empty template, will grow automatically
- Create log.md β append-only record of what you ingest
Phase 2: Ingest Your Existing Knowledge (Weeks 2-3)
- Collect your sources β papers youβve read, course notes, project docs, bookmarks
- Batch ingest papers β drop PDFs into raw/, have LLM write summaries + concept pages
- Ingest course materials β lecture notes, slides, textbooks β wiki articles
- Import existing notes β any Notion, Google Docs, Apple Notes β markdown in raw/
- Cross-link everything β LLM updates index.md and adds backlinks between pages
Phase 3: Build Workflows (Week 4)
- Set up ingest workflow β define the process: new paper β raw/ β LLM compiles β wiki updated
- Set up query workflow β how to ask questions across the wiki (load index β find entries β synthesize)
- Set up lint schedule β weekly or monthly health check for contradictions, gaps, stale info
- Connect to Obsidian graph view β visualize connections between concepts
- Install Claudian or Agent Client Plugin β Claude Code sidebar in Obsidian for live querying
Phase 4: Research-Specific Extensions (Ongoing)
- Literature review mode β for each new paper, LLM adds to themes/, methods/, and flags connections to existing knowledge
- Gap tracking β maintain gaps/ directory with scored research opportunities
- Contradiction log β when papers disagree, document both sides with citations
- Weekly synthesis β generate a βwhat I learned this weekβ summary for advisor meetings
- Collaborate β share wiki via Git, let team members ingest their own sources into shared categories
Phase 5: Advanced (When Ready)
- Build query skill β custom Claude Code skill that loads your index.md and answers research questions with citations
- Automate ingest β Zotero β auto-export new papers β raw/ β LLM processes overnight
- Synthetic data exploration β generate Q&A pairs from your wiki for fine-tuning experiments
- Publish curated subset β turn selected wiki pages into a public Jekyll/Hugo site (like LearnAI Doc)
Minimum Viable Knowledge Base (30 minutes)
If you just want to try it:
mkdir -p my-research-kb/raw my-research-kb/wiki
cd my-research-kb
# Create minimal schema
cat > CLAUDE.md << 'EOF'
# Research Knowledge Base
When ingesting a source:
1. Write a summary in wiki/{category}/{title}.md
2. Update wiki/index.md with: path | one-line summary | tags
3. Cross-link to related existing pages
4. Append to wiki/log.md with date and source
Categories: methods, theories, tools, papers, gaps
Format: markdown with YAML frontmatter (title, date, tags, sources)
EOF
# Drop a paper and start
cp ~/Downloads/interesting-paper.pdf raw/
claude
> "Ingest raw/interesting-paper.pdf into the wiki following CLAUDE.md rules"
Links
- Karpathyβs X post: LLM Knowledge Bases
- Farzapedia post: Personal Wikipedia example
- llm-wiki gist: Full workflow + schema
- VentureBeat: Karpathyβs architecture bypasses RAG
- Bilibili guide (ιδΈι): 9 Obsidian Skills for Knowledge Base