How It Works
Pipeline overview
┌──────────────────────────────────────────────────────────────────┐
│ your-files/ │
│ *.py *.ts *.go *.md *.txt *.pdf *.docx *.png *.heic │
└──────────┬──────────────┬────────────┬──────────────┬────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ detect │ │ detect │ │ detect │ │ detect │
│ → code │ │ → document │ │ → paper │ │ → image │
└──────┬───────┘ └──────┬───────┘ └────┬─────┘ └──────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────────┐
│ AST extract │ │ structural │ │ pypdf → │ │ hub nodes │
│ (tree-sitter│ │ (headings, │ │ structural│ │ (file refs) │
│ 18 languages│ │ links, defs)│ │ or hub │ │ │
└──────┬───────┘ └──────┬───────┘ └────┬─────┘ └──────┬───────┘
│ │ │ │
└────────────────┴──────┬───────┴───────────────┘
│
▼
┌─────────────────────┐
│ cross-reference │
│ code ↔ docs │
│ (mentions edges) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ build graph │
│ (NetworkX) │
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ cluster │
│ (Leiden/Louvain) │
└──────────┬──────────┘
│
▼
┌──────────────┴───────────────┐
│ export │
├──────────┬──────────┬────────┤
│graph.html│graph.json│report │
│graph.html│wiki/ │vault/ │
└──────────┴──────────┴────────┘
Step 1 — Detect
Scans folder recursively. Classifies files by type. Respects .wikiignore.
detect(root)
│
├── skip: .git, node_modules, __pycache__, wiki-out/
├── skip: .env, *.pem, credentials (sensitive files)
│
├── .py .ts .go .rs .java ... → code
├── .md .txt .rst → document
├── .pdf → paper (or document if no academic signals)
├── .docx .xlsx → document (converted to markdown)
├── .png .jpg .heic .svg → image
│
└── output: {files: {code: [...], document: [...], paper: [...], image: [...]},
total_files, total_words}
Step 2 — Extract
Code extraction (AST)
source.py
│
├── tree-sitter parse → AST
│
├── walk AST nodes:
│ class Foo → node(id, label="Foo", file_type="code")
│ def bar() → node(id, label="bar()", file_type="code")
│ import X → edge(source→X, relation="imports")
│ class Foo(Base) → edge(Foo→Base, relation="inherits")
│
├── cross-file imports:
│ from mod import A → edge(A→A_in_mod, relation="imports_from")
│
└── output: {nodes: [...], edges: [...]}
Document extraction (structural)
document.md
│
├── # Heading 1 → section node (h1/h2 only)
├── ## Heading 2 → section node, edge(h1→h2, "contains")
│
├── - **Term**: desc → definition node, edge(doc→term, "defines")
│ (skip terms with special chars, >40 chars, inside code blocks)
│
├── [link](other.md) → edge(doc→other, "references")
│ (deduplicated, normalized paths)
│
└── cross-doc: same label in different files → edge("same_concept", INFERRED)
Image & scanned PDF extraction
image.heic / scanned.pdf
│
├── structural: hub node only (filename as label)
│ (no text extractable — pypdf returns empty for scanned pages)
│
└── agent mode (Step 2 in SKILL.md):
Claude reads file with vision → extracts entities → JSON
people, places, concepts, text (OCR), relationships
Step 3 — Cross-reference
Runs when both code AND docs exist. Links code entities mentioned in doc text.
README.md text: "The GraphStore class handles all persistence..."
│
▼
pattern match: "GraphStore" found
│
▼
edge(README → GraphStore, relation="mentions", INFERRED)
Code entities eligible for matching:
✓ class names (GraphStore, UserService)
✓ function names (detect, cluster)
✗ method stubs (.traverse(), .__init__())
✗ file-hub nodes (detect-files.py)
✗ labels < 3 chars
Step 4 — Build graph
Merges all extraction results into a single NetworkX graph.
[code_result, doc_result, semantic_result]
│
├── deduplicate nodes by ID
├── merge edges (preserve _src/_tgt for direction display)
│
└── output: nx.Graph with node/edge attributes
node: {id, label, file_type, source_file, source_location}
edge: {relation, confidence, source_file, _src, _tgt}
Step 5 — Cluster
Graph (N nodes, E edges)
│
├── density check:
│ avg_degree ≤ 3 → resolution = 1.0 (broad, for docs)
│ nodes > 5000 → resolution = 1.0 (fewer communities)
│ otherwise → resolution = 1.5 (tight, for code)
│
├── Leiden (if graspologic installed) or Louvain (networkx builtin)
│
├── split oversized communities (> 15% of graph, min 10 nodes)
│
├── label each community from top-degree node names
│
└── score cohesion (internal edge density)
Step 6 — Export
wiki-out/
│
├── graph.json ← node-link JSON, community assignments
│ queryable with llm-wiki query
│
├── graph.html ← vis.js interactive graph
│ nodes sized by degree, colored by community
│
├── WIKI_REPORT.md ← god nodes, surprising connections,
│ community summaries, suggested questions
│
├── wiki/ ← one .md per community + god node articles
│ index.md cross-links, bridge nodes
│ Community_0.md
│ GraphStore.md
│
└── vault/ ← one .md per node with [[wikilinks]]
.vault/graph.json community color config
GraphStore.md YAML frontmatter + inline tags
Settings.md
Agent mode flow (semantic extraction)
When running /wiki . in Claude Code, the skill adds a second pass:
Step 1: llm-wiki . ← structural (free)
│
▼
Step 2: check output
│
├── code-only graph, many edges → done, skip agent
│
├── DOCX/PDF/images with 0 edges → dispatch agents:
│
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ │ Agent 1 │ │ Agent 2 │ │ Agent 3 │
│ │ read DOCX │ │ read PDF │ │ read images│
│ │ extract │ │ (vision) │ │ (vision) │
│ │ entities │ │ extract │ │ OCR + desc │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │
│ └────────────────┴────────────────┘
│ │
│ semantic.json
│ │
▼ ▼
Step 3: merge structural + semantic → rebuild graph → re-export
Typed inheritance extraction
source.java
│
├── class UserRepository extends BaseEntity
│ implements Repository, Comparable
│
├── tree-sitter AST:
│ class_declaration
│ ├── superclass → BaseEntity ← extends
│ └── super_interfaces → type_list:
│ ├── Repository ← implements
│ └── Comparable ← implements
│
└── Graph edges:
UserRepository --extends--> BaseEntity
UserRepository --implements--> Repository
UserRepository --implements--> Comparable
Per-language dispatch in extract-inheritance.py:
| Language | Grammar field | Generates |
|---|---|---|
| Java | superclass + super_interfaces | extends + implements |
| Python | superclasses | extends (handles Generic[T] subscript) |
| TypeScript | class_heritage → extends_clause / implements_clause | extends + implements |
| Kotlin | delegation_specifiers → delegation_specifier | extends |
| C# | base_list (first=extends, rest=implements) | extends + implements |
| C++ | base_class_clause | extends |
| PHP | base_clause + class_interface_clause | extends + implements |
| Scala | extends_clause (first=extends, with T=implements) | extends + implements |
| Swift | inheritance_specifier | extends |
| Ruby | superclass field | extends |
Query example:
llm-wiki query neighbors Serializable
# → lists all classes implementing Serializable
Function signature extraction
source.py
│
├── def process(order: Order, user: User) -> Result:
│
├── tree-sitter AST:
│ function_definition
│ ├── name: process
│ ├── parameters: (order: Order, user: User)
│ └── return_type: Result
│
└── Node enrichment:
{
id: "orders_process",
label: "process()",
signature: "(order: Order, user: User) -> Result"
}
Truncation: signatures longer than 200 chars end with .... Failures: handler exceptions caught, node still gets created without signature. Debug mode: WIKI_DEBUG=1 llm-wiki . prints skipped extractions to stderr.
Doc comment extraction
Automatic enrichment of AST nodes with business logic from inline docs:
source.java
│
├── /** Match YHC orders with delivery data. ← Javadoc
│ * Uses 3-month sliding window.
│ */
├── public class YhcOrderMatchingService { ← AST node
│
└── Result:
node.label = "YhcOrderMatchingService"
node.description = "Match YHC orders with delivery data. Uses 3-month sliding window."
Supported: /** */ (Java/JS/TS/PHP), // (Go), /// (Rust/C#/Swift), # (Ruby)
Living wiki cycle
After initial build, the wiki grows with every session:
┌──────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Monitor │───▶│ Rebuild │───▶│ Lint │ │
│ │ (watch) │ │ (cached) │ │ (health) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Report │◀─────────────────│Write-back│ │
│ │ (stats) │ │(insights)│ │
│ └──────────┘ └──────────┘ │
│ │
└──────────────────────────────────────────────────┘
Monitor: llm-wiki watch . or check mtime
Rebuild: llm-wiki . SHA256 cache skips unchanged
Lint: llm-wiki lint orphans, tiny communities
Write-back: wiki-out/ingested/*.md insights filed as markdown
Report: llm-wiki query stats track growth over time