RepoNova
RepoNova is an MCP server that builds a persistent knowledge graph of your codebase, enabling AI agents to query code structure, dependencies, and semantics through 11 specialized tools.
README
<p align="center"> <img src="https://img.shields.io/npm/v/reponova?style=flat-square&color=cb3837&logo=npm" alt="npm version" /> <img src="https://img.shields.io/npm/dm/reponova?style=flat-square&color=blue" alt="npm downloads" /> <img src="https://img.shields.io/node/v/reponova?style=flat-square&color=339933&logo=node.js&logoColor=white" alt="node version" /> <img src="https://img.shields.io/github/license/CristianoCiuti/reponova?style=flat-square&color=green" alt="license" /> <img src="https://img.shields.io/badge/MCP-compatible-8A2BE2?style=flat-square" alt="MCP compatible" /> </p>
<p align="center"> <img src="media/reponova-social.jpg" alt="RepoNova" width="600" /> </p>
<h1 align="center">π€ RepoNova π</h1>
<p align="center"> <strong>Turn your codebase into a knowledge graph. Query it with AI.</strong> </p>
<p align="center" style="font-style: italic;"> Knowledge graph builder & <a href="https://modelcontextprotocol.io/">MCP</a> server for AI code assistants.<br/> Extracts symbols, relationships, and semantics from your code β then exposes the entire structure<br/> as 11 graph tools that any MCP-compatible agent can use. </p>
β οΈ Alpha β Active Development APIs, config format, and CLI may change between releases. Already usable in production workflows. Open an issue if something doesn't work.
Why RepoNova?
AI agents read files one at a time. They don't understand how your codebase fits together β which functions call what, which modules depend on which, where the architectural bottlenecks are.
RepoNova fixes that. It builds a persistent knowledge graph of your entire codebase (or multiple repos) and gives your AI agent 11 specialized tools to query it: search, impact analysis, shortest path, semantic similarity, community detection, and more.
One build. Persistent graph. Instant queries across sessions. No re-reading files. No burning tokens on context. The graph remembers everything.
What makes it different
- Zero external dependencies β no Python, no Docker, no database servers. Pure Node.js
- Multi-repo support β build one graph spanning multiple repositories
- Smart incremental builds β SHA256 file hashing, per-phase config change detection, selective subsystem regeneration
- Provider-based AI β optional local or remote AI providers for embeddings, summaries, and descriptions (local CPU/GPU or OpenAI-compatible APIs)
- 11 MCP tools β from text search to weighted Dijkstra, semantic similarity to structural queries
- Works with any MCP client β OpenCode, Cursor, Claude Code, VS Code Copilot
How it works
Your Codebase reponova build AI Agent
βββββββββββββ ββββββββββββββ ββββββββ
Python ΒΉ 1. tree-sitter AST parsing graph_search
Markdown / Docs βββββββββββΊ 2. Symbol + edge extraction βββΊ graph_impact
Diagrams / SVG 3. Louvain communities graph_path
Multi-repo 4. Enrichment (summaries + descriptions) graph_similar
5. TF-IDF / ONNX / API embeddings
6. HTML visualizations ... (11 tools)
ΒΉ More languages coming soon β contributions welcome.
Install
npm install -g reponova
Or run directly without installing:
npx reponova
Requires Node.js >= 18.
Quick Start
1. Install into your editor
reponova install --target opencode
This registers the MCP server, installs hooks/skills, and writes the default reponova.yml config.
Supported editors: opencode, cursor, claude, vscode
2. Build the knowledge graph
reponova build
3. Use it
The MCP server starts automatically with your editor. Your AI agent now has access to all 11 graph tools.
You: "What would be the impact of refactoring the authenticate function?"
Agent: [calls graph_impact] β shows upstream/downstream blast radius across repos
MCP Tools
11 specialized tools exposed over MCP (stdio). Each tool is designed for a specific query pattern.
| Tool | Description |
|---|---|
graph_search |
π Full-text search across nodes. Filter by type, repo. Expand results with BFS/DFS. |
graph_impact |
π₯ Blast radius analysis β find all upstream/downstream dependents of any symbol. |
graph_path |
π€οΈ Weighted shortest path (Dijkstra) between two symbols. Filter by edge type. |
graph_explain |
π Full detail on a node: edges, community, centrality metrics, signature, docstring. |
graph_similar |
π§² Semantic similarity search using vector embeddings (TF-IDF, ONNX, or remote provider). |
graph_context |
π§ Smart context builder with token budget β combines search + vectors + graph expansion. |
graph_community |
ποΈ List all nodes in a community, ranked by degree centrality. |
graph_hotspots |
π₯ God nodes / architectural bottlenecks β most connected symbols in the graph. |
graph_outline |
ποΈ Tree-sitter code outline: functions, classes, imports with signatures and line ranges. |
graph_docs |
π Search documentation nodes (markdown, text, rst). |
graph_status |
π Graph metadata: node/edge counts, repos, build timestamp, reponova version, build config. |
Agentic Workflows
RepoNova is designed to be the structural memory layer for AI coding agents. Here's how to use it effectively in agentic workflows.
Recommended agent patterns
Before any refactoring:
1. graph_impact "TargetFunction" β understand blast radius
2. graph_path "ModuleA" "ModuleB" β see dependency chain
3. graph_community 5 β understand the module cluster
4. Make changes with full structural awareness
When exploring unfamiliar code:
1. graph_status β understand graph size and repos
2. graph_hotspots β identify architectural pillars
3. graph_search "authentication" β find entry points
4. graph_explain "Function:authenticate" β deep dive
When answering "where is X used?":
1. graph_search "X" β find the node
2. graph_impact "X" direction=downstream β who depends on it
3. graph_similar "X" β find semantically related code
Integration with editor skills
The reponova install command installs a skill file and a hook/rule that teaches your AI agent when and how to use each tool. The agent automatically reaches for graph tools when it needs structural information.
| Editor | MCP Config | Hook / Rule | Skill | Config |
|---|---|---|---|---|
| OpenCode | .opencode/opencode.json |
.opencode/plugins/reponova.js |
.opencode/skills/reponova/SKILL.md |
.opencode/reponova.yml |
| Cursor | .cursor/mcp.json |
.cursor/rules/reponova.mdc |
(embedded in rule) | .cursor/reponova.yml |
| Claude Code | claude mcp add |
.claude/settings.json |
.claude/skills/reponova/SKILL.md |
.claude/reponova.yml |
| VS Code | .vscode/mcp.json |
.github/copilot-instructions.md |
(embedded in instructions) | .vscode/reponova.yml |
Keeping the graph fresh
# Incremental rebuild β only processes changed files
reponova build
# Force rebuild β ignores all caches, reruns every phase
reponova build --force
Tip: Add
reponova buildto your CI pipeline or as a post-commit hook to keep the graph always up-to-date.
How incremental builds work
RepoNova's incremental build goes beyond simple file-change detection. It minimizes redundant work at every stage of the pipeline:
| Layer | What it does | When it kicks in |
|---|---|---|
| File hashing | SHA256 per file β only re-parse changed/added files. Detects removed files too. | Every incremental build |
| Config fingerprinting | Compares a hash of build-relevant config fields across builds. | When reponova.yml changes between builds |
| Selective subsystem regeneration | Only reruns the subsystems affected by config changes (e.g. changing embeddings.provider reruns embeddings but not parsing). |
Config-only changes (no file changes) |
| Incremental embeddings | Tracks text content per node. Only re-embeds nodes whose text changed. | Every incremental build with embeddings enabled |
| Outline hashing | SHA256 per source file for outlines. Skips outline regeneration for unchanged files. | Every incremental build with outlines enabled |
| Stale artifact cleanup | Removes outdated artifacts when config changes invalidate them (e.g. deletes tfidf_idf.json after switching to a different embedding provider). |
After config change detection |
| Per-phase skip | Each phase independently checks its cache and config fingerprint. If nothing relevant changed, the phase is skipped. | Every incremental build |
The build config fingerprint is stored in graph.json metadata. Each phase also stores its own config hash in .cache/ for per-phase change detection.
CLI Reference
reponova install
Set up editor integration. Creates MCP config, hook, skill, and reponova.yml.
reponova install --target <editor> [--graph <path>]
| Option | Required | Description |
|---|---|---|
--target |
Yes | Editor to configure. Values: opencode, cursor, claude, vscode |
--graph |
No | Path to the reponova-out/ directory. Default: ./reponova-out |
reponova build
Build (or rebuild) the knowledge graph.
reponova build [--config <path>] [--force] [--target <phase>] [--start-after <phase>] [--check <phase>]
| Option | Required | Description |
|---|---|---|
--config |
No | Path to reponova.yml. Default: auto-detected (see Config Resolution) |
--force |
No | Ignore all caches and rerun every phase. Default: false |
--target |
No | Run only this phase and its transitive dependencies. Useful for selective rebuilds without running the full pipeline. |
--start-after |
No | Run only phases downstream of this phase (requires previous build outputs). Conflicts with --target. |
--check |
No | Check if a phase needs to run. Exit 0 = up to date, exit 1 = needs run. Conflicts with --target, --start-after, --force. |
Target examples:
reponova build --target index # file-detection β graph β communities β enrich β index
reponova build --target outlines # file-detection β outlines
reponova build --target html # file-detection β graph β communities β enrich β html
reponova build --target embeddings # file-detection β graph β communities β enrich β embeddings
Start-after examples:
reponova build --start-after enrich # run only index, embeddings, html, report
reponova build --start-after communities # run only enrich + downstream
When --target is omitted, all 9 phases run in DAG order.
Build pipeline (9 DAG phases, 5 levels):
The pipeline executes as a directed acyclic graph β phases within the same level run in parallel:
Level 0: file-detection
Level 1: graph, outlines (parallel)
Level 2: communities
Level 3: enrich
Level 4: search-index, embeddings, html, report (parallel)
| Phase | What it does |
|---|---|
| file-detection | Detect source files, documentation, and diagrams (centralized glob matching via picomatch) |
| graph | Diff files against previous build, parse changed files with tree-sitter WASM, extract symbols/calls/imports/inheritance, build directed graph with cross-file/cross-repo edges |
| outlines | Generate tree-sitter code outlines per file (SHA256 per-file hashing β skip unchanged) |
| communities | Detect communities (Louvain algorithm) and write final graph.json with community assignments |
| enrich | Generate graph-enriched.json, community summaries, and node descriptions (algorithmic or LLM-enhanced via provider) |
| search-index | Generate SQLite search index (graph_search.db) |
| embeddings | Generate embeddings incrementally β only re-embed nodes whose text content changed (TF-IDF, ONNX, or remote provider). Clean up stale artifacts on config change. |
| html | Generate graph.html and graph_communities.html interactive visualizations |
| report | Generate report.md build report |
Each phase internally handles its own incremental logic: file diffing, config fingerprint comparison, cache invalidation, and stale artifact cleanup.
reponova mcp
Start the MCP server over stdio. Normally launched automatically by the editor.
reponova mcp [--graph <path>]
| Option | Required | Description |
|---|---|---|
--graph |
No | Path to reponova-out/ directory. Default: auto-detected |
reponova models
Manage local AI models (ONNX embeddings, LLM). See Models for details.
reponova models status # Show configured and cached models
reponova models download # Pre-download all models needed by config
reponova models remove <name> # Remove a specific cached model
reponova models clear # Remove all cached models
| Option | Required | Description |
|---|---|---|
--config |
No | Path to reponova.yml. Default: auto-detected |
--cache-dir |
No | Override model cache directory |
reponova check
Verify graph installation, build integrity, and report stats.
reponova check [--graph <path>]
| Option | Required | Description |
|---|---|---|
--graph |
No | Path to reponova-out/ directory. Default: auto-detected |
Checks performed:
- Graph file (
graph.json) exists and is readable - Build metadata presence (
build_configfingerprint) - Embedding artifacts consistency (TF-IDF IDF file, vector store)
- Warns if embedding provider in config doesn't match the built artifacts
- Search index (
graph_search.db) existence - Outlines directory existence
- tree-sitter WASM availability
reponova cache
Inspect and manage per-phase cache state.
reponova cache [--check <phase>] [--seal <phase>] [--invalidate <phase>] [--status]
| Option | Required | Description |
|---|---|---|
--check |
No | Check if a phase cache is fresh (exit 0 = fresh, exit 1 = stale) |
--seal |
No | Manually seal a phase cache (marks it as up-to-date) |
--invalidate |
No | Invalidate a phase cache (forces re-run on next build) |
--status |
No | Show cache status for all phases |
--config |
No | Path to reponova.yml. Default: auto-detected |
Only one operation at a time. Example:
reponova cache --status # Show freshness of all phases
reponova cache --seal enrich # Mark enrich as done (after manual enrichment)
reponova cache --invalidate html # Force HTML regeneration on next build
reponova enrich
Run the full intelligent enrichment pipeline (requires enrich.provider configured). This is the automated provider-driven mode β for IDE/agent-driven enrichment, use the subcommands below.
reponova enrich [--config <path>]
The command:
- Builds up to
communitiesphase (if needed) - Runs all enrichment steps (metrics β descriptions β profiles β routing β restructure β apply β updated-profiles β finalize)
- Seals the enrich cache
reponova enrich:* (subcommands)
Step-by-step enrichment for IDE/agent workflows. Each subcommand corresponds to one stage of the pipeline.
reponova enrich:metrics # Step 0: Compute candidates and edge density
reponova enrich:prepare <step> # Prepare input batches for agent processing
reponova enrich:merge <step> # Merge agent output batches into final file
reponova enrich:apply # Apply routing + restructure decisions to graph
reponova enrich:finalize # Assemble final output files
Steps (for enrich:prepare and enrich:merge): descriptions, profiles, routing, restructure, updated-profiles
Typical IDE workflow:
reponova enrich:metrics # classify boundary nodes
reponova enrich:prepare descriptions # create input batches
# β agent reads .enrich/input/descriptions/, writes .enrich/output/descriptions/
reponova enrich:merge descriptions # merge into .enrich/descriptions.json
reponova enrich:prepare profiles # ...repeat for each step
# β agent processes β merge β prepare next β ...
reponova enrich:apply # apply routing + restructure to graph
reponova enrich:finalize # produce graph-enriched.json + final files
reponova cache --seal enrich # seal cache
reponova build --start-after enrich # run downstream phases
Supported Languages
Extraction (AST parsing + graph building)
| Language | Extensions | Parser | Node Types |
|---|---|---|---|
| Python | .py, .pyw |
tree-sitter-python (WASM) | function, class, method, module, constant |
| Markdown | .md, .txt, .rst |
Built-in | document, section |
| Diagrams | .puml, .plantuml, .svg, .png, .jpg, .jpeg, .gif |
Built-in | diagram, component, interface, section |
Outline (tree-sitter code outline)
| Language | Extensions | Outline Support |
|---|---|---|
| Python | .py, .pyw |
Full: functions, classes, methods, imports, signatures, decorators, docstrings |
Adding a new language: Create
src/extract/languages/<lang>.tsimplementingLanguageExtractor, register it inregistry.ts, add the.wasmgrammar togrammars/. See Contributing > Adding Language Support for the full interface reference.Note: Extraction and outline are separate systems with different registries and interfaces. Registering an extractor gives you graph building (symbols, edges, imports). For code outlines (
graph_outline), you also need aLanguageSupportimplementation insrc/outline/languages/β see Contributing > Adding Outline Support.
Edge Types
Every edge in the graph has a type that describes the relationship:
| Edge Type | Description | Example |
|---|---|---|
calls |
Function/method invocation | process_data β validate_input |
imports |
Module-level import | api.py β models.py |
imports_from |
Named import of a specific symbol | api.py β UserModel |
extends |
Class inheritance | AdminUser β BaseUser |
contains |
Parent contains a child (moduleβsymbol, classβmethod, documentβsection) | auth.py β login() |
Configuration
Config Resolution
The config file is auto-detected from these locations (first match wins):
- Explicit
--configargument reponova.ymlin the project root.opencode/reponova.yml.cursor/reponova.yml.claude/reponova.yml.vscode/reponova.yml
All paths inside the config are relative to the config file's location. When placed inside an editor directory (e.g. .opencode/), use ../ to reference the project root.
Pattern Resolution
All glob patterns (patterns, exclude, docs.patterns, etc.) are matched against workspace-relative paths. How those paths look depends on the number of repos.
Single-repo
With one repo, file paths are relative to the repo root β no prefix:
src/core.py
src/utils/helpers.py
tests/test_core.py
Patterns work as you'd expect:
repos:
- name: my-project
path: .
patterns: ["src/**/*.py"] # matches src/core.py β
exclude: ["tests/**"] # excludes tests/test_core.py β
Multi-repo
With multiple repos, each file path is prefixed with the repo name from the config:
api/src/routes.py # β "api" comes from repos[].name
api/src/handlers.py
core/src/models.py # β "core" comes from repos[].name
core/src/db.py
Patterns are tested against both forms β the full prefixed path and the repo-relative path β so the same pattern works in single and multi-repo:
repos:
- name: api
path: ../services/api
- name: core
path: ../services/core
patterns: ["src/**/*.py"] # matches api/src/routes.py, api/src/handlers.py, core/src/models.py, core/src/db.py β (via repo-relative)
exclude: ["**/test_*.py"] # works across all repos
Filtering a specific repo
Use the repo name as a path prefix to target one repo only:
exclude:
- "api/src/generated/**" # excludes only in the api repo
- "**/migrations/**" # excludes in all repos
This works because the full workspace path is always <repo-name>/<path>. The repo name is the name field from your repos config β not the directory name on disk.
Full Config Reference
Every field, every valid value, every default.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# reponova.yml β Full Configuration Reference
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Where to write build output (graph.json, graph.html, graph_search.db, etc.)
# Type: string
# Default: "reponova-out"
output: ../reponova-out
# ββ Repositories ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# List of repositories to include in the build.
# Each repo needs a unique name and a path (relative to this config file).
repos:
- name: api-service # string β unique identifier for this repo
path: ../services/api # string β path to repo root (relative to this file)
- name: core-lib
path: ../services/core
# ββ Providers (optional β AI backends) ββββββββββββββββββββββββββββββββββββββββ
# Define named providers here, then reference them from features below.
# Default (no provider) = algorithmic mode (TF-IDF embeddings, rule-based summaries).
# Type: Record<string, ProviderConfig>
# Default: {} (empty β fully algorithmic)
# providers:
# my-openai:
# type: openai # "openai" (remote), "llama-cpp" (local LLM), "onnx" (local embeddings)
# base_url: https://api.openai.com/v1
# model: text-embedding-3-small
# api_key: ${OPENAI_API_KEY} # env var reference (resolved at runtime)
# timeout: 30 # seconds (default: 30)
# local-llm:
# type: llama-cpp
# model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M"
# context_size: 512
# local-embeddings:
# type: onnx
# model: all-MiniLM-L6-v2
# ollama:
# type: openai # Ollama is OpenAI-compatible
# base_url: http://localhost:11434/v1
# model: nomic-embed-text
# ββ Centralized Model Management βββββββββββββββββββββββββββββββββββββββββββββ
# Shared settings for local AI models (ONNX embeddings + GGUF LLM weights).
# These apply to providers of type "onnx" and "llama-cpp".
models:
# Directory to cache downloaded models (ONNX embeddings + LLM weights)
# Type: string
# Default: "~/.cache/reponova/models"
cache_dir: ~/.cache/reponova/models
# GPU acceleration backend for LLM inference
# Values: "auto" | "cpu" | "cuda" | "metal" | "vulkan"
# - auto: auto-detect best available backend
# - cpu: force CPU inference (slower but always works)
# - cuda: NVIDIA GPU (requires CUDA drivers)
# - metal: Apple Silicon GPU (macOS only)
# - vulkan: Cross-platform GPU (AMD, Intel, NVIDIA)
# Default: "auto"
gpu: auto
# Number of CPU threads for LLM inference
# Type: number
# Default: 0 (auto-detect based on available cores)
threads: 0
# Automatically download models on first use
# Type: boolean
# Default: true
download_on_first_use: true
# ββ Source Code File Filters ββββββββββββββββββββββββββββββββββββββββββββββββββ
# Shared by graph + outlines β a single file-detection phase produces
# the file list consumed by both.
# Glob patterns for source code files to include
# Type: string[]
# Default: [] (empty = auto-detect by file extension using registered extractors)
# Example: ["src/**/*.py", "lib/**/*.ts"]
patterns: []
# Glob patterns to exclude from source code detection
# Type: string[]
# Default: []
# Example: ["**/generated/**", "**/*.test.ts", "**/vendor/**"]
exclude: []
# Exclude common non-source directories from all file detection
# (source code, documentation and diagrams).
# When true, the following directories are skipped at any depth:
# node_modules, __pycache__, .git, .svn, .hg, venv, .venv, env, .env, .tox,
# site-packages, dist, build, .eggs, .mypy_cache, .pytest_cache, .ruff_cache,
# target, bin, obj
# Set to false if you need to index files inside these directories
# (e.g. vendored code in node_modules). You can still exclude specific
# directories via the `exclude` patterns above.
# Type: boolean
# Default: true
exclude_common: true
# Incremental builds: only re-process files whose SHA256 hash changed
# Type: boolean
# Default: true
incremental: true
# ββ Documentation Extraction βββββββββββββββββββββββββββββββββββββββββββββββββ
docs:
# Enable/disable documentation extraction
# Type: boolean
# Default: true
enabled: true
# Glob patterns for documentation files (relative to repo root)
# Type: string[]
# Default: [] (empty = auto-detect by file extension: .md, .txt, .rst)
# Example: ["docs/**/*.md", "**/*.rst"]
patterns: []
# Glob patterns to exclude from documentation extraction
# Type: string[]
# Default: []
# Example: ["**/CHANGELOG.md", "**/node_modules/**"]
exclude: []
# Maximum file size in KB β files larger than this are skipped
# Type: number
# Default: 500
max_file_size_kb: 500
# ββ Diagram / Image Extraction βββββββββββββββββββββββββββββββββββββββββββββββ
images:
# Enable/disable diagram extraction
# Type: boolean
# Default: true
enabled: true
# Glob patterns for diagram files (relative to repo root)
# Type: string[]
# Default: [] (empty = auto-detect by file extension: .puml, .plantuml, .svg, .png, .jpg, .jpeg, .gif)
# Example: ["diagrams/**/*.puml", "**/*.svg"]
patterns: []
# Glob patterns to exclude
# Type: string[]
# Default: []
# Example: ["**/node_modules/**"]
exclude: []
# Parse PlantUML files to extract components and relationships
# Type: boolean
# Default: true
parse_puml: true
# Extract text content from SVG files
# Type: boolean
# Default: true
parse_svg_text: true
# ββ Embeddings ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Vector representations for semantic search (graph_similar, graph_context)
# Default (no provider): TF-IDF (384-dim, fast, no download required)
# With provider: uses the named provider for embedding generation
embeddings:
# Enable/disable embedding generation
# Type: boolean
# Default: true
enabled: true
# Reference a named provider from the `providers` section above
# When omitted: uses built-in TF-IDF (384-dim, no download)
# Type: string | undefined
# Default: (none β algorithmic TF-IDF)
# provider: my-openai
# Batch size for embedding generation
# Type: number
# Default: 128
batch_size: 128
# ββ Enrich ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Unified enrichment: community summaries + node descriptions.
# Default (no provider): algorithmic mode (rule-based summaries and descriptions)
# With provider: enables intelligent multi-step LLM enrichment pipeline
enrich:
# Enable/disable enrichment
# Type: boolean
# Default: true
enabled: true
# Degree percentile threshold for node descriptions
# Type: number (0.0 β 1.0)
# Default: 0.8
# Meaning: top (1 - threshold)% of nodes by degree get descriptions.
# - 0.8 = top 20% of nodes
# - 0.5 = top 50% of nodes
# - 0.0 = all nodes (expensive!)
# - 1.0 = no nodes
threshold: 0.8
# Maximum number of communities to summarize
# Type: integer (>= 0)
# Default: 0 (no limit β summarize all communities)
# Communities are sorted by size (largest first). When max_communities > 0,
# only the top N largest communities are summarized.
# Communities with fewer than 3 nodes are always excluded.
max_communities: 0
# Boundary ratio threshold for candidate classification (intelligent mode)
# Nodes with external_edges / total_edges >= this value are candidates for rerouting
# Type: number (0.0 β 1.0)
# Default: 0.3
candidate_threshold: 0.3
# Token budget per description batch (intelligent mode)
# Type: number
# Default: 40000
description_batch_tokens: 40000
# Batch size for routing decisions (intelligent mode)
# Type: number
# Default: 30
routing_batch_size: 30
# LLM concurrency β max parallel LLM calls (intelligent mode)
# Type: number (>= 1)
# Default: 4
concurrency: 4
# Max retry depth for failed LLM calls (intelligent mode)
# Type: number (>= 0)
# Default: 3
max_retry_depth: 3
# Per-step max_tokens sent to the LLM provider (intelligent mode)
# Controls the maximum output length for each enrichment step independently.
# Type: object { descriptions, profiles, routing, restructure }
# Default: { descriptions: 32768, profiles: 2048, routing: 8192, restructure: 4096 }
# Note: descriptions output scales ~0.75Γ with description_batch_tokens input.
# With default 40k input, expect ~30k output tokens.
# Ensure your model context window fits input + output (e.g. 40k + 32k = 72k minimum).
max_tokens:
descriptions: 32768 # node description batches (scales with batch input)
profiles: 2048 # community profiling (single object, bounded)
routing: 8192 # routing decision batches (scales with routing_batch_size)
restructure: 4096 # merge/split detection
# Profile generation limits (intelligent mode)
# Controls how many nodes/edges are included in the community profile prompt.
# Lower values = cheaper prompts, less context for the LLM.
# Type: object { max_nodes, max_edges }
# Default: { max_nodes: 80, max_edges: 50 }
# profile:
# max_nodes: 80 # max nodes listed in profile prompt per community
# max_edges: 50 # max edges listed in profile prompt per community
# Maximum density pairs for restructure (intelligent mode)
# Limits how many cross-community (communityA, communityB) pairs are sent
# to the LLM for merge/split analysis. Pairs are ranked by edge density.
# Type: number (>= 1)
# Default: 20
# restructure_max_pairs: 20
# Provider name β references a provider defined in the top-level `providers` map
# When omitted: uses algorithmic enrichment (rule-based summaries + descriptions)
# The referenced provider must be type "openai" or "llama-cpp" (LLM-capable)
# Type: string (optional)
# provider: local-llm
# ββ HTML Visualizations ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Generate interactive HTML visualizations (graph.html + graph_communities.html)
# Type: boolean
# Default: true
html: true
# Minimum node degree to include in HTML visualization
# Useful for large graphs β filters out leaf nodes to reduce clutter
# Type: integer (>= 1)
# Default: not set (include all nodes)
# html_min_degree: 3
# ββ Outlines ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Tree-sitter code outlines: functions, classes, imports with signatures.
# Language is auto-detected from file extension (no need to specify it).
# File selection comes from top-level patterns / exclude / exclude_common.
outlines:
# Enable/disable outline generation
# Type: boolean
# Default: true
enabled: true
# ββ Server ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# MCP server options (reserved for future use)
# Type: object
# Default: {}
server: {}
Minimal Config
Most fields have sensible defaults. A minimal config for a single repo:
output: ../reponova-out
repos:
- name: my-project
path: ..
Multi-repo Config
output: ../reponova-out
repos:
- name: api
path: ../services/api
- name: core
path: ../services/core
- name: shared
path: ../libs/shared
Provider-based Config
For richer AI-enhanced enrichment or embeddings, define providers and reference them:
output: ../reponova-out
repos:
- name: my-project
path: ..
providers:
local-llm:
type: llama-cpp
model: "hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M" # ~350MB download
context_size: 512
models:
gpu: auto # auto-detect GPU, falls back to CPU
download_on_first_use: true
enrich:
enabled: true
threshold: 0.5 # describe top 50% nodes by degree
provider: local-llm # use local LLM for intelligent enrichment
When multiple features reference the same
llama-cppprovider, RepoNova shares a single engine instance β no double memory usage.
Using OpenAI-compatible APIs (including Ollama)
providers:
openai-embed:
type: openai
base_url: https://api.openai.com/v1
model: text-embedding-3-small
api_key: ${OPENAI_API_KEY}
ollama-llm:
type: openai
base_url: http://localhost:11434/v1
model: llama3.2
embeddings:
enabled: true
provider: openai-embed
enrich:
enabled: true
provider: ollama-llm
File Filtering Config
Control which source files are included in the graph:
output: ../reponova-out
repos:
- name: my-project
path: ..
patterns: # only include files matching these globs
- "src/**/*.py"
- "lib/**/*.ts"
exclude: # exclude files matching these globs
- "**/test/**"
- "**/tests/**"
- "**/migrations/**"
- "**/*.generated.ts"
When
patternsis empty (default) for any subsystem (docs,images), RepoNova auto-detects files by extension using the corresponding registry. Source code and outlines share the top-levelpatterns/exclude/exclude_common. No configuration needed for standard project layouts. The configured output directory is automatically excluded from all file detection β no need to add it toexcludepatterns manually.exclude_common(default:true) skips the following directories at any depth:node_modules,__pycache__,.git,.svn,.hg,venv,.venv,env,.env,.tox,site-packages,dist,build,.eggs,.mypy_cache,.pytest_cache,.ruff_cache,target,bin,obj. Setexclude_common: falseto disable this behavior and use explicitexcludepatterns instead.
Models & Providers
RepoNova supports three provider types for AI-enhanced features. By default (no providers configured), everything is algorithmic β no downloads, no API keys.
Provider Types
| Type | Purpose | Downloads | Requires |
|---|---|---|---|
onnx |
Local ONNX embeddings (sentence-transformers) | ~86 MB model | Nothing (bundled runtime) |
llama-cpp |
Local LLM (GGUF format) for summaries/descriptions | ~350 MB model | node-llama-cpp (optional peer dep) |
openai |
Remote OpenAI-compatible API (embeddings or LLM) | None | API key or local server (e.g. Ollama) |
ONNX Embeddings (local)
Sentence-transformer models for semantic similarity search (graph_similar, graph_context).
| Property | Value |
|---|---|
| Provider type | onnx |
| Config | providers.<name>.model (plain model name, e.g., all-MiniLM-L6-v2) |
| Source | huggingface.co/sentence-transformers/{model} |
| Cache path | {models.cache_dir}/{model-name}/ |
| Files downloaded | model.onnx, vocab.txt, tokenizer_config.json |
| Used when | embeddings.provider references an onnx provider |
Compatible models (384-dim output):
| Model | Size | Notes |
|---|---|---|
all-MiniLM-L6-v2 |
~86 MB | Default. Good speed/quality balance |
all-MiniLM-L12-v2 |
~130 MB | More accurate, slower |
paraphrase-MiniLM-L6-v2 |
~86 MB | Optimized for paraphrase detection |
multi-qa-MiniLM-L6-cos-v1 |
~86 MB | Optimized for Q&A |
Any model under the sentence-transformers/ org on HuggingFace that provides an ONNX export with BERT-compatible tokenizer (WordPiece) should work.
LLM / GGUF (local)
Local language models for enrichment (community summaries and node descriptions), powered by node-llama-cpp.
| Property | Value |
|---|---|
| Provider type | llama-cpp |
| Config | providers.<name>.model β hf: URI (e.g., hf:Qwen/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M) |
| Format | hf:{user}/{repo}:{quantization} |
| Cache path | {models.cache_dir}/llm/ |
| Used when | enrich.provider references a llama-cpp provider |
| Dependency | node-llama-cpp (optional peer dependency) |
When multiple features reference the same llama-cpp provider, RepoNova shares a single engine instance β no double memory usage.
Why different notations? ONNX embeddings use direct HTTP fetch from a fixed HuggingFace org (
sentence-transformers/), downloading specific files (model.onnx, vocab.txt). LLM models delegate entirely to node-llama-cpp'sresolveModelFile(), which handles thehf:URI protocol, download, and caching. The two systems are technically incompatible β the notation reflects this.
OpenAI-compatible (remote)
Any OpenAI-compatible API β including OpenAI itself, Azure OpenAI, Ollama, LM Studio, vLLM, etc.
| Property | Value |
|---|---|
| Provider type | openai |
| Config | providers.<name>.base_url, .model, .api_key, .timeout |
| Used for | Embeddings (embeddings.provider) or LLM enrichment (enrich.provider) |
| Retry policy | 3 retries with exponential backoff (1s/2s/4s) on HTTP 429 (embeddings only) |
| Timeout | Configurable per provider (default: 30s) |
Environment variable references (e.g., ${OPENAI_API_KEY}) are resolved at runtime.
Model Management CLI
reponova models status # Show configured and cached models
reponova models download # Pre-download all models needed by config
reponova models remove <name> # Remove a specific cached model
reponova models clear # Remove all cached models
Models are also downloaded automatically during reponova build when models.download_on_first_use: true (default). The CLI commands let you manage the cache independently of the build.
Build Output
After reponova build, the output directory contains:
reponova-out/
βββ graph.json # Full graph: nodes, edges, community assignments, metadata
β # metadata.build_config: config fingerprint for change detection
β # nodes include: docstring, signature, bases (when available)
βββ graph-enriched.json # Enriched graph: summaries + descriptions merged into nodes/communities
βββ graph-nodes.json # Intermediate graph (pre-community detection, no Louvain assignments)
βββ detected-files.json # Detected file list (intermediate, consumed by graph + outlines)
βββ graph.html # Interactive visualization (vis.js) β click, search, filter
βββ graph_communities.html # Community-focused visualization with summary labels
βββ graph_search.db # SQLite search index (sql.js WASM) β structural queries
βββ report.md # Build report: stats, hotspots, community breakdown
βββ community_summaries.json # Community summaries (algorithmic or provider-enhanced)
βββ node_descriptions.json # Descriptions for high-degree nodes
βββ tfidf_idf.json # TF-IDF vocabulary weights (for query-time embedding)
βββ vectors/ # LanceDB vector store β semantic similarity search
β βββ _meta.json # self-describing metadata (provider, dimensions, model)
β βββ (LanceDB internal files) # fallback: vectors.json when @lancedb/lancedb unavailable
βββ outlines/ # Pre-computed code outlines per file
β βββ <repo>/<path>.outline.json
βββ .cache/ # Incremental build cache
βββ hashes.json # file path β SHA256 hex map (source code hashing)
βββ outline-hashes.json # file path β SHA256 map for outline generation
βββ node-texts.json # node id β text hash map for incremental embeddings
βββ graph-nodes-hash.txt # SHA256 of graph-nodes.json (skip community detection)
βββ embeddings-config-hash.txt # config fingerprint for embeddings phase
βββ embeddings-input-hash.txt # input hash for embeddings (detect upstream changes)
βββ enrich-input-hash.txt # graph.json hash for enrich invalidation
βββ extractions/ # cached FileExtraction per file
βββ <hash>.json
Two storage engines serve different purposes:
- SQLite (
graph_search.db) β structural index for exact lookups, graph traversal, FTS. Used bygraph_search,graph_impact,graph_path,graph_explain, and more. - LanceDB (
vectors/) β vector index for semantic similarity. Used bygraph_similarandgraph_context. Falls back to brute-force cosine similarity (JSON) when@lancedb/lancedbis not installed.
Programmatic API
Use RepoNova as a library in your own Node.js tools.
Build API
Run the full build pipeline programmatically β useful for CI integrations, custom tooling, or workflows that register custom extractors/languages before building.
import { build } from "reponova";
const result = await build("./reponova.yml");
console.log(`Output: ${result.outputDir}`);
console.log(`Total processed: ${result.totalProcessed}`);
for (const [phase, r] of result.phases) {
console.log(` ${phase}: ${r.skipped ? `skipped (${r.skipReason})` : `${r.processed} items`}`);
}
// Force rebuild β ignores all caches, reruns every phase
const result = await build("./reponova.yml", { force: true });
build() returns a BuildResult:
| Field | Type | Description |
|---|---|---|
outputDir |
string |
Absolute path to the output directory |
phases |
Map<string, PhaseResult> |
Per-phase results (processed count, skip status, skip reason) |
totalProcessed |
number |
Total items processed across all phases |
If configPath is omitted, config is auto-detected from standard locations (see Config Resolution).
Runtime Registration + Build
Register custom extractors or outline languages before calling build():
import {
build,
registerExtractor,
registerOutlineLanguage,
} from "reponova";
import type { LanguageExtractor, LanguageSupport } from "reponova";
// 1. Register a custom extractor (graph building)
const myExtractor: LanguageExtractor = { /* ... */ };
registerExtractor(myExtractor);
// 2. Register outline support (graph_outline)
const myOutline: LanguageSupport = { /* ... */ };
registerOutlineLanguage("rust", ["rs"], myOutline);
// 3. Build β all registrations are picked up automatically
const result = await build("./reponova.yml");
Query API
After building, load and query the graph:
import {
openDatabase,
initializeSchema,
populateDatabase,
loadGraphData,
searchNodes,
analyzeImpact,
findShortestPath,
getNodeDetail,
} from "reponova";
// Load and index the graph
const graphData = loadGraphData("./reponova-out/graph.json");
const db = await openDatabase(":memory:");
initializeSchema(db);
populateDatabase(db, graphData);
// Search
const results = searchNodes(db, "authentication", { top_k: 5, type: "function" });
// Impact analysis
const impact = analyzeImpact(db, "Function:authenticate_user", { max_depth: 3 });
// Shortest path
const path = findShortestPath(db, graphData, "ModuleA", "ModuleB");
// Node detail
const detail = getNodeDetail(db, graphData, "Function:process_payment");
Advanced API
import {
ContextBuilder,
loadConfig,
} from "reponova";
// Smart context assembly (search + vectors + graph expansion)
const { config } = loadConfig("./reponova.yml");
const builder = new ContextBuilder(db, graphData, "./reponova-out");
await builder.initialize(config.embeddings);
const context = await builder.buildContext({
query: "authentication flow",
maxTokens: 4000,
});
FAQ
Do I need an API key?
No. By default, RepoNova is fully algorithmic β no models, no downloads, no API keys. If you configure an openai provider pointing to a remote service, you'll need an API key for that service. Local providers (onnx, llama-cpp) run entirely on your machine.
How big are the models?
| Model | Size | When downloaded |
|---|---|---|
| TF-IDF embeddings | None (computed in-process) | Never (default) |
| ONNX embeddings | ~86 MB (MiniLM-L6-v2) | When embeddings.provider references an onnx provider |
| LLM (Qwen 0.5B Q4_K_M) | ~350 MB | When a llama-cpp provider is configured and referenced |
How long does a build take?
Depends on codebase size. Rough benchmarks:
- Small project (500 files): ~5-10 seconds
- Medium project (5,000 files): ~30-60 seconds
- Large monorepo (20,000+ files): 2-5 minutes
- LLM-provider summaries add ~2-3 seconds per community
Can I use it without an editor?
Yes. Use the CLI (reponova build, reponova check) and the programmatic API. The MCP server is just one way to query the graph.
What about TypeScript / JavaScript extraction?
Tree-sitter grammars are ready. The extractor implementation is on the roadmap β contributions welcome.
Contributing
Contributions are welcome.
Adding Language Support (Extraction)
Add new programming language extractors via tree-sitter. An extractor teaches RepoNova how to parse a language's AST and extract symbols, imports, and references for graph building.
Steps
- Create
src/extract/languages/<lang>.tsimplementing theLanguageExtractorinterface - Register it in
src/extract/languages/registry.ts(or at runtime viaregisterExtractor()) - Add the tree-sitter WASM grammar to
grammars/(e.g.,tree-sitter-javascript.wasm)
LanguageExtractor Interface
interface LanguageExtractor {
/** Language identifier β must match tree-sitter grammar name (e.g., "javascript") */
readonly languageId: string;
/** File extensions this extractor handles (e.g., [".js", ".mjs", ".cjs"]) */
readonly extensions: string[];
/**
* WASM grammar filename (e.g., "tree-sitter-javascript.wasm").
* If provided: pipeline parses with tree-sitter and passes the SyntaxTree.
* If omitted: extract() receives a null tree and must work from sourceCode directly.
* (Markdown and diagram extractors use this β no WASM needed.)
*/
readonly wasmFile?: string;
/**
* Extract symbols, imports, and references from a single source file.
* @param tree - Parsed tree-sitter AST (null if wasmFile not set)
* @param sourceCode - Raw file content
* @param filePath - Relative path (normalized, forward slashes)
*/
extract(tree: SyntaxTree | null, sourceCode: string, filePath: string): FileExtraction;
/**
* Resolve an import module path to candidate file paths.
* Example: "config.loader" β ["config/loader.py", "config/loader/__init__.py"]
* Return empty array for external/third-party modules.
*/
resolveImportPath(importModule: string, currentFilePath: string): string[];
}
FileExtraction Return Type
interface FileExtraction {
filePath: string; // Relative path (forward slashes)
language: string; // Must match languageId
symbols: SymbolNode[]; // Functions, classes, methods, variables
imports: ImportDeclaration[]; // Import/export statements
references: SymbolReference[]; // Calls, type annotations, inheritance refs
}
Key types your extractor produces:
| Type | Fields | Purpose |
|---|---|---|
SymbolNode |
name, qualifiedName, kind, signature?, decorators, docstring?, startLine, endLine, parent?, bases?, calls |
A symbol defined in the file |
ImportDeclaration |
module, names, isWildcard, isExport?, line |
An import/export statement |
SymbolReference |
name, fromSymbol, kind ("call" | "type_annotation" | "attribute_access" | "inheritance"), line |
A reference to another symbol |
SymbolKind |
"function" | "class" | "method" | "variable" | "constant" | "interface" | "enum" | "module" | "document" | "section" |
Symbol classification |
See src/extract/types.ts for full type definitions and JSDoc.
How tree-sitter Parsing Works
- If
wasmFileis set, the pipeline loadsgrammars/<wasmFile>, parses the source, and passes aSyntaxTreetoextract() - If
wasmFileis omitted,extract()receivesnullas the tree and must work fromsourceCodedirectly - WASM grammars are loaded from the
grammars/directory relative to the package root SyntaxTree/SyntaxNodetypes match the web-tree-sitter WASM interface
Runtime Registration
You can also register extractors at runtime via the public API (must be called before build):
import { registerExtractor } from "reponova";
import type { LanguageExtractor } from "reponova";
const myExtractor: LanguageExtractor = { /* ... */ };
registerExtractor(myExtractor);
Note: duplicate languageId or extensions silently overwrite the previous extractor.
Reference Implementation
See src/extract/languages/python.ts for a full tree-sitter-based extractor, or src/extract/languages/markdown.ts for a non-tree-sitter (regex) extractor.
Adding Outline Support
Outlines (graph_outline) use a separate system from extraction. They have their own registry, interface, and implementations.
Steps
- Create
src/outline/languages/<lang>.tsimplementing theLanguageSupportinterface - Register it in
src/outline/languages/registry.tsviaregisterOutlineLanguage() - The same WASM grammar from
grammars/is shared with the extraction system
LanguageSupport Interface
interface LanguageSupport {
/** WASM grammar filename (e.g., "tree-sitter-python.wasm") */
readonly wasmFile: string;
/** Extract outline from tree-sitter AST (primary method) */
treeSitterExtract(rootNode: SyntaxNode, filePath: string, lineCount: number): FileOutline;
/** Extract outline from raw source via regex (fallback when WASM unavailable) */
regexExtract(filePath: string, source: string, lineCount: number): FileOutline;
}
Runtime Registration
You can also register outline languages at runtime via the public API (must be called before build):
import { registerOutlineLanguage } from "reponova";
import type { LanguageSupport } from "reponova";
const myOutline: LanguageSupport = { /* ... */ };
registerOutlineLanguage("rust", ["rs"], myOutline);
Note: duplicate language names or extensions silently overwrite the previous registration.
See src/outline/languages/python.ts for the reference implementation.
License
MIT β CristianoCiuti/reponova
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.