CodeWalker
An MCP server that indexes Python codebase structures to help AI assistants discover and reuse existing functions instead of duplicating code. It enables real-time searching of function metadata, duplicate detection, and structural analysis across multiple projects.
README
CodeWalker
Walk your codebase before writing new code.
CodeWalker is an MCP server that gives Claude Code real-time access to your Python codebase structure, enabling AI-assisted development that reuses existing code instead of duplicating it.
The Problem: AI Code Duplication
What Happens Without CodeWalker
When Claude Code writes code, it can't see what already exists in your codebase. This causes a cascade of problems:
Day 1: You ask Claude to add CSV loading functionality
# Claude creates: src/data_loader.py
def load_csv_file(path):
return pd.read_csv(path)
Day 5: Different feature needs CSV loading
# Claude creates: src/importer.py (Claude has no memory of data_loader.py)
def load_csv_data(filepath):
df = pd.read_csv(filepath)
return df
Day 10: Another feature, another duplicate
# Claude creates: src/utils.py (Claude still doesn't know about the others)
def read_csv(file_path):
return pd.read_csv(file_path, low_memory=False) # Now with different behavior!
Result after 2 weeks:
- š“ 7 different CSV loading functions across your codebase
- š“ Inconsistent behavior (one uses
low_memory=False, others don't) - š“ Impossible to maintain (bug fixes need to be applied 7 times)
- š“ Unpredictable behavior (which implementation gets called depends on imports)
- š“ Code review nightmare (reviewing duplicate implementations wastes time)
The Cost of Code Duplication
This isn't just messy - it's expensive:
| Impact | Cost |
|---|---|
| Development Time | 30-40% wasted rewriting existing code |
| Bug Fixes | Same bug appears in multiple places, fixed multiple times |
| Code Reviews | Reviewers waste time on duplicate implementations |
| Onboarding | New developers confused by inconsistent patterns |
| Technical Debt | Duplicates diverge over time, creating maintenance burden |
| Testing | Same logic tested multiple times (or worse, inconsistently) |
Real Example: A codebase with 800 functions had 52.7% duplication rate - 422 functions were duplicates. That's thousands of wasted lines of code.
How CodeWalker Solves This
CodeWalker indexes your codebase and lets Claude search before writing:
With CodeWalker
Day 1: You ask Claude to add CSV loading
Claude (internal): Let me check if CSV loading already exists...
> search_functions("load csv")
Found: load_csv_file() in src/data_loader.py
Claude: "I found an existing CSV loader. Let me use it instead of creating a new one."
Result:
# Claude imports existing function
from src.data_loader import load_csv_file
data = load_csv_file(path)
Day 5, 10, 15...: Same pattern - Claude finds and reuses existing code
Result after 2 weeks:
- ā 1 canonical CSV loading function (not 7)
- ā Consistent behavior across entire codebase
- ā Easy to maintain (fix bugs once, fixed everywhere)
- ā Predictable behavior (one implementation = one behavior)
- ā Fast code reviews (reviewers see reuse, not duplication)
Why This Problem Exists
LLMs Lack Architectural Awareness
Claude Code (and all LLMs) have a fundamental limitation:
ā Can't see your codebase structure ā Can't search across files ā Can't remember what exists ā Can't detect duplicates
The technical reason: When Claude writes code, it only sees:
- The current file you're editing
- Recent conversation context
- Maybe a few related files you showed it
What Claude DOESN'T see:
- That
load_csv_file()already exists insrc/data_loader.py - That 3 other files have similar functions
- That your team has a canonical implementation
- Your codebase architecture and patterns
Result: Claude invents new implementations instead of reusing existing ones.
The "10 Developers, 0 Communication" Problem
Working with AI without CodeWalker is like having 10 developers who never talk to each other:
Developer 1 (Monday): Creates load_csv_file()
Developer 2 (Tuesday): Doesn't know about it, creates load_csv_data()
Developer 3 (Wednesday): Doesn't know about either, creates read_csv()
Developer 4 (Thursday): Creates import_csv()
... and so on
Each "developer" (AI session) works in isolation, creating duplicates because they can't see what others did.
CodeWalker fixes this by giving AI a "shared memory" of your entire codebase.
Real-World Impact
Case Study: Elisity Project
Before CodeWalker:
- 800 total functions
- 422 duplicates (52.7% duplication rate)
- 33 direct
pd.read_csv()calls (should use centralized loader) - 11 duplicate
print_summary()implementations - 3 duplicate
load_flow_data()functions with diverging behavior
With CodeWalker:
- Claude finds existing implementations before writing new code
- Duplication rate drops to near-zero for new code
- Codebase becomes more maintainable over time
Time Saved:
- Development: 30-40% less time rewriting existing code
- Code Review: Reviewers focus on new logic, not duplicate detection
- Bug Fixes: Fix once instead of hunting down 3-7 duplicates
How It Works
Architecture
āāāāāāāāāāāāāāāāāāāāāāā
ā Your Codebase ā
ā (Python files) ā
āāāāāāāāāāāā¬āāāāāāāāāāā
ā
ā AST Parser extracts
ā function metadata
ā¼
āāāāāāāāāāāāāāāāāāāāāāā
ā SQLite Index ā
ā (functions.db) ā
ā ā
ā ⢠Function names ā
ā ⢠Parameters ā
ā ⢠Locations ā
ā ⢠Docstrings ā
āāāāāāāāāāāā¬āāāāāāāāāāā
ā
ā Claude queries via
ā MCP protocol
ā¼
āāāāāāāāāāāāāāāāāāāāāāā
ā Claude Code ā
ā ā
ā "Does load_csv ā
ā already exist?" ā
ā ā
ā ā Yes! Use it ā
āāāāāāāāāāāāāāāāāāāāāāā
What Gets Indexed
For each function in your codebase:
- Name -
load_csv_file - Location -
src/data_loader.py:42 - Parameters -
(path, encoding='utf-8') - Docstring - First line for quick understanding
- Type - Regular function, async function, or class method
- Decorators -
@staticmethod,@cached, etc.
What's NOT stored: Function bodies, comments, string literals (only structural metadata).
Search Performance
- Parsing: ~100-200 files/second
- Indexing: ~1000 functions/second
- Search: Sub-millisecond SQLite queries
- Database size: ~1 KB per function
Example: 800 functions = ~800 KB database, indexed in < 5 seconds, searched in < 1ms.
Features
š Search Before Writing
Tool: search_functions(query, exact=False)
Find existing functions before Claude writes new code:
> search_functions("load csv")
Found 3 functions:
⢠load_csv_file(path, encoding='utf-8')
Location: src/data_loader.py:42
Docs: Load CSV file with proper encoding handling
⢠FlowDataLoader.load_flows(flow_path, site_label)
Location: modules/flow_loader.py:98
Docs: Load flow data from CSV with site labeling
⢠read_raw_csv(filepath)
Location: legacy/importer.py:156
Docs: Legacy CSV reader (deprecated)
Claude sees these results and chooses to import the canonical implementation instead of creating a new one.
š Detect Duplicates
Tool: find_duplicates()
Find functions with the same name in multiple files:
> find_duplicates()
ā ļø Found 3 function names with multiple implementations:
**load_flow_data** (3 implementations):
- cohesion_analyzer.py:253
- legacy/community_detector.py:440
- policy_group_clustering.py:497
**format_bytes** (2 implementations):
- utils.py:88
- helpers.py:124
š” Recommendation: Consolidate into single canonical implementations.
Use this to audit your codebase and identify consolidation opportunities.
šÆ Similar Signatures
Tool: find_similar_signatures(min_params=2)
Find functions with the same parameters (might be doing the same thing):
> find_similar_signatures(min_params=2)
Found 2 signature groups:
**Signature: (data, output_path)** - 4 functions:
⢠save_to_csv in exporter.py:67
⢠write_csv_file in writer.py:134
⢠export_data in utils.py:203
⢠save_results in analyzer.py:445
š” These functions likely do the same thing with different names.
Catches semantic duplicates - functions that do the same thing but have different names.
š Multi-Project Support
Work on multiple projects without reconfiguring:
# One-time setup
> register_project("project-a", "/Users/jose/Projects/project-a")
> register_project("project-b", "/Users/jose/Projects/project-b")
# Daily use - auto-detects from your current directory
cd ~/Projects/project-a
> search_functions("auth")
[Auto-detected: project-a]
Found 5 functions...
cd ~/Projects/project-b
> search_functions("auth")
[Auto-detected: project-b]
Found 3 functions...
Features:
- ā Register unlimited projects
- ā Auto-detection from working directory
- ā Isolated indexes (no cross-contamination)
- ā Zero configuration switching
š Codebase Statistics
Tool: get_index_stats()
Understand your codebase at a glance:
> get_index_stats()
š CodeWalker Statistics:
Total Functions: 800
Total Files: 60
Unique Names: 765
Methods: 423
Async Functions: 67
Avg Parameters: 2.3
Duplication Rate: 4.4% (35 duplicates)
Last Indexed: 2026-03-18 10:35:00
Track duplication rate over time to measure improvement.
Quick Start
1. Install
git clone https://github.com/[username]/codewalker.git
cd codewalker
pip install -r requirements.txt
2. Configure Claude Code
Add to ~/.config/claude-code/mcp.json:
{
"mcpServers": {
"codewalker": {
"command": "python3",
"args": ["/absolute/path/to/codewalker/src/server.py"]
}
}
}
3. Register Your Projects
Restart Claude Code, then:
> register_project("my-project", "/absolute/path/to/your/project")
š Registering project: my-project
š Path: /absolute/path/to/your/project
ā³ Indexing project...
Found 800 functions
ā
Indexing complete!
Total Functions: 800
Total Files: 60
Unique Names: 765
4. Start Using
CodeWalker now automatically prevents duplicate code:
You: "Add functionality to load CSV files"
Claude (internal):
> search_functions("load csv")
Found: load_csv_file() in src/data_loader.py
Claude: "I found an existing CSV loader at src/data_loader.py:42.
Let me use that instead of creating a new one:
from src.data_loader import load_csv_file
data = load_csv_file(path)
Available Tools
Project Management
register_project(name, path)- Add a project to CodeWalkerlist_projects()- View all registered projectsunregister_project(name)- Remove a projectget_current_project()- Show which project is detected
Function Search
search_functions(query, exact)- Find functions by namefind_duplicates()- Detect duplicate function namesfind_similar_signatures(min_params)- Find functions with similar parametersget_file_functions(file_path)- List all functions in a fileget_index_stats()- View codebase statisticsreindex_repository()- Rebuild index after major changes
Use Cases
1. Prevent Duplication During Development
Before every implementation:
You: "Add user authentication"
Claude: Let me check if auth code already exists...
> search_functions("auth")
Found: authenticate_user() in src/auth.py
Claude: "I found existing auth code. Let me use it..."
2. Onboard to New Codebases
Explore unfamiliar code:
> search_functions("export")
Found 12 functions with "export" in the name
> get_file_functions("src/exporter.py")
Lists all 8 functions in the file with signatures and docs
Quickly understand what exists before writing new code.
3. Refactoring and Cleanup
Find consolidation opportunities:
> find_duplicates()
Found 15 duplicate function names
> find_similar_signatures()
Found 8 signature groups (functions with same params)
Systematically eliminate duplication.
4. Code Review
Reviewers can verify reuse:
Reviewer: "Why didn't you use the existing loader?"
Developer: "Let me check..."
> search_functions("load")
Found 3 loaders I didn't know about!
Catch missed reuse opportunities during review.
Comparison: With vs Without CodeWalker
| Scenario | Without CodeWalker | With CodeWalker |
|---|---|---|
| Add CSV loading | Creates 7th duplicate load_csv() |
Finds and reuses existing load_csv_file() |
| Authentication needed | Creates new auth from scratch | Imports existing authenticate_user() |
| Format bytes | Creates 3rd format_bytes() |
Uses canonical implementation |
| Code review | "Why is this duplicated?" | "Good reuse of existing code" |
| Bug in duplicates | Fix bug in 7 different places | Fix once, fixed everywhere |
| Onboarding | "Which loader should I use?" | Clear: one canonical implementation |
| Duplication rate | 40-60% (typical for AI projects) | < 5% (with CodeWalker) |
Graph Theory Connection
CodeWalker treats your codebase as a graph:
- Vertices - Functions, classes, modules
- Edges - Imports, function calls, dependencies
- Walking - Traversing the graph to discover existing code
Graph concepts:
- Graph walk - Sequence of vertices (functions) and edges (calls)
- Traversal - Systematic exploration of the graph structure
- Random walks - Discovery algorithms (like PageRank)
- Tree walks - AST traversal (what the parser does)
This isn't just a metaphor - CodeWalker literally walks your Abstract Syntax Tree (AST) to build the function graph.
Roadmap & Future Development
CodeWalker v2.0.0 solves the core AI code duplication problem for Python projects. Future versions will add deeper analysis, broader language support, and smarter automation.
š„ High Priority
Why these matter: These features provide immediate value for existing users and are most frequently requested.
-
[ ] Incremental indexing - Currently, reindexing rebuilds the entire database. Incremental indexing would only update changed files, making reindexing 10-100x faster for large codebases. Impact: Seconds instead of minutes for 10k+ function codebases.
-
[ ] Near-duplicate detection - Functions like
load_csv,load_csv_data, andread_csv_fileare semantically duplicates but have different names. Levenshtein distance matching would catch these "near-duplicates" that current exact/partial matching misses. Impact: Catch 20-30% more duplicates. -
[ ] Cross-project search - Search across all registered projects simultaneously. Useful for teams with shared utilities across multiple repos or monorepo users who want to find reusable code anywhere. Impact: Prevent reinventing wheels across project boundaries.
-
[ ] Call graph analysis - Track what calls what to enable "blast radius" analysis ("what breaks if I change this function?") and identify unused code. Impact: Safer refactoring, dead code detection.
šÆ Medium Priority
Why these matter: These features enhance CodeWalker's intelligence and reduce manual effort.
-
[ ] Semantic similarity (ML-based) - Detect functions that do the same thing with completely different names and signatures using embedding-based similarity. Example:
save_to_csv(data, path)andexport_results(df, filename)might be doing the same thing. Impact: Catch duplicates current signature matching misses. -
[ ] Auto-reindexing on file changes - Watch filesystem and automatically reindex when Python files change. No more manual
reindex_repository()calls. Impact: Zero-maintenance index that's always current. -
[ ] Multi-language support - Extend beyond Python to JavaScript, TypeScript, Go, Rust, Java. Same duplication prevention for polyglot codebases. Impact: Unified duplication prevention across entire stack.
-
[ ] Blast radius visualization - Show dependency trees and impact analysis when considering changes. "If I modify function X, these 15 functions are affected." Impact: Confident refactoring.
š” Lower Priority
Why these matter: Nice-to-have features that improve developer experience but aren't critical to core functionality.
-
[ ] Web UI - Visual interface for browsing functions, viewing call graphs, and exploring codebase structure in a browser. Alternative to CLI-only workflow. Impact: Better onboarding experience, visual learners benefit.
-
[ ] VS Code extension - Native VS Code integration with inline suggestions ("ā ļø Similar function exists: use
load_csv_file()instead"). Impact: Proactive duplicate prevention during typing. -
[ ] Import suggestions - When Claude is about to write new code, automatically suggest existing imports. "You're about to write X, but Y already exists - import it?" Impact: Even less manual searching.
-
[ ] GitHub Action - CI/CD integration that fails PRs introducing duplicates above a threshold. Enforce duplication standards via automation. Impact: Prevent duplicates from ever being merged.
š Current Capabilities
What works today:
Language Support:
- ā Python (full support for functions, methods, async functions, decorators)
- š§ JavaScript, TypeScript, Go, Rust (on roadmap)
Analysis:
- ā Function names, signatures, locations, docstrings
- ā Parameter matching and signature comparison
- ā Duplicate detection (exact name matches)
- š§ Call graph analysis (planned)
- š§ Semantic similarity (planned)
- š§ Near-duplicate detection via Levenshtein distance (planned)
Indexing:
- ā Full repository indexing (~5 seconds for 800 functions)
- ā Manual reindexing on demand
- š§ Incremental updates (only changed files - planned)
- š§ Auto-reindexing on file changes (planned)
Search:
- ā Exact and partial name matching
- ā Parameter signature matching
- ā Multi-project support with auto-detection
- š§ Semantic search by behavior (planned)
- š§ Cross-project search (planned)
FAQ
Q: Does this work with other AI assistants?
Yes! CodeWalker uses the Model Context Protocol (MCP), which is an open standard. Any AI tool that supports MCP can use CodeWalker:
- Claude Code (tested)
- Claude Desktop (should work)
- Other MCP-compatible tools
Q: How much overhead does indexing add?
Very little:
- Initial indexing: ~5 seconds for 800 functions
- Reindexing: ~5 seconds (full rebuild)
- Search queries: < 1ms
- Memory: ~10 MB for typical projects
You barely notice it's there.
Q: What if my codebase is huge?
CodeWalker scales well:
- Tested on 800 functions / 60 files
- Should handle 10,000+ functions easily (SQLite scales)
- For massive codebases (100k+ functions), consider:
- Incremental indexing (planned feature)
- Multiple project registrations (already supported)
- Excluding test files or generated code
Q: Can I use this on proprietary code?
Yes! Everything is local:
- ā Index stored locally (~/.codewalker)
- ā No data sent to external services
- ā No network requests during search
- ā Your code never leaves your machine
CodeWalker is 100% private.
Q: How is this different from IDE autocomplete?
Complementary, not competing:
IDE autocomplete:
- Works in single file
- Shows available imports
- Type-aware suggestions
- Real-time as you type
CodeWalker:
- Works across entire codebase
- Searches by semantic intent ("load csv")
- Finds duplicates proactively
- Used by AI during code generation
Use both - IDE for writing, CodeWalker for AI-assisted development.
Q: What about private/internal functions?
CodeWalker indexes everything:
- Public functions: ā Indexed
- Private functions (
_private): ā Indexed - Internal functions (
__internal): ā Indexed
Why? Because you might want to reuse private functions too. Claude respects Python conventions (won't use _private from other modules without good reason), but knowing they exist prevents duplication.
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
Areas we need help:
- Multi-language support (JavaScript, TypeScript, Go)
- Incremental indexing
- Semantic similarity detection
- Performance optimization
License
MIT License - see LICENSE for details.
Free to use in personal and commercial projects.
Credits
Built to solve a real problem: Claude Code was creating duplicate implementations across a 60-file, 800-function codebase. CodeWalker eliminated the duplication.
Inspired by: Pharaoh (commercial tool for codebase intelligence)
Built with: Claude Sonnet 4.5 (dogfooding - using AI to build tools that improve AI)
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See guides in this repository
Summary
Problem: AI assistants can't see your codebase, causing massive code duplication.
Solution: CodeWalker indexes your codebase and lets AI search before writing.
Result: 40-60% reduction in duplicate code, faster development, cleaner codebase.
Get Started:
pip install -r requirements.txt
# Configure MCP (see Quick Start above)
> register_project("my-project", "/path/to/project")
> search_functions("whatever you're about to write")
Stop duplicating code. Start walking your codebase. š
Recommended Servers
playwright-mcp
A Model Context Protocol server that enables LLMs to interact with web pages through structured accessibility snapshots without requiring vision models or screenshots.
Audiense Insights MCP Server
Enables interaction with Audiense Insights accounts via the Model Context Protocol, facilitating the extraction and analysis of marketing insights and audience data including demographics, behavior, and influencer engagement.
Magic Component Platform (MCP)
An AI-powered tool that generates modern UI components from natural language descriptions, integrating with popular IDEs to streamline UI development workflow.
VeyraX MCP
Single MCP tool to connect all your favorite tools: Gmail, Calendar and 40 more.
Kagi MCP Server
An MCP server that integrates Kagi search capabilities with Claude AI, enabling Claude to perform real-time web searches when answering questions that require up-to-date information.
graphlit-mcp-server
The Model Context Protocol (MCP) Server enables integration between MCP clients and the Graphlit service. Ingest anything from Slack to Gmail to podcast feeds, in addition to web crawling, into a Graphlit project - and then retrieve relevant contents from the MCP client.
E2B
Using MCP to run code via e2b.
Neon Database
MCP server for interacting with Neon Management API and databases
Exa Search
A Model Context Protocol (MCP) server lets AI assistants like Claude use the Exa AI Search API for web searches. This setup allows AI models to get real-time web information in a safe and controlled way.
Qdrant Server
This repository is an example of how to create a MCP server for Qdrant, a vector search engine.