Content
# CodeSeeker
**Four-layer hybrid search and knowledge graph for AI coding assistants.**
BM25 + vector embeddings + RAPTOR directory summaries + graph expansion — fused into a single MCP tool that gives Claude, Copilot, and Cursor a real understanding of your codebase.
[](https://www.npmjs.com/package/codeseeker)
[](LICENSE)
[](https://www.typescriptlang.org/)
Works with **Claude Code**, **GitHub Copilot** (VS Code 1.99+), **Cursor**, **Windsurf**, and **Claude Desktop**.
Zero configuration — indexes on first use, stays in sync automatically.
## The Problem
AI assistants are powerful editors, but they navigate code like a tourist:
- **Grep finds text** — not meaning. `"find authentication logic"` returns every file containing the word "auth"
- **File reads are isolated** — Claude sees a file but not its dependencies, callers, or the patterns your team established
- **No memory of your project** — every session starts from scratch
CodeSeeker fixes this. It indexes your codebase once and gives AI assistants a queryable knowledge graph they can use on every turn.
## How It Works
A 4-stage pipeline runs on every query:
```
Query: "find JWT refresh token logic"
│
▼ Stage 1 — Hybrid retrieval
┌─────────────────────────────────────────────────────┐
│ BM25 (exact symbols, camelCase tokenized) │
│ + │
│ Vector search (384-dim Xenova embeddings) │
│ ↓ │
│ Reciprocal Rank Fusion: score = Σ 1/(60 + rank_i) │
│ Top-30 results, including RAPTOR directory nodes │
└─────────────────────────────────────────────────────┘
│
▼ Stage 2 — RAPTOR cascade (conditional)
┌─────────────────────────────────────────────────────┐
│ IF best directory-summary score ≥ 0.5: │
│ → narrow results to that directory automatically │
│ ELSE: all 30 results pass through unchanged │
│ Effect: "what does auth/ do?" scopes to auth/ │
│ "jwt.ts decode function" bypasses this │
└─────────────────────────────────────────────────────┘
│
▼ Stage 3 — Scoring and deduplication
┌─────────────────────────────────────────────────────┐
│ Dedup: keep highest-score chunk per file │
│ Source files: +0.10 (definition sites matter) │
│ Test files: −0.15 (prevent test dominance) │
│ Symbol boost: +0.20 (query token in filename) │
│ Multi-chunk: up to +0.30 (file has many hits) │
└─────────────────────────────────────────────────────┘
│
▼ Stage 4 — Graph expansion
┌─────────────────────────────────────────────────────┐
│ Top-10 results → follow IMPORTS/CALLS/EXTENDS edges │
│ Structural neighbors scored at source × 0.7 │
│ Avg graph connectivity: 20.8 edges/node │
└─────────────────────────────────────────────────────┘
│
▼
auth/jwt.ts (0.94), auth/refresh.ts (0.89), ...
```
The knowledge graph is built from AST-parsed imports at index time. It's what powers `analyze dependencies`, dead-code detection, and graph expansion in every search.
## What Makes It Different
| Approach | Strengths | Limitations |
|----------|-----------|-------------|
| **Grep / ripgrep** | Fast, universal | No semantic understanding |
| **Vector search only** | Finds similar code | Misses structural relationships |
| **Serena** | Precise LSP symbol navigation, 30+ languages | No semantic search, no cross-file reasoning |
| **Codanna** | Fast symbol lookup, good call graphs | Semantic search needs JSDoc — undocumented code gets no embeddings; no BM25, no RAPTOR, Windows experimental |
| **CodeSeeker** | BM25 + embedding fusion + RAPTOR + graph + coding standards + multi-language AST | Requires initial indexing (30s–5min) |
**What LSP tools can't do:**
- *"Find code that handles errors like this"* → semantic pattern search
- *"What validation approach does this project use?"* → auto-detected coding standards
- *"Show me everything related to authentication"* → graph traversal across indirect dependencies
**What vector-only search misses:**
- Direct import/export chains
- Class inheritance hierarchies
- Which files actually depend on which
## Installation
### Recommended: npx (no install needed)
The standard way to configure any MCP server — no global install required:
```json
{
"mcpServers": {
"codeseeker": {
"command": "npx",
"args": ["-y", "codeseeker", "serve", "--mcp"]
}
}
}
```
Add this to your MCP config file ([see below](#advanced-installation-options) for per-client locations) and restart your editor.
### npm global install
```bash
npm install -g codeseeker
codeseeker install --vscode # or --cursor, --windsurf
```
### 🔌 Claude Code Plugin
For Claude Code CLI users — adds auto-sync hooks and slash commands:
```bash
/plugin install codeseeker@github:jghiringhelli/codeseeker#plugin
```
Slash commands: `/codeseeker:init`, `/codeseeker:reindex`
### ☁️ Devcontainers / GitHub Codespaces
```json
{
"name": "My Project",
"image": "mcr.microsoft.com/devcontainers/javascript-node:18",
"postCreateCommand": "npm install -g codeseeker && codeseeker install --vscode"
}
```
### ✅ Verify
Ask your AI assistant: *"What CodeSeeker tools do you have?"*
You should see: `search`, `analyze`, `index` — CodeSeeker's three tools.
## Advanced Installation Options
<details>
<summary><b>📋 MCP Configuration by client</b></summary>
The MCP config JSON is the same for all clients — only the file location differs:
| Client | Config file |
|--------|------------|
| **VS Code** (Claude Code / Copilot) | `.vscode/mcp.json` in your project, or `~/.vscode/mcp.json` globally |
| **Cursor** | `.cursor/mcp.json` in your project |
| **Claude Desktop** | `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS) or `%APPDATA%\Claude\claude_desktop_config.json` (Windows) |
| **Windsurf** | `.windsurf/mcp.json` in your project |
```json
{
"mcpServers": {
"codeseeker": {
"command": "npx",
"args": ["-y", "codeseeker", "serve", "--mcp"]
}
}
}
```
</details>
<details>
<summary><b>🖥️ CLI Standalone Usage</b> (without AI assistant)</summary>
```bash
npm install -g codeseeker
cd your-project
codeseeker init
codeseeker -c "how does authentication work in this project?"
```
</details>
## What You Get
Once configured, Claude has access to these MCP tools (used automatically):
| Tool | Actions / Usage | What It Does |
|------|-----------------|-------------|
| `search` | `{query}` | Hybrid search: vector + BM25 text + path-match, fused with RRF; RAPTOR directory summaries surface for abstract queries |
| `search` | `{query, search_type: "graph"}` | Hybrid search **+ Graph RAG** — follows import/call/extends edges to surface structurally connected files |
| `search` | `{query, search_type: "vector"}` | Pure embedding cosine-similarity search (no BM25 or path scoring) |
| `search` | `{query, search_type: "fts"}` | Pure BM25 text search with CamelCase tokenisation and synonym expansion |
| `search` | `{query, read: true}` | Search + read file contents in one step |
| `search` | `{filepath}` | Read a file with its related code automatically included |
| `analyze` | `{action: "dependencies", filepath}` | Traverse the knowledge graph (imports, calls, extends) |
| `analyze` | `{action: "standards"}` | Your project's detected patterns (validation, error handling) |
| `analyze` | `{action: "duplicates"}` | Find duplicate/similar code blocks across your codebase |
| `analyze` | `{action: "dead_code"}` | Detect unused exports, functions, and classes |
| `index` | `{action: "init", path}` | Manually trigger indexing (rarely needed) |
| `index` | `{action: "sync", changes}` | Update index for specific files |
| `index` | `{action: "exclude", paths}` | Dynamically exclude/include files from the index |
| `index` | `{action: "status"}` | List indexed projects with file/chunk counts |
**You don't invoke these manually**—Claude uses them automatically when searching code or analyzing relationships.
## How Indexing Works
**You don't need to manually index.** When Claude uses any CodeSeeker tool, the tool automatically checks if the project is indexed. If not, it indexes on first use.
```
User: "Find the authentication logic"
│
▼
┌─────────────────────────────────────┐
│ Claude calls search({query: ...}) │
│ │ │
│ ▼ │
│ Project indexed? ──No──► Index now │
│ │ (auto) │
│ Yes │ │
│ │◀───────────────────┘ │
│ ▼ │
│ Return search results │
└─────────────────────────────────────┘
```
First search on a new project takes 30 seconds to several minutes (depending on size). Subsequent searches are instant.
---
## Search Quality Research
<details>
<summary><b>📊 Component ablation study (v2.0.0)</b> — measured impact of each retrieval layer</summary>
### Setup
18 hand-labelled queries across two real-world codebases:
| Corpus | Language | Files | Queries | Query types |
|--------|----------|-------|---------|-------------|
| [Conclave](https://github.com/jghiringhelli/conclave) | TypeScript (pnpm monorepo) | 201 | 10 | Symbol lookup, cross-file chains, out-of-scope |
| [ImperialCommander2](https://github.com/jonwill8/ImperialCommander2) | C# / Unity | 199 | 8 | Class lookup, controller wiring, file I/O |
Each query has one or more `mustFind` targets (exact file basenames) and optional `mustNotFind` targets (scope leak check). Queries were run on a real index built from source — real Xenova embeddings, real graph, real RAPTOR L2 nodes — to reflect production conditions.
Metrics: **MRR** (Mean Reciprocal Rank), **P@1** (Precision at 1), **R@5** (Recall at 5), **F1@3**.
### Ablation results
| Configuration | MRR | P@1 | P@3 | R@5 | F1@3 | Notes |
|--------------|-----|-----|-----|-----|------|-------|
| **Hybrid baseline** (BM25 + embed + RAPTOR, no graph) | **75.2%** | 61.1% | 29.6% | 91.7% | 44.4% | Production default |
| + graph 1-hop | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | ±0% ranking, adds structural neighbors |
| + graph 2-hop | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | Scope leaks on unrelated queries |
| No RAPTOR (graph 1-hop) | 74.9% | 61.1% | 29.6% | 91.7% | 44.4% | RAPTOR contributes +0.3% |
### What each layer actually does
**BM25 + embedding fusion (RRF)**
The workhorse. Handles ~94% of ranking quality on its own. BM25 catches exact symbol names and camelCase tokens; vector embeddings catch semantic similarity when names differ. Fused with Reciprocal Rank Fusion to combine both signals without manual weight tuning.
**RAPTOR (hierarchical directory summaries)**
Generates per-directory embedding nodes by mean-pooling all file embeddings in a folder. Acts as a post-filter: when a directory summary scores ≥ 0.5 against the query, results are narrowed to that directory's files. Measured contribution: **+0.3% MRR** on symbol queries. Fires conservatively — only when the directory is an obvious match. Its real value is on _abstract queries_ ("what does the payments module do?") which don't appear in this benchmark; for those queries it prevents broad scattering across the entire codebase.
**Knowledge graph (import/dependency edges)**
Average connectivity: 20.8 file→file edges per node across both TS and C# codebases. Measured ranking impact: **±0% MRR** for 1-hop expansion. The graph doesn't move MRR because the semantic layer already finds the right files — the graph's neighbors are usually already in the top-15. Its value is structural: the `analyze dependencies` action and explicit `graph` search type give Claude traversable import chains, inheritance hierarchies, and dependency paths that embeddings alone cannot provide.
**Type boost / penalty scoring**
Source files get +0.10 score boost; test files get −0.15 penalty; lock files and docs get −0.05 penalty. Without this, `integration.test.ts` would rank above `dag-engine.ts` for exact symbol queries because test files import and exercise every symbol in the source. The penalty corrects this without eliminating test files from results.
**Monorepo directory exclusion fix**
The single highest-impact change in v1.12.0: removing `packages/` from the default exclusion list. For pnpm/yarn/lerna monorepos where all source lives under `packages/`, this exclusion was silently dropping all source files. Effect: **10% → 72% MRR** on the Conclave monorepo benchmark.
### Known limitations
| Query | Target | Issue | Root cause |
|-------|--------|-------|-----------|
| `cv-prompts` | `orchestrator.ts` | rank 97+ even with 2-hop graph | `prompt-builder.test.ts` outscores `prompt-builder.ts` semantically; source file never enters top-10, so we can't graph-walk from it to `orchestrator.ts`. Test-file dominance on cross-file queries. |
| `cv-exec-mode` | `types.ts` | rank 11–12 | `types.ts` is a pure type-export file; low keyword density. Found within R@5 (rank ≤ 15). |
### Benchmark script
Reproduce with:
```bash
npm run build
node scripts/real-bench.js
```
Requires `C:\workspace\claude\conclave` and `C:\workspace\ImperialCommander2` to be present locally (or update paths in `scripts/real-bench.js`).
</details>
## Auto-Detected Coding Standards
CodeSeeker analyzes your codebase and extracts patterns:
```json
{
"validation": {
"email": {
"preferred": "z.string().email()",
"usage_count": 12,
"files": ["src/auth.ts", "src/user.ts"]
}
},
"react-patterns": {
"state": {
"preferred": "useState<T>()",
"usage_count": 45
}
}
}
```
Detected pattern categories:
- **validation**: Zod, Yup, Joi, validator.js, custom regex
- **error-handling**: API error responses, try-catch patterns, custom Error classes
- **logging**: Console, Winston, Bunyan, structured logging
- **testing**: Jest/Vitest setup, assertion patterns
- **react-patterns**: Hooks (useState, useEffect, useMemo, useCallback, useRef)
- **state-management**: Redux Toolkit, Zustand, React Context, TanStack Query
- **api-patterns**: Fetch, Axios, Express routes, Next.js API routes
When Claude writes new code, it follows your existing conventions instead of inventing new ones.
## Managing Index Exclusions
If Claude notices files that shouldn't be indexed (like Unity's Library folder, build outputs, or generated files), it can dynamically exclude them:
```
// Exclude Unity Library folder and generated files
index({
action: "exclude",
project: "my-unity-game",
paths: ["Library/**", "Temp/**", "*.generated.cs"],
reason: "Unity build artifacts"
})
```
Exclusions are persisted in `.codeseeker/exclusions.json` and automatically respected during reindexing.
## Code Cleanup Tools
CodeSeeker helps you maintain a clean codebase by finding duplicate code and detecting dead code.
### Finding Duplicate Code
Ask Claude to find similar code blocks that could be consolidated:
```
"Find duplicate code in my project"
"Are there any similar functions that could be merged?"
"Show me copy-pasted code that should be refactored"
```
CodeSeeker uses vector similarity to find semantically similar code—not just exact matches. It detects:
- Copy-pasted functions with minor variations
- Similar validation logic across files
- Repeated patterns that could be extracted into utilities
### Finding Dead Code
Ask Claude to identify unused code that can be safely removed:
```
"Find dead code in this project"
"What functions are never called?"
"Show me unused exports"
```
CodeSeeker analyzes the knowledge graph to find:
- Exported functions/classes that are never imported
- Internal functions with no callers
- Orphaned files with no incoming dependencies
**Example workflow:**
```
User: "Use CodeSeeker to clean up this project"
Claude: I'll analyze your codebase for cleanup opportunities.
Found 3 duplicate code blocks:
- validateEmail() in auth.ts and user.ts (92% similar)
- formatDate() appears in 4 files with minor variations
- Error handling pattern repeated in api/*.ts
Found 2 dead code files:
- src/utils/legacy-helper.ts (0 imports)
- src/services/unused-service.ts (exported but never imported)
Would you like me to:
1. Consolidate the duplicate validators into a shared utility?
2. Remove the dead code files?
```
## Language Support
| Language | Parser | Relationship Extraction |
|----------|--------|------------------------|
| TypeScript/JavaScript | Babel AST | Excellent |
| Python | Tree-sitter | Excellent |
| Java | Tree-sitter | Excellent |
| C# | Regex | Good |
| Go | Regex | Good |
| Rust, C/C++, Ruby, PHP | Regex | Basic |
Tree-sitter parsers install automatically when needed.
## Keeping the Index in Sync
### With Claude Code Plugin
The plugin installs **hooks** that automatically update the index:
| Event | What Happens |
|-------|--------------|
| Claude edits a file | Index updated automatically |
| Claude runs `git pull/checkout/merge` | Full reindex triggered |
| You run `/codeseeker:reindex` | Manual full reindex |
**You don't need to do anything**—the plugin handles sync automatically.
### With MCP Server Only (Cursor, Claude Desktop)
- **Claude-initiated changes**: Claude can call `index({action: "sync"})` tool
- **Manual changes**: Not automatically detected—ask Claude to reindex periodically
### Sync Summary
| Setup | Claude Edits | Git Operations | Manual Edits |
|-------|--------------|----------------|--------------|
| **Plugin** (Claude Code) | Auto | Auto | Manual |
| **MCP** (Cursor, Desktop) | Ask Claude | Ask Claude | Ask Claude |
| **CLI** | Auto | Auto | Manual |
## When CodeSeeker Helps Most
**Good fit:**
- Large codebases (10K+ files) where Claude struggles to find relevant code
- Projects with established patterns you want Claude to follow
- Complex dependency chains across multiple files
- Teams wanting consistent AI-generated code
**Less useful:**
- Greenfield projects with little existing code
- Single-file scripts
- Projects where you're actively changing architecture
## Architecture
```
┌──────────────────────────────────────────────────────────┐
│ Claude Code │
│ │ │
│ MCP Protocol │
│ │ │
│ ┌──────────────────────▼──────────────────────────┐ │
│ │ CodeSeeker MCP Server │ │
│ │ ┌─────────────┬─────────────┬────────────────┐ │ │
│ │ │ Vector │ Knowledge │ Coding │ │ │
│ │ │ Search │ Graph │ Standards │ │ │
│ │ │ (SQLite) │ (SQLite) │ (JSON) │ │ │
│ │ └─────────────┴─────────────┴────────────────┘ │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
```
All data stored locally in `.codeseeker/`. No external services required.
For large teams (100K+ files, shared indexes), server mode supports PostgreSQL + Neo4j. See [Storage Documentation](docs/technical/storage.md).
For the complete technical internals — exact scoring formulas, MCP tool schema, graph edge types, RAPTOR threshold logic, pipeline stages, analysis confidence tiers — see the **[Technical Architecture Manual](docs/technical/architecture.md)**.
## Troubleshooting
### MCP server not connecting
1. Verify npm and npx work: `npx -y codeseeker --version`
2. Check MCP config file syntax (valid JSON, no trailing commas)
3. Restart your editor/Claude application completely
4. Check that Node.js is installed: `node --version` (need v18+)
### Indexing seems slow
First-time indexing of large projects (50K+ files) can take 5+ minutes. Subsequent uses are instant.
### Tools not appearing in Claude
1. Ask Claude: *"What CodeSeeker tools do you have?"*
2. If no tools appear, check MCP config file exists and has correct syntax
3. Restart your IDE completely (not just reload window)
4. Check Claude/Copilot MCP connection status in IDE
### Still stuck?
Open an issue: [GitHub Issues](https://github.com/jghiringhelli/codeseeker/issues)
## Documentation
- [Integration Guide](docs/INTEGRATION.md) - How all components connect
- [Architecture](docs/technical/architecture.md) - Technical deep dive
- [CLI Commands](docs/install/cli_commands_manual.md) - Full command reference
## Supported Platforms
| Client | MCP Support | Config |
|--------|-------------|--------|
| **Claude Code** (VS Code) | ✅ | `.vscode/mcp.json` or plugin |
| **GitHub Copilot** (VS Code 1.99+) | ✅ | `.vscode/mcp.json` |
| **Cursor** | ✅ | `.cursor/mcp.json` |
| **Windsurf** | ✅ | `.windsurf/mcp.json` |
| **Claude Desktop** | ✅ | `claude_desktop_config.json` |
| **Visual Studio** | ✅ | `codeseeker install --vs` |
> Claude Code and GitHub Copilot share the same `.vscode/mcp.json` — configure once, works for both.
## Support
If CodeSeeker is useful to you, consider [sponsoring the project](https://github.com/sponsors/jghiringhelli).
## License
MIT License. See [LICENSE](LICENSE).
---
*CodeSeeker gives Claude the code understanding that grep and embeddings alone can't provide.*
MCP Config
Below is the configuration for this MCP Server. You can copy it directly to Cursor or other MCP clients.
mcp.json
Connection Info
You Might Also Like
everything-claude-code
Complete Claude Code configuration collection - agents, skills, hooks,...
markitdown
MarkItDown-MCP is a lightweight server for converting URIs to Markdown.
servers
Model Context Protocol Servers
servers
Model Context Protocol Servers
Time
A Model Context Protocol server for time and timezone conversions.
Filesystem
Node.js MCP Server for filesystem operations with dynamic access control.