Problem Statement
When an AI agent (or human) encounters an unfamiliar codebase, the first question is always: "What's in this repo?" Currently, there's no way to get a structural overview from probe without searching for something specific. The agent has to either:
- Guess search terms blindly
- Use the NPM
listFiles tool (flat, one-level directory listing, no code structure)
- Run
find or ls -R externally (no semantic information)
Other tools solve this: ABCoder's get_repo_structure / get_package_structure, Stakgraph's repo_map MCP tool, and Octocode's view command. Probe needs its own version that fits its zero-setup, instant, AST-aware philosophy.
Proposed Solution
A new probe map CLI command that returns a hierarchical view of a codebase with top-level symbol signatures, using the existing tree-sitter infrastructure. No indexing, no setup -- same instant behavior as probe search.
Existing Building Blocks
The codebase already has most of the pieces:
-
extract_all_symbols_from_file() in src/extract/processor.rs:884 -- DEAD CODE that already:
- Parses a file with tree-sitter
- Iterates root-level children
- Filters by
is_acceptable_parent()
- Calls
get_symbol_signature() for each symbol
- Returns
Vec<SearchResult> with symbol_signature populated
- Just needs to be exposed and extended
-
get_symbol_signature() implemented for 8 languages in src/language/:
- Rust (
rust.rs:142): functions, structs, impls, traits, enums, consts, statics, types, macros
- TypeScript (
typescript.rs:106)
- JavaScript (
javascript.rs:97)
- Python (
python.rs:64)
- Go (
go.rs:151)
- YAML (
yaml.rs:191)
- Markdown (
markdown.rs:178)
- HTML (
html.rs:329)
-
file_list_cache in src/search/file_list_cache.rs -- cached .gitignore-aware file listing with language filtering
-
Token counting via tiktoken (src/search/search_tokens.rs:333) -- can enforce --max-tokens on map output
-
ParentContext model in src/models.rs:25 -- already represents scope hierarchy
CLI Interface
# Basic: map the current directory
probe map .
# Map a specific subdirectory
probe map ./src/search
# Filter by language
probe map ./src --language rust
# Control depth: how many directory levels deep
probe map ./src --depth 2
# Control detail level
probe map ./src --detail signatures # default: symbol signatures
probe map ./src --detail files # files only, no symbols
probe map ./src --detail full # signatures + first doc comment line
# Token budget (critical for AI agents)
probe map ./src --max-tokens 4000
# Output formats (reuse existing infrastructure)
probe map ./src --format outline # default
probe map ./src --format json
probe map ./src --format xml
# Ignore patterns (reuse existing --ignore flag)
probe map ./src --ignore "test*" --ignore "*.generated.*"
# Exclude test files (reuse existing --allow-tests flag, inverted default)
probe map ./src # excludes test files by default
probe map ./src --allow-tests # includes test files
Expected Output Formats
Outline Format (default) -- --detail signatures
src/
search/
search_runner.rs
pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>
pub fn search_with_structured_patterns(...) -> Result<HashMap<PathBuf, ...>>
fn process_file_with_results(...) -> Result<Vec<SearchResult>>
result_ranking.rs
pub fn rank_search_results(results: &mut Vec<SearchResult>, ...) -> Result<()>
search_output.rs
pub fn format_and_print_search_results(results: &LimitedSearchResults, ...) -> Result<String>
pub fn collect_parent_context_for_line(...) -> Vec<ParentContext>
fn format_and_print_outline_results(...) -> Result<String>
elastic_query.rs
pub enum Expr
pub fn parse_query(query: &str) -> Result<Expr>
block_merging.rs
pub fn merge_ranked_blocks(results: Vec<SearchResult>, ...) -> Vec<SearchResult>
cache.rs
pub struct SearchCache
pub fn new(session_id: &str) -> Self
filters.rs
pub struct SearchFilters
tokenization.rs
pub fn tokenize(text: &str) -> Vec<String>
language/
language_trait.rs
pub trait LanguageImpl
factory.rs
pub fn get_language_for_file(path: &Path) -> Option<Box<dyn LanguageImpl>>
rust.rs
pub struct RustLanguage
parser_pool.rs
pub struct ParserPool
tree_cache.rs
pub struct TreeCache
models.rs
pub struct SearchResult
pub struct ParentContext
pub struct CodeBlock
pub struct LimitedSearchResults
pub struct SearchLimits
ranking.rs
pub fn rank_documents(query: &Expr, ...) -> Vec<(usize, f64)>
pub struct QueryTokenMap
extract/
mod.rs
pub fn handle_extract(options: ExtractOptions) -> Result<()>
processor.rs
pub fn process_file_for_extraction(...) -> Result<Vec<SearchResult>>
symbol_finder.rs
pub fn find_all_symbols_in_file(...) -> Result<Vec<SymbolMatch>>
cli.rs
pub enum Commands
pub struct SearchArgs
pub struct ExtractArgs
Outline Format -- --detail files
src/
search/
search_runner.rs (2,145 lines)
result_ranking.rs (487 lines)
search_output.rs (2,680 lines)
elastic_query.rs (356 lines)
block_merging.rs (290 lines)
cache.rs (185 lines)
filters.rs (142 lines)
tokenization.rs (210 lines)
language/
language_trait.rs (45 lines)
factory.rs (120 lines)
rust.rs (280 lines)
parser_pool.rs (95 lines)
tree_cache.rs (110 lines)
models.rs (105 lines)
ranking.rs (520 lines)
extract/
mod.rs (780 lines)
processor.rs (930 lines)
symbol_finder.rs (480 lines)
cli.rs (460 lines)
Outline Format -- --detail full
src/
search/
search_runner.rs
/// Main entry point for probe search. Orchestrates the full pipeline.
pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>
/// Search using structured patterns with SIMD acceleration.
pub fn search_with_structured_patterns(...) -> Result<HashMap<PathBuf, ...>>
JSON Format
{
"root": "./src",
"total_files": 42,
"total_symbols": 187,
"total_tokens": 3850,
"tree": [
{
"path": "src/search",
"type": "directory",
"children": [
{
"path": "src/search/search_runner.rs",
"type": "file",
"lines": 2145,
"language": "rust",
"symbols": [
{
"name": "perform_probe",
"signature": "pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>",
"node_type": "function_item",
"line": 225,
"end_line": 450,
"visibility": "public",
"doc": "Main entry point for probe search. Orchestrates the full pipeline."
}
]
}
]
}
]
}
Implementation Plan
Phase 1: Core probe map Command (Rust)
Step 1: New model types in src/models.rs
pub struct MapEntry {
pub path: String,
pub entry_type: MapEntryType,
pub language: Option<String>,
pub line_count: Option<usize>,
pub symbols: Vec<SymbolInfo>,
pub children: Vec<MapEntry>, // for directories
}
pub enum MapEntryType {
Directory,
File,
}
pub struct SymbolInfo {
pub name: String,
pub signature: String,
pub node_type: String,
pub start_line: usize,
pub end_line: usize,
pub visibility: Option<String>, // pub, pub(crate), private, etc.
pub doc_comment: Option<String>, // first line only
}
pub struct MapOptions {
pub paths: Vec<String>,
pub depth: Option<usize>,
pub detail: MapDetail, // files, signatures, full
pub language: Option<String>,
pub max_tokens: Option<usize>,
pub format: String, // outline, json, xml
pub ignore_patterns: Vec<String>,
pub allow_tests: bool,
}
pub enum MapDetail {
Files, // just file names + line counts
Signatures, // + symbol signatures (default)
Full, // + doc comments
}
Step 2: New module src/map/
Create src/map/mod.rs:
pub fn handle_map(options: MapOptions) -> Result<MapResult> -- main entry point
- Reuse
file_list_cache for .gitignore-aware traversal
- For each file: call a revived
extract_all_symbols_from_file() (currently dead code at processor.rs:884)
- Build directory tree from flat file list
- Apply
--max-tokens budget using existing tiktoken infrastructure
Create src/map/output.rs:
format_map_outline() -- indented text output
format_map_json() -- structured JSON
format_map_xml() -- XML output
Step 3: Token-Budget-Aware Truncation
Critical for AI agents. When --max-tokens is set:
- Start with directory structure (cheapest)
- Add symbols for files in priority order:
- Smaller files first (more likely to be focused modules)
- Files closer to root first
- Public symbols only if budget is tight
- When budget runs out, show remaining files as
... (N more files) with just the filename
- Return metadata:
{ total_files, shown_files, total_symbols, shown_symbols, tokens_used }
This ensures the agent always gets SOMETHING useful within its token budget, never an error or empty result.
Step 4: CLI registration in src/cli.rs
Add Map variant to the Commands enum:
/// Generate a structural overview of a codebase
Map {
/// Paths to map
#[arg(default_value = ".")]
paths: Vec<String>,
/// Maximum directory depth
#[arg(long, short = 'd')]
depth: Option<usize>,
/// Detail level: files, signatures, full
#[arg(long, default_value = "signatures")]
detail: String,
/// Filter by programming language
#[arg(long, short = 'l')]
language: Option<String>,
/// Maximum output tokens
#[arg(long)]
max_tokens: Option<usize>,
/// Output format
#[arg(long, short = 'o', default_value = "outline")]
format: String,
/// Custom ignore patterns
#[arg(long, short = 'i')]
ignore: Vec<String>,
/// Include test files
#[arg(long)]
allow_tests: bool,
}
Phase 2: MCP Integration
Add map_code tool to the MCP server at npm/src/mcp/index.ts:
{
name: "map_code",
description: "Get a structural overview of a codebase with file tree and symbol signatures. Use this FIRST when exploring an unfamiliar codebase before searching.",
inputSchema: {
type: "object",
properties: {
path: { type: "string", description: "Directory to map" },
depth: { type: "number", description: "Max directory depth (default: unlimited)" },
detail: { type: "string", enum: ["files", "signatures", "full"], default: "signatures" },
language: { type: "string", description: "Filter by language" },
maxTokens: { type: "number", description: "Token budget for output", default: 4000 },
},
required: ["path"]
}
}
Phase 3: Agent Integration
Update ProbeAgent system prompt to use map_code as the first step when exploring a new codebase:
When exploring an unfamiliar codebase:
1. Use map_code to understand the overall structure
2. Use search_code to find specific code
3. Use extract_code to read specific files/symbols
Performance Considerations
- Lazy symbol extraction: Only parse files with tree-sitter when
--detail signatures or --detail full is requested. For --detail files, just count lines.
- Parallel processing: Use rayon for file parsing (same as search pipeline).
- Cache reuse: The parser pool (
ParserPool) and tree cache (TreeCache) are already designed for reuse across files.
- Early termination: Stop processing files once
--max-tokens budget is exhausted.
- File list cache: Reuse
file_list_cache to avoid re-walking the directory on repeated calls.
Testing
Unit Tests (src/map/mod.rs)
#[cfg(test)]
mod tests {
#[test]
fn test_map_single_file() { /* map a single .rs file, verify symbols extracted */ }
#[test]
fn test_map_directory_tree() { /* map a directory, verify tree structure */ }
#[test]
fn test_map_depth_limit() { /* --depth 1 only shows one level */ }
#[test]
fn test_map_language_filter() { /* --language rust only shows .rs files */ }
#[test]
fn test_map_token_budget() { /* --max-tokens 500 truncates gracefully */ }
#[test]
fn test_map_detail_files() { /* --detail files shows no symbols */ }
#[test]
fn test_map_detail_signatures() { /* --detail signatures shows signatures */ }
#[test]
fn test_map_excludes_tests() { /* test files excluded by default */ }
#[test]
fn test_map_gitignore_respected() { /* .gitignore patterns honored */ }
}
CLI Tests (tests/cli_tests.rs)
#[test]
fn test_map_command_basic() {
let output = Command::new("probe")
.args(["map", "./src", "--format", "json"])
.output().unwrap();
let map: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap();
assert!(map["total_files"].as_u64().unwrap() > 0);
assert!(map["tree"].as_array().unwrap().len() > 0);
}
#[test]
fn test_map_command_max_tokens() {
let output = Command::new("probe")
.args(["map", "./src", "--max-tokens", "500", "--format", "json"])
.output().unwrap();
let map: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap();
assert!(map["total_tokens"].as_u64().unwrap() <= 500);
}
Success Criteria
probe map ./src returns useful output in <1 second for a medium codebase (~500 files)
probe map . --max-tokens 4000 always fits within token budget
- Output is immediately useful for an LLM to understand repo structure
- No indexing or setup required
- Respects
.gitignore and --ignore patterns
- Token budget truncation is graceful (never empty output, always shows at least file tree)
Competitive Context
This feature was identified by comparing probe with:
- ABCoder (CloudWeGo/ByteDance) --
get_repo_structure / get_package_structure MCP tools with hierarchical drill-down
- Stakgraph (Stakwork) --
repo_map MCP tool returning graph overview
- Octocode (Muvon) --
view command showing file signatures via glob patterns
- grepai -- no equivalent (relies on semantic search for discovery)
Probe's advantage: zero setup, instant results -- unlike ABCoder (requires batch parse) or Octocode (requires indexing). Same philosophy as probe search.
Problem Statement
When an AI agent (or human) encounters an unfamiliar codebase, the first question is always: "What's in this repo?" Currently, there's no way to get a structural overview from probe without searching for something specific. The agent has to either:
listFilestool (flat, one-level directory listing, no code structure)findorls -Rexternally (no semantic information)Other tools solve this: ABCoder's
get_repo_structure/get_package_structure, Stakgraph'srepo_mapMCP tool, and Octocode'sviewcommand. Probe needs its own version that fits its zero-setup, instant, AST-aware philosophy.Proposed Solution
A new
probe mapCLI command that returns a hierarchical view of a codebase with top-level symbol signatures, using the existing tree-sitter infrastructure. No indexing, no setup -- same instant behavior asprobe search.Existing Building Blocks
The codebase already has most of the pieces:
extract_all_symbols_from_file()insrc/extract/processor.rs:884-- DEAD CODE that already:is_acceptable_parent()get_symbol_signature()for each symbolVec<SearchResult>withsymbol_signaturepopulatedget_symbol_signature()implemented for 8 languages insrc/language/:rust.rs:142): functions, structs, impls, traits, enums, consts, statics, types, macrostypescript.rs:106)javascript.rs:97)python.rs:64)go.rs:151)yaml.rs:191)markdown.rs:178)html.rs:329)file_list_cacheinsrc/search/file_list_cache.rs-- cached.gitignore-aware file listing with language filteringToken counting via tiktoken (
src/search/search_tokens.rs:333) -- can enforce--max-tokenson map outputParentContextmodel insrc/models.rs:25-- already represents scope hierarchyCLI Interface
Expected Output Formats
Outline Format (default) --
--detail signaturesOutline Format --
--detail filesOutline Format --
--detail fullJSON Format
{ "root": "./src", "total_files": 42, "total_symbols": 187, "total_tokens": 3850, "tree": [ { "path": "src/search", "type": "directory", "children": [ { "path": "src/search/search_runner.rs", "type": "file", "lines": 2145, "language": "rust", "symbols": [ { "name": "perform_probe", "signature": "pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>", "node_type": "function_item", "line": 225, "end_line": 450, "visibility": "public", "doc": "Main entry point for probe search. Orchestrates the full pipeline." } ] } ] } ] }Implementation Plan
Phase 1: Core
probe mapCommand (Rust)Step 1: New model types in
src/models.rsStep 2: New module
src/map/Create
src/map/mod.rs:pub fn handle_map(options: MapOptions) -> Result<MapResult>-- main entry pointfile_list_cachefor.gitignore-aware traversalextract_all_symbols_from_file()(currently dead code atprocessor.rs:884)--max-tokensbudget using existing tiktoken infrastructureCreate
src/map/output.rs:format_map_outline()-- indented text outputformat_map_json()-- structured JSONformat_map_xml()-- XML outputStep 3: Token-Budget-Aware Truncation
Critical for AI agents. When
--max-tokensis set:... (N more files)with just the filename{ total_files, shown_files, total_symbols, shown_symbols, tokens_used }This ensures the agent always gets SOMETHING useful within its token budget, never an error or empty result.
Step 4: CLI registration in
src/cli.rsAdd
Mapvariant to theCommandsenum:Phase 2: MCP Integration
Add
map_codetool to the MCP server atnpm/src/mcp/index.ts:Phase 3: Agent Integration
Update ProbeAgent system prompt to use
map_codeas the first step when exploring a new codebase:Performance Considerations
--detail signaturesor--detail fullis requested. For--detail files, just count lines.ParserPool) and tree cache (TreeCache) are already designed for reuse across files.--max-tokensbudget is exhausted.file_list_cacheto avoid re-walking the directory on repeated calls.Testing
Unit Tests (
src/map/mod.rs)CLI Tests (
tests/cli_tests.rs)Success Criteria
probe map ./srcreturns useful output in <1 second for a medium codebase (~500 files)probe map . --max-tokens 4000always fits within token budget.gitignoreand--ignorepatternsCompetitive Context
This feature was identified by comparing probe with:
get_repo_structure/get_package_structureMCP tools with hierarchical drill-downrepo_mapMCP tool returning graph overviewviewcommand showing file signatures via glob patternsProbe's advantage: zero setup, instant results -- unlike ABCoder (requires batch parse) or Octocode (requires indexing). Same philosophy as
probe search.