Skip to content

Feature: probe map -- Repository Structure Overview Command #501

@buger

Description

@buger

Problem Statement

When an AI agent (or human) encounters an unfamiliar codebase, the first question is always: "What's in this repo?" Currently, there's no way to get a structural overview from probe without searching for something specific. The agent has to either:

  • Guess search terms blindly
  • Use the NPM listFiles tool (flat, one-level directory listing, no code structure)
  • Run find or ls -R externally (no semantic information)

Other tools solve this: ABCoder's get_repo_structure / get_package_structure, Stakgraph's repo_map MCP tool, and Octocode's view command. Probe needs its own version that fits its zero-setup, instant, AST-aware philosophy.

Proposed Solution

A new probe map CLI command that returns a hierarchical view of a codebase with top-level symbol signatures, using the existing tree-sitter infrastructure. No indexing, no setup -- same instant behavior as probe search.

Existing Building Blocks

The codebase already has most of the pieces:

  1. extract_all_symbols_from_file() in src/extract/processor.rs:884 -- DEAD CODE that already:

    • Parses a file with tree-sitter
    • Iterates root-level children
    • Filters by is_acceptable_parent()
    • Calls get_symbol_signature() for each symbol
    • Returns Vec<SearchResult> with symbol_signature populated
    • Just needs to be exposed and extended
  2. get_symbol_signature() implemented for 8 languages in src/language/:

    • Rust (rust.rs:142): functions, structs, impls, traits, enums, consts, statics, types, macros
    • TypeScript (typescript.rs:106)
    • JavaScript (javascript.rs:97)
    • Python (python.rs:64)
    • Go (go.rs:151)
    • YAML (yaml.rs:191)
    • Markdown (markdown.rs:178)
    • HTML (html.rs:329)
  3. file_list_cache in src/search/file_list_cache.rs -- cached .gitignore-aware file listing with language filtering

  4. Token counting via tiktoken (src/search/search_tokens.rs:333) -- can enforce --max-tokens on map output

  5. ParentContext model in src/models.rs:25 -- already represents scope hierarchy

CLI Interface

# Basic: map the current directory
probe map .

# Map a specific subdirectory
probe map ./src/search

# Filter by language
probe map ./src --language rust

# Control depth: how many directory levels deep
probe map ./src --depth 2

# Control detail level
probe map ./src --detail signatures    # default: symbol signatures
probe map ./src --detail files         # files only, no symbols
probe map ./src --detail full          # signatures + first doc comment line

# Token budget (critical for AI agents)
probe map ./src --max-tokens 4000

# Output formats (reuse existing infrastructure)
probe map ./src --format outline       # default
probe map ./src --format json
probe map ./src --format xml

# Ignore patterns (reuse existing --ignore flag)
probe map ./src --ignore "test*" --ignore "*.generated.*"

# Exclude test files (reuse existing --allow-tests flag, inverted default)
probe map ./src               # excludes test files by default
probe map ./src --allow-tests # includes test files

Expected Output Formats

Outline Format (default) -- --detail signatures

src/
  search/
    search_runner.rs
      pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>
      pub fn search_with_structured_patterns(...) -> Result<HashMap<PathBuf, ...>>
      fn process_file_with_results(...) -> Result<Vec<SearchResult>>
    result_ranking.rs
      pub fn rank_search_results(results: &mut Vec<SearchResult>, ...) -> Result<()>
    search_output.rs
      pub fn format_and_print_search_results(results: &LimitedSearchResults, ...) -> Result<String>
      pub fn collect_parent_context_for_line(...) -> Vec<ParentContext>
      fn format_and_print_outline_results(...) -> Result<String>
    elastic_query.rs
      pub enum Expr
      pub fn parse_query(query: &str) -> Result<Expr>
    block_merging.rs
      pub fn merge_ranked_blocks(results: Vec<SearchResult>, ...) -> Vec<SearchResult>
    cache.rs
      pub struct SearchCache
      pub fn new(session_id: &str) -> Self
    filters.rs
      pub struct SearchFilters
    tokenization.rs
      pub fn tokenize(text: &str) -> Vec<String>
  language/
    language_trait.rs
      pub trait LanguageImpl
    factory.rs
      pub fn get_language_for_file(path: &Path) -> Option<Box<dyn LanguageImpl>>
    rust.rs
      pub struct RustLanguage
    parser_pool.rs
      pub struct ParserPool
    tree_cache.rs
      pub struct TreeCache
  models.rs
    pub struct SearchResult
    pub struct ParentContext
    pub struct CodeBlock
    pub struct LimitedSearchResults
    pub struct SearchLimits
  ranking.rs
    pub fn rank_documents(query: &Expr, ...) -> Vec<(usize, f64)>
    pub struct QueryTokenMap
  extract/
    mod.rs
      pub fn handle_extract(options: ExtractOptions) -> Result<()>
    processor.rs
      pub fn process_file_for_extraction(...) -> Result<Vec<SearchResult>>
    symbol_finder.rs
      pub fn find_all_symbols_in_file(...) -> Result<Vec<SymbolMatch>>
  cli.rs
    pub enum Commands
    pub struct SearchArgs
    pub struct ExtractArgs

Outline Format -- --detail files

src/
  search/
    search_runner.rs (2,145 lines)
    result_ranking.rs (487 lines)
    search_output.rs (2,680 lines)
    elastic_query.rs (356 lines)
    block_merging.rs (290 lines)
    cache.rs (185 lines)
    filters.rs (142 lines)
    tokenization.rs (210 lines)
  language/
    language_trait.rs (45 lines)
    factory.rs (120 lines)
    rust.rs (280 lines)
    parser_pool.rs (95 lines)
    tree_cache.rs (110 lines)
  models.rs (105 lines)
  ranking.rs (520 lines)
  extract/
    mod.rs (780 lines)
    processor.rs (930 lines)
    symbol_finder.rs (480 lines)
  cli.rs (460 lines)

Outline Format -- --detail full

src/
  search/
    search_runner.rs
      /// Main entry point for probe search. Orchestrates the full pipeline.
      pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>
      /// Search using structured patterns with SIMD acceleration.
      pub fn search_with_structured_patterns(...) -> Result<HashMap<PathBuf, ...>>

JSON Format

{
  "root": "./src",
  "total_files": 42,
  "total_symbols": 187,
  "total_tokens": 3850,
  "tree": [
    {
      "path": "src/search",
      "type": "directory",
      "children": [
        {
          "path": "src/search/search_runner.rs",
          "type": "file",
          "lines": 2145,
          "language": "rust",
          "symbols": [
            {
              "name": "perform_probe",
              "signature": "pub fn perform_probe(options: &SearchOptions) -> Result<Vec<SearchResult>>",
              "node_type": "function_item",
              "line": 225,
              "end_line": 450,
              "visibility": "public",
              "doc": "Main entry point for probe search. Orchestrates the full pipeline."
            }
          ]
        }
      ]
    }
  ]
}

Implementation Plan

Phase 1: Core probe map Command (Rust)

Step 1: New model types in src/models.rs

pub struct MapEntry {
    pub path: String,
    pub entry_type: MapEntryType,
    pub language: Option<String>,
    pub line_count: Option<usize>,
    pub symbols: Vec<SymbolInfo>,
    pub children: Vec<MapEntry>,  // for directories
}

pub enum MapEntryType {
    Directory,
    File,
}

pub struct SymbolInfo {
    pub name: String,
    pub signature: String,
    pub node_type: String,
    pub start_line: usize,
    pub end_line: usize,
    pub visibility: Option<String>,  // pub, pub(crate), private, etc.
    pub doc_comment: Option<String>, // first line only
}

pub struct MapOptions {
    pub paths: Vec<String>,
    pub depth: Option<usize>,
    pub detail: MapDetail,       // files, signatures, full
    pub language: Option<String>,
    pub max_tokens: Option<usize>,
    pub format: String,          // outline, json, xml
    pub ignore_patterns: Vec<String>,
    pub allow_tests: bool,
}

pub enum MapDetail {
    Files,       // just file names + line counts
    Signatures,  // + symbol signatures (default)
    Full,        // + doc comments
}

Step 2: New module src/map/

Create src/map/mod.rs:

  • pub fn handle_map(options: MapOptions) -> Result<MapResult> -- main entry point
  • Reuse file_list_cache for .gitignore-aware traversal
  • For each file: call a revived extract_all_symbols_from_file() (currently dead code at processor.rs:884)
  • Build directory tree from flat file list
  • Apply --max-tokens budget using existing tiktoken infrastructure

Create src/map/output.rs:

  • format_map_outline() -- indented text output
  • format_map_json() -- structured JSON
  • format_map_xml() -- XML output

Step 3: Token-Budget-Aware Truncation

Critical for AI agents. When --max-tokens is set:

  1. Start with directory structure (cheapest)
  2. Add symbols for files in priority order:
    • Smaller files first (more likely to be focused modules)
    • Files closer to root first
    • Public symbols only if budget is tight
  3. When budget runs out, show remaining files as ... (N more files) with just the filename
  4. Return metadata: { total_files, shown_files, total_symbols, shown_symbols, tokens_used }

This ensures the agent always gets SOMETHING useful within its token budget, never an error or empty result.

Step 4: CLI registration in src/cli.rs

Add Map variant to the Commands enum:

/// Generate a structural overview of a codebase
Map {
    /// Paths to map
    #[arg(default_value = ".")]
    paths: Vec<String>,

    /// Maximum directory depth
    #[arg(long, short = 'd')]
    depth: Option<usize>,

    /// Detail level: files, signatures, full
    #[arg(long, default_value = "signatures")]
    detail: String,

    /// Filter by programming language
    #[arg(long, short = 'l')]
    language: Option<String>,

    /// Maximum output tokens
    #[arg(long)]
    max_tokens: Option<usize>,

    /// Output format
    #[arg(long, short = 'o', default_value = "outline")]
    format: String,

    /// Custom ignore patterns
    #[arg(long, short = 'i')]
    ignore: Vec<String>,

    /// Include test files
    #[arg(long)]
    allow_tests: bool,
}

Phase 2: MCP Integration

Add map_code tool to the MCP server at npm/src/mcp/index.ts:

{
  name: "map_code",
  description: "Get a structural overview of a codebase with file tree and symbol signatures. Use this FIRST when exploring an unfamiliar codebase before searching.",
  inputSchema: {
    type: "object",
    properties: {
      path: { type: "string", description: "Directory to map" },
      depth: { type: "number", description: "Max directory depth (default: unlimited)" },
      detail: { type: "string", enum: ["files", "signatures", "full"], default: "signatures" },
      language: { type: "string", description: "Filter by language" },
      maxTokens: { type: "number", description: "Token budget for output", default: 4000 },
    },
    required: ["path"]
  }
}

Phase 3: Agent Integration

Update ProbeAgent system prompt to use map_code as the first step when exploring a new codebase:

When exploring an unfamiliar codebase:
1. Use map_code to understand the overall structure
2. Use search_code to find specific code
3. Use extract_code to read specific files/symbols

Performance Considerations

  • Lazy symbol extraction: Only parse files with tree-sitter when --detail signatures or --detail full is requested. For --detail files, just count lines.
  • Parallel processing: Use rayon for file parsing (same as search pipeline).
  • Cache reuse: The parser pool (ParserPool) and tree cache (TreeCache) are already designed for reuse across files.
  • Early termination: Stop processing files once --max-tokens budget is exhausted.
  • File list cache: Reuse file_list_cache to avoid re-walking the directory on repeated calls.

Testing

Unit Tests (src/map/mod.rs)

#[cfg(test)]
mod tests {
    #[test]
    fn test_map_single_file() { /* map a single .rs file, verify symbols extracted */ }

    #[test]
    fn test_map_directory_tree() { /* map a directory, verify tree structure */ }

    #[test]
    fn test_map_depth_limit() { /* --depth 1 only shows one level */ }

    #[test]
    fn test_map_language_filter() { /* --language rust only shows .rs files */ }

    #[test]
    fn test_map_token_budget() { /* --max-tokens 500 truncates gracefully */ }

    #[test]
    fn test_map_detail_files() { /* --detail files shows no symbols */ }

    #[test]
    fn test_map_detail_signatures() { /* --detail signatures shows signatures */ }

    #[test]
    fn test_map_excludes_tests() { /* test files excluded by default */ }

    #[test]
    fn test_map_gitignore_respected() { /* .gitignore patterns honored */ }
}

CLI Tests (tests/cli_tests.rs)

#[test]
fn test_map_command_basic() {
    let output = Command::new("probe")
        .args(["map", "./src", "--format", "json"])
        .output().unwrap();
    let map: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap();
    assert!(map["total_files"].as_u64().unwrap() > 0);
    assert!(map["tree"].as_array().unwrap().len() > 0);
}

#[test]
fn test_map_command_max_tokens() {
    let output = Command::new("probe")
        .args(["map", "./src", "--max-tokens", "500", "--format", "json"])
        .output().unwrap();
    let map: serde_json::Value = serde_json::from_slice(&output.stdout).unwrap();
    assert!(map["total_tokens"].as_u64().unwrap() <= 500);
}

Success Criteria

  1. probe map ./src returns useful output in <1 second for a medium codebase (~500 files)
  2. probe map . --max-tokens 4000 always fits within token budget
  3. Output is immediately useful for an LLM to understand repo structure
  4. No indexing or setup required
  5. Respects .gitignore and --ignore patterns
  6. Token budget truncation is graceful (never empty output, always shows at least file tree)

Competitive Context

This feature was identified by comparing probe with:

  • ABCoder (CloudWeGo/ByteDance) -- get_repo_structure / get_package_structure MCP tools with hierarchical drill-down
  • Stakgraph (Stakwork) -- repo_map MCP tool returning graph overview
  • Octocode (Muvon) -- view command showing file signatures via glob patterns
  • grepai -- no equivalent (relies on semantic search for discovery)

Probe's advantage: zero setup, instant results -- unlike ABCoder (requires batch parse) or Octocode (requires indexing). Same philosophy as probe search.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions