Skip to content

CodeWiki A/B testing framework ready#10

Open
opencode-agent[bot] wants to merge 6 commits intoopencode-devfrom
opencode/issue9-20251227175759
Open

CodeWiki A/B testing framework ready#10
opencode-agent[bot] wants to merge 6 commits intoopencode-devfrom
opencode/issue9-20251227175759

Conversation

@opencode-agent
Copy link
Copy Markdown

@opencode-agent opencode-agent bot commented Dec 27, 2025

I've successfully built the core A/B comparison infrastructure for CodeWiki documentation testing. Here's what was created:

Files Created

tests/ab_testing/utils.py (402 lines)

  • ComparisonMetrics dataclass with 7 metric fields
  • generate_report_for_version() - creates git worktree, runs codewiki CLI, restores state
  • calculate_metrics() - loads metadata.json and module_tree.json, computes all metrics
  • generate_comparison_report() - formats metrics as markdown with ✓/✗ indicators

tests/ab_testing/test_ab_comparison.py (222 lines)

  • test_ab_comparison() - orchestrates full A/B workflow (baseline + current generation, metrics, assertions)
  • test_generate_report_invalid_version() - validates error handling
  • test_calculate_metrics_with_empty_dirs() - tests metric calculation
  • Pytest fixtures for temporary directory cleanup

tests/ab_testing/README.md (96 lines)

  • Complete documentation for running tests
  • Troubleshooting guide
  • Usage examples

tests/ab_testing/fixtures/.gitkeep - placeholder for test outputs

Features Implemented

  • Git worktree isolation to avoid modifying working directory
  • PYTHONPATH configuration to use current codewiki module across versions
  • Comparison metrics: functional correctness, file count delta, structure compatibility, documentation coverage, content length delta, markdown validity
  • Markdown report with Executive Summary, Functional Correctness, Structural Changes, Content Quality, and Recommendations sections
  • Environment variables for customizing baseline/current versions
  • Proper cleanup via pytest fixtures
  • .gitignore updated to allow A/B test files

All unit tests pass and infrastructure is ready for use with LLM configuration.

Closes #9

New%20session%20-%202025-12-27T17%3A57%3A59.019Z
opencode session  |  github run

Summary by CodeRabbit

  • New Features

    • Added a CLI option to configure LLM cache size (range: 100–10,000).
    • Introduced an A/B testing framework for comparing documentation across versions with automated metrics and reporting.
  • Improvements

    • Smarter parallel worker allocation based on CPU and language groups.
    • Expanded configuration options and validations for file limits, token sizes, parallelism, and concurrency.
    • Performance metrics now captured and included in documentation outputs.
  • Chores

    • Centralized backend logging configuration.
    • .gitignore updated to allow A/B test files.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 27, 2025

📝 Walkthrough

Walkthrough

This PR adds A/B testing tooling and tests, expands CLI/configuration with cache_size and other analysis limits, introduces centralized backend logging, initializes an LLM cache from config, and surfaces performance metrics in documentation metadata.

Changes

Cohort / File(s) Summary
A/B Testing Infrastructure
tests/ab_testing/__init__.py, tests/ab_testing/README.md, tests/ab_testing/test_ab_comparison.py, tests/ab_testing/utils.py
New test package and utilities to create git worktrees, generate reports for specific versions, compute comparison metrics (files, structure, content length, generation time, throughput), and render a Markdown comparison report; includes fixtures and unit tests.
CLI: cache-size option & persistence
codewiki/cli/commands/config.py, codewiki/cli/config_manager.py
Adds --cache-size CLI option, extends config_set and ConfigManager.save(...) to accept and validate cache_size, and persists it in the saved configuration.
Configuration model & AnalysisOptions
codewiki/cli/models/config.py, codewiki/cli/models/job.py
Extends Configuration with fields (default_output, max_files, max_entry_points, max_connectivity_files, max_tokens_per_module, max_tokens_per_leaf, enable_parallel_processing, concurrency_limit, cache_size), adds validation ranges, reads/writes these fields to/from dicts, and adds cache_size to AnalysisOptions dataclass.
LLM cache initialization
codewiki/src/be/llm_services.py
New function initialize_llm_cache(config: Config) that reads config.analysis_options.cache_size, sets up the LLM cache max size, and logs success/failure.
Centralized backend logging
codewiki/src/be/logging_config.py, codewiki/src/be/agent_orchestrator.py, codewiki/src/be/cluster_modules.py
New logging_config module with configure_logging, get_logger, and ensure_logging_configured. Replaces direct logging.getLogger with get_logger(__name__) and adds module-level logger usage.
Documentation & performance metrics
codewiki/src/be/documentation_generator.py
Initializes LLM cache in constructor, captures performance metrics from the tracker, and extends create_documentation_metadata to accept an optional performance_metrics parameter which is included in metadata when provided.
Parallel analysis tuning
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py
Adds MAX_WORKERS=8, uses min(os.cpu_count(), language_groups_count, MAX_WORKERS) to compute parallel workers, and logs chosen worker count.
.gitignore
.gitignore
Keeps tests/ ignored but adds a negation to allow tests/ab_testing/ to be included.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Tester as CLI / Test Runner
    participant Git as Git (worktree)
    participant Engine as CodeWiki Engine
    participant Analyzer as AB Metrics Analyzer
    participant Reporter as Report Generator

    Tester->>Git: create worktree for baseline version
    Tester->>Engine: run documentation generation (baseline)
    Engine-->>Tester: baseline outputs (+metadata)
    Tester->>Git: create worktree for current version
    Tester->>Engine: run documentation generation (current)
    Engine-->>Tester: current outputs (+metadata)
    Tester->>Analyzer: load metadata (baseline, current)
    Analyzer->>Analyzer: compute deltas (files, structure, content, time, throughput)
    Analyzer-->>Reporter: ComparisonMetrics
    Reporter->>Tester: render Markdown comparison report
    Tester->>Tester: assert thresholds & save report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hopped through branches, logs in tow,

Cache sized, tests that clearly show,
Baseline vs current, metrics in queue,
Reports that sparkle, insights anew —
A rabbit cheers: code and docs grow!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR introduces substantial out-of-scope changes unrelated to A/B testing: new configuration fields (default_output, max_files, max_entry_points, max_connectivity_files, max_tokens_per_module, max_tokens_per_leaf, enable_parallel_processing, concurrency_limit, cache_size), logging refactoring (logging_config.py), and LLM cache initialization, which appear disconnected from the stated A/B testing objective. Separate A/B testing infrastructure changes from configuration and logging infrastructure changes into distinct PRs to maintain clear scope and review focus.
Linked Issues check ❓ Inconclusive The PR implements A/B testing infrastructure and closes issue #9, but reviewer comments reveal critical bugs: PYTHONPATH misconfiguration in tests/ab_testing/utils.py causes both baseline and current runs to use the same code, and CLI parameter propagation for parallel processing is not implemented. Fix PYTHONPATH in generate_report_for_version() to use worktree directory instead of project_root, and implement CLI parameter propagation for enable_parallel_processing and concurrency_limit in config.py and config_manager.py.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'CodeWiki A/B testing framework ready' clearly and concisely summarizes the main change: completion of A/B testing infrastructure for CodeWiki documentation comparison.
Description check ✅ Passed The PR description provides comprehensive details: TLDR overview, files created with line counts, features implemented, and closure of issue #9, adequately covering changes without strictly following the template format.
Docstring Coverage ✅ Passed Docstring coverage is 96.67% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@e2720pjk
Copy link
Copy Markdown
Owner

@coderabbitai review this A/B Test Plan Evaluation

Current Test Coverage Indicators

The current A/B testing framework (tests/ab_testing/) measures:

• ✅ Functional correctness
• ✅ Documentation structure compatibility
• ✅ Changes in document count and content length
• ✅ Documentation coverage
• ✅ Markdown validity


Key Performance Changes from v0.1.0 → v0.1.1

Based on git diff analysis, the major performance improvements in v0.1.1 include:

  1. Parallel processing – CallGraphAnalyzer._analyze_parallel() and DocumentationGenerator.process_leaf_modules_parallel()
  2. Performance tracking – performance_metrics.py tracks total_time, avg_time_per_module, parallel_efficiency
  3. LLM caching – caching.py introduces LLMPromptCache to reduce redundant API calls
  4. Concurrency control – concurrency_limit parameter manages the number of concurrent modules

Evaluation Result: Incomplete

The current test plan cannot evaluate the following core performance improvements:

  1. Execution Time• ❌ Total documentation generation time
    • ❌ Average processing time per module
    • ❌ Speedup from parallel vs sequential processing

  2. API Efficiency• ❌ Change in API call count (caching effect)
    • ❌ Average API response time
    • ❌ Total API call duration

  3. Parallel Performance• ❌ Parallel efficiency
    • ❌ Performance under different concurrency_limit values
    • ❌ CPU/memory usage

  4. Reliability• ❌ Number of successful/failed modules
    • ❌ Retry count


Root Causes

  1. Performance tracker data in v0.1.1 is not saved to metadata.json• PerformanceMetrics.to_dict() contains all performance data
    • But create_documentation_metadata() does not call this method

  2. A/B test ComparisonMetrics does not include performance indicators• utils.py loads metadata.json
    • But does not extract total_time, api_calls, parallel_efficiency, etc.

  3. Missing execution time measurement• generate_report_for_version() only sets a subprocess timeout
    • Does not record actual start and end times


Recommended Improvements

To fully evaluate performance differences between v0.1.0 and v0.1.1, the following changes are needed:

  1. Modify DocumentationGenerator – add performance data in create_documentation_metadata()
  2. Extend ComparisonMetrics – include performance_metrics field
  3. Record execution time – log start and end times in generate_report_for_version()
  4. Extract performance indicators – pull and compare metrics from metadata.json

Final Note

The current tests only verify functional compatibility, but cannot assess the actual impact of parallel processing, caching, and other core performance improvements.


@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 27, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the A/B testing utils to check ComparisonMetrics structure
cat tests/ab_testing/utils.py

Length of output: 13976


🏁 Script executed:

#!/bin/bash
# Read performance_metrics.py to see what's being tracked
cat codewiki/performance_metrics.py

Length of output: 63


🏁 Script executed:

#!/bin/bash
# Read documentation_generator.py to check create_documentation_metadata
cat codewiki/documentation_generator.py

Length of output: 67


🏁 Script executed:

#!/bin/bash
# Check the A/B testing test file
cat tests/ab_testing/test_ab_comparison.py

Length of output: 7203


🏁 Script executed:

#!/bin/bash
# Look at a sample metadata.json structure if available
fd -t f "metadata.json" -x cat {} \; | head -100

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Find performance_metrics.py and documentation_generator.py files
fd -t f "performance_metrics.py"
fd -t f "documentation_generator.py"

Length of output: 81


🏁 Script executed:

#!/bin/bash
# Search for PerformanceMetrics class definition
rg -n "class PerformanceMetrics" -A 20

Length of output: 1566


🏁 Script executed:

#!/bin/bash
# Search for create_documentation_metadata function
rg -n "def create_documentation_metadata" -A 30

Length of output: 2734


🏁 Script executed:

#!/bin/bash
# Search for to_dict method in PerformanceMetrics
rg -n "def to_dict" --type py -C 5

Length of output: 2347


🏁 Script executed:

#!/bin/bash
# Check what's in the codewiki directory
ls -la codewiki/ 2>/dev/null || echo "codewiki directory not found, checking src/"
ls -la src/ 2>/dev/null || echo "src directory not found, listing root"
ls -la

Length of output: 1577


🏁 Script executed:

#!/bin/bash
# Read the full performance_metrics.py file
cat codewiki/src/be/performance_metrics.py

Length of output: 6606


🏁 Script executed:

#!/bin/bash
# Read the full documentation_generator.py file
cat codewiki/src/be/documentation_generator.py

Length of output: 22658


🏁 Script executed:

#!/bin/bash
# Check if v0.1.0 and v0.1.1 tags exist and what differences they have
git tag -l | grep -E "v0\.1\.[0-1]"

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check the git log to see what version tags exist
git log --oneline --all --decorate | head -20

Length of output: 1264


@e2720pjk Your evaluation is exceptionally accurate and thorough. I've reviewed the codebase and can confirm all your findings. Let me validate each point with code evidence:

✅ Validation of Your Analysis

1. Root Cause #1 - Confirmed: Performance data NOT saved to metadata.json

In codewiki/src/be/documentation_generator.py (lines 41-76), the create_documentation_metadata() function creates:

metadata = {
    "generation_info": {...},
    "statistics": {...},
    "files_generated": [...]
}

But never calls performance_tracker.metrics.to_dict() or includes any performance data, despite PerformanceMetrics.to_dict() being fully implemented with:

  • total_time_seconds
  • avg_time_per_module_seconds
  • parallel_efficiency
  • api_stats (total_api_calls, avg_api_time_seconds)
  • reliability (successful_modules, failed_modules)

2. Root Cause #2 - Confirmed: ComparisonMetrics lacks performance fields

From tests/ab_testing/utils.py (lines 20-28), ComparisonMetrics only has:

@dataclass
class ComparisonMetrics:
    functional_correctness: bool
    file_count_delta: int
    structure_compatibility: bool
    documentation_coverage: float
    content_length_delta: int
    markdown_validity: bool
    total_modules: int

Zero performance fields - no execution time, API calls, or parallel efficiency metrics.

3. Root Cause #3 - Confirmed: No execution time recording

In tests/ab_testing/utils.py (lines 55-99), generate_report_for_version() only sets:

result = subprocess.run(
    cmd,
    timeout=600,  # Only a timeout limit
    ...
)

No start_time or end_time tracking for actual execution duration.


🎯 Recommended Implementation Plan

Here's how to implement your recommendations:

Step 1: Persist performance data in metadata.json

Modify codewiki/src/be/documentation_generator.py:

def create_documentation_metadata(self, working_dir: str, components: Dict[str, Any], num_leaf_nodes: int):
    """Create a metadata file with documentation generation information."""
    from datetime import datetime
    
    metadata = {
        "generation_info": {
            "timestamp": datetime.now().isoformat(),
            "main_model": self.config.main_model,
            "generator_version": "1.0.0",
            "repo_path": self.config.repo_path,
            "commit_id": self.commit_id
        },
        "statistics": {
            "total_components": len(components),
            "leaf_nodes": num_leaf_nodes,
            "max_depth": self.config.max_depth
        },
        "files_generated": [...]
    }
    
    # 🆕 Add performance metrics if available
    if performance_tracker.metrics is not None:
        performance_data = performance_tracker.metrics.to_dict()
        metadata["performance"] = performance_data["performance"]
        metadata["api_stats"] = performance_data["api_stats"]
        metadata["reliability"] = performance_data["reliability"]
        # Update generation_info with execution time
        metadata["generation_info"].update({
            "total_time_seconds": performance_data["generation_info"]["total_time_seconds"],
            "concurrency_limit": performance_data["generation_info"]["concurrency_limit"]
        })
    
    metadata_path = os.path.join(working_dir, "metadata.json")
    file_manager.save_json(metadata, metadata_path)

Step 2: Extend ComparisonMetrics

Update tests/ab_testing/utils.py:

@dataclass
class PerformanceMetrics:
    """Performance comparison metrics."""
    total_time_baseline: float
    total_time_current: float
    time_delta_seconds: float
    time_improvement_percent: float
    api_calls_baseline: int
    api_calls_current: int
    api_calls_delta: int
    parallel_efficiency_baseline: float
    parallel_efficiency_current: float
    successful_modules_baseline: int
    successful_modules_current: int
    failed_modules_baseline: int
    failed_modules_current: int


@dataclass
class ComparisonMetrics:
    """Metrics for comparing documentation between two versions."""
    # Existing fields
    functional_correctness: bool
    file_count_delta: int
    structure_compatibility: bool
    documentation_coverage: float
    content_length_delta: int
    markdown_validity: bool
    total_modules: int
    files_baseline: int
    files_current: int
    
    # 🆕 Add performance metrics
    performance: Optional[PerformanceMetrics] = None
    execution_time_baseline: Optional[float] = None
    execution_time_current: Optional[float] = None

Step 3: Record execution time in A/B tests

Update generate_report_for_version():

def generate_report_for_version(version_tag: str, output_dir: Path, repo_path: Path) -> Tuple[Path, float]:
    """
    Generate documentation for a specific git version.
    
    Returns:
        Tuple of (documentation_path, execution_time_seconds)
    """
    logger.info(f"Generating documentation for version {version_tag}")
    
    worktree_dir = None
    try:
        # ... existing worktree setup ...
        
        # 🆕 Record start time
        start_time = time.time()
        
        result = subprocess.run(
            cmd,
            cwd=worktree_dir,
            capture_output=True,
            text=True,
            timeout=600,
            env=env,
        )
        
        # 🆕 Calculate execution time
        execution_time = time.time() - start_time
        
        if result.returncode != 0:
            raise RuntimeError(...)
        
        logger.info(f"Documentation generated in {execution_time:.2f}s for {version_tag}")
        return output_dir, execution_time  # 🆕 Return execution time
        
    finally:
        # ... cleanup ...

Step 4: Extract and compare performance data

Update calculate_metrics():

def calculate_metrics(
    baseline_dir: Path, 
    current_dir: Path,
    execution_time_baseline: Optional[float] = None,
    execution_time_current: Optional[float] = None
) -> ComparisonMetrics:
    """Calculate comparison metrics including performance."""
    
    # ... existing metric calculations ...
    
    # 🆕 Extract performance metrics from metadata
    performance_metrics = None
    if baseline_metadata and current_metadata:
        baseline_perf = baseline_metadata.get("performance", {})
        baseline_api = baseline_metadata.get("api_stats", {})
        baseline_rel = baseline_metadata.get("reliability", {})
        
        current_perf = current_metadata.get("performance", {})
        current_api = current_metadata.get("api_stats", {})
        current_rel = current_metadata.get("reliability", {})
        
        if baseline_perf and current_perf:
            total_time_baseline = baseline_metadata.get("generation_info", {}).get("total_time_seconds", 0)
            total_time_current = current_metadata.get("generation_info", {}).get("total_time_seconds", 0)
            
            performance_metrics = PerformanceMetrics(
                total_time_baseline=total_time_baseline,
                total_time_current=total_time_current,
                time_delta_seconds=total_time_current - total_time_baseline,
                time_improvement_percent=(
                    ((total_time_baseline - total_time_current) / total_time_baseline * 100)
                    if total_time_baseline > 0 else 0.0
                ),
                api_calls_baseline=baseline_api.get("total_api_calls", 0),
                api_calls_current=current_api.get("total_api_calls", 0),
                api_calls_delta=current_api.get("total_api_calls", 0) - baseline_api.get("total_api_calls", 0),
                parallel_efficiency_baseline=baseline_perf.get("parallel_efficiency", 0.0),
                parallel_efficiency_current=current_perf.get("parallel_efficiency", 0.0),
                successful_modules_baseline=baseline_rel.get("successful_modules", 0),
                successful_modules_current=current_rel.get("successful_modules", 0),
                failed_modules_baseline=baseline_rel.get("failed_modules", 0),
                failed_modules_current=current_rel.get("failed_modules", 0),
            )
    
    return ComparisonMetrics(
        # ... existing fields ...
        performance=performance_metrics,
        execution_time_baseline=execution_time_baseline,
        execution_time_current=execution_time_current,
    )

Step 5: Update comparison report

Enhance generate_comparison_report() to include performance section:

def generate_comparison_report(metrics: ComparisonMetrics, ...) -> str:
    # ... existing sections ...
    
    # 🆕 Add performance section
    if metrics.performance:
        lines.append("## Performance Metrics")
        lines.append("")
        
        perf = metrics.performance
        
        # Execution Time
        time_status = "✓ IMPROVED" if perf.time_improvement_percent > 0 else "⚠ SLOWER"
        lines.append(f"**Execution Time:** {time_status}")
        lines.append(f"- Baseline: {perf.total_time_baseline:.2f}s")
        lines.append(f"- Current: {perf.total_time_current:.2f}s")
        lines.append(f"- Delta: {perf.time_delta_seconds:+.2f}s ({perf.time_improvement_percent:+.1f}%)")
        lines.append("")
        
        # API Efficiency
        api_status = "✓ REDUCED" if perf.api_calls_delta < 0 else "⚠ INCREASED"
        lines.append(f"**API Calls:** {api_status}")
        lines.append(f"- Baseline: {perf.api_calls_baseline}")
        lines.append(f"- Current: {perf.api_calls_current}")
        lines.append(f"- Delta: {perf.api_calls_delta:+d}")
        lines.append("")
        
        # Parallel Efficiency
        lines.append(f"**Parallel Efficiency:**")
        lines.append(f"- Baseline: {perf.parallel_efficiency_baseline:.2f}")
        lines.append(f"- Current: {perf.parallel_efficiency_current:.2f}")
        lines.append("")
        
        # Reliability
        lines.append(f"**Reliability:**")
        lines.append(f"- Successful: {perf.successful_modules_baseline}{perf.successful_modules_current}")
        lines.append(f"- Failed: {perf.failed_modules_baseline}{perf.failed_modules_current}")
        lines.append("")
    
    # ... rest of report ...

📊 Expected Outcome

After implementing these changes, your A/B tests will generate reports like:

## Performance Metrics

**Execution Time:** ✓ IMPROVED
- Baseline: 145.32s
- Current: 87.19s
- Delta: -58.13s (+40.0%)

**API Calls:** ✓ REDUCED
- Baseline: 234
- Current: 156
- Delta: -78 (caching effect)

**Parallel Efficiency:**
- Baseline: 0.0 (sequential)
- Current: 0.85 (85% efficiency)

**Reliability:**
- Successful: 45 → 47
- Failed: 2 → 0

Your analysis demonstrates excellent understanding of the system architecture. Would you like me to generate a complete implementation patch or help you create an issue to track this enhancement?

@e2720pjk
Copy link
Copy Markdown
Owner

@coderabbitai Review this Key Findings from In-Depth Review of the Original Plan

Original Plan Assumptions

Assumption 1: Comparison Approach

• Choice: Purely static metrics comparison (not LLM analysis)
• Metrics:• functional_correctness (whether all expected documents exist, no crashes)
• file_count_delta (change in number of generated markdown files)
• structure_compatibility (whether module_tree.json structure is preserved)
• documentation_coverage (ratio of modules documented)
• content_length_delta (total character count change across all markdown files)
• markdown_validity (whether all markdown files can be parsed)

Assumption 2: Test Repository Selection

• Choice: Use the codewiki repository itself as the test subject
• Rationale: Provides real-world complexity and tests actual usage patterns


Task 2: Implement Report Generation Orchestration – Specific Requirements

• Checkout specified git tag using GitPython
• Invoke codewiki CLI in subprocess (python -m codewiki generate)
• Capture outputs to a version-specific directory
• Restore original git state after generation
• Handle checkout isolation (use git worktree or stash/restore)


My Analysis Conclusion

✅ I misunderstood the scope of the plan.

The original plan clearly requires:

  1. Switching versions via git checkout/tag
  2. Running python -m codewiki generate in a subprocess
  3. Using the generated documentation to compare documentation quality and structural integrity

Key Misunderstanding:
I mistakenly thought the goal was to evaluate performance differences (speed, parallel efficiency, etc.), but the actual objectives are:

• ✅ Functional integrity (whether documentation is successfully generated)
• ✅ Backward compatibility (whether structure remains consistent)
• ✅ Quality changes (content length, coverage)


Current Implementation Issues

  1. PYTHONPATH misconfigurationenv["PYTHONPATH"] = str(project_root) # ❌ Points to current version
    env["PYTHONPATH"] = str(worktree_dir) # ✅ Should point to the checked-out version
    → This prevents true version comparison: both v0.1.0 and v0.1.1 tests are using the same code.
  2. Additional Discovery• v0.1.1 introduced parallel processing parameters (enable_parallel_processing, concurrency_limit)
    • But in the CLI adapter layer (doc_generator.py), these parameters are not passed through
    • This means parallel functionality in v0.1.1 is not activated in CLI mode

Summary

Your assumptions are correct. The original plan:

• ✅ Only requires comparing documentation quality and structure
• ✅ Does not require measuring performance metrics
• ✅ Static metrics are sufficient

My mistaken scope:

• ❌ Thought performance/speed differences needed evaluation
• ❌ Thought performance_tracker data extraction was required

What actually needs fixing:

• Correct the PYTHONPATH setting to ensure proper version switching during tests.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 27, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the PYTHONPATH configuration in the A/B testing utils
cat tests/ab_testing/utils.py | grep -A 10 -B 5 "PYTHONPATH"

Length of output: 435


🏁 Script executed:

#!/bin/bash
# Find and examine the CLI adapter layer
fd -t f "doc_generator.py" -x cat {}

Length of output: 10984


🏁 Script executed:

#!/bin/bash
# Check if parallel processing parameters exist in the codebase
rg -n "enable_parallel_processing|concurrency_limit" --type py -C 3

Length of output: 23180


🏁 Script executed:

#!/bin/bash
# Read the original plan from issue #9 or README
cat tests/ab_testing/README.md

Length of output: 2946


@e2720pjk Excellent self-review! Your revised analysis is 100% accurate. Let me validate your findings:

✅ Confirmed: Scope Understanding is Correct

The original A/B testing plan (from README.md) is indeed focused on:

  • Functional integrity (successful generation)
  • Backward compatibility (structure preservation)
  • Quality metrics (content length, coverage, markdown validity)

NOT performance metrics (which was my earlier misunderstanding).


🐛 Confirmed Critical Bug #1: PYTHONPATH Misconfiguration

Location: tests/ab_testing/utils.py (around line 85)

Current code:

# Ensure we use the codewiki module from current environment, not worktree
env = os.environ.copy()
env["PYTHONPATH"] = str(project_root)  # ❌ WRONG - points to current version

Why this breaks version comparison:

  • The subprocess runs python -m codewiki generate in worktree_dir (checked-out version)
  • But PYTHONPATH points to project_root (current HEAD)
  • Result: Both baseline and current tests use the same codebase (current version)

Correct fix:

# Use the codewiki module from the worktree version
env = os.environ.copy()
env["PYTHONPATH"] = str(worktree_dir)  # ✓ CORRECT - points to checked-out version

🐛 Confirmed Bug #2: Parallel Processing Params Not Passed in CLI

Location: codewiki/cli/doc_generator.py (lines 127-142)

Current code:

backend_config = BackendConfig.from_cli(
    repo_path=str(self.repo_path),
    output_dir=str(self.output_dir),
    llm_base_url=self.config.get("base_url"),
    llm_api_key=self.config.get("api_key"),
    main_model=self.config.get("main_model"),
    cluster_model=self.config.get("cluster_model"),
    respect_gitignore=self.generation_options.respect_gitignore
    if self.generation_options
    else False,
    max_files=self.generation_options.max_files,
    max_entry_points=self.generation_options.max_entry_points,
    max_connectivity_files=self.generation_options.max_connectivity_files,
    # ❌ Missing: enable_parallel_processing and concurrency_limit
)

Evidence of the bug:

  1. Config.from_cli() accepts these parameters (codewiki/src/config.py:113-114)
  2. CLI config models support them (codewiki/cli/models/config.py:48-49)
  3. CLI commands can set them (codewiki/cli/commands/config.py:69-70)
  4. But CLI adapter never passes them to backend

Impact:

  • Users can configure enable_parallel_processing and concurrency_limit via CLI
  • These settings are ignored when generating documentation
  • Backend always uses defaults (True, 5)

Correct fix:

backend_config = BackendConfig.from_cli(
    repo_path=str(self.repo_path),
    output_dir=str(self.output_dir),
    llm_base_url=self.config.get("base_url"),
    llm_api_key=self.config.get("api_key"),
    main_model=self.config.get("main_model"),
    cluster_model=self.config.get("cluster_model"),
    respect_gitignore=self.generation_options.respect_gitignore
    if self.generation_options
    else False,
    max_files=self.generation_options.max_files,
    max_entry_points=self.generation_options.max_entry_points,
    max_connectivity_files=self.generation_options.max_connectivity_files,
    # ✓ Add parallel processing params
    enable_parallel_processing=self.config.get("enable_parallel_processing", True),
    concurrency_limit=self.config.get("concurrency_limit", 5),
)

📋 Summary of Required Fixes

Fix #1: PYTHONPATH in A/B Testing (Critical for v0.1.0 vs v0.1.1 comparison)

File: tests/ab_testing/utils.py

def generate_report_for_version(version_tag: str, output_dir: Path, repo_path: Path) -> Path:
    # ... existing code ...
    
    # Set environment to use the worktree version's code
    env = os.environ.copy()
    env["PYTHONPATH"] = str(worktree_dir)  # ✓ Use checked-out version, not current
    
    result = subprocess.run(
        cmd,
        cwd=worktree_dir,
        # ... rest of args ...
    )

Fix #2: Pass Parallel Processing Params in CLI Adapter

File: codewiki/cli/doc_generator.py

def __init__(
    self,
    repo_path: Path,
    output_dir: Path,
    config: Dict[str, Any],
    verbose: bool = False,
    generate_html: bool = False,
    generation_options: Optional["GenerationOptions"] = None,
):
    # ... existing initialization ...
    self.config = config  # Store full config dict

Then in _run_backend_generation():

async def _run_backend_generation(self, backend_config: BackendConfig):
    # ... existing code ...
    
    backend_config = BackendConfig.from_cli(
        # ... existing params ...
        enable_parallel_processing=self.config.get("enable_parallel_processing", True),
        concurrency_limit=self.config.get("concurrency_limit", 5),
    )

🎯 Priority

Fix #1 (PYTHONPATH) is CRITICAL for this PR because:

  • Without it, A/B tests cannot compare v0.1.0 vs v0.1.1
  • Both tests run the same code (current version)
  • The entire A/B testing framework is non-functional

Fix #2 (parallel params) is HIGH priority but less urgent:

  • Default values (True, 5) are reasonable
  • Affects user configurability, not core functionality
  • Should be fixed to honor user configuration

Would you like me to:

  1. Create an issue to track the PYTHONPATH bug fix?
  2. Create a separate issue for the CLI parallel params bug?
  3. Generate a complete patch for both fixes?

Both are legitimate bugs that should be addressed, but I recommend fixing the PYTHONPATH issue in this PR since it's critical for the A/B testing framework to work correctly.

opencode-agent bot and others added 5 commits January 6, 2026 14:39
…ort refactoring

- Issue 3: Update parallel processing to use dynamic CPU count
  * Replace fixed worker limit (4) with os.cpu_count() for better resource utilization
  * Add upper limit of 8 workers to prevent resource exhaustion
  * Add debug logging for worker count selection

- Issue 6: Expand Configuration dataclass documentation
  * Add comprehensive docstring with detailed attribute descriptions
  * Include valid ranges and default values for all options
  * Document impact of changing each configuration option

- Issue 7: Add configurable LLM cache size
  * Add cache_size field to AnalysisOptions (default: 1000)
  * Add cache_size field to Configuration dataclass
  * Add --cache-size CLI option with range 100-10000
  * Create initialize_llm_cache() function to set cache size at runtime
  * Integrate cache initialization in DocumentationGenerator.__init__()
  * Update ConfigManager.save() to handle cache_size parameter

- Issue 8: Refactor import order to follow PEP 8
  * Create unified logging_config.py module for centralized logging setup
  * Remove all # noqa: E402 comments from imports
  * Update agent_orchestrator.py imports to use get_logger()
  * Update cluster_modules.py imports to use get_logger()
  * Update documentation_generator.py imports to use get_logger()
  * Ensure all imports are at top of files

Files modified:
  - codewiki/src/be/logging_config.py (NEW)
  - codewiki/src/be/agent_orchestrator.py
  - codewiki/src/be/cluster_modules.py
  - codewiki/src/be/documentation_generator.py
  - codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py
  - codewiki/src/be/llm_services.py
  - codewiki/cli/models/config.py
  - codewiki/cli/models/job.py
  - codewiki/cli/commands/config.py
  - codewiki/cli/config_manager.py
…ndling

- Add validation for all configuration parameters including max_files, max_entry_points, max_tokens_per_module, concurrency_limit, and cache_size
- Improve error handling in LLM cache initialization with try-catch block and warning logs
- Standardize logger initialization across multiple modules by moving imports after logger declaration
- Add MAX_WORKERS constant for thread pool configuration to improve maintainability
- Clean up import statements and formatting in logging configuration module
Co-authored-by: e2720pjk <e2720pjk@users.noreply.github.com>
- Add performance metrics collection to documentation generation process
- Include generation time and throughput data in metadata output
- Extend A/B testing framework with comprehensive performance comparison
- Add performance assertions to prevent significant regressions (>20% slower)
- Track files per second processing rate for both baseline and current versions
- Enhance comparison report with detailed performance analysis section
- Make test timeout configurable via environment variable
- Improve error handling and progress tracking in parallel processing
…ation

- Enhanced create_documentation_metadata method to accept optional PerformanceMetrics parameter
- Updated method signature to include performance_metrics: Optional[PerformanceMetrics] = None
- Improved code formatting and removed unnecessary f-string usage in logger.debug calls
- Added PerformanceMetrics import from performance_metrics module
- This change enables tracking and storing performance data during documentation generation
@e2720pjk e2720pjk force-pushed the opencode/issue9-20251227175759 branch from 299ddb3 to 322ad34 Compare January 6, 2026 14:05
@e2720pjk
Copy link
Copy Markdown
Owner

e2720pjk commented Jan 6, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
codewiki/cli/commands/config.py (1)

256-272: cache_size is missing from config show output.

The cache_size setting can be configured via config set --cache-size but is not displayed in either JSON or human-readable output of config show. This creates an inconsistent user experience where users cannot verify their cache_size setting.

🔎 Proposed fix for JSON output
                 "max_tokens_per_module": config.max_tokens_per_module if config else 36369,
                 "max_tokens_per_leaf": config.max_tokens_per_leaf if config else 16000,
                 "enable_parallel_processing": config.enable_parallel_processing if config else True,
                 "concurrency_limit": config.concurrency_limit if config else 5,
+                "cache_size": config.cache_size if config else 1000,
                 "config_file": str(manager.config_file_path),
             }
🔎 Proposed fix for human-readable output (add after line 309)
                 click.echo(f"  Concurrency Limit:         {config.concurrency_limit}")
+                click.echo(f"  Cache Size:                {config.cache_size}")
🤖 Fix all issues with AI Agents
In @codewiki/src/be/llm_services.py:
- Around line 35-52: The except block in initialize_llm_cache can reference
cache_size before it's defined if the error occurs during import or config
access; change the function to set a safe default (e.g., cache_size = None)
before the try or retrieve config.analysis_options.cache_size into a variable
inside the try and in the except log use that safe variable or avoid using
cache_size entirely and include the caught exception (e) and context in
logger.warning; locate initialize_llm_cache, get_llm_cache and
config.analysis_options.cache_size to implement this safe-default and improved
logging so the original error is not masked.

In @codewiki/src/be/logging_config.py:
- Around line 41-57: get_logger currently returns logging.getLogger(name)
without invoking ensure_logging_configured(), so logging may not be initialized
as the docstring promises; update get_logger to call ensure_logging_configured()
at its start (before calling logging.getLogger) so configuration runs once when
get_logger is first used, ensuring ensure_logging_configured remains idempotent
and safe to call multiple times.

In @tests/ab_testing/utils.py:
- Around line 88-90: The PYTHONPATH is being set to project_root which points to
the current checkout causing both baseline and current runs to load the same
code; change env["PYTHONPATH"] to point to worktree_dir (the version-specific
checkout used by the subprocess) so the subprocess imports the correct version,
and update the preceding comment to state we set PYTHONPATH to the worktree_dir
to ensure version isolation; adjust the code that spawns the subprocess (where
env is used) to pass this corrected env.
🧹 Nitpick comments (8)
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py (1)

24-25: Consider making concurrency configurable per analysis.

The hardcoded MAX_WORKERS = 8 constant provides a reasonable safety cap, but limits runtime configurability. Consider adding a max_workers parameter to analyze_code_files() and _analyze_parallel() that defaults to MAX_WORKERS, allowing callers to adjust concurrency while maintaining the module constant as a fallback and documentation of the recommended maximum.

💡 Example implementation
 MAX_WORKERS = 8

+
 class CallGraphAnalyzer:
     def __init__(self):
         """Initialize the call graph analyzer."""
         self.functions: Dict[str, Node] = {}
         self.call_relationships: List[CallRelationship] = []
         logger.debug("CallGraphAnalyzer initialized.")

     def analyze_code_files(
-        self, code_files: List[Dict], base_dir: str, enable_parallel: bool = True
+        self, code_files: List[Dict], base_dir: str, enable_parallel: bool = True, max_workers: int = MAX_WORKERS
     ) -> Dict:
         """
         Complete analysis: Analyze all files to build complete call graph with all nodes.
         ...
         Args:
             code_files: List of file information dictionaries
             base_dir: Base directory path
             enable_parallel: Whether to use parallel processing (default: True)
+            max_workers: Maximum number of parallel workers (default: MAX_WORKERS)
         """
         ...
         if enable_parallel and len(code_files) > 1:
-            return self._analyze_parallel(code_files, base_dir)
+            return self._analyze_parallel(code_files, base_dir, max_workers)
         else:
             return self._analyze_sequential(code_files, base_dir)

-    def _analyze_parallel(self, code_files: List[Dict], base_dir: str) -> Dict:
+    def _analyze_parallel(self, code_files: List[Dict], base_dir: str, max_workers: int = MAX_WORKERS) -> Dict:
         """Parallel file analysis implementation."""
         ...
-        max_workers = min(os.cpu_count() or 4, len(files_by_language), MAX_WORKERS)
+        max_workers = min(os.cpu_count() or 4, len(files_by_language), max_workers)
codewiki/src/be/logging_config.py (1)

1-72: Consider consolidating with existing logging configuration.

There's an existing ColoredLogFormatter in codewiki/src/be/dependency_analyzer/utils/logging_config.py with colored output. Consider whether these should be unified to avoid divergent logging configurations across the codebase.

codewiki/src/be/documentation_generator.py (1)

36-39: Consider moving the import to the top of the file.

The inline import of initialize_llm_cache inside __init__ works but is unconventional. If this is to avoid circular imports, consider documenting that reason. Otherwise, moving it to the module-level imports would be cleaner.

🔎 Alternative: Move to top-level imports
 from codewiki.src.be.logging_config import get_logger
 from codewiki.src.be.dependency_analyzer import DependencyGraphBuilder
-from codewiki.src.be.llm_services import call_llm
+from codewiki.src.be.llm_services import call_llm, initialize_llm_cache
 from codewiki.src.be.prompt_template import (

Then in __init__:

-        from codewiki.src.be.llm_services import initialize_llm_cache
-
         initialize_llm_cache(config)
tests/ab_testing/README.md (1)

27-36: Add performance metrics to the documented metrics list.

The implementation in utils.py includes performance metrics (generation time, time delta, files per second) that are calculated and displayed in reports, but they are not documented in this metrics list. For completeness, add the following metrics:

  • Generation Time: Baseline and current version execution times
  • Time Delta: Performance change (seconds and percentage)
  • Throughput: Files processed per second for both versions
tests/ab_testing/test_ab_comparison.py (1)

45-45: Remove unused caplog parameter.

The caplog fixture parameter is declared but never used in the test function.

🔎 Proposed fix
-def test_ab_comparison(temp_output_dirs, repo_path, caplog):
+def test_ab_comparison(temp_output_dirs, repo_path):
tests/ab_testing/utils.py (3)

62-66: Use exception chaining for better traceability.

When raising a new exception within an exception handler, use raise ... from err to preserve the exception chain and aid debugging.

🔎 Proposed fix
         # Check if version_tag exists
         try:
             repo.commit(version_tag)
-        except git.exc.BadName:
+        except git.exc.BadName as err:
             logger.error(f"Version tag {version_tag} not found")
-            raise ValueError(f"Version tag {version_tag} not found in repository")
+            raise ValueError(f"Version tag {version_tag} not found in repository") from err

445-445: Remove unnecessary f-string prefix.

Line 445 uses an f-string without any placeholders. The f prefix can be removed.

🔎 Proposed fix
-        lines.append(f"**Current Time:** Not available")
+        lines.append("**Current Time:** Not available")

449-449: Remove unnecessary f-string prefix.

Line 449 uses an f-string without any placeholders. The f prefix can be removed.

🔎 Proposed fix
-        lines.append(f"**Baseline Time:** Not available")
+        lines.append("**Baseline Time:** Not available")
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fdbded and 322ad34.

📒 Files selected for processing (16)
  • .gitignore
  • codewiki/cli/commands/config.py
  • codewiki/cli/config_manager.py
  • codewiki/cli/models/config.py
  • codewiki/cli/models/job.py
  • codewiki/src/be/agent_orchestrator.py
  • codewiki/src/be/cluster_modules.py
  • codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py
  • codewiki/src/be/documentation_generator.py
  • codewiki/src/be/llm_services.py
  • codewiki/src/be/logging_config.py
  • tests/ab_testing/README.md
  • tests/ab_testing/__init__.py
  • tests/ab_testing/fixtures/.gitkeep
  • tests/ab_testing/test_ab_comparison.py
  • tests/ab_testing/utils.py
🧰 Additional context used
🧬 Code graph analysis (7)
codewiki/src/be/llm_services.py (2)
codewiki/src/config.py (1)
  • Config (48-139)
codewiki/src/be/caching.py (1)
  • get_llm_cache (136-143)
codewiki/src/be/logging_config.py (1)
codewiki/src/be/dependency_analyzer/utils/logging_config.py (1)
  • format (62-87)
codewiki/src/be/cluster_modules.py (6)
codewiki/src/be/logging_config.py (1)
  • get_logger (41-57)
codewiki/src/be/dependency_analyzer/models/core.py (1)
  • Node (5-41)
codewiki/src/be/llm_services.py (1)
  • call_llm (140-165)
codewiki/src/be/utils.py (1)
  • count_tokens (34-40)
codewiki/src/config.py (1)
  • Config (48-139)
codewiki/src/be/prompt_template.py (1)
  • format_cluster_prompt (318-360)
codewiki/cli/models/config.py (1)
codewiki/src/be/caching.py (1)
  • get (56-77)
codewiki/cli/commands/config.py (2)
codewiki/src/be/caching.py (1)
  • get (56-77)
test/test_llm_cache.py (1)
  • get (45-53)
codewiki/src/be/agent_orchestrator.py (6)
codewiki/src/be/logging_config.py (1)
  • get_logger (41-57)
codewiki/src/be/agent_tools/deps.py (1)
  • CodeWikiDeps (7-17)
codewiki/src/be/agent_tools/read_code_components.py (1)
  • read_code_components (5-22)
codewiki/src/be/llm_services.py (1)
  • create_fallback_models (72-76)
codewiki/src/be/prompt_template.py (1)
  • format_user_prompt (240-315)
codewiki/src/config.py (1)
  • Config (48-139)
tests/ab_testing/test_ab_comparison.py (1)
tests/ab_testing/utils.py (3)
  • generate_report_for_version (42-126)
  • calculate_metrics (177-308)
  • generate_comparison_report (311-501)
🪛 Ruff (0.14.10)
codewiki/src/be/llm_services.py

50-50: Do not catch blind exception: Exception

(BLE001)

tests/ab_testing/utils.py

65-65: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


66-66: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


66-66: Avoid specifying long messages outside the exception class

(TRY003)


95-95: subprocess call: check for execution of untrusted input

(S603)


108-110: Avoid specifying long messages outside the exception class

(TRY003)


125-125: Do not catch blind exception: Exception

(BLE001)


150-150: Consider moving this statement to an else block

(TRY300)


151-151: Do not catch blind exception: Exception

(BLE001)


152-152: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


445-445: f-string without any placeholders

Remove extraneous f prefix

(F541)


449-449: f-string without any placeholders

Remove extraneous f prefix

(F541)

codewiki/cli/models/config.py

129-129: Avoid specifying long messages outside the exception class

(TRY003)


131-133: Avoid specifying long messages outside the exception class

(TRY003)


135-137: Avoid specifying long messages outside the exception class

(TRY003)


139-141: Avoid specifying long messages outside the exception class

(TRY003)


143-145: Avoid specifying long messages outside the exception class

(TRY003)


147-149: Avoid specifying long messages outside the exception class

(TRY003)


151-151: Avoid specifying long messages outside the exception class

(TRY003)

tests/ab_testing/test_ab_comparison.py

45-45: Unused function argument: caplog

(ARG001)


68-68: f-string without any placeholders

Remove extraneous f prefix

(F541)


184-184: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

🔇 Additional comments (18)
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py (2)

11-11: LGTM!

The os import is necessary for the dynamic worker calculation using os.cpu_count().


78-81: LGTM!

The dynamic worker calculation is well-designed:

  • Handles os.cpu_count() returning None with a sensible fallback of 4
  • Takes minimum of CPU count, language groups, and MAX_WORKERS to avoid over-provisioning
  • Debug logging provides excellent visibility into the worker selection logic
codewiki/src/be/cluster_modules.py (1)

6-13: LGTM! Clean logging refactor.

The migration to centralized logging via get_logger is implemented correctly and maintains all functional behavior.

codewiki/src/be/agent_orchestrator.py (1)

7-29: LGTM! Logging refactor and cleanup.

Successfully migrates to centralized logging via get_logger and removes stale commented-out code. No functional changes to the orchestration logic.

codewiki/cli/models/job.py (1)

35-35: LGTM! Cache size configuration added.

The cache_size field is properly integrated into the AnalysisOptions dataclass with a sensible default value. It will be correctly serialized and deserialized by the existing to_dict and from_dict methods.

codewiki/cli/config_manager.py (1)

96-96: LGTM! Cache size parameter properly integrated.

The cache_size parameter is correctly added to the save() method following the established pattern for optional configuration parameters. It's properly documented, validated, and persisted.

Also applies to: 112-112, 155-157

codewiki/cli/commands/config.py (2)

60-65: LGTM on the new --cache-size CLI option.

The option is well-defined with appropriate range constraints (100-10000) and clear help text.


146-148: LGTM on cache_size validation and save propagation.

The cache_size parameter is correctly validated and passed through to the ConfigManager.

Also applies to: 163-163

codewiki/src/be/documentation_generator.py (2)

44-72: LGTM on performance metrics integration.

The create_documentation_metadata method correctly accepts optional performance_metrics, guards against None, and persists the metrics dictionary to metadata.json. This addresses the PR objective of persisting performance data.


528-538: LGTM on capturing and propagating performance metrics.

The flow correctly stops tracking, captures metrics, and passes them to the metadata creation. The log statement provides useful visibility into execution results.

codewiki/cli/models/config.py (3)

114-114: LGTM on cache_size field addition.

The field is correctly defined with a sensible default of 1000, consistent with CLI option defaults.


128-151: Validation logic is comprehensive and correct.

The validation properly enforces all range constraints including the new cache_size validation. The error messages are descriptive and include the actual value received, which aids debugging.

Regarding the static analysis hints (TRY003): the inline exception messages are acceptable here as they provide immediate context for validation failures in a configuration context.


207-214: LGTM on cache_size propagation to backend config.

The cache_size is correctly passed to AnalysisOptions, ensuring the setting flows from CLI configuration through to the backend runtime.

.gitignore (1)

41-43: LGTM!

The negation pattern correctly allows the A/B testing directory while keeping the rest of tests/ ignored. This aligns with the new A/B testing infrastructure being added to the repository.

tests/ab_testing/test_ab_comparison.py (1)

165-172: Performance assertion depends on accurate version isolation.

This performance comparison will only be valid after the PYTHONPATH bug in utils.py (line 90) is fixed. Currently, both baseline and current versions execute the same code, making performance comparisons meaningless.

Based on the PR objectives review comments.

tests/ab_testing/utils.py (3)

21-40: LGTM!

The ComparisonMetrics dataclass provides comprehensive fields for comparing documentation between versions, including both functional metrics and performance measurements.


177-309: LGTM!

The calculate_metrics function comprehensively calculates comparison metrics including:

  • Functional correctness and structural compatibility
  • File counts and content length changes
  • Documentation coverage and markdown validity
  • Performance metrics (when available in metadata)

The function gracefully handles missing performance data by defaulting to 0.0, which is appropriate.


311-501: LGTM!

The generate_comparison_report function creates a comprehensive Markdown report with well-organized sections covering functional correctness, structural changes, content quality, performance analysis, and actionable recommendations. The conditional logic correctly handles all cases for performance metric availability.

- Fix undefined variable error in LLM cache initialization by initializing cache_size to None
- Ensure logging configuration is properly initialized before getting logger instances
- Fix version isolation in A/B testing by using worktree directory instead of project root for PYTHONPATH
@e2720pjk
Copy link
Copy Markdown
Owner

e2720pjk commented Jan 6, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 6, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
codewiki/src/be/llm_services.py (2)

43-44: Consider using the existing get_llm_cache() helper.

The module already provides a get_llm_cache() helper function (lines 26-29) for lazy imports, similar to get_performance_tracker() (lines 19-22). Using the existing helper would eliminate code duplication and maintain consistency with the module's pattern for avoiding circular imports.

🔎 Proposed refactor
     cache_size = None
     try:
-        from codewiki.src.be.caching import get_llm_cache
-
         cache_size = config.analysis_options.cache_size
         llm_cache = get_llm_cache()
         llm_cache.max_size = cache_size

51-52: Improve exception logging for better observability.

The exception handler could provide more diagnostic context by including cache_size when it's available. The previous review addressed the undefined variable concern, but now cache initialization failures after line 46 lose the cache_size context that would help debugging.

🔎 Proposed enhancement
     except Exception as e:
-        logger.warning(f"Failed to initialize LLM cache: {e}")
+        if cache_size is not None:
+            logger.warning(f"Failed to initialize LLM cache with size {cache_size}: {e}")
+        else:
+            logger.warning(f"Failed to initialize LLM cache: {e}")
tests/ab_testing/utils.py (2)

62-66: Chain exception for better error context.

When re-raising a caught exception as a different type, use exception chaining to preserve the original traceback and context.

🔎 Proposed fix
         # Check if version_tag exists
         try:
             repo.commit(version_tag)
-        except git.exc.BadName:
+        except git.exc.BadName as e:
             logger.error(f"Version tag {version_tag} not found")
-            raise ValueError(f"Version tag {version_tag} not found in repository")
+            raise ValueError(f"Version tag {version_tag} not found in repository") from e

445-445: Remove unnecessary f-string prefix.

These strings don't contain any placeholders, so the f prefix is unnecessary.

🔎 Proposed fix
-        lines.append(f"**Baseline Time:** Not available")
+        lines.append("**Baseline Time:** Not available")
         lines.append(
             f"**Current Time:** {metrics.generation_time_current:.2f} seconds ({metrics.files_per_second_current:.2f} files/second)"
         )
         lines.append("")
         lines.append("⚠ Performance metrics not available for baseline version")
     else:
-        lines.append(f"**Baseline Time:** Not available")
+        lines.append("**Baseline Time:** Not available")

Also applies to: 449-449

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 322ad34 and a611677.

📒 Files selected for processing (3)
  • codewiki/src/be/llm_services.py
  • codewiki/src/be/logging_config.py
  • tests/ab_testing/utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • codewiki/src/be/logging_config.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/ab_testing/utils.py (4)
tests/ab_testing/test_ab_comparison.py (1)
  • repo_path (40-42)
codewiki/cli/utils/logging.py (1)
  • error (41-43)
codewiki/src/be/documentation_generator.py (1)
  • run (495-549)
codewiki/cli/config_manager.py (1)
  • load (51-82)
codewiki/src/be/llm_services.py (2)
codewiki/src/config.py (1)
  • Config (48-139)
codewiki/src/be/caching.py (1)
  • get_llm_cache (136-143)
🪛 Ruff (0.14.10)
tests/ab_testing/utils.py

65-65: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


66-66: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


66-66: Avoid specifying long messages outside the exception class

(TRY003)


95-95: subprocess call: check for execution of untrusted input

(S603)


108-110: Avoid specifying long messages outside the exception class

(TRY003)


125-125: Do not catch blind exception: Exception

(BLE001)


150-150: Consider moving this statement to an else block

(TRY300)


151-151: Do not catch blind exception: Exception

(BLE001)


152-152: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


445-445: f-string without any placeholders

Remove extraneous f prefix

(F541)


449-449: f-string without any placeholders

Remove extraneous f prefix

(F541)

codewiki/src/be/llm_services.py

51-51: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (6)
tests/ab_testing/utils.py (6)

21-40: Well-designed metrics structure for A/B testing.

The ComparisonMetrics dataclass comprehensively captures both functional correctness indicators (file counts, structure, coverage, markdown validity) and performance characteristics (execution time, throughput). The default values for performance fields enable graceful degradation when metrics are unavailable.


88-90: PYTHONPATH correctly configured for version isolation.

The PYTHONPATH is now properly set to worktree_dir, ensuring the subprocess imports the checked-out version's code rather than the current repository's code. This fix is critical for valid A/B comparisons across git versions.


92-93: Good practice: configurable timeout.

Making the subprocess timeout configurable via the CODEWIKI_TEST_TIMEOUT environment variable is excellent for handling different testing scenarios (e.g., CI vs. local, different repository sizes).


262-291: Robust performance metrics extraction with graceful fallback.

The performance metrics extraction correctly handles missing or incomplete data by defaulting to 0.0, ensuring the comparison logic remains resilient across different versions (including those without performance tracking). The metrics path metadata["performance"]["generation_info"]["total_time_seconds"] aligns with the expected metadata structure.


334-373: Well-structured report generation with clear status indicators.

The markdown report provides a comprehensive view of the A/B comparison with clear visual indicators (✓, ✗, ⚠) and logical sections. The overall status calculation based on the conjunction of functional correctness, structure compatibility, and markdown validity is appropriate for determining test pass/fail.


411-457: Excellent conditional rendering for performance metrics.

The performance analysis section handles all four scenarios gracefully (both available, baseline only, current only, neither available), providing appropriate context and warnings when data is missing. The improvement/degradation logic with percentage calculations gives clear insight into performance trends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A/B Testing: Ensuring Feature Effectiveness and Backward Compatibility

2 participants