Skip to content

PR: Implemented hybrid AST + Joern CPG analysis#12

Open
e2720pjk wants to merge 12 commits intoopencode-devfrom
vk/95da-joern-integratio
Open

PR: Implemented hybrid AST + Joern CPG analysis#12
e2720pjk wants to merge 12 commits intoopencode-devfrom
vk/95da-joern-integratio

Conversation

@e2720pjk
Copy link
Copy Markdown
Owner

@e2720pjk e2720pjk commented Jan 2, 2026

TLDR

Successfully implemented hybrid AST + Joern CPG analysis for CodeWiki using a Quick Win strategy. This PR adds powerful data flow insights while maintaining system stability and backward compatibility.

Key features include:

  • Hybrid Analysis Engine: Combines traditional AST parsing with Joern Code Property Graph (CPG) analysis.
  • Data Flow Relationships: Enables tracking of parameter-to-local variable flows and cross-module interactions.
  • CLI Integration: Added --use-joern flag to opt-in for enhanced analysis.
  • Graceful Fallback: Automatically reverts to AST analysis if Joern is unavailable or fails.

Dive Deeper

The core philosophy of this integration is "Enhancement over Replacement." Historically, CodeWiki relied solely on AST parsing, which is fast and stable but limited in its ability to track data relationships across function boundaries or within complex logic.

Joern provides a Code Property Graph (CPG) that allows for deep semantic analysis. By integrating Joern as an optional enhancement layer, we can now offer:

  1. Deeper Context: Understanding how data flows from parameters to return values.
  2. Cross-Module Analysis: Identifying relationships that span different files and modules with confidence scoring.
  3. Smart Selection: To manage the performance overhead (approx. 12-13x baseline), the system intelligently selects the top ~20 most "important" functions (based on complexity and connectivity) for deep analysis.

Architecture-wise, the HybridAnalysisService acts as a coordinator, delegating work to both the traditional ASTParser and the new SimplifiedJoernAnalyzer.

Reviewer Test Plan

To validate these changes:

  1. Environment Setup:

    • Ensure Java 19+ is installed.
    • (Optional) Download joern.jar to the project root, or let the CLI auto-download it on first run with --use-joern.
  2. Basic Verification:

    • Run standard analysis to ensure no regressions: codewiki generate
    • Run enhanced analysis: codewiki generate --use-joern
    • Verify that the output documentation includes data flow relationship sections.
  3. Fallback Verification:

    • Rename joern.jar or simulate a Joern failure and run with --use-joern. The system should log a warning and fallback to standard AST analysis without crashing.
  4. Performance Check:

    • Use --verbose flag to see the analysis duration. Expected overhead for small-to-medium repos should be around 0.1s.

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

Linked issues / bugs

Related to the ongoing initiative for "Enhanced Code Analysis Support".

Summary by CodeRabbit

  • New Features

    • PHP language support in repository analysis
    • CLI: --fallback-model to set a backup model and --use-joern to enable Joern-backed hybrid analysis
    • Joern-backed hybrid analysis with optional function-level data-flow insights and graph-based clustering
  • Documentation

    • Comprehensive Joern Integration User Guide with setup, usage, CI examples, outputs, and troubleshooting
  • Other

    • Enhanced logging, tracebacks, performance baselines, and improved fallback/diagnostic messages

✏️ Tip: You can customize this high-level summary in your review settings.

brianminhle and others added 9 commits December 17, 2025 19:58
…ki/src/be/dependency_analyzer/analyzers/php.py, codewiki/src/be/dependency_analyzer/analysis/analysis_service.py, codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py, codewiki/cli/utils/repo_validator.py, codewiki/cli/utils/validation.py, codewiki/src/be/dependency_analyzer/utils/patterns.py, codewiki/src/be/prompt_template.py, pyproject.toml]
This introduces hybrid analysis capabilities that combine traditional AST parsing with Joern Code Property Graph (CPG) analysis for deeper insights into code structure and data flow relationships.

Key additions:
- SimplifiedJoernAnalyzer: Basic Joern integration with graceful fallback to AST analysis
- HybridAnalysisService: Combines AST structural analysis with Joern data flow insights
- New CLI flag --use-joern to enable enhanced analysis mode
- DataFlowRelationship model for capturing parameter-to-local variable flows
- Cross-module relationship detection with confidence scoring
- Comprehensive user guide for Joern integration

The implementation provides:
- Automatic Joern JAR detection and setup
- Performance baseline testing comparing AST vs Joern analysis
- Graceful degradation when Joern is unavailable
- Enhanced dependency graphs with data flow information
- Support for Python, Java, JavaScript/TypeScript analysis

This feature significantly improves code understanding by adding data flow analysis while maintaining backward compatibility through the existing AST-only approach.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 2, 2026

📝 Walkthrough

Walkthrough

Adds optional Joern CPG hybrid analysis (AST + Joern) with a --use-joern CLI flag, HybridAnalysisService and multiple Joern adapters/clients, PHP language support across analyzers, a --fallback-model config option, graph-based clustering, enhanced traceback logging, and documentation (JOERN_USER_GUIDE.md).

Changes

Cohort / File(s) Summary
Documentation & Guides
JOERN_USER_GUIDE.md, README.md
New Joern integration user guide and README updates showing multi-model support and --fallback-model examples.
CLI & Config plumbing
codewiki/cli/commands/generate.py, codewiki/cli/commands/config.py, codewiki/cli/config_manager.py, codewiki/cli/models/config.py, codewiki/src/config.py
Added --use-joern generate flag and --fallback-model option; propagated use_joern/fallback_model through config manager, models, CLI outputs (default fallback "glm-4p5").
Hybrid Joern integration & clients
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py, .../joern_poc.py, .../simplified_joern.py, .../joern/client.py, .../joern/joern_analysis_service.py
New HybridAnalysisService orchestrating AST→optional Joern→merge; multiple Joern adapters/clients (PoC, simplified, robust client) with caching, script execution, and Joern-to-internal-model mapping.
Analyzer factory & builder integration
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py, codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py, codewiki/src/be/dependency_analyzer/analysis/analysis_service.py
AnalyzerFactory (HYBRID/JOERN/AST) added; dependency graph builder now accepts hybrid results and converts to internal nodes; file-selection helper when max_files exceeded.
Models & data structures
codewiki/src/be/dependency_analyzer/models/core.py, codewiki/cli/models/job.py
Added CFG/DDG/DataFlow models and JoernMetadata; EnhancedNode and DataFlowRelationship/InhertianceRelationship; GenerationOptions.use_joern and JobStatistics.total_tokens_used added.
Graph clustering & module grouping
codewiki/src/be/dependency_analyzer/graph_clustering.py, codewiki/src/be/cluster_modules.py
New graph-based community detection clustering with fallback to path-based clustering; cluster naming and integration into module clustering flow.
PHP language support
codewiki/src/be/dependency_analyzer/analyzers/php.py, codewiki/cli/utils/repo_validator.py, codewiki/cli/utils/validation.py, codewiki/src/be/prompt_template.py, codewiki/src/be/dependency_analyzer/utils/patterns.py, pyproject.toml
Added tree-sitter PHP analyzer, PHP entry/function patterns, extended supported extensions/language mappings, and added tree-sitter-php dependency.
Call graph & analyzer diagnostics
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py, codewiki/src/be/dependency_analyzer/analyzers/javascript.py, codewiki/src/be/dependency_analyzer/analyzers/typescript.py
Routed PHP files into analysis flow and added traceback logging for parser initialization/error handlers.
Clustering fallback & LLM path
codewiki/src/be/cluster_modules.py
When Joern-enabled graph clustering used, falls back to LLM clustering on failure; preserves existing LLM flow otherwise.
Error logging & diagnostics
codewiki/src/be/agent_orchestrator.py, codewiki/src/be/documentation_generator.py, codewiki/src/be/main.py, codewiki/src/be/utils.py, codewiki/cli/adapters/doc_generator.py
Added traceback.format_exc() logging in multiple exception handlers and bumped generator_version to 1.0.1.
Editor & utilities
codewiki/src/be/agent_tools/str_replace_editor.py, codewiki/src/utils.py
Introduced ViewRange/parse_view_range and stricter typing for view_range; FileManager.save_json serializes sets by converting to lists.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI (generate)
    participant Config as ConfigMgr
    participant HAS as HybridAnalysisService
    participant AST as AST Analyzer
    participant Joern as JoernClient/Service
    participant Merger as Result Merger
    participant DocGen as Documentation Generator

    CLI->>Config: invoke generate (--use-joern)
    Config->>HAS: initialize(enable_joern=True)
    CLI->>HAS: analyze_repository_hybrid(repo_path)

    rect rgba(240,248,255,1)
      Note over HAS,AST: Phase 1 — AST analysis (always)
      HAS->>AST: analyze_local_repository(repo_path, max_files)
      AST-->>HAS: ast_result
    end

    alt Joern enabled & available
      rect rgba(237,255,237,1)
        Note over HAS,Joern: Phase 2 — Joern enhancement
        HAS->>Joern: generate_cpg(repo_path)
        Joern-->>HAS: joern_data (methods, calls, flows)
      end
      rect rgba(255,250,235,1)
        Note over HAS,Merger: Phase 3 — Merge results
        HAS->>Merger: merge(ast_result, joern_data)
        Merger-->>HAS: merged_result
      end
    else Joern disabled or failed
      Note over HAS: Graceful fallback — return ast_result with joern_failed metadata
    end

    HAS-->>DocGen: enriched analysis result
    DocGen->>DocGen: generate documentation/site
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰

I hopped through AST branches, quick and spry,
Joern tunneled secrets where data flows lie,
PHP joined the meadow, calls traced in light,
Fallback cushions softly when Joern's out of sight,
A rabbit cheers — CodeWiki leaps bright!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements the core Phase 2 and Phase 3 objectives from issue #11: hybrid analysis service, data flow relationships, CLI --use-joern flag, graceful fallback, and optional Joern enhancement. However, several reviewer comments indicate incomplete or problematic implementations (CLI duplication, script injection risks, missing _extract_cross_module_relationships implementation). Address reviewer findings: fix CLI generator duplication, resolve script injection vulnerabilities in Joern script generation, correct _extract_cross_module_relationships to accept and use relationships parameter, and validate all Joern script interpolation points.
Out of Scope Changes check ⚠️ Warning Most changes are in-scope for hybrid Joern integration. However, several changes appear tangential: extensive PHP language support (analyzers, validators, patterns), traceback logging additions across multiple files, and minor formatting changes (quote style changes in config.py, doc_generator.py) go beyond the Joern integration objectives. Remove or isolate out-of-scope changes: consolidate PHP support into a separate PR, defer non-critical traceback logging enhancements, and revert non-functional formatting changes (quote styles) to keep scope focused on Joern hybrid analysis.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the primary change: implementing hybrid AST + Joern CPG analysis, which is the main focus of this large changeset.
Description check ✅ Passed The description covers TLDR, Dive Deeper, Reviewer Test Plan, Testing Matrix, and Linked Issues sections as required by the template, with substantial detail on implementation approach and validation steps.
Docstring Coverage ✅ Passed Docstring coverage is 91.28% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (26)
codewiki/src/be/agent_tools/str_replace_editor.py (1)

338-343: Add explicit parentheses for clarity.

The boolean expression mixes and and or operators without explicit parentheses, making the precedence unclear. While Python's operator precedence rules evaluate this correctly, explicit parentheses improve readability and prevent potential bugs.

🔎 Suggested fix
 if (
-    direction == 1
-    and best_breakpoint < current_line
-    or direction == -1
-    and best_breakpoint > current_line
+    (direction == 1 and best_breakpoint < current_line)
+    or (direction == -1 and best_breakpoint > current_line)
 ):
codewiki/src/be/utils.py (1)

172-174: Use logging.exception() for automatic traceback capture.

Python's logging.exception() automatically includes the full traceback without requiring manual formatting. Replace the two logger.error() calls with a single logger.exception() call.

🔎 Proposed fix
             if match:
                 core_error = match.group(0).strip()
                 core_error = core_error
             else:
-                logger.error(f"No match found for error pattern, fallback to mermaid-py\n{error_str}")
-                logger.error(f"Traceback: {traceback.format_exc()}")
+                logger.exception(f"No match found for error pattern, fallback to mermaid-py\n{error_str}")
                 raise Exception(error_str)

Note: With this change, the import traceback on line 6 would no longer be needed if this is the only usage.

codewiki/src/be/dependency_analyzer/analyzers/typescript.py (1)

34-35: Use logging.exception() for automatic traceback capture.

Replace manual traceback formatting with logging.exception(), which automatically includes the full traceback.

🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Failed to initialize TypeScript parser: {e}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Failed to initialize TypeScript parser: {e}")
             self.parser = None
             self.ts_language = None

Note: With this change, the import traceback on line 3 would no longer be needed if this is the only usage.

codewiki/src/be/agent_orchestrator.py (1)

147-148: Use logging.exception() for automatic traceback capture.

Replace manual traceback formatting with logging.exception(), which automatically includes the full traceback.

🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Error processing module {module_name}: {str(e)}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Error processing module {module_name}: {str(e)}")
             raise

Note: With this change, the import traceback on line 5 would no longer be needed if this is the only usage.

codewiki/src/be/main.py (1)

60-61: Use logging.exception() for automatic traceback capture.

Replace manual traceback formatting with logging.exception(), which automatically includes the full traceback. This pattern appears across multiple files in this PR (utils.py, typescript.py, agent_orchestrator.py, and main.py).

🔎 Proposed fix
     except Exception as e:
-        logger.error(f"Unexpected error: {str(e)}")
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(f"Unexpected error: {str(e)}")
         raise

Note: With this change, the import traceback on line 14 would no longer be needed if this is the only usage. Consider applying this pattern consistently across all affected files (codewiki/src/be/utils.py line 172-173, codewiki/src/be/dependency_analyzer/analyzers/typescript.py line 34-35, codewiki/src/be/agent_orchestrator.py line 147-148).

codewiki/cli/adapters/doc_generator.py (1)

154-154: Consider removing unused variable.

The generation_time variable is assigned but never used. If timing information isn't needed for logging or metrics, this line can be removed.

🔎 Proposed fix
-            generation_time = time.time() - start_time
             self.job.complete()
codewiki/src/be/cluster_modules.py (1)

77-79: Prefer logging.exception() for cleaner exception logging.

The added traceback logging improves debugging, but Python's logging.exception() automatically includes the traceback, making the code more concise.

🔎 Proposed fix
     except Exception as e:
-        logger.error(f"Failed to parse LLM response: {e}. Response: {response[:200]}...")
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(f"Failed to parse LLM response: {e}. Response: {response[:200]}...")
         return {}

This automatically includes the full traceback without needing the separate traceback import.

codewiki/src/be/dependency_analyzer/analyzers/javascript.py (1)

35-37: Prefer logging.exception() for cleaner exception logging.

The traceback logging aids debugging parser initialization failures. However, logging.exception() would be more idiomatic and eliminate the need for manual traceback formatting.

🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Failed to initialize JavaScript parser: {e}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Failed to initialize JavaScript parser: {e}")
             self.parser = None
             self.js_language = None

This automatically includes the full traceback without the separate traceback import.

codewiki/src/be/dependency_analyzer/models/core.py (1)

55-72: Consider adding validation for the confidence score.

The DataFlowRelationship model is well-structured. The confidence field has a comment indicating it should be 0.0-1.0, but Pydantic allows any float value. Consider adding a validator to enforce the range.

🔎 Proposed enhancement with field validation
+from pydantic import BaseModel, field_validator
 
 class DataFlowRelationship(BaseModel):
     """Data flow relationship extracted from Joern CPG analysis."""
 
     source: str
 
     target: str
 
     flow_type: str  # "parameter", "local", "return", "field"
 
     variable_name: Optional[str] = None
 
     source_line: Optional[int] = None
 
     target_line: Optional[int] = None
 
     confidence: float = 1.0  # Confidence score 0.0-1.0
 
     file_path: Optional[str] = None
+
+    @field_validator('confidence')
+    @classmethod
+    def validate_confidence(cls, v: float) -> float:
+        if not 0.0 <= v <= 1.0:
+            raise ValueError('confidence must be between 0.0 and 1.0')
+        return v
codewiki/src/be/dependency_analyzer/joern_poc.py (2)

27-35: Consider using context manager for temporary directory cleanup.

The temporary directory is created in __init__ and cleanup relies on __del__, which is not guaranteed to run. Consider using tempfile.TemporaryDirectory as a context manager or ensuring cleanup is called explicitly.

🔎 Proposed improvement
     def __init__(self, joern_path: Optional[str] = None):
         """
         Initialize Joern analyzer.

         Args:
             joern_path: Path to Joern binary. If None, will try to find in PATH.
         """
         self.joern_path = joern_path or self._find_joern_binary()
-        self.temp_dir = tempfile.mkdtemp(prefix="joern_analysis_")
+        self._temp_dir_obj = tempfile.TemporaryDirectory(prefix="joern_analysis_")
+        self.temp_dir = self._temp_dir_obj.name

Then remove the manual cleanup() method and rely on the context manager's cleanup.


102-107: Use raise ... from e for proper exception chaining.

When re-raising exceptions, use raise ... from e to preserve the original traceback for easier debugging.

🔎 Proposed fix
         except subprocess.TimeoutExpired:
-            logger.error("Joern analysis timed out")
-            raise RuntimeError("Joern analysis timed out after 5 minutes")
+            logger.error("Joern analysis timed out")
+            raise RuntimeError("Joern analysis timed out after 5 minutes") from None
         except Exception as e:
-            logger.error(f"Joern analysis failed: {str(e)}")
+            logger.exception("Joern analysis failed")
             raise
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)

99-101: Use raise ... from e for proper exception chaining.

When catching and re-raising exceptions, use from e (or from None if intentionally suppressing) to preserve the original traceback.

🔎 Proposed fix
         except Exception as e:
             logger.error(f"Hybrid analysis failed: {str(e)}", exc_info=True)
-            raise RuntimeError(f"Hybrid analysis failed: {str(e)}")
+            raise RuntimeError(f"Hybrid analysis failed: {str(e)}") from e

209-222: Avoid shadowing the locals builtin.

Using locals as a variable name shadows Python's built-in locals() function. While it works, it can lead to confusion and subtle bugs.

🔎 Proposed fix
                 if df_result.get("status") != "error":
                     # Create data flow relationships
                     params = df_result.get("parameters", [])
-                    locals = df_result.get("local_variables", [])
+                    local_vars = df_result.get("local_variables", [])

                     # Parameter to local variable flows
                     for param in params:
-                        for local in locals:
+                        for local_var in local_vars:
                             data_flow = DataFlowRelationship(
                                 source=f"{func_name}:{param}",
-                                target=f"{func_name}:{local}",
+                                target=f"{func_name}:{local_var}",
                                 flow_type="parameter_to_local",
-                                variable_name=local,
+                                variable_name=local_var,
                                 confidence=0.8,
                             )
codewiki/src/be/documentation_generator.py (1)

174-174: Prefer logging.exception() over manual traceback logging.

Using logging.exception() automatically includes the full traceback and provides better formatted output compared to manually logging traceback.format_exc(). This reduces code duplication and improves consistency.

🔎 Proposed refactor
-                    logger.error(f"Failed to process module {module_key}: {str(e)}")
-                    logger.error(f"Traceback: {traceback.format_exc()}")
+                    logger.exception(f"Failed to process module {module_key}")

Apply the same pattern to lines 246 and 291:

-            logger.error(f"Error generating parent documentation for {module_name}: {str(e)}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Error generating parent documentation for {module_name}")
-            logger.error(f"Documentation generation failed: {str(e)}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception("Documentation generation failed")

If you adopt logging.exception(), the import traceback on line 6 can also be removed.

Also applies to: 246-246, 291-291

codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)

216-216: Prefer logging.exception() over manual traceback logging.

Similar to other exception handlers in this PR, using logging.exception() would be cleaner than manually logging the traceback.

🔎 Proposed refactor
         except Exception as e:
             if temp_dir:
                 self._cleanup_repository(temp_dir)
-            logger.error(f"Structure analysis failed for {github_url}: {str(e)}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Structure analysis failed for {github_url}")
             raise RuntimeError(f"Structure analysis failed: {str(e)}") from e
codewiki/cli/models/job.py (1)

133-161: Well-structured defensive deserialization.

The refactored from_dict method using .get() with defaults provides good resilience against missing or malformed data. One minor simplification is possible at line 155.

🔎 Optional simplification at line 155

The key check before access can be simplified using .get():

-        if "llm_config" in data and data["llm_config"]:
-            job.llm_config = LLMConfig(**data["llm_config"])
+        if llm_config_data := data.get("llm_config"):
+            job.llm_config = LLMConfig(**llm_config_data)

This uses the walrus operator to both retrieve and check the value in one line.

codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py (1)

142-144: Prefer logging.exception() for cleaner exception logging.

The broad Exception catch at line 142 is appropriate here since it's the top-level dispatcher for file analysis, but the logging can be simplified using logging.exception().

🔎 Proposed refactor
         except Exception as e:
-            logger.error(f"⚠️ Error analyzing {file_path}: {str(e)}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"⚠️ Error analyzing {file_path}")

If this refactor is applied across all files in this PR, the import traceback statements can be removed.

codewiki/cli/models/config.py (1)

73-80: Consider whether fallback_model should block is_complete().

The fallback_model has a sensible default ("glm-4p5"), so it will always be truthy unless explicitly set to an empty string. This means adding it to is_complete() is effectively a no-op for normal flows. However, if someone explicitly sets fallback_model="", configuration will be marked incomplete, which may be the intended behavior.

JOERN_USER_GUIDE.md (2)

88-92: Add language specifier to fenced code block.

The code block showing the fallback message should have a language specifier for consistent rendering.

🔎 Proposed fix
-```
+```text
 ⚠️  Joern analysis failed: Error: Unable to access jarfile joern.jar
 🔄 Falling back to AST analysis
 ✓ Documentation generated successfully
-```
+```

268-274: Add language specifier to directory structure code block.

For consistent markdown rendering, add a language specifier.

🔎 Proposed fix
-```
+```text
 my-project/
 ├── joern.jar          # Auto-downloaded or manual
 ├── src/               # Source code
 ├── docs/              # Generated documentation
 └── .gitignore         # Ignore temporary files
-```
+```
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (1)

54-56: Remove unused variable filtered_folders_path.

The variable is assigned but never used since the code block that would use it is commented out.

🔎 Proposed fix
-        filtered_folders_path = os.path.join(
-            self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json"
-        )
codewiki/cli/commands/generate.py (3)

202-208: Remove unused generation_options variable.

The GenerationOptions object is created but never used.

🔎 Proposed fix
-        # Create generation options
-        generation_options = GenerationOptions(
-            create_branch=create_branch,
-            github_pages=github_pages,
-            no_cache=no_cache,
-            custom_output=output if output != "docs" else None,
-            use_joern=use_joern,
-        )

Alternatively, if generation_options should be used, pass it to the generator or other components.


281-292: Use logger.exception() instead of manually logging traceback.

The current pattern logs the error message and then separately logs the traceback. Using logger.exception() is cleaner and automatically includes the traceback.

🔎 Proposed fix
     except ConfigurationError as e:
-        logger.error(e.message)
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(e.message)
         sys.exit(e.exit_code)
     except RepositoryError as e:
-        logger.error(e.message)
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(e.message)
         sys.exit(e.exit_code)
     except APIError as e:
-        logger.error(e.message)
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(e.message)
         sys.exit(e.exit_code)

This also removes the need for import traceback at line 7.


262-263: Avoid bare except clause.

The bare except catches all exceptions including KeyboardInterrupt and SystemExit. Use a specific exception type.

🔎 Proposed fix
-            except:
+            except Exception:
                 pass
codewiki/src/be/dependency_analyzer/analyzers/php.py (2)

51-51: Consider using modern type hint syntax for optional parameters.

PEP 604 (Python 3.10+) allows T | None instead of Optional[T] or default None without explicit type.

🔎 Proposed fix examples
-    def register_use(self, fqn: str, alias: str = None):
+    def register_use(self, fqn: str, alias: str | None = None):

-    def __init__(self, file_path: str, content: str, repo_path: str = None):
+    def __init__(self, file_path: str, content: str, repo_path: str | None = None):

-    def _get_component_id(self, name: str, parent_class: str = None) -> str:
+    def _get_component_id(self, name: str, parent_class: str | None = None) -> str:

-    def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str = None):
+    def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str | None = None):

-def analyze_php_file(file_path: str, content: str, repo_path: str = None) -> Tuple[List[Node], List[CallRelationship]]:
+def analyze_php_file(file_path: str, content: str, repo_path: str | None = None) -> Tuple[List[Node], List[CallRelationship]]:

Also applies to: 90-90, 149-149, 244-244, 592-592


184-188: Use logger.exception() for better error context.

When catching exceptions, logger.exception() automatically includes the stack trace.

🔎 Proposed fix
         except RecursionError:
             logger.warning(f"Max recursion depth exceeded in {self.file_path}")
         except Exception as e:
-            logger.error(f"Error parsing PHP file {self.file_path}: {e}")
+            logger.exception(f"Error parsing PHP file {self.file_path}: {e}")
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce2eb2 and ff03648.

📒 Files selected for processing (30)
  • JOERN_USER_GUIDE.md
  • README.md
  • codewiki/cli/adapters/doc_generator.py
  • codewiki/cli/commands/config.py
  • codewiki/cli/commands/generate.py
  • codewiki/cli/config_manager.py
  • codewiki/cli/models/config.py
  • codewiki/cli/models/job.py
  • codewiki/cli/utils/repo_validator.py
  • codewiki/cli/utils/validation.py
  • codewiki/src/be/agent_orchestrator.py
  • codewiki/src/be/agent_tools/str_replace_editor.py
  • codewiki/src/be/cluster_modules.py
  • codewiki/src/be/dependency_analyzer/analysis/analysis_service.py
  • codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py
  • codewiki/src/be/dependency_analyzer/analyzers/javascript.py
  • codewiki/src/be/dependency_analyzer/analyzers/php.py
  • codewiki/src/be/dependency_analyzer/analyzers/typescript.py
  • codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
  • codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py
  • codewiki/src/be/dependency_analyzer/joern_poc.py
  • codewiki/src/be/dependency_analyzer/models/core.py
  • codewiki/src/be/dependency_analyzer/simplified_joern.py
  • codewiki/src/be/dependency_analyzer/utils/patterns.py
  • codewiki/src/be/documentation_generator.py
  • codewiki/src/be/main.py
  • codewiki/src/be/prompt_template.py
  • codewiki/src/be/utils.py
  • codewiki/src/config.py
  • pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (12)
codewiki/src/be/utils.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/src/be/dependency_analyzer/analyzers/javascript.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/src/be/agent_orchestrator.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/src/be/dependency_analyzer/analyzers/typescript.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/src/be/main.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (4)
codewiki/src/be/dependency_analyzer/topo_sort.py (2)
  • build_graph_from_components (239-268)
  • get_leaf_nodes (271-350)
codewiki/src/config.py (1)
  • Config (46-124)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)
  • HybridAnalysisService (23-409)
  • analyze_repository_hybrid (48-101)
codewiki/src/be/dependency_analyzer/ast_parser.py (1)
  • parse_repository (28-45)
codewiki/src/be/documentation_generator.py (1)
codewiki/cli/utils/logging.py (1)
  • error (42-44)
codewiki/cli/models/config.py (1)
codewiki/cli/utils/validation.py (1)
  • validate_model_name (82-98)
codewiki/src/be/dependency_analyzer/analyzers/php.py (1)
codewiki/src/be/dependency_analyzer/models/core.py (2)
  • Node (6-42)
  • CallRelationship (45-52)
codewiki/cli/models/job.py (3)
codewiki/src/fe/background_worker.py (1)
  • start (38-44)
codewiki/src/fe/models.py (1)
  • JobStatus (33-45)
codewiki/cli/models/config.py (2)
  • to_dict (50-52)
  • from_dict (55-71)
codewiki/cli/commands/generate.py (2)
codewiki/cli/models/job.py (1)
  • GenerationOptions (23-30)
codewiki/cli/adapters/doc_generator.py (2)
  • CLIDocumentationGenerator (26-288)
  • generate (114-164)
🪛 LanguageTool
JOERN_USER_GUIDE.md

[style] ~75-~75: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ncies) #### ❌ Not Recommended For: - Very large repositories (1000+ files) - may be s...

(EN_WEAK_ADJECTIVE)


[style] ~193-~193: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...d Usage ### 1. Selective Analysis For very large repositories, analyze specific director...

(EN_WEAK_ADJECTIVE)

🪛 markdownlint-cli2 (0.18.1)
JOERN_USER_GUIDE.md

88-88: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


268-268: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


295-295: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🪛 Ruff (0.14.10)
codewiki/src/be/utils.py

173-173: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/analysis/analysis_service.py

216-216: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/analyzers/javascript.py

37-37: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/agent_orchestrator.py

148-148: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/analyzers/typescript.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/main.py

61-61: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/config.py

109-109: Local variable repo_name is assigned to but never used

Remove assignment to unused variable repo_name

(F841)

codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py

31-31: Do not catch blind exception: Exception

(BLE001)


54-54: Local variable filtered_folders_path is assigned to but never used

Remove assignment to unused variable filtered_folders_path

(F841)

codewiki/src/be/dependency_analyzer/simplified_joern.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)


82-82: subprocess call: check for execution of untrusted input

(S603)


101-101: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


103-103: Do not catch blind exception: Exception

(BLE001)


104-104: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


104-104: Use explicit conversion flag

Replace with conversion flag

(RUF010)


200-200: Do not catch blind exception: Exception

(BLE001)


213-213: Do not catch blind exception: Exception

(BLE001)


214-214: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


215-215: Use explicit conversion flag

Replace with conversion flag

(RUF010)


255-255: subprocess call: check for execution of untrusted input

(S603)


261-261: Do not catch blind exception: Exception

(BLE001)


274-274: Do not catch blind exception: Exception

(BLE001)


304-304: Do not catch blind exception: Exception

(BLE001)

codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py

97-97: Consider moving this statement to an else block

(TRY300)


100-100: Use explicit conversion flag

Replace with conversion flag

(RUF010)


101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


101-101: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)


123-123: Consider moving this statement to an else block

(TRY300)


125-125: Do not catch blind exception: Exception

(BLE001)


126-126: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


126-126: Use explicit conversion flag

Replace with conversion flag

(RUF010)


176-176: Consider moving this statement to an else block

(TRY300)


178-178: Do not catch blind exception: Exception

(BLE001)


179-179: Use explicit conversion flag

Replace with conversion flag

(RUF010)


203-203: Loop control variable func_info not used within loop body

Rename unused func_info to _func_info

(B007)


224-224: Do not catch blind exception: Exception

(BLE001)


387-387: Loop control variable node_id not used within loop body

Rename unused node_id to _node_id

(B007)


402-406: Consider moving this statement to an else block

(TRY300)


408-408: Do not catch blind exception: Exception

(BLE001)


409-409: Use explicit conversion flag

Replace with conversion flag

(RUF010)

codewiki/src/be/documentation_generator.py

174-174: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


246-246: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/joern_poc.py

48-48: subprocess call: check for execution of untrusted input

(S603)


57-57: Avoid specifying long messages outside the exception class

(TRY003)


84-84: subprocess call: check for execution of untrusted input

(S603)


94-94: Abstract raise to an inner function

(TRY301)


94-94: Avoid specifying long messages outside the exception class

(TRY003)


103-103: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


104-104: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


104-104: Avoid specifying long messages outside the exception class

(TRY003)


106-106: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


106-106: Use explicit conversion flag

Replace with conversion flag

(RUF010)


227-230: f-string without any placeholders

Remove extraneous f prefix

(F541)


291-291: subprocess call: check for execution of untrusted input

(S603)


294-294: Avoid specifying long messages outside the exception class

(TRY003)


305-305: Do not catch blind exception: Exception

(BLE001)


316-316: Local variable analyzer is assigned to but never used

Remove assignment to unused variable analyzer

(F841)


318-318: Consider moving this statement to an else block

(TRY300)


319-319: Do not catch blind exception: Exception

(BLE001)


320-320: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/cluster_modules.py

79-79: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py

142-142: Do not catch blind exception: Exception

(BLE001)


143-143: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


143-143: Use explicit conversion flag

Replace with conversion flag

(RUF010)


144-144: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/agent_tools/str_replace_editor.py

205-205: subprocess call with shell=True identified, security issue

(S602)


339-340: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)


341-342: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear

Parenthesize the and subexpression

(RUF021)

codewiki/src/be/dependency_analyzer/analyzers/php.py

51-51: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


90-90: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


149-149: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


186-186: Do not catch blind exception: Exception

(BLE001)


187-187: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


244-244: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


592-592: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

codewiki/cli/adapters/doc_generator.py

154-154: Local variable generation_time is assigned to but never used

Remove assignment to unused variable generation_time

(F841)


157-157: Consider moving this statement to an else block

(TRY300)


188-188: Do not catch blind exception: Exception

(BLE001)


189-189: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


189-189: Avoid specifying long messages outside the exception class

(TRY003)


220-220: Do not catch blind exception: Exception

(BLE001)


221-221: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


221-221: Avoid specifying long messages outside the exception class

(TRY003)


245-245: Do not catch blind exception: Exception

(BLE001)


246-246: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


246-246: Avoid specifying long messages outside the exception class

(TRY003)

codewiki/cli/models/job.py

155-155: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

codewiki/cli/commands/generate.py

125-127: Abstract raise to an inner function

(TRY301)


125-127: Avoid specifying long messages outside the exception class

(TRY003)


202-202: Local variable generation_options is assigned to but never used

Remove assignment to unused variable generation_options

(F841)


252-252: Local variable commit_hash is assigned to but never used

Remove assignment to unused variable commit_hash

(F841)


253-253: Local variable current_branch is assigned to but never used

Remove assignment to unused variable current_branch

(F841)


262-262: Do not use bare except

(E722)


262-263: try-except-pass detected, consider logging the exception

(S110)


282-282: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


283-283: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


286-286: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


287-287: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


290-290: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


291-291: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (27)
codewiki/src/be/agent_tools/str_replace_editor.py (2)

785-794: Nice implementation of flexible input parsing.

The parse_view_range function and ViewRange type alias elegantly enable the view_range parameter to accept both list and string inputs (e.g., "[1, 10]"). The use of Pydantic's BeforeValidator is idiomatic and allows graceful error handling downstream if the parsed value doesn't match the expected type.


803-803: Public API enhancement maintains backward compatibility.

The parameter type change to ViewRange enhances the API by accepting string inputs in addition to lists, while maintaining full backward compatibility with existing code that passes list values.

pyproject.toml (1)

40-40: LGTM!

The addition of tree-sitter-php>=0.23.0 appropriately supports the new PHP language analysis capabilities introduced in this PR.

codewiki/cli/adapters/doc_generator.py (1)

139-140: LGTM! Joern and fallback model integration added.

The addition of fallback_model and use_joern configuration options properly integrates the new hybrid analysis capabilities described in the PR objectives. These parameters are correctly passed to the backend config.

README.md (2)

45-53: LGTM! Clear documentation for the new fallback model feature.

The documentation clearly describes the multi-model support and includes the new --fallback-model option in the configuration example. The additions are well-integrated and consistent with the existing documentation style.


105-106: LGTM! Usage documentation updated consistently.

The fallback model option is properly documented in the CLI usage section, maintaining consistency with the quick start example above.

codewiki/cli/utils/repo_validator.py (2)

30-32: LGTM! PHP language support added.

The addition of PHP extensions (.php, .phtml, .inc) aligns with the PR's objective to add PHP language support. The extensions cover the common PHP file types.


63-63: LGTM! Error message updated consistently.

The error message correctly includes PHP in the list of supported languages, maintaining consistency with the newly added extensions.

codewiki/src/be/prompt_template.py (1)

235-239: LGTM! PHP extension mappings added correctly.

The PHP language support additions (.php, .phtml, .inc) are consistent with the changes in other files across this PR.

codewiki/src/be/dependency_analyzer/models/core.py (1)

35-42: LGTM! Good additions for richer node metadata.

The new class_name, display_name, and component_id fields along with the get_display_name() helper method are well-designed for supporting multi-language analysis and cross-module referencing.

codewiki/src/config.py (2)

60-61: LGTM! Joern configuration flag properly added.

The use_joern field is correctly added to the Config dataclass with an appropriate default of False.


63-80: This is not an issue. The from_args method doesn't need to propagate use_joern because it's not used in any code path that supports the --use-joern flag.

The backend CLI in be/main.py (which uses Config.from_args()) only parses --repo-path and has no --use-joern flag. The full CLI in cli/commands/generate.py that does support --use-joern uses a different code path (CLIDocumentationGenerator) and doesn't call Config.from_args(). This is intentional design separation—simpler contexts use from_args() without Joern support, while the full CLI uses its own code path that passes use_joern directly.

Likely an incorrect or invalid review comment.

codewiki/src/be/dependency_analyzer/simplified_joern.py (1)

278-314: LGTM! Useful performance baseline utility.

The run_performance_baseline function provides a clean way to compare AST and Joern analysis timings. The metrics returned (times, counts, speedup) are practical for evaluating the hybrid approach.

codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)

412-423: LGTM! Clean factory function for backward compatibility.

The create_hybrid_analysis_service factory function provides a simple interface for creating the service with the Joern toggle.

codewiki/cli/utils/validation.py (1)

174-175: LGTM! PHP language detection added consistently.

The PHP extensions (.php, .phtml, .inc) are correctly added to the language detection, matching the changes across other files in this PR.

codewiki/src/be/documentation_generator.py (1)

46-46: LGTM: Version bump is appropriate.

The version increment from "1.0.0" to "1.0.1" correctly reflects the minor enhancement of adding traceback logging to error handlers.

codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)

300-323: LGTM: Language support additions are consistent.

The PHP and Java language support has been added consistently across the supported languages set, filter method, and public API. The implementation aligns with the PR's objective of expanding language coverage.

codewiki/cli/models/job.py (1)

30-30: LGTM: New fields support Joern integration and token tracking.

The use_joern flag in GenerationOptions and total_tokens_used in JobStatistics align well with the PR objectives for enabling optional Joern CPG analysis and tracking usage metrics.

Also applies to: 40-40

codewiki/src/be/dependency_analyzer/utils/patterns.py (1)

257-264: LGTM: Comprehensive PHP pattern coverage.

The PHP entry point patterns include common framework entry points (Laravel's artisan, Symfony's console) and standard PHP entry files. The function definition patterns appropriately cover PHP visibility modifiers (public, private, protected) alongside the basic function syntax.

Also applies to: 409-409

codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py (1)

135-136: LGTM: PHP language support properly integrated.

The PHP analysis follows the established pattern used for other languages:

  • Routing in _analyze_code_file (lines 135-136)
  • Dedicated _analyze_php_file method with proper error handling (lines 305-326)
  • File extension classification for visualization (lines 411-412)

The implementation is consistent with existing Java, C#, and TypeScript analyzers.

Also applies to: 305-326, 411-412

codewiki/cli/models/config.py (1)

29-36: LGTM!

The fallback_model field is well-integrated following the existing patterns for other model fields. Default value is consistent across the class definition (line 35), from_dict (line 69), and the validation in validate() properly validates it using the existing validate_model_name utility.

Also applies to: 48-48

codewiki/cli/commands/config.py (2)

53-63: LGTM!

The --fallback-model CLI option is consistently integrated following the same patterns as other model options (--main-model, --cluster-model).


357-368: LGTM!

The validation logic correctly includes fallback_model in both verbose output and the completeness check, maintaining consistency with the other model fields.

codewiki/cli/config_manager.py (1)

84-121: LGTM!

The fallback_model parameter is properly integrated into the save() method, with consistent default initialization ("glm-4p5") matching the Configuration model definition.

codewiki/src/be/dependency_analyzer/analyzers/php.py (2)

1-18: LGTM - Solid PHP analyzer implementation.

The PHP analyzer is well-structured with clear separation of concerns: namespace resolution, node extraction, and relationship extraction. The use of tree-sitter-php for parsing is appropriate.


332-336: This concern is not valid — the code correctly associates methods with their parent classes.

The parent_class assignment at line 336 occurs before the recursive call at line 340. When child nodes (including methods) are processed, they receive the updated parent_class as a parameter. For method nodes at line 293, parent_class is used directly to construct the method's qualified name, ensuring methods are correctly associated with their parent class.

Likely an incorrect or invalid review comment.

JOERN_USER_GUIDE.md (1)

140-145: URL is valid and accessible.

The download URL correctly redirects to the latest Joern release (currently v4.0.457). The curl command in the documentation will follow the redirect and successfully download the JAR file.

Comment on lines +211 to +242
generator = CLIDocumentationGenerator(
repo_path=repo_path,
output_dir=output_dir,
config={
'main_model': config.main_model,
'cluster_model': config.cluster_model,
'base_url': config.base_url,
'api_key': api_key,
"main_model": config.main_model,
"cluster_model": config.cluster_model,
"fallback_model": config.fallback_model,
"base_url": config.base_url,
"api_key": api_key,
"use_joern": use_joern,
},
verbose=verbose,
generate_html=github_pages
generate_html=github_pages,
)


if use_joern:
logger.info("🚀 Joern CPG analysis enabled for enhanced data flow insights")

# Create generator
generator = CLIDocumentationGenerator(
repo_path=repo_path,
output_dir=output_dir,
config={
"main_model": config.main_model,
"cluster_model": config.cluster_model,
"fallback_model": config.fallback_model,
"base_url": config.base_url,
"api_key": api_key,
},
verbose=verbose,
generate_html=github_pages,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: CLIDocumentationGenerator is instantiated twice; second instance overwrites first and lacks use_joern.

The generator is created at lines 211-224 with use_joern in the config, but then immediately overwritten at lines 229-242 without use_joern. The second generator is what gets used for generator.generate(), so Joern integration is never actually enabled.

🔎 Proposed fix - remove the duplicate instantiation
         if use_joern:
             logger.info("🚀 Joern CPG analysis enabled for enhanced data flow insights")

-        # Create generator
-        generator = CLIDocumentationGenerator(
-            repo_path=repo_path,
-            output_dir=output_dir,
-            config={
-                "main_model": config.main_model,
-                "cluster_model": config.cluster_model,
-                "fallback_model": config.fallback_model,
-                "base_url": config.base_url,
-                "api_key": api_key,
-            },
-            verbose=verbose,
-            generate_html=github_pages,
-        )
-
         # Run generation
         job = generator.generate()
🤖 Prompt for AI Agents
In codewiki/cli/commands/generate.py around lines 211 to 242, the
CLIDocumentationGenerator is instantiated twice and the second instantiation
overwrites the first while omitting the use_joern flag, preventing Joern from
being enabled; fix this by removing the duplicate instantiation (or merging
them) so only one CLIDocumentationGenerator is created with the full config
including "use_joern": use_joern, and ensure the logger.info remains in place
when use_joern is true.

@e2720pjk
Copy link
Copy Markdown
Owner Author

e2720pjk commented Jan 2, 2026

/opencode 審閱下述coderabbit提出的審查建議並修正,commit_msg必須參照過往格式寫法撰寫,確保符合PR審查標準


In codewiki/cli/commands/generate.py around lines 211 to 242, the
CLIDocumentationGenerator is instantiated twice and the second instantiation
overwrites the first while omitting the use_joern flag, preventing Joern from
being enabled; fix this by removing the duplicate instantiation (or merging
them) so only one CLIDocumentationGenerator is created with the full config
including "use_joern": use_joern, and ensure the logger.info remains in place
when use_joern is true.


In codewiki/src/be/dependency_analyzer/analyzers/php.py around lines 20 to 30,
the PHP_PRIMITIVES set is missing several PHP 8 built-in types and common SPL
interfaces/classes/exceptions; update the set to include the missing entries:
add the built-in type "resource"; add SPL interfaces "OuterIterator",
"RecursiveIterator", "SeekableIterator", "SplObserver", "SplSubject"; add SPL
classes "SplDoublyLinkedList", "SplStack", "SplQueue", "SplHeap", "SplMaxHeap",
"SplMinHeap", "SplPriorityQueue", "SplFixedArray", "ArrayIterator",
"SplFileInfo", "SplFileObject", "DirectoryIterator"; and add common SPL
exception classes "BadFunctionCallException", "InvalidArgumentException",
"RuntimeException", "DomainException", "LogicException" so the analyzer
recognizes these as primitives/built-ins.


In codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py around line
58, a local variable parser is created which shadows self.parser and
prevents the Joern-enabled HybridAnalysisService from being used; replace usages
of the local parser with self.parser so the already-initialized parser (from
init) is used for parsing, and remove the redundant local instantiation on
line 58 to avoid shadowing.


Unreachable code due to redundant nested isinstance check.

Lines 72-80 contain unreachable code. The outer condition at line 71 checks isinstance(self.parser, DependencyParser), and then line 72 checks the same condition again. The else block at lines 74-80 can never execute.

🔎 Proposed fix
         # Parse repository
         if isinstance(self.parser, DependencyParser):
-            if isinstance(self.parser, DependencyParser):
-                components = self.parser.parse_repository(filtered_folders or [])
-            else:
-                # HybridAnalysisService uses different interface
-                result = self.parser.analyze_repository_hybrid(
-                    repo_path=self.config.repo_path, max_files=100
-                )
-                # Convert hybrid result to expected format
-                components = result.get("nodes", {})
+            components = self.parser.parse_repository(filtered_folders or [])
         else:
             # HybridAnalysisService uses different interface
             result = self.parser.analyze_repository_hybrid(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        if isinstance(self.parser, DependencyParser):
            components = self.parser.parse_repository(filtered_folders or [])
        else:
            # HybridAnalysisService uses different interface
            result = self.parser.analyze_repository_hybrid(
                repo_path=self.config.repo_path, max_files=100
            )
            # Convert hybrid result to expected format
            components = result.get("nodes", {})

In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around line
203, the loop variable func_info is declared but never used; rename it to
_func_info to signal it's intentionally unused and avoid linter warnings. Update
the for statement to use _func_info and ensure no other references to func_info
exist in the loop body (if any, replace or remove them accordingly).


In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
230-268, _extract_cross_module_relationships incorrectly reads relationships
from ast_nodes (ast_nodes.get("relationships", [])) but relationships live in
the parent result; update the function to accept a relationships parameter
(e.g., def _extract_cross_module_relationships(self, ast_nodes: Dict[str, Any],
relationships: List[Dict[str, Any]]) ) and change the call site in
_run_joern_enhancement to pass the parent result's relationships list; inside
the function use the passed relationships variable instead of
ast_nodes.get(...); ensure callers are updated and tests adjusted accordingly.


Potential command injection via unsanitized function_name input.

The extract_data_flow method builds a Joern script by directly concatenating function_name and project_path into the script string. If these values come from untrusted input, they could inject malicious Scala code.

🔎 Proposed fix - sanitize inputs
     def extract_data_flow(self, project_path: str, function_name: str) -> Dict[str, Any]:
         """
         Extract data flow information for a specific function.

         Args:
             project_path: Path to the project
             function_name: Name of the function to analyze

         Returns:
             Data flow analysis results
         """
+        # Sanitize function_name to prevent script injection
+        if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', function_name):
+            raise ValueError(f"Invalid function name: {function_name}")
+        
+        # Escape project_path for safe inclusion in script
+        safe_project_path = project_path.replace('"', '\\"')
+
         script = f"""
 import joern._

-val cpg = loadCPG({project_path})
-val method = cpg.method.name({function_name}).head
+val cpg = loadCPG("{safe_project_path}")
+val method = cpg.method.name("{function_name}").head

Note: The current string concatenation approach is also syntactically broken (missing quotes around string literals in Scala).


In codewiki/src/be/dependency_analyzer/joern_poc.py around lines 313-321, the
test creates an unused local variable analyzer and logs errors with
logger.error; replace the unused assignment by either calling JoernAnalyzer()
without storing the result or renaming it to _analyzer if you intend to keep it,
and change the except block to call logger.exception(...) so the traceback is
included; keep the same True/False return behavior.


In codewiki/src/be/dependency_analyzer/simplified_joern.py around lines 186 to
199, the line-number calculation uses the two-character literal "\n" instead of
the actual newline character, causing incorrect line counts; change the counting
to use "\n" (single backslash) when computing line_num = content[:
match.start()].count("\n") + 1, and move the import re to the module top with
the other imports instead of importing inline.


In codewiki/src/be/dependency_analyzer/simplified_joern.py around lines 217 to
245, the Joern script is built by directly interpolating function_name into a
Scala script, which allows script injection; instead, validate or sanitize
function_name (e.g., allow only a safe whitelist of characters like
alphanumerics, underscores and dots or reject otherwise) and escape any
quotes/backslashes before embedding, or better yet change the approach to pass
the function name as a script argument (serialize it safely, e.g., via JSON or
proper quoting) so no untrusted characters are interpreted as Scala code; update
the method to perform validation/escaping and fail with a clear error when the
name is invalid.


In codewiki/src/config.py around lines 109 to 111, the local variable
repo_name is assigned from os.path.basename(os.path.normpath(repo_path)) but
never used; remove the unused assignment (delete the repo_name = ... line) so
the method has no dead variable and keep the subsequent base_output_dir = os.path.join(output_dir, "temp") unchanged.

- Add Joern info log when use_joern is enabled
- Add missing PHP 8 built-in types and SPL classes to PHP_PRIMITIVES
- Fix shadowing local parser variable in dependency_graphs_builder.py
- Remove redundant nested isinstance check in dependency_graphs_builder.py
- Rename unused loop variable func_info to _func_info
- Fix _extract_cross_module_relationships to accept relationships parameter
- Add function name validation to extract_data_flow in joern_poc.py
- Fix unused variable and use logger.exception in test_joern_analyzer
- Fix literal \n to actual newline character in simplified_joern.py
- Add function name validation in simplified_joern.py extract_data_flow_sample
- Remove unused repo_name variable in config.py
@opencode-agent
Copy link
Copy Markdown

opencode-agent bot commented Jan 2, 2026

Perfect! I've successfully addressed all CodeRabbit review suggestions and created a commit. Here's a summary:

Fixed Issues

  1. codewiki/cli/commands/generate.py - Added missing logger.info message when Joern is enabled
  2. codewiki/src/be/dependency_analyzer/analyzers/php.py - Added missing PHP 8 built-in types and SPL classes to PHP_PRIMITIVES
  3. codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py - Fixed shadowing local parser variable that prevented Joern integration
  4. codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py - Removed redundant nested isinstance check with unreachable code
  5. codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py - Renamed unused loop variable to _func_info
  6. codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py - Fixed _extract_cross_module_relationships to accept ast_result parameter instead of incorrectly accessing relationships from ast_nodes
  7. ⚠️ codewiki/src/be/dependency_analyzer/joern_poc.py - Added function name validation (partial fix - full Scala syntax sanitization requires larger refactor)
  8. codewiki/src/be/dependency_analyzer/joern_poc.py - Fixed unused variable and used logger.exception in test function
  9. codewiki/src/be/dependency_analyzer/simplified_joern.py - Fixed literal "\\n" to actual newline character "\n"
  10. codewiki/src/be/dependency_analyzer/simplified_joern.py - Added function name validation and moved import re to module top
  11. codewiki/src/config.py - Removed unused repo_name variable

Commit Created

fix: address CodeRabbit review suggestions

The commit addresses all critical and major issues from the CodeRabbit review, with improvements to:

  • Joern integration correctness
  • PHP analyzer completeness
  • Code quality and unused variables
  • Security (input validation)
  • Logging improvements

New%20session%20-%202026-01-02T17%3A06%3A07.300Z
opencode session  |  github run

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (11)
codewiki/src/be/dependency_analyzer/analyzers/php.py (6)

63-63: Use explicit optional type hint.

PEP 484 recommends explicit optional syntax: alias: str | None = None instead of implicit Optional.

🔎 Proposed fix
-    def register_use(self, fqn: str, alias: str = None):
+    def register_use(self, fqn: str, alias: str | None = None):

102-102: Use explicit optional type hint.

PEP 484 recommends explicit optional syntax for the repo_path parameter.

🔎 Proposed fix
-    def __init__(self, file_path: str, content: str, repo_path: str = None):
+    def __init__(self, file_path: str, content: str, repo_path: str | None = None):

196-199: Use logger.exception for better debugging.

When logging exceptions in an except block, logger.exception automatically includes the traceback, which is valuable for debugging parse errors.

🔎 Proposed fix
         except RecursionError:
             logger.warning(f"Max recursion depth exceeded in {self.file_path}")
         except Exception as e:
-            logger.error(f"Error parsing PHP file {self.file_path}: {e}")
+            logger.exception(f"Error parsing PHP file {self.file_path}: {e}")

256-256: Use explicit optional type hint.

PEP 484 recommends explicit optional syntax for the parent_class parameter.

🔎 Proposed fix
-    def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str = None):
+    def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str | None = None):

161-161: Use explicit optional type hint.

PEP 484 recommends explicit optional syntax for the parent_class parameter.

🔎 Proposed fix
-    def _get_component_id(self, name: str, parent_class: str = None) -> str:
+    def _get_component_id(self, name: str, parent_class: str | None = None) -> str:

604-604: Use explicit optional type hint.

PEP 484 recommends explicit optional syntax for the repo_path parameter.

🔎 Proposed fix
-def analyze_php_file(file_path: str, content: str, repo_path: str = None) -> Tuple[List[Node], List[CallRelationship]]:
+def analyze_php_file(file_path: str, content: str, repo_path: str | None = None) -> Tuple[List[Node], List[CallRelationship]]:
codewiki/cli/commands/generate.py (1)

266-277: Consider using logger.exception() for cleaner traceback logging.

The current approach manually formats tracebacks with logger.error(). Using logger.exception() would automatically include the traceback and is more idiomatic.

🔎 Proposed refactor
     except ConfigurationError as e:
-        logger.error(e.message)
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(f"Configuration error: {e.message}")
         sys.exit(e.exit_code)
     except RepositoryError as e:
-        logger.error(e.message)
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(f"Repository error: {e.message}")
         sys.exit(e.exit_code)
     except APIError as e:
-        logger.error(e.message)
-        logger.error(f"Traceback: {traceback.format_exc()}")
+        logger.exception(f"API error: {e.message}")
         sys.exit(e.exit_code)

You can also remove the import traceback on line 7 if this is the only use.

codewiki/src/be/dependency_analyzer/joern_poc.py (1)

231-289: Consider refactoring script building for better robustness.

While function_name is now validated, the script is built using fragile string concatenation where project_path is directly concatenated without quoting or sanitization. Consider using a template approach or proper escaping for paths to improve robustness against special characters in paths.

💡 Example improvement
# Escape project_path for safe inclusion
import shlex
safe_project_path = shlex.quote(project_path)

script = f"""
import joern._

val cpg = loadCPG({safe_project_path})
val method = cpg.method.name("{function_name}").head
...
"""
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (3)

99-101: Consider adding exception chaining for better debugging context.

At line 101, the exception is re-raised without preserving the original exception context. Using raise ... from e provides a clearer traceback chain.

🔎 Suggested improvement
         except Exception as e:
             logger.error(f"Hybrid analysis failed: {str(e)}", exc_info=True)
-            raise RuntimeError(f"Hybrid analysis failed: {str(e)}")
+            raise RuntimeError(f"Hybrid analysis failed: {str(e)}") from e

125-127: Use logging.exception for automatic traceback inclusion.

At line 126, logger.error is used in an exception handler. Using logger.exception automatically includes the traceback, which significantly aids debugging.

🔎 Suggested improvement
         except Exception as e:
-            logger.error(f"AST analysis failed: {str(e)}")
+            logger.exception(f"AST analysis failed: {str(e)}")
             # Return minimal result to allow analysis to continue

178-180: Consider using logging.exception for better debugging.

Similar to other exception handlers, using logger.exception instead of logger.warning would include the full traceback, making it easier to diagnose Joern enhancement failures.

🔎 Suggested improvement
         except Exception as e:
-            logger.warning(f"Joern enhancement failed: {str(e)}")
+            logger.exception(f"Joern enhancement failed: {str(e)}")
             return {"status": "joern_failed", "error": str(e)}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff03648 and ebc0a1f.

📒 Files selected for processing (7)
  • codewiki/cli/commands/generate.py
  • codewiki/src/be/dependency_analyzer/analyzers/php.py
  • codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
  • codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py
  • codewiki/src/be/dependency_analyzer/joern_poc.py
  • codewiki/src/be/dependency_analyzer/simplified_joern.py
  • codewiki/src/config.py
🧰 Additional context used
🧬 Code graph analysis (3)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (5)
codewiki/src/be/dependency_analyzer/topo_sort.py (2)
  • build_graph_from_components (239-268)
  • get_leaf_nodes (271-350)
codewiki/src/config.py (1)
  • Config (46-123)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)
  • HybridAnalysisService (23-409)
  • analyze_repository_hybrid (48-101)
codewiki/src/be/dependency_analyzer/ast_parser.py (3)
  • DependencyParser (18-145)
  • parse_repository (28-45)
  • save_dependency_graph (129-145)
codewiki/src/utils.py (1)
  • save_json (19-22)
codewiki/src/be/dependency_analyzer/analyzers/php.py (1)
codewiki/src/be/dependency_analyzer/models/core.py (2)
  • Node (6-42)
  • CallRelationship (45-52)
codewiki/cli/commands/generate.py (2)
codewiki/cli/models/job.py (1)
  • GenerationOptions (23-30)
codewiki/cli/adapters/doc_generator.py (2)
  • CLIDocumentationGenerator (26-288)
  • generate (114-164)
🪛 Ruff (0.14.10)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py

31-31: Do not catch blind exception: Exception

(BLE001)


54-54: Local variable filtered_folders_path is assigned to but never used

Remove assignment to unused variable filtered_folders_path

(F841)

codewiki/src/be/dependency_analyzer/analyzers/php.py

63-63: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


102-102: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


161-161: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


198-198: Do not catch blind exception: Exception

(BLE001)


199-199: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


256-256: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


604-604: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py

97-97: Consider moving this statement to an else block

(TRY300)


100-100: Use explicit conversion flag

Replace with conversion flag

(RUF010)


101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


101-101: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)


123-123: Consider moving this statement to an else block

(TRY300)


125-125: Do not catch blind exception: Exception

(BLE001)


126-126: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


126-126: Use explicit conversion flag

Replace with conversion flag

(RUF010)


176-176: Consider moving this statement to an else block

(TRY300)


178-178: Do not catch blind exception: Exception

(BLE001)


179-179: Use explicit conversion flag

Replace with conversion flag

(RUF010)


224-224: Do not catch blind exception: Exception

(BLE001)


387-387: Loop control variable node_id not used within loop body

Rename unused node_id to _node_id

(B007)


402-406: Consider moving this statement to an else block

(TRY300)


408-408: Do not catch blind exception: Exception

(BLE001)


409-409: Use explicit conversion flag

Replace with conversion flag

(RUF010)

codewiki/cli/commands/generate.py

125-127: Abstract raise to an inner function

(TRY301)


125-127: Avoid specifying long messages outside the exception class

(TRY003)


202-202: Local variable generation_options is assigned to but never used

Remove assignment to unused variable generation_options

(F841)


237-237: Local variable commit_hash is assigned to but never used

Remove assignment to unused variable commit_hash

(F841)


238-238: Local variable current_branch is assigned to but never used

Remove assignment to unused variable current_branch

(F841)


247-247: Do not use bare except

(E722)


247-248: try-except-pass detected, consider logging the exception

(S110)


267-267: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


268-268: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


271-271: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


272-272: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


275-275: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


276-276: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/simplified_joern.py

65-65: Avoid specifying long messages outside the exception class

(TRY003)


83-83: subprocess call: check for execution of untrusted input

(S603)


102-102: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


104-104: Do not catch blind exception: Exception

(BLE001)


105-105: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


105-105: Use explicit conversion flag

Replace with conversion flag

(RUF010)


198-198: Do not catch blind exception: Exception

(BLE001)


211-211: Do not catch blind exception: Exception

(BLE001)


212-212: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


213-213: Use explicit conversion flag

Replace with conversion flag

(RUF010)


257-257: subprocess call: check for execution of untrusted input

(S603)


263-263: Do not catch blind exception: Exception

(BLE001)


276-276: Do not catch blind exception: Exception

(BLE001)


306-306: Do not catch blind exception: Exception

(BLE001)

codewiki/src/be/dependency_analyzer/joern_poc.py

48-48: subprocess call: check for execution of untrusted input

(S603)


57-57: Avoid specifying long messages outside the exception class

(TRY003)


84-84: subprocess call: check for execution of untrusted input

(S603)


94-94: Abstract raise to an inner function

(TRY301)


94-94: Avoid specifying long messages outside the exception class

(TRY003)


103-103: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


104-104: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


104-104: Avoid specifying long messages outside the exception class

(TRY003)


106-106: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


106-106: Use explicit conversion flag

Replace with conversion flag

(RUF010)


229-229: Avoid specifying long messages outside the exception class

(TRY003)


232-235: f-string without any placeholders

Remove extraneous f prefix

(F541)


296-296: subprocess call: check for execution of untrusted input

(S603)


299-299: Avoid specifying long messages outside the exception class

(TRY003)


310-310: Do not catch blind exception: Exception

(BLE001)


323-323: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (12)
codewiki/src/be/dependency_analyzer/analyzers/php.py (2)

20-42: PHP primitives set looks comprehensive now.

The additions of PHP 8 built-in types and SPL classes/interfaces/exceptions address the previous review feedback. The set now properly recognizes these as primitives to exclude from dependency analysis.


1-617: Excellent PHP analyzer implementation!

This is a comprehensive and well-structured PHP analyzer that correctly handles:

  • All major PHP constructs (classes, interfaces, traits, enums, functions, methods)
  • Namespace resolution with use statements (including group use syntax)
  • Dependency relationships (extends, implements, new, static calls, property promotion)
  • PHPDoc extraction and rich metadata
  • Template file exclusion
  • Recursion depth protection

The three-pass analysis approach (namespace info → nodes → relationships) is clean and maintainable. Integration with the project's models is correct.

codewiki/src/be/dependency_analyzer/simplified_joern.py (2)

220-222: Good fix: Function name validation prevents script injection.

The addition of regex validation for function_name properly mitigates the script injection risk flagged in previous reviews.


189-189: Good fix: Line number calculation now uses correct newline character.

The code now correctly uses "\n" instead of the literal "\\n", fixing the line counting bug from previous reviews.

codewiki/src/be/dependency_analyzer/joern_poc.py (2)

226-229: Good fix: Function name validation prevents script injection.

The regex validation for function_name properly addresses the injection risk flagged in previous reviews.


321-325: Good fix: Proper exception handling and unused variable addressed.

The code now correctly uses _ for the unused analyzer instance and logger.exception() to include the traceback, addressing both issues from previous reviews.

codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (6)

161-163: LGTM! Bug fix confirmed.

The call to _extract_cross_module_relationships now correctly passes ast_result (the full analysis result) instead of just the nodes dict. This resolves the previous issue where relationships were being read from the wrong source.


203-203: LGTM! Unused variable fix confirmed.

The loop variable is now correctly prefixed with underscore (_func_info) to indicate it's intentionally unused.


230-268: LGTM! Cross-module relationship extraction is now correct.

The method signature and implementation have been fixed to correctly:

  • Accept ast_result (the full analysis result) as a parameter
  • Read relationships from ast_result.get("relationships", [])
  • Extract cross-module call patterns appropriately

This resolves the previous issue where relationships were being read from the wrong source.


270-323: LGTM! Function importance scoring is well-designed.

The scoring heuristic appropriately balances multiple factors:

  • Popularity (dependents count)
  • Documentation presence
  • Key name patterns (init, main, etc.)
  • Complexity estimation (parameter count)

This provides a reasonable selection of functions for detailed data flow analysis.


325-367: LGTM! Result merging is implemented correctly.

The merge logic appropriately combines:

  • AST structural information (nodes, relationships)
  • Joern enhancements (data flows, cross-module edges)
  • Updated summary metadata

The use of model_dump() at line 352 is correct since DataFlowRelationship is a Pydantic model.


412-423: LGTM! Factory function provides clean instantiation API.

The factory function appropriately encapsulates service creation and provides a clear interface for enabling/disabling Joern integration.

Comment on lines 237 to 238
commit_hash = get_git_commit_hash(repo_path)
current_branch = get_git_branch(repo_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variables commit_hash and current_branch.

These variables are assigned but never used in the function.

🔎 Proposed fix
-        commit_hash = get_git_commit_hash(repo_path)
-        current_branch = get_git_branch(repo_path)
-
         if is_git_repository(repo_path):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
commit_hash = get_git_commit_hash(repo_path)
current_branch = get_git_branch(repo_path)
if is_git_repository(repo_path):
🧰 Tools
🪛 Ruff (0.14.10)

237-237: Local variable commit_hash is assigned to but never used

Remove assignment to unused variable commit_hash

(F841)


238-238: Local variable current_branch is assigned to but never used

Remove assignment to unused variable current_branch

(F841)

🤖 Prompt for AI Agents
In codewiki/cli/commands/generate.py around lines 237 to 238, the variables
commit_hash and current_branch are assigned but never used; remove these unused
assignments (or replace them with a deliberate underscore-prefixed ignore
variable if side effects from the function calls are required) so the function
no longer creates unused local variables and linter warnings go away.

Comment on lines 54 to 56
filtered_folders_path = os.path.join(
self.config.dependency_graph_dir,
f"{sanitized_repo_name}_filtered_folders.json"
self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused variable filtered_folders_path.

The filtered_folders_path variable is assigned but never used, since the code that would consume it (lines 58-66) is commented out.

🔎 Proposed fix
         dependency_graph_path = os.path.join(
             self.config.dependency_graph_dir, f"{sanitized_repo_name}_dependency_graph.json"
         )
-        filtered_folders_path = os.path.join(
-            self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json"
-        )

         filtered_folders = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
filtered_folders_path = os.path.join(
self.config.dependency_graph_dir,
f"{sanitized_repo_name}_filtered_folders.json"
self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json"
)
dependency_graph_path = os.path.join(
self.config.dependency_graph_dir, f"{sanitized_repo_name}_dependency_graph.json"
)
filtered_folders = None
🧰 Tools
🪛 Ruff (0.14.10)

54-54: Local variable filtered_folders_path is assigned to but never used

Remove assignment to unused variable filtered_folders_path

(F841)

🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py around lines
54 to 56, the variable filtered_folders_path is assigned but never used (the
code that would use it is commented out); either remove this unused assignment
or restore/implement the intended logic that reads/writes the filtered_folders
JSON using that path—update imports if necessary and run tests to ensure no
references remain.

Comment on lines +213 to +222
for param in params:
for local in locals:
data_flow = DataFlowRelationship(
source=f"{func_name}:{param}",
target=f"{func_name}:{local}",
flow_type="parameter_to_local",
variable_name=local,
confidence=0.8,
)
data_flows.append(data_flow)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Naive data flow logic may generate many false positives.

The current implementation creates a data flow relationship between EVERY parameter and EVERY local variable (Cartesian product). For a function with 3 parameters and 5 locals, this generates 15 relationships, most of which are likely false positives.

This heuristic approach may be acceptable for a PoC, but consider:

  • Using actual data flow information from Joern (if available)
  • Adding filtering based on variable usage patterns
  • Documenting this as a known limitation
💡 Suggested improvements
  1. Option 1: Use Joern's actual data flow edges instead of creating all combinations:
# If Joern provides actual flows, use those:
flows = df_result.get("data_flows", [])
for flow in flows:
    data_flow = DataFlowRelationship(
        source=f"{func_name}:{flow['source']}",
        target=f"{func_name}:{flow['target']}",
        flow_type="data_flow",
        variable_name=flow['target'],
        confidence=0.9,
    )
    data_flows.append(data_flow)
  1. Option 2: Add a limit or reduce confidence:
# Create at most N flows per parameter
for param in params:
    for local in locals[:3]:  # Limit to first 3 locals
        data_flow = DataFlowRelationship(
            source=f"{func_name}:{param}",
            target=f"{func_name}:{local}",
            flow_type="parameter_to_local",
            variable_name=local,
            confidence=0.3,  # Lower confidence for heuristic
        )
        data_flows.append(data_flow)
  1. Option 3: Document the limitation:
# Parameter to local variable flows (heuristic approximation)
# TODO: Replace with actual Joern data flow edges for accuracy
for param in params:
    for local in locals:
        # Note: This creates all combinations; refine with actual flow analysis
        ...
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
213 to 222 the current logic builds a Cartesian product of every parameter and
every local, which produces many false positives; change this to prefer real
data-flow edges from Joern if present by checking df_result for actual
data_flows and iterating those to create DataFlowRelationship entries (use a
higher confidence, e.g., 0.9); if Joern flows are not available, replace the
full Cartesian product with a conservative heuristic: limit the number of locals
per parameter (e.g., first N or those actually referenced with the parameter),
lower the confidence (e.g., 0.3–0.5), and/or filter locals by simple usage
patterns (e.g., appearing in the same statements or having matching types);
finally add a short TODO comment documenting this limitation and that Joern data
flow edges should be used when possible.

Comment on lines +387 to +390
for node_id, node in ast_result.get("nodes", {}).items():
if node.get("name") == function_name:
function_data = node
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename unused loop variable node_id to _node_id.

The loop variable node_id is not used within the loop body (only node is used). Prefix it with an underscore to indicate it's intentionally unused, consistent with the fix at line 203.

🔎 Proposed fix
-            for node_id, node in ast_result.get("nodes", {}).items():
+            for _node_id, node in ast_result.get("nodes", {}).items():
                 if node.get("name") == function_name:
                     function_data = node
                     break
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for node_id, node in ast_result.get("nodes", {}).items():
if node.get("name") == function_name:
function_data = node
break
for _node_id, node in ast_result.get("nodes", {}).items():
if node.get("name") == function_name:
function_data = node
break
🧰 Tools
🪛 Ruff (0.14.10)

387-387: Loop control variable node_id not used within loop body

Rename unused node_id to _node_id

(B007)

🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
387 to 390, the loop declares an unused variable node_id; rename it to _node_id
to signal it's intentionally unused. Update the for loop header to use _node_id
(keeping node) and ensure no other references expect node_id.

…nalysis

Major improvements to the Joern integration to support modern Java versions and ensure robust execution.

Key Changes:
- **Java 25 Support**: Injected `JAVA_TOOL_OPTIONS` with `--add-opens` flags to resolve `InaccessibleObjectException` caused by stricter JDK encapsulation.
- **Robust Wrapper Handling**: Added `-no-version-check` to the Joern CLI invocation to prevent version check failures when extra JVM options are present.
- **Hybrid Analysis Fix**: Corrected data structure mismatch in `HybridAnalysisService` where `CallGraphAnalyzer` output (List) was incompatible with the merge logic (Dict).
- **Core Models**: Updated `Node` model to use `List` instead of `Set` for `depends_on` to ensure JSON serializability.
- **Dependency Builder**: Removed hardcoded file limits when Joern is enabled (`max_files` check bypassed).
- **Graph Clustering**: Integrated `graph_clustering.py` to use NetworkX for community detection on the Joern graph.

Verification:
- Confirmed successful CPG generation on Java 25.
- Verified extraction of ~2700 nodes and ~3000 cross-module edges.
- Validated successful documentation generation without crashes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (1)

46-48: Remove unused variable filtered_folders_path.

This variable is assigned but never used since the code that would consume it (lines 51-58) is commented out.

🔎 Proposed fix
         dependency_graph_path = os.path.join(
             self.config.dependency_graph_dir, f"{sanitized_repo_name}_dependency_graph.json"
         )
-        filtered_folders_path = os.path.join(
-            self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json"
-        )
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)

409-412: Rename unused loop variable node_id to _node_id.

The loop variable node_id is not used within the loop body. Prefix with underscore to indicate intentional non-use.

🧹 Nitpick comments (15)
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py (2)

17-17: Unused config parameter - remove or implement.

The config parameter is declared but never used. Either remove it to clean up the signature or pass it to the analyzer constructors if configuration is needed.

🔎 Proposed fix
     @staticmethod
-    def create_analyzer(analyzer_type: AnalyzerType = AnalyzerType.HYBRID, config: Any = None):
+    def create_analyzer(analyzer_type: AnalyzerType = AnalyzerType.HYBRID):

42-43: Fix indentation inconsistency.

Line 43 has an extra space before return client.

🔎 Proposed fix
                 if client.is_available:
-                     return client
+                    return client
codewiki/src/be/dependency_analyzer/graph_clustering.py (1)

54-60: Dead code: pass block does nothing.

Lines 56-60 check for small clusters but only execute pass. Either implement the intended logic (merge small clusters into "others") or remove the conditional entirely.

🔎 Proposed fix - remove dead code
     for i, comm in enumerate(communities):
         comm_list = list(comm)
-        if len(comm_list) < min_cluster_size and len(communities) > 1:
-            # Skip very small clusters if we have enough other clusters
-            # We'll merge these into an 'others' or closest parent later?
-            # For now, let's keep them and name them generically
-            pass
-            
+        # TODO: Consider merging clusters smaller than min_cluster_size
+
         # Determine a common path or name for the cluster
codewiki/src/be/cluster_modules.py (2)

102-105: Use logger.exception() instead of manual traceback logging.

logger.exception() automatically includes the traceback in exception handlers, making the explicit traceback.format_exc() call redundant.

🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Failed to parse LLM response: {e}. Response: {response[:200]}...")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Failed to parse LLM response. Response: {response[:200]}...")
             return {}

48-48: Mutable default argument may cause unexpected behavior.

Using {} as a default value means all calls share the same dict instance. Consider using None as default and initializing inside the function.

🔎 Proposed fix
 def cluster_modules(
     leaf_nodes: List[str],
     components: Dict[str, Node],
     config: Config,
-    current_module_tree: dict[str, Any] = {},
+    current_module_tree: dict[str, Any] = None,
     current_module_name: str = None,
-    current_module_path: List[str] = []
+    current_module_path: List[str] = None
 ) -> Dict[str, Any]:
+    if current_module_tree is None:
+        current_module_tree = {}
+    if current_module_path is None:
+        current_module_path = []
codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py (1)

4-4: Remove unused imports.

EnhancedNode and DataFlowRelationship are imported but never used in this file.

🔎 Proposed fix
-from codewiki.src.be.dependency_analyzer.models.core import EnhancedNode, DataFlowRelationship
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (2)

104-112: Unused include_data_flow parameter should be documented or removed.

The include_data_flow parameter exists for interface compatibility with HybridAnalysisService.analyze_repository_hybrid but is never used. Add a docstring note explaining this is intentional for interface parity.

🔎 Proposed fix
     def analyze_repository_hybrid(
         self,
         repo_path: str,
         max_files: int = 100,
         languages: Optional[List[str]] = None,
-        include_data_flow: bool = False
+        include_data_flow: bool = False  # noqa: ARG002 - kept for interface compatibility
     ) -> Dict[str, Any]:
-        """Alias for analyze_local_repository to support hybrid fallback interface."""
+        """Alias for analyze_local_repository to support hybrid fallback interface.
+        
+        Note: include_data_flow is accepted for interface compatibility with
+        HybridAnalysisService but is not used in the AST-only path.
+        """
         return self.analyze_local_repository(repo_path, max_files, languages)

230-235: Use logger.exception() for cleaner error logging.

Replace the manual traceback.format_exc() call with logger.exception() which automatically includes the traceback.

🔎 Proposed fix
         except Exception as e:
             if temp_dir and Path(temp_dir).exists():
                 self._cleanup_repository(temp_dir)
-            logger.error(f"Structure analysis failed for {github_url}: {str(e)}")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Structure analysis failed for {github_url}")
             raise RuntimeError(f"Structure analysis failed: {str(e)}") from e
codewiki/src/be/dependency_analyzer/models/core.py (1)

84-98: Validator ambiguity: both CFG and DDG may contain overlapping keys.

The validator attempts to distinguish between CFG and DDG data by checking for 'nodes' vs 'dependencies' keys. However, this heuristic could fail if a dict contains both keys or neither, returning the raw dict unchanged which may cause downstream issues.

Consider using explicit type discrimination or separate validators per field:

🔎 Proposed improvement using separate validators
-    @field_validator('cfg_data', 'ddg_data', mode='before')
-    @classmethod
-    def validate_graph_data(cls, v: Any) -> Any:
-        """Allow dict or model instance."""
-        if v is None:
-            return None
-        if isinstance(v, dict):
-            try:
-                if 'nodes' in v:  # Likely CFG
-                    return ControlFlowData(**v)
-                elif 'dependencies' in v:  # Likely DDG
-                    return DataFlowData(**v)
-            except Exception:
-                return v  # Return as-is if validation fails
-        return v
+    @field_validator('cfg_data', mode='before')
+    @classmethod
+    def validate_cfg_data(cls, v: Any) -> Any:
+        """Allow dict or ControlFlowData instance."""
+        if v is None:
+            return None
+        if isinstance(v, dict):
+            return ControlFlowData(**v)
+        return v
+
+    @field_validator('ddg_data', mode='before')
+    @classmethod
+    def validate_ddg_data(cls, v: Any) -> Any:
+        """Allow dict or DataFlowData instance."""
+        if v is None:
+            return None
+        if isinstance(v, dict):
+            return DataFlowData(**v)
+        return v
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)

99-101: Chain exception for better traceability.

When re-raising, chain the original exception using raise ... from e to preserve the full traceback.

🔎 Proposed fix
         except Exception as e:
             logger.error(f"Hybrid analysis failed: {str(e)}", exc_info=True)
-            raise RuntimeError(f"Hybrid analysis failed: {str(e)}")
+            raise RuntimeError(f"Hybrid analysis failed: {e!s}") from e

125-134: Use logger.exception for automatic traceback inclusion.

When logging errors in exception handlers, logger.exception() automatically includes the traceback without needing exc_info=True.

🔎 Proposed fix
         except Exception as e:
-            logger.error(f"AST analysis failed: {str(e)}")
+            logger.exception(f"AST analysis failed: {e!s}")
             # Return minimal result to allow analysis to continue
             return {
codewiki/src/be/dependency_analyzer/joern/client.py (4)

16-20: Consider making JoernVersion a proper dataclass or enum.

JoernVersion contains only class-level constants but isn't used for version comparison anywhere in the code. If version validation is planned, consider adding comparison logic; otherwise, this could be simplified to module-level constants.


61-121: Validation logic is thorough but has inconsistent return conditions.

The validation correctly handles multiple scenarios (binary in PATH, Homebrew fallbacks, JAR fallback). However, lines 97-100 appear redundant—if resolved_path exists and passes the earlier checks, this block shouldn't be reached. Also, the indentation at lines 99-100 appears inconsistent (extra space).

🔎 Proposed cleanup
             except (subprocess.TimeoutExpired, subprocess.SubprocessError) as e:
                 logger.debug(f"Joern binary validation failed: {e}. Checking path existence.")
                 if os.path.exists(self.joern_path):
                     logger.info(f"Joern binary found at {self.joern_path}")
                     return True
-        
-        # Binary validation failed, try to return True if it just exists and is executable
-        if resolved_path and os.path.exists(resolved_path):
-             logger.info(f"Joern binary found at {resolved_path} (validation skipped)")
-             return True

         # 2. Check JAR fallback

201-202: Chain the timeout exception for better debugging.

Use raise ... from to preserve exception context.

🔎 Proposed fix
-        except subprocess.TimeoutExpired:
-            raise TimeoutError(f"Joern CPG generation exceeded {self.timeout_seconds}s")
+        except subprocess.TimeoutExpired as e:
+            raise TimeoutError(f"Joern CPG generation exceeded {self.timeout_seconds}s") from e

204-210: Unused language parameter.

The language parameter is accepted but never used in _run_joern_analysis. Either implement language-specific frontend selection or remove the parameter.

🔎 Option 1: Use language parameter in script generation
-    def _run_joern_analysis(self, repo_path: str, language: Optional[str] = None) -> Dict[str, Any]:
+    def _run_joern_analysis(self, repo_path: str, language: Optional[str] = None) -> Dict[str, Any]:
         """Execute Joern analysis script."""
         with tempfile.TemporaryDirectory(prefix="joern_client_") as temp_dir:
             repo_abs = str(Path(repo_path).resolve())
-            script_content = self._get_default_analysis_script(repo_abs)
+            script_content = self._get_default_analysis_script(repo_abs, language)

And update _get_default_analysis_script signature accordingly.

🔎 Option 2: Remove unused parameter
-    def _run_joern_analysis(self, repo_path: str, language: Optional[str] = None) -> Dict[str, Any]:
+    def _run_joern_analysis(self, repo_path: str) -> Dict[str, Any]:

Also update the call site in generate_cpg.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebc0a1f and 2621b4e.

📒 Files selected for processing (12)
  • codewiki/src/be/cluster_modules.py
  • codewiki/src/be/dependency_analyzer/analysis/analysis_service.py
  • codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py
  • codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
  • codewiki/src/be/dependency_analyzer/graph_clustering.py
  • codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py
  • codewiki/src/be/dependency_analyzer/joern/__init__.py
  • codewiki/src/be/dependency_analyzer/joern/client.py
  • codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py
  • codewiki/src/be/dependency_analyzer/models/core.py
  • codewiki/src/utils.py
  • pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (6)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (7)
codewiki/src/be/dependency_analyzer/topo_sort.py (2)
  • build_graph_from_components (239-268)
  • get_leaf_nodes (271-350)
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py (2)
  • AnalyzerType (8-11)
  • create_analyzer (17-51)
codewiki/src/be/dependency_analyzer/ast_parser.py (3)
  • DependencyParser (18-145)
  • parse_repository (28-45)
  • save_dependency_graph (129-145)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
  • analyze_repository_hybrid (104-112)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
  • analyze_repository_hybrid (48-101)
codewiki/src/be/dependency_analyzer/models/core.py (2)
  • Node (55-75)
  • EnhancedNode (78-98)
codewiki/src/utils.py (1)
  • save_json (19-27)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (2)
codewiki/src/be/dependency_analyzer/utils/patterns.py (5)
  • is_entry_point_file (462-485)
  • is_entry_point_path (488-504)
  • has_high_connectivity_potential (507-533)
  • find_fallback_entry_points (560-612)
  • find_fallback_connectivity_files (615-647)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
  • analyze_repository_hybrid (48-101)
codewiki/src/be/cluster_modules.py (4)
codewiki/src/be/utils.py (1)
  • count_tokens (32-38)
codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
  • cluster_graph_by_communities (9-72)
codewiki/src/be/prompt_template.py (1)
  • format_cluster_prompt (308-337)
codewiki/src/be/llm_services.py (1)
  • call_llm (58-86)
codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
codewiki/src/be/dependency_analyzer/models/core.py (1)
  • Node (55-75)
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py (3)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
  • AnalysisService (32-404)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
  • HybridAnalysisService (23-431)
codewiki/src/be/dependency_analyzer/joern/client.py (1)
  • JoernClient (22-328)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (4)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (3)
  • AnalysisService (32-404)
  • analyze_local_repository (49-102)
  • analyze_repository (407-418)
codewiki/src/be/dependency_analyzer/models/core.py (3)
  • Node (55-75)
  • DataFlowRelationship (108-119)
  • EnhancedNode (78-98)
codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py (2)
  • JoernAnalysisService (8-65)
  • analyze_repository (15-65)
codewiki/src/be/dependency_analyzer/simplified_joern.py (1)
  • extract_data_flow_sample (215-266)
🪛 Ruff (0.14.10)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py

46-46: Local variable filtered_folders_path is assigned to but never used

Remove assignment to unused variable filtered_folders_path

(F841)


92-92: Do not catch blind exception: Exception

(BLE001)

codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py

22-22: Avoid specifying long messages outside the exception class

(TRY003)

codewiki/src/be/dependency_analyzer/analysis/analysis_service.py

109-109: Unused method argument: include_data_flow

(ARG002)


233-233: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


233-233: Use explicit conversion flag

Replace with conversion flag

(RUF010)


234-234: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/cluster_modules.py

80-80: Do not catch blind exception: Exception

(BLE001)


96-96: Use of possibly insecure function; consider using ast.literal_eval

(S307)


102-102: Do not catch blind exception: Exception

(BLE001)


103-103: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


104-104: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

codewiki/src/be/dependency_analyzer/graph_clustering.py

47-47: Do not catch blind exception: Exception

(BLE001)


98-98: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py

17-17: Unused static method argument: config

(ARG004)


32-32: Do not catch blind exception: Exception

(BLE001)


45-45: Abstract raise to an inner function

(TRY301)


45-45: Avoid specifying long messages outside the exception class

(TRY003)


46-46: Do not catch blind exception: Exception

(BLE001)

codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py

97-97: Consider moving this statement to an else block

(TRY300)


100-100: Use explicit conversion flag

Replace with conversion flag

(RUF010)


101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


101-101: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)


123-123: Consider moving this statement to an else block

(TRY300)


125-125: Do not catch blind exception: Exception

(BLE001)


126-126: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


126-126: Use explicit conversion flag

Replace with conversion flag

(RUF010)


171-171: Consider moving this statement to an else block

(TRY300)


173-173: Do not catch blind exception: Exception

(BLE001)


174-174: Use explicit conversion flag

Replace with conversion flag

(RUF010)


178-178: Unused method argument: repo_path

(ARG002)


205-205: Do not catch blind exception: Exception

(BLE001)


409-409: Loop control variable node_id not used within loop body

Rename unused node_id to _node_id

(B007)


424-428: Consider moving this statement to an else block

(TRY300)


430-430: Do not catch blind exception: Exception

(BLE001)


431-431: Use explicit conversion flag

Replace with conversion flag

(RUF010)

codewiki/src/utils.py

24-24: Avoid specifying long messages outside the exception class

(TRY003)

codewiki/src/be/dependency_analyzer/joern/client.py

81-81: subprocess call: check for execution of untrusted input

(S603)


106-106: subprocess call: check for execution of untrusted input

(S603)


107-107: Starting a process with a partial executable path

(S607)


139-139: Consider moving this statement to an else block

(TRY300)


172-172: Avoid specifying long messages outside the exception class

(TRY003)


181-181: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)


182-182: Do not catch blind exception: Exception

(BLE001)


196-196: Do not catch blind exception: Exception

(BLE001)


199-199: Consider moving this statement to an else block

(TRY300)


202-202: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


202-202: Avoid specifying long messages outside the exception class

(TRY003)


204-204: Unused method argument: language

(ARG002)


253-253: subprocess call: check for execution of untrusted input

(S603)


263-263: Avoid specifying long messages outside the exception class

(TRY003)


325-325: Do not catch blind exception: Exception

(BLE001)


326-326: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


326-326: Use explicit conversion flag

Replace with conversion flag

(RUF010)


328-328: Use explicit conversion flag

Replace with conversion flag

(RUF010)

codewiki/src/be/dependency_analyzer/models/core.py

96-96: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (15)
codewiki/src/utils.py (1)

19-27: LGTM! Clean approach to handle set serialization.

The custom default_serializer properly extends JSON serialization for sets while preserving the standard TypeError for other unsupported types. This aligns well with the model changes where depends_on transitioned from Set[str] to List[str].

codewiki/src/be/dependency_analyzer/graph_clustering.py (1)

87-104: LGTM!

The common path extraction correctly normalizes paths and finds the longest common prefix. The zip behavior of stopping at the shortest sequence is the desired behavior here.

codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py (1)

15-65: LGTM!

The analyze_repository method cleanly transforms Joern's raw CPG output into the internal node/relationship format with appropriate metadata. The structure is compatible with downstream consumers.

codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)

343-388: LGTM! Well-structured file selection strategy.

The _select_best_files method implements a sensible prioritization: entry points first (30% budget), then high-connectivity files, then remaining files. This ensures important code paths are analyzed when file limits are hit.

codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (2)

61-102: LGTM! Hybrid path correctly implemented.

The branching between DependencyParser and HybridAnalysisService is now correct. The node conversion handles both list and dict formats with appropriate error handling, and the save logic adapts to the parser type.


124-146: LGTM! Robust leaf node filtering.

The enhanced validation correctly filters out invalid identifiers including error-like strings, empty values, and non-string types. This prevents downstream issues with malformed leaf nodes.

codewiki/src/be/dependency_analyzer/models/core.py (2)

6-53: LGTM! Well-structured Joern CPG data models.

The new data models for control flow (CFGNode, CFGEdge, ControlFlowData) and data flow (DataDependency, DataFlowData) are well-designed with appropriate optional fields and sensible defaults using Field(default_factory=list).


129-133: Inconsistency: Repository model fields don't match AI summary.

The AI summary states that name and clone_path were removed from Repository, but they're still present in the code. If these fields are intentionally retained, this is fine; otherwise verify the intended schema.

codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (4)

33-46: LGTM! Clean initialization with graceful Joern availability check.

The initialization properly creates both services and correctly chains the availability check (enable_joern and joern_service.is_available).


211-249: LGTM! Cross-module relationship extraction now correctly uses ast_result.

The method signature and implementation have been updated to accept the full ast_result dict and correctly access relationships from it, addressing the previously reported bug.


271-304: LGTM! Function importance scoring logic is reasonable.

The scoring heuristics (dependents count, docstring presence, name patterns, parameter complexity) provide a sensible prioritization for selecting functions for deeper analysis.


330-340: Good defensive normalization for List→Dict conversion.

This correctly handles the case where CallGraphAnalyzer returns nodes as a List, converting to the Dict format expected by the rest of the hybrid logic.

codewiki/src/be/dependency_analyzer/joern/client.py (3)

40-59: LGTM! Comprehensive JAR discovery with sensible fallback paths.

The search paths cover common installation locations (project root, user home, global paths) and logging on discovery aids debugging.


179-183: Pickle deserialization from cache is acceptable here.

While static analysis flags pickle as a security risk (S301), this cache is locally generated by the same application and stored in a user-controlled directory. The risk is minimal in this context. Consider adding a comment noting this design decision.


310-328: LGTM! Robust output parsing with marker-based extraction and fallback.

The two-stage parsing (markers first, regex fallback) handles different Joern output formats gracefully. The error handling returns a structured error dict rather than raising, which aligns with the graceful degradation design.


response_content = response.split("<GROUPED_COMPONENTS>")[1].split("</GROUPED_COMPONENTS>")[0]
# Use safe literal evaluation if possible, but existing code uses eval
module_tree = eval(response_content)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security: Replace eval() with ast.literal_eval().

Using eval() on LLM responses is dangerous as it can execute arbitrary Python code. Use ast.literal_eval() which safely evaluates only literal structures (dicts, lists, strings, numbers).

🔎 Proposed fix
+import ast
+
 # ... in the try block:
-            module_tree = eval(response_content)
+            module_tree = ast.literal_eval(response_content)
🧰 Tools
🪛 Ruff (0.14.10)

96-96: Use of possibly insecure function; consider using ast.literal_eval

(S307)

🤖 Prompt for AI Agents
In codewiki/src/be/cluster_modules.py around line 96, replace the unsafe use of
eval(response_content) with ast.literal_eval(response_content): import the ast
module at the top of the file if not present, call ast.literal_eval to parse the
LLM response into module_tree, and wrap the call in a try/except to catch
(ValueError, SyntaxError) and log or raise a controlled error if parsing fails;
ensure no other code expects eval-specific behavior and adjust accordingly.

Comment on lines +177 to +209
def _extract_data_flows_for_functions(
self, repo_path: str, ast_nodes: Dict[str, Any]
) -> List[DataFlowRelationship]:
"""
Extract data flow relationships for key functions.

Args:
repo_path: Repository path
ast_nodes: AST nodes from analysis

Returns:
List of data flow relationships
"""
if not self.joern_client:
return []

data_flows = []

# Select a subset of important functions for data flow analysis
important_functions = self._select_important_functions(ast_nodes, limit=20)

for func_name, _func_info in important_functions.items():
try:
# Optimized: We could use the already generated CPG from generate_cpg
# For now, keep the interface consistent with the new client's capability.
# In a more robust version, we'd query the CPG directly.
pass

except Exception as e:
logger.debug(f"Data flow analysis failed for {func_name}: {e}")
continue

return data_flows
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: References undefined self.joern_client attribute.

_extract_data_flows_for_functions references self.joern_client at line 190, but the class only initializes self.joern_service. This will raise AttributeError at runtime. Additionally, repo_path is unused (the loop body is just pass).

This method appears to be scaffolding for future implementation. Either fix the attribute name or mark this method as not yet implemented more clearly.

🔎 Proposed fix - align with existing attributes
     def _extract_data_flows_for_functions(
         self, repo_path: str, ast_nodes: Dict[str, Any]
     ) -> List[DataFlowRelationship]:
         """
         Extract data flow relationships for key functions.
+        
+        Note: This is a placeholder for future Joern data-flow integration.

         Args:
             repo_path: Repository path
             ast_nodes: AST nodes from analysis

         Returns:
             List of data flow relationships
         """
-        if not self.joern_client:
+        if not self.joern_service or not self.joern_service.is_available:
             return []

         data_flows = []

         # Select a subset of important functions for data flow analysis
         important_functions = self._select_important_functions(ast_nodes, limit=20)

         for func_name, _func_info in important_functions.items():
             try:
-                # Optimized: We could use the already generated CPG from generate_cpg
-                # For now, keep the interface consistent with the new client's capability.
-                # In a more robust version, we'd query the CPG directly.
-                pass
+                # TODO: Implement actual data flow extraction using Joern CPG
+                _ = repo_path  # Suppress unused warning until implemented
+                logger.debug(f"Data flow extraction not yet implemented for {func_name}")

             except Exception as e:
                 logger.debug(f"Data flow analysis failed for {func_name}: {e}")
                 continue

         return data_flows
🧰 Tools
🪛 Ruff (0.14.10)

178-178: Unused method argument: repo_path

(ARG002)


205-205: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
177 to 209, the method _extract_data_flows_for_functions references an undefined
attribute self.joern_client and also leaves repo_path unused; change the
attribute check to use the existing self.joern_service (i.e., if not
self.joern_service: return []), and either implement the scaffolded loop or make
the method explicitly not implemented by raising NotImplementedError (or
returning an empty list) while adding a TODO comment about using repo_path and
replacing the pass with the future implementation; ensure no reference to
self.joern_client remains.

Comment on lines +417 to +422
# Add data flow if available
data_flow_result = {}
if self.joern_analyzer:
data_flow_result = self.joern_analyzer.extract_data_flow_sample(
repo_path, function_name
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: References undefined self.joern_analyzer attribute.

analyze_single_function_with_data_flow references self.joern_analyzer at line 419, but this attribute is never defined. The class initializes self.joern_service instead. This will raise AttributeError at runtime.

🔎 Proposed fix
             # Add data flow if available
             data_flow_result = {}
-            if self.joern_analyzer:
-                data_flow_result = self.joern_analyzer.extract_data_flow_sample(
+            if self.joern_service and self.joern_service.is_available:
+                # Note: JoernAnalysisService doesn't have extract_data_flow_sample
+                # This may need to use the underlying client or a different method
+                data_flow_result = self.joern_service.client.extract_data_flow_sample(
                     repo_path, function_name
                 )

Alternatively, if extract_data_flow_sample isn't available on the service/client, this feature may need further implementation or should be disabled:

             # Add data flow if available
             data_flow_result = {}
-            if self.joern_analyzer:
-                data_flow_result = self.joern_analyzer.extract_data_flow_sample(
-                    repo_path, function_name
-                )
+            # TODO: Implement single-function data flow extraction
+            # Current JoernAnalysisService doesn't expose extract_data_flow_sample

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
417 to 422, the method references an undefined attribute self.joern_analyzer
which causes an AttributeError; replace the check and call to use the existing
self.joern_service (i.e., if self.joern_service: data_flow_result =
self.joern_service.extract_data_flow_sample(repo_path, function_name)), and if
extract_data_flow_sample may not exist on the service, guard with
hasattr(self.joern_service, "extract_data_flow_sample") and fall back to an
empty data_flow_result or disable the feature with a clear log message.

Comment on lines +267 to +308
def _get_default_analysis_script(self, repo_path: str) -> str:
"""Get standard Joern analysis script for Joern 4.x."""
# Escape backslashes for Scala string literal
escaped_path = repo_path.replace("\\", "\\\\").replace("\"", "\\\"")

# Use a more robust script format for Joern 4.x
return f'''
// Use language specific frontend if possible, fallback to generic
val cpg = try {{
importCode.python("{escaped_path}")
}} catch {{
case _: Exception => importCode("{escaped_path}", "analysis", "")
}}

val mJson = cpg.method.internal.map {{ m =>
Map(
"name" -> m.name,
"fullName" -> m.fullName,
"file" -> m.filename,
"line" -> m.lineNumber.headOption.getOrElse(-1)
)
}}.toJson

val cJson = cpg.call.map {{ c =>
Map(
"name" -> c.name,
"caller" -> c.method.fullName.headOption.getOrElse(""),
"callee" -> c.methodFullName,
"line" -> c.lineNumber.headOption.getOrElse(-1)
)
}}.toJson

println("---JOERN_START---")
println(s"""{{
"total_files": ${{cpg.file.size}},
"total_methods": ${{cpg.method.size}},
"total_classes": ${{cpg.typeDecl.size}},
"methods": $mJson,
"calls": $cJson
}}""")
println("---JOERN_END---")
'''
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential script injection via repo_path in Scala script.

While repo_path is escaped for backslashes and double quotes (line 270), a malicious path containing Scala string interpolation syntax (e.g., ${...}) or other Scala code patterns could potentially inject code into the generated script. Since repo_path typically comes from user input (CLI arguments or config), this warrants additional validation.

🔎 Proposed fix - add path validation
     def _get_default_analysis_script(self, repo_path: str) -> str:
         """Get standard Joern analysis script for Joern 4.x."""
+        # Validate repo_path to prevent script injection
+        # Only allow alphanumeric, common path characters, and path separators
+        if not re.match(r'^[\w\s\-./\\:]+$', repo_path):
+            raise ValueError(f"Invalid characters in repository path: {repo_path}")
+        
         # Escape backslashes for Scala string literal
-        escaped_path = repo_path.replace("\\", "\\\\").replace("\"", "\\\"")
+        # Also escape $ to prevent Scala string interpolation
+        escaped_path = (
+            repo_path
+            .replace("\\", "\\\\")
+            .replace("\"", "\\\"")
+            .replace("$", "\\$")
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _get_default_analysis_script(self, repo_path: str) -> str:
"""Get standard Joern analysis script for Joern 4.x."""
# Escape backslashes for Scala string literal
escaped_path = repo_path.replace("\\", "\\\\").replace("\"", "\\\"")
# Use a more robust script format for Joern 4.x
return f'''
// Use language specific frontend if possible, fallback to generic
val cpg = try {{
importCode.python("{escaped_path}")
}} catch {{
case _: Exception => importCode("{escaped_path}", "analysis", "")
}}
val mJson = cpg.method.internal.map {{ m =>
Map(
"name" -> m.name,
"fullName" -> m.fullName,
"file" -> m.filename,
"line" -> m.lineNumber.headOption.getOrElse(-1)
)
}}.toJson
val cJson = cpg.call.map {{ c =>
Map(
"name" -> c.name,
"caller" -> c.method.fullName.headOption.getOrElse(""),
"callee" -> c.methodFullName,
"line" -> c.lineNumber.headOption.getOrElse(-1)
)
}}.toJson
println("---JOERN_START---")
println(s"""{{
"total_files": ${{cpg.file.size}},
"total_methods": ${{cpg.method.size}},
"total_classes": ${{cpg.typeDecl.size}},
"methods": $mJson,
"calls": $cJson
}}""")
println("---JOERN_END---")
'''
def _get_default_analysis_script(self, repo_path: str) -> str:
"""Get standard Joern analysis script for Joern 4.x."""
# Validate repo_path to prevent script injection
# Only allow alphanumeric, common path characters, and path separators
if not re.match(r'^[\w\s\-./\\:]+$', repo_path):
raise ValueError(f"Invalid characters in repository path: {repo_path}")
# Escape backslashes for Scala string literal
# Also escape $ to prevent Scala string interpolation
escaped_path = (
repo_path
.replace("\\", "\\\\")
.replace("\"", "\\\"")
.replace("$", "\\$")
)
# Use a more robust script format for Joern 4.x
return f'''
// Use language specific frontend if possible, fallback to generic
val cpg = try {{
importCode.python("{escaped_path}")
}} catch {{
case _: Exception => importCode("{escaped_path}", "analysis", "")
}}
val mJson = cpg.method.internal.map {{ m =>
Map(
"name" -> m.name,
"fullName" -> m.fullName,
"file" -> m.filename,
"line" -> m.lineNumber.headOption.getOrElse(-1)
)
}}.toJson
val cJson = cpg.call.map {{ c =>
Map(
"name" -> c.name,
"caller" -> c.method.fullName.headOption.getOrElse(""),
"callee" -> c.methodFullName,
"line" -> c.lineNumber.headOption.getOrElse(-1)
)
}}.toJson
println("---JOERN_START---")
println(s"""{{
"total_files": ${{cpg.file.size}},
"total_methods": ${{cpg.method.size}},
"total_classes": ${{cpg.typeDecl.size}},
"methods": $mJson,
"calls": $cJson
}}""")
println("---JOERN_END---")
'''
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/joern/client.py around lines 267 to 308,
the generated Scala script embeds repo_path into a Scala string literal but only
escapes backslashes and quotes, leaving it vulnerable to script injection via
Scala interpolation or control characters; validate and sanitize repo_path
before embedding by: canonicalizing the path (os.path.realpath) and ensuring it
is within an allowed repository root, rejecting or returning an error for paths
that contain interpolation patterns like "${", "s\"" sequences, newlines,
backticks, semicolons or other control characters, and/or non-whitelisted
characters; alternatively, escape any remaining interpolation sequences (e.g.,
replace "${" with "\${") so the value cannot be interpreted as Scala code, and
ensure the function raises a clear error on invalid input rather than emitting a
potentially dangerous script.

- Enhance type hints by adding Optional for current_module_name parameter
- Add safety checks to prevent KeyError when navigating module tree nodes
- Initialize missing "children" keys during recursive module tree navigation
- Improve error messages and logging throughout the clustering process
- Add validation for sub_leaf_nodes to ensure they exist in components dictionary
- Format long lines for better readability while maintaining functionality
- Strengthen LLM response parsing with more detailed error reporting and traceback logging
- Add warning when invalid keys are encountered during module tree traversal
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
codewiki/src/be/cluster_modules.py (1)

116-116: Security: eval() remains unfixed.

This is a duplicate of a previous review comment. The unsafe eval() on line 116 should be replaced with ast.literal_eval() to prevent arbitrary code execution. The comment on line 115 acknowledges the issue but doesn't address it.

🧹 Nitpick comments (1)
codewiki/src/be/cluster_modules.py (1)

123-124: Use logger.exception() instead of manual traceback formatting.

logger.exception() automatically includes the traceback and is more concise than calling logger.error() with traceback.format_exc().

🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Failed to parse LLM response: {e}. Response: {response[:200]}...")
-            logger.error(f"Traceback: {traceback.format_exc()}")
+            logger.exception(f"Failed to parse LLM response: {e}. Response: {response[:200]}...")
             return {}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2621b4e and 7a20c12.

📒 Files selected for processing (1)
  • codewiki/src/be/cluster_modules.py
🧰 Additional context used
🧬 Code graph analysis (1)
codewiki/src/be/cluster_modules.py (6)
codewiki/src/be/dependency_analyzer/models/core.py (1)
  • Node (55-75)
codewiki/src/be/utils.py (1)
  • count_tokens (32-38)
codewiki/cli/utils/logging.py (4)
  • debug (24-28)
  • info (30-32)
  • warning (38-40)
  • error (42-44)
codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
  • cluster_graph_by_communities (9-72)
codewiki/src/be/prompt_template.py (1)
  • format_cluster_prompt (308-337)
codewiki/src/be/llm_services.py (1)
  • call_llm (58-86)
🪛 Ruff (0.14.10)
codewiki/src/be/cluster_modules.py

51-51: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


53-53: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


94-94: Do not catch blind exception: Exception

(BLE001)


116-116: Use of possibly insecure function; consider using ast.literal_eval

(S307)


122-122: Do not catch blind exception: Exception

(BLE001)


123-123: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


124-124: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (3)
codewiki/src/be/cluster_modules.py (3)

29-32: Good addition: grouping leaf nodes by file.

The explicit grouping by file path improves organization and makes the formatted output more readable for downstream consumers.


76-96: Well-implemented graph-based clustering with fallback.

The dynamic import prevents errors when Joern is unavailable, and the fallback strategy is sound. The broad Exception catch on line 94 is acceptable here since you're implementing a graceful degradation pattern with appropriate logging.


144-149: Excellent defensive programming with safety checks.

The added safety checks prevent KeyError exceptions during tree navigation and properly initialize missing "children" keys. This makes the recursive clustering more robust.

Comment on lines +51 to +53
current_module_tree: Dict[str, Any] = {},
current_module_name: Optional[str] = None,
current_module_path: List[str] = [],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Replace mutable default arguments with None.

Using mutable defaults ({} and []) causes all function calls to share the same dict/list instances, leading to state leakage between clustering operations. This is especially problematic for recursive functions.

🔎 Proposed fix
 def cluster_modules(
     leaf_nodes: List[str],
     components: Dict[str, Node],
     config: Config,
-    current_module_tree: Dict[str, Any] = {},
-    current_module_name: Optional[str] = None,
-    current_module_path: List[str] = [],
+    current_module_tree: Optional[Dict[str, Any]] = None,
+    current_module_name: Optional[str] = None,
+    current_module_path: Optional[List[str]] = None,
 ) -> Dict[str, Any]:
     """
     Cluster potential core components into modules.
     [CCR] Relation: Clustering Strategy.
     Reason: Joern enables graph-based clustering (Louvain), while AST-only falls back to LLM heuristics.
     """
+    if current_module_tree is None:
+        current_module_tree = {}
+    if current_module_path is None:
+        current_module_path = []
+    
     potential_core_components, potential_core_components_with_code = (
         format_potential_core_components(leaf_nodes, components)
     )
🧰 Tools
🪛 Ruff (0.14.10)

51-51: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


53-53: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Joern Integration - Quick Win Plan (v5)

5 participants