PR: Implemented hybrid AST + Joern CPG analysis#12
PR: Implemented hybrid AST + Joern CPG analysis#12e2720pjk wants to merge 12 commits intoopencode-devfrom
Conversation
…ki/src/be/dependency_analyzer/analyzers/php.py, codewiki/src/be/dependency_analyzer/analysis/analysis_service.py, codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py, codewiki/cli/utils/repo_validator.py, codewiki/cli/utils/validation.py, codewiki/src/be/dependency_analyzer/utils/patterns.py, codewiki/src/be/prompt_template.py, pyproject.toml]
add traceback when errors
Add php support
config fallback model
This introduces hybrid analysis capabilities that combine traditional AST parsing with Joern Code Property Graph (CPG) analysis for deeper insights into code structure and data flow relationships. Key additions: - SimplifiedJoernAnalyzer: Basic Joern integration with graceful fallback to AST analysis - HybridAnalysisService: Combines AST structural analysis with Joern data flow insights - New CLI flag --use-joern to enable enhanced analysis mode - DataFlowRelationship model for capturing parameter-to-local variable flows - Cross-module relationship detection with confidence scoring - Comprehensive user guide for Joern integration The implementation provides: - Automatic Joern JAR detection and setup - Performance baseline testing comparing AST vs Joern analysis - Graceful degradation when Joern is unavailable - Enhanced dependency graphs with data flow information - Support for Python, Java, JavaScript/TypeScript analysis This feature significantly improves code understanding by adding data flow analysis while maintaining backward compatibility through the existing AST-only approach.
📝 WalkthroughWalkthroughAdds optional Joern CPG hybrid analysis (AST + Joern) with a --use-joern CLI flag, HybridAnalysisService and multiple Joern adapters/clients, PHP language support across analyzers, a --fallback-model config option, graph-based clustering, enhanced traceback logging, and documentation (JOERN_USER_GUIDE.md). Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI (generate)
participant Config as ConfigMgr
participant HAS as HybridAnalysisService
participant AST as AST Analyzer
participant Joern as JoernClient/Service
participant Merger as Result Merger
participant DocGen as Documentation Generator
CLI->>Config: invoke generate (--use-joern)
Config->>HAS: initialize(enable_joern=True)
CLI->>HAS: analyze_repository_hybrid(repo_path)
rect rgba(240,248,255,1)
Note over HAS,AST: Phase 1 — AST analysis (always)
HAS->>AST: analyze_local_repository(repo_path, max_files)
AST-->>HAS: ast_result
end
alt Joern enabled & available
rect rgba(237,255,237,1)
Note over HAS,Joern: Phase 2 — Joern enhancement
HAS->>Joern: generate_cpg(repo_path)
Joern-->>HAS: joern_data (methods, calls, flows)
end
rect rgba(255,250,235,1)
Note over HAS,Merger: Phase 3 — Merge results
HAS->>Merger: merge(ast_result, joern_data)
Merger-->>HAS: merged_result
end
else Joern disabled or failed
Note over HAS: Graceful fallback — return ast_result with joern_failed metadata
end
HAS-->>DocGen: enriched analysis result
DocGen->>DocGen: generate documentation/site
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (26)
codewiki/src/be/agent_tools/str_replace_editor.py (1)
338-343: Add explicit parentheses for clarity.The boolean expression mixes
andandoroperators without explicit parentheses, making the precedence unclear. While Python's operator precedence rules evaluate this correctly, explicit parentheses improve readability and prevent potential bugs.🔎 Suggested fix
if ( - direction == 1 - and best_breakpoint < current_line - or direction == -1 - and best_breakpoint > current_line + (direction == 1 and best_breakpoint < current_line) + or (direction == -1 and best_breakpoint > current_line) ):codewiki/src/be/utils.py (1)
172-174: Uselogging.exception()for automatic traceback capture.Python's
logging.exception()automatically includes the full traceback without requiring manual formatting. Replace the twologger.error()calls with a singlelogger.exception()call.🔎 Proposed fix
if match: core_error = match.group(0).strip() core_error = core_error else: - logger.error(f"No match found for error pattern, fallback to mermaid-py\n{error_str}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"No match found for error pattern, fallback to mermaid-py\n{error_str}") raise Exception(error_str)Note: With this change, the
import tracebackon line 6 would no longer be needed if this is the only usage.codewiki/src/be/dependency_analyzer/analyzers/typescript.py (1)
34-35: Uselogging.exception()for automatic traceback capture.Replace manual traceback formatting with
logging.exception(), which automatically includes the full traceback.🔎 Proposed fix
except Exception as e: - logger.error(f"Failed to initialize TypeScript parser: {e}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Failed to initialize TypeScript parser: {e}") self.parser = None self.ts_language = NoneNote: With this change, the
import tracebackon line 3 would no longer be needed if this is the only usage.codewiki/src/be/agent_orchestrator.py (1)
147-148: Uselogging.exception()for automatic traceback capture.Replace manual traceback formatting with
logging.exception(), which automatically includes the full traceback.🔎 Proposed fix
except Exception as e: - logger.error(f"Error processing module {module_name}: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Error processing module {module_name}: {str(e)}") raiseNote: With this change, the
import tracebackon line 5 would no longer be needed if this is the only usage.codewiki/src/be/main.py (1)
60-61: Uselogging.exception()for automatic traceback capture.Replace manual traceback formatting with
logging.exception(), which automatically includes the full traceback. This pattern appears across multiple files in this PR (utils.py, typescript.py, agent_orchestrator.py, and main.py).🔎 Proposed fix
except Exception as e: - logger.error(f"Unexpected error: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Unexpected error: {str(e)}") raiseNote: With this change, the
import tracebackon line 14 would no longer be needed if this is the only usage. Consider applying this pattern consistently across all affected files (codewiki/src/be/utils.py line 172-173, codewiki/src/be/dependency_analyzer/analyzers/typescript.py line 34-35, codewiki/src/be/agent_orchestrator.py line 147-148).codewiki/cli/adapters/doc_generator.py (1)
154-154: Consider removing unused variable.The
generation_timevariable is assigned but never used. If timing information isn't needed for logging or metrics, this line can be removed.🔎 Proposed fix
- generation_time = time.time() - start_time self.job.complete()codewiki/src/be/cluster_modules.py (1)
77-79: Preferlogging.exception()for cleaner exception logging.The added traceback logging improves debugging, but Python's
logging.exception()automatically includes the traceback, making the code more concise.🔎 Proposed fix
except Exception as e: - logger.error(f"Failed to parse LLM response: {e}. Response: {response[:200]}...") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Failed to parse LLM response: {e}. Response: {response[:200]}...") return {}This automatically includes the full traceback without needing the separate
tracebackimport.codewiki/src/be/dependency_analyzer/analyzers/javascript.py (1)
35-37: Preferlogging.exception()for cleaner exception logging.The traceback logging aids debugging parser initialization failures. However,
logging.exception()would be more idiomatic and eliminate the need for manual traceback formatting.🔎 Proposed fix
except Exception as e: - logger.error(f"Failed to initialize JavaScript parser: {e}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Failed to initialize JavaScript parser: {e}") self.parser = None self.js_language = NoneThis automatically includes the full traceback without the separate
tracebackimport.codewiki/src/be/dependency_analyzer/models/core.py (1)
55-72: Consider adding validation for the confidence score.The
DataFlowRelationshipmodel is well-structured. Theconfidencefield has a comment indicating it should be 0.0-1.0, but Pydantic allows any float value. Consider adding a validator to enforce the range.🔎 Proposed enhancement with field validation
+from pydantic import BaseModel, field_validator class DataFlowRelationship(BaseModel): """Data flow relationship extracted from Joern CPG analysis.""" source: str target: str flow_type: str # "parameter", "local", "return", "field" variable_name: Optional[str] = None source_line: Optional[int] = None target_line: Optional[int] = None confidence: float = 1.0 # Confidence score 0.0-1.0 file_path: Optional[str] = None + + @field_validator('confidence') + @classmethod + def validate_confidence(cls, v: float) -> float: + if not 0.0 <= v <= 1.0: + raise ValueError('confidence must be between 0.0 and 1.0') + return vcodewiki/src/be/dependency_analyzer/joern_poc.py (2)
27-35: Consider using context manager for temporary directory cleanup.The temporary directory is created in
__init__and cleanup relies on__del__, which is not guaranteed to run. Consider usingtempfile.TemporaryDirectoryas a context manager or ensuring cleanup is called explicitly.🔎 Proposed improvement
def __init__(self, joern_path: Optional[str] = None): """ Initialize Joern analyzer. Args: joern_path: Path to Joern binary. If None, will try to find in PATH. """ self.joern_path = joern_path or self._find_joern_binary() - self.temp_dir = tempfile.mkdtemp(prefix="joern_analysis_") + self._temp_dir_obj = tempfile.TemporaryDirectory(prefix="joern_analysis_") + self.temp_dir = self._temp_dir_obj.nameThen remove the manual
cleanup()method and rely on the context manager's cleanup.
102-107: Useraise ... from efor proper exception chaining.When re-raising exceptions, use
raise ... from eto preserve the original traceback for easier debugging.🔎 Proposed fix
except subprocess.TimeoutExpired: - logger.error("Joern analysis timed out") - raise RuntimeError("Joern analysis timed out after 5 minutes") + logger.error("Joern analysis timed out") + raise RuntimeError("Joern analysis timed out after 5 minutes") from None except Exception as e: - logger.error(f"Joern analysis failed: {str(e)}") + logger.exception("Joern analysis failed") raisecodewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)
99-101: Useraise ... from efor proper exception chaining.When catching and re-raising exceptions, use
from e(orfrom Noneif intentionally suppressing) to preserve the original traceback.🔎 Proposed fix
except Exception as e: logger.error(f"Hybrid analysis failed: {str(e)}", exc_info=True) - raise RuntimeError(f"Hybrid analysis failed: {str(e)}") + raise RuntimeError(f"Hybrid analysis failed: {str(e)}") from e
209-222: Avoid shadowing thelocalsbuiltin.Using
localsas a variable name shadows Python's built-inlocals()function. While it works, it can lead to confusion and subtle bugs.🔎 Proposed fix
if df_result.get("status") != "error": # Create data flow relationships params = df_result.get("parameters", []) - locals = df_result.get("local_variables", []) + local_vars = df_result.get("local_variables", []) # Parameter to local variable flows for param in params: - for local in locals: + for local_var in local_vars: data_flow = DataFlowRelationship( source=f"{func_name}:{param}", - target=f"{func_name}:{local}", + target=f"{func_name}:{local_var}", flow_type="parameter_to_local", - variable_name=local, + variable_name=local_var, confidence=0.8, )codewiki/src/be/documentation_generator.py (1)
174-174: Preferlogging.exception()over manual traceback logging.Using
logging.exception()automatically includes the full traceback and provides better formatted output compared to manually loggingtraceback.format_exc(). This reduces code duplication and improves consistency.🔎 Proposed refactor
- logger.error(f"Failed to process module {module_key}: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Failed to process module {module_key}")Apply the same pattern to lines 246 and 291:
- logger.error(f"Error generating parent documentation for {module_name}: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Error generating parent documentation for {module_name}")- logger.error(f"Documentation generation failed: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception("Documentation generation failed")If you adopt
logging.exception(), theimport tracebackon line 6 can also be removed.Also applies to: 246-246, 291-291
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
216-216: Preferlogging.exception()over manual traceback logging.Similar to other exception handlers in this PR, using
logging.exception()would be cleaner than manually logging the traceback.🔎 Proposed refactor
except Exception as e: if temp_dir: self._cleanup_repository(temp_dir) - logger.error(f"Structure analysis failed for {github_url}: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Structure analysis failed for {github_url}") raise RuntimeError(f"Structure analysis failed: {str(e)}") from ecodewiki/cli/models/job.py (1)
133-161: Well-structured defensive deserialization.The refactored
from_dictmethod using.get()with defaults provides good resilience against missing or malformed data. One minor simplification is possible at line 155.🔎 Optional simplification at line 155
The key check before access can be simplified using
.get():- if "llm_config" in data and data["llm_config"]: - job.llm_config = LLMConfig(**data["llm_config"]) + if llm_config_data := data.get("llm_config"): + job.llm_config = LLMConfig(**llm_config_data)This uses the walrus operator to both retrieve and check the value in one line.
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py (1)
142-144: Preferlogging.exception()for cleaner exception logging.The broad
Exceptioncatch at line 142 is appropriate here since it's the top-level dispatcher for file analysis, but the logging can be simplified usinglogging.exception().🔎 Proposed refactor
except Exception as e: - logger.error(f"⚠️ Error analyzing {file_path}: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"⚠️ Error analyzing {file_path}")If this refactor is applied across all files in this PR, the
import tracebackstatements can be removed.codewiki/cli/models/config.py (1)
73-80: Consider whetherfallback_modelshould blockis_complete().The
fallback_modelhas a sensible default ("glm-4p5"), so it will always be truthy unless explicitly set to an empty string. This means adding it tois_complete()is effectively a no-op for normal flows. However, if someone explicitly setsfallback_model="", configuration will be marked incomplete, which may be the intended behavior.JOERN_USER_GUIDE.md (2)
88-92: Add language specifier to fenced code block.The code block showing the fallback message should have a language specifier for consistent rendering.
🔎 Proposed fix
-``` +```text ⚠️ Joern analysis failed: Error: Unable to access jarfile joern.jar 🔄 Falling back to AST analysis ✓ Documentation generated successfully -``` +```
268-274: Add language specifier to directory structure code block.For consistent markdown rendering, add a language specifier.
🔎 Proposed fix
-``` +```text my-project/ ├── joern.jar # Auto-downloaded or manual ├── src/ # Source code ├── docs/ # Generated documentation └── .gitignore # Ignore temporary files -``` +```codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (1)
54-56: Remove unused variablefiltered_folders_path.The variable is assigned but never used since the code block that would use it is commented out.
🔎 Proposed fix
- filtered_folders_path = os.path.join( - self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json" - )codewiki/cli/commands/generate.py (3)
202-208: Remove unusedgeneration_optionsvariable.The
GenerationOptionsobject is created but never used.🔎 Proposed fix
- # Create generation options - generation_options = GenerationOptions( - create_branch=create_branch, - github_pages=github_pages, - no_cache=no_cache, - custom_output=output if output != "docs" else None, - use_joern=use_joern, - )Alternatively, if
generation_optionsshould be used, pass it to the generator or other components.
281-292: Uselogger.exception()instead of manually logging traceback.The current pattern logs the error message and then separately logs the traceback. Using
logger.exception()is cleaner and automatically includes the traceback.🔎 Proposed fix
except ConfigurationError as e: - logger.error(e.message) - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(e.message) sys.exit(e.exit_code) except RepositoryError as e: - logger.error(e.message) - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(e.message) sys.exit(e.exit_code) except APIError as e: - logger.error(e.message) - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(e.message) sys.exit(e.exit_code)This also removes the need for
import tracebackat line 7.
262-263: Avoid bareexceptclause.The bare
exceptcatches all exceptions includingKeyboardInterruptandSystemExit. Use a specific exception type.🔎 Proposed fix
- except: + except Exception: passcodewiki/src/be/dependency_analyzer/analyzers/php.py (2)
51-51: Consider using modern type hint syntax for optional parameters.PEP 604 (Python 3.10+) allows
T | Noneinstead ofOptional[T]or defaultNonewithout explicit type.🔎 Proposed fix examples
- def register_use(self, fqn: str, alias: str = None): + def register_use(self, fqn: str, alias: str | None = None): - def __init__(self, file_path: str, content: str, repo_path: str = None): + def __init__(self, file_path: str, content: str, repo_path: str | None = None): - def _get_component_id(self, name: str, parent_class: str = None) -> str: + def _get_component_id(self, name: str, parent_class: str | None = None) -> str: - def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str = None): + def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str | None = None): -def analyze_php_file(file_path: str, content: str, repo_path: str = None) -> Tuple[List[Node], List[CallRelationship]]: +def analyze_php_file(file_path: str, content: str, repo_path: str | None = None) -> Tuple[List[Node], List[CallRelationship]]:Also applies to: 90-90, 149-149, 244-244, 592-592
184-188: Uselogger.exception()for better error context.When catching exceptions,
logger.exception()automatically includes the stack trace.🔎 Proposed fix
except RecursionError: logger.warning(f"Max recursion depth exceeded in {self.file_path}") except Exception as e: - logger.error(f"Error parsing PHP file {self.file_path}: {e}") + logger.exception(f"Error parsing PHP file {self.file_path}: {e}")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
JOERN_USER_GUIDE.mdREADME.mdcodewiki/cli/adapters/doc_generator.pycodewiki/cli/commands/config.pycodewiki/cli/commands/generate.pycodewiki/cli/config_manager.pycodewiki/cli/models/config.pycodewiki/cli/models/job.pycodewiki/cli/utils/repo_validator.pycodewiki/cli/utils/validation.pycodewiki/src/be/agent_orchestrator.pycodewiki/src/be/agent_tools/str_replace_editor.pycodewiki/src/be/cluster_modules.pycodewiki/src/be/dependency_analyzer/analysis/analysis_service.pycodewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.pycodewiki/src/be/dependency_analyzer/analyzers/javascript.pycodewiki/src/be/dependency_analyzer/analyzers/php.pycodewiki/src/be/dependency_analyzer/analyzers/typescript.pycodewiki/src/be/dependency_analyzer/dependency_graphs_builder.pycodewiki/src/be/dependency_analyzer/hybrid_analysis_service.pycodewiki/src/be/dependency_analyzer/joern_poc.pycodewiki/src/be/dependency_analyzer/models/core.pycodewiki/src/be/dependency_analyzer/simplified_joern.pycodewiki/src/be/dependency_analyzer/utils/patterns.pycodewiki/src/be/documentation_generator.pycodewiki/src/be/main.pycodewiki/src/be/prompt_template.pycodewiki/src/be/utils.pycodewiki/src/config.pypyproject.toml
🧰 Additional context used
🧬 Code graph analysis (12)
codewiki/src/be/utils.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/src/be/dependency_analyzer/analyzers/javascript.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/src/be/agent_orchestrator.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/src/be/dependency_analyzer/analyzers/typescript.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/src/be/main.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (4)
codewiki/src/be/dependency_analyzer/topo_sort.py (2)
build_graph_from_components(239-268)get_leaf_nodes(271-350)codewiki/src/config.py (1)
Config(46-124)codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)
HybridAnalysisService(23-409)analyze_repository_hybrid(48-101)codewiki/src/be/dependency_analyzer/ast_parser.py (1)
parse_repository(28-45)
codewiki/src/be/documentation_generator.py (1)
codewiki/cli/utils/logging.py (1)
error(42-44)
codewiki/cli/models/config.py (1)
codewiki/cli/utils/validation.py (1)
validate_model_name(82-98)
codewiki/src/be/dependency_analyzer/analyzers/php.py (1)
codewiki/src/be/dependency_analyzer/models/core.py (2)
Node(6-42)CallRelationship(45-52)
codewiki/cli/models/job.py (3)
codewiki/src/fe/background_worker.py (1)
start(38-44)codewiki/src/fe/models.py (1)
JobStatus(33-45)codewiki/cli/models/config.py (2)
to_dict(50-52)from_dict(55-71)
codewiki/cli/commands/generate.py (2)
codewiki/cli/models/job.py (1)
GenerationOptions(23-30)codewiki/cli/adapters/doc_generator.py (2)
CLIDocumentationGenerator(26-288)generate(114-164)
🪛 LanguageTool
JOERN_USER_GUIDE.md
[style] ~75-~75: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ncies) #### ❌ Not Recommended For: - Very large repositories (1000+ files) - may be s...
(EN_WEAK_ADJECTIVE)
[style] ~193-~193: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...d Usage ### 1. Selective Analysis For very large repositories, analyze specific director...
(EN_WEAK_ADJECTIVE)
🪛 markdownlint-cli2 (0.18.1)
JOERN_USER_GUIDE.md
88-88: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
268-268: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
295-295: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🪛 Ruff (0.14.10)
codewiki/src/be/utils.py
173-173: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py
216-216: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/analyzers/javascript.py
37-37: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/agent_orchestrator.py
148-148: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/analyzers/typescript.py
35-35: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/main.py
61-61: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/config.py
109-109: Local variable repo_name is assigned to but never used
Remove assignment to unused variable repo_name
(F841)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
31-31: Do not catch blind exception: Exception
(BLE001)
54-54: Local variable filtered_folders_path is assigned to but never used
Remove assignment to unused variable filtered_folders_path
(F841)
codewiki/src/be/dependency_analyzer/simplified_joern.py
64-64: Avoid specifying long messages outside the exception class
(TRY003)
82-82: subprocess call: check for execution of untrusted input
(S603)
101-101: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
103-103: Do not catch blind exception: Exception
(BLE001)
104-104: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
104-104: Use explicit conversion flag
Replace with conversion flag
(RUF010)
200-200: Do not catch blind exception: Exception
(BLE001)
213-213: Do not catch blind exception: Exception
(BLE001)
214-214: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
215-215: Use explicit conversion flag
Replace with conversion flag
(RUF010)
255-255: subprocess call: check for execution of untrusted input
(S603)
261-261: Do not catch blind exception: Exception
(BLE001)
274-274: Do not catch blind exception: Exception
(BLE001)
304-304: Do not catch blind exception: Exception
(BLE001)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py
97-97: Consider moving this statement to an else block
(TRY300)
100-100: Use explicit conversion flag
Replace with conversion flag
(RUF010)
101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
123-123: Consider moving this statement to an else block
(TRY300)
125-125: Do not catch blind exception: Exception
(BLE001)
126-126: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
126-126: Use explicit conversion flag
Replace with conversion flag
(RUF010)
176-176: Consider moving this statement to an else block
(TRY300)
178-178: Do not catch blind exception: Exception
(BLE001)
179-179: Use explicit conversion flag
Replace with conversion flag
(RUF010)
203-203: Loop control variable func_info not used within loop body
Rename unused func_info to _func_info
(B007)
224-224: Do not catch blind exception: Exception
(BLE001)
387-387: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
402-406: Consider moving this statement to an else block
(TRY300)
408-408: Do not catch blind exception: Exception
(BLE001)
409-409: Use explicit conversion flag
Replace with conversion flag
(RUF010)
codewiki/src/be/documentation_generator.py
174-174: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
246-246: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/joern_poc.py
48-48: subprocess call: check for execution of untrusted input
(S603)
57-57: Avoid specifying long messages outside the exception class
(TRY003)
84-84: subprocess call: check for execution of untrusted input
(S603)
94-94: Abstract raise to an inner function
(TRY301)
94-94: Avoid specifying long messages outside the exception class
(TRY003)
103-103: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
104-104: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
104-104: Avoid specifying long messages outside the exception class
(TRY003)
106-106: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
106-106: Use explicit conversion flag
Replace with conversion flag
(RUF010)
227-230: f-string without any placeholders
Remove extraneous f prefix
(F541)
291-291: subprocess call: check for execution of untrusted input
(S603)
294-294: Avoid specifying long messages outside the exception class
(TRY003)
305-305: Do not catch blind exception: Exception
(BLE001)
316-316: Local variable analyzer is assigned to but never used
Remove assignment to unused variable analyzer
(F841)
318-318: Consider moving this statement to an else block
(TRY300)
319-319: Do not catch blind exception: Exception
(BLE001)
320-320: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/cluster_modules.py
79-79: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py
142-142: Do not catch blind exception: Exception
(BLE001)
143-143: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
143-143: Use explicit conversion flag
Replace with conversion flag
(RUF010)
144-144: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/agent_tools/str_replace_editor.py
205-205: subprocess call with shell=True identified, security issue
(S602)
339-340: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
341-342: Parenthesize a and b expressions when chaining and and or together, to make the precedence clear
Parenthesize the and subexpression
(RUF021)
codewiki/src/be/dependency_analyzer/analyzers/php.py
51-51: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
90-90: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
149-149: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
186-186: Do not catch blind exception: Exception
(BLE001)
187-187: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
244-244: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
592-592: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
codewiki/cli/adapters/doc_generator.py
154-154: Local variable generation_time is assigned to but never used
Remove assignment to unused variable generation_time
(F841)
157-157: Consider moving this statement to an else block
(TRY300)
188-188: Do not catch blind exception: Exception
(BLE001)
189-189: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
189-189: Avoid specifying long messages outside the exception class
(TRY003)
220-220: Do not catch blind exception: Exception
(BLE001)
221-221: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
221-221: Avoid specifying long messages outside the exception class
(TRY003)
245-245: Do not catch blind exception: Exception
(BLE001)
246-246: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
246-246: Avoid specifying long messages outside the exception class
(TRY003)
codewiki/cli/models/job.py
155-155: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
codewiki/cli/commands/generate.py
125-127: Abstract raise to an inner function
(TRY301)
125-127: Avoid specifying long messages outside the exception class
(TRY003)
202-202: Local variable generation_options is assigned to but never used
Remove assignment to unused variable generation_options
(F841)
252-252: Local variable commit_hash is assigned to but never used
Remove assignment to unused variable commit_hash
(F841)
253-253: Local variable current_branch is assigned to but never used
Remove assignment to unused variable current_branch
(F841)
262-262: Do not use bare except
(E722)
262-263: try-except-pass detected, consider logging the exception
(S110)
282-282: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
283-283: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
286-286: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
287-287: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
290-290: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
291-291: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (27)
codewiki/src/be/agent_tools/str_replace_editor.py (2)
785-794: Nice implementation of flexible input parsing.The
parse_view_rangefunction andViewRangetype alias elegantly enable theview_rangeparameter to accept both list and string inputs (e.g.,"[1, 10]"). The use of Pydantic'sBeforeValidatoris idiomatic and allows graceful error handling downstream if the parsed value doesn't match the expected type.
803-803: Public API enhancement maintains backward compatibility.The parameter type change to
ViewRangeenhances the API by accepting string inputs in addition to lists, while maintaining full backward compatibility with existing code that passes list values.pyproject.toml (1)
40-40: LGTM!The addition of
tree-sitter-php>=0.23.0appropriately supports the new PHP language analysis capabilities introduced in this PR.codewiki/cli/adapters/doc_generator.py (1)
139-140: LGTM! Joern and fallback model integration added.The addition of
fallback_modelanduse_joernconfiguration options properly integrates the new hybrid analysis capabilities described in the PR objectives. These parameters are correctly passed to the backend config.README.md (2)
45-53: LGTM! Clear documentation for the new fallback model feature.The documentation clearly describes the multi-model support and includes the new
--fallback-modeloption in the configuration example. The additions are well-integrated and consistent with the existing documentation style.
105-106: LGTM! Usage documentation updated consistently.The fallback model option is properly documented in the CLI usage section, maintaining consistency with the quick start example above.
codewiki/cli/utils/repo_validator.py (2)
30-32: LGTM! PHP language support added.The addition of PHP extensions (
.php,.phtml,.inc) aligns with the PR's objective to add PHP language support. The extensions cover the common PHP file types.
63-63: LGTM! Error message updated consistently.The error message correctly includes PHP in the list of supported languages, maintaining consistency with the newly added extensions.
codewiki/src/be/prompt_template.py (1)
235-239: LGTM! PHP extension mappings added correctly.The PHP language support additions (
.php,.phtml,.inc) are consistent with the changes in other files across this PR.codewiki/src/be/dependency_analyzer/models/core.py (1)
35-42: LGTM! Good additions for richer node metadata.The new
class_name,display_name, andcomponent_idfields along with theget_display_name()helper method are well-designed for supporting multi-language analysis and cross-module referencing.codewiki/src/config.py (2)
60-61: LGTM! Joern configuration flag properly added.The
use_joernfield is correctly added to the Config dataclass with an appropriate default ofFalse.
63-80: This is not an issue. Thefrom_argsmethod doesn't need to propagateuse_joernbecause it's not used in any code path that supports the--use-joernflag.The backend CLI in
be/main.py(which usesConfig.from_args()) only parses--repo-pathand has no--use-joernflag. The full CLI incli/commands/generate.pythat does support--use-joernuses a different code path (CLIDocumentationGenerator) and doesn't callConfig.from_args(). This is intentional design separation—simpler contexts usefrom_args()without Joern support, while the full CLI uses its own code path that passesuse_joerndirectly.Likely an incorrect or invalid review comment.
codewiki/src/be/dependency_analyzer/simplified_joern.py (1)
278-314: LGTM! Useful performance baseline utility.The
run_performance_baselinefunction provides a clean way to compare AST and Joern analysis timings. The metrics returned (times, counts, speedup) are practical for evaluating the hybrid approach.codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
412-423: LGTM! Clean factory function for backward compatibility.The
create_hybrid_analysis_servicefactory function provides a simple interface for creating the service with the Joern toggle.codewiki/cli/utils/validation.py (1)
174-175: LGTM! PHP language detection added consistently.The PHP extensions (
.php,.phtml,.inc) are correctly added to the language detection, matching the changes across other files in this PR.codewiki/src/be/documentation_generator.py (1)
46-46: LGTM: Version bump is appropriate.The version increment from "1.0.0" to "1.0.1" correctly reflects the minor enhancement of adding traceback logging to error handlers.
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
300-323: LGTM: Language support additions are consistent.The PHP and Java language support has been added consistently across the supported languages set, filter method, and public API. The implementation aligns with the PR's objective of expanding language coverage.
codewiki/cli/models/job.py (1)
30-30: LGTM: New fields support Joern integration and token tracking.The
use_joernflag inGenerationOptionsandtotal_tokens_usedinJobStatisticsalign well with the PR objectives for enabling optional Joern CPG analysis and tracking usage metrics.Also applies to: 40-40
codewiki/src/be/dependency_analyzer/utils/patterns.py (1)
257-264: LGTM: Comprehensive PHP pattern coverage.The PHP entry point patterns include common framework entry points (Laravel's
artisan, Symfony'sconsole) and standard PHP entry files. The function definition patterns appropriately cover PHP visibility modifiers (public, private, protected) alongside the basic function syntax.Also applies to: 409-409
codewiki/src/be/dependency_analyzer/analysis/call_graph_analyzer.py (1)
135-136: LGTM: PHP language support properly integrated.The PHP analysis follows the established pattern used for other languages:
- Routing in
_analyze_code_file(lines 135-136)- Dedicated
_analyze_php_filemethod with proper error handling (lines 305-326)- File extension classification for visualization (lines 411-412)
The implementation is consistent with existing Java, C#, and TypeScript analyzers.
Also applies to: 305-326, 411-412
codewiki/cli/models/config.py (1)
29-36: LGTM!The
fallback_modelfield is well-integrated following the existing patterns for other model fields. Default value is consistent across the class definition (line 35),from_dict(line 69), and the validation invalidate()properly validates it using the existingvalidate_model_nameutility.Also applies to: 48-48
codewiki/cli/commands/config.py (2)
53-63: LGTM!The
--fallback-modelCLI option is consistently integrated following the same patterns as other model options (--main-model,--cluster-model).
357-368: LGTM!The validation logic correctly includes
fallback_modelin both verbose output and the completeness check, maintaining consistency with the other model fields.codewiki/cli/config_manager.py (1)
84-121: LGTM!The
fallback_modelparameter is properly integrated into thesave()method, with consistent default initialization ("glm-4p5") matching theConfigurationmodel definition.codewiki/src/be/dependency_analyzer/analyzers/php.py (2)
1-18: LGTM - Solid PHP analyzer implementation.The PHP analyzer is well-structured with clear separation of concerns: namespace resolution, node extraction, and relationship extraction. The use of tree-sitter-php for parsing is appropriate.
332-336: This concern is not valid — the code correctly associates methods with their parent classes.The
parent_classassignment at line 336 occurs before the recursive call at line 340. When child nodes (including methods) are processed, they receive the updatedparent_classas a parameter. For method nodes at line 293,parent_classis used directly to construct the method's qualified name, ensuring methods are correctly associated with their parent class.Likely an incorrect or invalid review comment.
JOERN_USER_GUIDE.md (1)
140-145: URL is valid and accessible.The download URL correctly redirects to the latest Joern release (currently v4.0.457). The curl command in the documentation will follow the redirect and successfully download the JAR file.
codewiki/cli/commands/generate.py
Outdated
| generator = CLIDocumentationGenerator( | ||
| repo_path=repo_path, | ||
| output_dir=output_dir, | ||
| config={ | ||
| 'main_model': config.main_model, | ||
| 'cluster_model': config.cluster_model, | ||
| 'base_url': config.base_url, | ||
| 'api_key': api_key, | ||
| "main_model": config.main_model, | ||
| "cluster_model": config.cluster_model, | ||
| "fallback_model": config.fallback_model, | ||
| "base_url": config.base_url, | ||
| "api_key": api_key, | ||
| "use_joern": use_joern, | ||
| }, | ||
| verbose=verbose, | ||
| generate_html=github_pages | ||
| generate_html=github_pages, | ||
| ) | ||
|
|
||
|
|
||
| if use_joern: | ||
| logger.info("🚀 Joern CPG analysis enabled for enhanced data flow insights") | ||
|
|
||
| # Create generator | ||
| generator = CLIDocumentationGenerator( | ||
| repo_path=repo_path, | ||
| output_dir=output_dir, | ||
| config={ | ||
| "main_model": config.main_model, | ||
| "cluster_model": config.cluster_model, | ||
| "fallback_model": config.fallback_model, | ||
| "base_url": config.base_url, | ||
| "api_key": api_key, | ||
| }, | ||
| verbose=verbose, | ||
| generate_html=github_pages, | ||
| ) |
There was a problem hiding this comment.
Bug: CLIDocumentationGenerator is instantiated twice; second instance overwrites first and lacks use_joern.
The generator is created at lines 211-224 with use_joern in the config, but then immediately overwritten at lines 229-242 without use_joern. The second generator is what gets used for generator.generate(), so Joern integration is never actually enabled.
🔎 Proposed fix - remove the duplicate instantiation
if use_joern:
logger.info("🚀 Joern CPG analysis enabled for enhanced data flow insights")
- # Create generator
- generator = CLIDocumentationGenerator(
- repo_path=repo_path,
- output_dir=output_dir,
- config={
- "main_model": config.main_model,
- "cluster_model": config.cluster_model,
- "fallback_model": config.fallback_model,
- "base_url": config.base_url,
- "api_key": api_key,
- },
- verbose=verbose,
- generate_html=github_pages,
- )
-
# Run generation
job = generator.generate()🤖 Prompt for AI Agents
In codewiki/cli/commands/generate.py around lines 211 to 242, the
CLIDocumentationGenerator is instantiated twice and the second instantiation
overwrites the first while omitting the use_joern flag, preventing Joern from
being enabled; fix this by removing the duplicate instantiation (or merging
them) so only one CLIDocumentationGenerator is created with the full config
including "use_joern": use_joern, and ensure the logger.info remains in place
when use_joern is true.
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
Outdated
Show resolved
Hide resolved
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
Outdated
Show resolved
Hide resolved
|
/opencode 審閱下述coderabbit提出的審查建議並修正,commit_msg必須參照過往格式寫法撰寫,確保符合PR審查標準 In codewiki/cli/commands/generate.py around lines 211 to 242, the In codewiki/src/be/dependency_analyzer/analyzers/php.py around lines 20 to 30, In codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py around line Unreachable code due to redundant nested Lines 72-80 contain unreachable code. The outer condition at line 71 checks 🔎 Proposed fix # Parse repository
if isinstance(self.parser, DependencyParser):
- if isinstance(self.parser, DependencyParser):
- components = self.parser.parse_repository(filtered_folders or [])
- else:
- # HybridAnalysisService uses different interface
- result = self.parser.analyze_repository_hybrid(
- repo_path=self.config.repo_path, max_files=100
- )
- # Convert hybrid result to expected format
- components = result.get("nodes", {})
+ components = self.parser.parse_repository(filtered_folders or [])
else:
# HybridAnalysisService uses different interface
result = self.parser.analyze_repository_hybrid(📝 Committable suggestion
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around line In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines Potential command injection via unsanitized The 🔎 Proposed fix - sanitize inputs def extract_data_flow(self, project_path: str, function_name: str) -> Dict[str, Any]:
"""
Extract data flow information for a specific function.
Args:
project_path: Path to the project
function_name: Name of the function to analyze
Returns:
Data flow analysis results
"""
+ # Sanitize function_name to prevent script injection
+ if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', function_name):
+ raise ValueError(f"Invalid function name: {function_name}")
+
+ # Escape project_path for safe inclusion in script
+ safe_project_path = project_path.replace('"', '\\"')
+
script = f"""
import joern._
-val cpg = loadCPG({project_path})
-val method = cpg.method.name({function_name}).head
+val cpg = loadCPG("{safe_project_path}")
+val method = cpg.method.name("{function_name}").headNote: The current string concatenation approach is also syntactically broken (missing quotes around string literals in Scala). In codewiki/src/be/dependency_analyzer/joern_poc.py around lines 313-321, the In codewiki/src/be/dependency_analyzer/simplified_joern.py around lines 186 to In codewiki/src/be/dependency_analyzer/simplified_joern.py around lines 217 to In codewiki/src/config.py around lines 109 to 111, the local variable |
- Add Joern info log when use_joern is enabled - Add missing PHP 8 built-in types and SPL classes to PHP_PRIMITIVES - Fix shadowing local parser variable in dependency_graphs_builder.py - Remove redundant nested isinstance check in dependency_graphs_builder.py - Rename unused loop variable func_info to _func_info - Fix _extract_cross_module_relationships to accept relationships parameter - Add function name validation to extract_data_flow in joern_poc.py - Fix unused variable and use logger.exception in test_joern_analyzer - Fix literal \n to actual newline character in simplified_joern.py - Add function name validation in simplified_joern.py extract_data_flow_sample - Remove unused repo_name variable in config.py
|
Perfect! I've successfully addressed all CodeRabbit review suggestions and created a commit. Here's a summary: Fixed Issues
Commit CreatedThe commit addresses all critical and major issues from the CodeRabbit review, with improvements to:
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (11)
codewiki/src/be/dependency_analyzer/analyzers/php.py (6)
63-63: Use explicit optional type hint.PEP 484 recommends explicit optional syntax:
alias: str | None = Noneinstead of implicitOptional.🔎 Proposed fix
- def register_use(self, fqn: str, alias: str = None): + def register_use(self, fqn: str, alias: str | None = None):
102-102: Use explicit optional type hint.PEP 484 recommends explicit optional syntax for the
repo_pathparameter.🔎 Proposed fix
- def __init__(self, file_path: str, content: str, repo_path: str = None): + def __init__(self, file_path: str, content: str, repo_path: str | None = None):
196-199: Uselogger.exceptionfor better debugging.When logging exceptions in an except block,
logger.exceptionautomatically includes the traceback, which is valuable for debugging parse errors.🔎 Proposed fix
except RecursionError: logger.warning(f"Max recursion depth exceeded in {self.file_path}") except Exception as e: - logger.error(f"Error parsing PHP file {self.file_path}: {e}") + logger.exception(f"Error parsing PHP file {self.file_path}: {e}")
256-256: Use explicit optional type hint.PEP 484 recommends explicit optional syntax for the
parent_classparameter.🔎 Proposed fix
- def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str = None): + def _extract_nodes(self, node, lines: List[str], depth: int = 0, parent_class: str | None = None):
161-161: Use explicit optional type hint.PEP 484 recommends explicit optional syntax for the
parent_classparameter.🔎 Proposed fix
- def _get_component_id(self, name: str, parent_class: str = None) -> str: + def _get_component_id(self, name: str, parent_class: str | None = None) -> str:
604-604: Use explicit optional type hint.PEP 484 recommends explicit optional syntax for the
repo_pathparameter.🔎 Proposed fix
-def analyze_php_file(file_path: str, content: str, repo_path: str = None) -> Tuple[List[Node], List[CallRelationship]]: +def analyze_php_file(file_path: str, content: str, repo_path: str | None = None) -> Tuple[List[Node], List[CallRelationship]]:codewiki/cli/commands/generate.py (1)
266-277: Consider usinglogger.exception()for cleaner traceback logging.The current approach manually formats tracebacks with
logger.error(). Usinglogger.exception()would automatically include the traceback and is more idiomatic.🔎 Proposed refactor
except ConfigurationError as e: - logger.error(e.message) - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Configuration error: {e.message}") sys.exit(e.exit_code) except RepositoryError as e: - logger.error(e.message) - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Repository error: {e.message}") sys.exit(e.exit_code) except APIError as e: - logger.error(e.message) - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"API error: {e.message}") sys.exit(e.exit_code)You can also remove the
import tracebackon line 7 if this is the only use.codewiki/src/be/dependency_analyzer/joern_poc.py (1)
231-289: Consider refactoring script building for better robustness.While
function_nameis now validated, the script is built using fragile string concatenation whereproject_pathis directly concatenated without quoting or sanitization. Consider using a template approach or proper escaping for paths to improve robustness against special characters in paths.💡 Example improvement
# Escape project_path for safe inclusion import shlex safe_project_path = shlex.quote(project_path) script = f""" import joern._ val cpg = loadCPG({safe_project_path}) val method = cpg.method.name("{function_name}").head ... """codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (3)
99-101: Consider adding exception chaining for better debugging context.At line 101, the exception is re-raised without preserving the original exception context. Using
raise ... from eprovides a clearer traceback chain.🔎 Suggested improvement
except Exception as e: logger.error(f"Hybrid analysis failed: {str(e)}", exc_info=True) - raise RuntimeError(f"Hybrid analysis failed: {str(e)}") + raise RuntimeError(f"Hybrid analysis failed: {str(e)}") from e
125-127: Uselogging.exceptionfor automatic traceback inclusion.At line 126,
logger.erroris used in an exception handler. Usinglogger.exceptionautomatically includes the traceback, which significantly aids debugging.🔎 Suggested improvement
except Exception as e: - logger.error(f"AST analysis failed: {str(e)}") + logger.exception(f"AST analysis failed: {str(e)}") # Return minimal result to allow analysis to continue
178-180: Consider usinglogging.exceptionfor better debugging.Similar to other exception handlers, using
logger.exceptioninstead oflogger.warningwould include the full traceback, making it easier to diagnose Joern enhancement failures.🔎 Suggested improvement
except Exception as e: - logger.warning(f"Joern enhancement failed: {str(e)}") + logger.exception(f"Joern enhancement failed: {str(e)}") return {"status": "joern_failed", "error": str(e)}
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
codewiki/cli/commands/generate.pycodewiki/src/be/dependency_analyzer/analyzers/php.pycodewiki/src/be/dependency_analyzer/dependency_graphs_builder.pycodewiki/src/be/dependency_analyzer/hybrid_analysis_service.pycodewiki/src/be/dependency_analyzer/joern_poc.pycodewiki/src/be/dependency_analyzer/simplified_joern.pycodewiki/src/config.py
🧰 Additional context used
🧬 Code graph analysis (3)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (5)
codewiki/src/be/dependency_analyzer/topo_sort.py (2)
build_graph_from_components(239-268)get_leaf_nodes(271-350)codewiki/src/config.py (1)
Config(46-123)codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)
HybridAnalysisService(23-409)analyze_repository_hybrid(48-101)codewiki/src/be/dependency_analyzer/ast_parser.py (3)
DependencyParser(18-145)parse_repository(28-45)save_dependency_graph(129-145)codewiki/src/utils.py (1)
save_json(19-22)
codewiki/src/be/dependency_analyzer/analyzers/php.py (1)
codewiki/src/be/dependency_analyzer/models/core.py (2)
Node(6-42)CallRelationship(45-52)
codewiki/cli/commands/generate.py (2)
codewiki/cli/models/job.py (1)
GenerationOptions(23-30)codewiki/cli/adapters/doc_generator.py (2)
CLIDocumentationGenerator(26-288)generate(114-164)
🪛 Ruff (0.14.10)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
31-31: Do not catch blind exception: Exception
(BLE001)
54-54: Local variable filtered_folders_path is assigned to but never used
Remove assignment to unused variable filtered_folders_path
(F841)
codewiki/src/be/dependency_analyzer/analyzers/php.py
63-63: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
102-102: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
161-161: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
198-198: Do not catch blind exception: Exception
(BLE001)
199-199: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
256-256: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
604-604: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py
97-97: Consider moving this statement to an else block
(TRY300)
100-100: Use explicit conversion flag
Replace with conversion flag
(RUF010)
101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
123-123: Consider moving this statement to an else block
(TRY300)
125-125: Do not catch blind exception: Exception
(BLE001)
126-126: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
126-126: Use explicit conversion flag
Replace with conversion flag
(RUF010)
176-176: Consider moving this statement to an else block
(TRY300)
178-178: Do not catch blind exception: Exception
(BLE001)
179-179: Use explicit conversion flag
Replace with conversion flag
(RUF010)
224-224: Do not catch blind exception: Exception
(BLE001)
387-387: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
402-406: Consider moving this statement to an else block
(TRY300)
408-408: Do not catch blind exception: Exception
(BLE001)
409-409: Use explicit conversion flag
Replace with conversion flag
(RUF010)
codewiki/cli/commands/generate.py
125-127: Abstract raise to an inner function
(TRY301)
125-127: Avoid specifying long messages outside the exception class
(TRY003)
202-202: Local variable generation_options is assigned to but never used
Remove assignment to unused variable generation_options
(F841)
237-237: Local variable commit_hash is assigned to but never used
Remove assignment to unused variable commit_hash
(F841)
238-238: Local variable current_branch is assigned to but never used
Remove assignment to unused variable current_branch
(F841)
247-247: Do not use bare except
(E722)
247-248: try-except-pass detected, consider logging the exception
(S110)
267-267: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
268-268: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
271-271: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
272-272: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
275-275: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
276-276: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/simplified_joern.py
65-65: Avoid specifying long messages outside the exception class
(TRY003)
83-83: subprocess call: check for execution of untrusted input
(S603)
102-102: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
104-104: Do not catch blind exception: Exception
(BLE001)
105-105: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
105-105: Use explicit conversion flag
Replace with conversion flag
(RUF010)
198-198: Do not catch blind exception: Exception
(BLE001)
211-211: Do not catch blind exception: Exception
(BLE001)
212-212: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
213-213: Use explicit conversion flag
Replace with conversion flag
(RUF010)
257-257: subprocess call: check for execution of untrusted input
(S603)
263-263: Do not catch blind exception: Exception
(BLE001)
276-276: Do not catch blind exception: Exception
(BLE001)
306-306: Do not catch blind exception: Exception
(BLE001)
codewiki/src/be/dependency_analyzer/joern_poc.py
48-48: subprocess call: check for execution of untrusted input
(S603)
57-57: Avoid specifying long messages outside the exception class
(TRY003)
84-84: subprocess call: check for execution of untrusted input
(S603)
94-94: Abstract raise to an inner function
(TRY301)
94-94: Avoid specifying long messages outside the exception class
(TRY003)
103-103: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
104-104: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
104-104: Avoid specifying long messages outside the exception class
(TRY003)
106-106: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
106-106: Use explicit conversion flag
Replace with conversion flag
(RUF010)
229-229: Avoid specifying long messages outside the exception class
(TRY003)
232-235: f-string without any placeholders
Remove extraneous f prefix
(F541)
296-296: subprocess call: check for execution of untrusted input
(S603)
299-299: Avoid specifying long messages outside the exception class
(TRY003)
310-310: Do not catch blind exception: Exception
(BLE001)
323-323: Consider moving this statement to an else block
(TRY300)
🔇 Additional comments (12)
codewiki/src/be/dependency_analyzer/analyzers/php.py (2)
20-42: PHP primitives set looks comprehensive now.The additions of PHP 8 built-in types and SPL classes/interfaces/exceptions address the previous review feedback. The set now properly recognizes these as primitives to exclude from dependency analysis.
1-617: Excellent PHP analyzer implementation!This is a comprehensive and well-structured PHP analyzer that correctly handles:
- All major PHP constructs (classes, interfaces, traits, enums, functions, methods)
- Namespace resolution with use statements (including group use syntax)
- Dependency relationships (extends, implements, new, static calls, property promotion)
- PHPDoc extraction and rich metadata
- Template file exclusion
- Recursion depth protection
The three-pass analysis approach (namespace info → nodes → relationships) is clean and maintainable. Integration with the project's models is correct.
codewiki/src/be/dependency_analyzer/simplified_joern.py (2)
220-222: Good fix: Function name validation prevents script injection.The addition of regex validation for
function_nameproperly mitigates the script injection risk flagged in previous reviews.
189-189: Good fix: Line number calculation now uses correct newline character.The code now correctly uses
"\n"instead of the literal"\\n", fixing the line counting bug from previous reviews.codewiki/src/be/dependency_analyzer/joern_poc.py (2)
226-229: Good fix: Function name validation prevents script injection.The regex validation for
function_nameproperly addresses the injection risk flagged in previous reviews.
321-325: Good fix: Proper exception handling and unused variable addressed.The code now correctly uses
_for the unused analyzer instance andlogger.exception()to include the traceback, addressing both issues from previous reviews.codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (6)
161-163: LGTM! Bug fix confirmed.The call to
_extract_cross_module_relationshipsnow correctly passesast_result(the full analysis result) instead of just the nodes dict. This resolves the previous issue where relationships were being read from the wrong source.
203-203: LGTM! Unused variable fix confirmed.The loop variable is now correctly prefixed with underscore (
_func_info) to indicate it's intentionally unused.
230-268: LGTM! Cross-module relationship extraction is now correct.The method signature and implementation have been fixed to correctly:
- Accept
ast_result(the full analysis result) as a parameter- Read
relationshipsfromast_result.get("relationships", [])- Extract cross-module call patterns appropriately
This resolves the previous issue where relationships were being read from the wrong source.
270-323: LGTM! Function importance scoring is well-designed.The scoring heuristic appropriately balances multiple factors:
- Popularity (dependents count)
- Documentation presence
- Key name patterns (init, main, etc.)
- Complexity estimation (parameter count)
This provides a reasonable selection of functions for detailed data flow analysis.
325-367: LGTM! Result merging is implemented correctly.The merge logic appropriately combines:
- AST structural information (nodes, relationships)
- Joern enhancements (data flows, cross-module edges)
- Updated summary metadata
The use of
model_dump()at line 352 is correct sinceDataFlowRelationshipis a Pydantic model.
412-423: LGTM! Factory function provides clean instantiation API.The factory function appropriately encapsulates service creation and provides a clear interface for enabling/disabling Joern integration.
| commit_hash = get_git_commit_hash(repo_path) | ||
| current_branch = get_git_branch(repo_path) |
There was a problem hiding this comment.
Remove unused variables commit_hash and current_branch.
These variables are assigned but never used in the function.
🔎 Proposed fix
- commit_hash = get_git_commit_hash(repo_path)
- current_branch = get_git_branch(repo_path)
-
if is_git_repository(repo_path):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| commit_hash = get_git_commit_hash(repo_path) | |
| current_branch = get_git_branch(repo_path) | |
| if is_git_repository(repo_path): |
🧰 Tools
🪛 Ruff (0.14.10)
237-237: Local variable commit_hash is assigned to but never used
Remove assignment to unused variable commit_hash
(F841)
238-238: Local variable current_branch is assigned to but never used
Remove assignment to unused variable current_branch
(F841)
🤖 Prompt for AI Agents
In codewiki/cli/commands/generate.py around lines 237 to 238, the variables
commit_hash and current_branch are assigned but never used; remove these unused
assignments (or replace them with a deliberate underscore-prefixed ignore
variable if side effects from the function calls are required) so the function
no longer creates unused local variables and linter warnings go away.
| filtered_folders_path = os.path.join( | ||
| self.config.dependency_graph_dir, | ||
| f"{sanitized_repo_name}_filtered_folders.json" | ||
| self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json" | ||
| ) |
There was a problem hiding this comment.
Remove unused variable filtered_folders_path.
The filtered_folders_path variable is assigned but never used, since the code that would consume it (lines 58-66) is commented out.
🔎 Proposed fix
dependency_graph_path = os.path.join(
self.config.dependency_graph_dir, f"{sanitized_repo_name}_dependency_graph.json"
)
- filtered_folders_path = os.path.join(
- self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json"
- )
filtered_folders = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| filtered_folders_path = os.path.join( | |
| self.config.dependency_graph_dir, | |
| f"{sanitized_repo_name}_filtered_folders.json" | |
| self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json" | |
| ) | |
| dependency_graph_path = os.path.join( | |
| self.config.dependency_graph_dir, f"{sanitized_repo_name}_dependency_graph.json" | |
| ) | |
| filtered_folders = None |
🧰 Tools
🪛 Ruff (0.14.10)
54-54: Local variable filtered_folders_path is assigned to but never used
Remove assignment to unused variable filtered_folders_path
(F841)
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py around lines
54 to 56, the variable filtered_folders_path is assigned but never used (the
code that would use it is commented out); either remove this unused assignment
or restore/implement the intended logic that reads/writes the filtered_folders
JSON using that path—update imports if necessary and run tests to ensure no
references remain.
| for param in params: | ||
| for local in locals: | ||
| data_flow = DataFlowRelationship( | ||
| source=f"{func_name}:{param}", | ||
| target=f"{func_name}:{local}", | ||
| flow_type="parameter_to_local", | ||
| variable_name=local, | ||
| confidence=0.8, | ||
| ) | ||
| data_flows.append(data_flow) |
There was a problem hiding this comment.
Naive data flow logic may generate many false positives.
The current implementation creates a data flow relationship between EVERY parameter and EVERY local variable (Cartesian product). For a function with 3 parameters and 5 locals, this generates 15 relationships, most of which are likely false positives.
This heuristic approach may be acceptable for a PoC, but consider:
- Using actual data flow information from Joern (if available)
- Adding filtering based on variable usage patterns
- Documenting this as a known limitation
💡 Suggested improvements
- Option 1: Use Joern's actual data flow edges instead of creating all combinations:
# If Joern provides actual flows, use those:
flows = df_result.get("data_flows", [])
for flow in flows:
data_flow = DataFlowRelationship(
source=f"{func_name}:{flow['source']}",
target=f"{func_name}:{flow['target']}",
flow_type="data_flow",
variable_name=flow['target'],
confidence=0.9,
)
data_flows.append(data_flow)- Option 2: Add a limit or reduce confidence:
# Create at most N flows per parameter
for param in params:
for local in locals[:3]: # Limit to first 3 locals
data_flow = DataFlowRelationship(
source=f"{func_name}:{param}",
target=f"{func_name}:{local}",
flow_type="parameter_to_local",
variable_name=local,
confidence=0.3, # Lower confidence for heuristic
)
data_flows.append(data_flow)- Option 3: Document the limitation:
# Parameter to local variable flows (heuristic approximation)
# TODO: Replace with actual Joern data flow edges for accuracy
for param in params:
for local in locals:
# Note: This creates all combinations; refine with actual flow analysis
...🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
213 to 222 the current logic builds a Cartesian product of every parameter and
every local, which produces many false positives; change this to prefer real
data-flow edges from Joern if present by checking df_result for actual
data_flows and iterating those to create DataFlowRelationship entries (use a
higher confidence, e.g., 0.9); if Joern flows are not available, replace the
full Cartesian product with a conservative heuristic: limit the number of locals
per parameter (e.g., first N or those actually referenced with the parameter),
lower the confidence (e.g., 0.3–0.5), and/or filter locals by simple usage
patterns (e.g., appearing in the same statements or having matching types);
finally add a short TODO comment documenting this limitation and that Joern data
flow edges should be used when possible.
| for node_id, node in ast_result.get("nodes", {}).items(): | ||
| if node.get("name") == function_name: | ||
| function_data = node | ||
| break |
There was a problem hiding this comment.
Rename unused loop variable node_id to _node_id.
The loop variable node_id is not used within the loop body (only node is used). Prefix it with an underscore to indicate it's intentionally unused, consistent with the fix at line 203.
🔎 Proposed fix
- for node_id, node in ast_result.get("nodes", {}).items():
+ for _node_id, node in ast_result.get("nodes", {}).items():
if node.get("name") == function_name:
function_data = node
break📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for node_id, node in ast_result.get("nodes", {}).items(): | |
| if node.get("name") == function_name: | |
| function_data = node | |
| break | |
| for _node_id, node in ast_result.get("nodes", {}).items(): | |
| if node.get("name") == function_name: | |
| function_data = node | |
| break |
🧰 Tools
🪛 Ruff (0.14.10)
387-387: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
387 to 390, the loop declares an unused variable node_id; rename it to _node_id
to signal it's intentionally unused. Update the for loop header to use _node_id
(keeping node) and ensure no other references expect node_id.
…nalysis Major improvements to the Joern integration to support modern Java versions and ensure robust execution. Key Changes: - **Java 25 Support**: Injected `JAVA_TOOL_OPTIONS` with `--add-opens` flags to resolve `InaccessibleObjectException` caused by stricter JDK encapsulation. - **Robust Wrapper Handling**: Added `-no-version-check` to the Joern CLI invocation to prevent version check failures when extra JVM options are present. - **Hybrid Analysis Fix**: Corrected data structure mismatch in `HybridAnalysisService` where `CallGraphAnalyzer` output (List) was incompatible with the merge logic (Dict). - **Core Models**: Updated `Node` model to use `List` instead of `Set` for `depends_on` to ensure JSON serializability. - **Dependency Builder**: Removed hardcoded file limits when Joern is enabled (`max_files` check bypassed). - **Graph Clustering**: Integrated `graph_clustering.py` to use NetworkX for community detection on the Joern graph. Verification: - Confirmed successful CPG generation on Java 25. - Verified extraction of ~2700 nodes and ~3000 cross-module edges. - Validated successful documentation generation without crashes.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (1)
46-48: Remove unused variablefiltered_folders_path.This variable is assigned but never used since the code that would consume it (lines 51-58) is commented out.
🔎 Proposed fix
dependency_graph_path = os.path.join( self.config.dependency_graph_dir, f"{sanitized_repo_name}_dependency_graph.json" ) - filtered_folders_path = os.path.join( - self.config.dependency_graph_dir, f"{sanitized_repo_name}_filtered_folders.json" - )codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
409-412: Rename unused loop variablenode_idto_node_id.The loop variable
node_idis not used within the loop body. Prefix with underscore to indicate intentional non-use.
🧹 Nitpick comments (15)
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py (2)
17-17: Unusedconfigparameter - remove or implement.The
configparameter is declared but never used. Either remove it to clean up the signature or pass it to the analyzer constructors if configuration is needed.🔎 Proposed fix
@staticmethod - def create_analyzer(analyzer_type: AnalyzerType = AnalyzerType.HYBRID, config: Any = None): + def create_analyzer(analyzer_type: AnalyzerType = AnalyzerType.HYBRID):
42-43: Fix indentation inconsistency.Line 43 has an extra space before
return client.🔎 Proposed fix
if client.is_available: - return client + return clientcodewiki/src/be/dependency_analyzer/graph_clustering.py (1)
54-60: Dead code:passblock does nothing.Lines 56-60 check for small clusters but only execute
pass. Either implement the intended logic (merge small clusters into "others") or remove the conditional entirely.🔎 Proposed fix - remove dead code
for i, comm in enumerate(communities): comm_list = list(comm) - if len(comm_list) < min_cluster_size and len(communities) > 1: - # Skip very small clusters if we have enough other clusters - # We'll merge these into an 'others' or closest parent later? - # For now, let's keep them and name them generically - pass - + # TODO: Consider merging clusters smaller than min_cluster_size + # Determine a common path or name for the clustercodewiki/src/be/cluster_modules.py (2)
102-105: Uselogger.exception()instead of manual traceback logging.
logger.exception()automatically includes the traceback in exception handlers, making the explicittraceback.format_exc()call redundant.🔎 Proposed fix
except Exception as e: - logger.error(f"Failed to parse LLM response: {e}. Response: {response[:200]}...") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Failed to parse LLM response. Response: {response[:200]}...") return {}
48-48: Mutable default argument may cause unexpected behavior.Using
{}as a default value means all calls share the same dict instance. Consider usingNoneas default and initializing inside the function.🔎 Proposed fix
def cluster_modules( leaf_nodes: List[str], components: Dict[str, Node], config: Config, - current_module_tree: dict[str, Any] = {}, + current_module_tree: dict[str, Any] = None, current_module_name: str = None, - current_module_path: List[str] = [] + current_module_path: List[str] = None ) -> Dict[str, Any]: + if current_module_tree is None: + current_module_tree = {} + if current_module_path is None: + current_module_path = []codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py (1)
4-4: Remove unused imports.
EnhancedNodeandDataFlowRelationshipare imported but never used in this file.🔎 Proposed fix
-from codewiki.src.be.dependency_analyzer.models.core import EnhancedNode, DataFlowRelationshipcodewiki/src/be/dependency_analyzer/analysis/analysis_service.py (2)
104-112: Unusedinclude_data_flowparameter should be documented or removed.The
include_data_flowparameter exists for interface compatibility withHybridAnalysisService.analyze_repository_hybridbut is never used. Add a docstring note explaining this is intentional for interface parity.🔎 Proposed fix
def analyze_repository_hybrid( self, repo_path: str, max_files: int = 100, languages: Optional[List[str]] = None, - include_data_flow: bool = False + include_data_flow: bool = False # noqa: ARG002 - kept for interface compatibility ) -> Dict[str, Any]: - """Alias for analyze_local_repository to support hybrid fallback interface.""" + """Alias for analyze_local_repository to support hybrid fallback interface. + + Note: include_data_flow is accepted for interface compatibility with + HybridAnalysisService but is not used in the AST-only path. + """ return self.analyze_local_repository(repo_path, max_files, languages)
230-235: Uselogger.exception()for cleaner error logging.Replace the manual
traceback.format_exc()call withlogger.exception()which automatically includes the traceback.🔎 Proposed fix
except Exception as e: if temp_dir and Path(temp_dir).exists(): self._cleanup_repository(temp_dir) - logger.error(f"Structure analysis failed for {github_url}: {str(e)}") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Structure analysis failed for {github_url}") raise RuntimeError(f"Structure analysis failed: {str(e)}") from ecodewiki/src/be/dependency_analyzer/models/core.py (1)
84-98: Validator ambiguity: both CFG and DDG may contain overlapping keys.The validator attempts to distinguish between CFG and DDG data by checking for
'nodes'vs'dependencies'keys. However, this heuristic could fail if a dict contains both keys or neither, returning the raw dict unchanged which may cause downstream issues.Consider using explicit type discrimination or separate validators per field:
🔎 Proposed improvement using separate validators
- @field_validator('cfg_data', 'ddg_data', mode='before') - @classmethod - def validate_graph_data(cls, v: Any) -> Any: - """Allow dict or model instance.""" - if v is None: - return None - if isinstance(v, dict): - try: - if 'nodes' in v: # Likely CFG - return ControlFlowData(**v) - elif 'dependencies' in v: # Likely DDG - return DataFlowData(**v) - except Exception: - return v # Return as-is if validation fails - return v + @field_validator('cfg_data', mode='before') + @classmethod + def validate_cfg_data(cls, v: Any) -> Any: + """Allow dict or ControlFlowData instance.""" + if v is None: + return None + if isinstance(v, dict): + return ControlFlowData(**v) + return v + + @field_validator('ddg_data', mode='before') + @classmethod + def validate_ddg_data(cls, v: Any) -> Any: + """Allow dict or DataFlowData instance.""" + if v is None: + return None + if isinstance(v, dict): + return DataFlowData(**v) + return vcodewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (2)
99-101: Chain exception for better traceability.When re-raising, chain the original exception using
raise ... from eto preserve the full traceback.🔎 Proposed fix
except Exception as e: logger.error(f"Hybrid analysis failed: {str(e)}", exc_info=True) - raise RuntimeError(f"Hybrid analysis failed: {str(e)}") + raise RuntimeError(f"Hybrid analysis failed: {e!s}") from e
125-134: Uselogger.exceptionfor automatic traceback inclusion.When logging errors in exception handlers,
logger.exception()automatically includes the traceback without needingexc_info=True.🔎 Proposed fix
except Exception as e: - logger.error(f"AST analysis failed: {str(e)}") + logger.exception(f"AST analysis failed: {e!s}") # Return minimal result to allow analysis to continue return {codewiki/src/be/dependency_analyzer/joern/client.py (4)
16-20: Consider making JoernVersion a proper dataclass or enum.
JoernVersioncontains only class-level constants but isn't used for version comparison anywhere in the code. If version validation is planned, consider adding comparison logic; otherwise, this could be simplified to module-level constants.
61-121: Validation logic is thorough but has inconsistent return conditions.The validation correctly handles multiple scenarios (binary in PATH, Homebrew fallbacks, JAR fallback). However, lines 97-100 appear redundant—if
resolved_pathexists and passes the earlier checks, this block shouldn't be reached. Also, the indentation at lines 99-100 appears inconsistent (extra space).🔎 Proposed cleanup
except (subprocess.TimeoutExpired, subprocess.SubprocessError) as e: logger.debug(f"Joern binary validation failed: {e}. Checking path existence.") if os.path.exists(self.joern_path): logger.info(f"Joern binary found at {self.joern_path}") return True - - # Binary validation failed, try to return True if it just exists and is executable - if resolved_path and os.path.exists(resolved_path): - logger.info(f"Joern binary found at {resolved_path} (validation skipped)") - return True # 2. Check JAR fallback
201-202: Chain the timeout exception for better debugging.Use
raise ... fromto preserve exception context.🔎 Proposed fix
- except subprocess.TimeoutExpired: - raise TimeoutError(f"Joern CPG generation exceeded {self.timeout_seconds}s") + except subprocess.TimeoutExpired as e: + raise TimeoutError(f"Joern CPG generation exceeded {self.timeout_seconds}s") from e
204-210: Unusedlanguageparameter.The
languageparameter is accepted but never used in_run_joern_analysis. Either implement language-specific frontend selection or remove the parameter.🔎 Option 1: Use language parameter in script generation
- def _run_joern_analysis(self, repo_path: str, language: Optional[str] = None) -> Dict[str, Any]: + def _run_joern_analysis(self, repo_path: str, language: Optional[str] = None) -> Dict[str, Any]: """Execute Joern analysis script.""" with tempfile.TemporaryDirectory(prefix="joern_client_") as temp_dir: repo_abs = str(Path(repo_path).resolve()) - script_content = self._get_default_analysis_script(repo_abs) + script_content = self._get_default_analysis_script(repo_abs, language)And update
_get_default_analysis_scriptsignature accordingly.🔎 Option 2: Remove unused parameter
- def _run_joern_analysis(self, repo_path: str, language: Optional[str] = None) -> Dict[str, Any]: + def _run_joern_analysis(self, repo_path: str) -> Dict[str, Any]:Also update the call site in
generate_cpg.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
codewiki/src/be/cluster_modules.pycodewiki/src/be/dependency_analyzer/analysis/analysis_service.pycodewiki/src/be/dependency_analyzer/analysis/analyzer_factory.pycodewiki/src/be/dependency_analyzer/dependency_graphs_builder.pycodewiki/src/be/dependency_analyzer/graph_clustering.pycodewiki/src/be/dependency_analyzer/hybrid_analysis_service.pycodewiki/src/be/dependency_analyzer/joern/__init__.pycodewiki/src/be/dependency_analyzer/joern/client.pycodewiki/src/be/dependency_analyzer/joern/joern_analysis_service.pycodewiki/src/be/dependency_analyzer/models/core.pycodewiki/src/utils.pypyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
🧰 Additional context used
🧬 Code graph analysis (6)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (7)
codewiki/src/be/dependency_analyzer/topo_sort.py (2)
build_graph_from_components(239-268)get_leaf_nodes(271-350)codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py (2)
AnalyzerType(8-11)create_analyzer(17-51)codewiki/src/be/dependency_analyzer/ast_parser.py (3)
DependencyParser(18-145)parse_repository(28-45)save_dependency_graph(129-145)codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
analyze_repository_hybrid(104-112)codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
analyze_repository_hybrid(48-101)codewiki/src/be/dependency_analyzer/models/core.py (2)
Node(55-75)EnhancedNode(78-98)codewiki/src/utils.py (1)
save_json(19-27)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (2)
codewiki/src/be/dependency_analyzer/utils/patterns.py (5)
is_entry_point_file(462-485)is_entry_point_path(488-504)has_high_connectivity_potential(507-533)find_fallback_entry_points(560-612)find_fallback_connectivity_files(615-647)codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
analyze_repository_hybrid(48-101)
codewiki/src/be/cluster_modules.py (4)
codewiki/src/be/utils.py (1)
count_tokens(32-38)codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
cluster_graph_by_communities(9-72)codewiki/src/be/prompt_template.py (1)
format_cluster_prompt(308-337)codewiki/src/be/llm_services.py (1)
call_llm(58-86)
codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
codewiki/src/be/dependency_analyzer/models/core.py (1)
Node(55-75)
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py (3)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
AnalysisService(32-404)codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (1)
HybridAnalysisService(23-431)codewiki/src/be/dependency_analyzer/joern/client.py (1)
JoernClient(22-328)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (4)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (3)
AnalysisService(32-404)analyze_local_repository(49-102)analyze_repository(407-418)codewiki/src/be/dependency_analyzer/models/core.py (3)
Node(55-75)DataFlowRelationship(108-119)EnhancedNode(78-98)codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py (2)
JoernAnalysisService(8-65)analyze_repository(15-65)codewiki/src/be/dependency_analyzer/simplified_joern.py (1)
extract_data_flow_sample(215-266)
🪛 Ruff (0.14.10)
codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py
46-46: Local variable filtered_folders_path is assigned to but never used
Remove assignment to unused variable filtered_folders_path
(F841)
92-92: Do not catch blind exception: Exception
(BLE001)
codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py
22-22: Avoid specifying long messages outside the exception class
(TRY003)
codewiki/src/be/dependency_analyzer/analysis/analysis_service.py
109-109: Unused method argument: include_data_flow
(ARG002)
233-233: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
233-233: Use explicit conversion flag
Replace with conversion flag
(RUF010)
234-234: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/cluster_modules.py
80-80: Do not catch blind exception: Exception
(BLE001)
96-96: Use of possibly insecure function; consider using ast.literal_eval
(S307)
102-102: Do not catch blind exception: Exception
(BLE001)
103-103: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
104-104: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
codewiki/src/be/dependency_analyzer/graph_clustering.py
47-47: Do not catch blind exception: Exception
(BLE001)
98-98: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
codewiki/src/be/dependency_analyzer/analysis/analyzer_factory.py
17-17: Unused static method argument: config
(ARG004)
32-32: Do not catch blind exception: Exception
(BLE001)
45-45: Abstract raise to an inner function
(TRY301)
45-45: Avoid specifying long messages outside the exception class
(TRY003)
46-46: Do not catch blind exception: Exception
(BLE001)
codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py
97-97: Consider moving this statement to an else block
(TRY300)
100-100: Use explicit conversion flag
Replace with conversion flag
(RUF010)
101-101: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
123-123: Consider moving this statement to an else block
(TRY300)
125-125: Do not catch blind exception: Exception
(BLE001)
126-126: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
126-126: Use explicit conversion flag
Replace with conversion flag
(RUF010)
171-171: Consider moving this statement to an else block
(TRY300)
173-173: Do not catch blind exception: Exception
(BLE001)
174-174: Use explicit conversion flag
Replace with conversion flag
(RUF010)
178-178: Unused method argument: repo_path
(ARG002)
205-205: Do not catch blind exception: Exception
(BLE001)
409-409: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
424-428: Consider moving this statement to an else block
(TRY300)
430-430: Do not catch blind exception: Exception
(BLE001)
431-431: Use explicit conversion flag
Replace with conversion flag
(RUF010)
codewiki/src/utils.py
24-24: Avoid specifying long messages outside the exception class
(TRY003)
codewiki/src/be/dependency_analyzer/joern/client.py
81-81: subprocess call: check for execution of untrusted input
(S603)
106-106: subprocess call: check for execution of untrusted input
(S603)
107-107: Starting a process with a partial executable path
(S607)
139-139: Consider moving this statement to an else block
(TRY300)
172-172: Avoid specifying long messages outside the exception class
(TRY003)
181-181: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
182-182: Do not catch blind exception: Exception
(BLE001)
196-196: Do not catch blind exception: Exception
(BLE001)
199-199: Consider moving this statement to an else block
(TRY300)
202-202: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
202-202: Avoid specifying long messages outside the exception class
(TRY003)
204-204: Unused method argument: language
(ARG002)
253-253: subprocess call: check for execution of untrusted input
(S603)
263-263: Avoid specifying long messages outside the exception class
(TRY003)
325-325: Do not catch blind exception: Exception
(BLE001)
326-326: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
326-326: Use explicit conversion flag
Replace with conversion flag
(RUF010)
328-328: Use explicit conversion flag
Replace with conversion flag
(RUF010)
codewiki/src/be/dependency_analyzer/models/core.py
96-96: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (15)
codewiki/src/utils.py (1)
19-27: LGTM! Clean approach to handle set serialization.The custom
default_serializerproperly extends JSON serialization for sets while preserving the standardTypeErrorfor other unsupported types. This aligns well with the model changes wheredepends_ontransitioned fromSet[str]toList[str].codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
87-104: LGTM!The common path extraction correctly normalizes paths and finds the longest common prefix. The
zipbehavior of stopping at the shortest sequence is the desired behavior here.codewiki/src/be/dependency_analyzer/joern/joern_analysis_service.py (1)
15-65: LGTM!The
analyze_repositorymethod cleanly transforms Joern's raw CPG output into the internal node/relationship format with appropriate metadata. The structure is compatible with downstream consumers.codewiki/src/be/dependency_analyzer/analysis/analysis_service.py (1)
343-388: LGTM! Well-structured file selection strategy.The
_select_best_filesmethod implements a sensible prioritization: entry points first (30% budget), then high-connectivity files, then remaining files. This ensures important code paths are analyzed when file limits are hit.codewiki/src/be/dependency_analyzer/dependency_graphs_builder.py (2)
61-102: LGTM! Hybrid path correctly implemented.The branching between
DependencyParserandHybridAnalysisServiceis now correct. The node conversion handles both list and dict formats with appropriate error handling, and the save logic adapts to the parser type.
124-146: LGTM! Robust leaf node filtering.The enhanced validation correctly filters out invalid identifiers including error-like strings, empty values, and non-string types. This prevents downstream issues with malformed leaf nodes.
codewiki/src/be/dependency_analyzer/models/core.py (2)
6-53: LGTM! Well-structured Joern CPG data models.The new data models for control flow (CFGNode, CFGEdge, ControlFlowData) and data flow (DataDependency, DataFlowData) are well-designed with appropriate optional fields and sensible defaults using
Field(default_factory=list).
129-133: Inconsistency: Repository model fields don't match AI summary.The AI summary states that
nameandclone_pathwere removed from Repository, but they're still present in the code. If these fields are intentionally retained, this is fine; otherwise verify the intended schema.codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py (4)
33-46: LGTM! Clean initialization with graceful Joern availability check.The initialization properly creates both services and correctly chains the availability check (
enable_joern and joern_service.is_available).
211-249: LGTM! Cross-module relationship extraction now correctly usesast_result.The method signature and implementation have been updated to accept the full
ast_resultdict and correctly accessrelationshipsfrom it, addressing the previously reported bug.
271-304: LGTM! Function importance scoring logic is reasonable.The scoring heuristics (dependents count, docstring presence, name patterns, parameter complexity) provide a sensible prioritization for selecting functions for deeper analysis.
330-340: Good defensive normalization for List→Dict conversion.This correctly handles the case where
CallGraphAnalyzerreturns nodes as a List, converting to the Dict format expected by the rest of the hybrid logic.codewiki/src/be/dependency_analyzer/joern/client.py (3)
40-59: LGTM! Comprehensive JAR discovery with sensible fallback paths.The search paths cover common installation locations (project root, user home, global paths) and logging on discovery aids debugging.
179-183: Pickle deserialization from cache is acceptable here.While static analysis flags pickle as a security risk (S301), this cache is locally generated by the same application and stored in a user-controlled directory. The risk is minimal in this context. Consider adding a comment noting this design decision.
310-328: LGTM! Robust output parsing with marker-based extraction and fallback.The two-stage parsing (markers first, regex fallback) handles different Joern output formats gracefully. The error handling returns a structured error dict rather than raising, which aligns with the graceful degradation design.
|
|
||
| response_content = response.split("<GROUPED_COMPONENTS>")[1].split("</GROUPED_COMPONENTS>")[0] | ||
| # Use safe literal evaluation if possible, but existing code uses eval | ||
| module_tree = eval(response_content) |
There was a problem hiding this comment.
Security: Replace eval() with ast.literal_eval().
Using eval() on LLM responses is dangerous as it can execute arbitrary Python code. Use ast.literal_eval() which safely evaluates only literal structures (dicts, lists, strings, numbers).
🔎 Proposed fix
+import ast
+
# ... in the try block:
- module_tree = eval(response_content)
+ module_tree = ast.literal_eval(response_content)🧰 Tools
🪛 Ruff (0.14.10)
96-96: Use of possibly insecure function; consider using ast.literal_eval
(S307)
🤖 Prompt for AI Agents
In codewiki/src/be/cluster_modules.py around line 96, replace the unsafe use of
eval(response_content) with ast.literal_eval(response_content): import the ast
module at the top of the file if not present, call ast.literal_eval to parse the
LLM response into module_tree, and wrap the call in a try/except to catch
(ValueError, SyntaxError) and log or raise a controlled error if parsing fails;
ensure no other code expects eval-specific behavior and adjust accordingly.
| def _extract_data_flows_for_functions( | ||
| self, repo_path: str, ast_nodes: Dict[str, Any] | ||
| ) -> List[DataFlowRelationship]: | ||
| """ | ||
| Extract data flow relationships for key functions. | ||
|
|
||
| Args: | ||
| repo_path: Repository path | ||
| ast_nodes: AST nodes from analysis | ||
|
|
||
| Returns: | ||
| List of data flow relationships | ||
| """ | ||
| if not self.joern_client: | ||
| return [] | ||
|
|
||
| data_flows = [] | ||
|
|
||
| # Select a subset of important functions for data flow analysis | ||
| important_functions = self._select_important_functions(ast_nodes, limit=20) | ||
|
|
||
| for func_name, _func_info in important_functions.items(): | ||
| try: | ||
| # Optimized: We could use the already generated CPG from generate_cpg | ||
| # For now, keep the interface consistent with the new client's capability. | ||
| # In a more robust version, we'd query the CPG directly. | ||
| pass | ||
|
|
||
| except Exception as e: | ||
| logger.debug(f"Data flow analysis failed for {func_name}: {e}") | ||
| continue | ||
|
|
||
| return data_flows |
There was a problem hiding this comment.
Bug: References undefined self.joern_client attribute.
_extract_data_flows_for_functions references self.joern_client at line 190, but the class only initializes self.joern_service. This will raise AttributeError at runtime. Additionally, repo_path is unused (the loop body is just pass).
This method appears to be scaffolding for future implementation. Either fix the attribute name or mark this method as not yet implemented more clearly.
🔎 Proposed fix - align with existing attributes
def _extract_data_flows_for_functions(
self, repo_path: str, ast_nodes: Dict[str, Any]
) -> List[DataFlowRelationship]:
"""
Extract data flow relationships for key functions.
+
+ Note: This is a placeholder for future Joern data-flow integration.
Args:
repo_path: Repository path
ast_nodes: AST nodes from analysis
Returns:
List of data flow relationships
"""
- if not self.joern_client:
+ if not self.joern_service or not self.joern_service.is_available:
return []
data_flows = []
# Select a subset of important functions for data flow analysis
important_functions = self._select_important_functions(ast_nodes, limit=20)
for func_name, _func_info in important_functions.items():
try:
- # Optimized: We could use the already generated CPG from generate_cpg
- # For now, keep the interface consistent with the new client's capability.
- # In a more robust version, we'd query the CPG directly.
- pass
+ # TODO: Implement actual data flow extraction using Joern CPG
+ _ = repo_path # Suppress unused warning until implemented
+ logger.debug(f"Data flow extraction not yet implemented for {func_name}")
except Exception as e:
logger.debug(f"Data flow analysis failed for {func_name}: {e}")
continue
return data_flows🧰 Tools
🪛 Ruff (0.14.10)
178-178: Unused method argument: repo_path
(ARG002)
205-205: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
177 to 209, the method _extract_data_flows_for_functions references an undefined
attribute self.joern_client and also leaves repo_path unused; change the
attribute check to use the existing self.joern_service (i.e., if not
self.joern_service: return []), and either implement the scaffolded loop or make
the method explicitly not implemented by raising NotImplementedError (or
returning an empty list) while adding a TODO comment about using repo_path and
replacing the pass with the future implementation; ensure no reference to
self.joern_client remains.
| # Add data flow if available | ||
| data_flow_result = {} | ||
| if self.joern_analyzer: | ||
| data_flow_result = self.joern_analyzer.extract_data_flow_sample( | ||
| repo_path, function_name | ||
| ) |
There was a problem hiding this comment.
Bug: References undefined self.joern_analyzer attribute.
analyze_single_function_with_data_flow references self.joern_analyzer at line 419, but this attribute is never defined. The class initializes self.joern_service instead. This will raise AttributeError at runtime.
🔎 Proposed fix
# Add data flow if available
data_flow_result = {}
- if self.joern_analyzer:
- data_flow_result = self.joern_analyzer.extract_data_flow_sample(
+ if self.joern_service and self.joern_service.is_available:
+ # Note: JoernAnalysisService doesn't have extract_data_flow_sample
+ # This may need to use the underlying client or a different method
+ data_flow_result = self.joern_service.client.extract_data_flow_sample(
repo_path, function_name
)Alternatively, if extract_data_flow_sample isn't available on the service/client, this feature may need further implementation or should be disabled:
# Add data flow if available
data_flow_result = {}
- if self.joern_analyzer:
- data_flow_result = self.joern_analyzer.extract_data_flow_sample(
- repo_path, function_name
- )
+ # TODO: Implement single-function data flow extraction
+ # Current JoernAnalysisService doesn't expose extract_data_flow_sampleCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/hybrid_analysis_service.py around lines
417 to 422, the method references an undefined attribute self.joern_analyzer
which causes an AttributeError; replace the check and call to use the existing
self.joern_service (i.e., if self.joern_service: data_flow_result =
self.joern_service.extract_data_flow_sample(repo_path, function_name)), and if
extract_data_flow_sample may not exist on the service, guard with
hasattr(self.joern_service, "extract_data_flow_sample") and fall back to an
empty data_flow_result or disable the feature with a clear log message.
| def _get_default_analysis_script(self, repo_path: str) -> str: | ||
| """Get standard Joern analysis script for Joern 4.x.""" | ||
| # Escape backslashes for Scala string literal | ||
| escaped_path = repo_path.replace("\\", "\\\\").replace("\"", "\\\"") | ||
|
|
||
| # Use a more robust script format for Joern 4.x | ||
| return f''' | ||
| // Use language specific frontend if possible, fallback to generic | ||
| val cpg = try {{ | ||
| importCode.python("{escaped_path}") | ||
| }} catch {{ | ||
| case _: Exception => importCode("{escaped_path}", "analysis", "") | ||
| }} | ||
|
|
||
| val mJson = cpg.method.internal.map {{ m => | ||
| Map( | ||
| "name" -> m.name, | ||
| "fullName" -> m.fullName, | ||
| "file" -> m.filename, | ||
| "line" -> m.lineNumber.headOption.getOrElse(-1) | ||
| ) | ||
| }}.toJson | ||
|
|
||
| val cJson = cpg.call.map {{ c => | ||
| Map( | ||
| "name" -> c.name, | ||
| "caller" -> c.method.fullName.headOption.getOrElse(""), | ||
| "callee" -> c.methodFullName, | ||
| "line" -> c.lineNumber.headOption.getOrElse(-1) | ||
| ) | ||
| }}.toJson | ||
|
|
||
| println("---JOERN_START---") | ||
| println(s"""{{ | ||
| "total_files": ${{cpg.file.size}}, | ||
| "total_methods": ${{cpg.method.size}}, | ||
| "total_classes": ${{cpg.typeDecl.size}}, | ||
| "methods": $mJson, | ||
| "calls": $cJson | ||
| }}""") | ||
| println("---JOERN_END---") | ||
| ''' |
There was a problem hiding this comment.
Potential script injection via repo_path in Scala script.
While repo_path is escaped for backslashes and double quotes (line 270), a malicious path containing Scala string interpolation syntax (e.g., ${...}) or other Scala code patterns could potentially inject code into the generated script. Since repo_path typically comes from user input (CLI arguments or config), this warrants additional validation.
🔎 Proposed fix - add path validation
def _get_default_analysis_script(self, repo_path: str) -> str:
"""Get standard Joern analysis script for Joern 4.x."""
+ # Validate repo_path to prevent script injection
+ # Only allow alphanumeric, common path characters, and path separators
+ if not re.match(r'^[\w\s\-./\\:]+$', repo_path):
+ raise ValueError(f"Invalid characters in repository path: {repo_path}")
+
# Escape backslashes for Scala string literal
- escaped_path = repo_path.replace("\\", "\\\\").replace("\"", "\\\"")
+ # Also escape $ to prevent Scala string interpolation
+ escaped_path = (
+ repo_path
+ .replace("\\", "\\\\")
+ .replace("\"", "\\\"")
+ .replace("$", "\\$")
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get_default_analysis_script(self, repo_path: str) -> str: | |
| """Get standard Joern analysis script for Joern 4.x.""" | |
| # Escape backslashes for Scala string literal | |
| escaped_path = repo_path.replace("\\", "\\\\").replace("\"", "\\\"") | |
| # Use a more robust script format for Joern 4.x | |
| return f''' | |
| // Use language specific frontend if possible, fallback to generic | |
| val cpg = try {{ | |
| importCode.python("{escaped_path}") | |
| }} catch {{ | |
| case _: Exception => importCode("{escaped_path}", "analysis", "") | |
| }} | |
| val mJson = cpg.method.internal.map {{ m => | |
| Map( | |
| "name" -> m.name, | |
| "fullName" -> m.fullName, | |
| "file" -> m.filename, | |
| "line" -> m.lineNumber.headOption.getOrElse(-1) | |
| ) | |
| }}.toJson | |
| val cJson = cpg.call.map {{ c => | |
| Map( | |
| "name" -> c.name, | |
| "caller" -> c.method.fullName.headOption.getOrElse(""), | |
| "callee" -> c.methodFullName, | |
| "line" -> c.lineNumber.headOption.getOrElse(-1) | |
| ) | |
| }}.toJson | |
| println("---JOERN_START---") | |
| println(s"""{{ | |
| "total_files": ${{cpg.file.size}}, | |
| "total_methods": ${{cpg.method.size}}, | |
| "total_classes": ${{cpg.typeDecl.size}}, | |
| "methods": $mJson, | |
| "calls": $cJson | |
| }}""") | |
| println("---JOERN_END---") | |
| ''' | |
| def _get_default_analysis_script(self, repo_path: str) -> str: | |
| """Get standard Joern analysis script for Joern 4.x.""" | |
| # Validate repo_path to prevent script injection | |
| # Only allow alphanumeric, common path characters, and path separators | |
| if not re.match(r'^[\w\s\-./\\:]+$', repo_path): | |
| raise ValueError(f"Invalid characters in repository path: {repo_path}") | |
| # Escape backslashes for Scala string literal | |
| # Also escape $ to prevent Scala string interpolation | |
| escaped_path = ( | |
| repo_path | |
| .replace("\\", "\\\\") | |
| .replace("\"", "\\\"") | |
| .replace("$", "\\$") | |
| ) | |
| # Use a more robust script format for Joern 4.x | |
| return f''' | |
| // Use language specific frontend if possible, fallback to generic | |
| val cpg = try {{ | |
| importCode.python("{escaped_path}") | |
| }} catch {{ | |
| case _: Exception => importCode("{escaped_path}", "analysis", "") | |
| }} | |
| val mJson = cpg.method.internal.map {{ m => | |
| Map( | |
| "name" -> m.name, | |
| "fullName" -> m.fullName, | |
| "file" -> m.filename, | |
| "line" -> m.lineNumber.headOption.getOrElse(-1) | |
| ) | |
| }}.toJson | |
| val cJson = cpg.call.map {{ c => | |
| Map( | |
| "name" -> c.name, | |
| "caller" -> c.method.fullName.headOption.getOrElse(""), | |
| "callee" -> c.methodFullName, | |
| "line" -> c.lineNumber.headOption.getOrElse(-1) | |
| ) | |
| }}.toJson | |
| println("---JOERN_START---") | |
| println(s"""{{ | |
| "total_files": ${{cpg.file.size}}, | |
| "total_methods": ${{cpg.method.size}}, | |
| "total_classes": ${{cpg.typeDecl.size}}, | |
| "methods": $mJson, | |
| "calls": $cJson | |
| }}""") | |
| println("---JOERN_END---") | |
| ''' |
🤖 Prompt for AI Agents
In codewiki/src/be/dependency_analyzer/joern/client.py around lines 267 to 308,
the generated Scala script embeds repo_path into a Scala string literal but only
escapes backslashes and quotes, leaving it vulnerable to script injection via
Scala interpolation or control characters; validate and sanitize repo_path
before embedding by: canonicalizing the path (os.path.realpath) and ensuring it
is within an allowed repository root, rejecting or returning an error for paths
that contain interpolation patterns like "${", "s\"" sequences, newlines,
backticks, semicolons or other control characters, and/or non-whitelisted
characters; alternatively, escape any remaining interpolation sequences (e.g.,
replace "${" with "\${") so the value cannot be interpreted as Scala code, and
ensure the function raises a clear error on invalid input rather than emitting a
potentially dangerous script.
- Enhance type hints by adding Optional for current_module_name parameter - Add safety checks to prevent KeyError when navigating module tree nodes - Initialize missing "children" keys during recursive module tree navigation - Improve error messages and logging throughout the clustering process - Add validation for sub_leaf_nodes to ensure they exist in components dictionary - Format long lines for better readability while maintaining functionality - Strengthen LLM response parsing with more detailed error reporting and traceback logging - Add warning when invalid keys are encountered during module tree traversal
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
codewiki/src/be/cluster_modules.py (1)
116-116: Security: eval() remains unfixed.This is a duplicate of a previous review comment. The unsafe
eval()on line 116 should be replaced withast.literal_eval()to prevent arbitrary code execution. The comment on line 115 acknowledges the issue but doesn't address it.
🧹 Nitpick comments (1)
codewiki/src/be/cluster_modules.py (1)
123-124: Uselogger.exception()instead of manual traceback formatting.
logger.exception()automatically includes the traceback and is more concise than callinglogger.error()withtraceback.format_exc().🔎 Proposed fix
except Exception as e: - logger.error(f"Failed to parse LLM response: {e}. Response: {response[:200]}...") - logger.error(f"Traceback: {traceback.format_exc()}") + logger.exception(f"Failed to parse LLM response: {e}. Response: {response[:200]}...") return {}
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
codewiki/src/be/cluster_modules.py
🧰 Additional context used
🧬 Code graph analysis (1)
codewiki/src/be/cluster_modules.py (6)
codewiki/src/be/dependency_analyzer/models/core.py (1)
Node(55-75)codewiki/src/be/utils.py (1)
count_tokens(32-38)codewiki/cli/utils/logging.py (4)
debug(24-28)info(30-32)warning(38-40)error(42-44)codewiki/src/be/dependency_analyzer/graph_clustering.py (1)
cluster_graph_by_communities(9-72)codewiki/src/be/prompt_template.py (1)
format_cluster_prompt(308-337)codewiki/src/be/llm_services.py (1)
call_llm(58-86)
🪛 Ruff (0.14.10)
codewiki/src/be/cluster_modules.py
51-51: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
53-53: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
94-94: Do not catch blind exception: Exception
(BLE001)
116-116: Use of possibly insecure function; consider using ast.literal_eval
(S307)
122-122: Do not catch blind exception: Exception
(BLE001)
123-123: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
124-124: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (3)
codewiki/src/be/cluster_modules.py (3)
29-32: Good addition: grouping leaf nodes by file.The explicit grouping by file path improves organization and makes the formatted output more readable for downstream consumers.
76-96: Well-implemented graph-based clustering with fallback.The dynamic import prevents errors when Joern is unavailable, and the fallback strategy is sound. The broad
Exceptioncatch on line 94 is acceptable here since you're implementing a graceful degradation pattern with appropriate logging.
144-149: Excellent defensive programming with safety checks.The added safety checks prevent KeyError exceptions during tree navigation and properly initialize missing "children" keys. This makes the recursive clustering more robust.
| current_module_tree: Dict[str, Any] = {}, | ||
| current_module_name: Optional[str] = None, | ||
| current_module_path: List[str] = [], |
There was a problem hiding this comment.
Critical: Replace mutable default arguments with None.
Using mutable defaults ({} and []) causes all function calls to share the same dict/list instances, leading to state leakage between clustering operations. This is especially problematic for recursive functions.
🔎 Proposed fix
def cluster_modules(
leaf_nodes: List[str],
components: Dict[str, Node],
config: Config,
- current_module_tree: Dict[str, Any] = {},
- current_module_name: Optional[str] = None,
- current_module_path: List[str] = [],
+ current_module_tree: Optional[Dict[str, Any]] = None,
+ current_module_name: Optional[str] = None,
+ current_module_path: Optional[List[str]] = None,
) -> Dict[str, Any]:
"""
Cluster potential core components into modules.
[CCR] Relation: Clustering Strategy.
Reason: Joern enables graph-based clustering (Louvain), while AST-only falls back to LLM heuristics.
"""
+ if current_module_tree is None:
+ current_module_tree = {}
+ if current_module_path is None:
+ current_module_path = []
+
potential_core_components, potential_core_components_with_code = (
format_potential_core_components(leaf_nodes, components)
)🧰 Tools
🪛 Ruff (0.14.10)
51-51: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
53-53: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)

TLDR
Successfully implemented hybrid AST + Joern CPG analysis for CodeWiki using a Quick Win strategy. This PR adds powerful data flow insights while maintaining system stability and backward compatibility.
Key features include:
--use-joernflag to opt-in for enhanced analysis.Dive Deeper
The core philosophy of this integration is "Enhancement over Replacement." Historically, CodeWiki relied solely on AST parsing, which is fast and stable but limited in its ability to track data relationships across function boundaries or within complex logic.
Joern provides a Code Property Graph (CPG) that allows for deep semantic analysis. By integrating Joern as an optional enhancement layer, we can now offer:
Architecture-wise, the
HybridAnalysisServiceacts as a coordinator, delegating work to both the traditionalASTParserand the newSimplifiedJoernAnalyzer.Reviewer Test Plan
To validate these changes:
Environment Setup:
joern.jarto the project root, or let the CLI auto-download it on first run with--use-joern.Basic Verification:
codewiki generatecodewiki generate --use-joernFallback Verification:
joern.jaror simulate a Joern failure and run with--use-joern. The system should log a warning and fallback to standard AST analysis without crashing.Performance Check:
--verboseflag to see the analysis duration. Expected overhead for small-to-medium repos should be around 0.1s.Testing Matrix
Linked issues / bugs
Related to the ongoing initiative for "Enhanced Code Analysis Support".
Summary by CodeRabbit
New Features
Documentation
Other
✏️ Tip: You can customize this high-level summary in your review settings.