Skip to content

Feat/multimodal model support#878

Closed
suluyana wants to merge 5 commits intomodelscope:mainfrom
suluyana:feat/multimodal_model_support
Closed

Feat/multimodal model support#878
suluyana wants to merge 5 commits intomodelscope:mainfrom
suluyana:feat/multimodal_model_support

Conversation

@suluyana
Copy link
Collaborator

@suluyana suluyana commented Mar 5, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant architectural and functional upgrades to the ms-agent platform. It integrates multimodal model support, allowing the system to process and understand image content, thereby expanding its capabilities beyond text-only interactions. A new FastAPI-based API server has been developed to provide robust configuration, project, session, and file management, along with real-time agent execution via WebSockets. This backend is complemented by a completely redesigned Next.js and Tailwind CSS frontend, offering a modern, responsive user experience for interacting with agents, monitoring system activity, and managing settings.

Highlights

  • Multimodal Model Support: The ms-agent framework now supports multimodal models, such as Alibaba Cloud's qwen3.5-plus, enabling image understanding, object recognition, and integrated visual-textual conversations. Comprehensive documentation and examples have been added to guide users on configuration and usage.
  • New FastAPI-based API Server: A complete FastAPI-based API server has been introduced, providing robust REST and WebSocket endpoints for managing configurations (LLM, MCP, EdgeOne Pages, Deep Research), projects, sessions, and file operations. This server mirrors and enhances functionalities previously handled by the webui/backend.
  • Real-time Agent Execution via WebSockets: Agent and workflow executions can now be initiated and monitored in real-time through dedicated WebSocket connections. A new AgentExecutor with WebSocketCallback ensures live broadcasting of events, status updates, tool calls, and results to connected clients.
  • Frontend Overhaul with Next.js and Tailwind CSS: The entire web user interface has been rewritten using Next.js 16 (App Router) and Tailwind CSS. This modern frontend offers a redesigned chat interface, a new system monitor page, and a settings page for LLM configuration and theme management, providing a more responsive and intuitive user experience.
  • Enhanced Configuration and File Management APIs: The new API server includes detailed endpoints for managing various system configurations and performing file operations. This includes listing, reading, and streaming files, alongside robust management of MCP servers, EdgeOne Pages deployments, and Deep Research settings.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • MULTIMODAL_SUPPORT.md
    • Added new documentation for multimodal model support in ms-agent, detailing configuration, usage, and message formats.
  • api/ENHANCEMENTS.md
    • Added a comprehensive document outlining new API features for configuration, project, and file management.
  • api/IMPLEMENTATION_SUMMARY.md
    • Added a summary detailing the completion of API implementation TODOs and architectural improvements.
  • api/README.md
    • Added a new README providing an overview, features, and usage instructions for the MS-Agent API server.
  • api/init.py
    • Added the initialization file for the API package, setting its version.
  • api/agent.py
    • Added API endpoints for starting, stopping, and checking the status of agent executions.
  • api/agent_executor.py
    • Added the core agent and workflow execution engine with WebSocket callback integration.
  • api/config.py
    • Added API endpoints and a manager for persistent configuration settings across various services.
  • api/files.py
    • Added API endpoints for listing, reading, and streaming files with intelligent path resolution.
  • api/main.py
    • Added the main FastAPI application entry point, integrating all API routes and middleware.
  • api/models.py
    • Added Pydantic models for API requests, responses, and WebSocket messages.
  • api/project.py
    • Added API endpoints and a discovery mechanism for managing project information and workflows.
  • api/run_server.py
    • Added a script to launch the FastAPI server.
  • api/session.py
    • Added API endpoints for managing user sessions, messages, and events.
  • api/session_manager.py
    • Added an in-memory manager for session data, messages, and progress events.
  • api/test_api.py
    • Added a basic test script to verify core API functionality.
  • api/test_new_endpoints.py
    • Added a comprehensive test script for the newly implemented API endpoints.
  • api/utils.py
    • Added utility functions for JSON handling, path resolution, error formatting, and sensitive data masking.
  • api/websocket.py
    • Added WebSocket endpoints for real-time agent and chat communication.
  • config/cfg_model_multimodal.yaml
    • Added a new YAML configuration file for multimodal models like Qwen3.5-plus.
  • examples/agent/test_llm_agent_multimodal.py
    • Added an example script demonstrating multimodal capabilities of LLMAgent in various modes.
  • ms_agent/agent/llm_agent.py
    • Updated log_output method to correctly handle and log multimodal content.
  • ms_agent/llm/openai_llm.py
    • Modified _format_input_message to properly process multimodal content in API requests.
  • webui/frontend/.gitignore
    • Added a new Git ignore file for the Next.js frontend.
  • webui/frontend/README.md
    • Added a new README detailing the Next.js and Tailwind CSS frontend.
  • webui/frontend/app/globals.css
    • Added global CSS styles, including Markdown and KaTeX rendering.
  • webui/frontend/app/layout.tsx
    • Added the root layout component for the Next.js application.
  • webui/frontend/app/monitor/page.tsx
    • Added a new page for monitoring system health and session activity.
  • webui/frontend/app/page.tsx
    • Added the main home/chat page with project selection and real-time interaction.
  • webui/frontend/app/settings/page.tsx
    • Added a new page for configuring LLM settings and theme preferences.
  • webui/frontend/components/Sidebar.tsx
    • Added a new collapsible sidebar navigation component.
  • webui/frontend/components/ThemeScript.tsx
    • Added a script to handle theme initialization and prevent FOUC.
  • webui/frontend/context/GlobalContext.tsx
    • Added a React Context for managing global application state.
  • webui/frontend/eslint.config.mjs
    • Added ESLint configuration for the Next.js project.
  • webui/frontend/index.html
    • Removed the old index.html file.
  • webui/frontend/next.config.ts
    • Added Next.js configuration, including static export options.
  • webui/frontend/package.json
    • Updated frontend dependencies and scripts for the Next.js environment.
  • webui/frontend/postcss.config.mjs
    • Added PostCSS configuration for Tailwind CSS.
  • webui/frontend/public/file.svg
    • Added a new SVG icon for files.
  • webui/frontend/public/globe.svg
    • Added a new SVG icon for global/web elements.
  • webui/frontend/public/next.svg
    • Added a new SVG asset.
  • webui/frontend/public/vercel.svg
    • Added a new SVG asset.
  • webui/frontend/public/window.svg
    • Added a new SVG icon for windows.
  • webui/frontend/tailwind.config.js
    • Added Tailwind CSS configuration file.
  • webui/frontend/tsconfig.json
    • Updated TypeScript configuration for Next.js.
  • webui/frontend/types/api.ts
    • Added new TypeScript type definitions for API requests and responses.
  • webui/frontend_backup_20260212_112822/package.json
    • Renamed package.json to a backup file.
  • webui/frontend_backup_20260212_112822/public/favicon.svg
    • Renamed favicon.svg to a backup file.
  • webui/frontend_backup_20260212_112822/src/App.tsx
    • Renamed App.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/ChatView.tsx
    • Renamed ChatView.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/ConversationView.tsx
    • Renamed ConversationView.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/FileProgress.tsx
    • Renamed FileProgress.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/Layout.tsx
    • Renamed Layout.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/LogViewer.tsx
    • Renamed LogViewer.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/MessageContent.tsx
    • Renamed MessageContent.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/SearchView.tsx
    • Renamed SearchView.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/SettingsDialog.tsx
    • Renamed SettingsDialog.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/components/WorkflowProgress.tsx
    • Renamed WorkflowProgress.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/context/SessionContext.tsx
    • Renamed SessionContext.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/context/ThemeContext.tsx
    • Renamed ThemeContext.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/src/main.tsx
    • Renamed main.tsx to a backup file.
  • webui/frontend_backup_20260212_112822/tsconfig.json
    • Renamed tsconfig.json to a backup file.
  • webui/frontend_backup_20260212_112822/tsconfig.node.json
    • Renamed tsconfig.node.json to a backup file.
  • webui/frontend_backup_20260212_112822/vite.config.ts
    • Renamed vite.config.ts to a backup file.
Activity
  • A new FastAPI-based API server has been implemented, consolidating and enhancing backend functionalities.
  • The frontend has undergone a complete rewrite using Next.js and Tailwind CSS, replacing the previous React/MUI stack.
  • Multimodal capabilities have been integrated into the ms-agent core, allowing for image processing and understanding.
  • Extensive documentation for multimodal support and the new API server has been added.
  • New test scripts were created to ensure the functionality of the new API endpoints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new features, including multimodal model support, a comprehensive FastAPI-based API server, and a completely new Next.js frontend. However, the new backend API implementation contains critical security vulnerabilities, specifically multiple path traversal vulnerabilities in the file management API and a remote code execution (RCE) vector via the MCP server configuration endpoint, exacerbated by a lack of authentication. Other issues include an insecure CORS configuration and architectural patterns that may limit scalability. Immediate remediation of these security and architectural concerns is required for a robust and secure application.

Comment on lines +46 to +139
def resolve_root_dir(root_dir: Optional[str], session_id: Optional[str] = None) -> str:
"""
Resolve optional root_dir to an absolute normalized path within allowed roots.
Default: output_dir
Supports:
- None/"" => output_dir
- "output", "projects", "projects/xxx"
- absolute path (must still be under allowed roots)
"""
if session_id:
session_root = get_session_root(session_id)
return str(session_root.resolve())

_, output_dir, projects_dir = get_allowed_roots()

if not root_dir or root_dir.strip() == '':
resolved = output_dir
else:
rd = root_dir.strip()

if os.path.isabs(rd):
resolved = rd
else:
# Allow explicit "output"/"projects"
if rd in ('output', 'output/'):
resolved = output_dir
elif rd in ('projects', 'projects/'):
resolved = projects_dir
else:
cand1 = os.path.join(output_dir, rd)
cand2 = os.path.join(projects_dir, rd)
# choose existing one if possible, otherwise default to cand1
resolved = cand1 if os.path.exists(cand1) else (
cand2 if os.path.exists(cand2) else cand1)

resolved = os.path.normpath(os.path.abspath(resolved))

# TODO: Security check: ensure `resolved` is within configured allowed roots.

return resolved


def resolve_file_path(root_dir_abs: str, file_path: str) -> str:
"""
Resolve file_path against root_dir_abs.
- if file_path starts with 'projects/', resolve from ms-agent base dir
- if file_path is absolute, use as-is
- if relative, join(root_dir_abs, file_path)
"""
root_dir_abs = os.path.normpath(os.path.abspath(root_dir_abs))

if os.path.isabs(file_path):
full_path = os.path.normpath(os.path.abspath(file_path))
elif file_path.startswith('projects/'):
# Special case: if path starts with 'projects/', resolve from base_dir
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
full_path = os.path.normpath(
os.path.abspath(os.path.join(base_dir, file_path)))
else:
# Try multiple locations
base_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

candidates = [
# First try with root_dir_abs (for session-based access)
os.path.join(root_dir_abs, file_path),
]

# Search in project output directories
projects_dir = os.path.join(base_dir, 'projects')
if os.path.exists(projects_dir):
try:
for project_name in os.listdir(projects_dir):
project_path = os.path.join(projects_dir, project_name)
if os.path.isdir(project_path):
candidates.append(
os.path.join(project_path, 'output', file_path))
except (OSError, PermissionError):
pass

# Find first existing file
full_path = None
for candidate in candidates:
candidate = os.path.normpath(candidate)
if os.path.exists(candidate) and os.path.isfile(candidate):
full_path = candidate
break

if not full_path:
# Default to first candidate if none found
full_path = os.path.normpath(candidates[0])

# TODO: Security check: ensure `full_path` is within configured allowed roots.

return full_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The resolve_root_dir and resolve_file_path functions in api/files.py are critically vulnerable to path traversal. The current implementation allows absolute paths, enabling an attacker to access arbitrary files on the system. os.path.abspath is insufficient for prevention. To remediate, disallow absolute paths in user input and strictly validate that the final resolved path is within the intended base directory, for example, by using os.path.realpath with startswith or os.path.commonpath.

Comment on lines +439 to +456
@router.post("/mcp/servers")
async def add_mcp_server(server: MCPServerConfig):
"""Add a new MCP server"""
try:
success = config_manager.add_mcp_server(
server.name,
server.dict(exclude={'name'})
)
if not success:
raise HTTPException(status_code=500, detail="Failed to add server")

return APIResponse(
success=True,
message="MCP server added successfully"
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The POST /api/v1/config/mcp/servers endpoint allows users to add arbitrary MCP server configurations. For servers of type stdio, this includes specifying a command and args that will be executed by the system when an agent is started. Since the API lacks authentication by default, an unauthenticated attacker can achieve Remote Code Execution (RCE) by configuring a malicious command (e.g., rm -rf / or a reverse shell).

To remediate this, you must:

  1. Implement strong authentication and authorization for all configuration endpoints.
  2. Restrict the command field to a strict allow-list of safe executables if possible.

Comment on lines +26 to +35
def get_session_root(session_id: str) -> Path:
"""Get the work directory for a session"""
if not session_id or not str(session_id).strip():
raise HTTPException(status_code=400, detail='session_id is required')

# Get the API root directory
api_root = Path(__file__).resolve().parent
work_dir = (api_root / 'work_dir' / str(session_id)).resolve()
work_dir.mkdir(parents=True, exist_ok=True)
return work_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The get_session_root function uses the user-supplied session_id to construct a file path without sanitization. An attacker could provide a session_id like ../../ to escape the work_dir and access or create directories elsewhere on the filesystem.

To remediate this, sanitize the session_id to ensure it only contains alphanumeric characters or use a UUID validation check.

router = APIRouter(prefix="/api/v1/agent", tags=["agent"])

# Store running agent executors
running_agents: Dict[str, AgentExecutor] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a global in-memory dictionary running_agents to store the state of active agent executors introduces a significant architectural limitation. This approach will not work correctly in a multi-worker production environment (e.g., when using Gunicorn with multiple workers), as each worker process will have its own separate memory space.

This will lead to issues such as:

  • Requests for the same session being routed to different workers that don't have the agent's state.
  • Inability to scale horizontally by adding more workers or nodes.
  • Loss of all running agent states if a worker process restarts.

To address this, consider using a shared state management solution like Redis or a database to store and coordinate the state of running agents across all workers.

Comment on lines +40 to +47
# CORS configuration
app.add_middleware(
CORSMiddleware,
allow_origins=['*'], # In production, specify actual origins
allow_credentials=True,
allow_methods=['*'],
allow_headers=['*'],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The CORS configuration in api/main.py is insecure. Allowing all origins (*) while also enabling allow_credentials=True poses a security risk, as it permits any website to make authenticated requests to your API. This configuration may also be rejected by modern browsers. It is critical to restrict allow_origins to a specific list of trusted domains or use a regular expression for remediation before deployment.

    allow_origins=['https://your-frontend-domain.com'],  # TODO: Replace with your actual frontend domain(s) in production

Comment on lines +58 to +59
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a bare except Exception: pass is generally discouraged as it can silently swallow all exceptions, including unexpected ones like KeyboardInterrupt or SystemExit. This makes debugging very difficult if an error occurs while loading the mcp_servers.json file (e.g., due to a JSON syntax error or file permission issue).

It's better to either catch a more specific exception (like json.JSONDecodeError or IOError) or at least log the exception that was caught.

Suggested change
except Exception:
pass
except Exception as e:
# It's better to log this error to help with debugging configuration issues.
# Consider using the logging module here.
print(f"Warning: Failed to load or parse MCP server config from {self.mcp_file}: {e}")

Comment on lines +251 to +252
if not os.path.abspath(full_path).startswith(os.path.abspath(project['path'])):
raise HTTPException(status_code=403, detail="Access denied")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The security check to prevent path traversal is a good addition. However, using os.path.abspath is not fully secure against attacks involving symbolic links. A malicious user could create a symlink within the project directory that points to a sensitive location outside of it (e.g., symlink -> /).

To make this check more robust, you should use os.path.realpath to resolve any symbolic links in the path before comparing it with the base project path. This ensures you are always comparing canonical paths.

Suggested change
if not os.path.abspath(full_path).startswith(os.path.abspath(project['path'])):
raise HTTPException(status_code=403, detail="Access denied")
if not os.path.realpath(full_path).startswith(os.path.realpath(project['path'])):

@suluyana suluyana closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant