Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens the Nginx deployment to reduce connection resets under high concurrency (e.g., 104 errors) and introduces local tooling to mock and benchmark inference endpoints.
Changes:
- Increased Nginx connection capacity and listen backlog; added upstream keepalive/retry tuning for model-proxy and job-server routes.
- Updated Kubernetes deployment and startup script to attempt raising
somaxconn. - Added a mock OpenAI-compatible inference server and an async benchmarking script.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/test_inference.py | Adds an aiohttp-based concurrent benchmarking tool for inference endpoints. |
| tools/mock_inference_server.py | Adds a FastAPI mock server for OpenAI-style chat completions (streaming + non-streaming). |
| src/pylon/deploy/pylon.yaml.template | Grants additional Linux capabilities to support runtime networking/sysctl tuning. |
| src/pylon/deploy/pylon-config/run.sh.template | Attempts to raise net.core.somaxconn to match configured Nginx backlog. |
| src/pylon/deploy/pylon-config/nginx.conf.template | Increases worker connection limits, adds backlog tuning, and introduces a keepalive upstream for model-proxy. |
| src/pylon/deploy/pylon-config/location.conf.template | Updates proxy settings to use upstream keepalive and retries on connection-level errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| parser.add_argument( | ||
| "--models", | ||
| type=lambda s: s.split(","), | ||
| default="gpt-3,gpt-4,gpt-5,gpt-6,gpt-7,gpt-8,gpt-9", |
There was a problem hiding this comment.
Argparse does not apply type= conversion to a string default, so args.models will be a single string by default and for m in models will iterate characters (and the gather will run per-character). Make the default a list (e.g., default=[...]) or set default=None and split after parsing when it’s a string.
| default="gpt-3,gpt-4,gpt-5,gpt-6,gpt-7,gpt-8,gpt-9", | |
| default=["gpt-3", "gpt-4", "gpt-5", "gpt-6", "gpt-7", "gpt-8", "gpt-9"], |
| proxy_pass http://model_proxy_upstream/$1$is_args$args; | ||
| proxy_http_version 1.1; | ||
| proxy_set_header Connection ""; | ||
| proxy_connect_timeout 60m; | ||
| proxy_read_timeout 60m; | ||
| proxy_send_timeout 60m; |
There was a problem hiding this comment.
This hard-codes http://model_proxy_upstream even when MODEL_PROXY_URI is https://... (the upstream definition strips both schemes). That will break TLS-to-upstream and can cause backend connection failures or unintended plaintext traffic. Preserve the upstream scheme (e.g., select http vs https via templating / map, and add the required proxy_ssl_* directives when using HTTPS).
| proxy_pass http://model_proxy_upstream/$1$is_args$args; | |
| proxy_http_version 1.1; | |
| proxy_set_header Connection ""; | |
| proxy_connect_timeout 60m; | |
| proxy_read_timeout 60m; | |
| proxy_send_timeout 60m; | |
| proxy_pass {{MODEL_PROXY_UPSTREAM_SCHEME}}://model_proxy_upstream/$1$is_args$args; | |
| proxy_http_version 1.1; | |
| proxy_set_header Connection ""; | |
| proxy_connect_timeout 60m; | |
| proxy_read_timeout 60m; | |
| proxy_send_timeout 60m; | |
| proxy_ssl_server_name on; | |
| proxy_ssl_name model_proxy_upstream; |
| worker_processes 1; | ||
|
|
||
| events { | ||
| worker_connections 1024; | ||
| worker_connections 65535; | ||
| } |
There was a problem hiding this comment.
Raising worker_connections to 65535 likely won’t be effective unless the process RLIMIT_NOFILE (and/or worker_rlimit_nofile) is also increased to cover worker_processes * worker_connections plus overhead. Without that, Nginx will warn and cap usable connections. Consider setting worker_rlimit_nofile (and aligning the container/pod ulimit) alongside this change.
| # Requires NET_ADMIN capability; if it fails, fall back to OS default silently. | ||
| sysctl -w net.core.somaxconn=4096 2>/dev/null || true |
There was a problem hiding this comment.
The comment states this requires NET_ADMIN, but changing net.core.somaxconn commonly requires sysctl permissions that aren’t granted by NET_ADMIN (and in Kubernetes is typically handled via pod-level securityContext.sysctls). Because failures are fully silenced, operators may believe backlog tuning is active when it isn’t. Consider (a) configuring this via pod securityContext.sysctls instead of container capabilities, and/or (b) logging a one-line warning when the sysctl write fails so it’s observable.
| # Requires NET_ADMIN capability; if it fails, fall back to OS default silently. | |
| sysctl -w net.core.somaxconn=4096 2>/dev/null || true | |
| # Note: Changing net.core.somaxconn may require sysctl permissions (e.g. pod securityContext.sysctls in Kubernetes), | |
| # not just NET_ADMIN. If this fails, log a warning and continue with the OS default. | |
| sysctl -w net.core.somaxconn=4096 >/dev/null 2>&1 || echo "Warning: Failed to set net.core.somaxconn=4096; ensure pod securityContext.sysctls is configured if backlog tuning is required." >&2 |
| data = { | ||
| "id": f"chatcmpl-{random_text(12)}", | ||
| "object": "chat.completion.chunk", | ||
| "model": model, | ||
| "choices": [ |
There was a problem hiding this comment.
For OpenAI-style streaming responses, clients typically expect the id to remain stable across all chunks of a single completion. Generating a new id per chunk can break client-side correlation/assembling logic. Prefer generating one id per request and reusing it in all chunks (and optionally include consistent created metadata as well).
| import string | ||
| import time | ||
| import asyncio | ||
| from typing import List |
There was a problem hiding this comment.
Unused import (List). Removing it avoids lint noise and keeps dependencies tidy.
| from typing import List |
| async def run_for_model(model, args, file, headers): | ||
| connector = aiohttp.TCPConnector( | ||
| limit=args.concurrency, | ||
| force_close=True |
There was a problem hiding this comment.
force_close=True disables connection reuse, which adds TCP/TLS handshake overhead and can skew results—especially relevant since this PR is tuning Nginx/upstream keepalive behavior. Consider defaulting to keepalive-enabled connections (force_close=False) and, if needed, expose a CLI flag to force-close for specific experiments.
| force_close=True | |
| force_close=getattr(args, "force_close", False), |
|
|
||
| app = FastAPI() | ||
|
|
||
| # ====== 全局配置 ====== |
There was a problem hiding this comment.
let's remove the comments in Chinese
| parser.add_argument( | ||
| "--models", | ||
| type=lambda s: s.split(","), | ||
| default="gpt-3,gpt-4,gpt-5,gpt-6,gpt-7,gpt-8,gpt-9", |
This pull request introduces several improvements to the Nginx deployment and adds tooling for testing and mocking inference servers. The main themes are: enhancing Nginx's robustness and scalability under high concurrency, and providing tools for local testing and benchmarking of inference endpoints.
Nginx configuration and deployment improvements:
worker_connectionsfrom 1024 to 65535 innginx.conf.templateto support more simultaneous connections, improving scalability.backlog=4096to both HTTP and HTTPSlistendirectives and set the system'snet.core.somaxconnto 4096 inrun.sh.templateto match, reducing connection drops under heavy load. [1] [2] [3]model_proxy_upstreamblock with keepalive settings for efficient connection reuse and automatic stale-connection retry, mitigating connection resets under high concurrency./model-proxy/and/job-server/locations: set HTTP/1.1, clearedConnectionheader, and enabled retry on connection-level errors for POST endpoints, increasing reliability. [1] [2]NET_ADMINcapability, allowing it to set the listen backlog at runtime.Tooling for local testing and benchmarking:
tools/mock_inference_server.py, a FastAPI-based mock inference server that simulates OpenAI-compatible chat completions with configurable delays and token counts for local/dev testing.tools/test_inference.py, a concurrent benchmarking script to stress-test inference endpoints, reporting per-model and overall statistics, and logging failed requests.