Skip to content

Upgrade nginx configuration to reduce 104 error#176

Open
zhogu wants to merge 1 commit intodevfrom
zhogu/fix-nginx
Open

Upgrade nginx configuration to reduce 104 error#176
zhogu wants to merge 1 commit intodevfrom
zhogu/fix-nginx

Conversation

@zhogu
Copy link
Copy Markdown
Contributor

@zhogu zhogu commented Apr 1, 2026

This pull request introduces several improvements to the Nginx deployment and adds tooling for testing and mocking inference servers. The main themes are: enhancing Nginx's robustness and scalability under high concurrency, and providing tools for local testing and benchmarking of inference endpoints.

Nginx configuration and deployment improvements:

  • Increased worker_connections from 1024 to 65535 in nginx.conf.template to support more simultaneous connections, improving scalability.
  • Added backlog=4096 to both HTTP and HTTPS listen directives and set the system's net.core.somaxconn to 4096 in run.sh.template to match, reducing connection drops under heavy load. [1] [2] [3]
  • Introduced a model_proxy_upstream block with keepalive settings for efficient connection reuse and automatic stale-connection retry, mitigating connection resets under high concurrency.
  • Updated proxy settings for /model-proxy/ and /job-server/ locations: set HTTP/1.1, cleared Connection header, and enabled retry on connection-level errors for POST endpoints, increasing reliability. [1] [2]
  • Updated the Kubernetes deployment to grant the container NET_ADMIN capability, allowing it to set the listen backlog at runtime.

Tooling for local testing and benchmarking:

  • Added tools/mock_inference_server.py, a FastAPI-based mock inference server that simulates OpenAI-compatible chat completions with configurable delays and token counts for local/dev testing.
  • Added tools/test_inference.py, a concurrent benchmarking script to stress-test inference endpoints, reporting per-model and overall statistics, and logging failed requests.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Nginx deployment to reduce connection resets under high concurrency (e.g., 104 errors) and introduces local tooling to mock and benchmark inference endpoints.

Changes:

  • Increased Nginx connection capacity and listen backlog; added upstream keepalive/retry tuning for model-proxy and job-server routes.
  • Updated Kubernetes deployment and startup script to attempt raising somaxconn.
  • Added a mock OpenAI-compatible inference server and an async benchmarking script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tools/test_inference.py Adds an aiohttp-based concurrent benchmarking tool for inference endpoints.
tools/mock_inference_server.py Adds a FastAPI mock server for OpenAI-style chat completions (streaming + non-streaming).
src/pylon/deploy/pylon.yaml.template Grants additional Linux capabilities to support runtime networking/sysctl tuning.
src/pylon/deploy/pylon-config/run.sh.template Attempts to raise net.core.somaxconn to match configured Nginx backlog.
src/pylon/deploy/pylon-config/nginx.conf.template Increases worker connection limits, adds backlog tuning, and introduces a keepalive upstream for model-proxy.
src/pylon/deploy/pylon-config/location.conf.template Updates proxy settings to use upstream keepalive and retries on connection-level errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

parser.add_argument(
"--models",
type=lambda s: s.split(","),
default="gpt-3,gpt-4,gpt-5,gpt-6,gpt-7,gpt-8,gpt-9",
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argparse does not apply type= conversion to a string default, so args.models will be a single string by default and for m in models will iterate characters (and the gather will run per-character). Make the default a list (e.g., default=[...]) or set default=None and split after parsing when it’s a string.

Suggested change
default="gpt-3,gpt-4,gpt-5,gpt-6,gpt-7,gpt-8,gpt-9",
default=["gpt-3", "gpt-4", "gpt-5", "gpt-6", "gpt-7", "gpt-8", "gpt-9"],

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suggestion is good

Comment on lines +77 to 82
proxy_pass http://model_proxy_upstream/$1$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 60m;
proxy_read_timeout 60m;
proxy_send_timeout 60m;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes http://model_proxy_upstream even when MODEL_PROXY_URI is https://... (the upstream definition strips both schemes). That will break TLS-to-upstream and can cause backend connection failures or unintended plaintext traffic. Preserve the upstream scheme (e.g., select http vs https via templating / map, and add the required proxy_ssl_* directives when using HTTPS).

Suggested change
proxy_pass http://model_proxy_upstream/$1$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 60m;
proxy_read_timeout 60m;
proxy_send_timeout 60m;
proxy_pass {{MODEL_PROXY_UPSTREAM_SCHEME}}://model_proxy_upstream/$1$is_args$args;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_connect_timeout 60m;
proxy_read_timeout 60m;
proxy_send_timeout 60m;
proxy_ssl_server_name on;
proxy_ssl_name model_proxy_upstream;

Copilot uses AI. Check for mistakes.
Comment on lines +18 to 22
worker_processes 1;

events {
worker_connections 1024;
worker_connections 65535;
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising worker_connections to 65535 likely won’t be effective unless the process RLIMIT_NOFILE (and/or worker_rlimit_nofile) is also increased to cover worker_processes * worker_connections plus overhead. Without that, Nginx will warn and cap usable connections. Consider setting worker_rlimit_nofile (and aligning the container/pod ulimit) alongside this change.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +26
# Requires NET_ADMIN capability; if it fails, fall back to OS default silently.
sysctl -w net.core.somaxconn=4096 2>/dev/null || true
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states this requires NET_ADMIN, but changing net.core.somaxconn commonly requires sysctl permissions that aren’t granted by NET_ADMIN (and in Kubernetes is typically handled via pod-level securityContext.sysctls). Because failures are fully silenced, operators may believe backlog tuning is active when it isn’t. Consider (a) configuring this via pod securityContext.sysctls instead of container capabilities, and/or (b) logging a one-line warning when the sysctl write fails so it’s observable.

Suggested change
# Requires NET_ADMIN capability; if it fails, fall back to OS default silently.
sysctl -w net.core.somaxconn=4096 2>/dev/null || true
# Note: Changing net.core.somaxconn may require sysctl permissions (e.g. pod securityContext.sysctls in Kubernetes),
# not just NET_ADMIN. If this fails, log a warning and continue with the OS default.
sysctl -w net.core.somaxconn=4096 >/dev/null 2>&1 || echo "Warning: Failed to set net.core.somaxconn=4096; ensure pod securityContext.sysctls is configured if backlog tuning is required." >&2

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +92
data = {
"id": f"chatcmpl-{random_text(12)}",
"object": "chat.completion.chunk",
"model": model,
"choices": [
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OpenAI-style streaming responses, clients typically expect the id to remain stable across all chunks of a single completion. Generating a new id per chunk can break client-side correlation/assembling logic. Prefer generating one id per request and reusing it in all chunks (and optionally include consistent created metadata as well).

Copilot uses AI. Check for mistakes.
import string
import time
import asyncio
from typing import List
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import (List). Removing it avoids lint noise and keeps dependencies tidy.

Suggested change
from typing import List

Copilot uses AI. Check for mistakes.
async def run_for_model(model, args, file, headers):
connector = aiohttp.TCPConnector(
limit=args.concurrency,
force_close=True
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force_close=True disables connection reuse, which adds TCP/TLS handshake overhead and can skew results—especially relevant since this PR is tuning Nginx/upstream keepalive behavior. Consider defaulting to keepalive-enabled connections (force_close=False) and, if needed, expose a CLI flag to force-close for specific experiments.

Suggested change
force_close=True
force_close=getattr(args, "force_close", False),

Copilot uses AI. Check for mistakes.

app = FastAPI()

# ====== 全局配置 ======
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the comments in Chinese

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and other places

parser.add_argument(
"--models",
type=lambda s: s.split(","),
default="gpt-3,gpt-4,gpt-5,gpt-6,gpt-7,gpt-8,gpt-9",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suggestion is good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants