In non-streaming mode, if model inference takes longer than about 2 minutes, the backend connection is interrupted and returns a 500 error. This issue does not occur in streaming mode. Could you consider adding a periodic data exchange (e.g., keep-alive) mechanism to prevent long-connection timeouts?
Also, I am a developer from China, here solely for fair and respectful technical collaboration.
While some business users from China may focus only on profit, many of us are genuine engineers who value open-source contribution and technical exchange. Please do not let the actions of a few create bias against Chinese developers.