test_generate_from_raw_with_format flaky due to missing MAX_NEW_TOKENS
Description
test/backends/test_openai_vllm.py::test_generate_from_raw_with_format fails intermittently with truncated JSON output.
The test doesn't set MAX_NEW_TOKENS, so it falls back to vLLM's SamplingParams default of 16 tokens. A valid JSON response (e.g. {"name": "get_sum", "value": 2}) requires ~20+ tokens, so output is truncated.
The chat-based equivalent (test_format in the same file) already sets MAX_NEW_TOKENS: 256 and passes reliably.
Error
pydantic_core._pydantic_core.ValidationError: 1 validation error for Answer
Invalid JSON: EOF while parsing an object at line 3 column 12
input_value='{ \n "name": "get_sum",\n "value": 2'
Fix
Add model_options={ModelOption.MAX_NEW_TOKENS: 256} to the generate_from_raw call in the test, matching what test_format already does.
Environment
Observed on LSF cluster (CUDA, vLLM 0.13+).
test_generate_from_raw_with_formatflaky due to missing MAX_NEW_TOKENSDescription
test/backends/test_openai_vllm.py::test_generate_from_raw_with_formatfails intermittently with truncated JSON output.The test doesn't set
MAX_NEW_TOKENS, so it falls back to vLLM'sSamplingParamsdefault of 16 tokens. A valid JSON response (e.g.{"name": "get_sum", "value": 2}) requires ~20+ tokens, so output is truncated.The chat-based equivalent (
test_formatin the same file) already setsMAX_NEW_TOKENS: 256and passes reliably.Error
Fix
Add
model_options={ModelOption.MAX_NEW_TOKENS: 256}to thegenerate_from_rawcall in the test, matching whattest_formatalready does.Environment
Observed on LSF cluster (CUDA, vLLM 0.13+).