test: fix flaky ollama tests, remove stale xfails, add diagnostic logging#598
Closed
planetf1 wants to merge 11 commits into
Closed
test: fix flaky ollama tests, remove stale xfails, add diagnostic logging#598planetf1 wants to merge 11 commits into
planetf1 wants to merge 11 commits into
Conversation
…puting#565) <!-- mellea-pr-edited-marker: do not remove this marker --> # Misc PR ## Type of PR - [ ] Bug Fix - [ ] New Feature - [ ] Documentation - [x] Other ## Description - [x] Link to Issue: Fixes generative-computing#565 <!-- Brief description of the change being made along with an explanation. --> Removed `--cov-report=term` from the `[tool.pytest.ini_options]` configuration in `pyproject.toml` to prevent test runs from dumping large code coverage tables to the terminal. Test coverage is still generated and output to files `htmlcov/` and `coverage.json`. ### Testing - [ ] Tests added to the respective file if code was changed - [ ] New code has 100% coverage if code as added - [ ] Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)
Introduces `test_astream_mock.py` to test `ModelOutputThunk`'s async queue incremental streaming logic deterministically without relying on highly-variable LLM backends.
Pop exception from chunks list (like we do for the None sentinel) so _process doesn't receive it. Guard chat_response access in ollama post_processing with .get() for when no valid chunks arrived. Signed-off-by: 0xCUB3 <skula@mit.edu>
Signed-off-by: 0xCUB3 <skula@mit.edu>
Signed-off-by: 0xCUB3 <skula@mit.edu>
… key exists Signed-off-by: 0xCUB3 <skula@mit.edu>
Unit tests that verify exceptions in the async queue are cleanly propagated without reaching _process, and that _post_process still runs for telemetry cleanup.
…ging - Remove xfail from test_generate_from_raw_with_format (consistently passing) - Remove xfail from test_multiple_async_funcs (watsonx litellm bug resolved) - Add CONTEXT_WINDOW: 2048 and stronger assertions to generate_from_raw tests - Add pytest.mark.timeout(150) to test_generate_from_raw - Increase MAX_NEW_TOKENS to 2**10 in format tests - Add FancyLogger warning when generate_from_raw catches an exception - Mark researcher example as slow; add markers to query_clarification - Update slow marker description in pyproject.toml
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
5 tasks
Contributor
Author
|
Closing in favour of a clean branch rebased on upstream/main |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
xfailfromtest_generate_from_raw_with_format— consistently passing; xfail was masking real failuresxfailfromtest_multiple_async_funcs(watsonx/litellm bug resolved)CONTEXT_WINDOW: 2048to bothgenerate_from_rawtests to reduce memory pressure on Ollamaassert all(r.value for r in results)with diagnostic message)pytest.mark.timeout(150)totest_generate_from_rawto bound worst-case flaketest_generate_from_raw_with_formatMAX_NEW_TOKENSfrom2**8to2**10in format tests (ollama/openai-ollama)FancyLogger.warningdiagnostic whengenerate_from_rawcatches an exceptionresearcher.pyexample asslow; add markers toquery_clarification.pyslowmarker description to ">1 minute"Notes
There is one remaining known issue with Ollama under sustained load: empty-body responses that are not exceptions and therefore not caught by the new logging. A separate issue will track that investigation.
Test plan
uv run pytest test/backends/test_ollama.pypasses without xfail noiseuv run pytest test/backends/test_litellm_watsonx.pypasses (or fails for a real reason)test/suite shows no regressions