fix: intrinsic tests and add some safeguards for future adapters changes #1078
Conversation
…nsics Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
…ests; small nits Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS
84f4af5 to
33401f4
Compare
|
@frreiss, this PR will require your review when you get a chance. Thank you. I think the main part is to make sure I added support for req-check and certainty correctly. I think I have previously mentioned the versioning checks to you. |
planetf1
left a comment
There was a problem hiding this comment.
Two tests need @pytest.mark.integration; stale last_validated_commit SHAs on the four new entries need verifying before merge.
| int(os.environ.get("CICD", 0)) == 1, | ||
| reason="Don't cause CICD pipelines to fail due to adapter version changes alone.", | ||
| ) | ||
| @pytest.mark.huggingface |
There was a problem hiding this comment.
huggingface is reserved for GPU/transformers tests and isn't in conftest's _NON_UNIT, so without a tier marker this test auto-becomes unit despite making live HF Hub API calls. integration is the right tier — real external boundary, no GPU needed.
| @pytest.mark.huggingface | |
| @pytest.mark.integration |
| the expected output | ||
| """ | ||
| cfg = yaml_json_combo_no_alora | ||
| _xfail_if_drifted(cfg) |
There was a problem hiding this comment.
_xfail_if_drifted makes a live HF Hub API call on first use per session. test_canned_input has no tier marker so auto-becomes unit. Add @pytest.mark.integration to the function.
|
|
||
| # Same cases as test_canned_input | ||
| cfg = yaml_json_combo_with_lora_model | ||
| _xfail_if_drifted(cfg) |
There was a problem hiding this comment.
Same as test_canned_input — _xfail_if_drifted makes a live HF Hub call but this test auto-becomes unit. Add @pytest.mark.integration.
| inputs_file=_INPUT_JSON_DIR / "requirement_check.json", | ||
| task="requirement-check", | ||
| repo_id="ibm-granite/granitelib-core-r1.0", | ||
| last_validated_commit="6b9a42d5e23364b3aca0ae334fbbea57c510623a", |
There was a problem hiding this comment.
The recorded SHA 6b9a42d5 is already behind current main on granitelib-core-r1.0 — verified against live HF Hub:
requirement-check/granite-4.1-3b/{lora,alora}→d0a2a96auncertainty/granite-4.1-3b/{lora,alora}→1e568b00
All four entries will immediately xfail on first run. Were the canned outputs generated against the current adapter? If so, update last_validated_commit to the current SHAs.
There was a problem hiding this comment.
Yes. I will update these when the PR is fully reviewed and before merging. There has been some turnover on these repos.
|
|
||
| # Explicitly don't check drift here. Ollama models don't have their own yaml combo | ||
| # that we can track. | ||
| # _xfail_if_drifted(cfg) |
There was a problem hiding this comment.
NIT: the prose comment above already explains why drift isn't checked here — dead code, can be removed.
There was a problem hiding this comment.
I disagree. I'd like to keep this comment; the yamls being ollama specific doesn't indicate that drift can't be detected with the ollama models.
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
d28c189 to
f27a5d9
Compare
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
f27a5d9 to
34ff589
Compare
frreiss
left a comment
There was a problem hiding this comment.
This approach is ok for now and will unblock CI, but longer term it would be a good idea to have a procedure in place to allow for independent development of Mellea the HF repositories that depend on it. See comments.
| def _adapter_subpath(cfg: YamlJsonCombo) -> str: | ||
| """Return the Hugging Face Hub subpath where ``cfg``'s adapter lives. | ||
|
|
||
| Mirrors the layout logic in | ||
| ``mellea.formatters.granite.intrinsics.util.obtain_lora()``. | ||
| """ | ||
| model_name = BASE_MODEL_TO_CANONICAL_NAME.get(cfg.base_model_id, cfg.base_model_id) | ||
| lora_str = "alora" if cfg.is_alora else "lora" | ||
| if cfg.repo_id in OLD_LAYOUT_REPOS: | ||
| return f"{cfg.task}/{lora_str}/{model_name}" | ||
| return f"{cfg.task}/{model_name}/{lora_str}" |
There was a problem hiding this comment.
Please factor this path generation functionality out of mellea.formatters.granite.intrinsics.util.obtain_lora() into a separate function, then call that function from this file. If these paths are computed in two places in the source tree, one of those places will get out of sync with the other.
There was a problem hiding this comment.
Will fix. Thank you!
| pytest.xfail( | ||
| f"Adapter at {cfg.repo_id}/{adapter_subpath} drifted from " | ||
| f"recorded {cfg.last_validated_commit[:8]} to {current[:8]}. The change " | ||
| f"may be nonfunctional (e.g. a README edit) — if this test still passes " | ||
| f"as XPASS, you may be able to simply bump the `last_validated_commit`. " | ||
| f"Otherwise refresh the canned outputs once the new adapter is verified." | ||
| ) |
There was a problem hiding this comment.
Checking for drift this way will get your CI unblocked, but it won't do much to solve the core problem.
At any point in time, there are many Mellea tests that xfail. Developers have trained themselves to ignore xfail messages. Unless we have some sort of automated audit process that checks the CI logs for this specific error message, the drift-checking code here will have no effect. People will upload changes to the HF repositories that break the tests in this file, and no one will notice.
I would recommend tagging point releases of the HF repositories and having Mellea pull a specific point release for each repo unless the user explicitly says to do otherwise. Then have test cases cover whatever whatever point release is currently the default. Every time we change Mellea's default point release for a HF repo, we can make whatever changes are necessary to make the tests pass again.
There was a problem hiding this comment.
Agreed. There is a proposal elsewhere for versioning intrinsics / adapters in Mellea. This PR was created before that and is intended to keep us working until that larger refactor. I still think this is a worthwhile effort to determine if adapters have breaking changes. We will see these failures in our nightly runs.
| reason="Don't cause CICD pipelines to fail due to adapter version changes alone.", | ||
| ) | ||
| @pytest.mark.integration | ||
| def test_adapter_versions_unchanged(): |
There was a problem hiding this comment.
Checking for HF repository drift this way will unblock your CI, but it won't solve the core problem.
Most Mellea developers don't run the entire test suite, instead relying on the CI server. People who push updates to the HF repositories generally don't run any Mellea tests.
Here's what will happen if this test case is left in as-is:
- Someone will push a change to a HF repository. This could be an innocuous change like updating a README file, or it could be a change that causes Mellea to crash.
- This test case will immediately break.
- No one will notice that this test case is broken.
- Days or weeks later, another developer who is in the middle of pushing a significant change to Mellea will attempt to run the entire test suite. This test case will fail.
- The developer, who is probably not a component owner for
formatters, will spend time figuring out what is going on. - If the developer is feeling charitable, he will move forward the pinned commit hashes higher up in this file, fix any problems that arise in the tests, and add those changes to his unrelated commit.
- If the developer is in a hurry, he will ignore the fact that this test is failing. If the original change was a breaking change, there will be an undetected regression in Mellea.
There was a problem hiding this comment.
Agreed. Our nightlies are currently checked every day by Mellea developers; but we should have better mechanisms in place. We will have a versioning proposal.
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
44969b4
Misc PR
Type of PR
Description
Link to Issue: Fixes update intrinsic tests #1029
Adds back the uncertainty and requirement-check tests
Adds last_validated_commit for adapters we test for so that we can catch future version changes in our nightlies
Testing
Attribution