fix: intrinsic tests and add some safeguards for future adapters changes by jakelorocco · Pull Request #1078 · generative-computing/mellea

jakelorocco · 2026-05-14T20:47:42Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes update intrinsic tests #1029
Adds back the uncertainty and requirement-check tests
Adds last_validated_commit for adapters we test for so that we can catch future version changes in our nightlies

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

…nsics Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

github-actions · 2026-05-14T20:47:55Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

…ests; small nits Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

jakelorocco · 2026-05-18T18:33:01Z

@frreiss, this PR will require your review when you get a chance. Thank you. I think the main part is to make sure I added support for req-check and certainty correctly. I think I have previously mentioned the versioning checks to you.

planetf1

Two tests need @pytest.mark.integration; stale last_validated_commit SHAs on the four new entries need verifying before merge.

planetf1 · 2026-05-22T09:48:07Z

+    int(os.environ.get("CICD", 0)) == 1,
+    reason="Don't cause CICD pipelines to fail due to adapter version changes alone.",
+)
+@pytest.mark.huggingface


huggingface is reserved for GPU/transformers tests and isn't in conftest's _NON_UNIT, so without a tier marker this test auto-becomes unit despite making live HF Hub API calls. integration is the right tier — real external boundary, no GPU needed.

Suggested change

@pytest.mark.huggingface

@pytest.mark.integration

planetf1 · 2026-05-22T09:48:08Z

    the expected output
    """
    cfg = yaml_json_combo_no_alora
+    _xfail_if_drifted(cfg)


_xfail_if_drifted makes a live HF Hub API call on first use per session. test_canned_input has no tier marker so auto-becomes unit. Add @pytest.mark.integration to the function.

planetf1 · 2026-05-22T09:48:08Z


    # Same cases as test_canned_input
    cfg = yaml_json_combo_with_lora_model
+    _xfail_if_drifted(cfg)


Same as test_canned_input — _xfail_if_drifted makes a live HF Hub call but this test auto-becomes unit. Add @pytest.mark.integration.

planetf1 · 2026-05-22T09:48:08Z

+        inputs_file=_INPUT_JSON_DIR / "requirement_check.json",
+        task="requirement-check",
+        repo_id="ibm-granite/granitelib-core-r1.0",
+        last_validated_commit="6b9a42d5e23364b3aca0ae334fbbea57c510623a",


The recorded SHA 6b9a42d5 is already behind current main on granitelib-core-r1.0 — verified against live HF Hub:

requirement-check/granite-4.1-3b/{lora,alora} → d0a2a96a

uncertainty/granite-4.1-3b/{lora,alora} → 1e568b00

All four entries will immediately xfail on first run. Were the canned outputs generated against the current adapter? If so, update last_validated_commit to the current SHAs.

Yes. I will update these when the PR is fully reviewed and before merging. There has been some turnover on these repos.

planetf1 · 2026-05-22T09:48:08Z

+
+    # Explicitly don't check drift here. Ollama models don't have their own yaml combo
+    # that we can track.
+    # _xfail_if_drifted(cfg)


NIT: the prose comment above already explains why drift isn't checked here — dead code, can be removed.

I disagree. I'd like to keep this comment; the yamls being ollama specific doesn't indicate that drift can't be detected with the ollama models.

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

frreiss

This approach is ok for now and will unblock CI, but longer term it would be a good idea to have a procedure in place to allow for independent development of Mellea the HF repositories that depend on it. See comments.

frreiss · 2026-05-26T22:06:56Z

+def _adapter_subpath(cfg: YamlJsonCombo) -> str:
+    """Return the Hugging Face Hub subpath where ``cfg``'s adapter lives.
+
+    Mirrors the layout logic in
+    ``mellea.formatters.granite.intrinsics.util.obtain_lora()``.
+    """
+    model_name = BASE_MODEL_TO_CANONICAL_NAME.get(cfg.base_model_id, cfg.base_model_id)
+    lora_str = "alora" if cfg.is_alora else "lora"
+    if cfg.repo_id in OLD_LAYOUT_REPOS:
+        return f"{cfg.task}/{lora_str}/{model_name}"
+    return f"{cfg.task}/{model_name}/{lora_str}"


Please factor this path generation functionality out of mellea.formatters.granite.intrinsics.util.obtain_lora() into a separate function, then call that function from this file. If these paths are computed in two places in the source tree, one of those places will get out of sync with the other.

Will fix. Thank you!

frreiss · 2026-05-26T22:12:53Z

+    pytest.xfail(
+        f"Adapter at {cfg.repo_id}/{adapter_subpath} drifted from "
+        f"recorded {cfg.last_validated_commit[:8]} to {current[:8]}. The change "
+        f"may be nonfunctional (e.g. a README edit) — if this test still passes "
+        f"as XPASS, you may be able to simply bump the `last_validated_commit`. "
+        f"Otherwise refresh the canned outputs once the new adapter is verified."
+    )


Checking for drift this way will get your CI unblocked, but it won't do much to solve the core problem.

At any point in time, there are many Mellea tests that xfail. Developers have trained themselves to ignore xfail messages. Unless we have some sort of automated audit process that checks the CI logs for this specific error message, the drift-checking code here will have no effect. People will upload changes to the HF repositories that break the tests in this file, and no one will notice.

I would recommend tagging point releases of the HF repositories and having Mellea pull a specific point release for each repo unless the user explicitly says to do otherwise. Then have test cases cover whatever whatever point release is currently the default. Every time we change Mellea's default point release for a HF repo, we can make whatever changes are necessary to make the tests pass again.

Agreed. There is a proposal elsewhere for versioning intrinsics / adapters in Mellea. This PR was created before that and is intended to keep us working until that larger refactor. I still think this is a worthwhile effort to determine if adapters have breaking changes. We will see these failures in our nightly runs.

frreiss · 2026-05-26T22:25:23Z

+    reason="Don't cause CICD pipelines to fail due to adapter version changes alone.",
+)
+@pytest.mark.integration
+def test_adapter_versions_unchanged():


Checking for HF repository drift this way will unblock your CI, but it won't solve the core problem.

Most Mellea developers don't run the entire test suite, instead relying on the CI server. People who push updates to the HF repositories generally don't run any Mellea tests.

Here's what will happen if this test case is left in as-is:

Someone will push a change to a HF repository. This could be an innocuous change like updating a README file, or it could be a change that causes Mellea to crash.

This test case will immediately break.

No one will notice that this test case is broken.

Days or weeks later, another developer who is in the middle of pushing a significant change to Mellea will attempt to run the entire test suite. This test case will fail.

The developer, who is probably not a component owner for formatters, will spend time figuring out what is going on.

If the developer is feeling charitable, he will move forward the pinned commit hashes higher up in this file, fix any problems that arise in the tests, and add those changes to his unrelated commit.

If the developer is in a hurry, he will ignore the fact that this test is failing. If the original change was a breaking change, there will be an undetected regression in Mellea.

Agreed. Our nightlies are currently checked every day by Mellea developers; but we should have better mechanisms in place. We will have a versioning proposal.

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

jakelorocco added 2 commits May 14, 2026 13:36

test: add test cases back for requirement-check and uncertainty intri…

bbe50d3

…nsics Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

test: add test for adapter version changes

7d5b0aa

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

github-actions Bot added the bug Something isn't working label May 14, 2026

jakelorocco linked an issue May 14, 2026 that may be closed by this pull request

update intrinsic tests #1029

Closed

fix: don't mutate yaml cfg combos for ollama in intrinsic formatter t…

33401f4

…ests; small nits Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

jakelorocco force-pushed the test/fix-intrinsic-tests branch from 84f4af5 to 33401f4 Compare May 18, 2026 17:57

jakelorocco marked this pull request as ready for review May 18, 2026 18:32

jakelorocco requested a review from a team as a code owner May 18, 2026 18:32

jakelorocco requested a review from frreiss May 18, 2026 18:33

jakelorocco changed the title ~~fix: intrinsic tests and add some safeguards for future adapters changes~~ fix: intrinsic tests and add some safeguards for future adapters changes May 18, 2026

jakelorocco added bug Something isn't working and removed bug Something isn't working labels May 18, 2026

planetf1 reviewed May 22, 2026

View reviewed changes

fix: pr comments

bae9965

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

jakelorocco force-pushed the test/fix-intrinsic-tests branch from d28c189 to f27a5d9 Compare May 26, 2026 15:03

fix: updated adapter shas and python constant

34ff589

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

jakelorocco force-pushed the test/fix-intrinsic-tests branch from f27a5d9 to 34ff589 Compare May 26, 2026 15:37

frreiss approved these changes May 26, 2026

View reviewed changes

fix: refactor shared code paths

3ff7216

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

jakelorocco added this pull request to the merge queue May 27, 2026

Merged via the queue into generative-computing:main with commit 44969b4 May 27, 2026
9 checks passed

jakelorocco deleted the test/fix-intrinsic-tests branch May 27, 2026 19:58

Conversation

jakelorocco commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Attribution

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

jakelorocco commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frreiss left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jakelorocco commented May 14, 2026 •

edited

Loading

jakelorocco commented May 18, 2026 •

edited

Loading