Skip to content

fix: intrinsic tests and add some safeguards for future adapters changes #1078

Merged
jakelorocco merged 6 commits into
generative-computing:mainfrom
jakelorocco:test/fix-intrinsic-tests
May 27, 2026
Merged

fix: intrinsic tests and add some safeguards for future adapters changes #1078
jakelorocco merged 6 commits into
generative-computing:mainfrom
jakelorocco:test/fix-intrinsic-tests

Conversation

@jakelorocco

@jakelorocco jakelorocco commented May 14, 2026

Copy link
Copy Markdown
Contributor

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

  • Link to Issue: Fixes update intrinsic tests #1029

  • Adds back the uncertainty and requirement-check tests

  • Adds last_validated_commit for adapters we test for so that we can catch future version changes in our nightlies

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

…nsics

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Assisted-by: CLAUDE:OPUS
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Assisted-by: CLAUDE:OPUS
@github-actions github-actions Bot added the bug Something isn't working label May 14, 2026
@github-actions

Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@jakelorocco jakelorocco linked an issue May 14, 2026 that may be closed by this pull request
…ests; small nits

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Assisted-by: CLAUDE:OPUS
@jakelorocco jakelorocco force-pushed the test/fix-intrinsic-tests branch from 84f4af5 to 33401f4 Compare May 18, 2026 17:57
@jakelorocco jakelorocco marked this pull request as ready for review May 18, 2026 18:32
@jakelorocco jakelorocco requested a review from a team as a code owner May 18, 2026 18:32
@jakelorocco

jakelorocco commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

@frreiss, this PR will require your review when you get a chance. Thank you. I think the main part is to make sure I added support for req-check and certainty correctly. I think I have previously mentioned the versioning checks to you.

@jakelorocco jakelorocco requested a review from frreiss May 18, 2026 18:33
@jakelorocco jakelorocco changed the title fix: intrinsic tests and add some safeguards for future adapters changes fix: intrinsic tests and add some safeguards for future adapters changes May 18, 2026
@jakelorocco jakelorocco added bug Something isn't working and removed bug Something isn't working labels May 18, 2026

@planetf1 planetf1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tests need @pytest.mark.integration; stale last_validated_commit SHAs on the four new entries need verifying before merge.

int(os.environ.get("CICD", 0)) == 1,
reason="Don't cause CICD pipelines to fail due to adapter version changes alone.",
)
@pytest.mark.huggingface

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface is reserved for GPU/transformers tests and isn't in conftest's _NON_UNIT, so without a tier marker this test auto-becomes unit despite making live HF Hub API calls. integration is the right tier — real external boundary, no GPU needed.

Suggested change
@pytest.mark.huggingface
@pytest.mark.integration

the expected output
"""
cfg = yaml_json_combo_no_alora
_xfail_if_drifted(cfg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_xfail_if_drifted makes a live HF Hub API call on first use per session. test_canned_input has no tier marker so auto-becomes unit. Add @pytest.mark.integration to the function.


# Same cases as test_canned_input
cfg = yaml_json_combo_with_lora_model
_xfail_if_drifted(cfg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as test_canned_input_xfail_if_drifted makes a live HF Hub call but this test auto-becomes unit. Add @pytest.mark.integration.

inputs_file=_INPUT_JSON_DIR / "requirement_check.json",
task="requirement-check",
repo_id="ibm-granite/granitelib-core-r1.0",
last_validated_commit="6b9a42d5e23364b3aca0ae334fbbea57c510623a",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recorded SHA 6b9a42d5 is already behind current main on granitelib-core-r1.0 — verified against live HF Hub:

  • requirement-check/granite-4.1-3b/{lora,alora}d0a2a96a
  • uncertainty/granite-4.1-3b/{lora,alora}1e568b00

All four entries will immediately xfail on first run. Were the canned outputs generated against the current adapter? If so, update last_validated_commit to the current SHAs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will update these when the PR is fully reviewed and before merging. There has been some turnover on these repos.


# Explicitly don't check drift here. Ollama models don't have their own yaml combo
# that we can track.
# _xfail_if_drifted(cfg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: the prose comment above already explains why drift isn't checked here — dead code, can be removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I'd like to keep this comment; the yamls being ollama specific doesn't indicate that drift can't be detected with the ollama models.

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
@jakelorocco jakelorocco force-pushed the test/fix-intrinsic-tests branch from d28c189 to f27a5d9 Compare May 26, 2026 15:03
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
@jakelorocco jakelorocco force-pushed the test/fix-intrinsic-tests branch from f27a5d9 to 34ff589 Compare May 26, 2026 15:37

@frreiss frreiss left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is ok for now and will unblock CI, but longer term it would be a good idea to have a procedure in place to allow for independent development of Mellea the HF repositories that depend on it. See comments.

Comment on lines +427 to +437
def _adapter_subpath(cfg: YamlJsonCombo) -> str:
"""Return the Hugging Face Hub subpath where ``cfg``'s adapter lives.

Mirrors the layout logic in
``mellea.formatters.granite.intrinsics.util.obtain_lora()``.
"""
model_name = BASE_MODEL_TO_CANONICAL_NAME.get(cfg.base_model_id, cfg.base_model_id)
lora_str = "alora" if cfg.is_alora else "lora"
if cfg.repo_id in OLD_LAYOUT_REPOS:
return f"{cfg.task}/{lora_str}/{model_name}"
return f"{cfg.task}/{model_name}/{lora_str}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please factor this path generation functionality out of mellea.formatters.granite.intrinsics.util.obtain_lora() into a separate function, then call that function from this file. If these paths are computed in two places in the source tree, one of those places will get out of sync with the other.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. Thank you!

Comment on lines +496 to +502
pytest.xfail(
f"Adapter at {cfg.repo_id}/{adapter_subpath} drifted from "
f"recorded {cfg.last_validated_commit[:8]} to {current[:8]}. The change "
f"may be nonfunctional (e.g. a README edit) — if this test still passes "
f"as XPASS, you may be able to simply bump the `last_validated_commit`. "
f"Otherwise refresh the canned outputs once the new adapter is verified."
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for drift this way will get your CI unblocked, but it won't do much to solve the core problem.

At any point in time, there are many Mellea tests that xfail. Developers have trained themselves to ignore xfail messages. Unless we have some sort of automated audit process that checks the CI logs for this specific error message, the drift-checking code here will have no effect. People will upload changes to the HF repositories that break the tests in this file, and no one will notice.

I would recommend tagging point releases of the HF repositories and having Mellea pull a specific point release for each repo unless the user explicitly says to do otherwise. Then have test cases cover whatever whatever point release is currently the default. Every time we change Mellea's default point release for a HF repo, we can make whatever changes are necessary to make the tests pass again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. There is a proposal elsewhere for versioning intrinsics / adapters in Mellea. This PR was created before that and is intended to keep us working until that larger refactor. I still think this is a worthwhile effort to determine if adapters have breaking changes. We will see these failures in our nightly runs.

reason="Don't cause CICD pipelines to fail due to adapter version changes alone.",
)
@pytest.mark.integration
def test_adapter_versions_unchanged():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for HF repository drift this way will unblock your CI, but it won't solve the core problem.

Most Mellea developers don't run the entire test suite, instead relying on the CI server. People who push updates to the HF repositories generally don't run any Mellea tests.

Here's what will happen if this test case is left in as-is:

  • Someone will push a change to a HF repository. This could be an innocuous change like updating a README file, or it could be a change that causes Mellea to crash.
  • This test case will immediately break.
  • No one will notice that this test case is broken.
  • Days or weeks later, another developer who is in the middle of pushing a significant change to Mellea will attempt to run the entire test suite. This test case will fail.
  • The developer, who is probably not a component owner for formatters, will spend time figuring out what is going on.
  • If the developer is feeling charitable, he will move forward the pinned commit hashes higher up in this file, fix any problems that arise in the tests, and add those changes to his unrelated commit.
  • If the developer is in a hurry, he will ignore the fact that this test is failing. If the original change was a breaking change, there will be an undetected regression in Mellea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Our nightlies are currently checked every day by Mellea developers; but we should have better mechanisms in place. We will have a versioning proposal.

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
@jakelorocco jakelorocco added this pull request to the merge queue May 27, 2026
Merged via the queue into generative-computing:main with commit 44969b4 May 27, 2026
9 checks passed
@jakelorocco jakelorocco deleted the test/fix-intrinsic-tests branch May 27, 2026 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

update intrinsic tests

3 participants