feat: consolidate to llguidance from xgrammar by jakelorocco · Pull Request #1077 · generative-computing/mellea

jakelorocco · 2026-05-14T18:45:43Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes consolidate between llguidance and xgrammar #1004

Moves granite formatters onto llguidance. There wasn't a strong reason to keep xgrammar and both work. All intrinsic tests continue to pass.

Moved _GuidanceLogitsProcessor to util so that it could be imported by the HuggingFace backend and not live in two spots.

Also fixed one spot in the util function where model was being used even though it could've been None.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

… instead of xgrammar Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

github-actions · 2026-05-14T18:46:01Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

ajbozarth

On a quick read-through this looks good, Claude found a handful of small items, but nothing big:

ajbozarth

small nit, otherwise LGTM. Though it may be worth getting a pair of eyes that know the affected code a bit better to double check

akihikokuroda · 2026-05-22T00:02:25Z

There are some observations but any nothing major.

  1. Import style inconsistency (minor)
    - Line 131: import_optional is used inside the __call__ method but not at module level
    - Consider whether llguidance could be imported at module load time (if always available for granite formatters) or if lazy loading is intentional for performance
  2. Docstring incompleteness
    - _GuidanceLogitsProcessor.__init__ has a docstring but the class docstring says "Apply... to batch_scores" — it should clarify that it's a logits processor callback
    - Current: "A HuggingFace logits processor that enforces an llguidance grammar." (good enough, but could add a line about usage context)
  3. Error message ambiguity (line 305)
    - Old: Two separate error checks for missing tokenizer and model
    - New: One unified error mentioning both
    - The new error at line 333 requires both when constrained_decoding_prefix is used, but this is only enforced inside the nested conditional. Consider lifting this validation earlier or documenting
  the dependency more clearly.
  4. Type annotation edge case
    - Line 291: tokenizer: PreTrainedTokenizerBase | None = None but at line 311 it assumes tokenizer is not None without a guard when computing vocab_size. Guard added at line 312 (if ll_tokenizer is 
  None) mitigates this, but the logic flow could be clearer.
  5. Testing coverage not shown
    - Diff doesn't include test updates. PR says tests pass, but verify:
        - Unit tests for _GuidanceLogitsProcessor in the new location
      - Integration tests with ll_tokenizer parameter
      - Edge case: what happens if both tokenizer and ll_tokenizer are None and constrained decoding is requested?

akihikokuroda

LGTM

planetf1

Reviewed with Claude Code

planetf1

Reviewed with Claude Code

planetf1 · 2026-05-26T11:39:13Z

+                self._tokenizer,
+                self._model,
+                ll_tokenizer=self._llguidance_tokenizer,
            )


The util.py fallback path (line 313–321) has a good comment explaining why n_vocab matters — llguidance defaults to the tokeniser's reported size, which can be smaller than model.vocab_size on models with resized embeddings. But constructing _llguidance_tokenizer here without n_vocab means that guard is bypassed whenever the pre-built instance is passed through. Worth noting the old xgrammar path did compute max(tokenizer.vocab_size, len(tokenizer), model.vocab_size) explicitly, so this is a regression for that case.

Possible suggestion —

n_vocab = max(self._tokenizer.vocab_size, len(self._tokenizer), self._model.vocab_size) self._llguidance_tokenizer = llguidance.hf.from_tokenizer(self._tokenizer, n_vocab=n_vocab)

Added the preemptive fix to this section of the code as well.

planetf1

Reviewed with Claude Code

planetf1 · 2026-05-26T11:42:27Z

+    ) -> torch.Tensor:
+        """Apply the grammar's allowed-token bitmask to ``batch_scores`` in place."""
+        with import_optional("llguidance"):
+            import llguidance


By the time __call__ fires, __init__ has already received a live llguidance.LLTokenizer, so llguidance is guaranteed importable — the import_optional guard here is effectively dead code and adds context-manager overhead on every generated token. Would be cleaner to move the runtime import into __init__ (or do it once at the top of the function without the wrapper).

Moved the optional import to the init function. I think it's still worth having a definitive check even though the caller will have almost certainly imported the necessary libraries if utilizing this class.

planetf1

Reviewed with Claude Code

planetf1 · 2026-05-26T11:43:44Z

-        model: HuggingFace model object. Only required if the request uses constrained
-            decoding.
+        tokenizer: HuggingFace tokenizer. Required for constrained decoding unless
+            ``ll_tokenizer`` is provided, and required when ``constrained_decoding_prefix``


Small doc nit: tokenizer is required unconditionally — apply_chat_template is called on it regardless of whether ll_tokenizer is provided (line 242, with # type: ignore[union-attr] suppressing mypy). The current wording implies it can be omitted in the non-constrained path, which would give a confusing AttributeError at a different call site.

Fixed. Requires the tokenizer now in the function signature.

planetf1

Reviewed with Claude Code

planetf1 · 2026-05-26T11:45:14Z

    # on the right device.
-    input_tokens = input_tokens.to(model.device)  # type: ignore[union-attr]
+    if model is not None:
+        input_tokens = input_tokens.to(model.device)


This prevents the unconditional model.device crash, which is a real improvement. One thing to note: if model is None and the caller eventually hits generate_with_transformers, the tokens will be on CPU and the error there will be harder to diagnose. Might be worth pairing this with an explicit check (or at least a comment noting that model is required for generation even if optional here).

I will just make model unconditionally required (same as the tokenizer fix above). I think this was an oversight in the original code.

planetf1 · 2026-05-26T11:47:40Z

test_base_util.py currently only covers find_substring_in_text. The new code paths added here — ll_tokenizer parameter, _GuidanceLogitsProcessor, and the two new ValueError guards — aren't tested. At minimum, three mockable unit tests would be useful:

tokenizer is None and ll_tokenizer is None → ValueError
constrained_decoding_prefix set but tokenizer or model is None → ValueError
Passing a pre-built ll_tokenizer skips llguidance.hf.from_tokenizer (verifies the performance contract of the new parameter)

Happy to treat this as a follow-up rather than a blocker — just worth tracking.

planetf1

Reviewed with Claude Code

planetf1 · 2026-05-26T11:51:18Z

+        i_batch, _ = batch_input_ids.shape
+        s_batch, _ = batch_scores.shape
+        assert i_batch == s_batch
+


assert can be stripped with python -O and gives a bare AssertionError when it fires. A RuntimeError with a message would be more helpful here:

if i_batch != s_batch: raise RuntimeError(f"batch size mismatch: input_ids={i_batch}, scores={s_batch}")

planetf1 · 2026-05-26T11:51:49Z

Nit: huggingface.py:265 — the assertion message ends with "... wtf?" which surfaces in user-facing stack traces when it fires. Something like f"vocab size mismatch: llguidance={self._llguidance_tokenizer.vocab_size} != tokenizer={self._tokenizer._tokenizer.get_vocab_size(with_added_tokens=True)}" would give more actionable info.

planetf1

Reviewed with Claude Code

planetf1 · 2026-05-26T11:52:26Z



+# Modified from VLLM v0.9.2 code base
+# https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py


Nit: the VLLM attribution comment is easy to miss sitting above a class in a mixed utility file — might be cleaner inside the class docstring so it travels with the class if it ever moves again.

jakelorocco

local tests passed; will wait for full ci tests

jakelorocco · 2026-05-26T13:13:58Z

    # on the right device.
-    input_tokens = input_tokens.to(model.device)  # type: ignore[union-attr]
+    if model is not None:
+        input_tokens = input_tokens.to(model.device)


I will just make model unconditionally required (same as the tokenizer fix above). I think this was an oversight in the original code.

jakelorocco · 2026-05-26T13:14:49Z

-        model: HuggingFace model object. Only required if the request uses constrained
-            decoding.
+        tokenizer: HuggingFace tokenizer. Required for constrained decoding unless
+            ``ll_tokenizer`` is provided, and required when ``constrained_decoding_prefix``


Fixed. Requires the tokenizer now in the function signature.

jakelorocco · 2026-05-26T13:17:37Z



+# Modified from VLLM v0.9.2 code base
+# https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py


jakelorocco · 2026-05-26T13:17:57Z

+                self._tokenizer,
+                self._model,
+                ll_tokenizer=self._llguidance_tokenizer,
            )


Added the preemptive fix to this section of the code as well.

jakelorocco · 2026-05-26T13:18:41Z

+    ) -> torch.Tensor:
+        """Apply the grammar's allowed-token bitmask to ``batch_scores`` in place."""
+        with import_optional("llguidance"):
+            import llguidance


Moved the optional import to the init function. I think it's still worth having a definitive check even though the caller will have almost certainly imported the necessary libraries if utilizing this class.

jakelorocco · 2026-05-26T13:18:48Z

+        i_batch, _ = batch_input_ids.shape
+        s_batch, _ = batch_scores.shape
+        assert i_batch == s_batch
+


Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

ajbozarth · 2026-05-26T17:17:48Z

-    tokenizer: PreTrainedTokenizerBase | None = None,
-    model: PreTrainedModel | None = None,
+    tokenizer: PreTrainedTokenizerBase,
+    model: PreTrainedModel,


Type signature says tokenizer: PreTrainedTokenizerBase and model: PreTrainedModel (required, non-None), but the body still has if tokenizer is None and ll_tokenizer is None: (line ~313) and if tokenizer is None or model is None: (line ~338). Either revert to | None = None to match the runtime contract (where ll_tokenizer can stand in for tokenizer), or drop the None checks. As written, mypy and runtime disagree.

Thanks for catching this after my most recent update. I will fix this in the next patch. @frreiss, when you review this PR, can you please comment here? This function originally listed these parameters as optional even though they were required in the implementation. I moved towards forcing them to be not None (and if you agree, will fix the None checks in the function body).

If I recall correctly, older versions of this function were able to get by with either a model or a tokenizer, depending on what features the chat completion request enabled. With the current proliferation of corner-case-covering code, I think that making those parameters optional is no longer practical.

frreiss · 2026-05-26T18:40:08Z

@kndtran can you have a look at this please?

kndtran · 2026-05-26T21:10:55Z

Hi all, I'm testing this change with my intrinsics code that uses this mellea path. I'll come back with results ASAP as GPU availability seems to be limited today.

My (Claude's) original implementation for this is similar, using the vLLM's implementation (logit masking) for transformers. The only issue I ran into was whitespace_flexible=False as the default. This produced garbage tokens in the query_clarification intrinsic. This does not affect the evaluation results as the clarification string quality is not measured (only the binary classification of CLEAR vs not CLEAR).

cc: @frreiss

jakelorocco · 2026-05-27T15:56:06Z

Hi all, I'm testing this change with my intrinsics code that uses this mellea path. I'll come back with results ASAP as GPU availability seems to be limited today.

My (Claude's) original implementation for this is similar, using the vLLM's implementation (logit masking) for transformers. The only issue I ran into was whitespace_flexible=False as the default. This produced garbage tokens in the query_clarification intrinsic. This does not affect the evaluation results as the clarification string quality is not measured (only the binary classification of CLEAR vs not CLEAR).

@kndtran, my understanding is that whitespace_flexible defaults to True (https://github.com/guidance-ai/llguidance/blob/main/python/llguidance/_lib.pyi#L610):

    # defaults to true (r"[\x20\x0A\x0D\x09]+"); if false, no whitespace is allowed
    whitespace_flexible: Optional[bool]

I left that behavior as is for the intrinsic processing path, but can flip it if needed.

kndtran · 2026-05-27T17:06:48Z

@frreiss I messaged you with detailed results, but in short, this PR looks good to me and no issues with our intrinsics eval code that uses this path.

@jakelorocco The default for whitespace_flexible (=True) is good. No changes needed. This is the same as what vLLM does. At some point I was looking at this param, as vLLM sets it explicitly, to see if it helped constrained decoding for our intrinsics.

frreiss

LGTM

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

jakelorocco added 2 commits May 14, 2026 14:44

feat: move granite formatter hf generation function to use llguidance…

372d940

… instead of xgrammar Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

fix: add not None assertion to model in granite formatter util

c16cc6b

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

github-actions Bot added the enhancement New feature or request label May 14, 2026

This was linked to issues May 14, 2026

consolidate between llguidance and xgrammar #1004

Closed

fix: unpin xgrammar when bug is fixed #990

Closed

jakelorocco marked this pull request as ready for review May 14, 2026 20:45

jakelorocco requested review from a team as code owners May 14, 2026 20:45

jakelorocco requested review from ajbozarth, akihikokuroda and frreiss May 14, 2026 20:45

ajbozarth reviewed May 14, 2026

View reviewed changes

ajbozarth reviewed May 18, 2026

View reviewed changes

Comment thread mellea/formatters/granite/base/util.py

ajbozarth approved these changes May 18, 2026

View reviewed changes

jakelorocco added the refactor label May 20, 2026

jakelorocco changed the title ~~feat: consolidate to llguidance from xgrammar~~ feat: consolidate to llguidance from xgrammar May 20, 2026

akihikokuroda approved these changes May 22, 2026

View reviewed changes

planetf1 reviewed May 26, 2026

View reviewed changes

jakelorocco force-pushed the fix/consolidate-llguidance branch from c9ba91f to 048003e Compare May 26, 2026 13:20

jakelorocco commented May 26, 2026

View reviewed changes

fix: pr comments

92269d9

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS

jakelorocco force-pushed the fix/consolidate-llguidance branch from 048003e to 92269d9 Compare May 26, 2026 13:57

ajbozarth reviewed May 26, 2026

View reviewed changes

frreiss approved these changes May 27, 2026

View reviewed changes

jakelorocco added 2 commits May 27, 2026 15:39

fix: pr comments about params no longer being optional

2e19382

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

fix: resolve merge conflicts

d6a5798

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>

jakelorocco enabled auto-merge May 27, 2026 20:07

jakelorocco added this pull request to the merge queue May 27, 2026

Merged via the queue into generative-computing:main with commit 5a8ec8f May 27, 2026
9 checks passed

jakelorocco deleted the fix/consolidate-llguidance branch May 27, 2026 20:46



		# Modified from VLLM v0.9.2 code base
		# https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py

Conversation

jakelorocco commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Testing

Attribution

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

akihikokuroda commented May 22, 2026

Uh oh!

akihikokuroda left a comment

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 commented May 26, 2026

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

planetf1 commented May 26, 2026

Uh oh!

planetf1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakelorocco left a comment

Choose a reason for hiding this comment

jakelorocco commented May 14, 2026 •

edited

Loading