Skip to content

feat: consolidate to llguidance from xgrammar #1077

Merged
jakelorocco merged 5 commits into
generative-computing:mainfrom
jakelorocco:fix/consolidate-llguidance
May 27, 2026
Merged

feat: consolidate to llguidance from xgrammar #1077
jakelorocco merged 5 commits into
generative-computing:mainfrom
jakelorocco:fix/consolidate-llguidance

Conversation

@jakelorocco
Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco commented May 14, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Moves granite formatters onto llguidance. There wasn't a strong reason to keep xgrammar and both work. All intrinsic tests continue to pass.

Moved _GuidanceLogitsProcessor to util so that it could be imported by the HuggingFace backend and not live in two spots.

Also fixed one spot in the util function where model was being used even though it could've been None.

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

  • AI coding assistants used

… instead of xgrammar

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Assisted-by: CLAUDE:OPUS
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
@github-actions github-actions Bot added the enhancement New feature or request label May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@jakelorocco jakelorocco marked this pull request as ready for review May 14, 2026 20:45
@jakelorocco jakelorocco requested review from a team as code owners May 14, 2026 20:45
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a quick read-through this looks good, Claude found a handful of small items, but nothing big:

Comment thread mellea/formatters/granite/base/util.py Outdated
Comment thread mellea/formatters/granite/base/util.py Outdated
Comment thread mellea/formatters/granite/base/util.py
Comment thread mellea/formatters/granite/base/util.py Outdated
Comment thread mellea/formatters/granite/base/util.py Outdated
Comment thread mellea/formatters/granite/base/util.py Outdated
Comment thread mellea/formatters/granite/base/util.py
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit, otherwise LGTM. Though it may be worth getting a pair of eyes that know the affected code a bit better to double check

@jakelorocco jakelorocco changed the title feat: consolidate to llguidance from xgrammar feat: consolidate to llguidance from xgrammar May 20, 2026
@akihikokuroda
Copy link
Copy Markdown
Member

There are some observations but any nothing major.

  1. Import style inconsistency (minor)
    - Line 131: import_optional is used inside the __call__ method but not at module level
    - Consider whether llguidance could be imported at module load time (if always available for granite formatters) or if lazy loading is intentional for performance
  2. Docstring incompleteness
    - _GuidanceLogitsProcessor.__init__ has a docstring but the class docstring says "Apply... to batch_scores" — it should clarify that it's a logits processor callback
    - Current: "A HuggingFace logits processor that enforces an llguidance grammar." (good enough, but could add a line about usage context)
  3. Error message ambiguity (line 305)
    - Old: Two separate error checks for missing tokenizer and model
    - New: One unified error mentioning both
    - The new error at line 333 requires both when constrained_decoding_prefix is used, but this is only enforced inside the nested conditional. Consider lifting this validation earlier or documenting
  the dependency more clearly.
  4. Type annotation edge case
    - Line 291: tokenizer: PreTrainedTokenizerBase | None = None but at line 311 it assumes tokenizer is not None without a guard when computing vocab_size. Guard added at line 312 (if ll_tokenizer is 
  None) mitigates this, but the logic flow could be clearer.
  5. Testing coverage not shown
    - Diff doesn't include test updates. PR says tests pass, but verify:
        - Unit tests for _GuidanceLogitsProcessor in the new location
      - Integration tests with ll_tokenizer parameter
      - Edge case: what happens if both tokenizer and ll_tokenizer are None and constrained decoding is requested?

Copy link
Copy Markdown
Member

@akihikokuroda akihikokuroda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

self._tokenizer,
self._model,
ll_tokenizer=self._llguidance_tokenizer,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The util.py fallback path (line 313–321) has a good comment explaining why n_vocab matters — llguidance defaults to the tokeniser's reported size, which can be smaller than model.vocab_size on models with resized embeddings. But constructing _llguidance_tokenizer here without n_vocab means that guard is bypassed whenever the pre-built instance is passed through. Worth noting the old xgrammar path did compute max(tokenizer.vocab_size, len(tokenizer), model.vocab_size) explicitly, so this is a regression for that case.

Possible suggestion —

n_vocab = max(self._tokenizer.vocab_size, len(self._tokenizer), self._model.vocab_size)
self._llguidance_tokenizer = llguidance.hf.from_tokenizer(self._tokenizer, n_vocab=n_vocab)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the preemptive fix to this section of the code as well.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

Comment thread mellea/formatters/granite/base/util.py Outdated
) -> torch.Tensor:
"""Apply the grammar's allowed-token bitmask to ``batch_scores`` in place."""
with import_optional("llguidance"):
import llguidance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time __call__ fires, __init__ has already received a live llguidance.LLTokenizer, so llguidance is guaranteed importable — the import_optional guard here is effectively dead code and adds context-manager overhead on every generated token. Would be cleaner to move the runtime import into __init__ (or do it once at the top of the function without the wrapper).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the optional import to the init function. I think it's still worth having a definitive check even though the caller will have almost certainly imported the necessary libraries if utilizing this class.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

Comment thread mellea/formatters/granite/base/util.py Outdated
model: HuggingFace model object. Only required if the request uses constrained
decoding.
tokenizer: HuggingFace tokenizer. Required for constrained decoding unless
``ll_tokenizer`` is provided, and required when ``constrained_decoding_prefix``
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small doc nit: tokenizer is required unconditionally — apply_chat_template is called on it regardless of whether ll_tokenizer is provided (line 242, with # type: ignore[union-attr] suppressing mypy). The current wording implies it can be omitted in the non-constrained path, which would give a confusing AttributeError at a different call site.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Requires the tokenizer now in the function signature.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

Comment thread mellea/formatters/granite/base/util.py Outdated
# on the right device.
input_tokens = input_tokens.to(model.device) # type: ignore[union-attr]
if model is not None:
input_tokens = input_tokens.to(model.device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents the unconditional model.device crash, which is a real improvement. One thing to note: if model is None and the caller eventually hits generate_with_transformers, the tokens will be on CPU and the error there will be harder to diagnose. Might be worth pairing this with an explicit check (or at least a comment noting that model is required for generation even if optional here).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just make model unconditionally required (same as the tokenizer fix above). I think this was an oversight in the original code.

@planetf1
Copy link
Copy Markdown
Contributor

test_base_util.py currently only covers find_substring_in_text. The new code paths added here — ll_tokenizer parameter, _GuidanceLogitsProcessor, and the two new ValueError guards — aren't tested. At minimum, three mockable unit tests would be useful:

  1. tokenizer is None and ll_tokenizer is NoneValueError
  2. constrained_decoding_prefix set but tokenizer or model is NoneValueError
  3. Passing a pre-built ll_tokenizer skips llguidance.hf.from_tokenizer (verifies the performance contract of the new parameter)

Happy to treat this as a follow-up rather than a blocker — just worth tracking.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

i_batch, _ = batch_input_ids.shape
s_batch, _ = batch_scores.shape
assert i_batch == s_batch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert can be stripped with python -O and gives a bare AssertionError when it fires. A RuntimeError with a message would be more helpful here:

if i_batch != s_batch:
    raise RuntimeError(f"batch size mismatch: input_ids={i_batch}, scores={s_batch}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@planetf1
Copy link
Copy Markdown
Contributor

Nit: huggingface.py:265 — the assertion message ends with "... wtf?" which surfaces in user-facing stack traces when it fires. Something like f"vocab size mismatch: llguidance={self._llguidance_tokenizer.vocab_size} != tokenizer={self._tokenizer._tokenizer.get_vocab_size(with_added_tokens=True)}" would give more actionable info.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Claude Code

Comment thread mellea/formatters/granite/base/util.py Outdated


# Modified from VLLM v0.9.2 code base
# https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the VLLM attribution comment is easy to miss sitting above a class in a mixed utility file — might be cleaner inside the class docstring so it travels with the class if it ever moves again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

@jakelorocco jakelorocco force-pushed the fix/consolidate-llguidance branch from c9ba91f to 048003e Compare May 26, 2026 13:20
Copy link
Copy Markdown
Contributor Author

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local tests passed; will wait for full ci tests

Comment thread mellea/formatters/granite/base/util.py Outdated
# on the right device.
input_tokens = input_tokens.to(model.device) # type: ignore[union-attr]
if model is not None:
input_tokens = input_tokens.to(model.device)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will just make model unconditionally required (same as the tokenizer fix above). I think this was an oversight in the original code.

Comment thread mellea/formatters/granite/base/util.py Outdated
model: HuggingFace model object. Only required if the request uses constrained
decoding.
tokenizer: HuggingFace tokenizer. Required for constrained decoding unless
``ll_tokenizer`` is provided, and required when ``constrained_decoding_prefix``
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Requires the tokenizer now in the function signature.

Comment thread mellea/formatters/granite/base/util.py Outdated


# Modified from VLLM v0.9.2 code base
# https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

self._tokenizer,
self._model,
ll_tokenizer=self._llguidance_tokenizer,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the preemptive fix to this section of the code as well.

Comment thread mellea/formatters/granite/base/util.py Outdated
) -> torch.Tensor:
"""Apply the grammar's allowed-token bitmask to ``batch_scores`` in place."""
with import_optional("llguidance"):
import llguidance
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the optional import to the init function. I think it's still worth having a definitive check even though the caller will have almost certainly imported the necessary libraries if utilizing this class.

i_batch, _ = batch_input_ids.shape
s_batch, _ = batch_scores.shape
assert i_batch == s_batch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Assisted-by: CLAUDE:OPUS
@jakelorocco jakelorocco force-pushed the fix/consolidate-llguidance branch from 048003e to 92269d9 Compare May 26, 2026 13:57
tokenizer: PreTrainedTokenizerBase | None = None,
model: PreTrainedModel | None = None,
tokenizer: PreTrainedTokenizerBase,
model: PreTrainedModel,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type signature says tokenizer: PreTrainedTokenizerBase and model: PreTrainedModel (required, non-None), but the body still has if tokenizer is None and ll_tokenizer is None: (line ~313) and if tokenizer is None or model is None: (line ~338). Either revert to | None = None to match the runtime contract (where ll_tokenizer can stand in for tokenizer), or drop the None checks. As written, mypy and runtime disagree.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this after my most recent update. I will fix this in the next patch. @frreiss, when you review this PR, can you please comment here? This function originally listed these parameters as optional even though they were required in the implementation. I moved towards forcing them to be not None (and if you agree, will fix the None checks in the function body).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly, older versions of this function were able to get by with either a model or a tokenizer, depending on what features the chat completion request enabled. With the current proliferation of corner-case-covering code, I think that making those parameters optional is no longer practical.

@frreiss
Copy link
Copy Markdown
Collaborator

frreiss commented May 26, 2026

@kndtran can you have a look at this please?

@kndtran
Copy link
Copy Markdown
Contributor

kndtran commented May 26, 2026

Hi all, I'm testing this change with my intrinsics code that uses this mellea path. I'll come back with results ASAP as GPU availability seems to be limited today.

My (Claude's) original implementation for this is similar, using the vLLM's implementation (logit masking) for transformers. The only issue I ran into was whitespace_flexible=False as the default. This produced garbage tokens in the query_clarification intrinsic. This does not affect the evaluation results as the clarification string quality is not measured (only the binary classification of CLEAR vs not CLEAR).

cc: @frreiss

@jakelorocco
Copy link
Copy Markdown
Contributor Author

Hi all, I'm testing this change with my intrinsics code that uses this mellea path. I'll come back with results ASAP as GPU availability seems to be limited today.

My (Claude's) original implementation for this is similar, using the vLLM's implementation (logit masking) for transformers. The only issue I ran into was whitespace_flexible=False as the default. This produced garbage tokens in the query_clarification intrinsic. This does not affect the evaluation results as the clarification string quality is not measured (only the binary classification of CLEAR vs not CLEAR).

@kndtran, my understanding is that whitespace_flexible defaults to True (https://github.com/guidance-ai/llguidance/blob/main/python/llguidance/_lib.pyi#L610):

    # defaults to true (r"[\x20\x0A\x0D\x09]+"); if false, no whitespace is allowed
    whitespace_flexible: Optional[bool]

I left that behavior as is for the intrinsic processing path, but can flip it if needed.

@kndtran
Copy link
Copy Markdown
Contributor

kndtran commented May 27, 2026

@frreiss I messaged you with detailed results, but in short, this PR looks good to me and no issues with our intrinsics eval code that uses this path.

@jakelorocco The default for whitespace_flexible (=True) is good. No changes needed. This is the same as what vLLM does. At some point I was looking at this param, as vLLM sets it explicitly, to see if it helped constrained decoding for our intrinsics.

Copy link
Copy Markdown
Collaborator

@frreiss frreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
@jakelorocco jakelorocco enabled auto-merge May 27, 2026 20:07
@jakelorocco jakelorocco added this pull request to the merge queue May 27, 2026
Merged via the queue into generative-computing:main with commit 5a8ec8f May 27, 2026
9 checks passed
@jakelorocco jakelorocco deleted the fix/consolidate-llguidance branch May 27, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request refactor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

consolidate between llguidance and xgrammar fix: unpin xgrammar when bug is fixed

6 participants