feat: consolidate to llguidance from xgrammar #1077
Conversation
… instead of xgrammar Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
ajbozarth
left a comment
There was a problem hiding this comment.
On a quick read-through this looks good, Claude found a handful of small items, but nothing big:
ajbozarth
left a comment
There was a problem hiding this comment.
small nit, otherwise LGTM. Though it may be worth getting a pair of eyes that know the affected code a bit better to double check
|
There are some observations but any nothing major. |
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
| self._tokenizer, | ||
| self._model, | ||
| ll_tokenizer=self._llguidance_tokenizer, | ||
| ) |
There was a problem hiding this comment.
The util.py fallback path (line 313–321) has a good comment explaining why n_vocab matters — llguidance defaults to the tokeniser's reported size, which can be smaller than model.vocab_size on models with resized embeddings. But constructing _llguidance_tokenizer here without n_vocab means that guard is bypassed whenever the pre-built instance is passed through. Worth noting the old xgrammar path did compute max(tokenizer.vocab_size, len(tokenizer), model.vocab_size) explicitly, so this is a regression for that case.
Possible suggestion —
n_vocab = max(self._tokenizer.vocab_size, len(self._tokenizer), self._model.vocab_size)
self._llguidance_tokenizer = llguidance.hf.from_tokenizer(self._tokenizer, n_vocab=n_vocab)There was a problem hiding this comment.
Added the preemptive fix to this section of the code as well.
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
| ) -> torch.Tensor: | ||
| """Apply the grammar's allowed-token bitmask to ``batch_scores`` in place.""" | ||
| with import_optional("llguidance"): | ||
| import llguidance |
There was a problem hiding this comment.
By the time __call__ fires, __init__ has already received a live llguidance.LLTokenizer, so llguidance is guaranteed importable — the import_optional guard here is effectively dead code and adds context-manager overhead on every generated token. Would be cleaner to move the runtime import into __init__ (or do it once at the top of the function without the wrapper).
There was a problem hiding this comment.
Moved the optional import to the init function. I think it's still worth having a definitive check even though the caller will have almost certainly imported the necessary libraries if utilizing this class.
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
| model: HuggingFace model object. Only required if the request uses constrained | ||
| decoding. | ||
| tokenizer: HuggingFace tokenizer. Required for constrained decoding unless | ||
| ``ll_tokenizer`` is provided, and required when ``constrained_decoding_prefix`` |
There was a problem hiding this comment.
Small doc nit: tokenizer is required unconditionally — apply_chat_template is called on it regardless of whether ll_tokenizer is provided (line 242, with # type: ignore[union-attr] suppressing mypy). The current wording implies it can be omitted in the non-constrained path, which would give a confusing AttributeError at a different call site.
There was a problem hiding this comment.
Fixed. Requires the tokenizer now in the function signature.
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
| # on the right device. | ||
| input_tokens = input_tokens.to(model.device) # type: ignore[union-attr] | ||
| if model is not None: | ||
| input_tokens = input_tokens.to(model.device) |
There was a problem hiding this comment.
This prevents the unconditional model.device crash, which is a real improvement. One thing to note: if model is None and the caller eventually hits generate_with_transformers, the tokens will be on CPU and the error there will be harder to diagnose. Might be worth pairing this with an explicit check (or at least a comment noting that model is required for generation even if optional here).
There was a problem hiding this comment.
I will just make model unconditionally required (same as the tokenizer fix above). I think this was an oversight in the original code.
|
Happy to treat this as a follow-up rather than a blocker — just worth tracking. |
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
| i_batch, _ = batch_input_ids.shape | ||
| s_batch, _ = batch_scores.shape | ||
| assert i_batch == s_batch | ||
|
|
There was a problem hiding this comment.
assert can be stripped with python -O and gives a bare AssertionError when it fires. A RuntimeError with a message would be more helpful here:
if i_batch != s_batch:
raise RuntimeError(f"batch size mismatch: input_ids={i_batch}, scores={s_batch}")|
Nit: |
planetf1
left a comment
There was a problem hiding this comment.
Reviewed with Claude Code
|
|
||
|
|
||
| # Modified from VLLM v0.9.2 code base | ||
| # https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py |
There was a problem hiding this comment.
Nit: the VLLM attribution comment is easy to miss sitting above a class in a mixed utility file — might be cleaner inside the class docstring so it travels with the class if it ever moves again.
c9ba91f to
048003e
Compare
jakelorocco
left a comment
There was a problem hiding this comment.
local tests passed; will wait for full ci tests
| # on the right device. | ||
| input_tokens = input_tokens.to(model.device) # type: ignore[union-attr] | ||
| if model is not None: | ||
| input_tokens = input_tokens.to(model.device) |
There was a problem hiding this comment.
I will just make model unconditionally required (same as the tokenizer fix above). I think this was an oversight in the original code.
| model: HuggingFace model object. Only required if the request uses constrained | ||
| decoding. | ||
| tokenizer: HuggingFace tokenizer. Required for constrained decoding unless | ||
| ``ll_tokenizer`` is provided, and required when ``constrained_decoding_prefix`` |
There was a problem hiding this comment.
Fixed. Requires the tokenizer now in the function signature.
|
|
||
|
|
||
| # Modified from VLLM v0.9.2 code base | ||
| # https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/model_executor/guided_decoding/guidance_logits_processors.py |
| self._tokenizer, | ||
| self._model, | ||
| ll_tokenizer=self._llguidance_tokenizer, | ||
| ) |
There was a problem hiding this comment.
Added the preemptive fix to this section of the code as well.
| ) -> torch.Tensor: | ||
| """Apply the grammar's allowed-token bitmask to ``batch_scores`` in place.""" | ||
| with import_optional("llguidance"): | ||
| import llguidance |
There was a problem hiding this comment.
Moved the optional import to the init function. I think it's still worth having a definitive check even though the caller will have almost certainly imported the necessary libraries if utilizing this class.
| i_batch, _ = batch_input_ids.shape | ||
| s_batch, _ = batch_scores.shape | ||
| assert i_batch == s_batch | ||
|
|
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com> Assisted-by: CLAUDE:OPUS
048003e to
92269d9
Compare
| tokenizer: PreTrainedTokenizerBase | None = None, | ||
| model: PreTrainedModel | None = None, | ||
| tokenizer: PreTrainedTokenizerBase, | ||
| model: PreTrainedModel, |
There was a problem hiding this comment.
Type signature says tokenizer: PreTrainedTokenizerBase and model: PreTrainedModel (required, non-None), but the body still has if tokenizer is None and ll_tokenizer is None: (line ~313) and if tokenizer is None or model is None: (line ~338). Either revert to | None = None to match the runtime contract (where ll_tokenizer can stand in for tokenizer), or drop the None checks. As written, mypy and runtime disagree.
There was a problem hiding this comment.
Thanks for catching this after my most recent update. I will fix this in the next patch. @frreiss, when you review this PR, can you please comment here? This function originally listed these parameters as optional even though they were required in the implementation. I moved towards forcing them to be not None (and if you agree, will fix the None checks in the function body).
There was a problem hiding this comment.
If I recall correctly, older versions of this function were able to get by with either a model or a tokenizer, depending on what features the chat completion request enabled. With the current proliferation of corner-case-covering code, I think that making those parameters optional is no longer practical.
|
@kndtran can you have a look at this please? |
|
Hi all, I'm testing this change with my intrinsics code that uses this My (Claude's) original implementation for this is similar, using the vLLM's implementation (logit masking) for transformers. The only issue I ran into was cc: @frreiss |
@kndtran, my understanding is that I left that behavior as is for the intrinsic processing path, but can flip it if needed. |
|
@frreiss I messaged you with detailed results, but in short, this PR looks good to me and no issues with our intrinsics eval code that uses this path. @jakelorocco The default for |
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
Signed-off-by: Jake LoRocco <jake.lorocco@ibm.com>
5a8ec8f
Misc PR
Type of PR
Description
Moves granite formatters onto llguidance. There wasn't a strong reason to keep xgrammar and both work. All intrinsic tests continue to pass.
Moved
_GuidanceLogitsProcessorto util so that it could be imported by the HuggingFace backend and not live in two spots.Also fixed one spot in the util function where
modelwas being used even though it could've beenNone.Testing
Attribution