Skip to content

Update tokenizer mappings to use TokenizersBackend for additional models#46091

Open
itazap wants to merge 15 commits into
mainfrom
tokenizers_backend_update
Open

Update tokenizer mappings to use TokenizersBackend for additional models#46091
itazap wants to merge 15 commits into
mainfrom
tokenizers_backend_update

Conversation

@itazap

@itazap itazap commented May 20, 2026

Copy link
Copy Markdown
Collaborator

see the description / report here: #45936

this PR is just the auto changes from the PR above. it's for models that don't have their own Tokenizer class so we don't have test_tokenization_*.py for these models.

fixes: #45920, #46710, #45488, #46489

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@itazap itazap requested a review from ArthurZucker May 20, 2026 05:27
@itazap itazap force-pushed the tokenizers_backend_update branch from 68f0144 to 3d7abee Compare May 20, 2026 06:56
pad_len = len(expected_input_ids_2) - len(expected_input_ids_1)

expected_attention_mask = [ [0] * pad_len + [1] * len(expected_input_ids_1), [1] * (len(expected_input_ids_2))]
expected_attention_mask = [ [1] * len(expected_input_ids_1) + [0] * pad_len, [1] * (len(expected_input_ids_2))]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test was changed a few months ago and im changing it back

@itazap itazap force-pushed the tokenizers_backend_update branch from 05a70e5 to 2ba20bb Compare May 28, 2026 11:29
@itazap

itazap commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

run-slow: aria, auto

@github-actions

Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/aria", "models/auto"]
quantizations: []

@github-actions

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 685085ea workflow commit (merge commit)
PR d1b3b522 branch commit (from PR)
main bc8f70a9 base commit (on main)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Well given how many used gpt2 I am not really surprised

if (
tokenizer_auto_map is None
and TokenizersBackend is not None
and _config_name_or_path.startswith("deepseek-ai/deepseek-r1")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no no let's support a new kind of entry "deepseek-ai/deepseek-r1" not only this one.

Comment on lines +359 to +364
"rhymes-ai/Aria",
"Salesforce/blip2-flan-t5-xl",
"google/bigbird-pegasus-large-pubmed",
"microsoft/kosmos-2-patch14-224",
"allenai/OLMo-2-0425-1B",
"stabilityai/tiny-random-stablelm-2",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the ones we should have in PER_MODEL_ID_TOKENIZER_FIX no?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are ones that don't have a dedicated tokenizer class so the mapping was updated to map the model_types (aria, blip, bigbird_pegasus, etc.) always to TokenzersBackend. so its not just these checkpoints in theory! these are just the checkpoints that we use in transformers to test these model_types. lmk if that makes sense!

@itazap itazap requested a review from ArthurZucker June 3, 2026 14:48
@itazap

itazap commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

FYI these are all the model checkpoints we use in transformers that we dont directly test tokenization for. We should add a test somewhere that wont clog the CI to do a simple check, would have saved us a lot of trouble!

self.assertEqual(
            tokenizer_tok(text, add_special_tokens=False)["input_ids"],
            tokenizer_auto(text, add_special_tokens=False)["input_ids"],
        )

BAAI/Emu3-Chat-hf
BAAI/Emu3-Gen-hf
BridgeTower/bridgetower-base-itm-mlm
BridgeTower/bridgetower-large-itm-mlm-itc
ByteDance-Seed/Seed-OSS-36B-Base
EleutherAI/gpt-j-6B
EleutherAI/pythia-410m-deduped
EuroBERT/EuroBERT-210m
FacebookAI/roberta-large-mnli
FacebookAI/xlm-roberta-large
Google/gemma-3n-E4B-it
HuggingFaceM4/Idefics3-8B-Llama3
HuggingFaceM4/idefics-9b
HuggingFaceM4/idefics2-8b-base
HuggingFaceTB/SmolVLM2-256M-Video-Instruct
IDEA-Research/grounding-dino-tiny
Jiqing/tiny-random-tvp
KamilaMila/FastVLM-0.5B
LanguageBind/Video-LLaVA-7B-hf
LiquidAI/LFM2-8B-A1B
LiquidAI/LFM2-VL-1.6B
LiquidAI/LFM2-VL-450M
LiquidAI/LFM2.5-VL-1.6B
MiniMaxAI/MiniMax-M2
Mistralai/Ministral-8B-Instruct-2410
PaddlePaddle/PaddleOCR-VL
Qwen/Qwen1.5-MoE-A2.7B
Qwen/Qwen2-0.5B
Qwen/Qwen2-Audio-7B-Instruct
Qwen/Qwen2-VL-7B-Instruct
Qwen/Qwen2.5-Omni-7B
Qwen/Qwen3-0.6B-Base
Qwen/Qwen3-30B-A3B-Base
Qwen/Qwen3-VL-30B-A3B-Instruct
RUCAIBox/mvp
Rocketknight1/falcon-rw-1b
Sahil-Kabir/colqwen2.5-v0.2-hf
Salesforce/blip-image-captioning-base
Salesforce/blip-itm-base-coco
Salesforce/blip-vqa-base
Salesforce/blip2-opt-2.7b
Salesforce/instructblip-flan-t5-xl
Salesforce/instructblip-vicuna-7b
SmallDoge/Doge-20M
Stancld/longt5-tglobal-large-16384-pubmed-3k_steps
THUDM/GLM-4.1V-9B-Thinking
UsefulSensors/moonshine-base
UsefulSensors/moonshine-streaming-medium
UsefulSensors/moonshine-streaming-small
UsefulSensors/moonshine-streaming-tiny
UsefulSensors/moonshine-tiny
adept/fuyu-8b
adept/persimmon-8b-chat
albert/albert-base-v2
allenai/OLMo-1B-hf
allenai/OLMo-7B-Twin-2T-hf
allenai/OLMo-7B-hf
allenai/OLMoE-1B-7B-0924
allenai/dolma2-tokenizer
allenai/longformer-base-4096
andreasmadsen/efficient_mlm_m0.40
answerdotai/ModernBERT-base
baidu/ERNIE-4.5-0.3B-PT
baidu/ERNIE-4.5-21B-A3B-PT
bigcode/gpt_bigcode-santacoder
bigcode/tiny_starcoder_py
blab-jhu/test-32m-dec
bzantium/tiny-deepseek-v3
clip-italian/clip-italian
dandelin/vilt-b32-finetuned-nlvr2
dandelin/vilt-b32-finetuned-vqa
dandelin/vilt-b32-mlm
deepseek-ai/DeepSeek-V2-Lite
distil-whisper/distil-large-v3
facebook/bart-large
facebook/bart-large-cnn
facebook/bart-large-mnli
facebook/bart-large-xsum
facebook/chameleon-7b
facebook/data2vec-text-base
facebook/dpr-question_encoder-single-nq-base
facebook/dpr-reader-single-nq-base
facebook/musicgen-small
facebook/musicgen-stereo-small
facebook/nllb-moe-54b
facebook/pe-av-large
facebook/xlm-roberta-xl
facebook/xlm-roberta-xxl
facebook/xmod-base
gg-hf/recurrent-gemma-2b-hf
google/electra-small-discriminator
google/fnet-base
google/gemma-2-2b
google/gemma-2b
google/gemma-3-4b-it
google/mobilebert-uncased
google/paligemma-3b-pt-224
google/pegasus-x-base
google/pix2struct-textcaps-base
google/switch-base-8
google/t5gemma-2-270m-270m
google/umt5-small
hf-internal-testing/olmo-hybrid
hf-internal-testing/tiny-random-EuroBertForSequenceClassification
hf-internal-testing/tiny-random-EuroBertForTokenClassification
hf-internal-testing/tiny-random-Gemma3ForCausalLM
hf-internal-testing/tiny-random-ModernBertForSequenceClassification
hf-internal-testing/tiny-random-ModernBertForTokenClassification
ibm-granite/granite-4.0-h-tiny
ibm/PowerLM-3b
ibm/PowerMoE-3b
itazap/blt-1b-hf
jetmoe/jetmoe-8b
jinho8345/bros-base-uncased
kajuma/DiffLlama-0.3B-handcut
kmhf/hf-moshika
kssteven/ibert-roberta-base
liuhaotian/llava-v1.6-34b
lkhl/VideoLLaMA3-2B-Image-HF
llava-hf/LLaVA-NeXT-Video-7B-hf
llava-hf/bakLlava-v1-hf
llava-hf/llava-1.5-7b-hf
llava-hf/llava-onevision-qwen2-0.5b-ov-hf
llava-hf/llava-v1.6-mistral-7b-hf
llava-hf/vip-llava-7b-hf
meta-llama/Llama-4-Scout-17B-16E
meta-llama/Meta-Llama-3.1-8B-Instruct
microsoft/Phi-3-mini-4k-instruct
microsoft/Phi-3.5-MoE-instruct
microsoft/bitnet-b1.58-2B-4T
microsoft/git-base
microsoft/git-base-coco
microsoft/git-base-textvqa
microsoft/layoutlm-base-uncased
microsoft/phi-1
microsoft/phi-1_5
microsoft/phi-2
microsoft/phi-3-mini-128k-instruct
microsoft/phi-3-mini-4k-instruct
microsoft/xclip-base-patch32
mistralai/Ministral-8B-Instruct-2410
mistralai/Mistral-7B-v0.1
mistralai/Mistral-Small-4-119B-2603
naver-clova-ix/donut-base-finetuned-cord-v2
naver-clova-ix/donut-base-finetuned-docvqa
naver-clova-ix/donut-base-finetuned-rvlcdip
nomic-ai/nomic-embed-text-v1
nomic-ai/nomic-embed-text-v1.5
openai/privacy-filter
openai/whisper-large
openai/whisper-large-v2
openai/whisper-large-v3
openai/whisper-small
redmoe-ai-v1/dots.llm1.test
shanearora/2025-sep-a-base-model
shanearora/Flex-reddit-2x7B-1T
shanearora/OLMo2-7B-1124-hf
squeezebert/squeezebert-mnli
stabilityai/stablelm-3b-4e1t
state-spaces/mamba-1.4b-hf
state-spaces/mamba-130m-hf
state-spaces/mamba-2.8b-hf
state-spaces/mamba-790m-hf
stepfun-ai/GOT-OCR-2.0-hf
suno/bark
tencent/Hunyuan-A13B-Instruct
tencent/Youtu-LLM-2B-Base
thisisiron/Ovis2-2B-hf
tiiuae/falcon-11B
tiiuae/falcon-7b
tiiuae/falcon-mamba-7b
unc-nlp/lxmert-base-uncased
uw-madison/nystromformer-512
uw-madison/yoso-4096
westlake-repl/Evolla-10B-hf
ylacombe/musicgen-melody
ylacombe/musicgen-stereo-melody
zai-org/GLM-4.5
zai-org/GLM-4.5V
zai-org/GLM-4.7-Flash
zai-org/GLM-OCR

@itazap itazap force-pushed the tokenizers_backend_update branch from 765e75b to be52825 Compare June 12, 2026 09:59
@itazap itazap force-pushed the tokenizers_backend_update branch 2 times, most recently from d551fe0 to 5e96632 Compare June 16, 2026 15:10
@itazap itazap force-pushed the tokenizers_backend_update branch from 5e96632 to 76ff958 Compare June 17, 2026 09:06
@github-actions

Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46091&sha=76ff95

@itazap

itazap commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty! 🤗 I would go as far as to add tests for the checkpoints in TOKENIZERS_BACKEND_AUTO_MAPPING_CHECKPOINTS ! 🤗


@slow
@require_tokenizers
@parameterized.expand(TOKENIZERS_BACKEND_AUTO_MAPPING_CHECKPOINTS)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker they are here!

@itazap itazap enabled auto-merge June 17, 2026 15:26
@github-actions

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AutoTokenizer produces wrong token IDs for OLMo2, HyperClovaX, DeepSeek-R1-Distill-Llama, Yi, and others (v5 regression)

3 participants