Update tokenizer mappings to use TokenizersBackend for additional models by itazap · Pull Request #46091 · huggingface/transformers

itazap · 2026-05-20T02:57:32Z

see the description / report here: #45936

this PR is just the auto changes from the PR above. it's for models that don't have their own Tokenizer class so we don't have test_tokenization_*.py for these models.

fixes: #45920, #46710, #45488, #46489

HuggingFaceDocBuilderDev · 2026-05-20T03:09:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

itazap · 2026-05-20T07:08:04Z

        pad_len = len(expected_input_ids_2) - len(expected_input_ids_1)

-        expected_attention_mask = [ [0] * pad_len + [1] * len(expected_input_ids_1), [1] * (len(expected_input_ids_2))]
+        expected_attention_mask = [ [1] * len(expected_input_ids_1) + [0] * pad_len, [1] * (len(expected_input_ids_2))]


this test was changed a few months ago and im changing it back

itazap · 2026-05-28T12:41:57Z

run-slow: aria, auto

github-actions · 2026-05-28T12:43:14Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/aria", "models/auto"]
quantizations: []

github-actions · 2026-05-28T12:53:10Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	685085ea	workflow commit (merge commit)
PR	d1b3b522	branch commit (from PR)
main	bc8f70a9	base commit (on `main`)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

ArthurZucker

Nice! Well given how many used gpt2 I am not really surprised

ArthurZucker · 2026-06-02T16:20:53Z

+        if (
+            tokenizer_auto_map is None
+            and TokenizersBackend is not None
+            and _config_name_or_path.startswith("deepseek-ai/deepseek-r1")


no no let's support a new kind of entry "deepseek-ai/deepseek-r1" not only this one.

ArthurZucker · 2026-06-02T16:21:50Z

+        "rhymes-ai/Aria",
+        "Salesforce/blip2-flan-t5-xl",
+        "google/bigbird-pegasus-large-pubmed",
+        "microsoft/kosmos-2-patch14-224",
+        "allenai/OLMo-2-0425-1B",
+        "stabilityai/tiny-random-stablelm-2",


these are the ones we should have in PER_MODEL_ID_TOKENIZER_FIX no?

these are ones that don't have a dedicated tokenizer class so the mapping was updated to map the model_types (aria, blip, bigbird_pegasus, etc.) always to TokenzersBackend. so its not just these checkpoints in theory! these are just the checkpoints that we use in transformers to test these model_types. lmk if that makes sense!

itazap · 2026-06-03T15:08:29Z

FYI these are all the model checkpoints we use in transformers that we dont directly test tokenization for. We should add a test somewhere that wont clog the CI to do a simple check, would have saved us a lot of trouble!

self.assertEqual(
            tokenizer_tok(text, add_special_tokens=False)["input_ids"],
            tokenizer_auto(text, add_special_tokens=False)["input_ids"],
        )

BAAI/Emu3-Chat-hf
BAAI/Emu3-Gen-hf
BridgeTower/bridgetower-base-itm-mlm
BridgeTower/bridgetower-large-itm-mlm-itc
ByteDance-Seed/Seed-OSS-36B-Base
EleutherAI/gpt-j-6B
EleutherAI/pythia-410m-deduped
EuroBERT/EuroBERT-210m
FacebookAI/roberta-large-mnli
FacebookAI/xlm-roberta-large
Google/gemma-3n-E4B-it
HuggingFaceM4/Idefics3-8B-Llama3
HuggingFaceM4/idefics-9b
HuggingFaceM4/idefics2-8b-base
HuggingFaceTB/SmolVLM2-256M-Video-Instruct
IDEA-Research/grounding-dino-tiny
Jiqing/tiny-random-tvp
KamilaMila/FastVLM-0.5B
LanguageBind/Video-LLaVA-7B-hf
LiquidAI/LFM2-8B-A1B
LiquidAI/LFM2-VL-1.6B
LiquidAI/LFM2-VL-450M
LiquidAI/LFM2.5-VL-1.6B
MiniMaxAI/MiniMax-M2
Mistralai/Ministral-8B-Instruct-2410
PaddlePaddle/PaddleOCR-VL
Qwen/Qwen1.5-MoE-A2.7B
Qwen/Qwen2-0.5B
Qwen/Qwen2-Audio-7B-Instruct
Qwen/Qwen2-VL-7B-Instruct
Qwen/Qwen2.5-Omni-7B
Qwen/Qwen3-0.6B-Base
Qwen/Qwen3-30B-A3B-Base
Qwen/Qwen3-VL-30B-A3B-Instruct
RUCAIBox/mvp
Rocketknight1/falcon-rw-1b
Sahil-Kabir/colqwen2.5-v0.2-hf
Salesforce/blip-image-captioning-base
Salesforce/blip-itm-base-coco
Salesforce/blip-vqa-base
Salesforce/blip2-opt-2.7b
Salesforce/instructblip-flan-t5-xl
Salesforce/instructblip-vicuna-7b
SmallDoge/Doge-20M
Stancld/longt5-tglobal-large-16384-pubmed-3k_steps
THUDM/GLM-4.1V-9B-Thinking
UsefulSensors/moonshine-base
UsefulSensors/moonshine-streaming-medium
UsefulSensors/moonshine-streaming-small
UsefulSensors/moonshine-streaming-tiny
UsefulSensors/moonshine-tiny
adept/fuyu-8b
adept/persimmon-8b-chat
albert/albert-base-v2
allenai/OLMo-1B-hf
allenai/OLMo-7B-Twin-2T-hf
allenai/OLMo-7B-hf
allenai/OLMoE-1B-7B-0924
allenai/dolma2-tokenizer
allenai/longformer-base-4096
andreasmadsen/efficient_mlm_m0.40
answerdotai/ModernBERT-base
baidu/ERNIE-4.5-0.3B-PT
baidu/ERNIE-4.5-21B-A3B-PT
bigcode/gpt_bigcode-santacoder
bigcode/tiny_starcoder_py
blab-jhu/test-32m-dec
bzantium/tiny-deepseek-v3
clip-italian/clip-italian
dandelin/vilt-b32-finetuned-nlvr2
dandelin/vilt-b32-finetuned-vqa
dandelin/vilt-b32-mlm
deepseek-ai/DeepSeek-V2-Lite
distil-whisper/distil-large-v3
facebook/bart-large
facebook/bart-large-cnn
facebook/bart-large-mnli
facebook/bart-large-xsum
facebook/chameleon-7b
facebook/data2vec-text-base
facebook/dpr-question_encoder-single-nq-base
facebook/dpr-reader-single-nq-base
facebook/musicgen-small
facebook/musicgen-stereo-small
facebook/nllb-moe-54b
facebook/pe-av-large
facebook/xlm-roberta-xl
facebook/xlm-roberta-xxl
facebook/xmod-base
gg-hf/recurrent-gemma-2b-hf
google/electra-small-discriminator
google/fnet-base
google/gemma-2-2b
google/gemma-2b
google/gemma-3-4b-it
google/mobilebert-uncased
google/paligemma-3b-pt-224
google/pegasus-x-base
google/pix2struct-textcaps-base
google/switch-base-8
google/t5gemma-2-270m-270m
google/umt5-small
hf-internal-testing/olmo-hybrid
hf-internal-testing/tiny-random-EuroBertForSequenceClassification
hf-internal-testing/tiny-random-EuroBertForTokenClassification
hf-internal-testing/tiny-random-Gemma3ForCausalLM
hf-internal-testing/tiny-random-ModernBertForSequenceClassification
hf-internal-testing/tiny-random-ModernBertForTokenClassification
ibm-granite/granite-4.0-h-tiny
ibm/PowerLM-3b
ibm/PowerMoE-3b
itazap/blt-1b-hf
jetmoe/jetmoe-8b
jinho8345/bros-base-uncased
kajuma/DiffLlama-0.3B-handcut
kmhf/hf-moshika
kssteven/ibert-roberta-base
liuhaotian/llava-v1.6-34b
lkhl/VideoLLaMA3-2B-Image-HF
llava-hf/LLaVA-NeXT-Video-7B-hf
llava-hf/bakLlava-v1-hf
llava-hf/llava-1.5-7b-hf
llava-hf/llava-onevision-qwen2-0.5b-ov-hf
llava-hf/llava-v1.6-mistral-7b-hf
llava-hf/vip-llava-7b-hf
meta-llama/Llama-4-Scout-17B-16E
meta-llama/Meta-Llama-3.1-8B-Instruct
microsoft/Phi-3-mini-4k-instruct
microsoft/Phi-3.5-MoE-instruct
microsoft/bitnet-b1.58-2B-4T
microsoft/git-base
microsoft/git-base-coco
microsoft/git-base-textvqa
microsoft/layoutlm-base-uncased
microsoft/phi-1
microsoft/phi-1_5
microsoft/phi-2
microsoft/phi-3-mini-128k-instruct
microsoft/phi-3-mini-4k-instruct
microsoft/xclip-base-patch32
mistralai/Ministral-8B-Instruct-2410
mistralai/Mistral-7B-v0.1
mistralai/Mistral-Small-4-119B-2603
naver-clova-ix/donut-base-finetuned-cord-v2
naver-clova-ix/donut-base-finetuned-docvqa
naver-clova-ix/donut-base-finetuned-rvlcdip
nomic-ai/nomic-embed-text-v1
nomic-ai/nomic-embed-text-v1.5
openai/privacy-filter
openai/whisper-large
openai/whisper-large-v2
openai/whisper-large-v3
openai/whisper-small
redmoe-ai-v1/dots.llm1.test
shanearora/2025-sep-a-base-model
shanearora/Flex-reddit-2x7B-1T
shanearora/OLMo2-7B-1124-hf
squeezebert/squeezebert-mnli
stabilityai/stablelm-3b-4e1t
state-spaces/mamba-1.4b-hf
state-spaces/mamba-130m-hf
state-spaces/mamba-2.8b-hf
state-spaces/mamba-790m-hf
stepfun-ai/GOT-OCR-2.0-hf
suno/bark
tencent/Hunyuan-A13B-Instruct
tencent/Youtu-LLM-2B-Base
thisisiron/Ovis2-2B-hf
tiiuae/falcon-11B
tiiuae/falcon-7b
tiiuae/falcon-mamba-7b
unc-nlp/lxmert-base-uncased
uw-madison/nystromformer-512
uw-madison/yoso-4096
westlake-repl/Evolla-10B-hf
ylacombe/musicgen-melody
ylacombe/musicgen-stereo-melody
zai-org/GLM-4.5
zai-org/GLM-4.5V
zai-org/GLM-4.7-Flash
zai-org/GLM-OCR

…tokenizer_class Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-17T09:20:57Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46091&sha=76ff95

itazap · 2026-06-17T12:09:46Z

^ idk how the new PR CI git works bc it is green https://app.circleci.com/pipelines/github/huggingface/transformers?branch=tokenizers_backend_update

ArthurZucker

Ty! 🤗 I would go as far as to add tests for the checkpoints in TOKENIZERS_BACKEND_AUTO_MAPPING_CHECKPOINTS ! 🤗

itazap · 2026-06-17T15:15:02Z

+
+    @slow
+    @require_tokenizers
+    @parameterized.expand(TOKENIZERS_BACKEND_AUTO_MAPPING_CHECKPOINTS)


@ArthurZucker they are here!

github-actions · 2026-06-17T15:33:48Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, auto

itazap requested a review from ArthurZucker May 20, 2026 05:27

itazap force-pushed the tokenizers_backend_update branch from 68f0144 to 3d7abee Compare May 20, 2026 06:56

itazap commented May 20, 2026

View reviewed changes

itazap force-pushed the tokenizers_backend_update branch from 05a70e5 to 2ba20bb Compare May 28, 2026 11:29

ArthurZucker reviewed Jun 2, 2026

View reviewed changes

itazap requested a review from ArthurZucker June 3, 2026 14:48

itazap force-pushed the tokenizers_backend_update branch from 765e75b to be52825 Compare June 12, 2026 09:59

itazap mentioned this pull request Jun 15, 2026

Fix Mistral models which contain both tokenizer.json and tekken.json #46622

Open

itazap force-pushed the tokenizers_backend_update branch 2 times, most recently from d551fe0 to 5e96632 Compare June 16, 2026 15:10

itazap and others added 14 commits June 17, 2026 11:06

Update tokenizer mappings to use TokenizersBackend for additional models

7525b0a

aria

181d0fb

split scripts

64c7f09

added test

5ed9b75

Remove model-status-space

b6d2ccf

fix test

8029942

fix test

76a9f9c

add MODEL_IDS_TO_TOKENIZERS_BACKEND list and test for models without …

a9bf262

…tokenizer_class Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

new test

151bf03

reorg tests

9a9d87a

ruff

7af81d2

ruff

29fcf4b

match checkpoint path

45aee66

fix test

76ff958

itazap force-pushed the tokenizers_backend_update branch from 5e96632 to 76ff958 Compare June 17, 2026 09:06

This was referenced Jun 17, 2026

LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family #45488

Open

Accuracy regression in DeepSeek-R1-Distill-Llama-8B after upgrading Transformers from 4.55 to 5.9 #46710

Open

ArthurZucker approved these changes Jun 17, 2026

View reviewed changes

itazap commented Jun 17, 2026

View reviewed changes

itazap enabled auto-merge June 17, 2026 15:26

Merge branch 'main' into tokenizers_backend_update

a94a072

Conversation

itazap commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 20, 2026

Uh oh!

itazap May 20, 2026

Choose a reason for hiding this comment

Uh oh!

itazap commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

CI Results

Commit Info

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

itazap Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

itazap commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

itazap commented Jun 17, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

itazap Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itazap commented May 20, 2026 •

edited

Loading