feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time #772

michaelfeil · 2025-12-10T02:13:23Z

What does this PR do?

Fixes # (issue)
Hides startup latency from cloning tokenizers (0.1s). Since tokenization requests are queued anyways, there is no effect. Also, tokenizer worker is init'ed before the backend, and since the backend is starting after the tokenizer and takes >> 0.1s (10s of seconds), we have a net reduction of ~0.1s * num_tokenizers (60 - 200), which saves 20s.

AFTER PR:

2025-12-10T02:20:09.338172Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "BAA*/***-*****-**-v1.5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: Some("hf_m******************************wnj"), hostname: "michaelfeildns-dev-pod-h100-0", port: 3000, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-12-10T02:20:09.516212Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-12-10T02:20:09.516241Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-12-10T02:20:09.518213Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-12-10T02:20:09.519112Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-12-10T02:20:09.520179Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-12-10T02:20:09.521038Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-12-10T02:20:09.521956Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 5.747292ms
2025-12-10T02:20:09.543257Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (512 > 16), as `--auto-truncate` is enabled.
2025-12-10T02:20:09.543277Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16
2025-12-10T02:20:09.543786Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 208 tokenization workers
2025-12-10T02:20:09.543895Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-12-10T02:20:09.543910Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-12-10T02:20:09.544861Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 950.751µs
2025-12-10T02:20:09.544879Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-12-10T02:20:09.546265Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 1.395499ms
2025-12-10T02:20:09.558937Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:249: Starting Bert model on Cpu
2025-12-10T02:20:09.741718Z  INFO text_embeddings_router: router/src/lib.rs:282: Warming up model
2025-12-10T02:20:10.050648Z  WARN text_embeddings_router: router/src/lib.rs:291: Backend does not support a batch size > 4
2025-12-10T02:20:10.050889Z  WARN text_embeddings_router: router/src/lib.rs:292: forcing `max_batch_requests=4`
2025-12-10T02:20:10.051131Z  WARN text_embeddings_router: router/src/lib.rs:341: Invalid hostname, defaulting to 0.0.0.0
2025-12-10T02:20:10.052194Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:3000
2025-12-10T02:20:10.052213Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready

BEFORE PR:

2025-12-10T03:58:19.968967Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-12-10T03:58:19.968997Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-12-10T03:58:19.975482Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-12-10T03:58:19.977307Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-12-10T03:58:19.979207Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-12-10T03:58:19.980129Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-12-10T03:58:19.981625Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 12.661192ms
2025-12-10T03:58:20.021686Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (512 > 16), as `--auto-truncate` is enabled.
2025-12-10T03:58:20.021707Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16
2025-12-10T03:58:20.022883Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 512 tokenization workers
2025-12-10T03:58:22.529459Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-12-10T03:58:22.530967Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-12-10T03:58:22.533130Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 2.163728ms
2025-12-10T03:58:22.533163Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-12-10T03:58:22.536127Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 2.98543ms

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

michaelfeil · 2025-12-15T23:39:43Z

@kozistr any ideas / comments?

kozistr

it seems fine for me to move the expensive calls onto a background thread!

one thing that came to mind is spawning the async tasks with the Tokio runtime, so that we can easily add graceful shutdown (if needed) and avoid OS thread oversubscription, which could be a minor issue in this case.

Looks good to me!

michaelfeil · 2025-12-16T18:22:58Z

one thing that came to mind is spawning the async tasks with the Tokio runtime, so that we can easily add graceful shutdown (if needed) and avoid OS thread oversubscription, which could be a minor issue in this case.

If you use tokio::spawn() you will block the tokio runtime if all tokenization is active. In the extreme case this means no inference is possible, as we yield to the tokio. spawn_blocking would be better, but would have similar effects. Actually, whenever there is high load on non tokenizer threads (aka we want to do inference), most likley it would make sense to prioritize this.

Seen TEI lag pretty badly for small bert models and nvidia l4 instances + 4 vCPUs, because inference did not get priority.

IMO makes a lot of sense to use std threads here, as these tasks run forever/until shutdown.

kozistr · 2025-12-17T01:39:13Z

one thing that came to mind is spawning the async tasks with the Tokio runtime, so that we can easily add graceful shutdown (if needed) and avoid OS thread oversubscription, which could be a minor issue in this case.

If you use tokio::spawn() you will block the tokio runtime if all tokenization is active. In the extreme case this means no inference is possible, as we yield to the tokio. spawn_blocking would be better, but would have similar effects. Actually, whenever there is high load on non tokenizer threads (aka we want to do inference), most likley it would make sense to prioritize this.

Seen TEI lag pretty badly for small bert models and nvidia l4 instances + 4 vCPUs, because inference did not get priority.

IMO makes a lot of sense to use std threads here, as these tasks run forever/until shutdown.

oh you're right. i miss that part. i agree that it'd be better use std thread here unless it's a short-lived. thanks for catching this!

alvarobartt

Hey @michaelfeil thanks a lot for yet another great PR! I'll merge now, still not sure about the release date for 1.9.0, but I'll try to squeeze in as much PRs a possible before 🤗

add cloned tokenzier fix

5bb876b

michaelfeil changed the title ~~feat: startup time: add cloned tokenzier fix~~ feat: startup time: add cloned tokenzier fix, saves ~10-20s cold start time Dec 10, 2025

michaelfeil changed the title ~~feat: startup time: add cloned tokenzier fix, saves ~10-20s cold start time~~ feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time Dec 10, 2025

kozistr approved these changes Dec 16, 2025

View reviewed changes

alvarobartt approved these changes Dec 17, 2025

View reviewed changes

alvarobartt merged commit 74ac6e1 into huggingface:main Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time #772

feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time #772

Uh oh!

michaelfeil commented Dec 10, 2025 •

edited

Loading

Uh oh!

michaelfeil commented Dec 15, 2025

Uh oh!

kozistr left a comment

Uh oh!

michaelfeil commented Dec 16, 2025

Uh oh!

kozistr commented Dec 17, 2025

Uh oh!

alvarobartt left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time #772

feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time #772

Uh oh!

Conversation

michaelfeil commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

michaelfeil commented Dec 15, 2025

Uh oh!

kozistr left a comment

Choose a reason for hiding this comment

Uh oh!

michaelfeil commented Dec 16, 2025

Uh oh!

kozistr commented Dec 17, 2025

Uh oh!

alvarobartt left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelfeil commented Dec 10, 2025 •

edited

Loading