Skip to content

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Dec 10, 2025

What does this PR do?

Fixes # (issue)
Hides startup latency from cloning tokenizers (0.1s). Since tokenization requests are queued anyways, there is no effect. Also, tokenizer worker is init'ed before the backend, and since the backend is starting after the tokenizer and takes >> 0.1s (10s of seconds), we have a net reduction of ~0.1s * num_tokenizers (60 - 200), which saves 20s.

AFTER PR:

2025-12-10T02:20:09.338172Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "BAA*/***-*****-**-v1.5", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: Some("hf_m******************************wnj"), hostname: "michaelfeildns-dev-pod-h100-0", port: 3000, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-12-10T02:20:09.516212Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-12-10T02:20:09.516241Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-12-10T02:20:09.518213Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-12-10T02:20:09.519112Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-12-10T02:20:09.520179Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-12-10T02:20:09.521038Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-12-10T02:20:09.521956Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 5.747292ms
2025-12-10T02:20:09.543257Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (512 > 16), as `--auto-truncate` is enabled.
2025-12-10T02:20:09.543277Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16
2025-12-10T02:20:09.543786Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 208 tokenization workers
2025-12-10T02:20:09.543895Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-12-10T02:20:09.543910Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-12-10T02:20:09.544861Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 950.751µs
2025-12-10T02:20:09.544879Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-12-10T02:20:09.546265Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 1.395499ms
2025-12-10T02:20:09.558937Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:249: Starting Bert model on Cpu
2025-12-10T02:20:09.741718Z  INFO text_embeddings_router: router/src/lib.rs:282: Warming up model
2025-12-10T02:20:10.050648Z  WARN text_embeddings_router: router/src/lib.rs:291: Backend does not support a batch size > 4
2025-12-10T02:20:10.050889Z  WARN text_embeddings_router: router/src/lib.rs:292: forcing `max_batch_requests=4`
2025-12-10T02:20:10.051131Z  WARN text_embeddings_router: router/src/lib.rs:341: Invalid hostname, defaulting to 0.0.0.0
2025-12-10T02:20:10.052194Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:3000
2025-12-10T02:20:10.052213Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready

BEFORE PR:

2025-12-10T03:58:19.968967Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:42: Starting download
2025-12-10T03:58:19.968997Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `1_Pooling/config.json`
2025-12-10T03:58:19.975482Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `sentence_bert_config.json`
2025-12-10T03:58:19.977307Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config_sentence_transformers.json`
2025-12-10T03:58:19.979207Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `config.json`
2025-12-10T03:58:19.980129Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:18: Downloading `tokenizer.json`
2025-12-10T03:58:19.981625Z  INFO download_artifacts: text_embeddings_core::download: core/src/download.rs:72: Model artifacts downloaded in 12.661192ms
2025-12-10T03:58:20.021686Z  WARN text_embeddings_router: router/src/lib.rs:205: The input sequences will be truncated to 16 tokens even if the model `max_input_length` is greater than the provided `--max-batch-tokens` (512 > 16), as `--auto-truncate` is enabled.
2025-12-10T03:58:20.021707Z  INFO text_embeddings_router: router/src/lib.rs:216: Maximum number of tokens per request: 16
2025-12-10T03:58:20.022883Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 512 tokenization workers
2025-12-10T03:58:22.529459Z  INFO text_embeddings_router: router/src/lib.rs:264: Starting model backend
2025-12-10T03:58:22.530967Z  INFO text_embeddings_backend: backends/src/lib.rs:586: Downloading `model.safetensors`
2025-12-10T03:58:22.533130Z  INFO text_embeddings_backend: backends/src/lib.rs:421: Model weights downloaded in 2.163728ms
2025-12-10T03:58:22.533163Z  INFO download_dense_modules: text_embeddings_backend: backends/src/lib.rs:685: Downloading `modules.json`
2025-12-10T03:58:22.536127Z  INFO text_embeddings_backend: backends/src/lib.rs:433: Dense modules downloaded in 2.98543ms

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@michaelfeil michaelfeil changed the title feat: startup time: add cloned tokenzier fix feat: startup time: add cloned tokenzier fix, saves ~10-20s cold start time Dec 10, 2025
@michaelfeil michaelfeil changed the title feat: startup time: add cloned tokenzier fix, saves ~10-20s cold start time feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time Dec 10, 2025
@michaelfeil
Copy link
Contributor Author

@kozistr any ideas / comments?

Copy link
Contributor

@kozistr kozistr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems fine for me to move the expensive calls onto a background thread!

one thing that came to mind is spawning the async tasks with the Tokio runtime, so that we can easily add graceful shutdown (if needed) and avoid OS thread oversubscription, which could be a minor issue in this case.

Looks good to me!

@michaelfeil
Copy link
Contributor Author

one thing that came to mind is spawning the async tasks with the Tokio runtime, so that we can easily add graceful shutdown (if needed) and avoid OS thread oversubscription, which could be a minor issue in this case.

If you use tokio::spawn() you will block the tokio runtime if all tokenization is active. In the extreme case this means no inference is possible, as we yield to the tokio. spawn_blocking would be better, but would have similar effects. Actually, whenever there is high load on non tokenizer threads (aka we want to do inference), most likley it would make sense to prioritize this.

Seen TEI lag pretty badly for small bert models and nvidia l4 instances + 4 vCPUs, because inference did not get priority.

IMO makes a lot of sense to use std threads here, as these tasks run forever/until shutdown.

@kozistr
Copy link
Contributor

kozistr commented Dec 17, 2025

one thing that came to mind is spawning the async tasks with the Tokio runtime, so that we can easily add graceful shutdown (if needed) and avoid OS thread oversubscription, which could be a minor issue in this case.

If you use tokio::spawn() you will block the tokio runtime if all tokenization is active. In the extreme case this means no inference is possible, as we yield to the tokio. spawn_blocking would be better, but would have similar effects. Actually, whenever there is high load on non tokenizer threads (aka we want to do inference), most likley it would make sense to prioritize this.

Seen TEI lag pretty badly for small bert models and nvidia l4 instances + 4 vCPUs, because inference did not get priority.

IMO makes a lot of sense to use std threads here, as these tasks run forever/until shutdown.

oh you're right. i miss that part. i agree that it'd be better use std thread here unless it's a short-lived. thanks for catching this!

Copy link
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @michaelfeil thanks a lot for yet another great PR! I'll merge now, still not sure about the release date for 1.9.0, but I'll try to squeeze in as much PRs a possible before 🤗

@alvarobartt alvarobartt merged commit 74ac6e1 into huggingface:main Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants