Unified embedding for Rust -- 6 cloud providers + local inference through one interface. Opinionated defaults backed by 8-model benchmark data.
If we build it, it must be great -- every default backed by data.
embedrs::local()?-- all-MiniLM-L6-v2 (23MB, free, no API key)embedrs::cloud(key)-- OpenAI text-embedding-3-small (best discrimination, cheapest cloud)- Both produce the same
EmbedResult-- write code once, switch backends in one line
Defaults chosen by 8-dimension benchmark across 8 models. See examples/embedding_models/ for the full reproducible experiment.
// cloud -- one key, done
let client = embedrs::cloud("sk-...");
let result = client.embed(vec!["hello world".into()]).await?;
println!("dimensions: {}", result.embeddings[0].len());// local -- zero config, free, 23MB model downloaded on first use
let client = embedrs::local()?;
let result = client.embed(vec!["hello world".into()]).await?;[dependencies]
embedrs = "0.3"
# enable local inference (adds ~23MB model download on first use)
embedrs = { version = "0.3", features = ["local"] }| Feature | Default | Description |
|---|---|---|
| (none) | yes | Core embedding client, all 6 cloud providers |
local |
no | Local inference via candle (all-MiniLM-L6-v2, 23MB) |
cost-tracking |
no | Estimated cost per request via tiktoken pricing data |
tracing |
no | Structured logging via the tracing crate |
[dependencies]
# cloud only
embedrs = "0.3"
# cloud + local inference
embedrs = { version = "0.3", features = ["local"] }
# with cost tracking
embedrs = { version = "0.3", features = ["cost-tracking"] }
# with tracing
embedrs = { version = "0.3", features = ["local", "tracing"] }8 dimensions, 184 unique texts. Full methodology and reproduction instructions in examples/embedding_models/ — run with cargo run --example embedding_models --features local --release.
| Metric | MiniLM-L6 | MiniLM-L12 | BGE-small | GTE-small | OpenAI | Gemini | Cohere | Voyage |
|---|---|---|---|---|---|---|---|---|
| Size | 23MB | 133MB | 133MB | 67MB | cloud | cloud | cloud | cloud |
| Spearman ρ | 0.81 | 0.84 | 0.71 | 0.75 | 0.91 | 0.94 | 0.91 | 0.89 |
| Discrimination | 0.52 | 0.52 | 0.29 | 0.14 | 0.58 | 0.30 | 0.46 | 0.45 |
| Retrieval | 100% | 100% | 89% | 100% | 100% | 89% | 100% | 89% |
| EN ρ | 0.92 | 0.94 | 0.92 | 0.90 | 0.91 | 0.91 | 0.89 | 0.88 |
| ZH ρ | 0.65 | 0.74 | 0.45 | 0.40 | 0.88 | 0.99 | 0.93 | 0.89 |
| JA ρ | 0.60 | 0.90 | 0.20 | 0.50 | 0.90 | 1.00 | 1.00 | 0.90 |
| Cross-lingual | 0.25 | 0.26 | 0.66 | 0.81 | 0.71 | 0.84 | 0.68 | 0.85 |
| Robustness | 0.89 | 0.90 | 0.94 | 0.97 | 0.88 | 0.94 | 0.89 | 0.95 |
| Cluster sep. | 8.73x | 4.38x | 1.29x | 1.09x | 2.55x | 1.11x | 1.41x | 1.30x |
| Cost | $0 | $0 | $0 | $0 | $0.02/1M | free tier | $0.10/1M | $0.06/1M |
- 23MB -- the only model small enough for app embedding (others are 67-133MB)
- Best clustering separation at 8.73x (2nd place is 4.38x)
- 100% retrieval accuracy, EN ρ=0.92
- 12-layer models are 3-6x larger with no meaningful quality improvement
- Known weakness: poor on Chinese/Japanese (ρ=0.60-0.65) and cross-lingual (0.25)
- Best discrimination gap at 0.58 (dissimilar texts avg cosine = 0.09, closest to zero)
- 100% retrieval accuracy, MRR=1.0
- Balanced multilingual: EN=0.91, ZH=0.88, JA=0.90 -- no weak language
- Cheapest cloud option at $0.02/1M tokens
- Gemini has higher ρ (0.94) but poor discrimination (0.30) and retrieval miss (89%)
- Cohere matches quality but costs 5x more ($0.10/1M tokens)
| Provider | Constructor | Default Model | Max Batch Size |
|---|---|---|---|
| OpenAI | Client::openai(key) |
text-embedding-3-small |
2048 |
| Cohere | Client::cohere(key) |
embed-v4.0 |
96 |
| Google Gemini | Client::gemini(key) |
gemini-embedding-001 |
100 |
| Voyage AI | Client::voyage(key) |
voyage-3-large |
128 |
| Jina AI | Client::jina(key) |
jina-embeddings-v3 |
2048 |
| Mistral | Client::mistral(key) |
mistral-embed |
512 |
| Local | Client::local(name)? |
all-MiniLM-L6-v2 |
256 |
Model is free-form — pass any current id with .embed(...).model("..."). Notable models as of 2026-06:
- Voyage —
voyage-3-large(default),voyage-4-large,voyage-4,voyage-4-lite,voyage-3.5,voyage-3.5-lite,voyage-code-3 - OpenAI —
text-embedding-3-small(default),text-embedding-3-large - Cohere —
embed-v4.0(default, multimodal) - Gemini —
gemini-embedding-001(default),gemini-embedding-2(multimodal, supportsoutput_dimensionality) - Jina —
jina-embeddings-v3(default).jina-embeddings-v4is released but its cloud-API response schema isn't yet verified single-vector compatible with this crate — try it via.model("jina-embeddings-v4")and please file an issue if you hit a deserialization error. - Mistral —
mistral-embed(default),codestral-embed-2505(specialized for code)
Each cloud provider also has a *_compatible constructor for proxies or API-compatible services:
// OpenAI-compatible (Azure, proxies, etc.)
let client = Client::openai_compatible("sk-...", "https://your-proxy.com/v1");
// Cohere-compatible
let client = Client::cohere_compatible("key", "https://proxy.example.com/v2");
// Gemini-compatible
let client = Client::gemini_compatible("key", "https://proxy.example.com/v1beta");
// Voyage-compatible
let client = Client::voyage_compatible("key", "https://proxy.example.com/v1");
// Mistral-compatible
let client = Client::mistral_compatible("key", "https://proxy.example.com/v1");
// Jina-compatible
let client = Client::jina_compatible("key", "https://proxy.example.com/v1");Embed thousands of texts concurrently. Texts are automatically chunked based on the provider's maximum batch size:
let client = embedrs::cloud("sk-...");
let texts: Vec<String> = (0..5000).map(|i| format!("document {i}")).collect();
let result = client.embed_batch(texts)
.concurrency(5) // max concurrent API requests (default: 5)
.chunk_size(512) // texts per request (default: provider max)
.model("text-embedding-3-large")
.await?;
println!("total embeddings: {}", result.embeddings.len());
println!("total tokens: {}", result.usage.total_tokens);use embedrs::{cosine_similarity, dot_product, euclidean_distance};
let a = vec![1.0, 0.0, 0.0];
let b = vec![0.0, 1.0, 0.0];
let cos = cosine_similarity(&a, &b); // 0.0 (orthogonal)
let dot = dot_product(&a, &b); // 0.0
let dist = euclidean_distance(&a, &b); // 1.414...Some providers use input type hints to optimize embeddings for specific use cases:
use embedrs::InputType;
// for indexing documents
let result = client.embed(docs)
.input_type(InputType::SearchDocument)
.await?;
// for search queries
let result = client.embed(queries)
.input_type(InputType::SearchQuery)
.await?;Available variants: SearchDocument, SearchQuery, Classification, Clustering.
Request reduced-dimension embeddings where the provider supports it:
let result = client.embed(vec!["hello".into()])
.model("text-embedding-3-large")
.dimensions(256)
.await?;
assert_eq!(result.embeddings[0].len(), 256);use std::time::Duration;
use embedrs::BackoffConfig;
let client = Client::openai("sk-...")
.with_retry_backoff(BackoffConfig::default()) // 500ms base, 30s cap, 3 retries
.with_timeout(Duration::from_secs(120)); // overall timeout (default: 60s)
// per-request override
let result = client.embed(vec!["hello".into()])
.retry_backoff(BackoffConfig {
base_delay: Duration::from_millis(200),
max_delay: Duration::from_secs(10),
jitter: true,
max_http_retries: 5,
})
.timeout(Duration::from_secs(30))
.await?;Without backoff configured, HTTP 429/503 errors fail immediately.
Set defaults once, override per-request:
let client = Client::openai("sk-...")
.with_model("text-embedding-3-large")
.with_dimensions(256)
.with_input_type(InputType::SearchDocument)
.with_retry_backoff(BackoffConfig::default())
.with_timeout(Duration::from_secs(120));
// all requests use the defaults above
let a = client.embed(vec!["doc 1".into()]).await?;
let b = client.embed(vec!["doc 2".into()]).await?;
// override for a specific request
let c = client.embed(vec!["query".into()])
.model("text-embedding-3-small")
.input_type(InputType::SearchQuery)
.await?;Chain fallback providers for automatic failover when the primary provider is unavailable:
let client = embedrs::Client::openai("sk-...")
.with_fallback(embedrs::Client::cohere("cohere-key"));
// if OpenAI fails, automatically tries Cohere
let result = client.embed(vec!["hello".into()]).await?;Multiple fallbacks are tried in order:
let client = embedrs::Client::openai("sk-...")
.with_fallback(embedrs::Client::cohere("cohere-key"))
.with_fallback(embedrs::Client::voyage("voyage-key"));Enable the cost-tracking feature to get estimated cost per request:
embedrs = { version = "0.3", features = ["cost-tracking"] }let result = client.embed(vec!["hello".into()]).await?;
if let Some(cost) = result.usage.cost {
println!("estimated cost: ${cost:.6}");
}Cost estimation uses tiktoken pricing data. Returns None for models without pricing information.
All fallible operations return embedrs::Result<T>. Match on Error variants for fine-grained control:
use embedrs::Error;
# async fn run(client: &embedrs::Client) {
match client.embed(vec!["hello".into()]).await {
Ok(result) => println!("got {} embeddings", result.embeddings.len()),
Err(Error::Api { status: 429, .. }) => eprintln!("rate limited"),
Err(Error::Api { status, message }) => eprintln!("API error {status}: {message}"),
Err(Error::Timeout(duration)) => eprintln!("timed out after {duration:?}"),
Err(Error::Http(e)) => eprintln!("network error: {e}"),
Err(e) => eprintln!("other error: {e}"),
}
# }| Aspect | embedrs | fastembed-rs | Raw reqwest |
|---|---|---|---|
| Cloud providers | 6 built-in (OpenAI, Cohere, Gemini, Voyage, Jina, Mistral) | None | Manual per provider |
| Local inference | candle-based, 23MB default model | ONNX Runtime, multiple models | N/A |
| Unified interface | Same EmbedResult for cloud and local |
Local only | N/A |
| Batch auto-chunking | Automatic by provider limits + concurrency | Manual | Manual |
| Provider fallback | Built-in .with_fallback() chain |
N/A | Manual |
| Data-driven defaults | 8-dimension benchmark across 8 models (examples/embedding_models/) |
No published benchmark | N/A |
| Backoff & timeout | Built-in exponential backoff on 429/503 | N/A | Manual |
fastembed-rs is a solid choice if you only need local inference with ONNX Runtime and don't need cloud providers. embedrs is designed for applications that need both cloud and local through a single API, with opinionated defaults and production features like fallback and backoff.
tiktoken · @goliapkg/tiktoken-wasm · instructors · chunkedrs · embedrs