Skip to content

MihaiValentin/lunr-languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

186 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lunr Languages — Multilingual Search for AI, RAG & Local-First Apps

Used by 18k+ projects • ~300k weekly downloads

Lunr Languages is an extension for Lunr.js that enables fast, multilingual full-text search across dozens of languages — in the browser or Node.js.

Originally built for classic search, it is now widely used as a lightweight retrieval layer in AI systems, including:

  • Retrieval-Augmented Generation (RAG)
  • Hybrid search (keyword + vector)
  • Local-first / edge AI apps
  • Static site search and documentation search

⭐ If this project saves you time or powers something important, consider starring it or supporting its maintenance.


Supported Languages

  • German
  • French
  • Spanish
  • Italian
  • Dutch
  • Danish
  • Portuguese
  • Finnish
  • Romanian
  • Hungarian
  • Russian
  • Norwegian
  • Swedish
  • Turkish
  • Japanese
  • Thai
  • Arabic
  • Chinese1
  • Vietnamese
  • Sanskrit
  • Kannada
  • Telugu
  • Hindi
  • Tamil
  • Korean
  • Armenian
  • Hebrew
  • Greek

Contribute a new language


1 Chinese tokenization uses Intl.Segmenter with CJK bigrams by default, which works in modern browsers and Node.js without native dependencies. In Node.js, if @node-rs/jieba is installed, Lunr Languages uses it automatically for higher-quality Jieba segmentation. Browsers must support Intl.Segmenter; there is no frontend fallback.


Why Lunr Languages in an AI world?

Modern AI systems don’t replace search — they depend on it.

Before an LLM can generate an answer, it needs relevant context. That’s where Lunr Languages fits:

🔎 Fast and consistent lexical retrieval

Filter thousands of documents down to a small candidate set before embedding or reranking.

🌍 Multilingual support out of the box

Tokenization, stemming, and stopwords for 30+ languages — still a hard problem in AI pipelines.

⚡ Zero infrastructure

Runs entirely in the browser or Node.js. No vector DB required.

🔒 Privacy-friendly / offline-ready

Perfect for:

  • in-browser AI assistants
  • local knowledge bases
  • on-device search

Example: Hybrid Search (Keyword + AI)

User query
→ Lunr (keyword search, multilingual)
→ top 100–500 documents
→ embeddings / reranker
→ LLM generates answer

Lunr Languages improves recall and precision, especially for:

  • non-English content
  • inflected languages
  • mixed-language datasets

Installation

npm install lunr-languages

Usage

Basic example (German)

const lunr = require('lunr');
require('lunr-languages/lunr.stemmer.support')(lunr);
require('lunr-languages/lunr.de')(lunr);

const idx = lunr(function () {
  this.use(lunr.de);

  this.field('title', { boost: 10 });
  this.field('body');

  this.add({ title: 'Dokument', body: 'Beispieltext' });
});

Multi-language indexing

require('lunr-languages/lunr.multi')(lunr);

const idx = lunr(function () {
  this.use(lunr.multiLanguage('en', 'ru', 'de'));

  this.field('title');
  this.field('body');
});

Chinese Tokenization

Chinese support is designed to work without mandatory native binaries:

  • In browsers, lunr.zh uses Intl.Segmenter plus CJK bigrams. If Intl.Segmenter is unavailable, it logs an error and throws because there is no bundled browser fallback.
  • In Node.js, lunr.zh first tries to load @node-rs/jieba. If it is installed, it is used for better Chinese segmentation. If it is not installed, Lunr Languages logs an informational message and falls back to Intl.Segmenter plus CJK bigrams.
  • If neither @node-rs/jieba nor Intl.Segmenter is available in Node.js, Chinese tokenization logs an error and throws.

The Intl.Segmenter fallback avoids native package supply-chain risk and works well for lightweight search, but it is not identical to Jieba. Bigrams improve recall for common two-character search terms such as 车主 and 学姐, while Jieba generally provides better precision and ranking for serious Chinese search.

To opt into Jieba tokenization in Node.js:

npm install @node-rs/jieba

Where this fits in modern architectures

Lunr Languages is commonly used as:

  • Pre-filter for vector search
  • Fallback when embeddings fail
  • Client-side retrieval for AI apps
  • Static / documentation search

👉 In practice, hybrid search (keyword + vector) performs best


How it works

To provide high-quality search across languages:

  • Tokenization — language-aware splitting (including Japanese, Chinese, etc.)
  • Stemming — matches different word forms
  • Stopword filtering — removes noise
  • Trimming — normalizes tokens

These steps improve both classic search and AI retrieval pipelines.


When to use Lunr Languages vs vector search

Use Lunr Languages when you need:

  • fast, deterministic keyword matching
  • multilingual normalization
  • offline / browser-based search
  • low-cost retrieval

Combine with embeddings for:

  • semantic similarity
  • fuzzy concept matching

Contributing

Want to add a new language?

See CONTRIBUTING.md


Support / Sponsorship

Maintained as an open-source project for over a decade.

If your company relies on this in production:

  • consider sponsoring
  • or contributing improvements

It helps keep the ecosystem stable.


Final note

Even in an AI-first world, retrieval is the bottleneck.

Lunr Languages ensures the right content reaches your models — fast, locally, and across languages.

About

A collection of languages stemmers and stopwords for Lunr Javascript library

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors