-
Notifications
You must be signed in to change notification settings - Fork 0
feat: libscope/lite — embeddable semantic search library #451
Description
Problem Statement
libscope is currently only usable as a CLI tool or MCP server. There is no way to embed it as a library into another application. Concrete use case: a Bitbucket MCP server wants semantic search over repository files, Jira tickets, and Confluence pages — it should be able to call libscope.index(docs) and libscope.search(query) directly without spawning a subprocess or running an HTTP server.
Additionally, there is no support for indexing source code files. The current chunker is markdown-heading-aware, which works well for documentation but produces poor-quality chunks for code. A code-aware chunker splitting at function/class boundaries is needed.
Goals
- New
LibScopeLiteclass exported fromlibscope/lite - Caller-controlled persistence:
:memory:default, file path for cross-session reuse - All existing embedding providers supported: local (xenova), openai, ollama
index(docs)— accepts pre-parsed content OR raw input types (HTML, PDF, DOCX, etc.)search(query)— hybrid vector+FTS search, same engine as full libscopegetContext(question)— RAG context retrieval without synthesis (for embedding in external LLMs)ask(question)— RAG with passthrough mode (returns context for external LLM to synthesize)askStream(question)— streaming RAG responsesindexBatch(docs, { concurrency })— batch indexing with concurrency controlrate(docId, score)— record feedback signal- Code indexing: tree-sitter AST chunking for TypeScript, JavaScript, Python, Go
- Input normalization: parser layer converts HTML/PDF/DOCX/plaintext → markdown before indexing
- Complete user docs and developer docs updates
Non-Goals
- No CLI commands for lite mode (use full libscope for that)
- No MCP server entrypoint (future enhancement)
- No bundled connectors (Confluence, Notion, etc.) — caller supplies documents
- No topics, packs, webhooks, or registry in lite mode
Public API Design
```typescript
// libscope/lite
import { LibScopeLite } from 'libscope/lite';
// --- Construction ---
const scope = new LibScopeLite({
provider: 'local' | 'openai' | 'ollama', // embedding provider
apiKey?: string, // for openai
ollamaUrl?: string, // for ollama
ollamaModel?: string,
dbPath?: string, // default: ':memory:'
llm?: { // for ask() — optional
provider: 'openai' | 'ollama' | 'anthropic' | 'passthrough',
apiKey?: string,
model?: string,
},
});
// Must call init() before use (async provider/DB setup)
await scope.init();
// --- Indexing ---
await scope.index({
title: string,
content: string, // pre-normalized text/markdown
url?: string,
sourceType?: string, // defaults to 'manual'
language?: string, // hint for code chunking
});
// Or provide raw content + type hint
await scope.indexRaw({
title: string,
content: Buffer | string, // raw HTML, PDF, DOCX, source code, etc.
contentType: 'html' | 'pdf' | 'docx' | 'markdown' | 'plaintext' | 'code',
language?: string, // for contentType: 'code'
url?: string,
sourceType?: string,
});
// Batch indexing with concurrency control
await scope.indexBatch(files.map(f => ({
title: f.path,
content: f.content,
contentType: 'code',
language: 'typescript',
})), { concurrency: 4 });
// --- Search ---
const results = await scope.search('how does auth work?', {
limit?: number, // default 10
sourceType?: string,
minRating?: number,
});
// --- RAG: Context Retrieval ---
const { context, sources } = await scope.getContext('what does auth depend on?', {
topK?: number, // chunks to retrieve, default 5
});
// Use context in your own LLM prompt
// --- RAG: Full Answer ---
const answer = await scope.ask('What does the deploy process look like?', {
topK?: number,
});
// returns { answer: string, sources: RagSource[], model: string }
// --- RAG: Streaming ---
for await (const event of scope.askStream('How does...?')) {
if (event.token) console.log(event.token);
if (event.done) console.log('Sources:', event.sources);
}
// --- Feedback ---
scope.rate(docId: string, score: 1 | -1): void;
// --- Lifecycle ---
scope.close(): void;
```
Implementation Plan
Phase 1: Core lite entrypoint (new files)
New file: `src/lite/index.ts` — the public export
New file: `src/lite/core.ts` — LibScopeLite class
New file: `src/lite/types.ts` — exported types
The entire implementation delegates to existing core functions from `src/core/`. LibScopeLite is a thin, dependency-injection-friendly wrapper around indexing, search, and RAG functions.
Phase 2: Input normalization layer
New file: `src/lite/normalize.ts` — normalizeInput(input: LiteRawIndexInput): string
Dispatches to existing parsers by contentType:
- `'html'` → HtmlParser.parse(Buffer)
- `'pdf'` → PdfParser.parse(Buffer)
- `'docx'` → WordParser.parse(Buffer)
- `'markdown'` | `'plaintext'` → UTF-8 decode, pass through
- `'code'` → new CodeParser
Phase 3: Code-aware chunking (tree-sitter)
New file: `src/core/parsers/code.ts` — CodeParser implementing DocumentParser
Converts source code to annotated markdown format that the existing `chunkContent()` function can split at heading boundaries. Tree-sitter is an optional peer dependency with graceful error handling.
New file: `src/core/chunk-code.ts` — chunkCode(source: string, language: string): string
Takes raw source code, returns annotated markdown with headings per symbol.
Supported languages: TypeScript, JavaScript, Python, Go (phase 1); Rust, Java, C++ (future)
Phase 4: package.json exports update
Add to `package.json` exports field:
```json
"./lite": {
"import": "./dist/lite/index.js",
"require": "./dist/lite/index.js"
}
```
Phase 5: Tests
- `tests/unit/lite.test.ts` — LibScopeLite init, index, search, rate
- `tests/unit/code-chunker.test.ts` — chunkCode() for TypeScript, Python, fallback
- `tests/integration/lite-embed.test.ts` — full index → search → ask workflow
Phase 6: Documentation updates
- `docs/lite.md` — User guide: installation, quick start, API reference, code indexing guide
- `docs/code-indexing.md` — How code indexing works, supported languages, extending
- `src/lite/README.md` — Module overview, design decisions
- `CLAUDE.md` — Update project structure to include lite module
Critical Files
| File | Action |
|---|---|
| `src/lite/index.ts` | NEW — public export |
| `src/lite/core.ts` | NEW — LibScopeLite class |
| `src/lite/types.ts` | NEW — exported types |
| `src/lite/normalize.ts` | NEW — input normalization dispatch |
| `src/core/parsers/code.ts` | NEW — CodeParser (tree-sitter → annotated markdown) |
| `src/core/chunk-code.ts` | NEW — chunkCode() helper |
| `src/core/indexing.ts` | REUSE — indexDocument() called directly |
| `src/core/search.ts` | REUSE — searchDocuments() called directly |
| `src/core/rag.ts` | REUSE — askQuestion(), askQuestionStream() |
| `package.json` | MODIFY — add `./lite` export |
| `docs/lite.md` | NEW — user docs |
| `tests/unit/lite.test.ts` | NEW |
| `tests/integration/lite-embed.test.ts` | NEW |
Primary Use Case: Bitbucket MCP PR Enhancement
The first concrete embedding target is a Bitbucket MCP server that enhances PR review quality while reducing token usage:
- On repo connect: `indexBatch(repoFiles, { concurrency: 4 })` with persistent DB
- On PR review: `getContext("what does the changed auth module depend on?")` → returns top-k chunks
- MCP injects those chunks into its own LLM prompt — no second LLM call to libscope
- Instead of sending 50 files of context, the MCP sends 5 highly-relevant chunks
Resolved Design Decisions
- Tree-sitter: optional peer dep — clear error message if not installed
- Separate getContext() method — for embedding into external LLMs (primary use case)
- askStream() included in v1 — streaming is useful for agent-to-agent communication
- indexBatch() included in v1 — required for the Bitbucket repo indexing use case
- Persistent SQLite is practical default — callers should pass dbPath pointing to a project-local file
Verification
- `npm run typecheck` — no new errors
- `npm run lint` — no new errors
- `npm test` — all new tests pass including lite integration test
- Manual smoke test with local provider
- Code indexing smoke test with TypeScript file
- SonarCloud: check duplication density on new files