Skip to content

feat: libscope/lite — embeddable semantic search library #451

@RobertLD

Description

@RobertLD

Problem Statement

libscope is currently only usable as a CLI tool or MCP server. There is no way to embed it as a library into another application. Concrete use case: a Bitbucket MCP server wants semantic search over repository files, Jira tickets, and Confluence pages — it should be able to call libscope.index(docs) and libscope.search(query) directly without spawning a subprocess or running an HTTP server.

Additionally, there is no support for indexing source code files. The current chunker is markdown-heading-aware, which works well for documentation but produces poor-quality chunks for code. A code-aware chunker splitting at function/class boundaries is needed.

Goals

  • New LibScopeLite class exported from libscope/lite
  • Caller-controlled persistence: :memory: default, file path for cross-session reuse
  • All existing embedding providers supported: local (xenova), openai, ollama
  • index(docs) — accepts pre-parsed content OR raw input types (HTML, PDF, DOCX, etc.)
  • search(query) — hybrid vector+FTS search, same engine as full libscope
  • getContext(question) — RAG context retrieval without synthesis (for embedding in external LLMs)
  • ask(question) — RAG with passthrough mode (returns context for external LLM to synthesize)
  • askStream(question) — streaming RAG responses
  • indexBatch(docs, { concurrency }) — batch indexing with concurrency control
  • rate(docId, score) — record feedback signal
  • Code indexing: tree-sitter AST chunking for TypeScript, JavaScript, Python, Go
  • Input normalization: parser layer converts HTML/PDF/DOCX/plaintext → markdown before indexing
  • Complete user docs and developer docs updates

Non-Goals

  • No CLI commands for lite mode (use full libscope for that)
  • No MCP server entrypoint (future enhancement)
  • No bundled connectors (Confluence, Notion, etc.) — caller supplies documents
  • No topics, packs, webhooks, or registry in lite mode

Public API Design

```typescript
// libscope/lite
import { LibScopeLite } from 'libscope/lite';

// --- Construction ---
const scope = new LibScopeLite({
provider: 'local' | 'openai' | 'ollama', // embedding provider
apiKey?: string, // for openai
ollamaUrl?: string, // for ollama
ollamaModel?: string,
dbPath?: string, // default: ':memory:'
llm?: { // for ask() — optional
provider: 'openai' | 'ollama' | 'anthropic' | 'passthrough',
apiKey?: string,
model?: string,
},
});

// Must call init() before use (async provider/DB setup)
await scope.init();

// --- Indexing ---
await scope.index({
title: string,
content: string, // pre-normalized text/markdown
url?: string,
sourceType?: string, // defaults to 'manual'
language?: string, // hint for code chunking
});

// Or provide raw content + type hint
await scope.indexRaw({
title: string,
content: Buffer | string, // raw HTML, PDF, DOCX, source code, etc.
contentType: 'html' | 'pdf' | 'docx' | 'markdown' | 'plaintext' | 'code',
language?: string, // for contentType: 'code'
url?: string,
sourceType?: string,
});

// Batch indexing with concurrency control
await scope.indexBatch(files.map(f => ({
title: f.path,
content: f.content,
contentType: 'code',
language: 'typescript',
})), { concurrency: 4 });

// --- Search ---
const results = await scope.search('how does auth work?', {
limit?: number, // default 10
sourceType?: string,
minRating?: number,
});

// --- RAG: Context Retrieval ---
const { context, sources } = await scope.getContext('what does auth depend on?', {
topK?: number, // chunks to retrieve, default 5
});
// Use context in your own LLM prompt

// --- RAG: Full Answer ---
const answer = await scope.ask('What does the deploy process look like?', {
topK?: number,
});
// returns { answer: string, sources: RagSource[], model: string }

// --- RAG: Streaming ---
for await (const event of scope.askStream('How does...?')) {
if (event.token) console.log(event.token);
if (event.done) console.log('Sources:', event.sources);
}

// --- Feedback ---
scope.rate(docId: string, score: 1 | -1): void;

// --- Lifecycle ---
scope.close(): void;
```

Implementation Plan

Phase 1: Core lite entrypoint (new files)

New file: `src/lite/index.ts` — the public export
New file: `src/lite/core.ts` — LibScopeLite class
New file: `src/lite/types.ts` — exported types

The entire implementation delegates to existing core functions from `src/core/`. LibScopeLite is a thin, dependency-injection-friendly wrapper around indexing, search, and RAG functions.

Phase 2: Input normalization layer

New file: `src/lite/normalize.ts` — normalizeInput(input: LiteRawIndexInput): string

Dispatches to existing parsers by contentType:

  • `'html'` → HtmlParser.parse(Buffer)
  • `'pdf'` → PdfParser.parse(Buffer)
  • `'docx'` → WordParser.parse(Buffer)
  • `'markdown'` | `'plaintext'` → UTF-8 decode, pass through
  • `'code'` → new CodeParser

Phase 3: Code-aware chunking (tree-sitter)

New file: `src/core/parsers/code.ts` — CodeParser implementing DocumentParser

Converts source code to annotated markdown format that the existing `chunkContent()` function can split at heading boundaries. Tree-sitter is an optional peer dependency with graceful error handling.

New file: `src/core/chunk-code.ts` — chunkCode(source: string, language: string): string

Takes raw source code, returns annotated markdown with headings per symbol.

Supported languages: TypeScript, JavaScript, Python, Go (phase 1); Rust, Java, C++ (future)

Phase 4: package.json exports update

Add to `package.json` exports field:
```json
"./lite": {
"import": "./dist/lite/index.js",
"require": "./dist/lite/index.js"
}
```

Phase 5: Tests

  • `tests/unit/lite.test.ts` — LibScopeLite init, index, search, rate
  • `tests/unit/code-chunker.test.ts` — chunkCode() for TypeScript, Python, fallback
  • `tests/integration/lite-embed.test.ts` — full index → search → ask workflow

Phase 6: Documentation updates

  • `docs/lite.md` — User guide: installation, quick start, API reference, code indexing guide
  • `docs/code-indexing.md` — How code indexing works, supported languages, extending
  • `src/lite/README.md` — Module overview, design decisions
  • `CLAUDE.md` — Update project structure to include lite module

Critical Files

File Action
`src/lite/index.ts` NEW — public export
`src/lite/core.ts` NEW — LibScopeLite class
`src/lite/types.ts` NEW — exported types
`src/lite/normalize.ts` NEW — input normalization dispatch
`src/core/parsers/code.ts` NEW — CodeParser (tree-sitter → annotated markdown)
`src/core/chunk-code.ts` NEW — chunkCode() helper
`src/core/indexing.ts` REUSE — indexDocument() called directly
`src/core/search.ts` REUSE — searchDocuments() called directly
`src/core/rag.ts` REUSE — askQuestion(), askQuestionStream()
`package.json` MODIFY — add `./lite` export
`docs/lite.md` NEW — user docs
`tests/unit/lite.test.ts` NEW
`tests/integration/lite-embed.test.ts` NEW

Primary Use Case: Bitbucket MCP PR Enhancement

The first concrete embedding target is a Bitbucket MCP server that enhances PR review quality while reducing token usage:

  1. On repo connect: `indexBatch(repoFiles, { concurrency: 4 })` with persistent DB
  2. On PR review: `getContext("what does the changed auth module depend on?")` → returns top-k chunks
  3. MCP injects those chunks into its own LLM prompt — no second LLM call to libscope
  4. Instead of sending 50 files of context, the MCP sends 5 highly-relevant chunks

Resolved Design Decisions

  1. Tree-sitter: optional peer dep — clear error message if not installed
  2. Separate getContext() method — for embedding into external LLMs (primary use case)
  3. askStream() included in v1 — streaming is useful for agent-to-agent communication
  4. indexBatch() included in v1 — required for the Bitbucket repo indexing use case
  5. Persistent SQLite is practical default — callers should pass dbPath pointing to a project-local file

Verification

  1. `npm run typecheck` — no new errors
  2. `npm run lint` — no new errors
  3. `npm test` — all new tests pass including lite integration test
  4. Manual smoke test with local provider
  5. Code indexing smoke test with TypeScript file
  6. SonarCloud: check duplication density on new files

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions