feat: libscope/lite — embeddable semantic search library (#451)#452
feat: libscope/lite — embeddable semantic search library (#451)#452
Conversation
Introduces `LibScopeLite`, a lightweight embeddable class exported from `libscope/lite` for use in external applications (e.g. Bitbucket MCP) without spawning a subprocess or HTTP server. New files: - src/lite/index.ts — public entrypoint, exports LibScopeLite + types - src/lite/core.ts — LibScopeLite class implementation - src/lite/types.ts — LiteDoc, SearchResult, ContextOptions, etc. - src/lite/normalize.ts — input normalization (HTML/PDF/DOCX/plaintext → markdown) - src/lite/chunker-treesitter.ts — tree-sitter code chunker (TS/JS/Python/Go) - tests/unit/lite.test.ts — 21 unit tests for LibScopeLite - tests/unit/code-chunker.test.ts — 18 unit tests for tree-sitter chunker - tests/integration/lite-embed.test.ts — 10 integration tests (index→search→getContext→rate) API: index(), indexRaw(), indexBatch(), search(), getContext(), ask(), askStream(), rate(), close(). Tree-sitter is an optional peer dependency with graceful error messaging if not installed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Four issues found during pre-push validation: 1. MockEmbeddingProvider.hashToVector: hash could collapse to 0 for long content (e.g. 200+ chars), producing a NaN vector. sqlite-vec returns null distance for NaN vectors, causing DatabaseError in vectorSearch. Fixed: non-zero seed (5381) + djb2-style mixing + zero-mag guard. 2. LibScopeLite.core.ts: replaced inline sqlite-vec require with createDatabase() from db/connection.ts — reuses the battle-tested extension loading path and avoids duplicating setup logic. 3. LibScopeLite.core.ts: added optional db injection to LiteOptions so tests and callers can supply a pre-configured Database instance. 4. Test lint fixes: floating promise in lite-embed, unused import and recursive ReturnType in code-chunker, unbound-method and async-without- await in lite.test.ts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
…lete Resolves @typescript-eslint/unbound-method lint error — accessing a method as an unbound property before passing to vi.mocked() is flagged. Using vi.mocked(obj).method keeps the reference bound to its object. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- S7735 (core.ts:32): flip negated condition — use === undefined, not !== - S7721 (lite.test.ts:222): move fakeStream generator to module scope - S2933 (chunker-treesitter.ts:95): mark grammarCache as readonly - S3776 (chunker-treesitter.ts:164): reduce extractChunks complexity 25→7 by extracting flushDeclaration() helper - S3776 (chunker-treesitter.ts:232): reduce splitLargeNode complexity 22→4 by extracting accumulateNamedChildren() helper Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
vi.mocked(mockLlm).complete passes the method reference unbound. Switch to vi.mocked(mockLlm.complete) — accessing the property on the object before passing to vi.mocked avoids the ESLint rule. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Assigning vi.fn() to a variable before placing it in the LlmProvider object means we never reference completeSpy as mockLlm.complete — the ESLint rule only fires when a method is accessed from an object. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…xing New VitePress pages: - docs/guide/lite.md — full user guide: constructor options, indexing, search, RAG (getContext/ask/askStream), code indexing, rate(), lifecycle, integration pattern for external MCP servers - docs/guide/code-indexing.md — tree-sitter chunking guide: installation, supported languages, node types, preamble accumulation, large-node splitting, caching, error handling, full directory-indexing example - docs/reference/lite-api.md — complete TypeScript API reference for LibScopeLite and TreeSitterChunker with all types, options, and examples Updated existing docs: - docs/.vitepress/config.ts — add LibScope Lite, Code Indexing (sidebar Integrations) and LibScope Lite API (sidebar Reference) - docs/guide/architecture.md — add lite/ to system layers diagram, module map, and LibScope Lite layer section - docs/guide/programmatic-usage.md — tip callout pointing to libscope/lite - CLAUDE.md — add src/lite/ to project structure - README.md — add LibScopeLite and TreeSitterChunker to SDK section with examples Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
RobertLD
left a comment
There was a problem hiding this comment.
Reviewed the libscope/lite implementation. Four critical bugs found — three in production code, one off-by-one in line number output.
| if (!doc) break; | ||
| idx++; | ||
| activeCount++; | ||
| void this.index([doc]).finally(() => { |
There was a problem hiding this comment.
Bug: indexBatch silently discards errors.
void this.index([doc]).finally(...) drops any rejection from this.index(). The .finally() callback receives no argument — it cannot tell whether the preceding promise succeeded or failed. The Promise<void> wrapping this has no reject path, so if any document fails to embed/index, indexBatch still resolves successfully, silently losing the failure.
The fix is to pass a reject callback to the outer Promise constructor and call it inside .finally() (or use .then(runNext, (err) => reject(err))). At minimum the error should be surfaced:
await new Promise<void>((resolve, reject) => {
const runNext = (): void => {
while (activeCount < concurrency && idx < docs.length) {
const doc = docs[idx];
if (!doc) break;
idx++;
activeCount++;
void this.index([doc]).then(
() => {
activeCount--;
if (idx >= docs.length && activeCount === 0) resolve();
else runNext();
},
(err: unknown) => reject(err),
);
}
};
runNext();
});| }); | ||
| if (i === 0) firstId = result.id; | ||
| } | ||
| return firstId; |
There was a problem hiding this comment.
Bug: indexRaw returns "" (empty string) as a document ID when the first chunk fails.
firstId is initialised to "" and only set on i === 0. If indexDocument throws on the first chunk, the for loop exits via the thrown exception (which propagates correctly), but if it throws on a later chunk the function has already assigned a real ID and this path is fine. The real problem is a subtler semantic issue: if normalized.chunks has length > 1 but the loop body somehow completes without entering if (i === 0) — impossible today but fragile — "" is returned as a valid ID. More importantly, the API contract says indexRaw returns the document ID of the newly created document, but when multiple chunks are indexed they each get their own document ID. Only the first is returned; the others are silently unreachable by the caller. This is a data loss bug for multi-chunk files: the caller cannot rate, retrieve, or delete the documents for chunks 2–N.
Consider either (a) returning all IDs as string[], or (b) indexing all chunks as a single document with pre-split content.
| * producing semantically meaningful chunks suitable for embedding. | ||
| */ | ||
| export class TreeSitterChunker { | ||
| private parserCache: TSParser | undefined; |
There was a problem hiding this comment.
Race condition: shared mutable parser state across concurrent chunk() calls.
parserCache holds a single TSParser instance. parser.setLanguage(grammar) on line 128 mutates the parser's active grammar. If chunk() is called concurrently for two different languages (e.g., "typescript" and "python"), the sequence can be:
- Coroutine A:
getParser()→ returns cached parser - Coroutine B:
getParser()→ returns same cached parser - Coroutine A:
parser.setLanguage(typescriptGrammar) - Coroutine B:
parser.setLanguage(pythonGrammar)← overwrites A's language - Coroutine A:
parser.parse(tsSource)← parsed with Python grammar → wrong AST
This is realistic when indexBatch runs concurrency > 1 and the batch contains mixed-language files that go through indexRaw→normalizeRawInput→chunker.chunk().
The parser must either be cloned/re-created per chunk() call, or language be serialised (one active language at a time), or one parser instance be kept per language.
| chunks.push({ | ||
| content: preamble, | ||
| startLine: preambleStartLine ?? startLine, | ||
| endLine: child.startPosition.row, |
There was a problem hiding this comment.
Off-by-one: preamble endLine is 0-based while every other endLine in the same chunk array is 1-based.
endLine: child.startPosition.row, // line 223 — missing +1Every other place in this file converts row to a 1-based line number with row + 1 (lines 184, 194, 214, 244, 254, 265, 269, 280). Only the preamble chunk emitted inside the large-node path uses the raw 0-based row value. For example, if the preamble ends just before a class starting at row 10 (1-based line 11), the preamble chunk will report endLine: 10 instead of endLine: 11, making the range appear to end one line early and leaving line 11 unaccounted for in any line-range display or navigation built on top of these chunks.



Summary
LibScopeLite, exported fromlibscope/lite, for embedding semantic search directly into external apps (e.g. Bitbucket MCP) without spawning a subprocess or HTTP serversrc/lite/chunker-treesitter.ts) supporting TypeScript, JavaScript, Python, and Go — tree-sitter is an optional peer dependencysrc/lite/normalize.ts) dispatching HTML/PDF/DOCX/plaintext to existing parsersAPI
Test plan
npm run format:check— cleannpm run lint— 40 errors (baseline, no new errors)npm run typecheck— no new errorsnpm test— 1492/1492 passingnpm run build— dist/lite/index.js presentRoot cause note
MockEmbeddingProvider.hashToVectorhad a zero-hash collapse bug for long strings (200+ chars) that produced NaN vectors. sqlite-vec returnsnulldistance for NaN vectors, causingDatabaseErrorin vectorSearch. Fixed with a non-zero seed and zero-magnitude guard.Closes #451
🤖 Generated with Claude Code