Skip to content

feat: libscope/lite — embeddable semantic search library (#451)#452

Merged
RobertLD merged 7 commits intomainfrom
feat/libscope-lite-451
Mar 19, 2026
Merged

feat: libscope/lite — embeddable semantic search library (#451)#452
RobertLD merged 7 commits intomainfrom
feat/libscope-lite-451

Conversation

@RobertLD
Copy link
Copy Markdown
Owner

Summary

  • Introduces LibScopeLite, exported from libscope/lite, for embedding semantic search directly into external apps (e.g. Bitbucket MCP) without spawning a subprocess or HTTP server
  • Adds tree-sitter code-aware chunker (src/lite/chunker-treesitter.ts) supporting TypeScript, JavaScript, Python, and Go — tree-sitter is an optional peer dependency
  • Adds input normalization layer (src/lite/normalize.ts) dispatching HTML/PDF/DOCX/plaintext to existing parsers
  • 49 new tests (18 unit/chunker, 21 unit/lite, 10 integration) — 1492 total passing

API

const lite = new LibScopeLite({ dbPath: ':memory:', provider })
await lite.indexBatch(repoFiles, { concurrency: 4 })
const context = await lite.getContext('How does auth work?')

Test plan

  • npm run format:check — clean
  • npm run lint — 40 errors (baseline, no new errors)
  • npm run typecheck — no new errors
  • npm test — 1492/1492 passing
  • npm run build — dist/lite/index.js present
  • SonarCloud quality gate (CI)

Root cause note

MockEmbeddingProvider.hashToVector had a zero-hash collapse bug for long strings (200+ chars) that produced NaN vectors. sqlite-vec returns null distance for NaN vectors, causing DatabaseError in vectorSearch. Fixed with a non-zero seed and zero-magnitude guard.

Closes #451

🤖 Generated with Claude Code

RobertLD and others added 2 commits March 19, 2026 16:38
Introduces `LibScopeLite`, a lightweight embeddable class exported from
`libscope/lite` for use in external applications (e.g. Bitbucket MCP)
without spawning a subprocess or HTTP server.

New files:
- src/lite/index.ts — public entrypoint, exports LibScopeLite + types
- src/lite/core.ts — LibScopeLite class implementation
- src/lite/types.ts — LiteDoc, SearchResult, ContextOptions, etc.
- src/lite/normalize.ts — input normalization (HTML/PDF/DOCX/plaintext → markdown)
- src/lite/chunker-treesitter.ts — tree-sitter code chunker (TS/JS/Python/Go)
- tests/unit/lite.test.ts — 21 unit tests for LibScopeLite
- tests/unit/code-chunker.test.ts — 18 unit tests for tree-sitter chunker
- tests/integration/lite-embed.test.ts — 10 integration tests (index→search→getContext→rate)

API: index(), indexRaw(), indexBatch(), search(), getContext(), ask(),
askStream(), rate(), close(). Tree-sitter is an optional peer dependency
with graceful error messaging if not installed.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Four issues found during pre-push validation:

1. MockEmbeddingProvider.hashToVector: hash could collapse to 0 for long
   content (e.g. 200+ chars), producing a NaN vector. sqlite-vec returns
   null distance for NaN vectors, causing DatabaseError in vectorSearch.
   Fixed: non-zero seed (5381) + djb2-style mixing + zero-mag guard.

2. LibScopeLite.core.ts: replaced inline sqlite-vec require with
   createDatabase() from db/connection.ts — reuses the battle-tested
   extension loading path and avoids duplicating setup logic.

3. LibScopeLite.core.ts: added optional db injection to LiteOptions so
   tests and callers can supply a pre-configured Database instance.

4. Test lint fixes: floating promise in lite-embed, unused import and
   recursive ReturnType in code-chunker, unbound-method and async-without-
   await in lite.test.ts.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
libscope Ignored Ignored Preview Mar 19, 2026 5:29pm

RobertLD and others added 4 commits March 19, 2026 17:04
…lete

Resolves @typescript-eslint/unbound-method lint error — accessing a
method as an unbound property before passing to vi.mocked() is flagged.
Using vi.mocked(obj).method keeps the reference bound to its object.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- S7735 (core.ts:32): flip negated condition — use === undefined, not !==
- S7721 (lite.test.ts:222): move fakeStream generator to module scope
- S2933 (chunker-treesitter.ts:95): mark grammarCache as readonly
- S3776 (chunker-treesitter.ts:164): reduce extractChunks complexity 25→7
  by extracting flushDeclaration() helper
- S3776 (chunker-treesitter.ts:232): reduce splitLargeNode complexity 22→4
  by extracting accumulateNamedChildren() helper

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
vi.mocked(mockLlm).complete passes the method reference unbound.
Switch to vi.mocked(mockLlm.complete) — accessing the property on
the object before passing to vi.mocked avoids the ESLint rule.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Assigning vi.fn() to a variable before placing it in the LlmProvider
object means we never reference completeSpy as mockLlm.complete —
the ESLint rule only fires when a method is accessed from an object.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@RobertLD RobertLD requested a review from Copilot March 19, 2026 17:22
…xing

New VitePress pages:
- docs/guide/lite.md — full user guide: constructor options, indexing,
  search, RAG (getContext/ask/askStream), code indexing, rate(), lifecycle,
  integration pattern for external MCP servers
- docs/guide/code-indexing.md — tree-sitter chunking guide: installation,
  supported languages, node types, preamble accumulation, large-node
  splitting, caching, error handling, full directory-indexing example
- docs/reference/lite-api.md — complete TypeScript API reference for
  LibScopeLite and TreeSitterChunker with all types, options, and examples

Updated existing docs:
- docs/.vitepress/config.ts — add LibScope Lite, Code Indexing (sidebar
  Integrations) and LibScope Lite API (sidebar Reference)
- docs/guide/architecture.md — add lite/ to system layers diagram, module
  map, and LibScope Lite layer section
- docs/guide/programmatic-usage.md — tip callout pointing to libscope/lite
- CLAUDE.md — add src/lite/ to project structure
- README.md — add LibScopeLite and TreeSitterChunker to SDK section with
  examples

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Owner Author

@RobertLD RobertLD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the libscope/lite implementation. Four critical bugs found — three in production code, one off-by-one in line number output.

if (!doc) break;
idx++;
activeCount++;
void this.index([doc]).finally(() => {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: indexBatch silently discards errors.

void this.index([doc]).finally(...) drops any rejection from this.index(). The .finally() callback receives no argument — it cannot tell whether the preceding promise succeeded or failed. The Promise<void> wrapping this has no reject path, so if any document fails to embed/index, indexBatch still resolves successfully, silently losing the failure.

The fix is to pass a reject callback to the outer Promise constructor and call it inside .finally() (or use .then(runNext, (err) => reject(err))). At minimum the error should be surfaced:

await new Promise<void>((resolve, reject) => {
  const runNext = (): void => {
    while (activeCount < concurrency && idx < docs.length) {
      const doc = docs[idx];
      if (!doc) break;
      idx++;
      activeCount++;
      void this.index([doc]).then(
        () => {
          activeCount--;
          if (idx >= docs.length && activeCount === 0) resolve();
          else runNext();
        },
        (err: unknown) => reject(err),
      );
    }
  };
  runNext();
});

});
if (i === 0) firstId = result.id;
}
return firstId;
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: indexRaw returns "" (empty string) as a document ID when the first chunk fails.

firstId is initialised to "" and only set on i === 0. If indexDocument throws on the first chunk, the for loop exits via the thrown exception (which propagates correctly), but if it throws on a later chunk the function has already assigned a real ID and this path is fine. The real problem is a subtler semantic issue: if normalized.chunks has length > 1 but the loop body somehow completes without entering if (i === 0) — impossible today but fragile — "" is returned as a valid ID. More importantly, the API contract says indexRaw returns the document ID of the newly created document, but when multiple chunks are indexed they each get their own document ID. Only the first is returned; the others are silently unreachable by the caller. This is a data loss bug for multi-chunk files: the caller cannot rate, retrieve, or delete the documents for chunks 2–N.

Consider either (a) returning all IDs as string[], or (b) indexing all chunks as a single document with pre-split content.

* producing semantically meaningful chunks suitable for embedding.
*/
export class TreeSitterChunker {
private parserCache: TSParser | undefined;
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: shared mutable parser state across concurrent chunk() calls.

parserCache holds a single TSParser instance. parser.setLanguage(grammar) on line 128 mutates the parser's active grammar. If chunk() is called concurrently for two different languages (e.g., "typescript" and "python"), the sequence can be:

  1. Coroutine A: getParser() → returns cached parser
  2. Coroutine B: getParser() → returns same cached parser
  3. Coroutine A: parser.setLanguage(typescriptGrammar)
  4. Coroutine B: parser.setLanguage(pythonGrammar)overwrites A's language
  5. Coroutine A: parser.parse(tsSource)parsed with Python grammar → wrong AST

This is realistic when indexBatch runs concurrency > 1 and the batch contains mixed-language files that go through indexRawnormalizeRawInputchunker.chunk().

The parser must either be cloned/re-created per chunk() call, or language be serialised (one active language at a time), or one parser instance be kept per language.

chunks.push({
content: preamble,
startLine: preambleStartLine ?? startLine,
endLine: child.startPosition.row,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one: preamble endLine is 0-based while every other endLine in the same chunk array is 1-based.

endLine: child.startPosition.row,   // line 223 — missing +1

Every other place in this file converts row to a 1-based line number with row + 1 (lines 184, 194, 214, 244, 254, 265, 269, 280). Only the preamble chunk emitted inside the large-node path uses the raw 0-based row value. For example, if the preamble ends just before a class starting at row 10 (1-based line 11), the preamble chunk will report endLine: 10 instead of endLine: 11, making the range appear to end one line early and leaving line 11 unaccounted for in any line-range display or navigation built on top of these chunks.

@RobertLD RobertLD merged commit 022b958 into main Mar 19, 2026
10 checks passed
@RobertLD RobertLD deleted the feat/libscope-lite-451 branch March 19, 2026 18:03
@RobertLD RobertLD restored the feat/libscope-lite-451 branch March 19, 2026 19:08
@RobertLD RobertLD review requested due to automatic review settings March 23, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: libscope/lite — embeddable semantic search library

1 participant