Skip to content

Token estimation SPI for prompt assembly #1497

@jimador

Description

@jimador

PromptContributor lets components inject content into LLM prompts, but nothing in the framework can estimate how many tokens any of it costs. There's no way for the prompt assembly pipeline to answer: "will this fit in the context window?"

Today this works because most applications have a small number of contributors with predictable output sizes. It breaks down when:

  • Multiple contributors compete for a finite context window
  • Contributors produce variable-length content (retrieval results, knowledge graph extracts, long conversation histories)
  • The application needs to prioritize which content survives when the window is full

WindowingConversationFormatter addresses conversation history with message-count windowing, but it's count-based, not token-based, and only covers one contributor.

What exists today

Component What it does Token awareness
PromptContributor Injects text at BEGINNING or END of prompt None
WindowingConversationFormatter Truncates older conversation messages Count-based, not token-based
ConversationFormatter Renders conversation into prompt text None
ProcessContext.llmOperations Executes LLM calls Knows model limits internally, doesn't expose them

Phase 1: TokenCounter SPI

A minimal functional interface for token estimation:

@FunctionalInterface
fun interface TokenCounter {
    fun countTokens(text: String): Int
}

Built-in implementations:

  • CharacterHeuristicTokenCountertext.length / 4 (the "good enough" default). Zero dependencies, fast, surprisingly accurate for English text.
  • Optional: TiktokenTokenCounter adapter for model-specific precision (GPT-family). Could live in a separate module to avoid a hard dependency.

Design constraints:

  • Must never throw. Return 0 for null/empty input.
  • Must be stateless and thread-safe.
  • Must be fast — called per-contributor, potentially hundreds of times during assembly.

Where it would live: embabel-agent-api (or embabel-agent-ai) alongside PromptContributor, since any SPI implementor might need it. The heuristic implementation could live in embabel-agent-core.

Phase 2: Token-budget-aware prompt assembly

Once the SPI exists, the assembly pipeline could optionally enforce a token budget:

interface PromptContributor {
    fun contribution(): String
    val promptContributionLocation: PromptContributionLocation
    val role: String

    // New: optional priority for budget enforcement (lower = dropped first)
    val priority: Int get() = 0
}

When a tokenBudget is configured (via ProcessOptions or application properties), the assembly pipeline:

  1. Sorts contributors by priority (descending — highest priority assembled first)
  2. Accumulates token cost using the registered TokenCounter
  3. Drops lowest-priority contributors when the budget would be exceeded
  4. Never drops contributors marked as essential (a boolean flag, default false)

Contributors that exceed the budget on their own could optionally implement a truncate(maxTokens: Int): String method to produce a shortened version rather than being dropped entirely.

Token-aware windowing for WindowingConversationFormatter:

  • Window by token count instead of message count when a TokenCounter is available
  • Fall back to message-count windowing when no TokenCounter is registered (backward compatible)

Use cases

RAG with multiple knowledge sources — An agent queries three knowledge bases via PromptContributor implementations. One returns 2 results, another returns 50. Today the framework blindly concatenates everything. With priority and budget enforcement, the most relevant source gets budget priority, and overflow from less relevant sources is trimmed.

DICE proposition injectionMemory loads extracted propositions into the prompt via PromptContributor. Applications with large knowledge graphs need to know "how many propositions fit in N tokens?" and "which ones do I drop first?" The SPI would let DICE delegate estimation and focus on domain-specific priority/selection logic.

Long-horizon conversational agents — An agent running a multi-hour session accumulates conversation history, retrieved documents, and domain state. Without token budgeting, the prompt silently overflows. With it, the framework automatically windows conversation history and drops low-priority context to stay within limits.

Multi-agent orchestration — When multiple agents contribute context to a coordinator agent's prompt, each contribution competes for window space. Priority-based budgeting lets the orchestrator express "Agent A's context is more important than Agent B's" without hardcoding token limits per agent.

Open questions

  • Where does TokenCounter live? embabel-agent-api (alongside PromptContributor) seems right, since any SPI implementor might need it. The character heuristic implementation could live in embabel-agent-core.
  • Should priority be on PromptContributor or on a separate PromptBudget configuration? Putting priority on the contributor is simple but means the contributor decides its own importance. A separate budget configuration lets the application override priorities per-deployment.
  • Is essential sufficient, or do we need a richer "never drop" contract? A boolean flag covers it; a richer contract (e.g., "drop partially but not entirely") might be overengineering at this stage.
  • Should the framework auto-detect model context limits? If LlmOperations already knows the model, it could expose maxContextTokens and derive the budget automatically. This would be convenient but couples budget enforcement to the LLM provider abstraction.

I have a working implementation and am happy to take this on if it aligns with the project's direction.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions