Token estimation SPI for prompt assembly

`PromptContributor` lets components inject content into LLM prompts, but nothing in the framework can estimate how many tokens any of it costs. There's no way for the prompt assembly pipeline to answer: "will this fit in the context window?"

Today this works because most applications have a small number of contributors with predictable output sizes. It breaks down when:

- Multiple contributors compete for a finite context window
- Contributors produce variable-length content (retrieval results, knowledge graph extracts, long conversation histories)
- The application needs to prioritize which content survives when the window is full

`WindowingConversationFormatter` addresses conversation history with message-count windowing, but it's count-based, not token-based, and only covers one contributor.

## What exists today

| Component | What it does | Token awareness |
|-----------|-------------|-----------------|
| `PromptContributor` | Injects text at BEGINNING or END of prompt | None |
| `WindowingConversationFormatter` | Truncates older conversation messages | Count-based, not token-based |
| `ConversationFormatter` | Renders conversation into prompt text | None |
| `ProcessContext.llmOperations` | Executes LLM calls | Knows model limits internally, doesn't expose them |

## Phase 1: TokenCounter SPI

A minimal functional interface for token estimation:

```kotlin
@FunctionalInterface
fun interface TokenCounter {
    fun countTokens(text: String): Int
}
```

**Built-in implementations:**

- `CharacterHeuristicTokenCounter` — `text.length / 4` (the "good enough" default). Zero dependencies, fast, surprisingly accurate for English text.
- Optional: `TiktokenTokenCounter` adapter for model-specific precision (GPT-family). Could live in a separate module to avoid a hard dependency.

**Design constraints:**
- Must never throw. Return `0` for null/empty input.
- Must be stateless and thread-safe.
- Must be fast — called per-contributor, potentially hundreds of times during assembly.

**Where it would live:** `embabel-agent-api` (or `embabel-agent-ai`) alongside `PromptContributor`, since any SPI implementor might need it. The heuristic implementation could live in `embabel-agent-core`.

## Phase 2: Token-budget-aware prompt assembly

Once the SPI exists, the assembly pipeline could optionally enforce a token budget:

```kotlin
interface PromptContributor {
    fun contribution(): String
    val promptContributionLocation: PromptContributionLocation
    val role: String

    // New: optional priority for budget enforcement (lower = dropped first)
    val priority: Int get() = 0
}
```

When a `tokenBudget` is configured (via `ProcessOptions` or application properties), the assembly pipeline:

1. Sorts contributors by `priority` (descending — highest priority assembled first)
2. Accumulates token cost using the registered `TokenCounter`
3. Drops lowest-priority contributors when the budget would be exceeded
4. Never drops contributors marked as `essential` (a boolean flag, default `false`)

Contributors that exceed the budget on their own could optionally implement a `truncate(maxTokens: Int): String` method to produce a shortened version rather than being dropped entirely.

Token-aware windowing for `WindowingConversationFormatter`:

- Window by token count instead of message count when a `TokenCounter` is available
- Fall back to message-count windowing when no `TokenCounter` is registered (backward compatible)

## Use cases

**RAG with multiple knowledge sources** — An agent queries three knowledge bases via `PromptContributor` implementations. One returns 2 results, another returns 50. Today the framework blindly concatenates everything. With priority and budget enforcement, the most relevant source gets budget priority, and overflow from less relevant sources is trimmed.

**DICE proposition injection** — `Memory` loads extracted propositions into the prompt via `PromptContributor`. Applications with large knowledge graphs need to know "how many propositions fit in N tokens?" and "which ones do I drop first?" The SPI would let DICE delegate estimation and focus on domain-specific priority/selection logic.

**Long-horizon conversational agents** — An agent running a multi-hour session accumulates conversation history, retrieved documents, and domain state. Without token budgeting, the prompt silently overflows. With it, the framework automatically windows conversation history and drops low-priority context to stay within limits.

**Multi-agent orchestration** — When multiple agents contribute context to a coordinator agent's prompt, each contribution competes for window space. Priority-based budgeting lets the orchestrator express "Agent A's context is more important than Agent B's" without hardcoding token limits per agent.

## Open questions

- **Where does `TokenCounter` live?** `embabel-agent-api` (alongside `PromptContributor`) seems right, since any SPI implementor might need it. The character heuristic implementation could live in `embabel-agent-core`.
- **Should priority be on `PromptContributor` or on a separate `PromptBudget` configuration?** Putting priority on the contributor is simple but means the contributor decides its own importance. A separate budget configuration lets the application override priorities per-deployment.
- **Is `essential` sufficient, or do we need a richer "never drop" contract?** A boolean flag covers it; a richer contract (e.g., "drop partially but not entirely") might be overengineering at this stage.
- **Should the framework auto-detect model context limits?** If `LlmOperations` already knows the model, it could expose `maxContextTokens` and derive the budget automatically. This would be convenient but couples budget enforcement to the LLM provider abstraction.

I have a working implementation and am happy to take this on if it aligns with the project's direction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token estimation SPI for prompt assembly #1497

What exists today

Phase 1: TokenCounter SPI

Phase 2: Token-budget-aware prompt assembly

Use cases

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	What it does	Token awareness
`PromptContributor`	Injects text at BEGINNING or END of prompt	None
`WindowingConversationFormatter`	Truncates older conversation messages	Count-based, not token-based
`ConversationFormatter`	Renders conversation into prompt text	None
`ProcessContext.llmOperations`	Executes LLM calls	Knows model limits internally, doesn't expose them

Token estimation SPI for prompt assembly #1497

Description

What exists today

Phase 1: TokenCounter SPI

Phase 2: Token-budget-aware prompt assembly

Use cases

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions