Git-native AI agents.
Prompts and datasets in your repo. Evals in CI. Traces in your OTEL backend.
AgentMark is an open-source platform for building reliable AI agents. Define prompts in Markdown, run them with the SDK you already use, evaluate against datasets locally or in CI, and trace every call with OpenTelemetry.
- Prompt management. Prompts are
.prompt.mdxfiles with type-safe inputs, tools, structured outputs, conditionals, loops, and reusable components. They live in your repo, get reviewed in PRs, and roll back withgit revert. - Datasets. JSONL files in your repo. Each row is a line, so git diffs show exactly which test cases changed.
- Evaluations. Run prompts over datasets with built-in or custom evaluators. Use the CLI or call the SDK from your own pipelines. Block merges on regressions, the way tests do.
- Tracing. Every LLM call emits an OpenTelemetry span. Inspect traces in the local dev UI, or forward them to AgentMark Cloud (or any OTEL backend) for search, dashboards, and alerts in production.
- Type safety. Auto-generated TypeScript types from your prompts. JSON Schema validation in your editor.
Requires: Node.js 18 or newer.
# Scaffold a new project (interactive: picks your language)
npm create agentmark@latest my-agents
cd my-agents
# Start the dev server (API + trace UI + hot reload)
agentmark dev
# Run a single prompt
agentmark run-prompt agentmark/my-prompt.prompt.mdx
# Run an experiment against a dataset
agentmark run-experiment agentmark/my-prompt.prompt.mdxAbout five minutes from npm create to a traced prompt running locally (assuming you have an LLM API key set up).
---
name: customer-support-agent
text_config:
model_name: anthropic/claude-sonnet-4-20250514
max_calls: 2
tools:
- search_knowledgebase
test_settings:
props:
customer_question: "How long does shipping take?"
input_schema:
type: object
properties:
customer_question:
type: string
required: [customer_question]
---
<System>
You are a helpful customer service agent. Use the search_knowledgebase tool
when customers ask about shipping, warranty, or returns.
</System>
<User>{props.customer_question}</User>The frontmatter declares which tools the prompt may call; the implementations live in your code, resolved where you call the model. See Tools and agents.
Run it:
agentmark run-prompt customer-support.prompt.mdxThe prompt is version-controlled, type-checked, and traced. The same file works with any SDK — the Vercel AI SDK, the raw OpenAI or Anthropic client, Pydantic AI, or your own bespoke client. AgentMark renders the prompt to a neutral { messages, ...config } shape; your SDK makes the call.
Most AI tooling treats the dashboard as the primary workspace. Prompts are rows in a database. Edits happen in a browser. Version history is whatever audit log the vendor decided to expose.
That's fine for prototyping. It stops working as soon as you do anything an engineering team would normally do with code. Branch off main to try a variant. Review a prompt change in a PR. Run evals in CI before a merge. Look up who changed the retrieval logic last quarter. Roll back when something breaks.
AgentMark treats prompts, datasets, and evals like the rest of your code. Prompts are MDX files. Datasets are JSONL. Evals are functions you import. Branches, PRs, git log, git revert: they all work the same way they do for anything else in your repo.
And when you decide to leave, your prompts are already in your repo and your traces are already in whatever OTEL backend you point them at. No export job, no vendor migration.
Want to try it on a team? Start free on AgentMark Cloud → | Read the docs →
| Feature | Description |
|---|---|
| Multimodal generation | Generate text, structured objects, images, and speech from a single prompt file. |
| Tools and agents | Declare tools by name in frontmatter; your code owns the implementations. Build agentic loops with max_calls. |
| Structured output | Type-safe JSON output via JSON Schema definitions. |
| Datasets and evals | Run prompts over JSONL datasets with built-in or custom evaluators. |
| Tracing | OpenTelemetry-native tracing for every LLM call, local and cloud. |
| Type safety | Auto-generated TypeScript types from your prompts. JSON Schema validation in your IDE. |
| Reusable components | Import and compose prompt fragments across files. |
| Conditionals and loops | Dynamic prompts with <If>, <ForEach>, props, and filter functions. |
| File attachments | Attach images and documents for vision and document tasks. |
| MCP servers | Call Model Context Protocol tools directly from prompts. |
| MCP server | Drive the full AgentMark API — traces, datasets, scores, deployments — from Claude Code, Cursor, or any MCP client. |
AgentMark doesn't call LLM APIs directly, and there are no SDK-specific adapters to install. Prompts render to a neutral { messages, ...config } shape that you hand to whatever SDK you already use — so you keep your existing client, retry logic, and auth:
import { createAgentMark } from "@agentmark-ai/prompt-core";
const agentmark = createAgentMark({ loader });
const prompt = await agentmark.loadTextPrompt("customer-support.prompt.mdx");
const { messages, ...config } = await prompt.format({ props });
// hand `messages` + `config` to your SDK of choiceSee the bring-your-own-SDK guide for the full integration path, including the createExecutor builder that lets AgentMark Cloud and agentmark dev run prompts through your SDK.
| Language | Status |
|---|---|
| TypeScript / JavaScript | Supported |
| Python | Supported |
| Others | Open an issue |
See the examples/ directory for complete, runnable projects:
- Hello World: the simplest possible prompt
- Structured Output: extract typed JSON with a schema
- Tool Use: an agent with tool calling
- Reusable Components: import and compose prompts
- Evaluations: test prompts against datasets
- Production Tracing: trace LLM calls with the SDK
| Package | Description |
|---|---|
@agentmark-ai/cli |
CLI for local development, prompt running, experiments, and building. |
@agentmark-ai/sdk |
SDK for tracing and cloud platform integration. |
@agentmark-ai/prompt-core |
Core prompt parsing and formatting engine. |
@agentmark-ai/templatedx |
MDX-based template engine with JSX components, conditionals, and loops. |
@agentmark-ai/mcp-server |
MCP server exposing the AgentMark API to Claude Code, Cursor, and other MCP clients. |
@agentmark-ai/model-registry |
Centralized LLM model metadata and pricing. |
create-agentmark |
Project scaffolding tool. |
Packages are versioned independently. The pairings below are what each release
line is tested against — mixing outside them can fail at runtime, because
@agentmark-ai/sdk imports @agentmark-ai/prompt-core lazily (a mismatch
surfaces when runExperiment/the webhook runner first executes, not at
install time):
@agentmark-ai/sdk |
@agentmark-ai/prompt-core |
@agentmark-ai/cli |
|---|---|---|
| 2.x | ≥1.0 | ≥0.21 |
The loaders (@agentmark-ai/loader-api, @agentmark-ai/loader-file) are
re-export shims of @agentmark-ai/prompt-core/loader-api /
/loader-file — prefer the prompt-core subpaths in new code.
AgentMark is open-core. The full development loop runs locally with no cloud dependency.
- Self-hosted (this repo, AGPL-3.0). CLI, SDK, prompt engine, local trace UI (
agentmark dev), eval runner, MCP server. Ship to production using only what's in this repo, and forward traces to any OpenTelemetry backend. - AgentMark Cloud (hosted, proprietary). The team layer on top: persistent trace storage, dashboards, collaborative prompt editing, annotations, alerts, and two-way Git sync. Free tier covers most small teams.
If you only need observability and you already have an OTEL backend, the self-hosted setup is enough. Cloud is for teams that want the dashboard, collaboration, and managed trace storage.
AgentMark Cloud adds the team layer:
- Persistent trace storage with search, filtering, and saved views
- Dashboards for cost, latency, and quality metrics
- Collaborative prompt editing with version history
- Annotations and human evaluation workflows
- Alerts for quality regressions, cost spikes, and latency
- Two-way Git sync. Edit prompts in the dashboard, changes land as commits in your repo (and vice versa).
The free tier covers small teams. Try Cloud free →
We welcome contributions. See CONTRIBUTING.md.
- GitHub Issues: bugs and feature requests
- GitHub Discussions: questions, ideas, and help
- LinkedIn: product updates and team posts
- Docs: reference, guides, and tutorials