Skip to content

agentmark-ai/agentmark

AgentMark

AgentMark git-native AI agent platform

Git-native AI agents.
Prompts and datasets in your repo. Evals in CI. Traces in your OTEL backend.

Homepage · Docs · Cloud

npm version License: AGPL-3.0 GitHub stars


AgentMark is an open-source platform for building reliable AI agents. Define prompts in Markdown, run them with the SDK you already use, evaluate against datasets locally or in CI, and trace every call with OpenTelemetry.

  • Prompt management. Prompts are .prompt.mdx files with type-safe inputs, tools, structured outputs, conditionals, loops, and reusable components. They live in your repo, get reviewed in PRs, and roll back with git revert.
  • Datasets. JSONL files in your repo. Each row is a line, so git diffs show exactly which test cases changed.
  • Evaluations. Run prompts over datasets with built-in or custom evaluators. Use the CLI or call the SDK from your own pipelines. Block merges on regressions, the way tests do.
  • Tracing. Every LLM call emits an OpenTelemetry span. Inspect traces in the local dev UI, or forward them to AgentMark Cloud (or any OTEL backend) for search, dashboards, and alerts in production.
  • Type safety. Auto-generated TypeScript types from your prompts. JSON Schema validation in your editor.

Quick start

Requires: Node.js 18 or newer.

# Scaffold a new project (interactive: picks your language)
npm create agentmark@latest my-agents
cd my-agents

# Start the dev server (API + trace UI + hot reload)
agentmark dev

# Run a single prompt
agentmark run-prompt agentmark/my-prompt.prompt.mdx

# Run an experiment against a dataset
agentmark run-experiment agentmark/my-prompt.prompt.mdx

About five minutes from npm create to a traced prompt running locally (assuming you have an LLM API key set up).

What a prompt looks like

---
name: customer-support-agent
text_config:
  model_name: anthropic/claude-sonnet-4-20250514
  max_calls: 2
  tools:
    - search_knowledgebase
test_settings:
  props:
    customer_question: "How long does shipping take?"
input_schema:
  type: object
  properties:
    customer_question:
      type: string
  required: [customer_question]
---

<System>
You are a helpful customer service agent. Use the search_knowledgebase tool
when customers ask about shipping, warranty, or returns.
</System>

<User>{props.customer_question}</User>

The frontmatter declares which tools the prompt may call; the implementations live in your code, resolved where you call the model. See Tools and agents.

Run it:

agentmark run-prompt customer-support.prompt.mdx

The prompt is version-controlled, type-checked, and traced. The same file works with any SDK — the Vercel AI SDK, the raw OpenAI or Anthropic client, Pydantic AI, or your own bespoke client. AgentMark renders the prompt to a neutral { messages, ...config } shape; your SDK makes the call.

Why git-native

Most AI tooling treats the dashboard as the primary workspace. Prompts are rows in a database. Edits happen in a browser. Version history is whatever audit log the vendor decided to expose.

That's fine for prototyping. It stops working as soon as you do anything an engineering team would normally do with code. Branch off main to try a variant. Review a prompt change in a PR. Run evals in CI before a merge. Look up who changed the retrieval logic last quarter. Roll back when something breaks.

AgentMark treats prompts, datasets, and evals like the rest of your code. Prompts are MDX files. Datasets are JSONL. Evals are functions you import. Branches, PRs, git log, git revert: they all work the same way they do for anything else in your repo.

And when you decide to leave, your prompts are already in your repo and your traces are already in whatever OTEL backend you point them at. No export job, no vendor migration.

Want to try it on a team? Start free on AgentMark Cloud →  |  Read the docs →

Features

Feature Description
Multimodal generation Generate text, structured objects, images, and speech from a single prompt file.
Tools and agents Declare tools by name in frontmatter; your code owns the implementations. Build agentic loops with max_calls.
Structured output Type-safe JSON output via JSON Schema definitions.
Datasets and evals Run prompts over JSONL datasets with built-in or custom evaluators.
Tracing OpenTelemetry-native tracing for every LLM call, local and cloud.
Type safety Auto-generated TypeScript types from your prompts. JSON Schema validation in your IDE.
Reusable components Import and compose prompt fragments across files.
Conditionals and loops Dynamic prompts with <If>, <ForEach>, props, and filter functions.
File attachments Attach images and documents for vision and document tasks.
MCP servers Call Model Context Protocol tools directly from prompts.
MCP server Drive the full AgentMark API — traces, datasets, scores, deployments — from Claude Code, Cursor, or any MCP client.

Bring your own SDK

AgentMark doesn't call LLM APIs directly, and there are no SDK-specific adapters to install. Prompts render to a neutral { messages, ...config } shape that you hand to whatever SDK you already use — so you keep your existing client, retry logic, and auth:

import { createAgentMark } from "@agentmark-ai/prompt-core";

const agentmark = createAgentMark({ loader });
const prompt = await agentmark.loadTextPrompt("customer-support.prompt.mdx");
const { messages, ...config } = await prompt.format({ props });
// hand `messages` + `config` to your SDK of choice

See the bring-your-own-SDK guide for the full integration path, including the createExecutor builder that lets AgentMark Cloud and agentmark dev run prompts through your SDK.

Language support

Language Status
TypeScript / JavaScript Supported
Python Supported
Others Open an issue

Examples

See the examples/ directory for complete, runnable projects:

Packages

Package Description
@agentmark-ai/cli CLI for local development, prompt running, experiments, and building.
@agentmark-ai/sdk SDK for tracing and cloud platform integration.
@agentmark-ai/prompt-core Core prompt parsing and formatting engine.
@agentmark-ai/templatedx MDX-based template engine with JSX components, conditionals, and loops.
@agentmark-ai/mcp-server MCP server exposing the AgentMark API to Claude Code, Cursor, and other MCP clients.
@agentmark-ai/model-registry Centralized LLM model metadata and pricing.
create-agentmark Project scaffolding tool.

Version compatibility

Packages are versioned independently. The pairings below are what each release line is tested against — mixing outside them can fail at runtime, because @agentmark-ai/sdk imports @agentmark-ai/prompt-core lazily (a mismatch surfaces when runExperiment/the webhook runner first executes, not at install time):

@agentmark-ai/sdk @agentmark-ai/prompt-core @agentmark-ai/cli
2.x ≥1.0 ≥0.21

The loaders (@agentmark-ai/loader-api, @agentmark-ai/loader-file) are re-export shims of @agentmark-ai/prompt-core/loader-api / /loader-file — prefer the prompt-core subpaths in new code.

Self-host vs Cloud

AgentMark is open-core. The full development loop runs locally with no cloud dependency.

  • Self-hosted (this repo, AGPL-3.0). CLI, SDK, prompt engine, local trace UI (agentmark dev), eval runner, MCP server. Ship to production using only what's in this repo, and forward traces to any OpenTelemetry backend.
  • AgentMark Cloud (hosted, proprietary). The team layer on top: persistent trace storage, dashboards, collaborative prompt editing, annotations, alerts, and two-way Git sync. Free tier covers most small teams.

If you only need observability and you already have an OTEL backend, the self-hosted setup is enough. Cloud is for teams that want the dashboard, collaboration, and managed trace storage.

AgentMark Cloud

AgentMark Cloud adds the team layer:

  • Persistent trace storage with search, filtering, and saved views
  • Dashboards for cost, latency, and quality metrics
  • Collaborative prompt editing with version history
  • Annotations and human evaluation workflows
  • Alerts for quality regressions, cost spikes, and latency
  • Two-way Git sync. Edit prompts in the dashboard, changes land as commits in your repo (and vice versa).

The free tier covers small teams. Try Cloud free →

Contributing

We welcome contributions. See CONTRIBUTING.md.

Community

License

GNU Affero General Public License v3.0 or later