AgentMark

Git-native AI agents.
_{Prompts and datasets in your repo. Evals in CI. Traces in your OTEL backend.}

AgentMark is an open-source platform for building reliable AI agents. Define prompts in Markdown, run them with the SDK you already use, evaluate against datasets locally or in CI, and trace every call with OpenTelemetry.

Prompt management. Prompts are .prompt.mdx files with type-safe inputs, tools, structured outputs, conditionals, loops, and reusable components. They live in your repo, get reviewed in PRs, and roll back with git revert.
Datasets. JSONL files in your repo. Each row is a line, so git diffs show exactly which test cases changed.
Evaluations. Run prompts over datasets with built-in or custom evaluators. Use the CLI or call the SDK from your own pipelines. Block merges on regressions, the way tests do.
Tracing. Every LLM call emits an OpenTelemetry span. Inspect traces in the local dev UI, or forward them to AgentMark Cloud (or any OTEL backend) for search, dashboards, and alerts in production.
Type safety. Auto-generated TypeScript types from your prompts. JSON Schema validation in your editor.

Quick start

Requires: Node.js 18 or newer.

# Scaffold a new project (interactive: picks your language)
npm create agentmark@latest my-agents
cd my-agents

# Start the dev server (API + trace UI + hot reload)
agentmark dev

# Run a single prompt
agentmark run-prompt agentmark/my-prompt.prompt.mdx

# Run an experiment against a dataset
agentmark run-experiment agentmark/my-prompt.prompt.mdx

About five minutes from npm create to a traced prompt running locally (assuming you have an LLM API key set up).

What a prompt looks like

---
name: customer-support-agent
text_config:
  model_name: anthropic/claude-sonnet-4-20250514
  max_calls: 2
  tools:
    - search_knowledgebase
test_settings:
  props:
    customer_question: "How long does shipping take?"
input_schema:
  type: object
  properties:
    customer_question:
      type: string
  required: [customer_question]
---

<System>
You are a helpful customer service agent. Use the search_knowledgebase tool
when customers ask about shipping, warranty, or returns.
</System>

<User>{props.customer_question}</User>

The frontmatter declares which tools the prompt may call; the implementations live in your code, resolved where you call the model. See Tools and agents.

Run it:

agentmark run-prompt customer-support.prompt.mdx

The prompt is version-controlled, type-checked, and traced. The same file works with any SDK — the Vercel AI SDK, the raw OpenAI or Anthropic client, Pydantic AI, or your own bespoke client. AgentMark renders the prompt to a neutral { messages, ...config } shape; your SDK makes the call.

Why git-native

Most AI tooling treats the dashboard as the primary workspace. Prompts are rows in a database. Edits happen in a browser. Version history is whatever audit log the vendor decided to expose.

That's fine for prototyping. It stops working as soon as you do anything an engineering team would normally do with code. Branch off main to try a variant. Review a prompt change in a PR. Run evals in CI before a merge. Look up who changed the retrieval logic last quarter. Roll back when something breaks.

AgentMark treats prompts, datasets, and evals like the rest of your code. Prompts are MDX files. Datasets are JSONL. Evals are functions you import. Branches, PRs, git log, git revert: they all work the same way they do for anything else in your repo.

And when you decide to leave, your prompts are already in your repo and your traces are already in whatever OTEL backend you point them at. No export job, no vendor migration.

Want to try it on a team? Start free on AgentMark Cloud → | Read the docs →

Features

Feature	Description
Multimodal generation	Generate text, structured objects, images, and speech from a single prompt file.
Tools and agents	Declare tools by name in frontmatter; your code owns the implementations. Build agentic loops with `max_calls`.
Structured output	Type-safe JSON output via JSON Schema definitions.
Datasets and evals	Run prompts over JSONL datasets with built-in or custom evaluators.
Tracing	OpenTelemetry-native tracing for every LLM call, local and cloud.
Type safety	Auto-generated TypeScript types from your prompts. JSON Schema validation in your IDE.
Reusable components	Import and compose prompt fragments across files.
Conditionals and loops	Dynamic prompts with `<If>`, `<ForEach>`, props, and filter functions.
File attachments	Attach images and documents for vision and document tasks.
MCP servers	Call Model Context Protocol tools directly from prompts.
MCP server	Drive the full AgentMark API — traces, datasets, scores, deployments — from Claude Code, Cursor, or any MCP client.

Bring your own SDK

AgentMark doesn't call LLM APIs directly, and there are no SDK-specific adapters to install. Prompts render to a neutral { messages, ...config } shape that you hand to whatever SDK you already use — so you keep your existing client, retry logic, and auth:

import { createAgentMark } from "@agentmark-ai/prompt-core";

const agentmark = createAgentMark({ loader });
const prompt = await agentmark.loadTextPrompt("customer-support.prompt.mdx");
const { messages, ...config } = await prompt.format({ props });
// hand `messages` + `config` to your SDK of choice

See the bring-your-own-SDK guide for the full integration path, including the createExecutor builder that lets AgentMark Cloud and agentmark dev run prompts through your SDK.

Language support

Language	Status
TypeScript / JavaScript	Supported
Python	Supported
Others	Open an issue

Examples

See the examples/ directory for complete, runnable projects:

Hello World: the simplest possible prompt
Structured Output: extract typed JSON with a schema
Tool Use: an agent with tool calling
Reusable Components: import and compose prompts
Evaluations: test prompts against datasets
Production Tracing: trace LLM calls with the SDK

Packages

Package	Description
`@agentmark-ai/cli`	CLI for local development, prompt running, experiments, and building.
`@agentmark-ai/sdk`	SDK for tracing and cloud platform integration.
`@agentmark-ai/prompt-core`	Core prompt parsing and formatting engine.
`@agentmark-ai/templatedx`	MDX-based template engine with JSX components, conditionals, and loops.
`@agentmark-ai/mcp-server`	MCP server exposing the AgentMark API to Claude Code, Cursor, and other MCP clients.
`@agentmark-ai/model-registry`	Centralized LLM model metadata and pricing.
`create-agentmark`	Project scaffolding tool.

Version compatibility

Packages are versioned independently. The pairings below are what each release line is tested against — mixing outside them can fail at runtime, because @agentmark-ai/sdk imports @agentmark-ai/prompt-core lazily (a mismatch surfaces when runExperiment/the webhook runner first executes, not at install time):

`@agentmark-ai/sdk`	`@agentmark-ai/prompt-core`	`@agentmark-ai/cli`
2.x	≥1.0	≥0.21

The loaders (@agentmark-ai/loader-api, @agentmark-ai/loader-file) are re-export shims of @agentmark-ai/prompt-core/loader-api / /loader-file — prefer the prompt-core subpaths in new code.

Self-host vs Cloud

AgentMark is open-core. The full development loop runs locally with no cloud dependency.

Self-hosted (this repo, AGPL-3.0). CLI, SDK, prompt engine, local trace UI (agentmark dev), eval runner, MCP server. Ship to production using only what's in this repo, and forward traces to any OpenTelemetry backend.
AgentMark Cloud (hosted, proprietary). The team layer on top: persistent trace storage, dashboards, collaborative prompt editing, annotations, alerts, and two-way Git sync. Free tier covers most small teams.

If you only need observability and you already have an OTEL backend, the self-hosted setup is enough. Cloud is for teams that want the dashboard, collaboration, and managed trace storage.

AgentMark Cloud

AgentMark Cloud adds the team layer:

Persistent trace storage with search, filtering, and saved views
Dashboards for cost, latency, and quality metrics
Collaborative prompt editing with version history
Annotations and human evaluation workflows
Alerts for quality regressions, cost spikes, and latency
Two-way Git sync. Edit prompts in the dashboard, changes land as commits in your repo (and vice versa).

The free tier covers small teams. Try Cloud free →

Contributing

We welcome contributions. See CONTRIBUTING.md.

Community

GitHub Issues: bugs and feature requests
GitHub Discussions: questions, ideas, and help
LinkedIn: product updates and team posts
Docs: reference, guides, and tutorials

License

GNU Affero General Public License v3.0 or later

Name		Name	Last commit message	Last commit date
Latest commit History 818 Commits
.github		.github
.yarn/releases		.yarn/releases
assets		assets
examples		examples
packages		packages
scripts		scripts
skills/agentmark		skills/agentmark
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.nvmrc		.nvmrc
.yarnrc.yml		.yarnrc.yml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
audit-output.txt		audit-output.txt
logo.svg		logo.svg
nx.json		nx.json
package.json		package.json
turbo.json		turbo.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentMark

Quick start

What a prompt looks like

Why git-native

Features

Bring your own SDK

Language support

Examples

Packages

Version compatibility

Self-host vs Cloud

AgentMark Cloud

Contributing

Community

License

About

Uh oh!

Releases 576

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentMark

Quick start

What a prompt looks like

Why git-native

Features

Bring your own SDK

Language support

Examples

Packages

Version compatibility

Self-host vs Cloud

AgentMark Cloud

Contributing

Community

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 576

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages