CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

The Ambient Code Platform is a Kubernetes-native AI automation platform that orchestrates intelligent agentic sessions through containerized microservices. The platform enables AI-powered automation for analysis, research, development, and content creation tasks via a modern web interface.

Note: This project was formerly known as "vTeam". Technical artifacts (image names, namespaces, API groups, routes) still use "vteam" for backward compatibility. The docs use ACP naming.

Amber Background Agent

The platform includes Amber, a background agent that automates common development tasks via GitHub Issues. Team members can trigger automated fixes, refactoring, and test additions without requiring direct access to Claude Code.

Quick Links:

Amber Quickstart - Get started in 5 minutes
Full Documentation - Complete automation guide
Amber Config - Automation policies

Common Workflows:

🤖 Auto-Fix (label: amber:auto-fix): Formatting, linting, trivial fixes
🔧 Refactoring (label: amber:refactor): Break large files, extract patterns
🧪 Test Coverage (label: amber:test-coverage): Add missing tests

Core Architecture

The system follows a Kubernetes-native pattern with Custom Resources, Operators, and Job execution:

Frontend (NextJS + Shadcn): Web UI for session management and monitoring
Backend API (Go + Gin): REST API managing Kubernetes Custom Resources with multi-tenant project isolation
Agentic Operator (Go): Kubernetes controller watching CRs and creating Jobs
Claude Code Runner (Python): Job pods executing Claude Code CLI with multi-agent collaboration

Agentic Session Flow

User Creates Session → Backend Creates CR → Operator Spawns Job →
Pod Runs Claude CLI → Results Stored in CR → UI Displays Progress

Memory System - Loadable Context

This repository uses a structured memory system to provide targeted, loadable context instead of relying solely on this comprehensive CLAUDE.md file.

Quick Reference

Load these files when working in specific areas:

Task Type	Context File	Architecture View	Pattern File
Backend API work	`.claude/context/backend-development.md`	`repomix-analysis/03-architecture-only.xml`	`.claude/patterns/k8s-client-usage.md`
Frontend UI work	`.claude/context/frontend-development.md`	`repomix-analysis/03-architecture-only.xml`	`.claude/patterns/react-query-usage.md`
Security review	`.claude/context/security-standards.md`	`repomix-analysis/03-architecture-only.xml`	`.claude/patterns/error-handling.md`
Architecture questions	-	`repomix-analysis/03-architecture-only.xml`	See ADRs below

Note: We use a single repomix architecture view (grade 8.8/10, 187K tokens) for all tasks. See .claude/repomix-guide.md for details.

Available Memory Files

1. Context Files (.claude/context/)

backend-development.md - Go backend, K8s integration, handler patterns
frontend-development.md - NextJS, Shadcn UI, React Query patterns
security-standards.md - Auth, RBAC, token handling, security patterns

2. Architectural Decision Records (docs/adr/)

Documents WHY decisions were made, not just WHAT
0001-kubernetes-native-architecture.md
0002-user-token-authentication.md
0003-multi-repo-support.md
0004-go-backend-python-runner.md
0005-nextjs-shadcn-react-query.md

3. Code Pattern Catalog (.claude/patterns/)

error-handling.md - Consistent error patterns (backend, operator, runner)
k8s-client-usage.md - When to use user token vs. service account
react-query-usage.md - Data fetching patterns (queries, mutations, caching)

4. Repomix Usage Guide (.claude/repomix-guide.md)

Guide for using the architecture view effectively
Why we use a single view approach (vs. 7 views)

5. Decision Log (docs/decisions.md)

Lightweight chronological record of major decisions
Links to ADRs, code, and context files

Example Usage

"Claude, load the architecture view (repomix-analysis/03-architecture-only.xml) and the
backend-development context file, then help me add a new endpoint for listing RFE workflows."

"Claude, load the architecture view and security-standards context file,
then review this PR for token handling issues."

"Claude, check ADR-0002 (User Token Authentication) and explain why we use user tokens
instead of service accounts for API operations."

Development Commands

Quick Start - Local Development

Single command setup with OpenShift Local (CRC):

# Prerequisites: brew install crc
# Get free Red Hat pull secret from console.redhat.com/openshift/create/local
make dev-start

# Access at https://vteam-frontend-vteam-dev.apps-crc.testing

Hot-reloading development:

# Terminal 1
DEV_MODE=true make dev-start

# Terminal 2 (separate terminal)
make dev-sync

Building Components

# Build all container images (default: docker, linux/amd64)
make build-all

# Build with podman
make build-all CONTAINER_ENGINE=podman

# Build for ARM64
make build-all PLATFORM=linux/arm64

# Build individual components
make build-frontend
make build-backend
make build-operator
make build-runner

# Push to registry
make push-all REGISTRY=quay.io/your-username

Deployment

# Deploy with default images from quay.io/ambient_code
make deploy

# Deploy to custom namespace
make deploy NAMESPACE=my-namespace

# Deploy with custom images
cd components/manifests
cp env.example .env
# Edit .env with ANTHROPIC_API_KEY and CONTAINER_REGISTRY
./deploy.sh

# Clean up deployment
make clean

Component Development

See component-specific documentation for detailed development commands:

Backend (components/backend/README.md): Go API development, testing, linting
Frontend (components/frontend/README.md): NextJS development, see also DESIGN_GUIDELINES.md
Operator (components/operator/README.md): Operator development, watch patterns
Claude Code Runner (components/runners/claude-code-runner/README.md): Python runner development

Common commands:

make build-all         # Build all components
make deploy            # Deploy to cluster
make test              # Run tests
make lint              # Lint code

Documentation

# Install documentation dependencies
pip install -r requirements-docs.txt

# Serve locally at http://127.0.0.1:8000
mkdocs serve

# Build static site
mkdocs build

# Deploy to GitHub Pages
mkdocs gh-deploy

# Markdown linting
markdownlint docs/**/*.md

Local Development Helpers

# View logs
make dev-logs              # Both backend and frontend
make dev-logs-backend      # Backend only
make dev-logs-frontend     # Frontend only
make dev-logs-operator     # Operator only

# Operator management
make dev-restart-operator  # Restart operator deployment
make dev-operator-status   # Show operator status and events

# Cleanup
make dev-stop              # Stop processes, keep CRC running
make dev-stop-cluster      # Stop processes and shutdown CRC
make dev-clean             # Stop and delete OpenShift project

# Testing
make dev-test              # Run smoke tests
make dev-test-operator     # Test operator only

Key Architecture Patterns

Custom Resource Definitions (CRDs)

The platform defines three primary CRDs:

AgenticSession (agenticsessions.vteam.ambient-code): Represents an AI execution session
- Spec: prompt, repos (multi-repo support), interactive mode, timeout, model selection
- Status: phase, startTime, completionTime, results, error messages, per-repo push status
ProjectSettings (projectsettings.vteam.ambient-code): Project-scoped configuration
- Manages API keys, default models, timeout settings
- Namespace-isolated for multi-tenancy
RFEWorkflow (rfeworkflows.vteam.ambient-code): RFE (Request For Enhancement) workflows
- 7-step agent council process for engineering refinement
- Agent roles: PM, Architect, Staff Engineer, PO, Team Lead, Team Member, Delivery Owner

Multi-Repo Support

AgenticSessions support operating on multiple repositories simultaneously:

Each repo has required input (URL, branch) and optional output (fork/target) configuration
mainRepoIndex specifies which repo is the Claude working directory (default: 0)
Per-repo status tracking: pushed or abandoned

Interactive vs Batch Mode

Batch Mode (default): Single prompt execution with timeout
Interactive Mode (interactive: true): Long-running chat sessions using inbox/outbox files

Backend API Structure

The Go backend (components/backend/) implements:

Project-scoped endpoints: /api/projects/:project/* for namespaced resources
Multi-tenant isolation: Each project maps to a Kubernetes namespace
WebSocket support: Real-time session updates via websocket_messaging.go
Git operations: Repository cloning, forking, PR creation via git.go
RBAC integration: OpenShift OAuth for authentication

Main handler logic in handlers.go (3906 lines) manages:

Project CRUD operations
AgenticSession lifecycle
ProjectSettings management
RFE workflow orchestration

Operator Reconciliation Loop

The Kubernetes operator (components/operator/) watches for:

AgenticSession creation/updates → spawns Jobs with runner pods
Job completion → updates CR status with results
Timeout handling and cleanup

Runner Execution

The Claude Code runner (components/runners/claude-code-runner/) provides:

Claude Code SDK integration (claude-code-sdk>=0.0.23)
Workspace synchronization via PVC proxy
Multi-agent collaboration capabilities
Anthropic API streaming (anthropic>=0.68.0)

Configuration Standards

Python

Virtual environments: Always use python -m venv venv or uv venv
Package manager: Prefer uv over pip
Formatting: black (double quotes)
Import sorting: isort with black profile
Linting: flake8 (ignore E203, W503)

Go

Formatting: go fmt ./... (enforced)
Linting: golangci-lint (install via make install-tools)
Testing: Table-driven tests with subtests
Error handling: Explicit error returns, no panic in production code

Container Images

Default registry: quay.io/ambient_code
Image tags: Component-specific (vteam_frontend, vteam_backend, vteam_operator, vteam_claude_runner)
Platform: Default linux/amd64, ARM64 supported via PLATFORM=linux/arm64
Build tool: Docker or Podman (CONTAINER_ENGINE=podman)

Git Workflow

Default branch: main
Feature branches: Required for development
Commit style: Conventional commits (squashed on merge)
Branch verification: Always check current branch before file modifications

Kubernetes/OpenShift

Default namespace: ambient-code (production), vteam-dev (local dev)
CRD group: vteam.ambient-code
API version: v1alpha1 (current)
RBAC: Namespace-scoped service accounts with minimal permissions

Langfuse Observability (LLM Tracing)

The platform includes optional Langfuse integration for LLM observability, tracking usage metrics while protecting user privacy.

Privacy-First Design

Default behavior: User messages and assistant responses are REDACTED in traces
Preserved data: Usage metrics (tokens, costs), metadata (model, turn count, timestamps)
Rationale: Track costs and usage patterns without exposing potentially sensitive user data

Configuration

Enable Langfuse (disabled by default):

# In ambient-admin-langfuse-secret
LANGFUSE_ENABLED=true
LANGFUSE_PUBLIC_KEY=<your-key>
LANGFUSE_SECRET_KEY=<your-secret>
LANGFUSE_HOST=http://langfuse-web.langfuse.svc.cluster.local:3000

Privacy Controls (optional - masking enabled by default):

# Masking is ENABLED BY DEFAULT (no environment variable needed)
# The runner defaults to LANGFUSE_MASK_MESSAGES=true if not set

# To explicitly set (optional):
LANGFUSE_MASK_MESSAGES=true

# To disable masking (dev/testing ONLY - exposes full message content):
LANGFUSE_MASK_MESSAGES=false

Deployment

Deploy Langfuse to your cluster:

# Deploy with default privacy-preserving settings
./e2e/scripts/deploy-langfuse.sh

# For OpenShift
./e2e/scripts/deploy-langfuse.sh --openshift

# For Kubernetes
./e2e/scripts/deploy-langfuse.sh --kubernetes

Implementation

Location: components/runners/claude-code-runner/observability.py
Masking function: _privacy_masking_function() - redacts content while preserving metrics
Test coverage: tests/test_privacy_masking.py - validates masking behavior

What Gets Logged

With Masking Enabled (Default):

✅ Token counts (input, output, cache read, cache creation)
✅ Cost calculations (USD per session)
✅ Model names and versions
✅ Turn counts and session durations
✅ Tool usage (names, execution status)
✅ Error states and completion status
❌ User prompts (redacted)
❌ Assistant responses (redacted)
❌ Tool outputs with long content (redacted)

With Masking Disabled (dev/testing only):

✅ All of the above
⚠️ Full user message content (potentially sensitive!)
⚠️ Full assistant response content
⚠️ Complete tool outputs

OpenTelemetry Support

Langfuse supports OpenTelemetry as of 2025:

Current implementation: Langfuse Python SDK (v3, OTel-based)
Alternative: Pure OpenTelemetry SDK → Langfuse OTLP endpoint (/api/public/otel)
Migration: Not recommended unless vendor neutrality is required
Benefit: Current SDK already uses OTel underneath

Backend and Operator Development Standards

IMPORTANT: When working on backend (components/backend/) or operator (components/operator/) code, you MUST follow these strict guidelines based on established patterns in the codebase.

Critical Rules (Never Violate)

User Token Authentication Required
- FORBIDDEN: Using backend service account for user-initiated API operations
- REQUIRED: Always use GetK8sClientsForRequest(c) to get user-scoped K8s clients
- REQUIRED: Return 401 Unauthorized if user token is missing or invalid
- Exception: Backend service account ONLY for CR writes and token minting (handlers/sessions.go:227, handlers/sessions.go:449)
Never Panic in Production Code
- FORBIDDEN: panic() in handlers, reconcilers, or any production path
- REQUIRED: Return explicit errors with context: return fmt.Errorf("failed to X: %w", err)
- REQUIRED: Log errors before returning: log.Printf("Operation failed: %v", err)
Token Security and Redaction
- FORBIDDEN: Logging tokens, API keys, or sensitive headers
- REQUIRED: Redact tokens in logs using custom formatters (server/server.go:22-34)
- REQUIRED: Use log.Printf("tokenLen=%d", len(token)) instead of logging token content
- Example: path = strings.Split(path, "?")[0] + "?token=[REDACTED]"
Type-Safe Unstructured Access
- FORBIDDEN: Direct type assertions without checking: obj.Object["spec"].(map[string]interface{})
- REQUIRED: Use unstructured.Nested* helpers with three-value returns
- Example: spec, found, err := unstructured.NestedMap(obj.Object, "spec")
- REQUIRED: Check found before using values; handle type mismatches gracefully
OwnerReferences for Resource Lifecycle
- REQUIRED: Set OwnerReferences on all child resources (Jobs, Secrets, PVCs, Services)
- REQUIRED: Use Controller: boolPtr(true) for primary owner
- FORBIDDEN: BlockOwnerDeletion (causes permission issues in multi-tenant environments)
- Pattern: (operator/internal/handlers/sessions.go:125-134, handlers/sessions.go:470-476)

Package Organization

Backend Structure (components/backend/):

backend/
├── handlers/          # HTTP handlers grouped by resource
│   ├── sessions.go    # AgenticSession CRUD + lifecycle
│   ├── projects.go    # Project management
│   ├── rfe.go         # RFE workflows
│   ├── helpers.go     # Shared utilities (StringPtr, etc.)
│   └── middleware.go  # Auth, validation, RBAC
├── types/             # Type definitions (no business logic)
│   ├── session.go
│   ├── project.go
│   └── common.go
├── server/            # Server setup, CORS, middleware
├── k8s/               # K8s resource templates
├── git/, github/      # External integrations
├── websocket/         # Real-time messaging
├── routes.go          # HTTP route registration
└── main.go            # Wiring, dependency injection

Operator Structure (components/operator/):

operator/
├── internal/
│   ├── config/        # K8s client init, config loading
│   ├── types/         # GVR definitions, resource helpers
│   ├── handlers/      # Watch handlers (sessions, namespaces, projectsettings)
│   └── services/      # Reusable services (PVC provisioning, etc.)
└── main.go            # Watch coordination

Rules:

Handlers contain HTTP/watch logic ONLY
Types are pure data structures
Business logic in separate service packages
No cyclic dependencies between packages

Kubernetes Client Patterns

User-Scoped Clients (for API operations):

// ALWAYS use for user-initiated operations (list, get, create, update, delete)
reqK8s, reqDyn := GetK8sClientsForRequest(c)
if reqK8s == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
    c.Abort()
    return
}
// Use reqDyn for CR operations in user's authorized namespaces
list, err := reqDyn.Resource(gvr).Namespace(project).List(ctx, v1.ListOptions{})

Backend Service Account Clients (limited use cases):

// ONLY use for:
// 1. Writing CRs after validation (handlers/sessions.go:417)
// 2. Minting tokens/secrets for runners (handlers/sessions.go:449)
// 3. Cross-namespace operations backend is authorized for
// Available as: DynamicClient, K8sClient (package-level in handlers/)
created, err := DynamicClient.Resource(gvr).Namespace(project).Create(ctx, obj, v1.CreateOptions{})

Never:

❌ Fall back to service account when user token is invalid
❌ Use service account for list/get operations on behalf of users
❌ Skip RBAC checks by using elevated permissions

Error Handling Patterns

Handler Errors:

// Pattern 1: Resource not found
if errors.IsNotFound(err) {
    c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"})
    return
}

// Pattern 2: Log + return error
if err != nil {
    log.Printf("Failed to create session %s in project %s: %v", name, project, err)
    c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to create session"})
    return
}

// Pattern 3: Non-fatal errors (continue operation)
if err := updateStatus(...); err != nil {
    log.Printf("Warning: status update failed: %v", err)
    // Continue - session was created successfully
}

Operator Errors:

// Pattern 1: Resource deleted during processing (non-fatal)
if errors.IsNotFound(err) {
    log.Printf("AgenticSession %s no longer exists, skipping", name)
    return nil  // Don't treat as error
}

// Pattern 2: Retriable errors in watch loop
if err != nil {
    log.Printf("Failed to create job: %v", err)
    updateAgenticSessionStatus(ns, name, map[string]interface{}{
        "phase": "Error",
        "message": fmt.Sprintf("Failed to create job: %v", err),
    })
    return fmt.Errorf("failed to create job: %v", err)
}

Never:

❌ Silent failures (always log errors)
❌ Generic error messages ("operation failed")
❌ Retrying indefinitely without backoff

Resource Management

OwnerReferences Pattern:

// Always set owner when creating child resources
ownerRef := v1.OwnerReference{
    APIVersion: obj.GetAPIVersion(),  // e.g., "vteam.ambient-code/v1alpha1"
    Kind:       obj.GetKind(),        // e.g., "AgenticSession"
    Name:       obj.GetName(),
    UID:        obj.GetUID(),
    Controller: boolPtr(true),        // Only one controller per resource
    // BlockOwnerDeletion: intentionally omitted (permission issues)
}

// Apply to child resources
job := &batchv1.Job{
    ObjectMeta: v1.ObjectMeta{
        Name: jobName,
        Namespace: namespace,
        OwnerReferences: []v1.OwnerReference{ownerRef},
    },
    // ...
}

Cleanup Patterns:

// Rely on OwnerReferences for automatic cleanup, but delete explicitly when needed
policy := v1.DeletePropagationBackground
err := K8sClient.BatchV1().Jobs(ns).Delete(ctx, jobName, v1.DeleteOptions{
    PropagationPolicy: &policy,
})
if err != nil && !errors.IsNotFound(err) {
    log.Printf("Failed to delete job: %v", err)
    return err
}

Security Patterns

Token Handling:

// Extract token from Authorization header
rawAuth := c.GetHeader("Authorization")
parts := strings.SplitN(rawAuth, " ", 2)
if len(parts) != 2 || !strings.EqualFold(parts[0], "Bearer") {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "invalid Authorization header"})
    return
}
token := strings.TrimSpace(parts[1])

// NEVER log the token itself
log.Printf("Processing request with token (len=%d)", len(token))

RBAC Enforcement:

// Always check permissions before operations
ssar := &authv1.SelfSubjectAccessReview{
    Spec: authv1.SelfSubjectAccessReviewSpec{
        ResourceAttributes: &authv1.ResourceAttributes{
            Group:     "vteam.ambient-code",
            Resource:  "agenticsessions",
            Verb:      "list",
            Namespace: project,
        },
    },
}
res, err := reqK8s.AuthorizationV1().SelfSubjectAccessReviews().Create(ctx, ssar, v1.CreateOptions{})
if err != nil || !res.Status.Allowed {
    c.JSON(http.StatusForbidden, gin.H{"error": "Unauthorized"})
    return
}

Container Security:

// Always set SecurityContext for Job pods
SecurityContext: &corev1.SecurityContext{
    AllowPrivilegeEscalation: boolPtr(false),
    ReadOnlyRootFilesystem:   boolPtr(false),  // Only if temp files needed
    Capabilities: &corev1.Capabilities{
        Drop: []corev1.Capability{"ALL"},  // Drop all by default
    },
},

API Design Patterns

Project-Scoped Endpoints:

// Standard pattern: /api/projects/:projectName/resource
r.GET("/api/projects/:projectName/agentic-sessions", ValidateProjectContext(), ListSessions)
r.POST("/api/projects/:projectName/agentic-sessions", ValidateProjectContext(), CreateSession)
r.GET("/api/projects/:projectName/agentic-sessions/:sessionName", ValidateProjectContext(), GetSession)

// ValidateProjectContext middleware:
// 1. Extracts project from route param
// 2. Validates user has access via RBAC check
// 3. Sets project in context: c.Set("project", projectName)

Middleware Chain:

// Order matters: Recovery → Logging → CORS → Identity → Validation → Handler
r.Use(gin.Recovery())
r.Use(gin.LoggerWithFormatter(customRedactingFormatter))
r.Use(cors.New(corsConfig))
r.Use(forwardedIdentityMiddleware())  // Extracts X-Forwarded-User, etc.
r.Use(ValidateProjectContext())       // RBAC check

Response Patterns:

// Success with data
c.JSON(http.StatusOK, gin.H{"items": sessions})

// Success with created resource
c.JSON(http.StatusCreated, gin.H{"message": "Session created", "name": name, "uid": uid})

// Success with no content
c.Status(http.StatusNoContent)

// Errors with structured messages
c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid request"})

Operator Patterns

Watch Loop with Reconnection:

func WatchAgenticSessions() {
    gvr := types.GetAgenticSessionResource()

    for {  // Infinite loop with reconnection
        watcher, err := config.DynamicClient.Resource(gvr).Watch(ctx, v1.ListOptions{})
        if err != nil {
            log.Printf("Failed to create watcher: %v", err)
            time.Sleep(5 * time.Second)  // Backoff before retry
            continue
        }

        log.Println("Watching for events...")

        for event := range watcher.ResultChan() {
            switch event.Type {
            case watch.Added, watch.Modified:
                obj := event.Object.(*unstructured.Unstructured)
                handleEvent(obj)
            case watch.Deleted:
                // Handle cleanup
            }
        }

        log.Println("Watch channel closed, restarting...")
        watcher.Stop()
        time.Sleep(2 * time.Second)
    }
}

Reconciliation Pattern:

func handleEvent(obj *unstructured.Unstructured) error {
    name := obj.GetName()
    namespace := obj.GetNamespace()

    // 1. Verify resource still exists (avoid race conditions)
    currentObj, err := getDynamicClient().Get(ctx, name, namespace)
    if errors.IsNotFound(err) {
        log.Printf("Resource %s no longer exists, skipping", name)
        return nil  // Not an error
    }

    // 2. Get current phase/status
    status, found, _ := unstructured.NestedMap(currentObj.Object, "status")
    phase := getPhaseOrDefault(status, "Pending")

    // 3. Only reconcile if in expected state
    if phase != "Pending" {
        return nil  // Already processed
    }

    // 4. Create resources idempotently (check existence first)
    if _, err := getResource(name); err == nil {
        log.Printf("Resource %s already exists", name)
        return nil
    }

    // 5. Create and update status
    createResource(...)
    updateStatus(namespace, name, map[string]interface{}{"phase": "Creating"})

    return nil
}

Status Updates (use UpdateStatus subresource):

func updateAgenticSessionStatus(namespace, name string, updates map[string]interface{}) error {
    gvr := types.GetAgenticSessionResource()

    obj, err := config.DynamicClient.Resource(gvr).Namespace(namespace).Get(ctx, name, v1.GetOptions{})
    if errors.IsNotFound(err) {
        log.Printf("Resource deleted, skipping status update")
        return nil  // Not an error
    }

    if obj.Object["status"] == nil {
        obj.Object["status"] = make(map[string]interface{})
    }

    status := obj.Object["status"].(map[string]interface{})
    for k, v := range updates {
        status[k] = v
    }

    // Use UpdateStatus subresource (requires /status permission)
    _, err = config.DynamicClient.Resource(gvr).Namespace(namespace).UpdateStatus(ctx, obj, v1.UpdateOptions{})
    if errors.IsNotFound(err) {
        return nil  // Resource deleted during update
    }
    return err
}

Goroutine Monitoring:

// Start background monitoring (operator/internal/handlers/sessions.go:477)
go monitorJob(jobName, sessionName, namespace)

// Monitoring loop checks both K8s Job status AND custom container status
func monitorJob(jobName, sessionName, namespace string) {
    for {
        time.Sleep(5 * time.Second)

        // 1. Check if parent resource still exists (exit if deleted)
        if _, err := getSession(namespace, sessionName); errors.IsNotFound(err) {
            log.Printf("Session deleted, stopping monitoring")
            return
        }

        // 2. Check Job status
        job, err := K8sClient.BatchV1().Jobs(namespace).Get(ctx, jobName, v1.GetOptions{})
        if errors.IsNotFound(err) {
            return
        }

        // 3. Update status based on Job conditions
        if job.Status.Succeeded > 0 {
            updateStatus(namespace, sessionName, map[string]interface{}{
                "phase": "Completed",
                "completionTime": time.Now().Format(time.RFC3339),
            })
            cleanup(namespace, jobName)
            return
        }
    }
}

Pre-Commit Checklist for Backend/Operator

Before committing backend or operator code, verify:

Run these commands before committing:

# Backend
cd components/backend
gofmt -l .                    # Check formatting (should output nothing)
go vet ./...                  # Detect suspicious constructs
golangci-lint run            # Run comprehensive linting

# Operator
cd components/operator
gofmt -l .
go vet ./...
golangci-lint run

Auto-format code:

gofmt -w components/backend components/operator

Note: GitHub Actions will automatically run these checks on your PR. Fix any issues locally before pushing.

Common Mistakes to Avoid

Backend:

❌ Using service account client for user operations (always use user token)
❌ Not checking if user-scoped client creation succeeded
❌ Logging full token values (use len(token) instead)
❌ Not validating project access in middleware
❌ Type assertions without checking: val := obj["key"].(string) (use val, ok := ...)
❌ Not setting OwnerReferences (causes resource leaks)
❌ Treating IsNotFound as fatal error during cleanup
❌ Exposing internal error details to API responses (use generic messages)

Operator:

❌ Not reconnecting watch on channel close
❌ Processing events without verifying resource still exists
❌ Updating status on main object instead of /status subresource
❌ Not checking current phase before reconciliation (causes duplicate resources)
❌ Creating resources without idempotency checks
❌ Goroutine leaks (not exiting monitor when resource deleted)
❌ Using panic() in watch/reconciliation loops
❌ Not setting SecurityContext on Job pods

Reference Files

Study these files to understand established patterns:

Backend:

components/backend/handlers/sessions.go - Complete session lifecycle, user/SA client usage
components/backend/handlers/middleware.go - Auth patterns, token extraction, RBAC
components/backend/handlers/helpers.go - Utility functions (StringPtr, BoolPtr)
components/backend/types/common.go - Type definitions
components/backend/server/server.go - Server setup, middleware chain, token redaction
components/backend/routes.go - HTTP route definitions and registration

Operator:

components/operator/internal/handlers/sessions.go - Watch loop, reconciliation, status updates
components/operator/internal/config/config.go - K8s client initialization
components/operator/internal/types/resources.go - GVR definitions
components/operator/internal/services/infrastructure.go - Reusable services

GitHub Actions CI/CD

Component Build Pipeline (`.github/workflows/components-build-deploy.yml`)

Change detection: Only builds modified components (frontend, backend, operator, claude-runner)
Multi-platform builds: linux/amd64 and linux/arm64
Registry: Pushes to quay.io/ambient_code on main branch
PR builds: Build-only, no push on pull requests

Automation Workflows

amber-issue-handler.yml: Amber background agent - automated fixes via GitHub issue labels (amber:auto-fix, amber:refactor, amber:test-coverage) or /amber execute command
amber-dependency-sync.yml: Daily sync of dependency versions to Amber agent knowledge base
claude.yml: Claude Code integration - responds to @claude mentions in issues/PRs
claude-code-review.yml: Automated code reviews on pull requests

Code Quality Workflows

go-lint.yml: Go code formatting, vetting, and linting (gofmt, go vet, golangci-lint)
frontend-lint.yml: Frontend code quality (ESLint, TypeScript checking, build validation)

Deployment & Testing Workflows

prod-release-deploy.yaml: Production releases with semver versioning and changelog generation
e2e.yml: End-to-end Cypress testing in kind cluster (see Testing Strategy section)
test-local-dev.yml: Local development environment validation

Utility Workflows

docs.yml: Deploy MkDocs documentation to GitHub Pages
dependabot-auto-merge.yml: Auto-approve and merge Dependabot dependency updates

Testing Strategy

E2E Tests (Cypress + Kind)

Purpose: Automated end-to-end testing of the complete vTeam stack in a Kubernetes environment.

Location: e2e/

Quick Start:

make e2e-test CONTAINER_ENGINE=podman  # Or docker

What Gets Tested:

✅ Full vTeam deployment in kind (Kubernetes in Docker)
✅ Frontend UI rendering and navigation
✅ Backend API connectivity
✅ Project creation workflow (main user journey)
✅ Authentication with ServiceAccount tokens
✅ Ingress routing
✅ All pods deploy and become ready

What Doesn't Get Tested:

❌ OAuth proxy flow (uses direct token auth for simplicity)
❌ Session pod execution (requires Anthropic API key)
❌ Multi-user scenarios

Test Suite (e2e/cypress/e2e/vteam.cy.ts):

UI loads with token authentication
Navigate to new project page
Create a new project
List created projects
Backend API cluster-info endpoint

CI Integration: Tests run automatically on all PRs via GitHub Actions (.github/workflows/e2e.yml)

Key Implementation Details:

Architecture: Frontend without oauth-proxy, direct token injection via environment variables
Authentication: Test user ServiceAccount with cluster-admin permissions
Token Handling: Frontend deployment includes OC_TOKEN, OC_USER, OC_EMAIL env vars
Podman Support: Auto-detects runtime, uses ports 8080/8443 for rootless Podman
Ingress: Standard nginx-ingress with path-based routing

Adding New Tests:

it('should test new feature', () => {
  cy.visit('/some-page')
  cy.contains('Expected Content').should('be.visible')
  cy.get('#button').click()
  // Auth header automatically injected via beforeEach interceptor
})

Debugging Tests:

cd e2e
source .env.test
CYPRESS_TEST_TOKEN="$TEST_TOKEN" CYPRESS_BASE_URL="http://vteam.local:8080" npm run test:headed

Documentation: See e2e/README.md and docs/testing/e2e-guide.md for comprehensive testing guide.

Backend Tests (Go)

Unit tests (tests/unit/): Isolated component logic
Contract tests (tests/contract/): API contract validation
Integration tests (tests/integration/): End-to-end with real k8s cluster
- Requires TEST_NAMESPACE environment variable
- Set CLEANUP_RESOURCES=true for automatic cleanup
- Permission tests validate RBAC boundaries

Frontend Tests (NextJS)

Jest for component testing (when configured)
Cypress for e2e testing (see E2E Tests section above)

Operator Tests (Go)

Controller reconciliation logic tests
CRD validation tests

Documentation Structure

The MkDocs site (mkdocs.yml) provides:

User Guide: Getting started, RFE creation, agent framework, configuration
Developer Guide: Setup, architecture, plugin development, API reference, testing
Labs: Hands-on exercises (basic → advanced → production)
- Basic: First RFE, agent interaction, workflow basics
- Advanced: Custom agents, workflow modification, integration testing
- Production: Jira integration, OpenShift deployment, scaling
Reference: Agent personas, API endpoints, configuration schema, glossary

Documentation Standards

Default to improving existing documentation rather than creating new files. When adding or updating documentation (standalone files like .md, design docs, guides):

Prefer inline updates: Improve existing markdown files or code comments
Colocate new docs: When feasible, documentation should live in the subdirectory that has the relevant code (e.g., components/backend/README.md) not at the top level
Avoid top-level proliferation: Only create top-level docs for cross-cutting concerns (architecture, security, deployment)
Follow established patterns: See docs/amber-quickstart.md and components/backend/README.md for examples of well-organized documentation

Director Training Labs

Special lab track for leadership training located in docs/labs/director-training/:

Structured exercises for understanding the vTeam system from a strategic perspective
Validation reports for tracking completion and understanding

Production Considerations

Security

API keys: Store in Kubernetes Secrets, managed via ProjectSettings CR
RBAC: Namespace-scoped isolation prevents cross-project access
OAuth integration: OpenShift OAuth for cluster-based authentication (see docs/OPENSHIFT_OAUTH.md)
Network policies: Component isolation and secure communication

Monitoring

Health endpoints: /health on backend API
Logs: Structured logging with OpenShift integration
Metrics: Prometheus-compatible (when configured)
Events: Kubernetes events for operator actions

Scaling

Horizontal Pod Autoscaling: Configure based on CPU/memory
Job concurrency: Operator manages concurrent session execution
Resource limits: Set appropriate requests/limits per component
Multi-tenancy: Project-based isolation with shared infrastructure

Frontend Development Standards

See components/frontend/DESIGN_GUIDELINES.md for complete frontend development patterns.

Critical Rules (Quick Reference)

Zero any Types - Use proper types, unknown, or generic constraints
Shadcn UI Components Only - Use @/components/ui/* components, no custom UI from scratch
React Query for ALL Data Operations - Use hooks from @/services/queries/*, no manual fetch()
Use type over interface - Always prefer type for type definitions
Colocate Single-Use Components - Keep page-specific components with their pages

Pre-Commit Checklist for Frontend

Before committing frontend code:

Reference Files

components/frontend/DESIGN_GUIDELINES.md - Detailed patterns and examples
components/frontend/COMPONENT_PATTERNS.md - Architecture patterns
components/frontend/src/components/ui/ - Available Shadcn components
components/frontend/src/services/ - API service layer examples

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Amber Background Agent

Core Architecture

Agentic Session Flow

Memory System - Loadable Context

Quick Reference

Available Memory Files

Example Usage

Development Commands

Quick Start - Local Development

Building Components

Deployment

Component Development

Documentation

Local Development Helpers

Key Architecture Patterns

Custom Resource Definitions (CRDs)

Multi-Repo Support

Interactive vs Batch Mode

Backend API Structure

Operator Reconciliation Loop

Runner Execution

Configuration Standards

Python

Go

Container Images

Git Workflow

Kubernetes/OpenShift

Langfuse Observability (LLM Tracing)

Privacy-First Design

Configuration

Deployment

Implementation

What Gets Logged

OpenTelemetry Support

Backend and Operator Development Standards

Critical Rules (Never Violate)

Package Organization

Kubernetes Client Patterns

Error Handling Patterns

Resource Management

Security Patterns

API Design Patterns

Operator Patterns

Pre-Commit Checklist for Backend/Operator

Common Mistakes to Avoid

Reference Files

GitHub Actions CI/CD

Component Build Pipeline (.github/workflows/components-build-deploy.yml)

Automation Workflows

Code Quality Workflows

Deployment & Testing Workflows

Utility Workflows

Testing Strategy

E2E Tests (Cypress + Kind)

Backend Tests (Go)

Frontend Tests (NextJS)

Operator Tests (Go)

Documentation Structure

Documentation Standards

Director Training Labs

Production Considerations

Security

Monitoring

Scaling

Frontend Development Standards

Critical Rules (Quick Reference)

Pre-Commit Checklist for Frontend

Reference Files

Component Build Pipeline (`.github/workflows/components-build-deploy.yml`)