AegisOps -- Multimodal Incident Review System

AegisOps turns raw incident logs and monitoring screenshots into structured, decision-ready incident reports -- complete with severity classification, root-cause analysis, timeline reconstruction, and operator handoff artifacts. Every analysis claim is backed by a deterministic replay eval suite before the live model path is trusted.

Live Demo

Surface	Link
Cloudflare Pages	https://aegisops-ai-incident-doctor.pages.dev
Google AI Studio	Open in AI Studio
Demo video	Watch on YouTube

Architecture

flowchart LR
    subgraph Intake ["1 - Intake"]
        Logs["Raw Logs"]
        Screenshots["Monitoring Screenshots"]
    end

    subgraph Analysis ["2 - Multimodal Analysis"]
        API["Express API\n(server-side key handling)"]
        Gemini["Gemini"]
        OpenAI["OpenAI"]
        Ollama["Ollama (local)"]
        Demo["Demo mode\n(no keys)"]
    end

    subgraph Output ["3 - Structured Output"]
        Report["Incident Report\n(JSON schema)"]
        Postmortem["Postmortem Pack\nSeverity + RCA + Timeline"]
        FollowUp["Follow-Up Q&A\n(grounded on report)"]
        TTS["Audio Briefing\n(TTS)"]
    end

    subgraph Eval ["4 - Replay Evals"]
        Suite["4 scenarios\n32 rubric checks"]
        Buckets["Failure Buckets\nby category"]
    end

    subgraph Handoff ["5 - Operator Handoff"]
        Export["JSON / Markdown\nSlack / Jira"]
        Workspace["Google Workspace\nDocs / Slides / Sheets\nCalendar / Chat"]
        Persist["GCS Artifacts\nBigQuery Analytics"]
    end

    Logs --> API
    Screenshots --> API
    API --> Gemini
    API --> OpenAI
    API --> Ollama
    API --> Demo
    Gemini --> Report
    OpenAI --> Report
    Ollama --> Report
    Demo --> Report
    Report --> Postmortem
    Report --> FollowUp
    Report --> TTS
    Report --> Suite
    Suite --> Buckets
    Report --> Export
    Report --> Workspace
    Report --> Persist

Key design principle: API keys never reach the browser. The React frontend calls /api/analyze, /api/followup, /api/tts -- the Express API reads provider keys server-side and returns validated, schema-conformant JSON.

Quick Start

Prerequisites

Node.js >= 20 and npm
(Optional) A Gemini or OpenAI API key for live analysis
(Optional) Ollama for fully offline local analysis

Run locally

# 1. Clone and install
git clone https://github.com/KIM3310/AegisOps.git
cd AegisOps
npm install

# 2. Configure (optional -- runs in demo mode without keys)
cp .env.example .env
# Edit .env to add GEMINI_API_KEY or OPENAI_API_KEY

# 3. Start both API and UI
npm run dev
# UI:  http://127.0.0.1:3000
# API: http://127.0.0.1:8787

If no API key is set, the system runs in demo mode -- deterministic output, no external calls, and the full replay eval suite still runs.

Verify the build

npm run verify   # typecheck + test + replay evals + review smoke + build

Tech Stack

Layer	Technology	Purpose
Frontend	React 19, Vite 6, Lucide Icons	Incident input, report rendering, operator dashboard
Backend	Express, Node.js 20+	API routing, key management, payload validation
LLM Providers	Gemini (default), OpenAI, Ollama	Multimodal incident analysis + follow-up Q&A
Validation	Zod 4	Request/response schema enforcement
Logging	Pino	Structured JSON logging
Eval Framework	Custom replay harness	4 scenarios, 32 rubric checks, failure bucket aggregation
Auth	Bearer token + OIDC (RS256 JWT)	Operator access control with role-based gating
Persistence	GCS + BigQuery (optional)	Report artifacts + analytics rows
Monitoring	Prometheus + Datadog (optional)	HTTP metrics, analysis latency, provider usage
Cloud Infra	Terraform (Cloud Run), Cloudflare Pages	IaC deployment, static hosting
Containers	Docker, Kubernetes (HPA, Ingress)	Production container orchestration
CI/CD	GitHub Actions	Typecheck, test, replay proof, build, artifact upload
Language	TypeScript 5.8 (strict mode)	End-to-end type safety

Core API

Endpoint	Method	Description
`/api/analyze`	POST	Analyze logs + screenshots, return structured incident report
`/api/followup`	POST	Follow-up Q&A grounded on the generated report
`/api/tts`	POST	Text-to-speech audio briefing (Gemini TTS)
`/api/evals/replays`	GET	Replay eval suite results (4 scenarios / 32 checks)
`/api/live-sessions`	GET	Persisted incident session history
`/api/meta`	GET	Runtime modes, replay summary, operator checklist
`/api/healthz`	GET	Deployment mode, provider, limits
`/api/summary-pack`	GET	Reviewable trust surface with replay proof
`/api/schema/report`	GET	Incident report JSON schema contract
`/api/metrics`	GET	Prometheus-format metrics

Incident Replay Evals

The replay eval suite validates report quality against fixed scenarios with a structured rubric before the live model path is trusted.

npm run eval:replays

Suite: evals/incidentReplays.ts | Scoring: server/lib/replayEvals.ts

Rubric categories

Category	What it checks
`severity_match`	Severity classification matches the expected level
`title_keywords`	Title captures the dominant failure mode
`tag_coverage`	Operational tags cover the main systems involved
`timeline_coverage`	Timeline retains enough events for reconstruction
`root_cause_coverage`	Root causes name the failure mode, not just symptoms
`actionability`	Action items are concrete and operator-facing
`reasoning_trace`	Reasoning preserves Observations, Hypotheses, Decision Path
`confidence_range`	Confidence score stays within the rubric band

Current scenarios

Scenario	Description
`llm-latency-spike`	SLO breach, queue saturation, memory pressure, autoscaling recovery
`redis-oom-failover`	Redis master OOM, quorum loss, cache miss storm during failover
`payments-retry-storm`	5xx spike + retry fan-out and request queue growth
`search-warning-buildup`	Pre-outage warning with latency and queue buildup (no hard outage)

See docs/INCIDENT_REPLAY_EVALS.md for full documentation.

Deployment

Local development

npm install && npm run dev

Cloudflare Pages

npm run build && wrangler pages deploy dist/

Docker / Cloud Run

docker build -t aegisops .
docker run -e GEMINI_API_KEY=<key> -e HOST=0.0.0.0 -p 8787:8787 aegisops

Terraform (GCP Cloud Run)

cd infra/terraform
terraform init
terraform plan -var="project_id=<your-project>" -var="image=<your-image>"
terraform apply

Kubernetes

Pre-built manifests in infra/k8s/ include Deployment, Service, HPA, Ingress, and ConfigMap.

Optional GCP persistence

Set GOOGLE_APPLICATION_CREDENTIALS + GCP_PROJECT_ID to persist incident artifacts to GCS and analytics rows to BigQuery.

Project Structure

AegisOps/
  App.tsx                    # React app root
  types.ts                   # Shared TypeScript types
  constants.ts               # Sample presets, review lenses
  components/                # React UI components (26 files)
  hooks/                     # React hooks (app state, auth, storage)
  services/                  # Frontend service clients (Google APIs, Gemini, export)
  server/
    index.ts                 # Express API server (~700 lines, 15+ routes)
    lib/
      gemini.ts              # Gemini provider (analyze, follow-up, TTS)
      openai.ts              # OpenAI provider (analyze, follow-up)
      ollama.ts              # Ollama provider (analyze, follow-up)
      demo.ts                # Deterministic demo mode
      replayEvals.ts         # Replay eval scoring and bucket aggregation
      schemas.ts             # Zod request validation schemas
      operatorAccess.ts      # Bearer + OIDC operator auth
      gcp-persistence.ts     # GCS artifact upload, BigQuery analytics
      prometheus.ts          # Prometheus metrics
      datadog-adapter.ts     # Datadog integration
      aws-adapter.ts         # AWS S3/SQS/CloudWatch integration
      ...
  evals/
    incidentReplays.ts       # 4 replay scenarios with expected rubrics
  __tests__/                 # 29 test files (Vitest)
  scripts/                   # CLI tools (replay runner, smoke tests, load tests)
  infra/
    terraform/               # GCP Cloud Run IaC
    k8s/                     # Kubernetes manifests
  docs/                      # Architecture docs, evidence, SVGs
  samples/                   # Sample logs, screenshots, resource packs
  .github/workflows/         # CI pipeline (test, build, artifact upload)

Testing

npm test                    # Unit tests (Vitest, 29 test files)
npm run typecheck           # TypeScript strict mode check
npm run eval:replays        # Replay eval suite (4 scenarios / 32 checks)
npm run review:smoke        # Review surface smoke tests
npm run verify              # All of the above + build

Environment Variables

See .env.example for the full list. Key variables:

Variable	Required	Description
`LLM_PROVIDER`	No	`auto` (default), `gemini`, `openai`, `ollama`, `demo`
`GEMINI_API_KEY`	For live mode	Google Gemini API key
`OPENAI_API_KEY`	For OpenAI mode	OpenAI API key
`OLLAMA_BASE_URL`	For local mode	Ollama server URL (default: `http://127.0.0.1:11434`)
`GCP_PROJECT_ID`	For persistence	GCP project for GCS + BigQuery
`AEGISOPS_OPERATOR_TOKEN`	For auth	Static operator bearer token
`DD_API_KEY`	For monitoring	Datadog API key

License

MIT

Built by Doeon Kim

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
__tests__		__tests__
components		components
docs		docs
evals		evals
hooks		hooks
infra		infra
public		public
samples		samples
scripts		scripts
server		server
services		services
tools		tools
utils		utils
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
App.tsx		App.tsx
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
constants.ts		constants.ts
index.html		index.html
index.tsx		index.tsx
metadata.json		metadata.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
types.ts		types.ts
vite-env.d.ts		vite-env.d.ts
vite.config.ts		vite.config.ts
wrangler.toml		wrangler.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AegisOps -- Multimodal Incident Review System

Live Demo

Architecture

Quick Start

Prerequisites

Run locally

Verify the build

Tech Stack

Core API

Incident Replay Evals

Rubric categories

Current scenarios

Deployment

Local development

Cloudflare Pages

Docker / Cloud Run

Terraform (GCP Cloud Run)

Kubernetes

Optional GCP persistence

Project Structure

Testing

Environment Variables

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AegisOps -- Multimodal Incident Review System

Live Demo

Architecture

Quick Start

Prerequisites

Run locally

Verify the build

Tech Stack

Core API

Incident Replay Evals

Rubric categories

Current scenarios

Deployment

Local development

Cloudflare Pages

Docker / Cloud Run

Terraform (GCP Cloud Run)

Kubernetes

Optional GCP persistence

Project Structure

Testing

Environment Variables

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages