zman27/ai_ops

Fork 0

Files

Josh Rzemien cf386e1aaa feat(ui): add operator UI server, stores, and insights

2026-02-23 18:49:53 -05:00

13 KiB

Raw Blame History

AI Ops: Schema-Driven Multi-Agent Orchestration Runtime

TypeScript runtime for deterministic multi-agent execution with:

OpenAI Codex SDK (@openai/codex-sdk)
Anthropic Claude Agent SDK (@anthropic-ai/claude-agent-sdk)
Schema-validated orchestration (AgentManifest)
DAG execution with topology-aware fan-out (parallel, hierarchical, retry-unrolled)
Project-scoped persistent context store
Typed domain events for edge-triggered routing
Resource provisioning (git worktrees + deterministic port ranges)
MCP configuration layer with handler policy hooks
Security middleware for shell/tool policy enforcement
Runtime event fan-out (NDJSON analytics log + optional Discord webhook notifications)

Architecture Summary

SchemaDrivenExecutionEngine.runSession(...) is the single execution entrypoint.
PipelineExecutor owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.
Runtime events are emitted as best-effort side-channel telemetry and do not affect orchestration control flow.
AgentManager is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.
Session state is persisted under AGENT_STATE_ROOT.
Project state is persisted under AGENT_PROJECT_CONTEXT_PATH with schema-versioned JSON (schemaVersion) and domains:
- globalFlags
- artifactPointers
- taskQueue

Repository Layout

src/agents
- orchestration.ts: engine facade and runtime wiring
- pipeline.ts: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing
- failure-policy.ts: hard/soft failure classification policy
- lifecycle-observer.ts: persistence/event lifecycle hooks for node attempts
- manifest.ts: schema parsing/validation for personas/topologies/edges
- manager.ts: recursive fan-out utility used by pipeline
- state-context.ts: persisted node handoffs + session state
- project-context.ts: project-scoped store
- domain-events.ts: typed domain event schema + bus
- runtime.ts: env-driven defaults/singletons
- provisioning.ts: resource provisioning and child suballocation helpers
src/mcp: MCP config types/conversion/handlers
src/security: shell AST parsing, rules engine, secure executor, and audit sinks
src/telemetry: runtime event schema, sink fan-out, file sink, and Discord webhook sink
src/ui: local operator UI server, API routes, run-control service, and graph/event aggregation
src/examples: provider entrypoints (codex.ts, claude.ts)
src/config.ts: centralized env parsing/validation/defaulting
tests: manager, manifest, pipeline/orchestration, state, provisioning, MCP

Setup

npm install
cp .env.example .env
cp mcp.config.example.json mcp.config.json

Run

npm run codex -- "Summarize this repository."
npm run claude -- "Summarize this repository."

Or via unified entrypoint:

npm run dev -- codex "List potential improvements."
npm run dev -- claude "List potential improvements."

Operator UI

Start the local UI server:

npm run ui

Then open:

http://127.0.0.1:4317 (default)

The UI provides:

graph visualizer with topology/retry rendering, edge trigger labels, node economics (duration/cost/tokens), and critical-path highlighting
node inspector with attempt metadata and injected ResolvedExecutionContext sandbox payload
live runtime event feed from AGENT_RUNTIME_EVENT_LOG_PATH with severity coloring (including security mirror events)
run trigger + kill switch backed by SchemaDrivenExecutionEngine.runSession(...)
- run mode selector: provider (real Codex/Claude execution) or mock (deterministic dry-run executor)
- provider selector: codex or claude
run history from AGENT_STATE_ROOT
forms for runtime Discord webhook settings, security policy, and manager/resource limits
manifest editor/validator/saver for schema "1" manifests

Provider mode notes:

provider=codex uses existing OpenAI/Codex auth settings (OPENAI_AUTH_MODE, CODEX_API_KEY, OPENAI_API_KEY).
provider=claude uses Claude auth resolution (CLAUDE_CODE_OAUTH_TOKEN preferred, otherwise ANTHROPIC_API_KEY, or existing Claude Code login state).

Manifest Semantics

AgentManifest (schema "1") validates:

supported topologies (sequential, parallel, hierarchical, retry-unrolled)
persona definitions, optional modelConstraint, and tool-clearance policy (validated by shared Zod schema)
relationship DAG and unknown persona references
strict pipeline DAG
topology constraints (maxDepth, maxRetries)

Pipeline edges can route via:

legacy status triggers (on: success, validation_fail, failure, always)
domain event triggers (event: typed domain events)
conditions (state_flag, history_has_event, file_exists, always)
history_has_event evaluates persisted domain event history (for example validation_failed)

Domain Events

Domain events are typed and can trigger edges directly:

planning: requirements_defined, tasks_planned
execution: code_committed, task_blocked
validation: validation_passed, validation_failed
integration: branch_merged

Actors can emit events in ActorExecutionResult.events. Pipeline status also emits default validation/execution events.

Retry Matrix and Cancellation

validation_fail: routed through retry-unrolled execution (new child manager session)
hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
AbortSignal is passed into every actor execution input
session closure aborts child recursive work
run summaries expose aggregate status: success requires successful terminal executed DAG nodes and no critical-path failure

Runtime Events

The pipeline emits runtime lifecycle events (session.started, node.attempt.completed, domain.*, session.completed, session.failed).
Runtime events are fan-out only and never used for edge-routing decisions.
Default sink writes NDJSON to AGENT_RUNTIME_EVENT_LOG_PATH.
Optional Discord sink posts high-visibility lifecycle/error events through webhook configuration.
Existing security command audit output (AGENT_SECURITY_AUDIT_LOG_PATH) remains in place and is also mirrored into runtime events.

Runtime Event Fields

Each runtime event is written as one NDJSON object with:

id, timestamp, type, severity
sessionId, nodeId, attempt
message
optional usage (tokenInput, tokenOutput, tokenTotal, toolCalls, durationMs, costUsd)
optional structured metadata
- node.attempt.completed metadata includes:
  - executionContext (resolved sandbox payload injected into executor)
  - topologyKind
  - retrySpawned
  - optional fromNodeId, subtasks, securityViolation

Runtime Event Setup

Add these variables in .env (or use defaults):

AGENT_RUNTIME_EVENT_LOG_PATH=.ai_ops/events/runtime-events.ndjson
AGENT_RUNTIME_DISCORD_WEBHOOK_URL=
AGENT_RUNTIME_DISCORD_MIN_SEVERITY=critical
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES=session.started,session.completed,session.failed

Notes:

File sink is always enabled and appends NDJSON.
Discord sink is enabled only when AGENT_RUNTIME_DISCORD_WEBHOOK_URL is set.
Discord notifications are sent when event severity is at or above AGENT_RUNTIME_DISCORD_MIN_SEVERITY.
Event types in AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES are always sent, regardless of severity.

Event Types Emitted by Runtime

Session lifecycle:
- session.started
- session.completed
- session.failed
Node/domain lifecycle:
- node.attempt.completed
- domain.<domain_event_type> (for example domain.validation_failed)
Security mirror events:
- security.shell.command_profiled
- security.shell.command_allowed
- security.shell.command_blocked
- security.tool.invocation_allowed
- security.tool.invocation_blocked

Analytics Quick Start

Inspect latest events:

tail -n 50 .ai_ops/events/runtime-events.ndjson

Count events by type:

jq -r '.type' .ai_ops/events/runtime-events.ndjson | sort | uniq -c

Get only critical events:

jq -c 'select(.severity=="critical")' .ai_ops/events/runtime-events.ndjson

Security Middleware

Shell command parsing uses async sh-syntax (WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction.
Rules are validated with strict Zod schemas (src/security/schemas.ts) before execution.
SecurityRulesEngine enforces:
- binary allowlists
- cwd/worktree boundary checks
- path traversal blocking (../)
- protected path blocking (state root + project context path)
- unified tool allowlist/banlist checks for shell binaries and MCP tool lists
SecureCommandExecutor runs commands via child_process.spawn with:
- explicit env scrub/inject policy (no implicit full env inheritance)
- timeout enforcement
- optional uid/gid drop
- stdout/stderr streaming hooks for audit
Every actor execution input now includes a pre-resolved executionContext (phase, modelConstraint, allowedTools, and immutable security constraints) generated by orchestration per node attempt.
Every actor execution input now includes security helpers (rulesEngine, createCommandExecutor(...)) so executors can enforce shell/tool policy at the execution boundary.
Every actor execution input now includes mcp helpers (resolvedConfig, resolveConfig(...), filterToolsForProvider(...), createClaudeCanUseTool()) so provider adapters are filtered against executionContext.allowedTools before SDK calls.
For Claude-based executors, pass input.mcp.filterToolsForProvider(...) and input.mcp.createClaudeCanUseTool() into the SDK call path so unauthorized tools are never exposed and runtime bypass attempts trigger security violations.
Pipeline behavior on SecurityViolationError is configurable:
- hard_abort (default)
- validation_fail (retry-unrolled remediation)

Environment Variables

Provider/Auth

CODEX_API_KEY
OPENAI_API_KEY
OPENAI_AUTH_MODE (auto, chatgpt, or api_key)
OPENAI_BASE_URL
CODEX_SKIP_GIT_CHECK
CLAUDE_CODE_OAUTH_TOKEN (preferred for Claude auth; takes precedence over ANTHROPIC_API_KEY)
ANTHROPIC_API_KEY (used when CLAUDE_CODE_OAUTH_TOKEN is unset)
CLAUDE_MODEL
CLAUDE_CODE_PATH
MCP_CONFIG_PATH

Agent Manager Limits

AGENT_MAX_CONCURRENT
AGENT_MAX_SESSION
AGENT_MAX_RECURSIVE_DEPTH

Orchestration / Context

AGENT_STATE_ROOT
AGENT_PROJECT_CONTEXT_PATH
AGENT_TOPOLOGY_MAX_DEPTH
AGENT_TOPOLOGY_MAX_RETRIES
AGENT_RELATIONSHIP_MAX_CHILDREN

Provisioning / Resource Controls

AGENT_WORKTREE_ROOT
AGENT_WORKTREE_BASE_REF
AGENT_PORT_BASE
AGENT_PORT_BLOCK_SIZE
AGENT_PORT_BLOCK_COUNT
AGENT_PORT_PRIMARY_OFFSET
AGENT_PORT_LOCK_DIR
AGENT_DISCOVERY_FILE_RELATIVE_PATH

Security Middleware

AGENT_SECURITY_VIOLATION_MODE (hard_abort or validation_fail)
AGENT_SECURITY_ALLOWED_BINARIES
AGENT_SECURITY_COMMAND_TIMEOUT_MS
AGENT_SECURITY_AUDIT_LOG_PATH
AGENT_SECURITY_ENV_INHERIT
AGENT_SECURITY_ENV_SCRUB
AGENT_SECURITY_DROP_UID
AGENT_SECURITY_DROP_GID

Runtime Events / Telemetry

AGENT_RUNTIME_EVENT_LOG_PATH
AGENT_RUNTIME_DISCORD_WEBHOOK_URL
AGENT_RUNTIME_DISCORD_MIN_SEVERITY (info, warning, or critical)
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES (CSV event types such as session.started,session.completed,session.failed)

Operator UI

AGENT_UI_HOST (default 127.0.0.1)
AGENT_UI_PORT (default 4317)

Runtime-Injected (Do Not Configure In `.env`)

AGENT_REPO_ROOT
AGENT_WORKTREE_PATH
AGENT_WORKTREE_BASE_REF
AGENT_PORT_RANGE_START
AGENT_PORT_RANGE_END
AGENT_PORT_PRIMARY
AGENT_DISCOVERY_FILE

Defaults are documented in .env.example.

Auth behavior notes:

OpenAI/Codex:
- OPENAI_AUTH_MODE=auto (default) prefers API keys when configured, and otherwise relies on existing Codex CLI login (codex login / ChatGPT plan auth).
- OPENAI_AUTH_MODE=chatgpt always omits API key injection so Codex uses ChatGPT subscription auth/session.
Claude:
- If CLAUDE_CODE_OAUTH_TOKEN and ANTHROPIC_API_KEY are both unset, runtime auth options are omitted and Claude Agent SDK can use existing Claude Code login state.

Quality Gate

npm run verify

Equivalent:

npm run check
npm run check:tests
npm run test
npm run build

Notes

Recursive execution APIs on AgentManager are internal runtime plumbing; use SchemaDrivenExecutionEngine.runSession(...) as the public orchestration entrypoint.

MCP Migration Note

Shared MCP server configs no longer accept the legacy http_headers alias.
Use headers instead.

13 KiB Raw Blame History

AI Ops: Schema-Driven Multi-Agent Orchestration Runtime

Architecture Summary

Repository Layout

Setup

Run

Operator UI

Manifest Semantics

Domain Events

Retry Matrix and Cancellation

Runtime Events

Runtime Event Fields

Runtime Event Setup

Event Types Emitted by Runtime

Analytics Quick Start

Security Middleware

Environment Variables

Provider/Auth

Agent Manager Limits

Orchestration / Context

Provisioning / Resource Controls

Security Middleware

Runtime Events / Telemetry

Operator UI

Runtime-Injected (Do Not Configure In .env)

Quality Gate

Notes

MCP Migration Note

13 KiB

Raw Blame History

Runtime-Injected (Do Not Configure In `.env`)