# AI Ops: Schema-Driven Multi-Agent Orchestration Runtime TypeScript runtime for deterministic multi-agent execution with: - OpenAI Codex SDK (`@openai/codex-sdk`) - Anthropic Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`) - Schema-validated orchestration (`AgentManifest`) - DAG execution with topology-aware fan-out (`parallel`, `hierarchical`, `retry-unrolled`) - Project-scoped persistent context store - Typed domain events for edge-triggered routing - Resource provisioning (git worktrees + deterministic port ranges) - MCP configuration layer with handler policy hooks - Security middleware for shell/tool policy enforcement - Runtime event fan-out (NDJSON analytics log + optional Discord webhook notifications) ## Architecture Summary - `SchemaDrivenExecutionEngine.runSession(...)` is the single execution entrypoint. - `PipelineExecutor` owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies. - Runtime events are emitted as best-effort side-channel telemetry and do not affect orchestration control flow. - `AgentManager` is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required. - Session state is persisted under `AGENT_STATE_ROOT`. - Session lifecycle is explicit (`POST /api/sessions`, `POST /api/sessions/:id/run`, `POST /api/sessions/:id/close`) and each session is bound to a target project path. - Session project context is persisted as schema-versioned JSON (`schemaVersion`) with domains: - `globalFlags` - `artifactPointers` - `taskQueue` - each task record stores `taskId`, status, and optional `worktreePath` for task-scoped workspace ownership - conflict-aware statuses are supported (`conflict`, `resolving_conflict`) ## Deep Dives - Session walkthrough with concrete artifacts from a successful provider run: `docs/session-walkthrough.md` - Orchestration engine internals: `docs/orchestration-engine.md` - Runtime event model and sinks: `docs/runtime-events.md` ## Repository Layout - `src/agents` - `orchestration.ts`: engine facade and runtime wiring - `pipeline.ts`: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing - `failure-policy.ts`: hard/soft failure classification policy - `lifecycle-observer.ts`: persistence/event lifecycle hooks for node attempts - `manifest.ts`: schema parsing/validation for personas/topologies/edges - `manager.ts`: recursive fan-out utility used by pipeline - `state-context.ts`: persisted node handoffs + session state - `project-context.ts`: project-scoped store - `domain-events.ts`: typed domain event schema + bus - `runtime.ts`: env-driven defaults/singletons - `provisioning.ts`: resource provisioning and child suballocation helpers - `src/mcp`: MCP config types/conversion/handlers - `src/security`: shell AST parsing, rules engine, secure executor, and audit sinks - `src/telemetry`: runtime event schema, sink fan-out, file sink, and Discord webhook sink - `src/ui`: local operator UI server, API routes, run-control service, and graph/event aggregation - `src/examples`: provider entrypoints (`codex.ts`, `claude.ts`) - `src/config.ts`: centralized env parsing/validation/defaulting - `tests`: manager, manifest, pipeline/orchestration, state, provisioning, MCP ## Setup ```bash npm install cp .env.example .env cp mcp.config.example.json mcp.config.json ``` ## Run ```bash npm run codex -- "Summarize this repository." npm run claude -- "Summarize this repository." ``` Or via unified entrypoint: ```bash npm run dev -- codex "List potential improvements." npm run dev -- claude "List potential improvements." ``` ## Operator UI Start the local UI server: ```bash npm run ui ``` Then open: - `http://127.0.0.1:4317` (default) The UI provides: - graph visualizer with topology/retry rendering, edge trigger labels, node economics (duration/cost/tokens), and critical-path highlighting - node inspector with attempt metadata and injected `ResolvedExecutionContext` sandbox payload - live runtime event feed from `AGENT_RUNTIME_EVENT_LOG_PATH` with severity coloring (including security mirror events) - Claude trace feed from `CLAUDE_OBSERVABILITY_LOG_PATH` (query lifecycle, SDK message types/subtypes, and errors) - run trigger + kill switch backed by `SchemaDrivenExecutionEngine.runSession(...)` - run mode selector: `provider` (real Codex/Claude execution) or `mock` (deterministic dry-run executor) - provider selector: `codex` or `claude` - run history from `AGENT_STATE_ROOT` - forms for runtime Discord webhook settings, security policy, and manager/resource limits - hover help on form labels with short intent guidance for each field - manifest editor/validator/saver for schema `"1"` manifests Provider mode notes: - `provider=codex` uses existing OpenAI/Codex auth settings (`OPENAI_AUTH_MODE`, `CODEX_API_KEY`, `OPENAI_API_KEY`). - `provider=claude` uses Claude auth resolution (`CLAUDE_CODE_OAUTH_TOKEN` preferred, otherwise `ANTHROPIC_API_KEY`, or existing Claude Code login state). - `CLAUDE_MODEL` should be a Claude model id/alias recognized by Claude Code (for example `claude-sonnet-4-6`); `anthropic/...` prefixes are normalized automatically. - Claude provider runs can emit Claude SDK/CLI internals to stdout and/or NDJSON with `CLAUDE_OBSERVABILITY_*` settings. ## Manifest Semantics `AgentManifest` (schema `"1"`) validates: - supported topologies (`sequential`, `parallel`, `hierarchical`, `retry-unrolled`) - persona definitions, optional `modelConstraint`, and tool-clearance policy (validated by shared Zod schema) - relationship DAG and unknown persona references - strict pipeline DAG - topology constraints (`maxDepth`, `maxRetries`) Pipeline edges can route via: - legacy status triggers (`on`: `success`, `validation_fail`, `failure`, `always`) - domain event triggers (`event`: typed domain events) - conditions (`state_flag`, `history_has_event`, `file_exists`, `always`) - `history_has_event` evaluates persisted domain event history (for example `validation_failed`) ## Domain Events Domain events are typed and can trigger edges directly: - planning: `requirements_defined`, `tasks_planned` - execution: `code_committed`, `task_ready_for_review`, `task_blocked` - validation: `validation_passed`, `validation_failed` - integration: `branch_merged`, `merge_conflict_detected`, `merge_conflict_resolved`, `merge_conflict_unresolved`, `merge_retry_started` Actors can emit events in `ActorExecutionResult.events`. Pipeline status also emits default validation/execution events. ## Retry Matrix and Cancellation - `validation_fail`: routed through retry-unrolled execution (new child manager session) - hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast - `AbortSignal` is passed into every actor execution input - session closure aborts child recursive work - run summaries expose aggregate `status`: success requires successful terminal executed DAG nodes and no critical-path failure ## Runtime Events - The pipeline emits runtime lifecycle events (`session.started`, `node.attempt.completed`, `domain.*`, `session.completed`, `session.failed`). - Runtime events are fan-out only and never used for edge-routing decisions. - Default sink writes NDJSON to `AGENT_RUNTIME_EVENT_LOG_PATH`. - Optional Discord sink posts high-visibility lifecycle/error events through webhook configuration. - Existing security command audit output (`AGENT_SECURITY_AUDIT_LOG_PATH`) remains in place and is also mirrored into runtime events. ### Runtime Event Fields Each runtime event is written as one NDJSON object with: - `id`, `timestamp`, `type`, `severity` - `sessionId`, `nodeId`, `attempt` - `message` - optional `usage` (`tokenInput`, `tokenOutput`, `tokenTotal`, `toolCalls`, `durationMs`, `costUsd`) - optional structured `metadata` - `node.attempt.completed` metadata includes: - `executionContext` (resolved sandbox payload injected into executor) - `topologyKind` - `retrySpawned` - optional `fromNodeId`, `subtasks`, `securityViolation` ### Runtime Event Setup Add these variables in `.env` (or use defaults): ```bash AGENT_RUNTIME_EVENT_LOG_PATH=.ai_ops/events/runtime-events.ndjson AGENT_RUNTIME_DISCORD_WEBHOOK_URL= AGENT_RUNTIME_DISCORD_MIN_SEVERITY=critical AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES=session.started,session.completed,session.failed ``` Notes: - File sink is always enabled and appends NDJSON. - Discord sink is enabled only when `AGENT_RUNTIME_DISCORD_WEBHOOK_URL` is set. - Discord notifications are sent when event severity is at or above `AGENT_RUNTIME_DISCORD_MIN_SEVERITY`. - Event types in `AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES` are always sent, regardless of severity. ### Event Types Emitted by Runtime - Session lifecycle: - `session.started` - `session.completed` - `session.failed` - Node/domain lifecycle: - `node.attempt.completed` - `domain.` (for example `domain.validation_failed`) - Security mirror events: - `security.shell.command_profiled` - `security.shell.command_allowed` - `security.shell.command_blocked` - `security.tool.invocation_allowed` - `security.tool.invocation_blocked` ## Claude Observability - `CLAUDE_OBSERVABILITY_MODE=stdout` prints structured Claude query internals (tool progress, system events, stderr, result lifecycle) to stdout as JSON lines prefixed with `[claude-trace]`. - `CLAUDE_OBSERVABILITY_MODE=file` appends the same records to `CLAUDE_OBSERVABILITY_LOG_PATH`. - `CLAUDE_OBSERVABILITY_MODE=both` enables both outputs. - Output samples high-frequency `tool_progress` events to avoid log flooding while retaining suppression counters. - `assistant` and `user` message records are retained so turn flow is inspectable end-to-end. - `CLAUDE_OBSERVABILITY_VERBOSITY=summary` stores compact metadata; `full` stores redacted full SDK message payloads. - `CLAUDE_OBSERVABILITY_INCLUDE_PARTIAL=true` enables and emits sampled partial assistant stream events from the SDK. - `CLAUDE_OBSERVABILITY_DEBUG=true` enables Claude SDK debug mode. - `CLAUDE_OBSERVABILITY_DEBUG_LOG_PATH` writes Claude SDK debug output to a file (also enables debug mode). - In UI/provider mode, `CLAUDE_OBSERVABILITY_LOG_PATH` resolves relative to the repo workspace root. - UI API: `GET /api/claude-trace?limit=&sessionId=` reads filtered Claude trace records. Example: ```bash CLAUDE_OBSERVABILITY_MODE=both CLAUDE_OBSERVABILITY_VERBOSITY=summary CLAUDE_OBSERVABILITY_LOG_PATH=.ai_ops/events/claude-trace.ndjson CLAUDE_OBSERVABILITY_INCLUDE_PARTIAL=false CLAUDE_OBSERVABILITY_DEBUG=false ``` ### Analytics Quick Start Inspect latest events: ```bash tail -n 50 .ai_ops/events/runtime-events.ndjson ``` Count events by type: ```bash jq -r '.type' .ai_ops/events/runtime-events.ndjson | sort | uniq -c ``` Get only critical events: ```bash jq -c 'select(.severity=="critical")' .ai_ops/events/runtime-events.ndjson ``` ## Security Middleware - Shell command parsing uses async `sh-syntax` (WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction. - Rules are validated with strict Zod schemas (`src/security/schemas.ts`) before execution. - `SecurityRulesEngine` enforces: - binary allowlists - cwd/worktree boundary checks - path traversal blocking (`../`) - protected path blocking (state root + project context path) - unified tool allowlist/banlist checks for shell binaries and MCP tool lists - `SecureCommandExecutor` runs commands via `child_process.spawn` with: - explicit env scrub/inject policy (no implicit full env inheritance) - timeout enforcement - optional uid/gid drop - stdout/stderr streaming hooks for audit - Every actor execution input now includes a pre-resolved `executionContext` (`phase`, `modelConstraint`, `allowedTools`, and immutable security constraints) generated by orchestration per node attempt. - Every actor execution input now includes `security` helpers (`rulesEngine`, `createCommandExecutor(...)`) so executors can enforce shell/tool policy at the execution boundary. - Every actor execution input now includes `mcp` helpers (`resolvedConfig`, `resolveConfig(...)`, `filterToolsForProvider(...)`, `createClaudeCanUseTool()`) so provider adapters are filtered against `executionContext.allowedTools` before SDK calls. - For Claude-based executors, pass `input.mcp.filterToolsForProvider(...)` and `input.mcp.createClaudeCanUseTool()` into the SDK call path so unauthorized tools are never exposed and runtime bypass attempts trigger security violations. - Claude `canUseTool` permission checks normalize provider casing (`Bash` vs `bash`) before enforcing persona allowlists. - Pipeline behavior on `SecurityViolationError` is configurable: - `hard_abort` (default) - `validation_fail` (retry-unrolled remediation) ## Environment Variables ### Provider/Auth - `CODEX_API_KEY` - `OPENAI_API_KEY` - `OPENAI_AUTH_MODE` (`auto`, `chatgpt`, or `api_key`) - `OPENAI_BASE_URL` - `CODEX_SKIP_GIT_CHECK` - `CLAUDE_CODE_OAUTH_TOKEN` (preferred for Claude auth; takes precedence over `ANTHROPIC_API_KEY`) - `ANTHROPIC_API_KEY` (used when `CLAUDE_CODE_OAUTH_TOKEN` is unset) - `CLAUDE_MODEL` - `CLAUDE_CODE_PATH` - `CLAUDE_OBSERVABILITY_MODE` (`off`, `stdout`, `file`, or `both`) - `CLAUDE_OBSERVABILITY_VERBOSITY` (`summary` or `full`) - `CLAUDE_OBSERVABILITY_LOG_PATH` - `CLAUDE_OBSERVABILITY_INCLUDE_PARTIAL` (`true` or `false`) - `CLAUDE_OBSERVABILITY_DEBUG` (`true` or `false`) - `CLAUDE_OBSERVABILITY_DEBUG_LOG_PATH` - `MCP_CONFIG_PATH` ### Agent Manager Limits - `AGENT_MAX_CONCURRENT` - `AGENT_MAX_SESSION` - `AGENT_MAX_RECURSIVE_DEPTH` ### Orchestration / Context - `AGENT_STATE_ROOT` - `AGENT_PROJECT_CONTEXT_PATH` - `AGENT_TOPOLOGY_MAX_DEPTH` - `AGENT_TOPOLOGY_MAX_RETRIES` - `AGENT_RELATIONSHIP_MAX_CHILDREN` - `AGENT_MERGE_CONFLICT_MAX_ATTEMPTS` ### Provisioning / Resource Controls - `AGENT_WORKTREE_ROOT` - `AGENT_WORKTREE_BASE_REF` - `AGENT_WORKTREE_TARGET_PATH` (optional relative path; enables sparse checkout and sets session working directory to that subfolder) - `AGENT_PORT_BASE` - `AGENT_PORT_BLOCK_SIZE` - `AGENT_PORT_BLOCK_COUNT` - `AGENT_PORT_PRIMARY_OFFSET` - `AGENT_PORT_LOCK_DIR` - `AGENT_DISCOVERY_FILE_RELATIVE_PATH` ### Security Middleware - `AGENT_SECURITY_VIOLATION_MODE` (`hard_abort` or `validation_fail`) - `AGENT_SECURITY_ALLOWED_BINARIES` - `AGENT_SECURITY_COMMAND_TIMEOUT_MS` - `AGENT_SECURITY_AUDIT_LOG_PATH` - `AGENT_SECURITY_ENV_INHERIT` - `AGENT_SECURITY_ENV_SCRUB` - `AGENT_SECURITY_DROP_UID` - `AGENT_SECURITY_DROP_GID` ### Runtime Events / Telemetry - `AGENT_RUNTIME_EVENT_LOG_PATH` - `AGENT_RUNTIME_DISCORD_WEBHOOK_URL` - `AGENT_RUNTIME_DISCORD_MIN_SEVERITY` (`info`, `warning`, or `critical`) - `AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES` (CSV event types such as `session.started,session.completed,session.failed`) ### Operator UI - `AGENT_UI_HOST` (default `127.0.0.1`) - `AGENT_UI_PORT` (default `4317`) ### Runtime-Injected (Do Not Configure In `.env`) - `AGENT_REPO_ROOT` - `AGENT_WORKTREE_PATH` - `AGENT_WORKTREE_BASE_REF` - `AGENT_PORT_RANGE_START` - `AGENT_PORT_RANGE_END` - `AGENT_PORT_PRIMARY` - `AGENT_DISCOVERY_FILE` Defaults are documented in `.env.example`. Auth behavior notes: - OpenAI/Codex: - `OPENAI_AUTH_MODE=auto` (default) prefers API keys when configured, and otherwise relies on existing Codex CLI login (`codex login` / ChatGPT plan auth). - `OPENAI_AUTH_MODE=chatgpt` always omits API key injection so Codex uses ChatGPT subscription auth/session. - Claude: - If `CLAUDE_CODE_OAUTH_TOKEN` and `ANTHROPIC_API_KEY` are both unset, runtime auth options are omitted and Claude Agent SDK can use existing Claude Code login state. ## Quality Gate ```bash npm run verify ``` Equivalent: ```bash npm run check npm run check:tests npm run test npm run build ``` ## Notes - Recursive execution APIs on `AgentManager` are internal runtime plumbing; use `SchemaDrivenExecutionEngine.runSession(...)` as the public orchestration entrypoint. ## MCP Migration Note - Shared MCP server configs no longer accept the legacy `http_headers` alias. - Use `headers` instead.