396 lines
16 KiB
Markdown
396 lines
16 KiB
Markdown
# AI Ops: Schema-Driven Multi-Agent Orchestration Runtime
|
|
|
|
TypeScript runtime for deterministic multi-agent execution with:
|
|
|
|
- OpenAI Codex SDK (`@openai/codex-sdk`)
|
|
- Anthropic Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`)
|
|
- Schema-validated orchestration (`AgentManifest`)
|
|
- DAG execution with topology-aware fan-out (`parallel`, `hierarchical`, `retry-unrolled`)
|
|
- Project-scoped persistent context store
|
|
- Typed domain events for edge-triggered routing
|
|
- Resource provisioning (git worktrees + deterministic port ranges)
|
|
- MCP configuration layer with handler policy hooks
|
|
- Security middleware for shell/tool policy enforcement
|
|
- Runtime event fan-out (NDJSON analytics log + optional Discord webhook notifications)
|
|
|
|
## Architecture Summary
|
|
|
|
- `SchemaDrivenExecutionEngine.runSession(...)` is the single execution entrypoint.
|
|
- `PipelineExecutor` owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.
|
|
- Runtime events are emitted as best-effort side-channel telemetry and do not affect orchestration control flow.
|
|
- `AgentManager` is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.
|
|
- Session state is persisted under `AGENT_STATE_ROOT`.
|
|
- Session lifecycle is explicit (`POST /api/sessions`, `POST /api/sessions/:id/run`, `POST /api/sessions/:id/close`) and each session is bound to a target project path.
|
|
- Session project context is persisted as schema-versioned JSON (`schemaVersion`) with domains:
|
|
- `globalFlags`
|
|
- `artifactPointers`
|
|
- `taskQueue`
|
|
- each task record stores `taskId`, status, and optional `worktreePath` for task-scoped workspace ownership
|
|
- conflict-aware statuses are supported (`conflict`, `resolving_conflict`)
|
|
|
|
## Deep Dives
|
|
|
|
- Session walkthrough with concrete artifacts from a successful provider run: `docs/session-walkthrough.md`
|
|
- Orchestration engine internals: `docs/orchestration-engine.md`
|
|
- Runtime event model and sinks: `docs/runtime-events.md`
|
|
|
|
## Repository Layout
|
|
|
|
- `src/agents`
|
|
- `orchestration.ts`: engine facade and runtime wiring
|
|
- `pipeline.ts`: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing
|
|
- `failure-policy.ts`: hard/soft failure classification policy
|
|
- `lifecycle-observer.ts`: persistence/event lifecycle hooks for node attempts
|
|
- `manifest.ts`: schema parsing/validation for personas/topologies/edges
|
|
- `manager.ts`: recursive fan-out utility used by pipeline
|
|
- `state-context.ts`: persisted node handoffs + session state
|
|
- `project-context.ts`: project-scoped store
|
|
- `domain-events.ts`: typed domain event schema + bus
|
|
- `runtime.ts`: env-driven defaults/singletons
|
|
- `provisioning.ts`: resource provisioning and child suballocation helpers
|
|
- `src/mcp`: MCP config types/conversion/handlers
|
|
- `src/security`: shell AST parsing, rules engine, secure executor, and audit sinks
|
|
- `src/telemetry`: runtime event schema, sink fan-out, file sink, and Discord webhook sink
|
|
- `src/ui`: local operator UI server, API routes, run-control service, and graph/event aggregation
|
|
- `src/examples`: provider entrypoints (`codex.ts`, `claude.ts`)
|
|
- `src/config.ts`: centralized env parsing/validation/defaulting
|
|
- `tests`: manager, manifest, pipeline/orchestration, state, provisioning, MCP
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
npm install
|
|
npm --prefix ui install
|
|
cp .env.example .env
|
|
cp mcp.config.example.json mcp.config.json
|
|
```
|
|
|
|
## Run
|
|
|
|
```bash
|
|
npm run codex -- "Summarize this repository."
|
|
npm run claude -- "Summarize this repository."
|
|
```
|
|
|
|
Or via unified entrypoint:
|
|
|
|
```bash
|
|
npm run dev -- codex "List potential improvements."
|
|
npm run dev -- claude "List potential improvements."
|
|
```
|
|
|
|
## Operator UI
|
|
|
|
Start the local UI server:
|
|
|
|
```bash
|
|
npm run ui
|
|
```
|
|
|
|
This script builds the React frontend from `ui/` before serving.
|
|
|
|
Then open:
|
|
|
|
- `http://127.0.0.1:4317` (default)
|
|
|
|
The UI provides:
|
|
|
|
- graph visualizer with topology/retry rendering, edge trigger labels, node economics (duration/cost/tokens), and critical-path highlighting
|
|
- node inspector with attempt metadata and injected `ResolvedExecutionContext` sandbox payload
|
|
- live runtime event feed from `AGENT_RUNTIME_EVENT_LOG_PATH` with severity coloring (including security mirror events)
|
|
- Claude trace feed from `CLAUDE_OBSERVABILITY_LOG_PATH` (query lifecycle, SDK message types/subtypes, and errors)
|
|
- run trigger + kill switch backed by `SchemaDrivenExecutionEngine.runSession(...)`
|
|
- run mode selector: `provider` (real Codex/Claude execution) or `mock` (deterministic dry-run executor)
|
|
- provider selector: `codex` or `claude`
|
|
- run history from `AGENT_STATE_ROOT`
|
|
- forms for runtime Discord webhook settings, security policy, and manager/resource limits
|
|
- hover help on form labels with short intent guidance for each field
|
|
- manifest editor/validator/saver for schema `"1"` manifests
|
|
|
|
Provider mode notes:
|
|
|
|
- `provider=codex` uses existing OpenAI/Codex auth settings (`OPENAI_AUTH_MODE`, `CODEX_API_KEY`, `OPENAI_API_KEY`).
|
|
- `provider=claude` uses Claude auth resolution (`CLAUDE_CODE_OAUTH_TOKEN` preferred, otherwise `ANTHROPIC_API_KEY`, or existing Claude Code login state).
|
|
- `CLAUDE_MODEL` should be a Claude model id/alias recognized by Claude Code (for example `claude-sonnet-4-6`); `anthropic/...` prefixes are normalized automatically.
|
|
- `CLAUDE_MAX_TURNS` controls the per-query Claude turn budget (default `2`).
|
|
- Claude provider runs can emit Claude SDK/CLI internals to stdout and/or NDJSON with `CLAUDE_OBSERVABILITY_*` settings.
|
|
- UI session-mode provider runs execute directly in orchestration-assigned task/base worktrees; provider adapters do not allocate additional nested worktrees.
|
|
|
|
## Manifest Semantics
|
|
|
|
`AgentManifest` (schema `"1"`) validates:
|
|
|
|
- supported topologies (`sequential`, `parallel`, `hierarchical`, `retry-unrolled`)
|
|
- persona definitions, optional `modelConstraint`, and tool-clearance policy (validated by shared Zod schema)
|
|
- relationship DAG and unknown persona references
|
|
- strict pipeline DAG
|
|
- topology constraints (`maxDepth`, `maxRetries`)
|
|
|
|
Pipeline edges can route via:
|
|
|
|
- legacy status triggers (`on`: `success`, `validation_fail`, `failure`, `always`)
|
|
- domain event triggers (`event`: typed domain events)
|
|
- conditions (`state_flag`, `history_has_event`, `file_exists`, `always`)
|
|
- `history_has_event` evaluates persisted domain event history (for example `validation_failed`)
|
|
|
|
## Domain Events
|
|
|
|
Domain events are typed and can trigger edges directly:
|
|
|
|
- planning: `requirements_defined`, `tasks_planned`
|
|
- execution: `code_committed`, `task_ready_for_review`, `task_blocked`
|
|
- validation: `validation_passed`, `validation_failed`
|
|
- integration: `branch_merged`, `merge_conflict_detected`, `merge_conflict_resolved`, `merge_conflict_unresolved`, `merge_retry_started`
|
|
|
|
Actors can emit events in `ActorExecutionResult.events`. Pipeline status also emits default validation/execution events.
|
|
|
|
## Retry Matrix and Cancellation
|
|
|
|
- `validation_fail`: routed through retry-unrolled execution (new child manager session)
|
|
- hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
|
|
- `AbortSignal` is passed into every actor execution input
|
|
- session closure aborts child recursive work
|
|
- run summaries expose aggregate `status`: success requires successful terminal executed DAG nodes and no critical-path failure
|
|
|
|
## Runtime Events
|
|
|
|
- The pipeline emits runtime lifecycle events (`session.started`, `node.attempt.completed`, `domain.*`, `session.completed`, `session.failed`).
|
|
- Runtime events are fan-out only and never used for edge-routing decisions.
|
|
- Default sink writes NDJSON to `AGENT_RUNTIME_EVENT_LOG_PATH`.
|
|
- Optional Discord sink posts high-visibility lifecycle/error events through webhook configuration.
|
|
- Existing security command audit output (`AGENT_SECURITY_AUDIT_LOG_PATH`) remains in place and is also mirrored into runtime events.
|
|
|
|
### Runtime Event Fields
|
|
|
|
Each runtime event is written as one NDJSON object with:
|
|
|
|
- `id`, `timestamp`, `type`, `severity`
|
|
- `sessionId`, `nodeId`, `attempt`
|
|
- `message`
|
|
- optional `usage` (`tokenInput`, `tokenOutput`, `tokenTotal`, `toolCalls`, `durationMs`, `costUsd`)
|
|
- optional structured `metadata`
|
|
- `node.attempt.completed` metadata includes:
|
|
- `executionContext` (resolved sandbox payload injected into executor)
|
|
- `topologyKind`
|
|
- `retrySpawned`
|
|
- optional `fromNodeId`, `subtasks`, `securityViolation`
|
|
|
|
### Runtime Event Setup
|
|
|
|
Add these variables in `.env` (or use defaults):
|
|
|
|
```bash
|
|
AGENT_RUNTIME_EVENT_LOG_PATH=.ai_ops/events/runtime-events.ndjson
|
|
AGENT_RUNTIME_DISCORD_WEBHOOK_URL=
|
|
AGENT_RUNTIME_DISCORD_MIN_SEVERITY=critical
|
|
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES=session.started,session.completed,session.failed
|
|
```
|
|
|
|
Notes:
|
|
|
|
- File sink is always enabled and appends NDJSON.
|
|
- Discord sink is enabled only when `AGENT_RUNTIME_DISCORD_WEBHOOK_URL` is set.
|
|
- Discord notifications are sent when event severity is at or above `AGENT_RUNTIME_DISCORD_MIN_SEVERITY`.
|
|
- Event types in `AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES` are always sent, regardless of severity.
|
|
|
|
### Event Types Emitted by Runtime
|
|
|
|
- Session lifecycle:
|
|
- `session.started`
|
|
- `session.completed`
|
|
- `session.failed`
|
|
- Node/domain lifecycle:
|
|
- `node.attempt.completed`
|
|
- `domain.<domain_event_type>` (for example `domain.validation_failed`)
|
|
- Security mirror events:
|
|
- `security.shell.command_profiled`
|
|
- `security.shell.command_allowed`
|
|
- `security.shell.command_blocked`
|
|
- `security.tool.invocation_allowed`
|
|
- `security.tool.invocation_blocked`
|
|
|
|
## Claude Observability
|
|
|
|
- `CLAUDE_OBSERVABILITY_MODE=stdout` prints structured Claude query internals (tool progress, system events, stderr, result lifecycle) to stdout as JSON lines prefixed with `[claude-trace]`.
|
|
- `CLAUDE_OBSERVABILITY_MODE=file` appends the same records to `CLAUDE_OBSERVABILITY_LOG_PATH`.
|
|
- `CLAUDE_OBSERVABILITY_MODE=both` enables both outputs.
|
|
- Output samples high-frequency `tool_progress` events to avoid log flooding while retaining suppression counters.
|
|
- `assistant` and `user` message records are retained so turn flow is inspectable end-to-end.
|
|
- `CLAUDE_OBSERVABILITY_VERBOSITY=summary` stores compact metadata; `full` stores redacted full SDK message payloads.
|
|
- `CLAUDE_OBSERVABILITY_INCLUDE_PARTIAL=true` enables and emits sampled partial assistant stream events from the SDK.
|
|
- `CLAUDE_OBSERVABILITY_DEBUG=true` enables Claude SDK debug mode.
|
|
- `CLAUDE_OBSERVABILITY_DEBUG_LOG_PATH` writes Claude SDK debug output to a file (also enables debug mode).
|
|
- In UI/provider mode, `CLAUDE_OBSERVABILITY_LOG_PATH` resolves relative to the repo workspace root.
|
|
- UI API: `GET /api/claude-trace?limit=<n>&sessionId=<id>` reads filtered Claude trace records.
|
|
|
|
Example:
|
|
|
|
```bash
|
|
CLAUDE_OBSERVABILITY_MODE=both
|
|
CLAUDE_OBSERVABILITY_VERBOSITY=summary
|
|
CLAUDE_OBSERVABILITY_LOG_PATH=.ai_ops/events/claude-trace.ndjson
|
|
CLAUDE_OBSERVABILITY_INCLUDE_PARTIAL=false
|
|
CLAUDE_OBSERVABILITY_DEBUG=false
|
|
```
|
|
|
|
### Analytics Quick Start
|
|
|
|
Inspect latest events:
|
|
|
|
```bash
|
|
tail -n 50 .ai_ops/events/runtime-events.ndjson
|
|
```
|
|
|
|
Count events by type:
|
|
|
|
```bash
|
|
jq -r '.type' .ai_ops/events/runtime-events.ndjson | sort | uniq -c
|
|
```
|
|
|
|
Get only critical events:
|
|
|
|
```bash
|
|
jq -c 'select(.severity=="critical")' .ai_ops/events/runtime-events.ndjson
|
|
```
|
|
|
|
## Security Middleware
|
|
|
|
- Shell command parsing uses async `sh-syntax` (WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction.
|
|
- Rules are validated with strict Zod schemas (`src/security/schemas.ts`) before execution.
|
|
- `SecurityRulesEngine` enforces:
|
|
- binary allowlists
|
|
- cwd/worktree boundary checks
|
|
- path traversal blocking (`../`)
|
|
- protected path blocking (state root + project context path)
|
|
- unified tool allowlist/banlist checks for shell binaries and MCP tool lists
|
|
- `SecureCommandExecutor` runs commands via `child_process.spawn` with:
|
|
- explicit env scrub/inject policy (no implicit full env inheritance)
|
|
- timeout enforcement
|
|
- optional uid/gid drop
|
|
- stdout/stderr streaming hooks for audit
|
|
- Every actor execution input now includes a pre-resolved `executionContext` (`phase`, `modelConstraint`, `allowedTools`, and immutable security constraints) generated by orchestration per node attempt.
|
|
- Every actor execution input now includes `security` helpers (`rulesEngine`, `createCommandExecutor(...)`) so executors can enforce shell/tool policy at the execution boundary.
|
|
- Every actor execution input now includes `mcp` helpers (`resolvedConfig`, `resolveConfig(...)`, `filterToolsForProvider(...)`, `createClaudeCanUseTool()`) so provider adapters are filtered against `executionContext.allowedTools` before SDK calls.
|
|
- For Claude-based executors, pass `input.mcp.filterToolsForProvider(...)` and `input.mcp.createClaudeCanUseTool()` into the SDK call path so unauthorized tools are never exposed and runtime bypass attempts trigger security violations.
|
|
- Claude `canUseTool` permission checks normalize provider casing (`Bash` vs `bash`) before enforcing persona allowlists.
|
|
- Pipeline behavior on `SecurityViolationError` is configurable:
|
|
- `hard_abort` (default)
|
|
- `validation_fail` (retry-unrolled remediation)
|
|
- `dangerous_warn_only` (logs violations and continues execution; high risk)
|
|
|
|
## Environment Variables
|
|
|
|
### Provider/Auth
|
|
|
|
- `CODEX_API_KEY`
|
|
- `OPENAI_API_KEY`
|
|
- `OPENAI_AUTH_MODE` (`auto`, `chatgpt`, or `api_key`)
|
|
- `OPENAI_BASE_URL`
|
|
- `CODEX_SKIP_GIT_CHECK`
|
|
- `CLAUDE_CODE_OAUTH_TOKEN` (preferred for Claude auth; takes precedence over `ANTHROPIC_API_KEY`)
|
|
- `ANTHROPIC_API_KEY` (used when `CLAUDE_CODE_OAUTH_TOKEN` is unset)
|
|
- `CLAUDE_MODEL`
|
|
- `CLAUDE_CODE_PATH`
|
|
- `CLAUDE_MAX_TURNS` (integer >= 1, defaults to `2`)
|
|
- `CLAUDE_OBSERVABILITY_MODE` (`off`, `stdout`, `file`, or `both`)
|
|
- `CLAUDE_OBSERVABILITY_VERBOSITY` (`summary` or `full`)
|
|
- `CLAUDE_OBSERVABILITY_LOG_PATH`
|
|
- `CLAUDE_OBSERVABILITY_INCLUDE_PARTIAL` (`true` or `false`)
|
|
- `CLAUDE_OBSERVABILITY_DEBUG` (`true` or `false`)
|
|
- `CLAUDE_OBSERVABILITY_DEBUG_LOG_PATH`
|
|
- `MCP_CONFIG_PATH`
|
|
|
|
### Agent Manager Limits
|
|
|
|
- `AGENT_MAX_CONCURRENT`
|
|
- `AGENT_MAX_SESSION`
|
|
- `AGENT_MAX_RECURSIVE_DEPTH`
|
|
|
|
### Orchestration / Context
|
|
|
|
- `AGENT_STATE_ROOT`
|
|
- `AGENT_PROJECT_CONTEXT_PATH`
|
|
- `AGENT_TOPOLOGY_MAX_DEPTH`
|
|
- `AGENT_TOPOLOGY_MAX_RETRIES`
|
|
- `AGENT_RELATIONSHIP_MAX_CHILDREN`
|
|
- `AGENT_MERGE_CONFLICT_MAX_ATTEMPTS`
|
|
|
|
### Provisioning / Resource Controls
|
|
|
|
- `AGENT_WORKTREE_ROOT`
|
|
- `AGENT_WORKTREE_BASE_REF`
|
|
- `AGENT_WORKTREE_TARGET_PATH` (optional relative path; enables sparse checkout and sets session working directory to that subfolder)
|
|
- `AGENT_PORT_BASE`
|
|
- `AGENT_PORT_BLOCK_SIZE`
|
|
- `AGENT_PORT_BLOCK_COUNT`
|
|
- `AGENT_PORT_PRIMARY_OFFSET`
|
|
- `AGENT_PORT_LOCK_DIR`
|
|
- `AGENT_DISCOVERY_FILE_RELATIVE_PATH`
|
|
|
|
### Security Middleware
|
|
|
|
- `AGENT_SECURITY_VIOLATION_MODE` (`hard_abort`, `validation_fail`, or `dangerous_warn_only`)
|
|
- `AGENT_SECURITY_ALLOWED_BINARIES`
|
|
- `AGENT_SECURITY_COMMAND_TIMEOUT_MS`
|
|
- `AGENT_SECURITY_AUDIT_LOG_PATH`
|
|
- `AGENT_SECURITY_ENV_INHERIT`
|
|
- `AGENT_SECURITY_ENV_SCRUB`
|
|
- `AGENT_SECURITY_DROP_UID`
|
|
- `AGENT_SECURITY_DROP_GID`
|
|
|
|
### Runtime Events / Telemetry
|
|
|
|
- `AGENT_RUNTIME_EVENT_LOG_PATH`
|
|
- `AGENT_RUNTIME_DISCORD_WEBHOOK_URL`
|
|
- `AGENT_RUNTIME_DISCORD_MIN_SEVERITY` (`info`, `warning`, or `critical`)
|
|
- `AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES` (CSV event types such as `session.started,session.completed,session.failed`)
|
|
|
|
### Operator UI
|
|
|
|
- `AGENT_UI_HOST` (default `127.0.0.1`)
|
|
- `AGENT_UI_PORT` (default `4317`)
|
|
|
|
### Runtime-Injected (Do Not Configure In `.env`)
|
|
|
|
- `AGENT_REPO_ROOT`
|
|
- `AGENT_WORKTREE_PATH`
|
|
- `AGENT_WORKTREE_BASE_REF`
|
|
- `AGENT_PORT_RANGE_START`
|
|
- `AGENT_PORT_RANGE_END`
|
|
- `AGENT_PORT_PRIMARY`
|
|
- `AGENT_DISCOVERY_FILE`
|
|
|
|
Defaults are documented in `.env.example`.
|
|
|
|
Auth behavior notes:
|
|
|
|
- OpenAI/Codex:
|
|
- `OPENAI_AUTH_MODE=auto` (default) prefers API keys when configured, and otherwise relies on existing Codex CLI login (`codex login` / ChatGPT plan auth).
|
|
- `OPENAI_AUTH_MODE=chatgpt` always omits API key injection so Codex uses ChatGPT subscription auth/session.
|
|
- Claude:
|
|
- If `CLAUDE_CODE_OAUTH_TOKEN` and `ANTHROPIC_API_KEY` are both unset, runtime auth options are omitted and Claude Agent SDK can use existing Claude Code login state.
|
|
|
|
## Quality Gate
|
|
|
|
```bash
|
|
npm run verify
|
|
```
|
|
|
|
Equivalent:
|
|
|
|
```bash
|
|
npm run check
|
|
npm run check:tests
|
|
npm run test
|
|
npm run build
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Recursive execution APIs on `AgentManager` are internal runtime plumbing; use `SchemaDrivenExecutionEngine.runSession(...)` as the public orchestration entrypoint.
|
|
|
|
## MCP Migration Note
|
|
|
|
- Shared MCP server configs no longer accept the legacy `http_headers` alias.
|
|
- Use `headers` instead.
|