ai_ops/README.md

# AI Ops: Schema-Driven Multi-Agent Orchestration Runtime

TypeScript runtime for deterministic multi-agent execution with:

- OpenAI Codex SDK (`@openai/codex-sdk`)
- Anthropic Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`)
- Schema-validated orchestration (`AgentManifest`)
- DAG execution with topology-aware fan-out (`parallel`, `hierarchical`, `retry-unrolled`)
- Project-scoped persistent context store
- Typed domain events for edge-triggered routing
- Resource provisioning (git worktrees + deterministic port ranges)
- MCP configuration layer with handler policy hooks
- Security middleware for shell/tool policy enforcement
- Runtime event fan-out (NDJSON analytics log + optional Discord webhook notifications)

## Architecture Summary

- `SchemaDrivenExecutionEngine.runSession(...)` is the single execution entrypoint.
- `PipelineExecutor` owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.
- Runtime events are emitted as best-effort side-channel telemetry and do not affect orchestration control flow.
- `AgentManager` is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.
- Session state is persisted under `AGENT_STATE_ROOT`.
- Project state is persisted under `AGENT_PROJECT_CONTEXT_PATH` with schema-versioned JSON (`schemaVersion`) and domains:
  - `globalFlags`
  - `artifactPointers`
  - `taskQueue`

## Repository Layout

- `src/agents`
  - `orchestration.ts`: engine facade and runtime wiring
  - `pipeline.ts`: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing
  - `failure-policy.ts`: hard/soft failure classification policy
  - `lifecycle-observer.ts`: persistence/event lifecycle hooks for node attempts
  - `manifest.ts`: schema parsing/validation for personas/topologies/edges
  - `manager.ts`: recursive fan-out utility used by pipeline
  - `state-context.ts`: persisted node handoffs + session state
  - `project-context.ts`: project-scoped store
  - `domain-events.ts`: typed domain event schema + bus
  - `runtime.ts`: env-driven defaults/singletons
  - `provisioning.ts`: resource provisioning and child suballocation helpers
- `src/mcp`: MCP config types/conversion/handlers
- `src/security`: shell AST parsing, rules engine, secure executor, and audit sinks
- `src/telemetry`: runtime event schema, sink fan-out, file sink, and Discord webhook sink
- `src/examples`: provider entrypoints (`codex.ts`, `claude.ts`)
- `src/config.ts`: centralized env parsing/validation/defaulting
- `tests`: manager, manifest, pipeline/orchestration, state, provisioning, MCP

## Setup

```bash
npm install
cp .env.example .env
cp mcp.config.example.json mcp.config.json
```

## Run

```bash
npm run codex -- "Summarize this repository."
npm run claude -- "Summarize this repository."
```

Or via unified entrypoint:

```bash
npm run dev -- codex "List potential improvements."
npm run dev -- claude "List potential improvements."
```

## Manifest Semantics

`AgentManifest` (schema `"1"`) validates:

- supported topologies (`sequential`, `parallel`, `hierarchical`, `retry-unrolled`)
- persona definitions, optional `modelConstraint`, and tool-clearance policy (validated by shared Zod schema)
- relationship DAG and unknown persona references
- strict pipeline DAG
- topology constraints (`maxDepth`, `maxRetries`)

Pipeline edges can route via:

- legacy status triggers (`on`: `success`, `validation_fail`, `failure`, `always`)
- domain event triggers (`event`: typed domain events)
- conditions (`state_flag`, `history_has_event`, `file_exists`, `always`)
- `history_has_event` evaluates persisted domain event history (for example `validation_failed`)

## Domain Events

Domain events are typed and can trigger edges directly:

- planning: `requirements_defined`, `tasks_planned`
- execution: `code_committed`, `task_blocked`
- validation: `validation_passed`, `validation_failed`
- integration: `branch_merged`

Actors can emit events in `ActorExecutionResult.events`. Pipeline status also emits default validation/execution events.

## Retry Matrix and Cancellation

- `validation_fail`: routed through retry-unrolled execution (new child manager session)
- hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
- `AbortSignal` is passed into every actor execution input
- session closure aborts child recursive work
- run summaries expose aggregate `status`: success requires successful terminal executed DAG nodes and no critical-path failure

## Runtime Events

- The pipeline emits runtime lifecycle events (`session.started`, `node.attempt.completed`, `domain.*`, `session.completed`, `session.failed`).
- Runtime events are fan-out only and never used for edge-routing decisions.
- Default sink writes NDJSON to `AGENT_RUNTIME_EVENT_LOG_PATH`.
- Optional Discord sink posts high-visibility lifecycle/error events through webhook configuration.
- Existing security command audit output (`AGENT_SECURITY_AUDIT_LOG_PATH`) remains in place and is also mirrored into runtime events.

### Runtime Event Fields

Each runtime event is written as one NDJSON object with:

- `id`, `timestamp`, `type`, `severity`
- `sessionId`, `nodeId`, `attempt`
- `message`
- optional `usage` (`tokenInput`, `tokenOutput`, `tokenTotal`, `toolCalls`, `durationMs`, `costUsd`)
- optional structured `metadata`

### Runtime Event Setup

Add these variables in `.env` (or use defaults):

```bash
AGENT_RUNTIME_EVENT_LOG_PATH=.ai_ops/events/runtime-events.ndjson
AGENT_RUNTIME_DISCORD_WEBHOOK_URL=
AGENT_RUNTIME_DISCORD_MIN_SEVERITY=critical
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES=session.started,session.completed,session.failed
```

Notes:

- File sink is always enabled and appends NDJSON.
- Discord sink is enabled only when `AGENT_RUNTIME_DISCORD_WEBHOOK_URL` is set.
- Discord notifications are sent when event severity is at or above `AGENT_RUNTIME_DISCORD_MIN_SEVERITY`.
- Event types in `AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES` are always sent, regardless of severity.

### Event Types Emitted by Runtime

- Session lifecycle:
  - `session.started`
  - `session.completed`
  - `session.failed`
- Node/domain lifecycle:
  - `node.attempt.completed`
  - `domain.<domain_event_type>` (for example `domain.validation_failed`)
- Security mirror events:
  - `security.shell.command_profiled`
  - `security.shell.command_allowed`
  - `security.shell.command_blocked`
  - `security.tool.invocation_allowed`
  - `security.tool.invocation_blocked`

### Analytics Quick Start

Inspect latest events:

```bash
tail -n 50 .ai_ops/events/runtime-events.ndjson
```

Count events by type:

```bash
jq -r '.type' .ai_ops/events/runtime-events.ndjson | sort | uniq -c
```

Get only critical events:

```bash
jq -c 'select(.severity=="critical")' .ai_ops/events/runtime-events.ndjson
```

## Security Middleware

- Shell command parsing uses async `sh-syntax` (WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction.
- Rules are validated with strict Zod schemas (`src/security/schemas.ts`) before execution.
- `SecurityRulesEngine` enforces:
  - binary allowlists
  - cwd/worktree boundary checks
  - path traversal blocking (`../`)
  - protected path blocking (state root + project context path)
  - unified tool allowlist/banlist checks for shell binaries and MCP tool lists
- `SecureCommandExecutor` runs commands via `child_process.spawn` with:
  - explicit env scrub/inject policy (no implicit full env inheritance)
  - timeout enforcement
  - optional uid/gid drop
  - stdout/stderr streaming hooks for audit
- Every actor execution input now includes a pre-resolved `executionContext` (`phase`, `modelConstraint`, `allowedTools`, and immutable security constraints) generated by orchestration per node attempt.
- Every actor execution input now includes `security` helpers (`rulesEngine`, `createCommandExecutor(...)`) so executors can enforce shell/tool policy at the execution boundary.
- Every actor execution input now includes `mcp` helpers (`resolvedConfig`, `resolveConfig(...)`, `filterToolsForProvider(...)`, `createClaudeCanUseTool()`) so provider adapters are filtered against `executionContext.allowedTools` before SDK calls.
- For Claude-based executors, pass `input.mcp.filterToolsForProvider(...)` and `input.mcp.createClaudeCanUseTool()` into the SDK call path so unauthorized tools are never exposed and runtime bypass attempts trigger security violations.
- Pipeline behavior on `SecurityViolationError` is configurable:
  - `hard_abort` (default)
  - `validation_fail` (retry-unrolled remediation)

## Environment Variables

### Provider/Auth

- `CODEX_API_KEY`
- `OPENAI_API_KEY`
- `OPENAI_AUTH_MODE` (`auto`, `chatgpt`, or `api_key`)
- `OPENAI_BASE_URL`
- `CODEX_SKIP_GIT_CHECK`
- `CLAUDE_CODE_OAUTH_TOKEN` (preferred for Claude auth; takes precedence over `ANTHROPIC_API_KEY`)
- `ANTHROPIC_API_KEY` (used when `CLAUDE_CODE_OAUTH_TOKEN` is unset)
- `CLAUDE_MODEL`
- `CLAUDE_CODE_PATH`
- `MCP_CONFIG_PATH`

### Agent Manager Limits

- `AGENT_MAX_CONCURRENT`
- `AGENT_MAX_SESSION`
- `AGENT_MAX_RECURSIVE_DEPTH`

### Orchestration / Context

- `AGENT_STATE_ROOT`
- `AGENT_PROJECT_CONTEXT_PATH`
- `AGENT_TOPOLOGY_MAX_DEPTH`
- `AGENT_TOPOLOGY_MAX_RETRIES`
- `AGENT_RELATIONSHIP_MAX_CHILDREN`

### Provisioning / Resource Controls

- `AGENT_WORKTREE_ROOT`
- `AGENT_WORKTREE_BASE_REF`
- `AGENT_PORT_BASE`
- `AGENT_PORT_BLOCK_SIZE`
- `AGENT_PORT_BLOCK_COUNT`
- `AGENT_PORT_PRIMARY_OFFSET`
- `AGENT_PORT_LOCK_DIR`
- `AGENT_DISCOVERY_FILE_RELATIVE_PATH`

### Security Middleware

- `AGENT_SECURITY_VIOLATION_MODE` (`hard_abort` or `validation_fail`)
- `AGENT_SECURITY_ALLOWED_BINARIES`
- `AGENT_SECURITY_COMMAND_TIMEOUT_MS`
- `AGENT_SECURITY_AUDIT_LOG_PATH`
- `AGENT_SECURITY_ENV_INHERIT`
- `AGENT_SECURITY_ENV_SCRUB`
- `AGENT_SECURITY_DROP_UID`
- `AGENT_SECURITY_DROP_GID`

### Runtime Events / Telemetry

- `AGENT_RUNTIME_EVENT_LOG_PATH`
- `AGENT_RUNTIME_DISCORD_WEBHOOK_URL`
- `AGENT_RUNTIME_DISCORD_MIN_SEVERITY` (`info`, `warning`, or `critical`)
- `AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES` (CSV event types such as `session.started,session.completed,session.failed`)

### Runtime-Injected (Do Not Configure In `.env`)

- `AGENT_REPO_ROOT`
- `AGENT_WORKTREE_PATH`
- `AGENT_WORKTREE_BASE_REF`
- `AGENT_PORT_RANGE_START`
- `AGENT_PORT_RANGE_END`
- `AGENT_PORT_PRIMARY`
- `AGENT_DISCOVERY_FILE`

Defaults are documented in `.env.example`.

Auth behavior notes:

- OpenAI/Codex:
  - `OPENAI_AUTH_MODE=auto` (default) prefers API keys when configured, and otherwise relies on existing Codex CLI login (`codex login` / ChatGPT plan auth).
  - `OPENAI_AUTH_MODE=chatgpt` always omits API key injection so Codex uses ChatGPT subscription auth/session.
- Claude:
  - If `CLAUDE_CODE_OAUTH_TOKEN` and `ANTHROPIC_API_KEY` are both unset, runtime auth options are omitted and Claude Agent SDK can use existing Claude Code login state.

## Quality Gate

```bash
npm run verify
```

Equivalent:

```bash
npm run check
npm run check:tests
npm run test
npm run build
```

## Notes

- Recursive execution APIs on `AgentManager` are internal runtime plumbing; use `SchemaDrivenExecutionEngine.runSession(...)` as the public orchestration entrypoint.

## MCP Migration Note

- Shared MCP server configs no longer accept the legacy `http_headers` alias.
- Use `headers` instead.