ai_ops/README.md

# AI Ops: Schema-Driven Multi-Agent Orchestration Runtime

TypeScript runtime for deterministic multi-agent execution with:

- OpenAI Codex SDK integration (`@openai/codex-sdk`)
- Anthropic Claude Agent SDK integration (`@anthropic-ai/claude-agent-sdk`)
- Schema-validated orchestration (`AgentManifest`)
- Stateless node handoffs via persisted state/context payloads
- Resource provisioning (git worktrees + deterministic port ranges)
- MCP configuration layer with handler-based policy hooks

## Current Status

- Provider entrypoints (`codex`, `claude`) run with session limits and resource provisioning.
- Schema-driven orchestration is implemented as reusable modules under `src/agents`.
- Recursive `AgentManager.runRecursiveAgent(...)` supports fanout/fan-in orchestration with abort propagation.
- Tool clearance allowlist/banlist is modeled, but hard security enforcement is still a TODO at tool execution boundaries.

## Repository Layout

- `src/agents`:
  - `manager.ts`: queue-based concurrency limits + recursive fanout/fan-in orchestration.
  - `runtime.ts`: env-driven runtime singletons and defaults.
  - `manifest.ts`: `AgentManifest` schema parsing + validation (strict DAG).
  - `persona-registry.ts`: prompt templating + persona behavior events.
  - `pipeline.ts`: actor-oriented DAG runner with retries and state-dependent routing.
  - `state-context.ts`: persisted state + stateless handoff reconstruction.
  - `provisioning.ts`: extensible resource orchestration + child suballocation support.
  - `orchestration.ts`: `SchemaDrivenExecutionEngine` facade.
- `src/mcp`: MCP config types, conversions, and handler resolution.
- `src/examples`: provider entrypoints (`codex.ts`, `claude.ts`).
- `tests`: unit coverage for manager, manifest, pipeline/orchestration, state context, MCP, and provisioning behavior.
- `docs/orchestration-engine.md`: design notes for the orchestration architecture.

## Prerequisites

- Node.js 18+
- npm

## Setup

```bash
npm install
cp .env.example .env
cp mcp.config.example.json mcp.config.json
```

Fill in any values you need in `.env`.

## Run

Run Codex example:

```bash
npm run codex -- "Summarize what this repository does."
```

Run Claude example:

```bash
npm run claude -- "Summarize what this repository does."
```

Run via unified entrypoint:

```bash
npm run dev -- codex "List potential improvements."
npm run dev -- claude "List potential improvements."
```

## Schema-Driven Orchestration

The orchestration engine is exposed as library modules (not yet wired into `src/index.ts` by default).

Core pieces:

- `parseAgentManifest(...)` validates the full orchestration schema.
- `PersonaRegistry` injects runtime context into templated system prompts.
- `PipelineExecutor` executes a strict DAG of actor nodes.
- `FileSystemStateContextManager` enforces stateless handoffs.
- `SchemaDrivenExecutionEngine` composes all of the above with env-driven limits.

### AgentManifest Overview

`AgentManifest` (schema version `"1"`) includes:

- `topologies`: any of `hierarchical`, `retry-unrolled`, `sequential`
- `personas`: identity, prompt template, tool clearance metadata
- `relationships`: parent-child persona edges and constraints
- `pipeline`: strict DAG with entry node, nodes, and edges
- `topologyConstraints`: max depth and retry ceilings

Edge routing supports:

- Event gates: `success`, `validation_fail`, `failure`, `always`, `onTaskComplete`, `onValidationFail`
- Conditions:
  - `state_flag`
  - `history_has_event`
  - `file_exists`
  - `always`

Example manifest:

```json
{
  "schemaVersion": "1",
  "topologies": ["hierarchical", "retry-unrolled", "sequential"],
  "personas": [
    {
      "id": "coder",
      "displayName": "Coder",
      "systemPromptTemplate": "Implement ticket {{ticket}} in repo {{repo}}",
      "toolClearance": {
        "allowlist": ["read_file", "write_file"],
        "banlist": ["rm"]
      }
    }
  ],
  "relationships": [],
  "pipeline": {
    "entryNodeId": "coder-1",
    "nodes": [
      {
        "id": "coder-1",
        "actorId": "coder_actor",
        "personaId": "coder",
        "constraints": { "maxRetries": 1 }
      }
    ],
    "edges": []
  },
  "topologyConstraints": {
    "maxDepth": 4,
    "maxRetries": 2
  }
}
```

### Minimal Engine Usage

```ts
import { SchemaDrivenExecutionEngine } from "./src/agents/orchestration.js";

const engine = new SchemaDrivenExecutionEngine({
  manifest,
  actorExecutors: {
    coder_actor: async ({ prompt, context, toolClearance }) => {
      // execute actor logic here
      return {
        status: "success",
        payload: {
          summary: "done"
        },
        stateFlags: {
          implemented: true
        }
      };
    }
  },
  settings: {
    workspaceRoot: process.cwd(),
    runtimeContext: {
      repo: "ai_ops",
      ticket: "AIOPS-123"
    }
  }
});

const result = await engine.runSession({
  sessionId: "session-1",
  initialPayload: {
    task: "Implement feature"
  }
});

console.log(result.records);
```

## Stateless Handoffs and Context

The engine does not depend on conversational memory between nodes.

- Node inputs are written as handoff payloads to storage.
- Each node execution reads a fresh context snapshot from disk.
- Session state persists:
  - flags
  - metadata
  - history events

Default state root is controlled by `AGENT_STATE_ROOT`.

## Recursive Orchestration Contract

`AgentManager.runRecursiveAgent(...)` uses a strict two-phase fanout/fan-in model:

- Phase 1 (planner): agent execution returns either a terminal result or a fanout plan (`intents[]` + `aggregate(...)`).
- Parent tokens are released before children are scheduled, avoiding deadlocks even when `AGENT_MAX_CONCURRENT=1`.
- Children run in isolated deterministic session IDs (`<parent>_child_<index>`), each with their own `AbortSignal`.
- Phase 2 (aggregator): once all children complete, the aggregate phase runs as a fresh invocation.

Optional child middleware hooks (`allocateForChild`, `releaseForChild`) let callers integrate provisioning/suballocation without coupling `AgentManager` to filesystem or git operations.

## Resource Provisioning

The provisioning layer separates:

- Hard constraints: actual resource allocation enforced before run.
- Soft constraints: injected env vars, prompt sections, metadata, and discovery snapshot.

Built-in providers:

- `git-worktree`
- `port-range`

Runtime injection includes:

- Working directory override
- Injected env vars such as `AGENT_WORKTREE_PATH`, `AGENT_PORT_RANGE_START`, `AGENT_PORT_RANGE_END`, `AGENT_PORT_PRIMARY`
- Discovery file path via `AGENT_DISCOVERY_FILE`

### Hierarchical Suballocation

Parent sessions can suballocate resources for child sessions using:

- `ResourceProvisioningOrchestrator.provisionChildSession(...)`
- `buildChildResourceRequests(...)`

Behavior:

- Child worktrees are placed under a deterministic parent-scoped root.
- Child port blocks are deterministically carved from the parent assigned range.

## MCP Configuration

Use `mcp.config.json` to configure shared and provider-specific MCP servers.

- `MCP_CONFIG_PATH` controls config location (default `./mcp.config.json`).
- Shared server definitions are in `servers`.
- Provider overrides:
  - `codex.mcp_servers`
  - `claude.mcpServers`
- Handlers:
  - built-in `context7`
  - built-in `claude-task-master`
  - built-in `generic`
  - custom handlers via `registerMcpHandler(...)`

See `mcp.config.example.json` for a complete template.

## Environment Variables

### Provider/Auth

- `CODEX_API_KEY`
- `OPENAI_API_KEY`
- `OPENAI_BASE_URL`
- `CODEX_SKIP_GIT_CHECK`
- `ANTHROPIC_API_KEY`
- `CLAUDE_MODEL`
- `CLAUDE_CODE_PATH`
- `MCP_CONFIG_PATH`

### Agent Manager Limits

- `AGENT_MAX_CONCURRENT`
- `AGENT_MAX_SESSION`
- `AGENT_MAX_RECURSIVE_DEPTH`

### Orchestration Limits

- `AGENT_STATE_ROOT`
- `AGENT_TOPOLOGY_MAX_DEPTH`
- `AGENT_TOPOLOGY_MAX_RETRIES`
- `AGENT_RELATIONSHIP_MAX_CHILDREN`

### Provisioning

- `AGENT_WORKTREE_ROOT`
- `AGENT_WORKTREE_BASE_REF`
- `AGENT_PORT_BASE`
- `AGENT_PORT_BLOCK_SIZE`
- `AGENT_PORT_BLOCK_COUNT`
- `AGENT_PORT_PRIMARY_OFFSET`
- `AGENT_PORT_LOCK_DIR`
- `AGENT_DISCOVERY_FILE_RELATIVE_PATH`

Defaults are documented in `.env.example`.

## Quality Gate

Run the full pre-PR gate:

```bash
npm run verify
```

Equivalent individual commands:

```bash
npm run check
npm run check:tests
npm run test
npm run build
```

## Build and Start

```bash
npm run build
npm run start -- codex "Hello from built JS"
```

## Known Limitations

- Tool clearance allowlist/banlist is currently metadata; enforcement is not yet wired into an execution sandbox.

## References

- `docs/orchestration-engine.md`
- OpenAI Codex SDK docs: https://developers.openai.com/codex/sdk/
- Codex MCP config docs: https://developers.openai.com/codex/config#model-context-protocol-mcp_servers
- Claude Agent SDK docs: https://platform.claude.com/docs/en/agent-sdk/overview