323 lines
8.7 KiB
Markdown
323 lines
8.7 KiB
Markdown
# AI Ops: Schema-Driven Multi-Agent Orchestration Runtime
|
|
|
|
TypeScript runtime for deterministic multi-agent execution with:
|
|
|
|
- OpenAI Codex SDK integration (`@openai/codex-sdk`)
|
|
- Anthropic Claude Agent SDK integration (`@anthropic-ai/claude-agent-sdk`)
|
|
- Schema-validated orchestration (`AgentManifest`)
|
|
- Stateless node handoffs via persisted state/context payloads
|
|
- Resource provisioning (git worktrees + deterministic port ranges)
|
|
- MCP configuration layer with handler-based policy hooks
|
|
|
|
## Current Status
|
|
|
|
- Provider entrypoints (`codex`, `claude`) run with session limits and resource provisioning.
|
|
- Schema-driven orchestration is implemented as reusable modules under `src/agents`.
|
|
- Recursive `AgentManager.runRecursiveAgent(...)` supports fanout/fan-in orchestration with abort propagation.
|
|
- Tool clearance allowlist/banlist is modeled, but hard security enforcement is still a TODO at tool execution boundaries.
|
|
|
|
## Repository Layout
|
|
|
|
- `src/agents`:
|
|
- `manager.ts`: queue-based concurrency limits + recursive fanout/fan-in orchestration.
|
|
- `runtime.ts`: env-driven runtime singletons and defaults.
|
|
- `manifest.ts`: `AgentManifest` schema parsing + validation (strict DAG).
|
|
- `persona-registry.ts`: prompt templating + persona behavior events.
|
|
- `pipeline.ts`: actor-oriented DAG runner with retries and state-dependent routing.
|
|
- `state-context.ts`: persisted state + stateless handoff reconstruction.
|
|
- `provisioning.ts`: extensible resource orchestration + child suballocation support.
|
|
- `orchestration.ts`: `SchemaDrivenExecutionEngine` facade.
|
|
- `src/mcp`: MCP config types, conversions, and handler resolution.
|
|
- `src/examples`: provider entrypoints (`codex.ts`, `claude.ts`).
|
|
- `tests`: unit coverage for manager, manifest, pipeline/orchestration, state context, MCP, and provisioning behavior.
|
|
- `docs/orchestration-engine.md`: design notes for the orchestration architecture.
|
|
|
|
## Prerequisites
|
|
|
|
- Node.js 18+
|
|
- npm
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
npm install
|
|
cp .env.example .env
|
|
cp mcp.config.example.json mcp.config.json
|
|
```
|
|
|
|
Fill in any values you need in `.env`.
|
|
|
|
## Run
|
|
|
|
Run Codex example:
|
|
|
|
```bash
|
|
npm run codex -- "Summarize what this repository does."
|
|
```
|
|
|
|
Run Claude example:
|
|
|
|
```bash
|
|
npm run claude -- "Summarize what this repository does."
|
|
```
|
|
|
|
Run via unified entrypoint:
|
|
|
|
```bash
|
|
npm run dev -- codex "List potential improvements."
|
|
npm run dev -- claude "List potential improvements."
|
|
```
|
|
|
|
## Schema-Driven Orchestration
|
|
|
|
The orchestration engine is exposed as library modules (not yet wired into `src/index.ts` by default).
|
|
|
|
Core pieces:
|
|
|
|
- `parseAgentManifest(...)` validates the full orchestration schema.
|
|
- `PersonaRegistry` injects runtime context into templated system prompts.
|
|
- `PipelineExecutor` executes a strict DAG of actor nodes.
|
|
- `FileSystemStateContextManager` enforces stateless handoffs.
|
|
- `SchemaDrivenExecutionEngine` composes all of the above with env-driven limits.
|
|
|
|
### AgentManifest Overview
|
|
|
|
`AgentManifest` (schema version `"1"`) includes:
|
|
|
|
- `topologies`: any of `hierarchical`, `retry-unrolled`, `sequential`
|
|
- `personas`: identity, prompt template, tool clearance metadata
|
|
- `relationships`: parent-child persona edges and constraints
|
|
- `pipeline`: strict DAG with entry node, nodes, and edges
|
|
- `topologyConstraints`: max depth and retry ceilings
|
|
|
|
Edge routing supports:
|
|
|
|
- Event gates: `success`, `validation_fail`, `failure`, `always`, `onTaskComplete`, `onValidationFail`
|
|
- Conditions:
|
|
- `state_flag`
|
|
- `history_has_event`
|
|
- `file_exists`
|
|
- `always`
|
|
|
|
Example manifest:
|
|
|
|
```json
|
|
{
|
|
"schemaVersion": "1",
|
|
"topologies": ["hierarchical", "retry-unrolled", "sequential"],
|
|
"personas": [
|
|
{
|
|
"id": "coder",
|
|
"displayName": "Coder",
|
|
"systemPromptTemplate": "Implement ticket {{ticket}} in repo {{repo}}",
|
|
"toolClearance": {
|
|
"allowlist": ["read_file", "write_file"],
|
|
"banlist": ["rm"]
|
|
}
|
|
}
|
|
],
|
|
"relationships": [],
|
|
"pipeline": {
|
|
"entryNodeId": "coder-1",
|
|
"nodes": [
|
|
{
|
|
"id": "coder-1",
|
|
"actorId": "coder_actor",
|
|
"personaId": "coder",
|
|
"constraints": { "maxRetries": 1 }
|
|
}
|
|
],
|
|
"edges": []
|
|
},
|
|
"topologyConstraints": {
|
|
"maxDepth": 4,
|
|
"maxRetries": 2
|
|
}
|
|
}
|
|
```
|
|
|
|
### Minimal Engine Usage
|
|
|
|
```ts
|
|
import { SchemaDrivenExecutionEngine } from "./src/agents/orchestration.js";
|
|
|
|
const engine = new SchemaDrivenExecutionEngine({
|
|
manifest,
|
|
actorExecutors: {
|
|
coder_actor: async ({ prompt, context, toolClearance }) => {
|
|
// execute actor logic here
|
|
return {
|
|
status: "success",
|
|
payload: {
|
|
summary: "done"
|
|
},
|
|
stateFlags: {
|
|
implemented: true
|
|
}
|
|
};
|
|
}
|
|
},
|
|
settings: {
|
|
workspaceRoot: process.cwd(),
|
|
runtimeContext: {
|
|
repo: "ai_ops",
|
|
ticket: "AIOPS-123"
|
|
}
|
|
}
|
|
});
|
|
|
|
const result = await engine.runSession({
|
|
sessionId: "session-1",
|
|
initialPayload: {
|
|
task: "Implement feature"
|
|
}
|
|
});
|
|
|
|
console.log(result.records);
|
|
```
|
|
|
|
## Stateless Handoffs and Context
|
|
|
|
The engine does not depend on conversational memory between nodes.
|
|
|
|
- Node inputs are written as handoff payloads to storage.
|
|
- Each node execution reads a fresh context snapshot from disk.
|
|
- Session state persists:
|
|
- flags
|
|
- metadata
|
|
- history events
|
|
|
|
Default state root is controlled by `AGENT_STATE_ROOT`.
|
|
|
|
## Recursive Orchestration Contract
|
|
|
|
`AgentManager.runRecursiveAgent(...)` uses a strict two-phase fanout/fan-in model:
|
|
|
|
- Phase 1 (planner): agent execution returns either a terminal result or a fanout plan (`intents[]` + `aggregate(...)`).
|
|
- Parent tokens are released before children are scheduled, avoiding deadlocks even when `AGENT_MAX_CONCURRENT=1`.
|
|
- Children run in isolated deterministic session IDs (`<parent>_child_<index>`), each with their own `AbortSignal`.
|
|
- Phase 2 (aggregator): once all children complete, the aggregate phase runs as a fresh invocation.
|
|
|
|
Optional child middleware hooks (`allocateForChild`, `releaseForChild`) let callers integrate provisioning/suballocation without coupling `AgentManager` to filesystem or git operations.
|
|
|
|
## Resource Provisioning
|
|
|
|
The provisioning layer separates:
|
|
|
|
- Hard constraints: actual resource allocation enforced before run.
|
|
- Soft constraints: injected env vars, prompt sections, metadata, and discovery snapshot.
|
|
|
|
Built-in providers:
|
|
|
|
- `git-worktree`
|
|
- `port-range`
|
|
|
|
Runtime injection includes:
|
|
|
|
- Working directory override
|
|
- Injected env vars such as `AGENT_WORKTREE_PATH`, `AGENT_PORT_RANGE_START`, `AGENT_PORT_RANGE_END`, `AGENT_PORT_PRIMARY`
|
|
- Discovery file path via `AGENT_DISCOVERY_FILE`
|
|
|
|
### Hierarchical Suballocation
|
|
|
|
Parent sessions can suballocate resources for child sessions using:
|
|
|
|
- `ResourceProvisioningOrchestrator.provisionChildSession(...)`
|
|
- `buildChildResourceRequests(...)`
|
|
|
|
Behavior:
|
|
|
|
- Child worktrees are placed under a deterministic parent-scoped root.
|
|
- Child port blocks are deterministically carved from the parent assigned range.
|
|
|
|
## MCP Configuration
|
|
|
|
Use `mcp.config.json` to configure shared and provider-specific MCP servers.
|
|
|
|
- `MCP_CONFIG_PATH` controls config location (default `./mcp.config.json`).
|
|
- Shared server definitions are in `servers`.
|
|
- Provider overrides:
|
|
- `codex.mcp_servers`
|
|
- `claude.mcpServers`
|
|
- Handlers:
|
|
- built-in `context7`
|
|
- built-in `claude-task-master`
|
|
- built-in `generic`
|
|
- custom handlers via `registerMcpHandler(...)`
|
|
|
|
See `mcp.config.example.json` for a complete template.
|
|
|
|
## Environment Variables
|
|
|
|
### Provider/Auth
|
|
|
|
- `CODEX_API_KEY`
|
|
- `OPENAI_API_KEY`
|
|
- `OPENAI_BASE_URL`
|
|
- `CODEX_SKIP_GIT_CHECK`
|
|
- `ANTHROPIC_API_KEY`
|
|
- `CLAUDE_MODEL`
|
|
- `CLAUDE_CODE_PATH`
|
|
- `MCP_CONFIG_PATH`
|
|
|
|
### Agent Manager Limits
|
|
|
|
- `AGENT_MAX_CONCURRENT`
|
|
- `AGENT_MAX_SESSION`
|
|
- `AGENT_MAX_RECURSIVE_DEPTH`
|
|
|
|
### Orchestration Limits
|
|
|
|
- `AGENT_STATE_ROOT`
|
|
- `AGENT_TOPOLOGY_MAX_DEPTH`
|
|
- `AGENT_TOPOLOGY_MAX_RETRIES`
|
|
- `AGENT_RELATIONSHIP_MAX_CHILDREN`
|
|
|
|
### Provisioning
|
|
|
|
- `AGENT_WORKTREE_ROOT`
|
|
- `AGENT_WORKTREE_BASE_REF`
|
|
- `AGENT_PORT_BASE`
|
|
- `AGENT_PORT_BLOCK_SIZE`
|
|
- `AGENT_PORT_BLOCK_COUNT`
|
|
- `AGENT_PORT_PRIMARY_OFFSET`
|
|
- `AGENT_PORT_LOCK_DIR`
|
|
- `AGENT_DISCOVERY_FILE_RELATIVE_PATH`
|
|
|
|
Defaults are documented in `.env.example`.
|
|
|
|
## Quality Gate
|
|
|
|
Run the full pre-PR gate:
|
|
|
|
```bash
|
|
npm run verify
|
|
```
|
|
|
|
Equivalent individual commands:
|
|
|
|
```bash
|
|
npm run check
|
|
npm run check:tests
|
|
npm run test
|
|
npm run build
|
|
```
|
|
|
|
## Build and Start
|
|
|
|
```bash
|
|
npm run build
|
|
npm run start -- codex "Hello from built JS"
|
|
```
|
|
|
|
## Known Limitations
|
|
|
|
- Tool clearance allowlist/banlist is currently metadata; enforcement is not yet wired into an execution sandbox.
|
|
|
|
## References
|
|
|
|
- `docs/orchestration-engine.md`
|
|
- OpenAI Codex SDK docs: https://developers.openai.com/codex/sdk/
|
|
- Codex MCP config docs: https://developers.openai.com/codex/config#model-context-protocol-mcp_servers
|
|
- Claude Agent SDK docs: https://platform.claude.com/docs/en/agent-sdk/overview
|