160 lines
5.1 KiB
Markdown
160 lines
5.1 KiB
Markdown
# AI Ops: Schema-Driven Multi-Agent Orchestration Runtime
|
|
|
|
TypeScript runtime for deterministic multi-agent execution with:
|
|
|
|
- OpenAI Codex SDK (`@openai/codex-sdk`)
|
|
- Anthropic Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`)
|
|
- Schema-validated orchestration (`AgentManifest`)
|
|
- DAG execution with topology-aware fan-out (`parallel`, `hierarchical`, `retry-unrolled`)
|
|
- Project-scoped persistent context store
|
|
- Typed domain events for edge-triggered routing
|
|
- Resource provisioning (git worktrees + deterministic port ranges)
|
|
- MCP configuration layer with handler policy hooks
|
|
|
|
## Architecture Summary
|
|
|
|
- `SchemaDrivenExecutionEngine.runSession(...)` is the single execution entrypoint.
|
|
- `PipelineExecutor` owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.
|
|
- `AgentManager` is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.
|
|
- Session state is persisted under `AGENT_STATE_ROOT`.
|
|
- Project state is persisted under `AGENT_PROJECT_CONTEXT_PATH` with schema-versioned JSON (`schemaVersion`) and domains:
|
|
- `globalFlags`
|
|
- `artifactPointers`
|
|
- `taskQueue`
|
|
|
|
## Repository Layout
|
|
|
|
- `src/agents`
|
|
- `orchestration.ts`: engine facade and runtime wiring
|
|
- `pipeline.ts`: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing
|
|
- `failure-policy.ts`: hard/soft failure classification policy
|
|
- `lifecycle-observer.ts`: persistence/event lifecycle hooks for node attempts
|
|
- `manifest.ts`: schema parsing/validation for personas/topologies/edges
|
|
- `manager.ts`: recursive fan-out utility used by pipeline
|
|
- `state-context.ts`: persisted node handoffs + session state
|
|
- `project-context.ts`: project-scoped store
|
|
- `domain-events.ts`: typed domain event schema + bus
|
|
- `runtime.ts`: env-driven defaults/singletons
|
|
- `provisioning.ts`: resource provisioning and child suballocation helpers
|
|
- `src/mcp`: MCP config types/conversion/handlers
|
|
- `src/examples`: provider entrypoints (`codex.ts`, `claude.ts`)
|
|
- `src/config.ts`: centralized env parsing/validation/defaulting
|
|
- `tests`: manager, manifest, pipeline/orchestration, state, provisioning, MCP
|
|
|
|
## Setup
|
|
|
|
```bash
|
|
npm install
|
|
cp .env.example .env
|
|
cp mcp.config.example.json mcp.config.json
|
|
```
|
|
|
|
## Run
|
|
|
|
```bash
|
|
npm run codex -- "Summarize this repository."
|
|
npm run claude -- "Summarize this repository."
|
|
```
|
|
|
|
Or via unified entrypoint:
|
|
|
|
```bash
|
|
npm run dev -- codex "List potential improvements."
|
|
npm run dev -- claude "List potential improvements."
|
|
```
|
|
|
|
## Manifest Semantics
|
|
|
|
`AgentManifest` (schema `"1"`) validates:
|
|
|
|
- supported topologies (`sequential`, `parallel`, `hierarchical`, `retry-unrolled`)
|
|
- persona definitions and tool-clearance metadata
|
|
- relationship DAG and unknown persona references
|
|
- strict pipeline DAG
|
|
- topology constraints (`maxDepth`, `maxRetries`)
|
|
|
|
Pipeline edges can route via:
|
|
|
|
- legacy status triggers (`on`: `success`, `validation_fail`, `failure`, `always`, ...)
|
|
- domain event triggers (`event`: typed domain events)
|
|
- conditions (`state_flag`, `history_has_event`, `file_exists`, `always`)
|
|
|
|
## Domain Events
|
|
|
|
Domain events are typed and can trigger edges directly:
|
|
|
|
- planning: `requirements_defined`, `tasks_planned`
|
|
- execution: `code_committed`, `task_blocked`
|
|
- validation: `validation_passed`, `validation_failed`
|
|
- integration: `branch_merged`
|
|
|
|
Actors can emit events in `ActorExecutionResult.events`. Pipeline status also emits default validation/execution events.
|
|
|
|
## Retry Matrix and Cancellation
|
|
|
|
- `validation_fail`: routed through retry-unrolled execution (new child manager session)
|
|
- hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
|
|
- `AbortSignal` is passed into every actor execution input
|
|
- session closure aborts child recursive work
|
|
- run summaries expose aggregate `status`: success requires successful terminal executed DAG nodes and no critical-path failure
|
|
|
|
## Environment Variables
|
|
|
|
### Provider/Auth
|
|
|
|
- `CODEX_API_KEY`
|
|
- `OPENAI_API_KEY`
|
|
- `OPENAI_BASE_URL`
|
|
- `CODEX_SKIP_GIT_CHECK`
|
|
- `ANTHROPIC_API_KEY`
|
|
- `CLAUDE_MODEL`
|
|
- `CLAUDE_CODE_PATH`
|
|
- `MCP_CONFIG_PATH`
|
|
|
|
### Agent Manager Limits
|
|
|
|
- `AGENT_MAX_CONCURRENT`
|
|
- `AGENT_MAX_SESSION`
|
|
- `AGENT_MAX_RECURSIVE_DEPTH`
|
|
|
|
### Orchestration / Context
|
|
|
|
- `AGENT_STATE_ROOT`
|
|
- `AGENT_PROJECT_CONTEXT_PATH`
|
|
- `AGENT_TOPOLOGY_MAX_DEPTH`
|
|
- `AGENT_TOPOLOGY_MAX_RETRIES`
|
|
- `AGENT_RELATIONSHIP_MAX_CHILDREN`
|
|
|
|
### Provisioning / Resource Controls
|
|
|
|
- `AGENT_WORKTREE_ROOT`
|
|
- `AGENT_WORKTREE_BASE_REF`
|
|
- `AGENT_PORT_BASE`
|
|
- `AGENT_PORT_BLOCK_SIZE`
|
|
- `AGENT_PORT_BLOCK_COUNT`
|
|
- `AGENT_PORT_PRIMARY_OFFSET`
|
|
- `AGENT_PORT_LOCK_DIR`
|
|
- `AGENT_DISCOVERY_FILE_RELATIVE_PATH`
|
|
|
|
Defaults are documented in `.env.example`.
|
|
|
|
## Quality Gate
|
|
|
|
```bash
|
|
npm run verify
|
|
```
|
|
|
|
Equivalent:
|
|
|
|
```bash
|
|
npm run check
|
|
npm run check:tests
|
|
npm run test
|
|
npm run build
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Tool clearance allowlist/banlist is currently metadata only; hard enforcement must happen at the tool execution boundary.
|
|
- `AgentManager.runRecursiveAgent(...)` remains available for low-level testing, but pipeline execution should use `SchemaDrivenExecutionEngine.runSession(...)`.
|