ai_ops/docs/orchestration-engine.md

# Schema-Driven Orchestration Engine

## Why this exists

The orchestration runtime introduces explicit schema validation and deterministic execution rules for multi-agent pipelines. The design favors predictable behavior over implicit conversational memory.

## Main components

- `AgentManifest` schema (`src/agents/manifest.ts`): validates personas, relationships, topology constraints, and a strict DAG pipeline.
- Persona registry (`src/agents/persona-registry.ts`): renders templated prompts with runtime context and routes behavioral events.
- Stateful storage for stateless execution (`src/agents/state-context.ts`): each node execution reads payload + state from storage to get fresh context.
- DAG pipeline runner (`src/agents/pipeline.ts`): executes topology blocks, emits typed domain events, evaluates route conditions, and enforces retry/depth/failure limits.
- Project context store (`src/agents/project-context.ts`): project-scoped global flags, artifact pointers, and task queue persisted across sessions.
- Orchestration facade (`src/agents/orchestration.ts`): wires manifest + registry + pipeline + state manager + project context with env-driven limits.
- Hierarchical resource suballocation (`src/agents/provisioning.ts`): builds child `git-worktree` and child `port-range` requests from parent allocation data.
  - Optional `AGENT_WORKTREE_TARGET_PATH` enables sparse-checkout for a subdirectory and sets per-session working directory to that target path.
- Recursive manager runtime (`src/agents/manager.ts`): utility invoked by the pipeline engine for fan-out/retry-unrolled execution.

## Constraint model

- Relationship constraints: per-edge limits (`maxDepth`, `maxChildren`) and process-level cap (`AGENT_RELATIONSHIP_MAX_CHILDREN`).
- Pipeline constraints: per-node retry limits, retry-unrolled topology, and process-level cap (`AGENT_TOPOLOGY_MAX_RETRIES`).
- Topology constraints: max depth and retries from manifest + env caps.

## Stateless handoffs

Node payloads are persisted under the state root. Nodes do not inherit in-memory conversational context from previous node runs. Fresh context is reconstructed from the handoff and persisted state each execution. Sessions load project context from `AGENT_PROJECT_CONTEXT_PATH` at initialization, and orchestration writes project updates on each node completion.

## Resolved execution contract

Before each actor invocation, orchestration resolves an immutable `ResolvedExecutionContext` and injects it into the executor input:

- `phase`: current pipeline node id
- `modelConstraint`: persona-level model policy (or runtime fallback)
- `allowedTools`: flat resolved tool list for that node attempt
- `security`: hard runtime constraints (`dropUid`, `dropGid`, `worktreePath`, violation handling mode)

This keeps orchestration policy resolution separate from executor enforcement. Executors do not need to parse manifests or MCP registry internals.

## Execution topology model

- Pipeline graph execution is DAG-based with ready-node frontiers.
- Nodes tagged with topology blocks `parallel`/`hierarchical` are dispatched concurrently (`Promise.all`) through `AgentManager`.
- Validation failures follow retry-unrolled behavior and are executed as new manager child sessions.
- Sequential hard failures (timeout/network/403-like) trigger fail-fast abort.
- `AbortSignal` is passed through actor execution input for immediate cancellation propagation.

## Domain events

- Domain event schema is strongly typed (`src/agents/domain-events.ts`).
- Standard event domains:
  - planning: `requirements_defined`, `tasks_planned`
  - execution: `code_committed`, `task_blocked`
  - validation: `validation_passed`, `validation_failed`
  - integration: `branch_merged`, `merge_conflict_detected`, `merge_conflict_resolved`, `merge_conflict_unresolved`, `merge_retry_started`
- Pipeline edges can trigger on domain events (`edge.event`) in addition to legacy status triggers (`edge.on`).
- `history_has_event` route conditions evaluate persisted domain event history entries (`validation_failed`, `task_blocked`, etc.).

## Merge conflict orchestration

- Task merge/close merge operations return structured outcomes (`success`, `conflict`, `fatal_error`) instead of throwing for conflicts.
- Task state supports conflict workflows (`conflict`, `resolving_conflict`) and conflict metadata is persisted under `task.metadata.mergeConflict`.
- Conflict retries are bounded by `AGENT_MERGE_CONFLICT_MAX_ATTEMPTS`; exhaustion emits `merge_conflict_unresolved` and the session continues without crashing.

## Security note

Security enforcement now lives in `src/security`:

- `bash-parser` AST parsing for shell command tokenization (`Command`/`Word` nodes).
- Zod-validated shell/tool policy schemas.
- `SecurityRulesEngine` for binary allowlists, path traversal checks, worktree boundaries, and tool clearance checks.
- `SecureCommandExecutor` for controlled `child_process` execution with timeout + explicit env policy.
- `ResolvedExecutionContext.allowedTools` is used to filter provider-exposed tools before SDK invocation, including Claude-specific tool gating where shared `enabled_tools` is ignored.

`PipelineExecutor` treats `SecurityViolationError` via configurable policy:
- `hard_abort` (default): immediate pipeline termination.
- `validation_fail`: maps to retry-unrolled remediation.