13 KiB
13 KiB
AI Ops: Schema-Driven Multi-Agent Orchestration Runtime
TypeScript runtime for deterministic multi-agent execution with:
- OpenAI Codex SDK (
@openai/codex-sdk) - Anthropic Claude Agent SDK (
@anthropic-ai/claude-agent-sdk) - Schema-validated orchestration (
AgentManifest) - DAG execution with topology-aware fan-out (
parallel,hierarchical,retry-unrolled) - Project-scoped persistent context store
- Typed domain events for edge-triggered routing
- Resource provisioning (git worktrees + deterministic port ranges)
- MCP configuration layer with handler policy hooks
- Security middleware for shell/tool policy enforcement
- Runtime event fan-out (NDJSON analytics log + optional Discord webhook notifications)
Architecture Summary
SchemaDrivenExecutionEngine.runSession(...)is the single execution entrypoint.PipelineExecutorowns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.- Runtime events are emitted as best-effort side-channel telemetry and do not affect orchestration control flow.
AgentManageris an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.- Session state is persisted under
AGENT_STATE_ROOT. - Project state is persisted under
AGENT_PROJECT_CONTEXT_PATHwith schema-versioned JSON (schemaVersion) and domains:globalFlagsartifactPointerstaskQueue
Repository Layout
src/agentsorchestration.ts: engine facade and runtime wiringpipeline.ts: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routingfailure-policy.ts: hard/soft failure classification policylifecycle-observer.ts: persistence/event lifecycle hooks for node attemptsmanifest.ts: schema parsing/validation for personas/topologies/edgesmanager.ts: recursive fan-out utility used by pipelinestate-context.ts: persisted node handoffs + session stateproject-context.ts: project-scoped storedomain-events.ts: typed domain event schema + busruntime.ts: env-driven defaults/singletonsprovisioning.ts: resource provisioning and child suballocation helpers
src/mcp: MCP config types/conversion/handlerssrc/security: shell AST parsing, rules engine, secure executor, and audit sinkssrc/telemetry: runtime event schema, sink fan-out, file sink, and Discord webhook sinksrc/ui: local operator UI server, API routes, run-control service, and graph/event aggregationsrc/examples: provider entrypoints (codex.ts,claude.ts)src/config.ts: centralized env parsing/validation/defaultingtests: manager, manifest, pipeline/orchestration, state, provisioning, MCP
Setup
npm install
cp .env.example .env
cp mcp.config.example.json mcp.config.json
Run
npm run codex -- "Summarize this repository."
npm run claude -- "Summarize this repository."
Or via unified entrypoint:
npm run dev -- codex "List potential improvements."
npm run dev -- claude "List potential improvements."
Operator UI
Start the local UI server:
npm run ui
Then open:
http://127.0.0.1:4317(default)
The UI provides:
- graph visualizer with topology/retry rendering, edge trigger labels, node economics (duration/cost/tokens), and critical-path highlighting
- node inspector with attempt metadata and injected
ResolvedExecutionContextsandbox payload - live runtime event feed from
AGENT_RUNTIME_EVENT_LOG_PATHwith severity coloring (including security mirror events) - run trigger + kill switch backed by
SchemaDrivenExecutionEngine.runSession(...)- run mode selector:
provider(real Codex/Claude execution) ormock(deterministic dry-run executor) - provider selector:
codexorclaude
- run mode selector:
- run history from
AGENT_STATE_ROOT - forms for runtime Discord webhook settings, security policy, and manager/resource limits
- manifest editor/validator/saver for schema
"1"manifests
Provider mode notes:
provider=codexuses existing OpenAI/Codex auth settings (OPENAI_AUTH_MODE,CODEX_API_KEY,OPENAI_API_KEY).provider=claudeuses Claude auth resolution (CLAUDE_CODE_OAUTH_TOKENpreferred, otherwiseANTHROPIC_API_KEY, or existing Claude Code login state).
Manifest Semantics
AgentManifest (schema "1") validates:
- supported topologies (
sequential,parallel,hierarchical,retry-unrolled) - persona definitions, optional
modelConstraint, and tool-clearance policy (validated by shared Zod schema) - relationship DAG and unknown persona references
- strict pipeline DAG
- topology constraints (
maxDepth,maxRetries)
Pipeline edges can route via:
- legacy status triggers (
on:success,validation_fail,failure,always) - domain event triggers (
event: typed domain events) - conditions (
state_flag,history_has_event,file_exists,always) history_has_eventevaluates persisted domain event history (for examplevalidation_failed)
Domain Events
Domain events are typed and can trigger edges directly:
- planning:
requirements_defined,tasks_planned - execution:
code_committed,task_blocked - validation:
validation_passed,validation_failed - integration:
branch_merged
Actors can emit events in ActorExecutionResult.events. Pipeline status also emits default validation/execution events.
Retry Matrix and Cancellation
validation_fail: routed through retry-unrolled execution (new child manager session)- hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
AbortSignalis passed into every actor execution input- session closure aborts child recursive work
- run summaries expose aggregate
status: success requires successful terminal executed DAG nodes and no critical-path failure
Runtime Events
- The pipeline emits runtime lifecycle events (
session.started,node.attempt.completed,domain.*,session.completed,session.failed). - Runtime events are fan-out only and never used for edge-routing decisions.
- Default sink writes NDJSON to
AGENT_RUNTIME_EVENT_LOG_PATH. - Optional Discord sink posts high-visibility lifecycle/error events through webhook configuration.
- Existing security command audit output (
AGENT_SECURITY_AUDIT_LOG_PATH) remains in place and is also mirrored into runtime events.
Runtime Event Fields
Each runtime event is written as one NDJSON object with:
id,timestamp,type,severitysessionId,nodeId,attemptmessage- optional
usage(tokenInput,tokenOutput,tokenTotal,toolCalls,durationMs,costUsd) - optional structured
metadatanode.attempt.completedmetadata includes:executionContext(resolved sandbox payload injected into executor)topologyKindretrySpawned- optional
fromNodeId,subtasks,securityViolation
Runtime Event Setup
Add these variables in .env (or use defaults):
AGENT_RUNTIME_EVENT_LOG_PATH=.ai_ops/events/runtime-events.ndjson
AGENT_RUNTIME_DISCORD_WEBHOOK_URL=
AGENT_RUNTIME_DISCORD_MIN_SEVERITY=critical
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES=session.started,session.completed,session.failed
Notes:
- File sink is always enabled and appends NDJSON.
- Discord sink is enabled only when
AGENT_RUNTIME_DISCORD_WEBHOOK_URLis set. - Discord notifications are sent when event severity is at or above
AGENT_RUNTIME_DISCORD_MIN_SEVERITY. - Event types in
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPESare always sent, regardless of severity.
Event Types Emitted by Runtime
- Session lifecycle:
session.startedsession.completedsession.failed
- Node/domain lifecycle:
node.attempt.completeddomain.<domain_event_type>(for exampledomain.validation_failed)
- Security mirror events:
security.shell.command_profiledsecurity.shell.command_allowedsecurity.shell.command_blockedsecurity.tool.invocation_allowedsecurity.tool.invocation_blocked
Analytics Quick Start
Inspect latest events:
tail -n 50 .ai_ops/events/runtime-events.ndjson
Count events by type:
jq -r '.type' .ai_ops/events/runtime-events.ndjson | sort | uniq -c
Get only critical events:
jq -c 'select(.severity=="critical")' .ai_ops/events/runtime-events.ndjson
Security Middleware
- Shell command parsing uses async
sh-syntax(WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction. - Rules are validated with strict Zod schemas (
src/security/schemas.ts) before execution. SecurityRulesEngineenforces:- binary allowlists
- cwd/worktree boundary checks
- path traversal blocking (
../) - protected path blocking (state root + project context path)
- unified tool allowlist/banlist checks for shell binaries and MCP tool lists
SecureCommandExecutorruns commands viachild_process.spawnwith:- explicit env scrub/inject policy (no implicit full env inheritance)
- timeout enforcement
- optional uid/gid drop
- stdout/stderr streaming hooks for audit
- Every actor execution input now includes a pre-resolved
executionContext(phase,modelConstraint,allowedTools, and immutable security constraints) generated by orchestration per node attempt. - Every actor execution input now includes
securityhelpers (rulesEngine,createCommandExecutor(...)) so executors can enforce shell/tool policy at the execution boundary. - Every actor execution input now includes
mcphelpers (resolvedConfig,resolveConfig(...),filterToolsForProvider(...),createClaudeCanUseTool()) so provider adapters are filtered againstexecutionContext.allowedToolsbefore SDK calls. - For Claude-based executors, pass
input.mcp.filterToolsForProvider(...)andinput.mcp.createClaudeCanUseTool()into the SDK call path so unauthorized tools are never exposed and runtime bypass attempts trigger security violations. - Pipeline behavior on
SecurityViolationErroris configurable:hard_abort(default)validation_fail(retry-unrolled remediation)
Environment Variables
Provider/Auth
CODEX_API_KEYOPENAI_API_KEYOPENAI_AUTH_MODE(auto,chatgpt, orapi_key)OPENAI_BASE_URLCODEX_SKIP_GIT_CHECKCLAUDE_CODE_OAUTH_TOKEN(preferred for Claude auth; takes precedence overANTHROPIC_API_KEY)ANTHROPIC_API_KEY(used whenCLAUDE_CODE_OAUTH_TOKENis unset)CLAUDE_MODELCLAUDE_CODE_PATHMCP_CONFIG_PATH
Agent Manager Limits
AGENT_MAX_CONCURRENTAGENT_MAX_SESSIONAGENT_MAX_RECURSIVE_DEPTH
Orchestration / Context
AGENT_STATE_ROOTAGENT_PROJECT_CONTEXT_PATHAGENT_TOPOLOGY_MAX_DEPTHAGENT_TOPOLOGY_MAX_RETRIESAGENT_RELATIONSHIP_MAX_CHILDREN
Provisioning / Resource Controls
AGENT_WORKTREE_ROOTAGENT_WORKTREE_BASE_REFAGENT_PORT_BASEAGENT_PORT_BLOCK_SIZEAGENT_PORT_BLOCK_COUNTAGENT_PORT_PRIMARY_OFFSETAGENT_PORT_LOCK_DIRAGENT_DISCOVERY_FILE_RELATIVE_PATH
Security Middleware
AGENT_SECURITY_VIOLATION_MODE(hard_abortorvalidation_fail)AGENT_SECURITY_ALLOWED_BINARIESAGENT_SECURITY_COMMAND_TIMEOUT_MSAGENT_SECURITY_AUDIT_LOG_PATHAGENT_SECURITY_ENV_INHERITAGENT_SECURITY_ENV_SCRUBAGENT_SECURITY_DROP_UIDAGENT_SECURITY_DROP_GID
Runtime Events / Telemetry
AGENT_RUNTIME_EVENT_LOG_PATHAGENT_RUNTIME_DISCORD_WEBHOOK_URLAGENT_RUNTIME_DISCORD_MIN_SEVERITY(info,warning, orcritical)AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES(CSV event types such assession.started,session.completed,session.failed)
Operator UI
AGENT_UI_HOST(default127.0.0.1)AGENT_UI_PORT(default4317)
Runtime-Injected (Do Not Configure In .env)
AGENT_REPO_ROOTAGENT_WORKTREE_PATHAGENT_WORKTREE_BASE_REFAGENT_PORT_RANGE_STARTAGENT_PORT_RANGE_ENDAGENT_PORT_PRIMARYAGENT_DISCOVERY_FILE
Defaults are documented in .env.example.
Auth behavior notes:
- OpenAI/Codex:
OPENAI_AUTH_MODE=auto(default) prefers API keys when configured, and otherwise relies on existing Codex CLI login (codex login/ ChatGPT plan auth).OPENAI_AUTH_MODE=chatgptalways omits API key injection so Codex uses ChatGPT subscription auth/session.
- Claude:
- If
CLAUDE_CODE_OAUTH_TOKENandANTHROPIC_API_KEYare both unset, runtime auth options are omitted and Claude Agent SDK can use existing Claude Code login state.
- If
Quality Gate
npm run verify
Equivalent:
npm run check
npm run check:tests
npm run test
npm run build
Notes
- Recursive execution APIs on
AgentManagerare internal runtime plumbing; useSchemaDrivenExecutionEngine.runSession(...)as the public orchestration entrypoint.
MCP Migration Note
- Shared MCP server configs no longer accept the legacy
http_headersalias. - Use
headersinstead.