AI Ops: Schema-Driven Multi-Agent Orchestration Runtime

TypeScript runtime for deterministic multi-agent execution with:

  • OpenAI Codex SDK (@openai/codex-sdk)
  • Anthropic Claude Agent SDK (@anthropic-ai/claude-agent-sdk)
  • Schema-validated orchestration (AgentManifest)
  • DAG execution with topology-aware fan-out (parallel, hierarchical, retry-unrolled)
  • Project-scoped persistent context store
  • Typed domain events for edge-triggered routing
  • Resource provisioning (git worktrees + deterministic port ranges)
  • MCP configuration layer with handler policy hooks
  • Security middleware for shell/tool policy enforcement
  • Runtime event fan-out (NDJSON analytics log + optional Discord webhook notifications)

Architecture Summary

  • SchemaDrivenExecutionEngine.runSession(...) is the single execution entrypoint.
  • PipelineExecutor owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.
  • Runtime events are emitted as best-effort side-channel telemetry and do not affect orchestration control flow.
  • AgentManager is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.
  • Session state is persisted under AGENT_STATE_ROOT.
  • Session lifecycle is explicit (POST /api/sessions, POST /api/sessions/:id/run, POST /api/sessions/:id/close) and each session is bound to a target project path.
  • Session project context is persisted as schema-versioned JSON (schemaVersion) with domains:
    • globalFlags
    • artifactPointers
    • taskQueue
      • each task record stores taskId, status, and optional worktreePath for task-scoped workspace ownership

Deep Dives

  • Session walkthrough with concrete artifacts from a successful provider run: docs/session-walkthrough.md
  • Orchestration engine internals: docs/orchestration-engine.md
  • Runtime event model and sinks: docs/runtime-events.md

Repository Layout

  • src/agents
    • orchestration.ts: engine facade and runtime wiring
    • pipeline.ts: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing
    • failure-policy.ts: hard/soft failure classification policy
    • lifecycle-observer.ts: persistence/event lifecycle hooks for node attempts
    • manifest.ts: schema parsing/validation for personas/topologies/edges
    • manager.ts: recursive fan-out utility used by pipeline
    • state-context.ts: persisted node handoffs + session state
    • project-context.ts: project-scoped store
    • domain-events.ts: typed domain event schema + bus
    • runtime.ts: env-driven defaults/singletons
    • provisioning.ts: resource provisioning and child suballocation helpers
  • src/mcp: MCP config types/conversion/handlers
  • src/security: shell AST parsing, rules engine, secure executor, and audit sinks
  • src/telemetry: runtime event schema, sink fan-out, file sink, and Discord webhook sink
  • src/ui: local operator UI server, API routes, run-control service, and graph/event aggregation
  • src/examples: provider entrypoints (codex.ts, claude.ts)
  • src/config.ts: centralized env parsing/validation/defaulting
  • tests: manager, manifest, pipeline/orchestration, state, provisioning, MCP

Setup

npm install
cp .env.example .env
cp mcp.config.example.json mcp.config.json

Run

npm run codex -- "Summarize this repository."
npm run claude -- "Summarize this repository."

Or via unified entrypoint:

npm run dev -- codex "List potential improvements."
npm run dev -- claude "List potential improvements."

Operator UI

Start the local UI server:

npm run ui

Then open:

  • http://127.0.0.1:4317 (default)

The UI provides:

  • graph visualizer with topology/retry rendering, edge trigger labels, node economics (duration/cost/tokens), and critical-path highlighting
  • node inspector with attempt metadata and injected ResolvedExecutionContext sandbox payload
  • live runtime event feed from AGENT_RUNTIME_EVENT_LOG_PATH with severity coloring (including security mirror events)
  • run trigger + kill switch backed by SchemaDrivenExecutionEngine.runSession(...)
    • run mode selector: provider (real Codex/Claude execution) or mock (deterministic dry-run executor)
    • provider selector: codex or claude
  • run history from AGENT_STATE_ROOT
  • forms for runtime Discord webhook settings, security policy, and manager/resource limits
  • hover help on form labels with short intent guidance for each field
  • manifest editor/validator/saver for schema "1" manifests

Provider mode notes:

  • provider=codex uses existing OpenAI/Codex auth settings (OPENAI_AUTH_MODE, CODEX_API_KEY, OPENAI_API_KEY).
  • provider=claude uses Claude auth resolution (CLAUDE_CODE_OAUTH_TOKEN preferred, otherwise ANTHROPIC_API_KEY, or existing Claude Code login state).
  • CLAUDE_MODEL should be a Claude model id/alias recognized by Claude Code (for example claude-sonnet-4-6); anthropic/... prefixes are normalized automatically.

Manifest Semantics

AgentManifest (schema "1") validates:

  • supported topologies (sequential, parallel, hierarchical, retry-unrolled)
  • persona definitions, optional modelConstraint, and tool-clearance policy (validated by shared Zod schema)
  • relationship DAG and unknown persona references
  • strict pipeline DAG
  • topology constraints (maxDepth, maxRetries)

Pipeline edges can route via:

  • legacy status triggers (on: success, validation_fail, failure, always)
  • domain event triggers (event: typed domain events)
  • conditions (state_flag, history_has_event, file_exists, always)
  • history_has_event evaluates persisted domain event history (for example validation_failed)

Domain Events

Domain events are typed and can trigger edges directly:

  • planning: requirements_defined, tasks_planned
  • execution: code_committed, task_ready_for_review, task_blocked
  • validation: validation_passed, validation_failed
  • integration: branch_merged

Actors can emit events in ActorExecutionResult.events. Pipeline status also emits default validation/execution events.

Retry Matrix and Cancellation

  • validation_fail: routed through retry-unrolled execution (new child manager session)
  • hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
  • AbortSignal is passed into every actor execution input
  • session closure aborts child recursive work
  • run summaries expose aggregate status: success requires successful terminal executed DAG nodes and no critical-path failure

Runtime Events

  • The pipeline emits runtime lifecycle events (session.started, node.attempt.completed, domain.*, session.completed, session.failed).
  • Runtime events are fan-out only and never used for edge-routing decisions.
  • Default sink writes NDJSON to AGENT_RUNTIME_EVENT_LOG_PATH.
  • Optional Discord sink posts high-visibility lifecycle/error events through webhook configuration.
  • Existing security command audit output (AGENT_SECURITY_AUDIT_LOG_PATH) remains in place and is also mirrored into runtime events.

Runtime Event Fields

Each runtime event is written as one NDJSON object with:

  • id, timestamp, type, severity
  • sessionId, nodeId, attempt
  • message
  • optional usage (tokenInput, tokenOutput, tokenTotal, toolCalls, durationMs, costUsd)
  • optional structured metadata
    • node.attempt.completed metadata includes:
      • executionContext (resolved sandbox payload injected into executor)
      • topologyKind
      • retrySpawned
      • optional fromNodeId, subtasks, securityViolation

Runtime Event Setup

Add these variables in .env (or use defaults):

AGENT_RUNTIME_EVENT_LOG_PATH=.ai_ops/events/runtime-events.ndjson
AGENT_RUNTIME_DISCORD_WEBHOOK_URL=
AGENT_RUNTIME_DISCORD_MIN_SEVERITY=critical
AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES=session.started,session.completed,session.failed

Notes:

  • File sink is always enabled and appends NDJSON.
  • Discord sink is enabled only when AGENT_RUNTIME_DISCORD_WEBHOOK_URL is set.
  • Discord notifications are sent when event severity is at or above AGENT_RUNTIME_DISCORD_MIN_SEVERITY.
  • Event types in AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES are always sent, regardless of severity.

Event Types Emitted by Runtime

  • Session lifecycle:
    • session.started
    • session.completed
    • session.failed
  • Node/domain lifecycle:
    • node.attempt.completed
    • domain.<domain_event_type> (for example domain.validation_failed)
  • Security mirror events:
    • security.shell.command_profiled
    • security.shell.command_allowed
    • security.shell.command_blocked
    • security.tool.invocation_allowed
    • security.tool.invocation_blocked

Analytics Quick Start

Inspect latest events:

tail -n 50 .ai_ops/events/runtime-events.ndjson

Count events by type:

jq -r '.type' .ai_ops/events/runtime-events.ndjson | sort | uniq -c

Get only critical events:

jq -c 'select(.severity=="critical")' .ai_ops/events/runtime-events.ndjson

Security Middleware

  • Shell command parsing uses async sh-syntax (WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction.
  • Rules are validated with strict Zod schemas (src/security/schemas.ts) before execution.
  • SecurityRulesEngine enforces:
    • binary allowlists
    • cwd/worktree boundary checks
    • path traversal blocking (../)
    • protected path blocking (state root + project context path)
    • unified tool allowlist/banlist checks for shell binaries and MCP tool lists
  • SecureCommandExecutor runs commands via child_process.spawn with:
    • explicit env scrub/inject policy (no implicit full env inheritance)
    • timeout enforcement
    • optional uid/gid drop
    • stdout/stderr streaming hooks for audit
  • Every actor execution input now includes a pre-resolved executionContext (phase, modelConstraint, allowedTools, and immutable security constraints) generated by orchestration per node attempt.
  • Every actor execution input now includes security helpers (rulesEngine, createCommandExecutor(...)) so executors can enforce shell/tool policy at the execution boundary.
  • Every actor execution input now includes mcp helpers (resolvedConfig, resolveConfig(...), filterToolsForProvider(...), createClaudeCanUseTool()) so provider adapters are filtered against executionContext.allowedTools before SDK calls.
  • For Claude-based executors, pass input.mcp.filterToolsForProvider(...) and input.mcp.createClaudeCanUseTool() into the SDK call path so unauthorized tools are never exposed and runtime bypass attempts trigger security violations.
  • Pipeline behavior on SecurityViolationError is configurable:
    • hard_abort (default)
    • validation_fail (retry-unrolled remediation)

Environment Variables

Provider/Auth

  • CODEX_API_KEY
  • OPENAI_API_KEY
  • OPENAI_AUTH_MODE (auto, chatgpt, or api_key)
  • OPENAI_BASE_URL
  • CODEX_SKIP_GIT_CHECK
  • CLAUDE_CODE_OAUTH_TOKEN (preferred for Claude auth; takes precedence over ANTHROPIC_API_KEY)
  • ANTHROPIC_API_KEY (used when CLAUDE_CODE_OAUTH_TOKEN is unset)
  • CLAUDE_MODEL
  • CLAUDE_CODE_PATH
  • MCP_CONFIG_PATH

Agent Manager Limits

  • AGENT_MAX_CONCURRENT
  • AGENT_MAX_SESSION
  • AGENT_MAX_RECURSIVE_DEPTH

Orchestration / Context

  • AGENT_STATE_ROOT
  • AGENT_PROJECT_CONTEXT_PATH
  • AGENT_TOPOLOGY_MAX_DEPTH
  • AGENT_TOPOLOGY_MAX_RETRIES
  • AGENT_RELATIONSHIP_MAX_CHILDREN

Provisioning / Resource Controls

  • AGENT_WORKTREE_ROOT
  • AGENT_WORKTREE_BASE_REF
  • AGENT_WORKTREE_TARGET_PATH (optional relative path; enables sparse checkout and sets session working directory to that subfolder)
  • AGENT_PORT_BASE
  • AGENT_PORT_BLOCK_SIZE
  • AGENT_PORT_BLOCK_COUNT
  • AGENT_PORT_PRIMARY_OFFSET
  • AGENT_PORT_LOCK_DIR
  • AGENT_DISCOVERY_FILE_RELATIVE_PATH

Security Middleware

  • AGENT_SECURITY_VIOLATION_MODE (hard_abort or validation_fail)
  • AGENT_SECURITY_ALLOWED_BINARIES
  • AGENT_SECURITY_COMMAND_TIMEOUT_MS
  • AGENT_SECURITY_AUDIT_LOG_PATH
  • AGENT_SECURITY_ENV_INHERIT
  • AGENT_SECURITY_ENV_SCRUB
  • AGENT_SECURITY_DROP_UID
  • AGENT_SECURITY_DROP_GID

Runtime Events / Telemetry

  • AGENT_RUNTIME_EVENT_LOG_PATH
  • AGENT_RUNTIME_DISCORD_WEBHOOK_URL
  • AGENT_RUNTIME_DISCORD_MIN_SEVERITY (info, warning, or critical)
  • AGENT_RUNTIME_DISCORD_ALWAYS_NOTIFY_TYPES (CSV event types such as session.started,session.completed,session.failed)

Operator UI

  • AGENT_UI_HOST (default 127.0.0.1)
  • AGENT_UI_PORT (default 4317)

Runtime-Injected (Do Not Configure In .env)

  • AGENT_REPO_ROOT
  • AGENT_WORKTREE_PATH
  • AGENT_WORKTREE_BASE_REF
  • AGENT_PORT_RANGE_START
  • AGENT_PORT_RANGE_END
  • AGENT_PORT_PRIMARY
  • AGENT_DISCOVERY_FILE

Defaults are documented in .env.example.

Auth behavior notes:

  • OpenAI/Codex:
    • OPENAI_AUTH_MODE=auto (default) prefers API keys when configured, and otherwise relies on existing Codex CLI login (codex login / ChatGPT plan auth).
    • OPENAI_AUTH_MODE=chatgpt always omits API key injection so Codex uses ChatGPT subscription auth/session.
  • Claude:
    • If CLAUDE_CODE_OAUTH_TOKEN and ANTHROPIC_API_KEY are both unset, runtime auth options are omitted and Claude Agent SDK can use existing Claude Code login state.

Quality Gate

npm run verify

Equivalent:

npm run check
npm run check:tests
npm run test
npm run build

Notes

  • Recursive execution APIs on AgentManager are internal runtime plumbing; use SchemaDrivenExecutionEngine.runSession(...) as the public orchestration entrypoint.

MCP Migration Note

  • Shared MCP server configs no longer accept the legacy http_headers alias.
  • Use headers instead.
Description
No description provided
Readme 1.3 MiB
Languages
TypeScript 84.2%
JavaScript 10.8%
CSS 2.9%
HTML 2.1%