Files
ai_ops/README.md

6.8 KiB

AI Ops: Schema-Driven Multi-Agent Orchestration Runtime

TypeScript runtime for deterministic multi-agent execution with:

  • OpenAI Codex SDK (@openai/codex-sdk)
  • Anthropic Claude Agent SDK (@anthropic-ai/claude-agent-sdk)
  • Schema-validated orchestration (AgentManifest)
  • DAG execution with topology-aware fan-out (parallel, hierarchical, retry-unrolled)
  • Project-scoped persistent context store
  • Typed domain events for edge-triggered routing
  • Resource provisioning (git worktrees + deterministic port ranges)
  • MCP configuration layer with handler policy hooks
  • Security middleware for shell/tool policy enforcement

Architecture Summary

  • SchemaDrivenExecutionEngine.runSession(...) is the single execution entrypoint.
  • PipelineExecutor owns runtime control flow and topology dispatch while delegating failure classification and persistence/event side-effects to dedicated policies.
  • AgentManager is an internal utility used by the pipeline when fan-out/retry-unrolled behavior is required.
  • Session state is persisted under AGENT_STATE_ROOT.
  • Project state is persisted under AGENT_PROJECT_CONTEXT_PATH with schema-versioned JSON (schemaVersion) and domains:
    • globalFlags
    • artifactPointers
    • taskQueue

Repository Layout

  • src/agents
    • orchestration.ts: engine facade and runtime wiring
    • pipeline.ts: DAG runner, retry matrix, aggregate session status, abort propagation, domain-event routing
    • failure-policy.ts: hard/soft failure classification policy
    • lifecycle-observer.ts: persistence/event lifecycle hooks for node attempts
    • manifest.ts: schema parsing/validation for personas/topologies/edges
    • manager.ts: recursive fan-out utility used by pipeline
    • state-context.ts: persisted node handoffs + session state
    • project-context.ts: project-scoped store
    • domain-events.ts: typed domain event schema + bus
    • runtime.ts: env-driven defaults/singletons
    • provisioning.ts: resource provisioning and child suballocation helpers
  • src/mcp: MCP config types/conversion/handlers
  • src/security: shell AST parsing, rules engine, secure executor, and audit sinks
  • src/examples: provider entrypoints (codex.ts, claude.ts)
  • src/config.ts: centralized env parsing/validation/defaulting
  • tests: manager, manifest, pipeline/orchestration, state, provisioning, MCP

Setup

npm install
cp .env.example .env
cp mcp.config.example.json mcp.config.json

Run

npm run codex -- "Summarize this repository."
npm run claude -- "Summarize this repository."

Or via unified entrypoint:

npm run dev -- codex "List potential improvements."
npm run dev -- claude "List potential improvements."

Manifest Semantics

AgentManifest (schema "1") validates:

  • supported topologies (sequential, parallel, hierarchical, retry-unrolled)
  • persona definitions and tool-clearance metadata
  • persona definitions and tool-clearance policy (validated by shared Zod schema)
  • relationship DAG and unknown persona references
  • strict pipeline DAG
  • topology constraints (maxDepth, maxRetries)

Pipeline edges can route via:

  • legacy status triggers (on: success, validation_fail, failure, always)
  • domain event triggers (event: typed domain events)
  • conditions (state_flag, history_has_event, file_exists, always)
  • history_has_event evaluates persisted domain event history (for example validation_failed)

Domain Events

Domain events are typed and can trigger edges directly:

  • planning: requirements_defined, tasks_planned
  • execution: code_committed, task_blocked
  • validation: validation_passed, validation_failed
  • integration: branch_merged

Actors can emit events in ActorExecutionResult.events. Pipeline status also emits default validation/execution events.

Retry Matrix and Cancellation

  • validation_fail: routed through retry-unrolled execution (new child manager session)
  • hard failures: timeout/network/403-like failures tracked sequentially; at 2 consecutive hard failures the pipeline aborts fast
  • AbortSignal is passed into every actor execution input
  • session closure aborts child recursive work
  • run summaries expose aggregate status: success requires successful terminal executed DAG nodes and no critical-path failure

Security Middleware

  • Shell command parsing uses async sh-syntax (WASM-backed mvdan/sh parser) with fail-closed command/redirect extraction.
  • Rules are validated with strict Zod schemas (src/security/schemas.ts) before execution.
  • SecurityRulesEngine enforces:
    • binary allowlists
    • cwd/worktree boundary checks
    • path traversal blocking (../)
    • protected path blocking (state root + project context path)
    • unified tool allowlist/banlist checks for shell binaries and MCP tool lists
  • SecureCommandExecutor runs commands via child_process.spawn with:
    • explicit env scrub/inject policy (no implicit full env inheritance)
    • timeout enforcement
    • optional uid/gid drop
    • stdout/stderr streaming hooks for audit
  • Every actor execution input now includes security helpers (rulesEngine, createCommandExecutor(...)) so executors can enforce shell/tool policy at the execution boundary.
  • Pipeline behavior on SecurityViolationError is configurable:
    • hard_abort (default)
    • validation_fail (retry-unrolled remediation)

Environment Variables

Provider/Auth

  • CODEX_API_KEY
  • OPENAI_API_KEY
  • OPENAI_BASE_URL
  • CODEX_SKIP_GIT_CHECK
  • CLAUDE_CODE_OAUTH_TOKEN (preferred for Claude auth)
  • ANTHROPIC_API_KEY
  • CLAUDE_MODEL
  • CLAUDE_CODE_PATH
  • MCP_CONFIG_PATH

Agent Manager Limits

  • AGENT_MAX_CONCURRENT
  • AGENT_MAX_SESSION
  • AGENT_MAX_RECURSIVE_DEPTH

Orchestration / Context

  • AGENT_STATE_ROOT
  • AGENT_PROJECT_CONTEXT_PATH
  • AGENT_TOPOLOGY_MAX_DEPTH
  • AGENT_TOPOLOGY_MAX_RETRIES
  • AGENT_RELATIONSHIP_MAX_CHILDREN

Provisioning / Resource Controls

  • AGENT_WORKTREE_ROOT
  • AGENT_WORKTREE_BASE_REF
  • AGENT_PORT_BASE
  • AGENT_PORT_BLOCK_SIZE
  • AGENT_PORT_BLOCK_COUNT
  • AGENT_PORT_PRIMARY_OFFSET
  • AGENT_PORT_LOCK_DIR
  • AGENT_DISCOVERY_FILE_RELATIVE_PATH

Security Middleware

  • AGENT_SECURITY_VIOLATION_MODE
  • AGENT_SECURITY_ALLOWED_BINARIES
  • AGENT_SECURITY_COMMAND_TIMEOUT_MS
  • AGENT_SECURITY_AUDIT_LOG_PATH
  • AGENT_SECURITY_ENV_INHERIT
  • AGENT_SECURITY_ENV_SCRUB
  • AGENT_SECURITY_DROP_UID
  • AGENT_SECURITY_DROP_GID

Defaults are documented in .env.example.

Quality Gate

npm run verify

Equivalent:

npm run check
npm run check:tests
npm run test
npm run build

Notes

  • Recursive execution APIs on AgentManager are internal runtime plumbing; use SchemaDrivenExecutionEngine.runSession(...) as the public orchestration entrypoint.

MCP Migration Note

  • Shared MCP server configs no longer accept the legacy http_headers alias.
  • Use headers instead.