Include full contents of all nested repositories

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-02-27 16:25:02 +01:00
parent 14ff8fd54c
commit 2401ed446f
7271 changed files with 1310112 additions and 6 deletions

View File

@@ -0,0 +1,800 @@
---
summary: "Integrate ACP coding agents via a first-class ACP control plane in core and plugin-backed runtimes (acpx first)"
owner: "onutc"
status: "draft"
last_updated: "2026-02-25"
title: "ACP Thread Bound Agents"
---
# ACP Thread Bound Agents
## Overview
This plan defines how OpenClaw should support ACP coding agents in thread-capable channels (Discord first) with production-level lifecycle and recovery.
Related document:
- [Unified Runtime Streaming Refactor Plan](/experiments/plans/acp-unified-streaming-refactor)
Target user experience:
- a user spawns or focuses an ACP session into a thread
- user messages in that thread route to the bound ACP session
- agent output streams back to the same thread persona
- session can be persistent or one shot with explicit cleanup controls
## Decision summary
Long term recommendation is a hybrid architecture:
- OpenClaw core owns ACP control plane concerns
- session identity and metadata
- thread binding and routing decisions
- delivery invariants and duplicate suppression
- lifecycle cleanup and recovery semantics
- ACP runtime backend is pluggable
- first backend is an acpx-backed plugin service
- runtime does ACP transport, queueing, cancel, reconnect
OpenClaw should not reimplement ACP transport internals in core.
OpenClaw should not rely on a pure plugin-only interception path for routing.
## North-star architecture (holy grail)
Treat ACP as a first-class control plane in OpenClaw, with pluggable runtime adapters.
Non-negotiable invariants:
- every ACP thread binding references a valid ACP session record
- every ACP session has explicit lifecycle state (`creating`, `idle`, `running`, `cancelling`, `closed`, `error`)
- every ACP run has explicit run state (`queued`, `running`, `completed`, `failed`, `cancelled`)
- spawn, bind, and initial enqueue are atomic
- command retries are idempotent (no duplicate runs or duplicate Discord outputs)
- bound-thread channel output is a projection of ACP run events, never ad-hoc side effects
Long-term ownership model:
- `AcpSessionManager` is the single ACP writer and orchestrator
- manager lives in gateway process first; can be moved to a dedicated sidecar later behind the same interface
- per ACP session key, manager owns one in-memory actor (serialized command execution)
- adapters (`acpx`, future backends) are transport/runtime implementations only
Long-term persistence model:
- move ACP control-plane state to a dedicated SQLite store (WAL mode) under OpenClaw state dir
- keep `SessionEntry.acp` as compatibility projection during migration, not source-of-truth
- store ACP events append-only to support replay, crash recovery, and deterministic delivery
### Delivery strategy (bridge to holy-grail)
- short-term bridge
- keep current thread binding mechanics and existing ACP config surface
- fix metadata-gap bugs and route ACP turns through a single core ACP branch
- add idempotency keys and fail-closed routing checks immediately
- long-term cutover
- move ACP source-of-truth to control-plane DB + actors
- make bound-thread delivery purely event-projection based
- remove legacy fallback behavior that depends on opportunistic session-entry metadata
## Why not pure plugin only
Current plugin hooks are not sufficient for end to end ACP session routing without core changes.
- inbound routing from thread binding resolves to a session key in core dispatch first
- message hooks are fire-and-forget and cannot short-circuit the main reply path
- plugin commands are good for control operations but not for replacing core per-turn dispatch flow
Result:
- ACP runtime can be pluginized
- ACP routing branch must exist in core
## Existing foundation to reuse
Already implemented and should remain canonical:
- thread binding target supports `subagent` and `acp`
- inbound thread routing override resolves by binding before normal dispatch
- outbound thread identity via webhook in reply delivery
- `/focus` and `/unfocus` flow with ACP target compatibility
- persistent binding store with restore on startup
- unbind lifecycle on archive, delete, unfocus, reset, and delete
This plan extends that foundation rather than replacing it.
## Architecture
### Boundary model
Core (must be in OpenClaw core):
- ACP session-mode dispatch branch in the reply pipeline
- delivery arbitration to avoid parent plus thread duplication
- ACP control-plane persistence (with `SessionEntry.acp` compatibility projection during migration)
- lifecycle unbind and runtime detach semantics tied to session reset/delete
Plugin backend (acpx implementation):
- ACP runtime worker supervision
- acpx process invocation and event parsing
- ACP command handlers (`/acp ...`) and operator UX
- backend-specific config defaults and diagnostics
### Runtime ownership model
- one gateway process owns ACP orchestration state
- ACP execution runs in supervised child processes via acpx backend
- process strategy is long lived per active ACP session key, not per message
This avoids startup cost on every prompt and keeps cancel and reconnect semantics reliable.
### Core runtime contract
Add a core ACP runtime contract so routing code does not depend on CLI details and can switch backends without changing dispatch logic:
```ts
export type AcpRuntimePromptMode = "prompt" | "steer";
export type AcpRuntimeHandle = {
sessionKey: string;
backend: string;
runtimeSessionName: string;
};
export type AcpRuntimeEvent =
| { type: "text_delta"; stream: "output" | "thought"; text: string }
| { type: "tool_call"; name: string; argumentsText: string }
| { type: "done"; usage?: Record<string, number> }
| { type: "error"; code: string; message: string; retryable?: boolean };
export interface AcpRuntime {
ensureSession(input: {
sessionKey: string;
agent: string;
mode: "persistent" | "oneshot";
cwd?: string;
env?: Record<string, string>;
idempotencyKey: string;
}): Promise<AcpRuntimeHandle>;
submit(input: {
handle: AcpRuntimeHandle;
text: string;
mode: AcpRuntimePromptMode;
idempotencyKey: string;
}): Promise<{ runtimeRunId: string }>;
stream(input: {
handle: AcpRuntimeHandle;
runtimeRunId: string;
onEvent: (event: AcpRuntimeEvent) => Promise<void> | void;
signal?: AbortSignal;
}): Promise<void>;
cancel(input: {
handle: AcpRuntimeHandle;
runtimeRunId?: string;
reason?: string;
idempotencyKey: string;
}): Promise<void>;
close(input: { handle: AcpRuntimeHandle; reason: string; idempotencyKey: string }): Promise<void>;
health?(): Promise<{ ok: boolean; details?: string }>;
}
```
Implementation detail:
- first backend: `AcpxRuntime` shipped as a plugin service
- core resolves runtime via registry and fails with explicit operator error when no ACP runtime backend is available
### Control-plane data model and persistence
Long-term source-of-truth is a dedicated ACP SQLite database (WAL mode), for transactional updates and crash-safe recovery:
- `acp_sessions`
- `session_key` (pk), `backend`, `agent`, `mode`, `cwd`, `state`, `created_at`, `updated_at`, `last_error`
- `acp_runs`
- `run_id` (pk), `session_key` (fk), `state`, `requester_message_id`, `idempotency_key`, `started_at`, `ended_at`, `error_code`, `error_message`
- `acp_bindings`
- `binding_key` (pk), `thread_id`, `channel_id`, `account_id`, `session_key` (fk), `expires_at`, `bound_at`
- `acp_events`
- `event_id` (pk), `run_id` (fk), `seq`, `kind`, `payload_json`, `created_at`
- `acp_delivery_checkpoint`
- `run_id` (pk/fk), `last_event_seq`, `last_discord_message_id`, `updated_at`
- `acp_idempotency`
- `scope`, `idempotency_key`, `result_json`, `created_at`, unique `(scope, idempotency_key)`
```ts
export type AcpSessionMeta = {
backend: string;
agent: string;
runtimeSessionName: string;
mode: "persistent" | "oneshot";
cwd?: string;
state: "idle" | "running" | "error";
lastActivityAt: number;
lastError?: string;
};
```
Storage rules:
- keep `SessionEntry.acp` as a compatibility projection during migration
- process ids and sockets stay in memory only
- durable lifecycle and run status live in ACP DB, not generic session JSON
- if runtime owner dies, gateway rehydrates from ACP DB and resumes from checkpoints
### Routing and delivery
Inbound:
- keep current thread binding lookup as first routing step
- if bound target is ACP session, route to ACP runtime branch instead of `getReplyFromConfig`
- explicit `/acp steer` command uses `mode: "steer"`
Outbound:
- ACP event stream is normalized to OpenClaw reply chunks
- delivery target is resolved through existing bound destination path
- when a bound thread is active for that session turn, parent channel completion is suppressed
Streaming policy:
- stream partial output with coalescing window
- configurable min interval and max chunk bytes to stay under Discord rate limits
- final message always emitted on completion or failure
### State machines and transaction boundaries
Session state machine:
- `creating -> idle -> running -> idle`
- `running -> cancelling -> idle | error`
- `idle -> closed`
- `error -> idle | closed`
Run state machine:
- `queued -> running -> completed`
- `running -> failed | cancelled`
- `queued -> cancelled`
Required transaction boundaries:
- spawn transaction
- create ACP session row
- create/update ACP thread binding row
- enqueue initial run row
- close transaction
- mark session closed
- delete/expire binding rows
- write final close event
- cancel transaction
- mark target run cancelling/cancelled with idempotency key
No partial success is allowed across these boundaries.
### Per-session actor model
`AcpSessionManager` runs one actor per ACP session key:
- actor mailbox serializes `submit`, `cancel`, `close`, and `stream` side effects
- actor owns runtime handle hydration and runtime adapter process lifecycle for that session
- actor writes run events in-order (`seq`) before any Discord delivery
- actor updates delivery checkpoints after successful outbound send
This removes cross-turn races and prevents duplicate or out-of-order thread output.
### Idempotency and delivery projection
All external ACP actions must carry idempotency keys:
- spawn idempotency key
- prompt/steer idempotency key
- cancel idempotency key
- close idempotency key
Delivery rules:
- Discord messages are derived from `acp_events` plus `acp_delivery_checkpoint`
- retries resume from checkpoint without re-sending already delivered chunks
- final reply emission is exactly-once per run from projection logic
### Recovery and self-healing
On gateway start:
- load non-terminal ACP sessions (`creating`, `idle`, `running`, `cancelling`, `error`)
- recreate actors lazily on first inbound event or eagerly under configured cap
- reconcile any `running` runs missing heartbeats and mark `failed` or recover via adapter
On inbound Discord thread message:
- if binding exists but ACP session is missing, fail closed with explicit stale-binding message
- optionally auto-unbind stale binding after operator-safe validation
- never silently route stale ACP bindings to normal LLM path
### Lifecycle and safety
Supported operations:
- cancel current run: `/acp cancel`
- unbind thread: `/unfocus`
- close ACP session: `/acp close`
- auto close idle sessions by effective TTL
TTL policy:
- effective TTL is minimum of
- global/session TTL
- Discord thread binding TTL
- ACP runtime owner TTL
Safety controls:
- allowlist ACP agents by name
- restrict workspace roots for ACP sessions
- env allowlist passthrough
- max concurrent ACP sessions per account and globally
- bounded restart backoff for runtime crashes
## Config surface
Core keys:
- `acp.enabled`
- `acp.dispatch.enabled` (independent ACP routing kill switch)
- `acp.backend` (default `acpx`)
- `acp.defaultAgent`
- `acp.allowedAgents[]`
- `acp.maxConcurrentSessions`
- `acp.stream.coalesceIdleMs`
- `acp.stream.maxChunkChars`
- `acp.runtime.ttlMinutes`
- `acp.controlPlane.store` (`sqlite` default)
- `acp.controlPlane.storePath`
- `acp.controlPlane.recovery.eagerActors`
- `acp.controlPlane.recovery.reconcileRunningAfterMs`
- `acp.controlPlane.checkpoint.flushEveryEvents`
- `acp.controlPlane.checkpoint.flushEveryMs`
- `acp.idempotency.ttlHours`
- `channels.discord.threadBindings.spawnAcpSessions`
Plugin/backend keys (acpx plugin section):
- backend command/path overrides
- backend env allowlist
- backend per-agent presets
- backend startup/stop timeouts
- backend max inflight runs per session
## Implementation specification
### Control-plane modules (new)
Add dedicated ACP control-plane modules in core:
- `src/acp/control-plane/manager.ts`
- owns ACP actors, lifecycle transitions, command serialization
- `src/acp/control-plane/store.ts`
- SQLite schema management, transactions, query helpers
- `src/acp/control-plane/events.ts`
- typed ACP event definitions and serialization
- `src/acp/control-plane/checkpoint.ts`
- durable delivery checkpoints and replay cursors
- `src/acp/control-plane/idempotency.ts`
- idempotency key reservation and response replay
- `src/acp/control-plane/recovery.ts`
- boot-time reconciliation and actor rehydrate plan
Compatibility bridge modules:
- `src/acp/runtime/session-meta.ts`
- remains temporarily for projection into `SessionEntry.acp`
- must stop being source-of-truth after migration cutover
### Required invariants (must enforce in code)
- ACP session creation and thread bind are atomic (single transaction)
- there is at most one active run per ACP session actor at a time
- event `seq` is strictly increasing per run
- delivery checkpoint never advances past last committed event
- idempotency replay returns previous success payload for duplicate command keys
- stale/missing ACP metadata cannot route into normal non-ACP reply path
### Core touchpoints
Core files to change:
- `src/auto-reply/reply/dispatch-from-config.ts`
- ACP branch calls `AcpSessionManager.submit` and event-projection delivery
- remove direct ACP fallback that bypasses control-plane invariants
- `src/auto-reply/reply/inbound-context.ts` (or nearest normalized context boundary)
- expose normalized routing keys and idempotency seeds for ACP control plane
- `src/config/sessions/types.ts`
- keep `SessionEntry.acp` as projection-only compatibility field
- `src/gateway/server-methods/sessions.ts`
- reset/delete/archive must call ACP manager close/unbind transaction path
- `src/infra/outbound/bound-delivery-router.ts`
- enforce fail-closed destination behavior for ACP bound session turns
- `src/discord/monitor/thread-bindings.ts`
- add ACP stale-binding validation helpers wired to control-plane lookups
- `src/auto-reply/reply/commands-acp.ts`
- route spawn/cancel/close/steer through ACP manager APIs
- `src/agents/acp-spawn.ts`
- stop ad-hoc metadata writes; call ACP manager spawn transaction
- `src/plugin-sdk/**` and plugin runtime bridge
- expose ACP backend registration and health semantics cleanly
Core files explicitly not replaced:
- `src/discord/monitor/message-handler.preflight.ts`
- keep thread binding override behavior as the canonical session-key resolver
### ACP runtime registry API
Add a core registry module:
- `src/acp/runtime/registry.ts`
Required API:
```ts
export type AcpRuntimeBackend = {
id: string;
runtime: AcpRuntime;
healthy?: () => boolean;
};
export function registerAcpRuntimeBackend(backend: AcpRuntimeBackend): void;
export function unregisterAcpRuntimeBackend(id: string): void;
export function getAcpRuntimeBackend(id?: string): AcpRuntimeBackend | null;
export function requireAcpRuntimeBackend(id?: string): AcpRuntimeBackend;
```
Behavior:
- `requireAcpRuntimeBackend` throws a typed ACP backend missing error when unavailable
- plugin service registers backend on `start` and unregisters on `stop`
- runtime lookups are read-only and process-local
### acpx runtime plugin contract (implementation detail)
For the first production backend (`extensions/acpx`), OpenClaw and acpx are
connected with a strict command contract:
- backend id: `acpx`
- plugin service id: `acpx-runtime`
- runtime handle encoding: `runtimeSessionName = acpx:v1:<base64url(json)>`
- encoded payload fields:
- `name` (acpx named session; uses OpenClaw `sessionKey`)
- `agent` (acpx agent command)
- `cwd` (session workspace root)
- `mode` (`persistent | oneshot`)
Command mapping:
- ensure session:
- `acpx --format json --json-strict --cwd <cwd> <agent> sessions ensure --name <name>`
- prompt turn:
- `acpx --format json --json-strict --cwd <cwd> <agent> prompt --session <name> --file -`
- cancel:
- `acpx --format json --json-strict --cwd <cwd> <agent> cancel --session <name>`
- close:
- `acpx --format json --json-strict --cwd <cwd> <agent> sessions close <name>`
Streaming:
- OpenClaw consumes ndjson events from `acpx --format json --json-strict`
- `text` => `text_delta/output`
- `thought` => `text_delta/thought`
- `tool_call` => `tool_call`
- `done` => `done`
- `error` => `error`
### Session schema patch
Patch `SessionEntry` in `src/config/sessions/types.ts`:
```ts
type SessionAcpMeta = {
backend: string;
agent: string;
runtimeSessionName: string;
mode: "persistent" | "oneshot";
cwd?: string;
state: "idle" | "running" | "error";
lastActivityAt: number;
lastError?: string;
};
```
Persisted field:
- `SessionEntry.acp?: SessionAcpMeta`
Migration rules:
- phase A: dual-write (`acp` projection + ACP SQLite source-of-truth)
- phase B: read-primary from ACP SQLite, fallback-read from legacy `SessionEntry.acp`
- phase C: migration command backfills missing ACP rows from valid legacy entries
- phase D: remove fallback-read and keep projection optional for UX only
- legacy fields (`cliSessionIds`, `claudeCliSessionId`) remain untouched
### Error contract
Add stable ACP error codes and user-facing messages:
- `ACP_BACKEND_MISSING`
- message: `ACP runtime backend is not configured. Install and enable the acpx runtime plugin.`
- `ACP_BACKEND_UNAVAILABLE`
- message: `ACP runtime backend is currently unavailable. Try again in a moment.`
- `ACP_SESSION_INIT_FAILED`
- message: `Could not initialize ACP session runtime.`
- `ACP_TURN_FAILED`
- message: `ACP turn failed before completion.`
Rules:
- return actionable user-safe message in-thread
- log detailed backend/system error only in runtime logs
- never silently fall back to normal LLM path when ACP routing was explicitly selected
### Duplicate delivery arbitration
Single routing rule for ACP bound turns:
- if an active thread binding exists for the target ACP session and requester context, deliver only to that bound thread
- do not also send to parent channel for the same turn
- if bound destination selection is ambiguous, fail closed with explicit error (no implicit parent fallback)
- if no active binding exists, use normal session destination behavior
### Observability and operational readiness
Required metrics:
- ACP spawn success/failure count by backend and error code
- ACP run latency percentiles (queue wait, runtime turn time, delivery projection time)
- ACP actor restart count and restart reason
- stale-binding detection count
- idempotency replay hit rate
- Discord delivery retry and rate-limit counters
Required logs:
- structured logs keyed by `sessionKey`, `runId`, `backend`, `threadId`, `idempotencyKey`
- explicit state transition logs for session and run state machines
- adapter command logs with redaction-safe arguments and exit summary
Required diagnostics:
- `/acp sessions` includes state, active run, last error, and binding status
- `/acp doctor` (or equivalent) validates backend registration, store health, and stale bindings
### Config precedence and effective values
ACP enablement precedence:
- account override: `channels.discord.accounts.<id>.threadBindings.spawnAcpSessions`
- channel override: `channels.discord.threadBindings.spawnAcpSessions`
- global ACP gate: `acp.enabled`
- dispatch gate: `acp.dispatch.enabled`
- backend availability: registered backend for `acp.backend`
Auto-enable behavior:
- when ACP is configured (`acp.enabled=true`, `acp.dispatch.enabled=true`, or
`acp.backend=acpx`), plugin auto-enable marks `plugins.entries.acpx.enabled=true`
unless denylisted or explicitly disabled
TTL effective value:
- `min(session ttl, discord thread binding ttl, acp runtime ttl)`
### Test map
Unit tests:
- `src/acp/runtime/registry.test.ts` (new)
- `src/auto-reply/reply/dispatch-from-config.acp.test.ts` (new)
- `src/infra/outbound/bound-delivery-router.test.ts` (extend ACP fail-closed cases)
- `src/config/sessions/types.test.ts` or nearest session-store tests (ACP metadata persistence)
Integration tests:
- `src/discord/monitor/reply-delivery.test.ts` (bound ACP delivery target behavior)
- `src/discord/monitor/message-handler.preflight*.test.ts` (bound ACP session-key routing continuity)
- acpx plugin runtime tests in backend package (service register/start/stop + event normalization)
Gateway e2e tests:
- `src/gateway/server.sessions.gateway-server-sessions-a.e2e.test.ts` (extend ACP reset/delete lifecycle coverage)
- ACP thread turn roundtrip e2e for spawn, message, stream, cancel, unfocus, restart recovery
### Rollout guard
Add independent ACP dispatch kill switch:
- `acp.dispatch.enabled` default `false` for first release
- when disabled:
- ACP spawn/focus control commands may still bind sessions
- ACP dispatch path does not activate
- user receives explicit message that ACP dispatch is disabled by policy
- after canary validation, default can be flipped to `true` in a later release
## Command and UX plan
### New commands
- `/acp spawn <agent-id> [--mode persistent|oneshot] [--thread auto|here|off]`
- `/acp cancel [session]`
- `/acp steer <instruction>`
- `/acp close [session]`
- `/acp sessions`
### Existing command compatibility
- `/focus <sessionKey>` continues to support ACP targets
- `/unfocus` keeps current semantics
- `/session idle` and `/session max-age` replace the old TTL override
## Phased rollout
### Phase 0 ADR and schema freeze
- ship ADR for ACP control-plane ownership and adapter boundaries
- freeze DB schema (`acp_sessions`, `acp_runs`, `acp_bindings`, `acp_events`, `acp_delivery_checkpoint`, `acp_idempotency`)
- define stable ACP error codes, event contract, and state-transition guards
### Phase 1 Control-plane foundation in core
- implement `AcpSessionManager` and per-session actor runtime
- implement ACP SQLite store and transaction helpers
- implement idempotency store and replay helpers
- implement event append + delivery checkpoint modules
- wire spawn/cancel/close APIs to manager with transactional guarantees
### Phase 2 Core routing and lifecycle integration
- route thread-bound ACP turns from dispatch pipeline into ACP manager
- enforce fail-closed routing when ACP binding/session invariants fail
- integrate reset/delete/archive/unfocus lifecycle with ACP close/unbind transactions
- add stale-binding detection and optional auto-unbind policy
### Phase 3 acpx backend adapter/plugin
- implement `acpx` adapter against runtime contract (`ensureSession`, `submit`, `stream`, `cancel`, `close`)
- add backend health checks and startup/teardown registration
- normalize acpx ndjson events into ACP runtime events
- enforce backend timeouts, process supervision, and restart/backoff policy
### Phase 4 Delivery projection and channel UX (Discord first)
- implement event-driven channel projection with checkpoint resume (Discord first)
- coalesce streaming chunks with rate-limit aware flush policy
- guarantee exactly-once final completion message per run
- ship `/acp spawn`, `/acp cancel`, `/acp steer`, `/acp close`, `/acp sessions`
### Phase 5 Migration and cutover
- introduce dual-write to `SessionEntry.acp` projection plus ACP SQLite source-of-truth
- add migration utility for legacy ACP metadata rows
- flip read path to ACP SQLite primary
- remove legacy fallback routing that depends on missing `SessionEntry.acp`
### Phase 6 Hardening, SLOs, and scale limits
- enforce concurrency limits (global/account/session), queue policies, and timeout budgets
- add full telemetry, dashboards, and alert thresholds
- chaos-test crash recovery and duplicate-delivery suppression
- publish runbook for backend outage, DB corruption, and stale-binding remediation
### Full implementation checklist
- core control-plane modules and tests
- DB migrations and rollback plan
- ACP manager API integration across dispatch and commands
- adapter registration interface in plugin runtime bridge
- acpx adapter implementation and tests
- thread-capable channel delivery projection logic with checkpoint replay (Discord first)
- lifecycle hooks for reset/delete/archive/unfocus
- stale-binding detector and operator-facing diagnostics
- config validation and precedence tests for all new ACP keys
- operational docs and troubleshooting runbook
## Test plan
Unit tests:
- ACP DB transaction boundaries (spawn/bind/enqueue atomicity, cancel, close)
- ACP state-machine transition guards for sessions and runs
- idempotency reservation/replay semantics across all ACP commands
- per-session actor serialization and queue ordering
- acpx event parser and chunk coalescer
- runtime supervisor restart and backoff policy
- config precedence and effective TTL calculation
- core ACP routing branch selection and fail-closed behavior when backend/session is invalid
Integration tests:
- fake ACP adapter process for deterministic streaming and cancel behavior
- ACP manager + dispatch integration with transactional persistence
- thread-bound inbound routing to ACP session key
- thread-bound outbound delivery suppresses parent channel duplication
- checkpoint replay recovers after delivery failure and resumes from last event
- plugin service registration and teardown of ACP runtime backend
Gateway e2e tests:
- spawn ACP with thread, exchange multi-turn prompts, unfocus
- gateway restart with persisted ACP DB and bindings, then continue same session
- concurrent ACP sessions in multiple threads have no cross-talk
- duplicate command retries (same idempotency key) do not create duplicate runs or replies
- stale-binding scenario yields explicit error and optional auto-clean behavior
## Risks and mitigations
- Duplicate deliveries during transition
- Mitigation: single destination resolver and idempotent event checkpoint
- Runtime process churn under load
- Mitigation: long lived per session owners + concurrency caps + backoff
- Plugin absent or misconfigured
- Mitigation: explicit operator-facing error and fail-closed ACP routing (no implicit fallback to normal session path)
- Config confusion between subagent and ACP gates
- Mitigation: explicit ACP keys and command feedback that includes effective policy source
- Control-plane store corruption or migration bugs
- Mitigation: WAL mode, backup/restore hooks, migration smoke tests, and read-only fallback diagnostics
- Actor deadlocks or mailbox starvation
- Mitigation: watchdog timers, actor health probes, and bounded mailbox depth with rejection telemetry
## Acceptance checklist
- ACP session spawn can create or bind a thread in a supported channel adapter (currently Discord)
- all thread messages route to bound ACP session only
- ACP outputs appear in the same thread identity with streaming or batches
- no duplicate output in parent channel for bound turns
- spawn+bind+initial enqueue are atomic in persistent store
- ACP command retries are idempotent and do not duplicate runs or outputs
- cancel, close, unfocus, archive, reset, and delete perform deterministic cleanup
- crash restart preserves mapping and resumes multi turn continuity
- concurrent thread bound ACP sessions work independently
- ACP backend missing state produces clear actionable error
- stale bindings are detected and surfaced explicitly (with optional safe auto-clean)
- control-plane metrics and diagnostics are available for operators
- new unit, integration, and e2e coverage passes
## Addendum: targeted refactors for current implementation (status)
These are non-blocking follow-ups to keep the ACP path maintainable after the current feature set lands.
### 1) Centralize ACP dispatch policy evaluation (completed)
- implemented via shared ACP policy helpers in `src/acp/policy.ts`
- dispatch, ACP command lifecycle handlers, and ACP spawn path now consume shared policy logic
### 2) Split ACP command handler by subcommand domain (completed)
- `src/auto-reply/reply/commands-acp.ts` is now a thin router
- subcommand behavior is split into:
- `src/auto-reply/reply/commands-acp/lifecycle.ts`
- `src/auto-reply/reply/commands-acp/runtime-options.ts`
- `src/auto-reply/reply/commands-acp/diagnostics.ts`
- shared helpers in `src/auto-reply/reply/commands-acp/shared.ts`
### 3) Split ACP session manager by responsibility (completed)
- manager is split into:
- `src/acp/control-plane/manager.ts` (public facade + singleton)
- `src/acp/control-plane/manager.core.ts` (manager implementation)
- `src/acp/control-plane/manager.types.ts` (manager types/deps)
- `src/acp/control-plane/manager.utils.ts` (normalization + helper functions)
### 4) Optional acpx runtime adapter cleanup
- `extensions/acpx/src/runtime.ts` can be split into:
- process execution/supervision
- ndjson event parsing/normalization
- runtime API surface (`submit`, `cancel`, `close`, etc.)
- improves testability and makes backend behavior easier to audit

View File

@@ -0,0 +1,96 @@
---
summary: "Holy grail refactor plan for one unified runtime streaming pipeline across main, subagent, and ACP"
owner: "onutc"
status: "draft"
last_updated: "2026-02-25"
title: "Unified Runtime Streaming Refactor Plan"
---
# Unified Runtime Streaming Refactor Plan
## Objective
Deliver one shared streaming pipeline for `main`, `subagent`, and `acp` so all runtimes get identical coalescing, chunking, delivery ordering, and crash recovery behavior.
## Why this exists
- Current behavior is split across multiple runtime-specific shaping paths.
- Formatting/coalescing bugs can be fixed in one path but remain in others.
- Delivery consistency, duplicate suppression, and recovery semantics are harder to reason about.
## Target architecture
Single pipeline, runtime-specific adapters:
1. Runtime adapters emit canonical events only.
2. Shared stream assembler coalesces and finalizes text/tool/status events.
3. Shared channel projector applies channel-specific chunking/formatting once.
4. Shared delivery ledger enforces idempotent send/replay semantics.
5. Outbound channel adapter executes sends and records delivery checkpoints.
Canonical event contract:
- `turn_started`
- `text_delta`
- `block_final`
- `tool_started`
- `tool_finished`
- `status`
- `turn_completed`
- `turn_failed`
- `turn_cancelled`
## Workstreams
### 1) Canonical streaming contract
- Define strict event schema + validation in core.
- Add adapter contract tests to guarantee each runtime emits compatible events.
- Reject malformed runtime events early and surface structured diagnostics.
### 2) Shared stream processor
- Replace runtime-specific coalescer/projector logic with one processor.
- Processor owns text delta buffering, idle flush, max-chunk splitting, and completion flush.
- Move ACP/main/subagent config resolution into one helper to prevent drift.
### 3) Shared channel projection
- Keep channel adapters dumb: accept finalized blocks and send.
- Move Discord-specific chunking quirks to channel projector only.
- Keep pipeline channel-agnostic before projection.
### 4) Delivery ledger + replay
- Add per-turn/per-chunk delivery IDs.
- Record checkpoints before and after physical send.
- On restart, replay pending chunks idempotently and avoid duplicates.
### 5) Migration and cutover
- Phase 1: shadow mode (new pipeline computes output but old path sends; compare).
- Phase 2: runtime-by-runtime cutover (`acp`, then `subagent`, then `main` or reverse by risk).
- Phase 3: delete legacy runtime-specific streaming code.
## Non-goals
- No changes to ACP policy/permissions model in this refactor.
- No channel-specific feature expansion outside projection compatibility fixes.
- No transport/backend redesign (acpx plugin contract remains as-is unless needed for event parity).
## Risks and mitigations
- Risk: behavioral regressions in existing main/subagent paths.
Mitigation: shadow mode diffing + adapter contract tests + channel e2e tests.
- Risk: duplicate sends during crash recovery.
Mitigation: durable delivery IDs + idempotent replay in delivery adapter.
- Risk: runtime adapters diverge again.
Mitigation: required shared contract test suite for all adapters.
## Acceptance criteria
- All runtimes pass shared streaming contract tests.
- Discord ACP/main/subagent produce equivalent spacing/chunking behavior for tiny deltas.
- Crash/restart replay sends no duplicate chunk for the same delivery ID.
- Legacy ACP projector/coalescer path is removed.
- Streaming config resolution is shared and runtime-independent.

View File

@@ -0,0 +1,232 @@
---
summary: "Plan: isolate browser act:evaluate from Playwright queue using CDP, with end-to-end deadlines and safer ref resolution"
read_when:
- Working on browser `act:evaluate` timeout, abort, or queue blocking issues
- Planning CDP based isolation for evaluate execution
owner: "openclaw"
status: "draft"
last_updated: "2026-02-10"
title: "Browser Evaluate CDP Refactor"
---
# Browser Evaluate CDP Refactor Plan
## Context
`act:evaluate` executes user provided JavaScript in the page. Today it runs via Playwright
(`page.evaluate` or `locator.evaluate`). Playwright serializes CDP commands per page, so a
stuck or long running evaluate can block the page command queue and make every later action
on that tab look "stuck".
PR #13498 adds a pragmatic safety net (bounded evaluate, abort propagation, and best-effort
recovery). This document describes a larger refactor that makes `act:evaluate` inherently
isolated from Playwright so a stuck evaluate cannot wedge normal Playwright operations.
## Goals
- `act:evaluate` cannot permanently block later browser actions on the same tab.
- Timeouts are single source of truth end to end so a caller can rely on a budget.
- Abort and timeout are treated the same way across HTTP and in-process dispatch.
- Element targeting for evaluate is supported without switching everything off Playwright.
- Maintain backward compatibility for existing callers and payloads.
## Non-goals
- Replace all browser actions (click, type, wait, etc.) with CDP implementations.
- Remove the existing safety net introduced in PR #13498 (it remains a useful fallback).
- Introduce new unsafe capabilities beyond the existing `browser.evaluateEnabled` gate.
- Add process isolation (worker process/thread) for evaluate. If we still see hard to recover
stuck states after this refactor, that is a follow-up idea.
## Current Architecture (Why It Gets Stuck)
At a high level:
- Callers send `act:evaluate` to the browser control service.
- The route handler calls into Playwright to execute the JavaScript.
- Playwright serializes page commands, so an evaluate that never finishes blocks the queue.
- A stuck queue means later click/type/wait operations on the tab can appear to hang.
## Proposed Architecture
### 1. Deadline Propagation
Introduce a single budget concept and derive everything from it:
- Caller sets `timeoutMs` (or a deadline in the future).
- The outer request timeout, route handler logic, and the execution budget inside the page
all use the same budget, with small headroom where needed for serialization overhead.
- Abort is propagated as an `AbortSignal` everywhere so cancellation is consistent.
Implementation direction:
- Add a small helper (for example `createBudget({ timeoutMs, signal })`) that returns:
- `signal`: the linked AbortSignal
- `deadlineAtMs`: absolute deadline
- `remainingMs()`: remaining budget for child operations
- Use this helper in:
- `src/browser/client-fetch.ts` (HTTP and in-process dispatch)
- `src/node-host/runner.ts` (proxy path)
- browser action implementations (Playwright and CDP)
### 2. Separate Evaluate Engine (CDP Path)
Add a CDP based evaluate implementation that does not share Playwright's per page command
queue. The key property is that the evaluate transport is a separate WebSocket connection
and a separate CDP session attached to the target.
Implementation direction:
- New module, for example `src/browser/cdp-evaluate.ts`, that:
- Connects to the configured CDP endpoint (browser level socket).
- Uses `Target.attachToTarget({ targetId, flatten: true })` to get a `sessionId`.
- Runs either:
- `Runtime.evaluate` for page level evaluate, or
- `DOM.resolveNode` plus `Runtime.callFunctionOn` for element evaluate.
- On timeout or abort:
- Sends `Runtime.terminateExecution` best-effort for the session.
- Closes the WebSocket and returns a clear error.
Notes:
- This still executes JavaScript in the page, so termination can have side effects. The win
is that it does not wedge the Playwright queue, and it is cancelable at the transport
layer by killing the CDP session.
### 3. Ref Story (Element Targeting Without A Full Rewrite)
The hard part is element targeting. CDP needs a DOM handle or `backendDOMNodeId`, while
today most browser actions use Playwright locators based on refs from snapshots.
Recommended approach: keep existing refs, but attach an optional CDP resolvable id.
#### 3.1 Extend Stored Ref Info
Extend the stored role ref metadata to optionally include a CDP id:
- Today: `{ role, name, nth }`
- Proposed: `{ role, name, nth, backendDOMNodeId?: number }`
This keeps all existing Playwright based actions working and allows CDP evaluate to accept
the same `ref` value when the `backendDOMNodeId` is available.
#### 3.2 Populate backendDOMNodeId At Snapshot Time
When producing a role snapshot:
1. Generate the existing role ref map as today (role, name, nth).
2. Fetch the AX tree via CDP (`Accessibility.getFullAXTree`) and compute a parallel map of
`(role, name, nth) -> backendDOMNodeId` using the same duplicate handling rules.
3. Merge the id back into the stored ref info for the current tab.
If mapping fails for a ref, leave `backendDOMNodeId` undefined. This makes the feature
best-effort and safe to roll out.
#### 3.3 Evaluate Behavior With Ref
In `act:evaluate`:
- If `ref` is present and has `backendDOMNodeId`, run element evaluate via CDP.
- If `ref` is present but has no `backendDOMNodeId`, fall back to the Playwright path (with
the safety net).
Optional escape hatch:
- Extend the request shape to accept `backendDOMNodeId` directly for advanced callers (and
for debugging), while keeping `ref` as the primary interface.
### 4. Keep A Last Resort Recovery Path
Even with CDP evaluate, there are other ways to wedge a tab or a connection. Keep the
existing recovery mechanisms (terminate execution + disconnect Playwright) as a last resort
for:
- legacy callers
- environments where CDP attach is blocked
- unexpected Playwright edge cases
## Implementation Plan (Single Iteration)
### Deliverables
- A CDP based evaluate engine that runs outside the Playwright per-page command queue.
- A single end-to-end timeout/abort budget used consistently by callers and handlers.
- Ref metadata that can optionally carry `backendDOMNodeId` for element evaluate.
- `act:evaluate` prefers the CDP engine when possible and falls back to Playwright when not.
- Tests that prove a stuck evaluate does not wedge later actions.
- Logs/metrics that make failures and fallbacks visible.
### Implementation Checklist
1. Add a shared "budget" helper to link `timeoutMs` + upstream `AbortSignal` into:
- a single `AbortSignal`
- an absolute deadline
- a `remainingMs()` helper for downstream operations
2. Update all caller paths to use that helper so `timeoutMs` means the same thing everywhere:
- `src/browser/client-fetch.ts` (HTTP and in-process dispatch)
- `src/node-host/runner.ts` (node proxy path)
- CLI wrappers that call `/act` (add `--timeout-ms` to `browser evaluate`)
3. Implement `src/browser/cdp-evaluate.ts`:
- connect to the browser-level CDP socket
- `Target.attachToTarget` to get a `sessionId`
- run `Runtime.evaluate` for page evaluate
- run `DOM.resolveNode` + `Runtime.callFunctionOn` for element evaluate
- on timeout/abort: best-effort `Runtime.terminateExecution` then close the socket
4. Extend stored role ref metadata to optionally include `backendDOMNodeId`:
- keep existing `{ role, name, nth }` behavior for Playwright actions
- add `backendDOMNodeId?: number` for CDP element targeting
5. Populate `backendDOMNodeId` during snapshot creation (best-effort):
- fetch AX tree via CDP (`Accessibility.getFullAXTree`)
- compute `(role, name, nth) -> backendDOMNodeId` and merge into the stored ref map
- if mapping is ambiguous or missing, leave the id undefined
6. Update `act:evaluate` routing:
- if no `ref`: always use CDP evaluate
- if `ref` resolves to a `backendDOMNodeId`: use CDP element evaluate
- otherwise: fall back to Playwright evaluate (still bounded and abortable)
7. Keep the existing "last resort" recovery path as a fallback, not the default path.
8. Add tests:
- stuck evaluate times out within budget and the next click/type succeeds
- abort cancels evaluate (client disconnect or timeout) and unblocks subsequent actions
- mapping failures cleanly fall back to Playwright
9. Add observability:
- evaluate duration and timeout counters
- terminateExecution usage
- fallback rate (CDP -> Playwright) and reasons
### Acceptance Criteria
- A deliberately hung `act:evaluate` returns within the caller budget and does not wedge the
tab for later actions.
- `timeoutMs` behaves consistently across CLI, agent tool, node proxy, and in-process calls.
- If `ref` can be mapped to `backendDOMNodeId`, element evaluate uses CDP; otherwise the
fallback path is still bounded and recoverable.
## Testing Plan
- Unit tests:
- `(role, name, nth)` matching logic between role refs and AX tree nodes.
- Budget helper behavior (headroom, remaining time math).
- Integration tests:
- CDP evaluate timeout returns within budget and does not block the next action.
- Abort cancels evaluate and triggers termination best-effort.
- Contract tests:
- Ensure `BrowserActRequest` and `BrowserActResponse` remain compatible.
## Risks And Mitigations
- Mapping is imperfect:
- Mitigation: best-effort mapping, fallback to Playwright evaluate, and add debug tooling.
- `Runtime.terminateExecution` has side effects:
- Mitigation: only use on timeout/abort and document the behavior in errors.
- Extra overhead:
- Mitigation: only fetch AX tree when snapshots are requested, cache per target, and keep
CDP session short lived.
- Extension relay limitations:
- Mitigation: use browser level attach APIs when per page sockets are not available, and
keep the current Playwright path as fallback.
## Open Questions
- Should the new engine be configurable as `playwright`, `cdp`, or `auto`?
- Do we want to expose a new "nodeRef" format for advanced users, or keep `ref` only?
- How should frame snapshots and selector scoped snapshots participate in AX mapping?

View File

@@ -0,0 +1,126 @@
---
summary: "Plan: Add OpenResponses /v1/responses endpoint and deprecate chat completions cleanly"
read_when:
- Designing or implementing `/v1/responses` gateway support
- Planning migration from Chat Completions compatibility
owner: "openclaw"
status: "draft"
last_updated: "2026-01-19"
title: "OpenResponses Gateway Plan"
---
# OpenResponses Gateway Integration Plan
## Context
OpenClaw Gateway currently exposes a minimal OpenAI-compatible Chat Completions endpoint at
`/v1/chat/completions` (see [OpenAI Chat Completions](/gateway/openai-http-api)).
Open Responses is an open inference standard based on the OpenAI Responses API. It is designed
for agentic workflows and uses item-based inputs plus semantic streaming events. The OpenResponses
spec defines `/v1/responses`, not `/v1/chat/completions`.
## Goals
- Add a `/v1/responses` endpoint that adheres to OpenResponses semantics.
- Keep Chat Completions as a compatibility layer that is easy to disable and eventually remove.
- Standardize validation and parsing with isolated, reusable schemas.
## Non-goals
- Full OpenResponses feature parity in the first pass (images, files, hosted tools).
- Replacing internal agent execution logic or tool orchestration.
- Changing the existing `/v1/chat/completions` behavior during the first phase.
## Research Summary
Sources: OpenResponses OpenAPI, OpenResponses specification site, and the Hugging Face blog post.
Key points extracted:
- `POST /v1/responses` accepts `CreateResponseBody` fields like `model`, `input` (string or
`ItemParam[]`), `instructions`, `tools`, `tool_choice`, `stream`, `max_output_tokens`, and
`max_tool_calls`.
- `ItemParam` is a discriminated union of:
- `message` items with roles `system`, `developer`, `user`, `assistant`
- `function_call` and `function_call_output`
- `reasoning`
- `item_reference`
- Successful responses return a `ResponseResource` with `object: "response"`, `status`, and
`output` items.
- Streaming uses semantic events such as:
- `response.created`, `response.in_progress`, `response.completed`, `response.failed`
- `response.output_item.added`, `response.output_item.done`
- `response.content_part.added`, `response.content_part.done`
- `response.output_text.delta`, `response.output_text.done`
- The spec requires:
- `Content-Type: text/event-stream`
- `event:` must match the JSON `type` field
- terminal event must be literal `[DONE]`
- Reasoning items may expose `content`, `encrypted_content`, and `summary`.
- HF examples include `OpenResponses-Version: latest` in requests (optional header).
## Proposed Architecture
- Add `src/gateway/open-responses.schema.ts` containing Zod schemas only (no gateway imports).
- Add `src/gateway/openresponses-http.ts` (or `open-responses-http.ts`) for `/v1/responses`.
- Keep `src/gateway/openai-http.ts` intact as a legacy compatibility adapter.
- Add config `gateway.http.endpoints.responses.enabled` (default `false`).
- Keep `gateway.http.endpoints.chatCompletions.enabled` independent; allow both endpoints to be
toggled separately.
- Emit a startup warning when Chat Completions is enabled to signal legacy status.
## Deprecation Path for Chat Completions
- Maintain strict module boundaries: no shared schema types between responses and chat completions.
- Make Chat Completions opt-in by config so it can be disabled without code changes.
- Update docs to label Chat Completions as legacy once `/v1/responses` is stable.
- Optional future step: map Chat Completions requests to the Responses handler for a simpler
removal path.
## Phase 1 Support Subset
- Accept `input` as string or `ItemParam[]` with message roles and `function_call_output`.
- Extract system and developer messages into `extraSystemPrompt`.
- Use the most recent `user` or `function_call_output` as the current message for agent runs.
- Reject unsupported content parts (image/file) with `invalid_request_error`.
- Return a single assistant message with `output_text` content.
- Return `usage` with zeroed values until token accounting is wired.
## Validation Strategy (No SDK)
- Implement Zod schemas for the supported subset of:
- `CreateResponseBody`
- `ItemParam` + message content part unions
- `ResponseResource`
- Streaming event shapes used by the gateway
- Keep schemas in a single, isolated module to avoid drift and allow future codegen.
## Streaming Implementation (Phase 1)
- SSE lines with both `event:` and `data:`.
- Required sequence (minimum viable):
- `response.created`
- `response.output_item.added`
- `response.content_part.added`
- `response.output_text.delta` (repeat as needed)
- `response.output_text.done`
- `response.content_part.done`
- `response.completed`
- `[DONE]`
## Tests and Verification Plan
- Add e2e coverage for `/v1/responses`:
- Auth required
- Non-stream response shape
- Stream event ordering and `[DONE]`
- Session routing with headers and `user`
- Keep `src/gateway/openai-http.test.ts` unchanged.
- Manual: curl to `/v1/responses` with `stream: true` and verify event ordering and terminal
`[DONE]`.
## Doc Updates (Follow-up)
- Add a new docs page for `/v1/responses` usage and examples.
- Update `/gateway/openai-http-api` with a legacy note and pointer to `/v1/responses`.

View File

@@ -0,0 +1,195 @@
---
summary: "Production plan for reliable interactive process supervision (PTY + non-PTY) with explicit ownership, unified lifecycle, and deterministic cleanup"
read_when:
- Working on exec/process lifecycle ownership and cleanup
- Debugging PTY and non-PTY supervision behavior
owner: "openclaw"
status: "in-progress"
last_updated: "2026-02-15"
title: "PTY and Process Supervision Plan"
---
# PTY and Process Supervision Plan
## 1. Problem and goal
We need one reliable lifecycle for long-running command execution across:
- `exec` foreground runs
- `exec` background runs
- `process` follow up actions (`poll`, `log`, `send-keys`, `paste`, `submit`, `kill`, `remove`)
- CLI agent runner subprocesses
The goal is not just to support PTY. The goal is predictable ownership, cancellation, timeout, and cleanup with no unsafe process matching heuristics.
## 2. Scope and boundaries
- Keep implementation internal in `src/process/supervisor`.
- Do not create a new package for this.
- Keep current behavior compatibility where practical.
- Do not broaden scope to terminal replay or tmux style session persistence.
## 3. Implemented in this branch
### Supervisor baseline already present
- Supervisor module is in place under `src/process/supervisor/*`.
- Exec runtime and CLI runner are already routed through supervisor spawn and wait.
- Registry finalization is idempotent.
### This pass completed
1. Explicit PTY command contract
- `SpawnInput` is now a discriminated union in `src/process/supervisor/types.ts`.
- PTY runs require `ptyCommand` instead of reusing generic `argv`.
- Supervisor no longer rebuilds PTY command strings from argv joins in `src/process/supervisor/supervisor.ts`.
- Exec runtime now passes `ptyCommand` directly in `src/agents/bash-tools.exec-runtime.ts`.
2. Process layer type decoupling
- Supervisor types no longer import `SessionStdin` from agents.
- Process local stdin contract lives in `src/process/supervisor/types.ts` (`ManagedRunStdin`).
- Adapters now depend only on process level types:
- `src/process/supervisor/adapters/child.ts`
- `src/process/supervisor/adapters/pty.ts`
3. Process tool lifecycle ownership improvement
- `src/agents/bash-tools.process.ts` now requests cancellation through supervisor first.
- `process kill/remove` now use process-tree fallback termination when supervisor lookup misses.
- `remove` keeps deterministic remove behavior by dropping running session entries immediately after termination is requested.
4. Single source watchdog defaults
- Added shared defaults in `src/agents/cli-watchdog-defaults.ts`.
- `src/agents/cli-backends.ts` consumes the shared defaults.
- `src/agents/cli-runner/reliability.ts` consumes the same shared defaults.
5. Dead helper cleanup
- Removed unused `killSession` helper path from `src/agents/bash-tools.shared.ts`.
6. Direct supervisor path tests added
- Added `src/agents/bash-tools.process.supervisor.test.ts` to cover kill and remove routing through supervisor cancellation.
7. Reliability gap fixes completed
- `src/agents/bash-tools.process.ts` now falls back to real OS-level process termination when supervisor lookup misses.
- `src/process/supervisor/adapters/child.ts` now uses process-tree termination semantics for default cancel/timeout kill paths.
- Added shared process-tree utility in `src/process/kill-tree.ts`.
8. PTY contract edge-case coverage added
- Added `src/process/supervisor/supervisor.pty-command.test.ts` for verbatim PTY command forwarding and empty-command rejection.
- Added `src/process/supervisor/adapters/child.test.ts` for process-tree kill behavior in child adapter cancellation.
## 4. Remaining gaps and decisions
### Reliability status
The two required reliability gaps for this pass are now closed:
- `process kill/remove` now has a real OS termination fallback when supervisor lookup misses.
- child cancel/timeout now uses process-tree kill semantics for default kill path.
- Regression tests were added for both behaviors.
### Durability and startup reconciliation
Restart behavior is now explicitly defined as in-memory lifecycle only.
- `reconcileOrphans()` remains a no-op in `src/process/supervisor/supervisor.ts` by design.
- Active runs are not recovered after process restart.
- This boundary is intentional for this implementation pass to avoid partial persistence risks.
### Maintainability follow-ups
1. `runExecProcess` in `src/agents/bash-tools.exec-runtime.ts` still handles multiple responsibilities and can be split into focused helpers in a follow-up.
## 5. Implementation plan
The implementation pass for required reliability and contract items is complete.
Completed:
- `process kill/remove` fallback real termination
- process-tree cancellation for child adapter default kill path
- regression tests for fallback kill and child adapter kill path
- PTY command edge-case tests under explicit `ptyCommand`
- explicit in-memory restart boundary with `reconcileOrphans()` no-op by design
Optional follow-up:
- split `runExecProcess` into focused helpers with no behavior drift
## 6. File map
### Process supervisor
- `src/process/supervisor/types.ts` updated with discriminated spawn input and process local stdin contract.
- `src/process/supervisor/supervisor.ts` updated to use explicit `ptyCommand`.
- `src/process/supervisor/adapters/child.ts` and `src/process/supervisor/adapters/pty.ts` decoupled from agent types.
- `src/process/supervisor/registry.ts` idempotent finalize unchanged and retained.
### Exec and process integration
- `src/agents/bash-tools.exec-runtime.ts` updated to pass PTY command explicitly and keep fallback path.
- `src/agents/bash-tools.process.ts` updated to cancel via supervisor with real process-tree fallback termination.
- `src/agents/bash-tools.shared.ts` removed direct kill helper path.
### CLI reliability
- `src/agents/cli-watchdog-defaults.ts` added as shared baseline.
- `src/agents/cli-backends.ts` and `src/agents/cli-runner/reliability.ts` now consume same defaults.
## 7. Validation run in this pass
Unit tests:
- `pnpm vitest src/process/supervisor/registry.test.ts`
- `pnpm vitest src/process/supervisor/supervisor.test.ts`
- `pnpm vitest src/process/supervisor/supervisor.pty-command.test.ts`
- `pnpm vitest src/process/supervisor/adapters/child.test.ts`
- `pnpm vitest src/agents/cli-backends.test.ts`
- `pnpm vitest src/agents/bash-tools.exec.pty-cleanup.test.ts`
- `pnpm vitest src/agents/bash-tools.process.poll-timeout.test.ts`
- `pnpm vitest src/agents/bash-tools.process.supervisor.test.ts`
- `pnpm vitest src/process/exec.test.ts`
E2E targets:
- `pnpm vitest src/agents/cli-runner.test.ts`
- `pnpm vitest run src/agents/bash-tools.exec.pty-fallback.test.ts src/agents/bash-tools.exec.background-abort.test.ts src/agents/bash-tools.process.send-keys.test.ts`
Typecheck note:
- Use `pnpm build` (and `pnpm check` for full lint/docs gate) in this repo. Older notes that mention `pnpm tsgo` are obsolete.
## 8. Operational guarantees preserved
- Exec env hardening behavior is unchanged.
- Approval and allowlist flow is unchanged.
- Output sanitization and output caps are unchanged.
- PTY adapter still guarantees wait settlement on forced kill and listener disposal.
## 9. Definition of done
1. Supervisor is lifecycle owner for managed runs.
2. PTY spawn uses explicit command contract with no argv reconstruction.
3. Process layer has no type dependency on agent layer for supervisor stdin contracts.
4. Watchdog defaults are single source.
5. Targeted unit and e2e tests remain green.
6. Restart durability boundary is explicitly documented or fully implemented.
## 10. Summary
The branch now has a coherent and safer supervision shape:
- explicit PTY contract
- cleaner process layering
- supervisor driven cancellation path for process operations
- real fallback termination when supervisor lookup misses
- process-tree cancellation for child-run default kill paths
- unified watchdog defaults
- explicit in-memory restart boundary (no orphan reconciliation across restart in this pass)

View File

@@ -0,0 +1,226 @@
---
summary: "Channel agnostic session binding architecture and iteration 1 delivery scope"
read_when:
- Refactoring channel-agnostic session routing and bindings
- Investigating duplicate, stale, or missing session delivery across channels
owner: "onutc"
status: "in-progress"
last_updated: "2026-02-21"
title: "Session Binding Channel Agnostic Plan"
---
# Session Binding Channel Agnostic Plan
## Overview
This document defines the long term channel agnostic session binding model and the concrete scope for the next implementation iteration.
Goal:
- make subagent bound session routing a core capability
- keep channel specific behavior in adapters
- avoid regressions in normal Discord behavior
## Why this exists
Current behavior mixes:
- completion content policy
- destination routing policy
- Discord specific details
This caused edge cases such as:
- duplicate main and thread delivery under concurrent runs
- stale token usage on reused binding managers
- missing activity accounting for webhook sends
## Iteration 1 scope
This iteration is intentionally limited.
### 1. Add channel agnostic core interfaces
Add core types and service interfaces for bindings and routing.
Proposed core types:
```ts
export type BindingTargetKind = "subagent" | "session";
export type BindingStatus = "active" | "ending" | "ended";
export type ConversationRef = {
channel: string;
accountId: string;
conversationId: string;
parentConversationId?: string;
};
export type SessionBindingRecord = {
bindingId: string;
targetSessionKey: string;
targetKind: BindingTargetKind;
conversation: ConversationRef;
status: BindingStatus;
boundAt: number;
expiresAt?: number;
metadata?: Record<string, unknown>;
};
```
Core service contract:
```ts
export interface SessionBindingService {
bind(input: {
targetSessionKey: string;
targetKind: BindingTargetKind;
conversation: ConversationRef;
metadata?: Record<string, unknown>;
ttlMs?: number;
}): Promise<SessionBindingRecord>;
listBySession(targetSessionKey: string): SessionBindingRecord[];
resolveByConversation(ref: ConversationRef): SessionBindingRecord | null;
touch(bindingId: string, at?: number): void;
unbind(input: {
bindingId?: string;
targetSessionKey?: string;
reason: string;
}): Promise<SessionBindingRecord[]>;
}
```
### 2. Add one core delivery router for subagent completions
Add a single destination resolution path for completion events.
Router contract:
```ts
export interface BoundDeliveryRouter {
resolveDestination(input: {
eventKind: "task_completion";
targetSessionKey: string;
requester?: ConversationRef;
failClosed: boolean;
}): {
binding: SessionBindingRecord | null;
mode: "bound" | "fallback";
reason: string;
};
}
```
For this iteration:
- only `task_completion` is routed through this new path
- existing paths for other event kinds remain as-is
### 3. Keep Discord as adapter
Discord remains the first adapter implementation.
Adapter responsibilities:
- create/reuse thread conversations
- send bound messages via webhook or channel send
- validate thread state (archived/deleted)
- map adapter metadata (webhook identity, thread ids)
### 4. Fix currently known correctness issues
Required in this iteration:
- refresh token usage when reusing existing thread binding manager
- record outbound activity for webhook based Discord sends
- stop implicit main channel fallback when a bound thread destination is selected for session mode completion
### 5. Preserve current runtime safety defaults
No behavior change for users with thread bound spawn disabled.
Defaults stay:
- `channels.discord.threadBindings.spawnSubagentSessions = false`
Result:
- normal Discord users stay on current behavior
- new core path affects only bound session completion routing where enabled
## Not in iteration 1
Explicitly deferred:
- ACP binding targets (`targetKind: "acp"`)
- new channel adapters beyond Discord
- global replacement of all delivery paths (`spawn_ack`, future `subagent_message`)
- protocol level changes
- store migration/versioning redesign for all binding persistence
Notes on ACP:
- interface design keeps room for ACP
- ACP implementation is not started in this iteration
## Routing invariants
These invariants are mandatory for iteration 1.
- destination selection and content generation are separate steps
- if session mode completion resolves to an active bound destination, delivery must target that destination
- no hidden reroute from bound destination to main channel
- fallback behavior must be explicit and observable
## Compatibility and rollout
Compatibility target:
- no regression for users with thread bound spawning off
- no change to non-Discord channels in this iteration
Rollout:
1. Land interfaces and router behind current feature gates.
2. Route Discord completion mode bound deliveries through router.
3. Keep legacy path for non-bound flows.
4. Verify with targeted tests and canary runtime logs.
## Tests required in iteration 1
Unit and integration coverage required:
- manager token rotation uses latest token after manager reuse
- webhook sends update channel activity timestamps
- two active bound sessions in same requester channel do not duplicate to main channel
- completion for bound session mode run resolves to thread destination only
- disabled spawn flag keeps legacy behavior unchanged
## Proposed implementation files
Core:
- `src/infra/outbound/session-binding-service.ts` (new)
- `src/infra/outbound/bound-delivery-router.ts` (new)
- `src/agents/subagent-announce.ts` (completion destination resolution integration)
Discord adapter and runtime:
- `src/discord/monitor/thread-bindings.manager.ts`
- `src/discord/monitor/reply-delivery.ts`
- `src/discord/send.outbound.ts`
Tests:
- `src/discord/monitor/provider*.test.ts`
- `src/discord/monitor/reply-delivery.test.ts`
- `src/agents/subagent-announce.format.test.ts`
## Done criteria for iteration 1
- core interfaces exist and are wired for completion routing
- correctness fixes above are merged with tests
- no main and thread duplicate completion delivery in session mode bound runs
- no behavior change for disabled bound spawn deployments
- ACP remains explicitly deferred