LetsBeBiz-Redesign/docs/technical/LetsBe_Biz_Dispatcher_Routi...

540 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LetsBe Biz — Dispatcher Routing Logic
**Version:** 1.0
**Date:** February 26, 2026
**Authors:** Matt (Founder), Claude (Architecture)
**Status:** Engineering Spec — Ready for Implementation
**Companion docs:** Technical Architecture v1.2, Pricing Model v2.2, SOUL.md Content Spec v1.0
**Decision refs:** Foundation Document Decisions #33, #35, #41
---
## 1. Purpose
This document specifies two routing systems that are central to LetsBe Biz:
1. **Agent Routing (Dispatcher)** — How user messages are routed to the correct AI agent
2. **Model Routing** — How AI requests are routed to the optimal LLM model based on task complexity, user settings, and cost constraints
Both routing systems live in the Safety Wrapper extension and operate transparently — users interact with "their AI team," not with routing logic.
---
## 2. Agent Routing (Dispatcher Logic)
### 2.1 Architecture
The Dispatcher is an OpenClaw agent configured with `agentToAgent` communication enabled. It uses the `messaging` tool profile and serves as the default entry point for all user messages.
```
User Message
Dispatcher Agent (SOUL.md: routing rules)
├── Simple / cross-domain → Handle directly
├── Infrastructure → delegate to IT Admin
├── Content / analytics → delegate to Marketing
├── Scheduling / comms → delegate to Secretary
├── CRM / pipeline → delegate to Sales
└── Multi-domain → coordinate across agents
```
### 2.2 Routing Decision Matrix
The Dispatcher routes based on **intent classification**. OpenClaw's native agent routing handles this through the Dispatcher's SOUL.md instructions — no separate classification model is needed.
| Signal | Routes To | Examples |
|--------|-----------|---------|
| Infrastructure keywords | IT Admin | "restart", "container", "backup", "disk", "server", "install", "update", "nginx", "Docker", "SSL", "certificate", "Keycloak", "Portainer" |
| Content/analytics keywords | Marketing | "blog", "post", "newsletter", "campaign", "analytics", "traffic", "subscribers", "Ghost", "Listmonk", "Umami", "SEO" |
| Scheduling/comms keywords | Secretary | "calendar", "meeting", "schedule", "email", "respond", "follow up", "Chatwoot", "Cal.com", "appointment", "reminder" |
| CRM/sales keywords | Sales | "lead", "opportunity", "pipeline", "CRM", "deal", "prospect", "follow-up", "Odoo", "quote", "proposal" |
| System questions | Dispatcher (self) | "what can you do", "how does this work", "what tools do I have", "help", "status", "summary" |
| Multi-domain | Dispatcher coordinates | "morning briefing", "give me a weekly summary", "how's business", "prepare for my meeting with [client]" |
### 2.3 Delegation Protocol
When the Dispatcher delegates to a specialist agent, it uses OpenClaw's native agent-to-agent messaging:
```
1. Dispatcher receives user message
2. Dispatcher identifies the target agent
3. Dispatcher sends structured delegation message:
{
"to": "it-admin",
"context": "User requests: 'Why is Nextcloud slow?'",
"expectation": "Diagnose and report. If action needed, get user approval."
}
4. Target agent receives message, executes task
5. Target agent returns result to Dispatcher
6. Dispatcher formats and presents result to user
```
### 2.4 Multi-Agent Coordination
For tasks spanning multiple agents, the Dispatcher acts as coordinator:
**Example: "Prepare for my call with Acme Corp tomorrow"**
1. Dispatcher identifies subtasks:
- Secretary: Pull calendar details, recent email threads with Acme
- Sales: Pull CRM record, pipeline status, last interaction
- Marketing: Check if Acme visited the website recently (Umami)
2. Dispatcher delegates each subtask in parallel (or sequential if dependencies exist)
3. Dispatcher compiles results into a unified briefing
4. Dispatcher presents the briefing to the user
### 2.5 Fallback Behavior
| Scenario | Behavior |
|----------|----------|
| Target agent unavailable (crashed/restarting) | Dispatcher notifies user, suggests IT Admin investigate |
| Ambiguous request | Dispatcher makes best judgment, routes, tells user who's handling it |
| User explicitly names an agent | Route directly ("Tell the IT Admin to restart Ghost") |
| Request is outside all agent capabilities | Dispatcher explains honestly what's possible and what isn't |
| Agent returns an error | Dispatcher reports the error to the user and suggests next steps |
---
## 3. Model Routing
### 3.1 Architecture
Model routing determines which LLM processes each agent turn. The Safety Wrapper's `before_prompt_build` hook (or the outbound secrets proxy) controls which model endpoint the request is sent to.
```
Agent Turn
Safety Wrapper: Model Router
├── Check user's model setting (Basic / Balanced / Complex / Specific Model)
├── Check if premium model → verify credit card on file
├── Check token pool → enough tokens remaining?
Route to OpenRouter endpoint
├── Primary model → attempt
├── If rate limited → try auth profile rotation (same model, different key)
├── If still failing → fallback to next model in chain
Response → Token metering → Return to agent
```
### 3.2 Model Presets (Basic Settings)
Users who don't want to think about models pick a preset. Each preset maps to a prioritized model chain.
| Preset | Primary Model | Fallback 1 | Fallback 2 | Blended Cost/1M | Use Case |
|--------|--------------|------------|------------|-----------------|----------|
| **Basic Tasks** | GPT 5 Nano | Gemini 3 Flash Preview | DeepSeek V3.2 | $0.201.58 | Quick lookups, formatting, simple drafts |
| **Balanced** (default) | DeepSeek V3.2 | MiniMax M2.5 | GPT 5 Nano | $0.200.70 | Daily operations, routine agent work |
| **Complex Tasks** | GLM 5 | MiniMax M2.5 | DeepSeek V3.2 | $0.331.68 | Analysis, multi-step reasoning, reports |
**Preset assignment logic:**
```
function resolveModel(agentId, taskContext) {
// 1. Check for agent-specific model override
if (agentConfig[agentId].model) return agentConfig[agentId].model;
// 2. Check user's global preset setting
const preset = tenantConfig.modelPreset; // "basic" | "balanced" | "complex"
// 3. Return the primary model for that preset
return PRESETS[preset].primary;
}
```
### 3.3 Advanced Model Selection
Users with a credit card on file can select specific models per agent or per task:
| Configuration Level | Scope | Example |
|--------------------|-------|---------|
| Global preset | All agents, all tasks | "Use Balanced for everything" |
| Per-agent override | All tasks for one agent | "IT Admin uses Complex, everything else uses Balanced" |
| Per-task override (future) | Single task/conversation | "Use Claude Sonnet for this analysis" |
**Schema (Safety Wrapper config):**
```json
{
"model_routing": {
"default_preset": "balanced",
"agent_overrides": {
"it-admin": { "preset": "complex" },
"marketing": { "model": "claude-sonnet-4.6" }
},
"premium_enabled": true,
"credit_card_on_file": true
}
}
```
### 3.4 Included vs. Premium Model Routing
| Model Category | Token Pool | Billing | Credit Card Required |
|---------------|------------|---------|---------------------|
| **Included** (DeepSeek V3.2, GPT 5 Nano, GPT 5.2 Mini, MiniMax M2.5, Gemini Flash, GLM 5) | Draws from monthly allocation | Subscription covers it | No |
| **Premium** (GPT 5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3.1 Pro) | Separate — does NOT draw from pool | Per-token metered to credit card | **Yes** |
**Routing decision tree:**
```
Is the selected model Premium?
├── No → Check token pool
│ ├── Tokens remaining → Route to model
│ └── Pool exhausted → Apply overage markup, notify user, route to model
└── Yes → Check credit card
├── Card on file → Route to model, meter tokens
└── No card → Reject, prompt user to add card in settings
```
### 3.5 Token Pool Management
Each subscription tier includes a monthly token allocation:
| Tier | Monthly Tokens | Founding Member Tokens |
|------|---------------|----------------------|
| Lite (€29) | ~8M | ~16M |
| Build (€45) | ~15M | ~30M |
| Scale (€75) | ~25M | ~50M |
| Enterprise (€109) | ~40M | ~80M |
**Pool tracking implementation:**
```
On every LLM response:
1. Safety Wrapper captures token counts (input, output, cache_read, cache_write)
2. Calculates cost: tokens × model_rate × (1 + openrouter_fee)
3. Converts to "standard tokens" (normalized to DeepSeek V3.2 equivalent)
4. Decrements from monthly pool
5. Reports to Hub via usage endpoint
When pool is exhausted:
1. Safety Wrapper detects pool < 0
2. Switches to overage billing mode
3. Applies overage markup (35% for cheap models, 25% mid, 20% top included)
4. Notifies user: "Your included tokens are used up. Continuing at overage rates."
5. User can top up, upgrade tier, or wait for next billing cycle
```
### 3.6 Fallback Chain Logic
When the primary model fails, the router attempts fallbacks before giving up.
**Failure types and responses:**
| Failure | First Response | Second Response | Third Response |
|---------|---------------|-----------------|----------------|
| Rate limited (429) | Rotate auth profile (different OpenRouter key) | Wait 5s, retry same model | Fall to next model in chain |
| Model unavailable (503) | Fall to next model in chain immediately | Continue down chain | Return error to agent |
| Context too long | Truncate and retry | Fall to model with larger context (Gemini Flash: 1M) | Return error suggesting context compaction |
| Timeout (>60s) | Retry once | Fall to faster model | Return timeout error |
| Auth error (401/403) | Rotate auth profile | Retry with Hub-synced key | Return auth error, notify admin |
**Auth profile rotation:** OpenClaw natively supports multiple auth profiles per model provider. Before falling back to a different model, the router first tries rotating to a different API key for the same model. This handles per-key rate limits.
```json
{
"providers": {
"openrouter": {
"auth_profiles": [
{ "id": "primary", "key": "SECRET_REF(openrouter_key_1)" },
{ "id": "secondary", "key": "SECRET_REF(openrouter_key_2)" },
{ "id": "tertiary", "key": "SECRET_REF(openrouter_key_3)" }
],
"rotation_strategy": "round-robin-on-failure"
}
}
}
```
### 3.7 Fallback Chain Definitions
| Starting Model | Fallback 1 | Fallback 2 | Fallback 3 (emergency) |
|---------------|------------|------------|----------------------|
| DeepSeek V3.2 | MiniMax M2.5 | GPT 5 Nano | — (return error) |
| GPT 5 Nano | DeepSeek V3.2 | — | — |
| GLM 5 | MiniMax M2.5 | DeepSeek V3.2 | GPT 5 Nano |
| MiniMax M2.5 | DeepSeek V3.2 | GPT 5 Nano | — |
| Gemini Flash | DeepSeek V3.2 | GPT 5 Nano | — |
| GPT 5.2 (premium) | GLM 5 (included) | DeepSeek V3.2 | — |
| Claude Sonnet 4.6 (premium) | GPT 5.2 (premium) | GLM 5 | DeepSeek V3.2 |
| Claude Opus 4.6 (premium) | Claude Sonnet 4.6 | GPT 5.2 | GLM 5 |
**Cross-category fallback rule:** Premium models can fall back to included models, but included models never fall "up" to premium models (that would charge the user unexpectedly).
---
## 4. Prompt Caching Strategy
### 4.1 Cache Architecture
OpenClaw's prompt caching with `cacheRetention: "long"` (1-hour TTL) is the primary cost optimization. The SOUL.md is the cacheable prefix.
```
┌─────────────────────────────────────────────────┐
│ Cached Prefix (1-hour TTL) │
│ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ SOUL.md (~3K tok) │ │ Tool Registry (~3K tok) │ │
│ └──────────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Dynamic Content (not cached) │
│ ┌────────────────┐ ┌──────────────────────────┐ │
│ │ Conversation │ │ Tool results / context │ │
│ └────────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────┘
```
### 4.2 Cache Savings by Model
| Model | Standard Input/1M | Cache Read/1M | Savings % | Monthly Savings (20M tokens, 60% cache hit) |
|-------|-------------------|---------------|-----------|---------------------------------------------|
| DeepSeek V3.2 | $0.274 | $0.211 | 23% | $0.76 |
| GPT 5 Nano | $0.053 | $0.005 | 91% | $0.58 |
| Gemini Flash | $0.528 | $0.053 | 90% | $5.70 |
| GLM 5 | $1.002 | $0.211 | 79% | $9.49 |
| Claude Sonnet 4.6 | $3.165 | $0.317 | 90% | $34.18 |
### 4.3 Heartbeat Keep-Warm
To maximize cache hit rates, the Safety Wrapper sends a heartbeat every 55 minutes (just under the 1-hour TTL). This keeps the SOUL.md prefix in cache without a real user interaction.
**Config:**
```json
{
"heartbeat": {
"every": "55m"
}
}
```
**Cost of keep-warm:** One minimal prompt per agent every 55 minutes = ~5K tokens × 24 turns/day × 5 agents = ~600K tokens/day. At DeepSeek V3.2 cache read rates: ~$0.13/day ($3.80/month). This is offset by the cache savings on real interactions.
**Decision: Only keep-warm agents that were active in the last 24 hours.** No point warming cache for agents the user hasn't talked to. The Safety Wrapper tracks `lastActiveAt` per agent and only sends heartbeats for recently active agents.
### 4.4 Cache Invalidation
Cache is invalidated when:
| Event | Action | Impact |
|-------|--------|--------|
| SOUL.md content changes | Cache miss on next turn, re-cache | One-time cost (~6K tokens at full input rate) |
| Tool registry changes (new tool installed) | Cache miss on next turn, re-cache | One-time cost |
| Model changed (user switches preset) | New model has fresh cache | Cache builds from scratch for new model |
| 1-hour TTL expires without heartbeat | Cache expires naturally | Re-cache on next interaction |
---
## 5. Load Balancing & Rate Limiting
### 5.1 Per-Tenant Rate Limits
Each tenant VPS has built-in rate limiting to prevent runaway token consumption:
| Limit | Default | Configurable | Purpose |
|-------|---------|-------------|---------|
| Max concurrent agent turns | 1 | No (OpenClaw default) | Prevent race conditions |
| Max tool calls per turn | 50 | Yes (Safety Wrapper) | Prevent infinite loops |
| Max tokens per single turn | 100K | Yes (Safety Wrapper) | Prevent context explosion |
| Max turns per hour per agent | 60 | Yes (Safety Wrapper) | Prevent runaway automation |
| Max API calls to Hub per minute | 10 | No | Prevent Hub overload |
| OpenRouter requests per minute | Per OpenRouter's limits | No | External rate limit |
### 5.2 Loop Detection
OpenClaw's native loop detection (`tools.loopDetection.enabled: true`) prevents agents from calling the same tool repeatedly without making progress. The Safety Wrapper adds a secondary check:
```
If agent calls the same tool with the same parameters 3 times in 60 seconds:
→ Block the call
→ Log warning
→ Notify user: "The AI appears to be stuck in a loop. I've paused it."
```
### 5.3 Token Budget Alerts
The Safety Wrapper monitors token consumption and alerts proactively:
| Pool Usage | Action |
|-----------|--------|
| 50% consumed | No action (tracked internally) |
| 75% consumed | Notify user: "You've used 75% of your monthly tokens" |
| 90% consumed | Notify user with upgrade suggestion |
| 100% consumed | Switch to overage mode, notify user with cost estimate |
| 150% of pool (overage) | Strong warning, suggest reviewing which agents are most active |
---
## 6. Monitoring & Observability
### 6.1 Metrics Collected
The Safety Wrapper collects and reports these metrics to the Hub via heartbeat:
| Metric | Granularity | Used For |
|--------|------------|----------|
| Tokens per agent per model per hour | Hourly buckets | Billing, usage dashboards |
| Model selection frequency | Per request | Optimize default presets |
| Fallback trigger count | Per hour | Monitor model reliability |
| Cache hit rate | Per agent per hour | Cost optimization tracking |
| Agent routing decisions | Per request | Dispatcher accuracy tracking |
| Tool call count per agent | Per hour | Identify heavy automation |
| Approval queue latency | Per request | UX optimization |
| Error rate per model | Per hour | Model health monitoring |
### 6.2 Dashboard Views (Hub)
**Admin dashboard (staff):**
- Global token usage heatmap (all tenants)
- Model usage distribution pie chart
- Fallback frequency by model (alerts if >5% in any hour)
- Revenue per model (included vs. premium vs. overage)
**Customer dashboard:**
- "Your AI Team This Month" — token usage by agent, visualized as a bar chart
- Model usage breakdown (which models are being used)
- Pool status gauge (% remaining)
- Cost breakdown (included vs. overage vs. premium)
- "Most Active Agent" — who's doing the most work
---
## 7. Configuration Reference
### 7.1 Model Routing Config (`model-routing.json`)
```json
{
"presets": {
"basic": {
"name": "Basic Tasks",
"models": ["openai/gpt-5-nano", "google/gemini-3-flash-preview", "deepseek/deepseek-v3.2"],
"description": "Quick lookups, simple drafts, data entry"
},
"balanced": {
"name": "Balanced",
"models": ["deepseek/deepseek-v3.2", "minimax/minimax-m2.5", "openai/gpt-5-nano"],
"description": "Day-to-day operations, routine tasks"
},
"complex": {
"name": "Complex Tasks",
"models": ["zhipu/glm-5", "minimax/minimax-m2.5", "deepseek/deepseek-v3.2"],
"description": "Analysis, multi-step reasoning, reports"
}
},
"premium_models": [
"openai/gpt-5.2",
"google/gemini-3.1-pro",
"anthropic/claude-sonnet-4.6",
"anthropic/claude-opus-4.6"
],
"included_models": [
"deepseek/deepseek-v3.2",
"openai/gpt-5-nano",
"openai/gpt-5.2-mini",
"minimax/minimax-m2.5",
"google/gemini-3-flash-preview",
"zhipu/glm-5"
],
"fallback_chains": {
"deepseek/deepseek-v3.2": ["minimax/minimax-m2.5", "openai/gpt-5-nano"],
"openai/gpt-5-nano": ["deepseek/deepseek-v3.2"],
"zhipu/glm-5": ["minimax/minimax-m2.5", "deepseek/deepseek-v3.2", "openai/gpt-5-nano"],
"minimax/minimax-m2.5": ["deepseek/deepseek-v3.2", "openai/gpt-5-nano"],
"google/gemini-3-flash-preview": ["deepseek/deepseek-v3.2", "openai/gpt-5-nano"],
"openai/gpt-5.2": ["zhipu/glm-5", "deepseek/deepseek-v3.2"],
"anthropic/claude-sonnet-4.6": ["openai/gpt-5.2", "zhipu/glm-5", "deepseek/deepseek-v3.2"],
"anthropic/claude-opus-4.6": ["anthropic/claude-sonnet-4.6", "openai/gpt-5.2", "zhipu/glm-5"]
},
"overage_markup": {
"cheap": { "threshold_max": 0.50, "markup": 0.35 },
"mid": { "threshold_max": 1.20, "markup": 0.25 },
"top": { "threshold_max": 999, "markup": 0.20 }
},
"premium_markup": {
"default": 0.10,
"overrides": {
"anthropic/claude-opus-4.6": 0.08
}
},
"rate_limits": {
"max_tool_calls_per_turn": 50,
"max_tokens_per_turn": 100000,
"max_turns_per_hour_per_agent": 60,
"loop_detection_threshold": 3,
"loop_detection_window_seconds": 60
},
"caching": {
"retention": "long",
"heartbeat_interval": "55m",
"warmup_only_active_agents": true,
"active_agent_threshold_hours": 24
}
}
```
### 7.2 OpenRouter Provider Config (`openclaw.json` excerpt)
```json
{
"providers": {
"openrouter": {
"base_url": "http://127.0.0.1:8100/v1",
"auth_profiles": [
{ "id": "primary", "key": "SECRET_REF(openrouter_key_1)" },
{ "id": "secondary", "key": "SECRET_REF(openrouter_key_2)" }
],
"rotation_strategy": "round-robin-on-failure",
"timeout_ms": 60000,
"retry": {
"max_attempts": 3,
"backoff_ms": [1000, 5000, 15000]
}
}
},
"model": {
"primary": "deepseek/deepseek-v3.2",
"fallback": ["minimax/minimax-m2.5", "openai/gpt-5-nano"]
}
}
```
Note: `base_url` points to the local secrets proxy (`127.0.0.1:8100`) which handles credential injection and outbound redaction before forwarding to OpenRouter's actual API.
---
## 8. Implementation Priorities
| Priority | Component | Effort | Dependencies |
|----------|-----------|--------|-------------|
| P0 | Basic preset routing (3 presets → model selection) | 1 week | Safety Wrapper skeleton |
| P0 | Fallback chain with auth rotation | 1 week | OpenRouter integration |
| P0 | Token metering and pool tracking | 2 weeks | Hub billing endpoints |
| P1 | Agent routing (Dispatcher SOUL.md) | 1 week | SOUL.md templates |
| P1 | Prompt caching with heartbeat keep-warm | 3 days | OpenClaw caching config |
| P1 | Loop detection (Safety Wrapper layer) | 3 days | Safety Wrapper hooks |
| P2 | Per-agent model overrides | 3 days | Hub agent config UI |
| P2 | Premium model gating (credit card check) | 1 week | Hub billing + Stripe |
| P2 | Token budget alerts | 3 days | Hub notification system |
| P3 | Multi-agent coordination (parallel delegation) | 2 weeks | Agent-to-agent messaging |
| P3 | Per-task model override (future) | 1 week | Conversation context detection |
| P3 | Customer usage dashboard | 1 week | Hub frontend |
---
## 9. Changelog
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-02-26 | Initial spec. Agent routing via Dispatcher. Model presets (Basic/Balanced/Complex). Fallback chains with auth rotation. Token pool management. Prompt caching strategy. Rate limiting. Configuration reference. Implementation priorities. |