LetsBeBiz-Redesign/docs/technical/LetsBe_Biz_Dispatcher_Routi...

22 KiB
Raw Permalink Blame History

LetsBe Biz — Dispatcher Routing Logic

Version: 1.0 Date: February 26, 2026 Authors: Matt (Founder), Claude (Architecture) Status: Engineering Spec — Ready for Implementation Companion docs: Technical Architecture v1.2, Pricing Model v2.2, SOUL.md Content Spec v1.0 Decision refs: Foundation Document Decisions #33, #35, #41


1. Purpose

This document specifies two routing systems that are central to LetsBe Biz:

  1. Agent Routing (Dispatcher) — How user messages are routed to the correct AI agent
  2. Model Routing — How AI requests are routed to the optimal LLM model based on task complexity, user settings, and cost constraints

Both routing systems live in the Safety Wrapper extension and operate transparently — users interact with "their AI team," not with routing logic.


2. Agent Routing (Dispatcher Logic)

2.1 Architecture

The Dispatcher is an OpenClaw agent configured with agentToAgent communication enabled. It uses the messaging tool profile and serves as the default entry point for all user messages.

User Message
    │
    ▼
Dispatcher Agent (SOUL.md: routing rules)
    │
    ├── Simple / cross-domain → Handle directly
    │
    ├── Infrastructure → delegate to IT Admin
    ├── Content / analytics → delegate to Marketing
    ├── Scheduling / comms → delegate to Secretary
    ├── CRM / pipeline → delegate to Sales
    │
    └── Multi-domain → coordinate across agents

2.2 Routing Decision Matrix

The Dispatcher routes based on intent classification. OpenClaw's native agent routing handles this through the Dispatcher's SOUL.md instructions — no separate classification model is needed.

Signal Routes To Examples
Infrastructure keywords IT Admin "restart", "container", "backup", "disk", "server", "install", "update", "nginx", "Docker", "SSL", "certificate", "Keycloak", "Portainer"
Content/analytics keywords Marketing "blog", "post", "newsletter", "campaign", "analytics", "traffic", "subscribers", "Ghost", "Listmonk", "Umami", "SEO"
Scheduling/comms keywords Secretary "calendar", "meeting", "schedule", "email", "respond", "follow up", "Chatwoot", "Cal.com", "appointment", "reminder"
CRM/sales keywords Sales "lead", "opportunity", "pipeline", "CRM", "deal", "prospect", "follow-up", "Odoo", "quote", "proposal"
System questions Dispatcher (self) "what can you do", "how does this work", "what tools do I have", "help", "status", "summary"
Multi-domain Dispatcher coordinates "morning briefing", "give me a weekly summary", "how's business", "prepare for my meeting with [client]"

2.3 Delegation Protocol

When the Dispatcher delegates to a specialist agent, it uses OpenClaw's native agent-to-agent messaging:

1. Dispatcher receives user message
2. Dispatcher identifies the target agent
3. Dispatcher sends structured delegation message:
   {
     "to": "it-admin",
     "context": "User requests: 'Why is Nextcloud slow?'",
     "expectation": "Diagnose and report. If action needed, get user approval."
   }
4. Target agent receives message, executes task
5. Target agent returns result to Dispatcher
6. Dispatcher formats and presents result to user

2.4 Multi-Agent Coordination

For tasks spanning multiple agents, the Dispatcher acts as coordinator:

Example: "Prepare for my call with Acme Corp tomorrow"

  1. Dispatcher identifies subtasks:
    • Secretary: Pull calendar details, recent email threads with Acme
    • Sales: Pull CRM record, pipeline status, last interaction
    • Marketing: Check if Acme visited the website recently (Umami)
  2. Dispatcher delegates each subtask in parallel (or sequential if dependencies exist)
  3. Dispatcher compiles results into a unified briefing
  4. Dispatcher presents the briefing to the user

2.5 Fallback Behavior

Scenario Behavior
Target agent unavailable (crashed/restarting) Dispatcher notifies user, suggests IT Admin investigate
Ambiguous request Dispatcher makes best judgment, routes, tells user who's handling it
User explicitly names an agent Route directly ("Tell the IT Admin to restart Ghost")
Request is outside all agent capabilities Dispatcher explains honestly what's possible and what isn't
Agent returns an error Dispatcher reports the error to the user and suggests next steps

3. Model Routing

3.1 Architecture

Model routing determines which LLM processes each agent turn. The Safety Wrapper's before_prompt_build hook (or the outbound secrets proxy) controls which model endpoint the request is sent to.

Agent Turn
    │
    ▼
Safety Wrapper: Model Router
    │
    ├── Check user's model setting (Basic / Balanced / Complex / Specific Model)
    ├── Check if premium model → verify credit card on file
    ├── Check token pool → enough tokens remaining?
    │
    ▼
Route to OpenRouter endpoint
    │
    ├── Primary model → attempt
    ├── If rate limited → try auth profile rotation (same model, different key)
    ├── If still failing → fallback to next model in chain
    │
    ▼
Response → Token metering → Return to agent

3.2 Model Presets (Basic Settings)

Users who don't want to think about models pick a preset. Each preset maps to a prioritized model chain.

Preset Primary Model Fallback 1 Fallback 2 Blended Cost/1M Use Case
Basic Tasks GPT 5 Nano Gemini 3 Flash Preview DeepSeek V3.2 $0.201.58 Quick lookups, formatting, simple drafts
Balanced (default) DeepSeek V3.2 MiniMax M2.5 GPT 5 Nano $0.200.70 Daily operations, routine agent work
Complex Tasks GLM 5 MiniMax M2.5 DeepSeek V3.2 $0.331.68 Analysis, multi-step reasoning, reports

Preset assignment logic:

function resolveModel(agentId, taskContext) {
  // 1. Check for agent-specific model override
  if (agentConfig[agentId].model) return agentConfig[agentId].model;

  // 2. Check user's global preset setting
  const preset = tenantConfig.modelPreset; // "basic" | "balanced" | "complex"

  // 3. Return the primary model for that preset
  return PRESETS[preset].primary;
}

3.3 Advanced Model Selection

Users with a credit card on file can select specific models per agent or per task:

Configuration Level Scope Example
Global preset All agents, all tasks "Use Balanced for everything"
Per-agent override All tasks for one agent "IT Admin uses Complex, everything else uses Balanced"
Per-task override (future) Single task/conversation "Use Claude Sonnet for this analysis"

Schema (Safety Wrapper config):

{
  "model_routing": {
    "default_preset": "balanced",
    "agent_overrides": {
      "it-admin": { "preset": "complex" },
      "marketing": { "model": "claude-sonnet-4.6" }
    },
    "premium_enabled": true,
    "credit_card_on_file": true
  }
}

3.4 Included vs. Premium Model Routing

Model Category Token Pool Billing Credit Card Required
Included (DeepSeek V3.2, GPT 5 Nano, GPT 5.2 Mini, MiniMax M2.5, Gemini Flash, GLM 5) Draws from monthly allocation Subscription covers it No
Premium (GPT 5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3.1 Pro) Separate — does NOT draw from pool Per-token metered to credit card Yes

Routing decision tree:

Is the selected model Premium?
├── No → Check token pool
│   ├── Tokens remaining → Route to model
│   └── Pool exhausted → Apply overage markup, notify user, route to model
│
└── Yes → Check credit card
    ├── Card on file → Route to model, meter tokens
    └── No card → Reject, prompt user to add card in settings

3.5 Token Pool Management

Each subscription tier includes a monthly token allocation:

Tier Monthly Tokens Founding Member Tokens
Lite (€29) ~8M ~16M
Build (€45) ~15M ~30M
Scale (€75) ~25M ~50M
Enterprise (€109) ~40M ~80M

Pool tracking implementation:

On every LLM response:
  1. Safety Wrapper captures token counts (input, output, cache_read, cache_write)
  2. Calculates cost: tokens × model_rate × (1 + openrouter_fee)
  3. Converts to "standard tokens" (normalized to DeepSeek V3.2 equivalent)
  4. Decrements from monthly pool
  5. Reports to Hub via usage endpoint

When pool is exhausted:
  1. Safety Wrapper detects pool < 0
  2. Switches to overage billing mode
  3. Applies overage markup (35% for cheap models, 25% mid, 20% top included)
  4. Notifies user: "Your included tokens are used up. Continuing at overage rates."
  5. User can top up, upgrade tier, or wait for next billing cycle

3.6 Fallback Chain Logic

When the primary model fails, the router attempts fallbacks before giving up.

Failure types and responses:

Failure First Response Second Response Third Response
Rate limited (429) Rotate auth profile (different OpenRouter key) Wait 5s, retry same model Fall to next model in chain
Model unavailable (503) Fall to next model in chain immediately Continue down chain Return error to agent
Context too long Truncate and retry Fall to model with larger context (Gemini Flash: 1M) Return error suggesting context compaction
Timeout (>60s) Retry once Fall to faster model Return timeout error
Auth error (401/403) Rotate auth profile Retry with Hub-synced key Return auth error, notify admin

Auth profile rotation: OpenClaw natively supports multiple auth profiles per model provider. Before falling back to a different model, the router first tries rotating to a different API key for the same model. This handles per-key rate limits.

{
  "providers": {
    "openrouter": {
      "auth_profiles": [
        { "id": "primary", "key": "SECRET_REF(openrouter_key_1)" },
        { "id": "secondary", "key": "SECRET_REF(openrouter_key_2)" },
        { "id": "tertiary", "key": "SECRET_REF(openrouter_key_3)" }
      ],
      "rotation_strategy": "round-robin-on-failure"
    }
  }
}

3.7 Fallback Chain Definitions

Starting Model Fallback 1 Fallback 2 Fallback 3 (emergency)
DeepSeek V3.2 MiniMax M2.5 GPT 5 Nano — (return error)
GPT 5 Nano DeepSeek V3.2
GLM 5 MiniMax M2.5 DeepSeek V3.2 GPT 5 Nano
MiniMax M2.5 DeepSeek V3.2 GPT 5 Nano
Gemini Flash DeepSeek V3.2 GPT 5 Nano
GPT 5.2 (premium) GLM 5 (included) DeepSeek V3.2
Claude Sonnet 4.6 (premium) GPT 5.2 (premium) GLM 5 DeepSeek V3.2
Claude Opus 4.6 (premium) Claude Sonnet 4.6 GPT 5.2 GLM 5

Cross-category fallback rule: Premium models can fall back to included models, but included models never fall "up" to premium models (that would charge the user unexpectedly).


4. Prompt Caching Strategy

4.1 Cache Architecture

OpenClaw's prompt caching with cacheRetention: "long" (1-hour TTL) is the primary cost optimization. The SOUL.md is the cacheable prefix.

┌─────────────────────────────────────────────────┐
│ Cached Prefix (1-hour TTL)                       │
│ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ SOUL.md (~3K tok) │ │ Tool Registry (~3K tok) │ │
│ └──────────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Dynamic Content (not cached)                     │
│ ┌────────────────┐ ┌──────────────────────────┐ │
│ │ Conversation    │ │ Tool results / context   │ │
│ └────────────────┘ └──────────────────────────┘ │
└─────────────────────────────────────────────────┘

4.2 Cache Savings by Model

Model Standard Input/1M Cache Read/1M Savings % Monthly Savings (20M tokens, 60% cache hit)
DeepSeek V3.2 $0.274 $0.211 23% $0.76
GPT 5 Nano $0.053 $0.005 91% $0.58
Gemini Flash $0.528 $0.053 90% $5.70
GLM 5 $1.002 $0.211 79% $9.49
Claude Sonnet 4.6 $3.165 $0.317 90% $34.18

4.3 Heartbeat Keep-Warm

To maximize cache hit rates, the Safety Wrapper sends a heartbeat every 55 minutes (just under the 1-hour TTL). This keeps the SOUL.md prefix in cache without a real user interaction.

Config:

{
  "heartbeat": {
    "every": "55m"
  }
}

Cost of keep-warm: One minimal prompt per agent every 55 minutes = ~5K tokens × 24 turns/day × 5 agents = ~600K tokens/day. At DeepSeek V3.2 cache read rates: ~$0.13/day ($3.80/month). This is offset by the cache savings on real interactions.

Decision: Only keep-warm agents that were active in the last 24 hours. No point warming cache for agents the user hasn't talked to. The Safety Wrapper tracks lastActiveAt per agent and only sends heartbeats for recently active agents.

4.4 Cache Invalidation

Cache is invalidated when:

Event Action Impact
SOUL.md content changes Cache miss on next turn, re-cache One-time cost (~6K tokens at full input rate)
Tool registry changes (new tool installed) Cache miss on next turn, re-cache One-time cost
Model changed (user switches preset) New model has fresh cache Cache builds from scratch for new model
1-hour TTL expires without heartbeat Cache expires naturally Re-cache on next interaction

5. Load Balancing & Rate Limiting

5.1 Per-Tenant Rate Limits

Each tenant VPS has built-in rate limiting to prevent runaway token consumption:

Limit Default Configurable Purpose
Max concurrent agent turns 1 No (OpenClaw default) Prevent race conditions
Max tool calls per turn 50 Yes (Safety Wrapper) Prevent infinite loops
Max tokens per single turn 100K Yes (Safety Wrapper) Prevent context explosion
Max turns per hour per agent 60 Yes (Safety Wrapper) Prevent runaway automation
Max API calls to Hub per minute 10 No Prevent Hub overload
OpenRouter requests per minute Per OpenRouter's limits No External rate limit

5.2 Loop Detection

OpenClaw's native loop detection (tools.loopDetection.enabled: true) prevents agents from calling the same tool repeatedly without making progress. The Safety Wrapper adds a secondary check:

If agent calls the same tool with the same parameters 3 times in 60 seconds:
  → Block the call
  → Log warning
  → Notify user: "The AI appears to be stuck in a loop. I've paused it."

5.3 Token Budget Alerts

The Safety Wrapper monitors token consumption and alerts proactively:

Pool Usage Action
50% consumed No action (tracked internally)
75% consumed Notify user: "You've used 75% of your monthly tokens"
90% consumed Notify user with upgrade suggestion
100% consumed Switch to overage mode, notify user with cost estimate
150% of pool (overage) Strong warning, suggest reviewing which agents are most active

6. Monitoring & Observability

6.1 Metrics Collected

The Safety Wrapper collects and reports these metrics to the Hub via heartbeat:

Metric Granularity Used For
Tokens per agent per model per hour Hourly buckets Billing, usage dashboards
Model selection frequency Per request Optimize default presets
Fallback trigger count Per hour Monitor model reliability
Cache hit rate Per agent per hour Cost optimization tracking
Agent routing decisions Per request Dispatcher accuracy tracking
Tool call count per agent Per hour Identify heavy automation
Approval queue latency Per request UX optimization
Error rate per model Per hour Model health monitoring

6.2 Dashboard Views (Hub)

Admin dashboard (staff):

  • Global token usage heatmap (all tenants)
  • Model usage distribution pie chart
  • Fallback frequency by model (alerts if >5% in any hour)
  • Revenue per model (included vs. premium vs. overage)

Customer dashboard:

  • "Your AI Team This Month" — token usage by agent, visualized as a bar chart
  • Model usage breakdown (which models are being used)
  • Pool status gauge (% remaining)
  • Cost breakdown (included vs. overage vs. premium)
  • "Most Active Agent" — who's doing the most work

7. Configuration Reference

7.1 Model Routing Config (model-routing.json)

{
  "presets": {
    "basic": {
      "name": "Basic Tasks",
      "models": ["openai/gpt-5-nano", "google/gemini-3-flash-preview", "deepseek/deepseek-v3.2"],
      "description": "Quick lookups, simple drafts, data entry"
    },
    "balanced": {
      "name": "Balanced",
      "models": ["deepseek/deepseek-v3.2", "minimax/minimax-m2.5", "openai/gpt-5-nano"],
      "description": "Day-to-day operations, routine tasks"
    },
    "complex": {
      "name": "Complex Tasks",
      "models": ["zhipu/glm-5", "minimax/minimax-m2.5", "deepseek/deepseek-v3.2"],
      "description": "Analysis, multi-step reasoning, reports"
    }
  },
  "premium_models": [
    "openai/gpt-5.2",
    "google/gemini-3.1-pro",
    "anthropic/claude-sonnet-4.6",
    "anthropic/claude-opus-4.6"
  ],
  "included_models": [
    "deepseek/deepseek-v3.2",
    "openai/gpt-5-nano",
    "openai/gpt-5.2-mini",
    "minimax/minimax-m2.5",
    "google/gemini-3-flash-preview",
    "zhipu/glm-5"
  ],
  "fallback_chains": {
    "deepseek/deepseek-v3.2": ["minimax/minimax-m2.5", "openai/gpt-5-nano"],
    "openai/gpt-5-nano": ["deepseek/deepseek-v3.2"],
    "zhipu/glm-5": ["minimax/minimax-m2.5", "deepseek/deepseek-v3.2", "openai/gpt-5-nano"],
    "minimax/minimax-m2.5": ["deepseek/deepseek-v3.2", "openai/gpt-5-nano"],
    "google/gemini-3-flash-preview": ["deepseek/deepseek-v3.2", "openai/gpt-5-nano"],
    "openai/gpt-5.2": ["zhipu/glm-5", "deepseek/deepseek-v3.2"],
    "anthropic/claude-sonnet-4.6": ["openai/gpt-5.2", "zhipu/glm-5", "deepseek/deepseek-v3.2"],
    "anthropic/claude-opus-4.6": ["anthropic/claude-sonnet-4.6", "openai/gpt-5.2", "zhipu/glm-5"]
  },
  "overage_markup": {
    "cheap": { "threshold_max": 0.50, "markup": 0.35 },
    "mid": { "threshold_max": 1.20, "markup": 0.25 },
    "top": { "threshold_max": 999, "markup": 0.20 }
  },
  "premium_markup": {
    "default": 0.10,
    "overrides": {
      "anthropic/claude-opus-4.6": 0.08
    }
  },
  "rate_limits": {
    "max_tool_calls_per_turn": 50,
    "max_tokens_per_turn": 100000,
    "max_turns_per_hour_per_agent": 60,
    "loop_detection_threshold": 3,
    "loop_detection_window_seconds": 60
  },
  "caching": {
    "retention": "long",
    "heartbeat_interval": "55m",
    "warmup_only_active_agents": true,
    "active_agent_threshold_hours": 24
  }
}

7.2 OpenRouter Provider Config (openclaw.json excerpt)

{
  "providers": {
    "openrouter": {
      "base_url": "http://127.0.0.1:8100/v1",
      "auth_profiles": [
        { "id": "primary", "key": "SECRET_REF(openrouter_key_1)" },
        { "id": "secondary", "key": "SECRET_REF(openrouter_key_2)" }
      ],
      "rotation_strategy": "round-robin-on-failure",
      "timeout_ms": 60000,
      "retry": {
        "max_attempts": 3,
        "backoff_ms": [1000, 5000, 15000]
      }
    }
  },
  "model": {
    "primary": "deepseek/deepseek-v3.2",
    "fallback": ["minimax/minimax-m2.5", "openai/gpt-5-nano"]
  }
}

Note: base_url points to the local secrets proxy (127.0.0.1:8100) which handles credential injection and outbound redaction before forwarding to OpenRouter's actual API.


8. Implementation Priorities

Priority Component Effort Dependencies
P0 Basic preset routing (3 presets → model selection) 1 week Safety Wrapper skeleton
P0 Fallback chain with auth rotation 1 week OpenRouter integration
P0 Token metering and pool tracking 2 weeks Hub billing endpoints
P1 Agent routing (Dispatcher SOUL.md) 1 week SOUL.md templates
P1 Prompt caching with heartbeat keep-warm 3 days OpenClaw caching config
P1 Loop detection (Safety Wrapper layer) 3 days Safety Wrapper hooks
P2 Per-agent model overrides 3 days Hub agent config UI
P2 Premium model gating (credit card check) 1 week Hub billing + Stripe
P2 Token budget alerts 3 days Hub notification system
P3 Multi-agent coordination (parallel delegation) 2 weeks Agent-to-agent messaging
P3 Per-task model override (future) 1 week Conversation context detection
P3 Customer usage dashboard 1 week Hub frontend

9. Changelog

Version Date Changes
1.0 2026-02-26 Initial spec. Agent routing via Dispatcher. Model presets (Basic/Balanced/Complex). Fallback chains with auth rotation. Token pool management. Prompt caching strategy. Rate limiting. Configuration reference. Implementation priorities.