51 KiB

Raw Blame History

LetsBe Biz — System Architecture

Date: February 27, 2026 Team: Claude Opus 4.6 Architecture Team Document: 01 of 09 Status: Proposal — Competing with independent team

Architecture Philosophy
High-Level System Overview
Tenant Server Architecture
Central Platform Architecture
Four-Layer Security Model
AI Autonomy Levels
Data Flow Diagrams
Inter-Agent Communication
Memory Architecture
Network Security
Scalability & Performance
Disaster Recovery & Backup
Error Handling & Resilience

1. Architecture Philosophy

1.1 Non-Negotiable Principles

Principle 1 — Secrets Never Leave the Server

All credential redaction happens locally on the tenant VPS before any data reaches an LLM provider. This is enforced at the transport layer through a dedicated Secrets Proxy process — not by trusting the AI to behave, not by configuration, not by policy. The enforcement point is a separate process that sits between OpenClaw and the internet. Traffic that hasn't passed through the Secrets Proxy physically cannot reach an LLM. This is the single most important architectural invariant.

Principle 2 — Per-Tenant Physical Isolation

One customer = one VPS. No multi-tenancy, no shared containers, no shared databases. Each tenant's data, credentials, agent state, and conversation history lives on dedicated hardware. This is permanent for v1. It eliminates entire categories of security vulnerabilities (cross-tenant data leaks, noisy neighbor performance issues, shared-secret compromise) at the cost of higher per-customer infrastructure spend.

Principle 3 — Defense in Depth (Four Independent Security Layers)

Security is not one wall — it's four independent layers, each enforced by different mechanisms, each unable to expand access granted by layers above. A failure in any single layer does not compromise the system because the remaining three layers still enforce their restrictions independently:

Layer	Mechanism	Enforced By	Bypassable By AI?
1. Sandbox	Container isolation	Docker / OS kernel	No
2. Tool Policy	Per-agent allow/deny arrays	OpenClaw config (loaded at startup)	No
3. Command Gating	5-tier classification + autonomy levels	Safety Wrapper (separate process)	No
4. Secrets Redaction	4-layer redaction pipeline	Secrets Proxy (separate process)	No

Principle 4 — OpenClaw Stays Vanilla

OpenClaw is treated as an upstream dependency, never a fork. All LetsBe-specific logic (secrets redaction, command gating, Hub communication, tool adapters, billing metering) lives in a Safety Wrapper process that runs alongside OpenClaw. This means:

Upstream security patches apply cleanly
New OpenClaw features are available without merge conflicts
Our competitive IP is cleanly separated from the upstream codebase
Pin to a tested release tag; upgrade monthly after staging verification

Principle 5 — Graceful Degradation

Every component has a failure mode that preserves the user's experience:

Hub goes down → agents continue working from cached config; approvals queue locally
OpenRouter goes down → model failover chains try alternatives; agents pause gracefully
Single tool goes down → agent reports it, other tools continue
Safety Wrapper restarts → agents pause briefly (~2-5s), auto-resume
Secrets Proxy restarts → LLM calls fail temporarily, auto-resume

1.2 Key Divergence from Technical Architecture v1.2

The Technical Architecture v1.2 proposes the Safety Wrapper as an in-process OpenClaw extension running inside the Gateway process, with only a thin Secrets Proxy as a separate process. After deep research into OpenClaw's plugin system, we propose a fundamentally different approach.

Our proposal: Safety Wrapper as a SEPARATE process (localhost:8200)

Three findings drive this decision:

Hook Gap (GitHub Discussion #20575): OpenClaw's before_tool_call and after_tool_call hooks are NOT bridged to external plugins. The internal hook system fires events via emitEvent() but never calls triggerInternalHook() for external plugin consumers. This means an in-process extension CANNOT reliably intercept tool calls — the exact mechanism the v1.2 architecture depends on for command classification and secrets injection.
CVE-2026-25253 (CVSS 8.8): Cross-site WebSocket hijacking vulnerability in OpenClaw, patched 2026-01-29. An in-process extension shares the vulnerability surface with the host process. A separate process has an independent attack surface — compromising OpenClaw doesn't automatically compromise the Safety Wrapper.
Synchronous hook limitation: tool_result_persist hook is synchronous — it cannot return Promises. This limits what an in-process extension can do for async operations like Hub API calls, approval requests, and token reporting.

Impact on architecture:

Safety Wrapper runs as a separate Node.js process on localhost:8200
OpenClaw is configured to route tool calls through the Safety Wrapper's HTTP API
Secrets Proxy remains as a separate thin process on localhost:8100
Total: 3 LetsBe processes (OpenClaw + Safety Wrapper + Secrets Proxy) + nginx + tool containers
RAM overhead increases by ~64MB (from ~576MB to ~640MB) — acceptable on all tiers

1.3 Why These Principles Matter for the Business

Privacy-first architecture is the competitive moat. SMBs increasingly distrust cloud-only AI solutions — stories of training data leaks, terms-of-service changes, and API key compromises make headlines weekly. LetsBe's "secrets never leave your server" guarantee is verifiable (the Secrets Proxy is inspectable) and defensible (transport-layer enforcement can't be bypassed by prompt injection). This positions LetsBe uniquely against competitors who run AI in multi-tenant cloud environments.

2. High-Level System Overview

2.1 Two-Domain Architecture

The platform operates across two distinct trust domains connected by HTTPS:

┌─────────────────────────────────────────────────────────────────────┐
│                        CENTRAL PLATFORM                             │
│                    (LetsBe infrastructure)                           │
│                                                                     │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐    │
│  │     Hub      │   │  Provisioner │   │      Website         │    │
│  │  (Next.js)   │   │  (Bash/SSH)  │   │    (Next.js SSG)     │    │
│  │              │   │              │   │                      │    │
│  │ Admin Portal │   │ 10-step VPS  │   │ Marketing + AI       │    │
│  │ Customer API │   │ setup via    │   │ onboarding chat +    │    │
│  │ Billing      │   │ Docker       │   │ Stripe checkout      │    │
│  │ Tenant Comms │   │              │   │                      │    │
│  └──────┬───────┘   └──────┬───────┘   └──────────────────────┘    │
│         │                  │                                        │
│         │   PostgreSQL     │                                        │
│         └──────┬───────────┘                                        │
│                │                                                    │
└────────────────┼────────────────────────────────────────────────────┘
                 │
                 │  HTTPS (heartbeat, config sync, approvals, usage)
                 │  SSH (provisioning only — one-shot, no persistent connection)
                 │
┌────────────────┼────────────────────────────────────────────────────┐
│                │           TENANT SERVER                            │
│                │      (Customer's isolated VPS)                     │
│                │                                                    │
│  ┌─────────────▼──────────┐                                        │
│  │    Safety Wrapper       │◄────── Hub API Key auth               │
│  │    (localhost:8200)     │                                        │
│  │                         │                                        │
│  │  Command Classification │        ┌──────────────────┐           │
│  │  Secrets Registry (SQLite)│      │  Secrets Proxy   │           │
│  │  Tool Execution Proxy   │───────►│  (localhost:8100) │           │
│  │  Hub Communication      │        │                  │           │
│  │  Token Metering         │        │  4-layer redact  │──► LLM   │
│  │  Audit Logger           │        │  <10ms overhead  │  (OpenRouter)
│  └────────────┬────────────┘        └──────────────────┘           │
│               │                                                     │
│  ┌────────────▼────────────┐                                        │
│  │      OpenClaw           │                                        │
│  │   (Gateway:18789)       │                                        │
│  │                         │                                        │
│  │  Agent Runtime          │     ┌──────────────────────────────┐  │
│  │  Session Management     │     │     Tool Stacks (Docker)     │  │
│  │  Prompt Caching         │     │                              │  │
│  │  Browser (Playwright)   │     │  Ghost    Cal.com   Nextcloud│  │
│  │  Channels (WA/TG)      │     │  Chatwoot Odoo      NocoDB   │  │
│  │  Cron / Webhooks        │     │  Listmonk Umami    Keycloak  │  │
│  └─────────────────────────┘     │  ... 20+ more containers    │  │
│                                   └──────────────────────────────┘  │
│  ┌─────────────────────────┐                                        │
│  │   nginx (80/443)        │  Only external-facing process          │
│  └─────────────────────────┘                                        │
└─────────────────────────────────────────────────────────────────────┘

2.2 Trust Boundaries

                    UNTRUSTED                │           TRUSTED (on-VPS)
                                             │
    External LLM Providers ◄─────────────────┤◄── Secrets Proxy (redacts ALL secrets)
    (via OpenRouter:                          │         ▲
     Anthropic, Google,                       │         │ outbound LLM traffic only
     DeepSeek, OpenAI, etc.)                  │         │
                                             │    Safety Wrapper (classifies commands)
    Internet Users ─────────► nginx ──────►  │         │
                              (TLS)          │         ▼
                                             │    OpenClaw (agent runtime)
    Mobile App ◄─────► Hub ◄────────────────►│         │
    (WebSocket)        (relay)               │         ▼
                                             │    Tool Containers
    Messaging Channels ◄────────────────────►│    (Ghost, Nextcloud, Cal.com, etc.)
    (WhatsApp, Telegram)                      │

Key boundaries:

LLMs are UNTRUSTED — all outbound traffic is sanitized by Secrets Proxy
The Internet is UNTRUSTED — only nginx port 80/443 and SSH 22022 are exposed
Hub communication is AUTHENTICATED — Bearer token over HTTPS
Inter-process communication is LOCAL — localhost only, no network exposure

2.3 Network Boundary

Central → Tenant: SSH (provisioning, one-shot), HTTPS (API calls to Safety Wrapper if needed)
Tenant → Central: HTTPS (heartbeat, config sync, approval requests, usage reporting)
Tenant → Internet: Only through Secrets Proxy (LLM calls) and nginx (tool web UIs)
No persistent connections: Heartbeat is periodic HTTP POST, not WebSocket

3. Tenant Server Architecture

3.1 Process Map

Every tenant VPS runs the following processes:

Process	Port	Protocol	RAM Budget	Restartable	Purpose
OpenClaw Gateway	18789	HTTP+WS	~384MB (includes Chromium ~200MB)	Yes (Docker restart)	AI agent runtime, session management, browser tool
Safety Wrapper	8200	HTTP	~128MB	Yes (Docker restart)	Command gating, secrets registry, Hub comms, metering
Secrets Proxy	8100	HTTP	~64MB	Yes (Docker restart)	Outbound LLM traffic redaction (4-layer pipeline)
nginx	80, 443	HTTP/S	~32MB	Yes (systemd)	Reverse proxy, TLS termination, tool routing
Tool containers	3001-3099	Various	~128-512MB each	Yes (Docker restart)	Ghost, Nextcloud, Cal.com, etc. (28+)
Monitoring	—	—	~32MB	Yes	Netdata or lightweight metrics agent

Total LetsBe overhead: ~640MB (OpenClaw 384MB + Safety Wrapper 128MB + Secrets Proxy 64MB + nginx 32MB + monitoring 32MB)

3.2 Memory Budget per Tier

Tier	Total RAM	LetsBe Overhead	Available for Tools	Max Practical Tools	Chromium?
Lite (8GB)	8,192MB	640MB	~7,552MB	8-12 (constrained)	Yes, but consider browser-less mode
Build (16GB)	16,384MB	640MB	~15,744MB	15-20 (comfortable)	Yes
Scale (32GB)	32,768MB	640MB	~32,128MB	25-30 (full stack)	Yes
Enterprise (64GB)	65,536MB	640MB	~64,896MB	30+ with headroom	Yes

Lite tier note: With ~7.5GB for tools, the Lite tier is tight. Each tool averages 256-512MB. A Freelancer bundle (7 tools) at ~2.5GB fits comfortably. The Lite tier is hidden at launch until real-world memory profiling confirms it's viable. If browser-less mode is needed (saves ~200MB from Chromium), OpenClaw supports running without the browser tool.

3.3 OpenClaw Configuration

OpenClaw (v2026.2.6-3) is configured via ~/.openclaw/openclaw.json (JSON5 format with environment variable substitution).

Critical configuration decisions:

{
  // Route ALL LLM calls through Safety Wrapper → Secrets Proxy → OpenRouter
  "model": {
    "primary": "${SW_PROXY_MODEL}",  // e.g., "anthropic/claude-sonnet-4-6"
    "apiUrl": "http://localhost:8100/v1",  // Secrets Proxy intercepts
    "apiKey": "${OPENROUTER_API_KEY_ENCRYPTED}",  // Resolved by Secrets Proxy
    "fallbacks": ["${SW_FALLBACK_1}", "${SW_FALLBACK_2}"],
    "contextTokens": 200000
  },

  // Prompt caching — massive cost saver
  "cacheRetention": "long",          // 1 hour (SOUL.md cached 80-99% cheaper)
  "heartbeat": { "every": "55m" },   // Keep-warm to prevent cache eviction

  // Security hardening
  "security": {
    "elevated": { "enable": false },  // DISABLED — Safety Wrapper handles all elevation
    "rateLimit": {
      "maxAttempts": 10,
      "windowSeconds": 60,
      "lockoutSeconds": 300,
      "exemptLoopback": true
    }
  },

  // Tool safety
  "tools": {
    "loopDetection": { "enabled": true },  // Prevent runaway tool calls
    "exec": {
      "security": "allowlist",  // Only allowlisted binaries
      "timeout": 1800
    }
  },

  // Logging with redaction
  "logging": {
    "level": "info",
    "redactSensitive": "tools"  // Extra protection — redact tool output in logs
  },

  // Agent definitions
  "agents": {
    "list": [
      // Dispatcher, IT Admin, Marketing, Secretary, Sales
      // (see Section 8 for full configurations)
    ]
  },

  // Channel support (configured per-tenant)
  "channels": {
    "whatsapp": { "enabled": "${WHATSAPP_ENABLED}" },
    "telegram": { "enabled": "${TELEGRAM_ENABLED}" }
  }
}

3.4 Safety Wrapper Architecture (localhost:8200)

The Safety Wrapper is the core IP — where all LetsBe-specific logic lives.

┌────────────────────────────────────────────────────────────────┐
│                     SAFETY WRAPPER (localhost:8200)              │
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ Command          │  │ Secrets          │  │ Token        │  │
│  │ Classification   │  │ Registry         │  │ Metering     │  │
│  │ Engine           │  │ (Encrypted       │  │ Engine       │  │
│  │                  │  │  SQLite)         │  │              │  │
│  │ 5-tier classify  │  │ ChaCha20-Poly1305│  │ Per-agent    │  │
│  │ Autonomy gating  │  │ via sqleet       │  │ per-model    │  │
│  │ Ext. comms gate  │  │ WAL mode         │  │ hourly agg   │  │
│  └────────┬─────────┘  └────────┬─────────┘  └──────┬───────┘  │
│           │                     │                    │           │
│  ┌────────▼─────────────────────▼────────────────────▼────────┐ │
│  │              Tool Execution Proxy                           │ │
│  │                                                             │ │
│  │  Intercepts ALL tool calls from OpenClaw                    │ │
│  │  1. Classify command (green/yellow/yellow_ext/red/crit_red) │ │
│  │  2. Check autonomy level + external comms gate              │ │
│  │  3. If gated → push approval to Hub, wait for response      │ │
│  │  4. If allowed → resolve SECRET_REFs from registry          │ │
│  │  5. Execute tool call (shell, Docker, API, browser)         │ │
│  │  6. Scrub secrets from response                             │ │
│  │  7. Log to audit trail                                      │ │
│  │  8. Report token usage to metering engine                   │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ Hub              │  │ Audit            │  │ Config       │  │
│  │ Communication    │  │ Logger           │  │ Manager      │  │
│  │ Client           │  │                  │  │              │  │
│  │                  │  │ Append-only      │  │ Hot-reload   │  │
│  │ Registration     │  │ SQLite           │  │ autonomy lvl │  │
│  │ Heartbeat (60s)  │  │ Every tool call  │  │ ext comms    │  │
│  │ Config sync      │  │ Every approval   │  │ agent config │  │
│  │ Approval routing │  │ Every secret use │  │              │  │
│  │ Usage reporting  │  │                  │  │              │  │
│  └──────────────────┘  └──────────────────┘  └──────────────┘  │
└────────────────────────────────────────────────────────────────┘

Technology stack:

Node.js 22+ (same runtime as OpenClaw — one ecosystem)
TypeScript (strict mode)
No web framework (raw node:http for minimal overhead and attack surface)
better-sqlite3-multiple-ciphers for encrypted SQLite (secrets registry + audit log + usage buckets)
Key derivation: scrypt from provisioner-generated seed
Cipher: ChaCha20-Poly1305 via sqleet (modern AEAD, ~2x faster than AES-256-CBC on ARM)

3.5 Secrets Proxy Architecture (localhost:8100)

The thinnest possible process — its only job is intercepting outbound LLM traffic and scrubbing secrets.

┌─────────────────────────────────────────────────────────┐
│             SECRETS PROXY (localhost:8100)                │
│                                                           │
│  Inbound (from OpenClaw via Safety Wrapper config)        │
│  ──────────────────────────────────────────────────       │
│  POST /v1/chat/completions                                │
│  POST /v1/completions                                     │
│  POST /v1/embeddings                                      │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │         4-LAYER REDACTION PIPELINE                   │ │
│  │                                                       │ │
│  │  Layer 1: Aho-Corasick Registry Substitution          │ │
│  │  ─────────────────────────────────────────            │ │
│  │  All 50+ known secrets from encrypted registry        │ │
│  │  loaded into Aho-Corasick automaton at startup         │ │
│  │  O(n) in text length regardless of pattern count       │ │
│  │  Deterministic replacements: value → [SECRET_REF:name] │ │
│  │                                                       │ │
│  │  Layer 2: Regex Pattern Safety Net                    │ │
│  │  ─────────────────────────────────────────            │ │
│  │  7 patterns catch secrets the registry might miss:    │ │
│  │  • -----BEGIN.*PRIVATE KEY-----                       │ │
│  │  • eyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+ (JWT)       │ │
│  │  • \$2[aby]?\$[0-9]+\$ (bcrypt)                      │ │
│  │  • ://[^:]+:[^@]+@ (connection strings)              │ │
│  │  • (PASSWORD|SECRET|KEY|TOKEN)=.+ (env patterns)      │ │
│  │  • High-entropy base64 (length > 32)                  │ │
│  │  • Hex strings 32+ chars matching known key patterns  │ │
│  │                                                       │ │
│  │  Layer 3: Shannon Entropy Filter                      │ │
│  │  ─────────────────────────────────────────            │ │
│  │  Threshold: 4.5 bits/char, minimum length: 16 chars   │ │
│  │  H(X) = -Σ p(x) log2(p(x))                          │ │
│  │  English text: ~3.5-4.0 bits/char                     │ │
│  │  Random secrets: ~5.0-6.0 bits/char                   │ │
│  │  Catches: API keys, random passwords, hex tokens      │ │
│  │  Excludes: common words, UUIDs (known format)         │ │
│  │                                                       │ │
│  │  Layer 4: Context-Aware JSON Key Scanning             │ │
│  │  ─────────────────────────────────────────            │ │
│  │  Scans JSON structures for sensitive keys:            │ │
│  │  password, secret, token, key, credential,            │ │
│  │  api_key, apiKey, auth, authorization, bearer,        │ │
│  │  private_key, access_token, refresh_token             │ │
│  │  Redacts the VALUE (not the key) in matched pairs     │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│  Outbound → OpenRouter (HTTPS)                            │
│  Performance target: <10ms added latency per LLM call     │
│                                                           │
│  Control interface: Unix socket (Safety Wrapper only)     │
│  • Credential sync (on rotation/add/remove)               │
│  • Pattern updates                                        │
│  • Health check                                           │
└─────────────────────────────────────────────────────────┘

3.6 Container Layout

Container	Image	Network	Ports	Resources
`letsbe-openclaw`	Custom (OpenClaw + CLI binaries + config)	host	18789 (loopback)	~384MB
`letsbe-safety-wrapper`	LetsBe custom (Node.js)	host	8200 (loopback)	~128MB
`letsbe-secrets-proxy`	LetsBe custom (Node.js, minimal)	host	8100 (loopback)	~64MB
nginx	nginx:alpine	host	80, 443	~32MB
Tool stacks (28+)	Various (Ghost, Nextcloud, etc.)	isolated per-tool	127.0.0.1:30XX	Variable

Network access pattern: OpenClaw container uses --network host to reach tool containers via 127.0.0.1:30XX (e.g., 3023 for Nextcloud, 3037 for NocoDB). Each tool keeps its own isolated Docker network — the AI accesses them through the host loopback interface. No shared Docker network across all 30 tools.

4. Central Platform Architecture

4.1 Hub (letsbe-hub)

The most mature component (~15K LOC, 244 source files, 80+ existing endpoints, 22+ Prisma models).

Current capabilities (KEEP):

Staff admin dashboard with RBAC (4 roles, 20 permissions, 2FA)
Customer management (CRUD, subscriptions)
Order lifecycle (8-state automation state machine)
Netcup SCP API integration (full OAuth2 Device Flow)
Portainer integration (container management)
DNS verification workflow
Docker-based provisioning with SSE log streaming
Stripe checkout + webhook integration
Enterprise client management + monitoring
Email notifications, credential encryption, system settings

New capabilities (BUILD):

Customer-facing portal API (~14 endpoints) — dashboard, agents, approvals, usage, billing
Tenant communication API (~7 endpoints) — registration, heartbeat, config sync, approvals, usage
Billing + token metering (~7 endpoints) — Stripe Billing Meters, overage, founding member multiplier
Agent management API (~5 endpoints) — CRUD for agent configs, deploy to tenant
Command approval queue (~3 endpoints) — pending, approve, deny
WebSocket relay for mobile app ↔ tenant server communication

New Prisma models: TokenUsageBucket, BillingPeriod, FoundingMember, AgentConfig, CommandApproval + ServerConnection updates (see 02-COMPONENT-BREAKDOWN for full schemas)

4.2 Provisioner (letsbe-ansible-runner → letsbe-provisioner)

One-shot Bash container (~4,477 LOC) that provisions a fresh VPS via SSH.

Existing 10-step pipeline (KEEP):

System packages
Docker CE installation
Disable conflicting services
nginx + fallback config
UFW firewall (ports 80, 443, 22022)
Optional admin user + SSH key
SSH hardening (port 22022, key-only auth, fail2ban)
Unattended security updates
Deploy tool stacks via docker-compose
Deploy LetsBe agents + bootstrap ← UPDATE THIS STEP

Step 10 changes:

Deploy OpenClaw + Safety Wrapper + Secrets Proxy (replacing orchestrator + sysadmin agent)
Generate Safety Wrapper config (secrets registry seed, agent configs, Hub credentials, autonomy defaults)
Generate OpenClaw config (model routing through Secrets Proxy, agent definitions, caching, loop detection)
Run Playwright initial-setup scenarios via OpenClaw native browser (7 scenarios — Cal.com, Chatwoot, Keycloak, Nextcloud, Stalwart Mail, Umami, Uptime Kuma; n8n removed)
CRITICAL FIX: Clean up config.json after provisioning (currently contains root password in plaintext)

Zero tests — container-based integration tests are part of this proposal (see 07-TESTING-STRATEGY)

4.3 Website (Separate Next.js App)

A separate Next.js application in the monorepo, sharing the @letsbe/db Prisma package. Not part of the Hub — different concerns (marketing + onboarding vs. admin + operations).

Key features:

Marketing pages (SSG for performance)
AI-powered onboarding chat (Gemini Flash for business classification, ~$0.001 per prospect)
Tool recommendation engine with live resource calculator
Stripe checkout flow
SSE provisioning status page
Shares Prisma schema via monorepo package — no data duplication

4.4 Mobile App (Expo Bare Workflow, SDK 52+)

Why Expo over alternatives:

EAS Build: Eliminates iOS code signing complexity — CI builds without Mac hardware
EAS Update: OTA updates without App Store review — critical for rapid iteration
expo-notifications: Action buttons on push notifications (Approve/Deny) for command gating
expo-local-authentication: Biometric auth (Face ID, Touch ID, Android fingerprint)
expo-secure-store: Secure token storage (iOS Keychain, Android Keystore)

Architecture: Mobile ↔ Hub (WebSocket relay) ↔ Tenant Server. The Hub acts as a relay — the tenant server is never directly exposed to the internet. JWT auth, reconnection strategy, offline message queuing.

5. Four-Layer Security Model

5.1 Layer 1 — Sandbox (Where Code Runs)

OpenClaw's native sandbox controls the execution environment:

Mode	Description	LetsBe Default
`off`	No containerization	Default — Safety Wrapper handles gating
`non-main`	Only non-default agents sandboxed	For untrusted custom agents
`all`	Every agent sandboxed	Maximum isolation (performance cost)

Default agents (Dispatcher, IT Admin, Marketing, Secretary, Sales) run with sandbox off because the Safety Wrapper provides command-level gating that's more granular than container isolation. Custom user-created agents can be sandboxed per-agent.

5.2 Layer 2 — Tool Policy (What Tools Are Visible)

OpenClaw's native agents.list[].tools.allow/deny arrays control which tools each agent can see. Deny wins over allow. Cascading restriction model:

Tool profiles (tools.profile — coding, minimal, messaging, full)
Global policies (tools.allow/tools.deny)
Agent-specific policies (agents.list[].tools.allow/deny)

Example — Marketing Agent:

{
  "id": "marketing",
  "tools": {
    "profile": "minimal",
    "allow": ["ghost_api", "listmonk_api", "umami_api", "file_read", "browser", "nextcloud_api", "web_search", "web_fetch"],
    "deny": ["shell", "docker", "env_update"]
  }
}

Marketing can see Ghost/Listmonk/Umami but CANNOT see shell/docker/env_update — those tools don't even appear in its context.

5.3 Layer 3 — Command Gating (What Operations Require Approval)

Even if an agent can see a tool (Layer 2 allows it), the Safety Wrapper may gate specific operations on that tool based on command classification and the agent's effective autonomy level.

Five-tier classification:

Tier	Color	Description	Examples
1	GREEN	Non-destructive reads	`file_read`, `container_stats`, `container_logs`, `query_select`, `umami_read`, `uptime_check`
2	YELLOW	Modifying operations	`container_restart`, `file_write`, `env_update`, `nginx_reload`, `chatwoot_assign`, `calcom_create`
3	YELLOW_EXTERNAL	External-facing communications	`ghost_publish`, `listmonk_send`, `poste_send`, `chatwoot_reply_external`, `social_post`, `documenso_send`
4	RED	Destructive operations	`file_delete`, `container_remove`, `volume_delete`, `user_revoke`, `db_drop_table`, `backup_delete`
5	CRITICAL_RED	Irreversible infrastructure	`db_drop_database`, `firewall_modify`, `ssh_config_modify`, `backup_wipe_all`, `ssl_revoke`

Autonomy level × classification gating matrix:

Command Tier	Training Wheels (L1)	Trusted Assistant (L2)	Full Autonomy (L3)
GREEN	Auto-execute	Auto-execute	Auto-execute
YELLOW	Gate → approval	Auto-execute	Auto-execute
YELLOW_EXTERNAL	Gate → approval	Gate → approval (unless unlocked)	Gate → approval (unless unlocked)
RED	Gate → approval	Gate → approval	Auto-execute
CRITICAL_RED	Gate → approval	Gate → approval	Gate → approval

5.4 Layer 4 — Secrets Redaction (Always On)

Regardless of sandbox mode, tool permissions, or autonomy level, ALL outbound LLM traffic is redacted via the Secrets Proxy's 4-layer pipeline (see Section 3.5). This layer cannot be disabled. It runs at every autonomy level. The AI never sees raw credentials.

5.5 External Communications Gate

Independent of autonomy levels. A separate mechanism that gates all YELLOW_EXTERNAL operations by default for every agent. Users explicitly unlock autonomous external sending per-agent per-tool via the mobile app or web portal.

Resolution logic:

Command classified as YELLOW_EXTERNAL
Check external_comms_gate.unlocks[agentId][toolName]
If "autonomous" → follow normal autonomy level gating (YELLOW rules apply)
If "gated" or not set → always gate, regardless of autonomy level
Present approval: "Marketing Agent wants to publish: 'Top 10 Tips...' to your blog. [Approve] [Edit] [Deny]"

6. AI Autonomy Levels

6.1 Level Definitions

Level	Name	Default For	Auto-Execute	Requires Approval
1	Training Wheels	New customers	GREEN only	YELLOW + RED + CRITICAL_RED
2	Trusted Assistant	Default	GREEN + YELLOW	RED + CRITICAL_RED
3	Full Autonomy	Power users	GREEN + YELLOW + RED	CRITICAL_RED only

6.2 Per-Agent Override

Each agent can have its own autonomy level independent of the tenant default:

Agent	Tenant Default L2	Agent Override	Effective
IT Admin	Level 2	Level 3	3 — full autonomy for infrastructure
Marketing	Level 2	—	2 — default
Secretary	Level 2	Level 1	1 — extra cautious with communications
Sales	Level 2	—	2 — default

6.3 Transition Criteria

Moving between levels is manual — triggered by the customer in the mobile app or web portal, synced to the Safety Wrapper via Hub heartbeat. There is no automatic promotion. The customer builds trust at their own pace.

Invariants across ALL levels:

Secrets are always redacted (Layer 4)
Audit trail is always logged
External comms are gated by default until explicitly unlocked
CRITICAL_RED always requires approval
The AI never sees raw credentials

7. Data Flow Diagrams

7.1 Message Processing Flow

User (mobile app)
  │
  ▼
Hub (WebSocket relay)
  │
  ▼
OpenClaw Gateway (port 18789)
  │
  ├─► Dispatcher Agent (intent classification)
  │     │
  │     ▼
  │   Route to specialist agent (Marketing, IT, Secretary, Sales)
  │     │
  │     ▼
  │   Agent decides on tool call(s)
  │     │
  ▼     ▼
Safety Wrapper (port 8200)
  │
  ├─ 1. Classify command (GREEN/YELLOW/YELLOW_EXT/RED/CRITICAL_RED)
  ├─ 2. Check agent's effective autonomy level
  ├─ 3. Check external comms gate (if YELLOW_EXT)
  │
  ├─ IF ALLOWED:
  │   ├─ 4. Resolve SECRET_REFs from encrypted registry
  │   ├─ 5. Execute tool call (shell/Docker/API/browser)
  │   ├─ 6. Scrub secrets from response
  │   ├─ 7. Log to audit trail
  │   └─ 8. Return result to OpenClaw → Agent → User
  │
  └─ IF GATED:
      ├─ 4. Create approval request with human-readable description
      ├─ 5. POST to Hub /api/v1/tenant/approval-request
      ├─ 6. Hub pushes to mobile app via WebSocket
      ├─ 7. Mobile shows push notification: "[Approve] [Deny]"
      ├─ 8. User taps Approve → Hub relays to Safety Wrapper
      └─ 9. Safety Wrapper resumes execution from step 4 of ALLOWED path

7.2 Secrets Injection Flow

Agent decides to call NocoDB API
  │
  ▼
OpenClaw sends tool call to Safety Wrapper:
  exec("curl http://127.0.0.1:3037/api/v2/tables -H 'xc-token: SECRET_REF(nocodb_api_token)'")
  │
  ▼
Safety Wrapper intercepts:
  1. Classify: GREEN (read-only query) → auto-execute
  2. Resolve SECRET_REF: look up "nocodb_api_token" in encrypted SQLite
  3. Substitute: SECRET_REF(nocodb_api_token) → "xc_abc123def456..."
  4. Execute curl with real token
  │
  ▼
Tool responds:
  { "tables": [...] }   ← response may contain secrets in error messages
  │
  ▼
Safety Wrapper scrubs response:
  Run through mini redaction pipeline (registry match + regex)
  │
  ▼
Secrets Proxy intercepts agent's next LLM call:
  Full 4-layer redaction on all outbound text
  │
  ▼
LLM receives: clean data, no secrets
  Agent sees: [SECRET_REF:nocodb_api_token] (never the real value)

7.3 Token Metering Flow

Every LLM call:
  Agent → OpenClaw → Secrets Proxy → OpenRouter → LLM Provider
                                                       │
  OpenRouter response includes:                        │
    usage: { input_tokens, output_tokens,               │
             cache_read_tokens, cache_write_tokens }    │
                                                       ▼
  Safety Wrapper captures (via response headers or proxy inspection):
    { agent_id, model, input_tokens, output_tokens,
      cached_tokens, timestamp, request_id }
                │
                ▼
  Local SQLite (token_usage table):
    INSERT per-call record
                │
                ▼
  Hourly aggregation job:
    GROUP BY agent_id, model, HOUR(timestamp)
    → TokenUsageBucket records
                │
                ▼
  Heartbeat (every 60s) or dedicated POST:
    Safety Wrapper → Hub /api/v1/tenant/usage
    Payload: array of unsent TokenUsageBucket records
                │
                ▼
  Hub processes:
    1. Store in PostgreSQL TokenUsageBucket table
    2. Update BillingPeriod.tokensUsed
    3. Check pool exhaustion → trigger overage if needed
    4. Report to Stripe Billing Meter (hourly batch)
                │
                ▼
  Stripe calculates overage on next invoice

7.4 Provisioning Flow

1. Customer completes Stripe checkout on Website
2. Stripe webhook → Hub creates User + Subscription + Order (PAYMENT_CONFIRMED)
3. Automation state machine: PAYMENT_CONFIRMED → AWAITING_SERVER
4. Hub assigns Netcup server from pre-provisioned pool (EU or US region)
5. State: AWAITING_SERVER → SERVER_READY
6. Hub creates DNS records (A records for all tool subdomains)
7. State: SERVER_READY → DNS_PENDING → DNS_READY
8. Hub spawns Provisioner Docker container with job config
9. Provisioner:
   a. SSH into VPS (port 22022)
   b. Steps 1-8: system setup, Docker, nginx, firewall, SSH hardening
   c. Step 9: Deploy 28+ tool stacks via docker-compose
   d. Step 10: Deploy OpenClaw + Safety Wrapper + Secrets Proxy
      - Generate 50+ credentials via env_setup.sh
      - Generate Safety Wrapper config (secrets registry seed, agent configs)
      - Generate OpenClaw config (model routing, agent definitions, caching)
      - Start all three processes
      - Run Playwright initial-setup scenarios via OpenClaw browser
      - Generate SSL certs via Let's Encrypt
10. Safety Wrapper registers with Hub, receives API key
11. State: PROVISIONING → FULFILLED
12. Customer receives welcome email with dashboard URL + app download links
13. Heartbeat loop begins (Safety Wrapper → Hub, every 60 seconds)

8. Inter-Agent Communication

8.1 Dispatcher Hub Pattern

The Dispatcher is a first-class default agent — the user's primary point of contact. Every tenant gets one. It has three responsibilities:

Intent routing: Classifies user messages and delegates to specialist agents
Workflow decomposition: Breaks multi-domain requests into ordered steps across agents
Morning briefing: Aggregates overnight activity from all agents into a unified summary

The Dispatcher has NO direct tool access (no shell, no docker, no file operations). It works exclusively through agent-to-agent delegation. This keeps it lightweight and prevents scope creep.

8.2 Agent-to-Agent Communication

OpenClaw's native agentToAgent tool, enabled for all agents:

{
  "tools": {
    "agentToAgent": {
      "enabled": true,
      "allow": ["dispatcher", "it-admin", "marketing", "secretary", "sales"]
    }
  }
}

Communication patterns:

Dispatcher → Specialist: "Handle this user request" (primary pattern)
Specialist → Specialist: "What's the current Ghost version?" (peer queries)
Specialist → Dispatcher: "Task complete, here's the result" (reporting)

Safety controls:

Maximum dispatch depth: 5 levels (prevents A→B→A→B→... loops)
Rate limiting: max inter-agent dispatches per minute per agent
Full audit trail: every dispatch logged with source, target, task, result
User visibility: all agent activity visible in mobile app's Activity feed

8.3 Shared Memory

Each agent has its own workspace, but all agents get extraPaths pointing to /opt/letsbe/shared-memory/. When one agent writes to the shared directory, others discover it via memory_search. This enables cross-agent knowledge sharing without breaking workspace isolation.

9. Memory Architecture

9.1 OpenClaw Native Memory

Layer	Location	Purpose	Loaded When
Daily logs	`memory/YYYY-MM-DD.md`	Session context	Today + yesterday
Long-term	`MEMORY.md`	Curated durable knowledge	Private sessions
Transcripts	Session JSONL	Full conversation recall	Via `memory_search`

9.2 Memory Search

Hybrid retrieval combining:

Vector search (cosine similarity via sqlite-vec): Semantic matching
BM25 keyword search (SQLite FTS5): Exact token matching
MMR re-ranking (lambda 0.7): Balances relevance with diversity
Temporal decay (30-day half-life): Boosts recent memories
Local embeddings (ggml-org/embeddinggemma-300m-qat-q8_0-GGUF, ~0.6GB)

9.3 Token Efficiency Strategy

Strategy	Impact
Tool registry (structured JSON, ~2.5K tokens) vs. verbose skills	~80% reduction in tool context
On-demand cheat sheets vs. always-loaded skills	Only pay for tools used in session
Compact SOUL.md (~600-800 tokens per agent)	~50% reduction in identity context
`cacheRetention: "long"` (1 hour)	80-99% cheaper on repeated SOUL.md calls
Context pruning (`cache-ttl`, 1h default)	Auto-removes stale tool outputs
Session compaction	Keeps long conversations from blowing up costs

Base context cost per agent: master skill (~700 tokens) + tool registry (~2,500 tokens) = ~3,200 tokens — regardless of how many tools are installed. Compare to 30 individual skills at ~750 tokens each = ~22,500 tokens always in context.

10. Network Security

10.1 Firewall Rules

# UFW configuration (set during provisioning step 5)
ufw default deny incoming
ufw default allow outgoing
ufw allow 80/tcp      # HTTP (nginx → redirect to HTTPS)
ufw allow 443/tcp     # HTTPS (nginx → tool web UIs + Hub API)
ufw allow 22022/tcp   # SSH (hardened port, key-only auth)
ufw enable

NOT exposed:

Port 18789 (OpenClaw) — loopback only
Port 8200 (Safety Wrapper) — loopback only
Port 8100 (Secrets Proxy) — loopback only
Ports 3001-3099 (tool containers) — loopback only, accessed via nginx

10.2 TLS

All tool web UIs served via nginx with Let's Encrypt certificates
Auto-renewal via certbot cron
Strict Transport Security headers
OCSP stapling enabled

10.3 Inter-Process Authentication

From → To	Auth Method
OpenClaw → Safety Wrapper	Shared secret token (generated at provisioning)
Safety Wrapper → Secrets Proxy	Unix socket (no network, filesystem permissions)
Safety Wrapper → Hub	Bearer token (Hub API key, received at registration)
Hub → Safety Wrapper	Registration token → Hub API key exchange
Mobile → Hub	JWT (NextAuth session)
Hub → Tenant via nginx	Not needed — Safety Wrapper initiates all Hub communication

10.4 SSRF Protection

OpenClaw's browser tool has configurable URL allowlists. LetsBe restricts browser navigation to:

127.0.0.1:* (localhost tool UIs)
Tool-specific external URLs (if configured)
Blocks: metadata endpoints (169.254.169.254), internal networks, file:// URIs

11. Scalability & Performance

11.1 Horizontal Scaling

Each tenant is an independent VPS — horizontal scaling means adding more VPS instances. No shared state between tenants. The Hub handles N tenants, scaling its own PostgreSQL and server capacity as needed.

11.2 Vertical Scaling

Tier upgrades: Lite → Build → Scale → Enterprise. The provisioner can migrate tool stacks to a larger VPS. OpenClaw and Safety Wrapper configs don't change — only resource limits increase.

11.3 Performance Targets

Metric	Target	Measured At
Secrets redaction latency	<10ms per LLM call	Secrets Proxy
Command classification latency	<5ms per tool call	Safety Wrapper
Approval round-trip (auto-execute)	<50ms	Safety Wrapper
Approval round-trip (with mobile)	<30 seconds typical	Safety Wrapper → Hub → Mobile → Hub → SW
Agent response time	2-15 seconds (model-dependent)	End-to-end
Heartbeat interval	60 seconds	Safety Wrapper → Hub
Config sync latency	<60 seconds (next heartbeat)	Hub → Safety Wrapper

12. Disaster Recovery & Backup

12.1 Application-Level Backups (Existing)

The Provisioner deploys backups.sh (~473 lines):

18 PostgreSQL databases + 2 MySQL + 1 MongoDB
Daily 2:00 AM cron job
Rotation: 7 daily local + 4 weekly remote (via rclone)
Output: backup-status.json with per-database status

12.2 Backup Monitoring (NEW)

OpenClaw cron job at 6:00 AM reads backup-status.json:

Was backup updated today?
All databases listed?
Any failures?
Reports to Hub via Safety Wrapper's /tenant/backup-status endpoint

12.3 VPS Snapshots

Daily Netcup VPS snapshots via SCP API:

Triggered by Hub cron job
3 snapshots retained (rolling)
Staggered across tenants to avoid API rate limits
Free to create and store

12.4 Recovery Procedures

Scenario	Recovery
Single tool database corruption	Restore from application-level dump
OpenClaw/Safety Wrapper state loss	Restore from VPS snapshot
Full VPS failure	Restore from snapshot to new VPS, re-provision
Hub database loss	Separate Hub backup strategy (not tenant concern)

13. Error Handling & Resilience

13.1 Severity-Based Alerting

Severity	Examples	Auto-Recovery	Alert
Soft	OpenClaw crash, Secrets Proxy restart, tool adapter timeout	Auto-restart immediately	Push notification after 3 failures in 1 hour
Medium	Tool API unreachable, OpenRouter timeout, Hub communication failure	Retry with backoff (30s → 1m → 5m)	Push notification after 3 consecutive failures
Hard	Auth token rejected, secrets registry corrupted, disk full, SSL expired	Stop affected component, do NOT auto-restart	Immediate push to customer + Hub alert to staff

13.2 Model Failover

OpenClaw native failover chains:

{
  "model": {
    "primary": "anthropic/claude-sonnet-4-6",
    "fallbacks": ["anthropic/claude-haiku-4-5", "google/gemini-2.0-flash"]
  }
}

Auth profile rotation before model fallback — if primary fails due to API key issue, OpenClaw rotates auth profiles before falling back to a different model.

13.3 Graceful Degradation

Component Down	User Experience
Single tool	Agent says "I can't reach X right now. I'll try again shortly."
Secrets Proxy	Agents pause (can't make LLM calls). Resume on restart (~2-5s).
Safety Wrapper	Tool calls blocked. Agents can still respond from cached context. Resume on restart.
OpenClaw	All agents offline. Auto-restart. User sees "Your AI team is restarting."
Hub	Agents continue locally (cached config). Heartbeats queue. Approvals delayed.
OpenRouter	Model failover chain. If all fail, agent reports temporary issue.
Mobile app	Customer portal (web) available as fallback.

End of System Architecture Document

51 KiB Raw Blame History Unescape Escape