LetsBeBiz-Redesign/docs/architecture-proposal/claude/01-SYSTEM-ARCHITECTURE.md

51 KiB
Raw Blame History

LetsBe Biz — System Architecture

Date: February 27, 2026 Team: Claude Opus 4.6 Architecture Team Document: 01 of 09 Status: Proposal — Competing with independent team


Table of Contents

  1. Architecture Philosophy
  2. High-Level System Overview
  3. Tenant Server Architecture
  4. Central Platform Architecture
  5. Four-Layer Security Model
  6. AI Autonomy Levels
  7. Data Flow Diagrams
  8. Inter-Agent Communication
  9. Memory Architecture
  10. Network Security
  11. Scalability & Performance
  12. Disaster Recovery & Backup
  13. Error Handling & Resilience

1. Architecture Philosophy

1.1 Non-Negotiable Principles

Principle 1 — Secrets Never Leave the Server

All credential redaction happens locally on the tenant VPS before any data reaches an LLM provider. This is enforced at the transport layer through a dedicated Secrets Proxy process — not by trusting the AI to behave, not by configuration, not by policy. The enforcement point is a separate process that sits between OpenClaw and the internet. Traffic that hasn't passed through the Secrets Proxy physically cannot reach an LLM. This is the single most important architectural invariant.

Principle 2 — Per-Tenant Physical Isolation

One customer = one VPS. No multi-tenancy, no shared containers, no shared databases. Each tenant's data, credentials, agent state, and conversation history lives on dedicated hardware. This is permanent for v1. It eliminates entire categories of security vulnerabilities (cross-tenant data leaks, noisy neighbor performance issues, shared-secret compromise) at the cost of higher per-customer infrastructure spend.

Principle 3 — Defense in Depth (Four Independent Security Layers)

Security is not one wall — it's four independent layers, each enforced by different mechanisms, each unable to expand access granted by layers above. A failure in any single layer does not compromise the system because the remaining three layers still enforce their restrictions independently:

Layer Mechanism Enforced By Bypassable By AI?
1. Sandbox Container isolation Docker / OS kernel No
2. Tool Policy Per-agent allow/deny arrays OpenClaw config (loaded at startup) No
3. Command Gating 5-tier classification + autonomy levels Safety Wrapper (separate process) No
4. Secrets Redaction 4-layer redaction pipeline Secrets Proxy (separate process) No

Principle 4 — OpenClaw Stays Vanilla

OpenClaw is treated as an upstream dependency, never a fork. All LetsBe-specific logic (secrets redaction, command gating, Hub communication, tool adapters, billing metering) lives in a Safety Wrapper process that runs alongside OpenClaw. This means:

  • Upstream security patches apply cleanly
  • New OpenClaw features are available without merge conflicts
  • Our competitive IP is cleanly separated from the upstream codebase
  • Pin to a tested release tag; upgrade monthly after staging verification

Principle 5 — Graceful Degradation

Every component has a failure mode that preserves the user's experience:

  • Hub goes down → agents continue working from cached config; approvals queue locally
  • OpenRouter goes down → model failover chains try alternatives; agents pause gracefully
  • Single tool goes down → agent reports it, other tools continue
  • Safety Wrapper restarts → agents pause briefly (~2-5s), auto-resume
  • Secrets Proxy restarts → LLM calls fail temporarily, auto-resume

1.2 Key Divergence from Technical Architecture v1.2

The Technical Architecture v1.2 proposes the Safety Wrapper as an in-process OpenClaw extension running inside the Gateway process, with only a thin Secrets Proxy as a separate process. After deep research into OpenClaw's plugin system, we propose a fundamentally different approach.

Our proposal: Safety Wrapper as a SEPARATE process (localhost:8200)

Three findings drive this decision:

  1. Hook Gap (GitHub Discussion #20575): OpenClaw's before_tool_call and after_tool_call hooks are NOT bridged to external plugins. The internal hook system fires events via emitEvent() but never calls triggerInternalHook() for external plugin consumers. This means an in-process extension CANNOT reliably intercept tool calls — the exact mechanism the v1.2 architecture depends on for command classification and secrets injection.

  2. CVE-2026-25253 (CVSS 8.8): Cross-site WebSocket hijacking vulnerability in OpenClaw, patched 2026-01-29. An in-process extension shares the vulnerability surface with the host process. A separate process has an independent attack surface — compromising OpenClaw doesn't automatically compromise the Safety Wrapper.

  3. Synchronous hook limitation: tool_result_persist hook is synchronous — it cannot return Promises. This limits what an in-process extension can do for async operations like Hub API calls, approval requests, and token reporting.

Impact on architecture:

  • Safety Wrapper runs as a separate Node.js process on localhost:8200
  • OpenClaw is configured to route tool calls through the Safety Wrapper's HTTP API
  • Secrets Proxy remains as a separate thin process on localhost:8100
  • Total: 3 LetsBe processes (OpenClaw + Safety Wrapper + Secrets Proxy) + nginx + tool containers
  • RAM overhead increases by ~64MB (from ~576MB to ~640MB) — acceptable on all tiers

1.3 Why These Principles Matter for the Business

Privacy-first architecture is the competitive moat. SMBs increasingly distrust cloud-only AI solutions — stories of training data leaks, terms-of-service changes, and API key compromises make headlines weekly. LetsBe's "secrets never leave your server" guarantee is verifiable (the Secrets Proxy is inspectable) and defensible (transport-layer enforcement can't be bypassed by prompt injection). This positions LetsBe uniquely against competitors who run AI in multi-tenant cloud environments.


2. High-Level System Overview

2.1 Two-Domain Architecture

The platform operates across two distinct trust domains connected by HTTPS:

┌─────────────────────────────────────────────────────────────────────┐
│                        CENTRAL PLATFORM                             │
│                    (LetsBe infrastructure)                           │
│                                                                     │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐    │
│  │     Hub      │   │  Provisioner │   │      Website         │    │
│  │  (Next.js)   │   │  (Bash/SSH)  │   │    (Next.js SSG)     │    │
│  │              │   │              │   │                      │    │
│  │ Admin Portal │   │ 10-step VPS  │   │ Marketing + AI       │    │
│  │ Customer API │   │ setup via    │   │ onboarding chat +    │    │
│  │ Billing      │   │ Docker       │   │ Stripe checkout      │    │
│  │ Tenant Comms │   │              │   │                      │    │
│  └──────┬───────┘   └──────┬───────┘   └──────────────────────┘    │
│         │                  │                                        │
│         │   PostgreSQL     │                                        │
│         └──────┬───────────┘                                        │
│                │                                                    │
└────────────────┼────────────────────────────────────────────────────┘
                 │
                 │  HTTPS (heartbeat, config sync, approvals, usage)
                 │  SSH (provisioning only — one-shot, no persistent connection)
                 │
┌────────────────┼────────────────────────────────────────────────────┐
│                │           TENANT SERVER                            │
│                │      (Customer's isolated VPS)                     │
│                │                                                    │
│  ┌─────────────▼──────────┐                                        │
│  │    Safety Wrapper       │◄────── Hub API Key auth               │
│  │    (localhost:8200)     │                                        │
│  │                         │                                        │
│  │  Command Classification │        ┌──────────────────┐           │
│  │  Secrets Registry (SQLite)│      │  Secrets Proxy   │           │
│  │  Tool Execution Proxy   │───────►│  (localhost:8100) │           │
│  │  Hub Communication      │        │                  │           │
│  │  Token Metering         │        │  4-layer redact  │──► LLM   │
│  │  Audit Logger           │        │  <10ms overhead  │  (OpenRouter)
│  └────────────┬────────────┘        └──────────────────┘           │
│               │                                                     │
│  ┌────────────▼────────────┐                                        │
│  │      OpenClaw           │                                        │
│  │   (Gateway:18789)       │                                        │
│  │                         │                                        │
│  │  Agent Runtime          │     ┌──────────────────────────────┐  │
│  │  Session Management     │     │     Tool Stacks (Docker)     │  │
│  │  Prompt Caching         │     │                              │  │
│  │  Browser (Playwright)   │     │  Ghost    Cal.com   Nextcloud│  │
│  │  Channels (WA/TG)      │     │  Chatwoot Odoo      NocoDB   │  │
│  │  Cron / Webhooks        │     │  Listmonk Umami    Keycloak  │  │
│  └─────────────────────────┘     │  ... 20+ more containers    │  │
│                                   └──────────────────────────────┘  │
│  ┌─────────────────────────┐                                        │
│  │   nginx (80/443)        │  Only external-facing process          │
│  └─────────────────────────┘                                        │
└─────────────────────────────────────────────────────────────────────┘

2.2 Trust Boundaries

                    UNTRUSTED                │           TRUSTED (on-VPS)
                                             │
    External LLM Providers ◄─────────────────┤◄── Secrets Proxy (redacts ALL secrets)
    (via OpenRouter:                          │         ▲
     Anthropic, Google,                       │         │ outbound LLM traffic only
     DeepSeek, OpenAI, etc.)                  │         │
                                             │    Safety Wrapper (classifies commands)
    Internet Users ─────────► nginx ──────►  │         │
                              (TLS)          │         ▼
                                             │    OpenClaw (agent runtime)
    Mobile App ◄─────► Hub ◄────────────────►│         │
    (WebSocket)        (relay)               │         ▼
                                             │    Tool Containers
    Messaging Channels ◄────────────────────►│    (Ghost, Nextcloud, Cal.com, etc.)
    (WhatsApp, Telegram)                      │

Key boundaries:

  • LLMs are UNTRUSTED — all outbound traffic is sanitized by Secrets Proxy
  • The Internet is UNTRUSTED — only nginx port 80/443 and SSH 22022 are exposed
  • Hub communication is AUTHENTICATED — Bearer token over HTTPS
  • Inter-process communication is LOCAL — localhost only, no network exposure

2.3 Network Boundary

  • Central → Tenant: SSH (provisioning, one-shot), HTTPS (API calls to Safety Wrapper if needed)
  • Tenant → Central: HTTPS (heartbeat, config sync, approval requests, usage reporting)
  • Tenant → Internet: Only through Secrets Proxy (LLM calls) and nginx (tool web UIs)
  • No persistent connections: Heartbeat is periodic HTTP POST, not WebSocket

3. Tenant Server Architecture

3.1 Process Map

Every tenant VPS runs the following processes:

Process Port Protocol RAM Budget Restartable Purpose
OpenClaw Gateway 18789 HTTP+WS ~384MB (includes Chromium ~200MB) Yes (Docker restart) AI agent runtime, session management, browser tool
Safety Wrapper 8200 HTTP ~128MB Yes (Docker restart) Command gating, secrets registry, Hub comms, metering
Secrets Proxy 8100 HTTP ~64MB Yes (Docker restart) Outbound LLM traffic redaction (4-layer pipeline)
nginx 80, 443 HTTP/S ~32MB Yes (systemd) Reverse proxy, TLS termination, tool routing
Tool containers 3001-3099 Various ~128-512MB each Yes (Docker restart) Ghost, Nextcloud, Cal.com, etc. (28+)
Monitoring ~32MB Yes Netdata or lightweight metrics agent

Total LetsBe overhead: ~640MB (OpenClaw 384MB + Safety Wrapper 128MB + Secrets Proxy 64MB + nginx 32MB + monitoring 32MB)

3.2 Memory Budget per Tier

Tier Total RAM LetsBe Overhead Available for Tools Max Practical Tools Chromium?
Lite (8GB) 8,192MB 640MB ~7,552MB 8-12 (constrained) Yes, but consider browser-less mode
Build (16GB) 16,384MB 640MB ~15,744MB 15-20 (comfortable) Yes
Scale (32GB) 32,768MB 640MB ~32,128MB 25-30 (full stack) Yes
Enterprise (64GB) 65,536MB 640MB ~64,896MB 30+ with headroom Yes

Lite tier note: With ~7.5GB for tools, the Lite tier is tight. Each tool averages 256-512MB. A Freelancer bundle (7 tools) at ~2.5GB fits comfortably. The Lite tier is hidden at launch until real-world memory profiling confirms it's viable. If browser-less mode is needed (saves ~200MB from Chromium), OpenClaw supports running without the browser tool.

3.3 OpenClaw Configuration

OpenClaw (v2026.2.6-3) is configured via ~/.openclaw/openclaw.json (JSON5 format with environment variable substitution).

Critical configuration decisions:

{
  // Route ALL LLM calls through Safety Wrapper → Secrets Proxy → OpenRouter
  "model": {
    "primary": "${SW_PROXY_MODEL}",  // e.g., "anthropic/claude-sonnet-4-6"
    "apiUrl": "http://localhost:8100/v1",  // Secrets Proxy intercepts
    "apiKey": "${OPENROUTER_API_KEY_ENCRYPTED}",  // Resolved by Secrets Proxy
    "fallbacks": ["${SW_FALLBACK_1}", "${SW_FALLBACK_2}"],
    "contextTokens": 200000
  },

  // Prompt caching — massive cost saver
  "cacheRetention": "long",          // 1 hour (SOUL.md cached 80-99% cheaper)
  "heartbeat": { "every": "55m" },   // Keep-warm to prevent cache eviction

  // Security hardening
  "security": {
    "elevated": { "enable": false },  // DISABLED — Safety Wrapper handles all elevation
    "rateLimit": {
      "maxAttempts": 10,
      "windowSeconds": 60,
      "lockoutSeconds": 300,
      "exemptLoopback": true
    }
  },

  // Tool safety
  "tools": {
    "loopDetection": { "enabled": true },  // Prevent runaway tool calls
    "exec": {
      "security": "allowlist",  // Only allowlisted binaries
      "timeout": 1800
    }
  },

  // Logging with redaction
  "logging": {
    "level": "info",
    "redactSensitive": "tools"  // Extra protection — redact tool output in logs
  },

  // Agent definitions
  "agents": {
    "list": [
      // Dispatcher, IT Admin, Marketing, Secretary, Sales
      // (see Section 8 for full configurations)
    ]
  },

  // Channel support (configured per-tenant)
  "channels": {
    "whatsapp": { "enabled": "${WHATSAPP_ENABLED}" },
    "telegram": { "enabled": "${TELEGRAM_ENABLED}" }
  }
}

3.4 Safety Wrapper Architecture (localhost:8200)

The Safety Wrapper is the core IP — where all LetsBe-specific logic lives.

┌────────────────────────────────────────────────────────────────┐
│                     SAFETY WRAPPER (localhost:8200)              │
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ Command          │  │ Secrets          │  │ Token        │  │
│  │ Classification   │  │ Registry         │  │ Metering     │  │
│  │ Engine           │  │ (Encrypted       │  │ Engine       │  │
│  │                  │  │  SQLite)         │  │              │  │
│  │ 5-tier classify  │  │ ChaCha20-Poly1305│  │ Per-agent    │  │
│  │ Autonomy gating  │  │ via sqleet       │  │ per-model    │  │
│  │ Ext. comms gate  │  │ WAL mode         │  │ hourly agg   │  │
│  └────────┬─────────┘  └────────┬─────────┘  └──────┬───────┘  │
│           │                     │                    │           │
│  ┌────────▼─────────────────────▼────────────────────▼────────┐ │
│  │              Tool Execution Proxy                           │ │
│  │                                                             │ │
│  │  Intercepts ALL tool calls from OpenClaw                    │ │
│  │  1. Classify command (green/yellow/yellow_ext/red/crit_red) │ │
│  │  2. Check autonomy level + external comms gate              │ │
│  │  3. If gated → push approval to Hub, wait for response      │ │
│  │  4. If allowed → resolve SECRET_REFs from registry          │ │
│  │  5. Execute tool call (shell, Docker, API, browser)         │ │
│  │  6. Scrub secrets from response                             │ │
│  │  7. Log to audit trail                                      │ │
│  │  8. Report token usage to metering engine                   │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ Hub              │  │ Audit            │  │ Config       │  │
│  │ Communication    │  │ Logger           │  │ Manager      │  │
│  │ Client           │  │                  │  │              │  │
│  │                  │  │ Append-only      │  │ Hot-reload   │  │
│  │ Registration     │  │ SQLite           │  │ autonomy lvl │  │
│  │ Heartbeat (60s)  │  │ Every tool call  │  │ ext comms    │  │
│  │ Config sync      │  │ Every approval   │  │ agent config │  │
│  │ Approval routing │  │ Every secret use │  │              │  │
│  │ Usage reporting  │  │                  │  │              │  │
│  └──────────────────┘  └──────────────────┘  └──────────────┘  │
└────────────────────────────────────────────────────────────────┘

Technology stack:

  • Node.js 22+ (same runtime as OpenClaw — one ecosystem)
  • TypeScript (strict mode)
  • No web framework (raw node:http for minimal overhead and attack surface)
  • better-sqlite3-multiple-ciphers for encrypted SQLite (secrets registry + audit log + usage buckets)
  • Key derivation: scrypt from provisioner-generated seed
  • Cipher: ChaCha20-Poly1305 via sqleet (modern AEAD, ~2x faster than AES-256-CBC on ARM)

3.5 Secrets Proxy Architecture (localhost:8100)

The thinnest possible process — its only job is intercepting outbound LLM traffic and scrubbing secrets.

┌─────────────────────────────────────────────────────────┐
│             SECRETS PROXY (localhost:8100)                │
│                                                           │
│  Inbound (from OpenClaw via Safety Wrapper config)        │
│  ──────────────────────────────────────────────────       │
│  POST /v1/chat/completions                                │
│  POST /v1/completions                                     │
│  POST /v1/embeddings                                      │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │         4-LAYER REDACTION PIPELINE                   │ │
│  │                                                       │ │
│  │  Layer 1: Aho-Corasick Registry Substitution          │ │
│  │  ─────────────────────────────────────────            │ │
│  │  All 50+ known secrets from encrypted registry        │ │
│  │  loaded into Aho-Corasick automaton at startup         │ │
│  │  O(n) in text length regardless of pattern count       │ │
│  │  Deterministic replacements: value → [SECRET_REF:name] │ │
│  │                                                       │ │
│  │  Layer 2: Regex Pattern Safety Net                    │ │
│  │  ─────────────────────────────────────────            │ │
│  │  7 patterns catch secrets the registry might miss:    │ │
│  │  • -----BEGIN.*PRIVATE KEY-----                       │ │
│  │  • eyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+ (JWT)       │ │
│  │  • \$2[aby]?\$[0-9]+\$ (bcrypt)                      │ │
│  │  • ://[^:]+:[^@]+@ (connection strings)              │ │
│  │  • (PASSWORD|SECRET|KEY|TOKEN)=.+ (env patterns)      │ │
│  │  • High-entropy base64 (length > 32)                  │ │
│  │  • Hex strings 32+ chars matching known key patterns  │ │
│  │                                                       │ │
│  │  Layer 3: Shannon Entropy Filter                      │ │
│  │  ─────────────────────────────────────────            │ │
│  │  Threshold: 4.5 bits/char, minimum length: 16 chars   │ │
│  │  H(X) = -Σ p(x) log2(p(x))                          │ │
│  │  English text: ~3.5-4.0 bits/char                     │ │
│  │  Random secrets: ~5.0-6.0 bits/char                   │ │
│  │  Catches: API keys, random passwords, hex tokens      │ │
│  │  Excludes: common words, UUIDs (known format)         │ │
│  │                                                       │ │
│  │  Layer 4: Context-Aware JSON Key Scanning             │ │
│  │  ─────────────────────────────────────────            │ │
│  │  Scans JSON structures for sensitive keys:            │ │
│  │  password, secret, token, key, credential,            │ │
│  │  api_key, apiKey, auth, authorization, bearer,        │ │
│  │  private_key, access_token, refresh_token             │ │
│  │  Redacts the VALUE (not the key) in matched pairs     │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│  Outbound → OpenRouter (HTTPS)                            │
│  Performance target: <10ms added latency per LLM call     │
│                                                           │
│  Control interface: Unix socket (Safety Wrapper only)     │
│  • Credential sync (on rotation/add/remove)               │
│  • Pattern updates                                        │
│  • Health check                                           │
└─────────────────────────────────────────────────────────┘

3.6 Container Layout

Container Image Network Ports Resources
letsbe-openclaw Custom (OpenClaw + CLI binaries + config) host 18789 (loopback) ~384MB
letsbe-safety-wrapper LetsBe custom (Node.js) host 8200 (loopback) ~128MB
letsbe-secrets-proxy LetsBe custom (Node.js, minimal) host 8100 (loopback) ~64MB
nginx nginx:alpine host 80, 443 ~32MB
Tool stacks (28+) Various (Ghost, Nextcloud, etc.) isolated per-tool 127.0.0.1:30XX Variable

Network access pattern: OpenClaw container uses --network host to reach tool containers via 127.0.0.1:30XX (e.g., 3023 for Nextcloud, 3037 for NocoDB). Each tool keeps its own isolated Docker network — the AI accesses them through the host loopback interface. No shared Docker network across all 30 tools.


4. Central Platform Architecture

4.1 Hub (letsbe-hub)

The most mature component (~15K LOC, 244 source files, 80+ existing endpoints, 22+ Prisma models).

Current capabilities (KEEP):

  • Staff admin dashboard with RBAC (4 roles, 20 permissions, 2FA)
  • Customer management (CRUD, subscriptions)
  • Order lifecycle (8-state automation state machine)
  • Netcup SCP API integration (full OAuth2 Device Flow)
  • Portainer integration (container management)
  • DNS verification workflow
  • Docker-based provisioning with SSE log streaming
  • Stripe checkout + webhook integration
  • Enterprise client management + monitoring
  • Email notifications, credential encryption, system settings

New capabilities (BUILD):

  • Customer-facing portal API (~14 endpoints) — dashboard, agents, approvals, usage, billing
  • Tenant communication API (~7 endpoints) — registration, heartbeat, config sync, approvals, usage
  • Billing + token metering (~7 endpoints) — Stripe Billing Meters, overage, founding member multiplier
  • Agent management API (~5 endpoints) — CRUD for agent configs, deploy to tenant
  • Command approval queue (~3 endpoints) — pending, approve, deny
  • WebSocket relay for mobile app ↔ tenant server communication

New Prisma models: TokenUsageBucket, BillingPeriod, FoundingMember, AgentConfig, CommandApproval + ServerConnection updates (see 02-COMPONENT-BREAKDOWN for full schemas)

4.2 Provisioner (letsbe-ansible-runner → letsbe-provisioner)

One-shot Bash container (~4,477 LOC) that provisions a fresh VPS via SSH.

Existing 10-step pipeline (KEEP):

  1. System packages
  2. Docker CE installation
  3. Disable conflicting services
  4. nginx + fallback config
  5. UFW firewall (ports 80, 443, 22022)
  6. Optional admin user + SSH key
  7. SSH hardening (port 22022, key-only auth, fail2ban)
  8. Unattended security updates
  9. Deploy tool stacks via docker-compose
  10. Deploy LetsBe agents + bootstrap ← UPDATE THIS STEP

Step 10 changes:

  • Deploy OpenClaw + Safety Wrapper + Secrets Proxy (replacing orchestrator + sysadmin agent)
  • Generate Safety Wrapper config (secrets registry seed, agent configs, Hub credentials, autonomy defaults)
  • Generate OpenClaw config (model routing through Secrets Proxy, agent definitions, caching, loop detection)
  • Run Playwright initial-setup scenarios via OpenClaw native browser (7 scenarios — Cal.com, Chatwoot, Keycloak, Nextcloud, Stalwart Mail, Umami, Uptime Kuma; n8n removed)
  • CRITICAL FIX: Clean up config.json after provisioning (currently contains root password in plaintext)

Zero tests — container-based integration tests are part of this proposal (see 07-TESTING-STRATEGY)

4.3 Website (Separate Next.js App)

A separate Next.js application in the monorepo, sharing the @letsbe/db Prisma package. Not part of the Hub — different concerns (marketing + onboarding vs. admin + operations).

Key features:

  • Marketing pages (SSG for performance)
  • AI-powered onboarding chat (Gemini Flash for business classification, ~$0.001 per prospect)
  • Tool recommendation engine with live resource calculator
  • Stripe checkout flow
  • SSE provisioning status page
  • Shares Prisma schema via monorepo package — no data duplication

4.4 Mobile App (Expo Bare Workflow, SDK 52+)

Why Expo over alternatives:

  • EAS Build: Eliminates iOS code signing complexity — CI builds without Mac hardware
  • EAS Update: OTA updates without App Store review — critical for rapid iteration
  • expo-notifications: Action buttons on push notifications (Approve/Deny) for command gating
  • expo-local-authentication: Biometric auth (Face ID, Touch ID, Android fingerprint)
  • expo-secure-store: Secure token storage (iOS Keychain, Android Keystore)

Architecture: Mobile ↔ Hub (WebSocket relay) ↔ Tenant Server. The Hub acts as a relay — the tenant server is never directly exposed to the internet. JWT auth, reconnection strategy, offline message queuing.


5. Four-Layer Security Model

5.1 Layer 1 — Sandbox (Where Code Runs)

OpenClaw's native sandbox controls the execution environment:

Mode Description LetsBe Default
off No containerization Default — Safety Wrapper handles gating
non-main Only non-default agents sandboxed For untrusted custom agents
all Every agent sandboxed Maximum isolation (performance cost)

Default agents (Dispatcher, IT Admin, Marketing, Secretary, Sales) run with sandbox off because the Safety Wrapper provides command-level gating that's more granular than container isolation. Custom user-created agents can be sandboxed per-agent.

5.2 Layer 2 — Tool Policy (What Tools Are Visible)

OpenClaw's native agents.list[].tools.allow/deny arrays control which tools each agent can see. Deny wins over allow. Cascading restriction model:

  1. Tool profiles (tools.profile — coding, minimal, messaging, full)
  2. Global policies (tools.allow/tools.deny)
  3. Agent-specific policies (agents.list[].tools.allow/deny)

Example — Marketing Agent:

{
  "id": "marketing",
  "tools": {
    "profile": "minimal",
    "allow": ["ghost_api", "listmonk_api", "umami_api", "file_read", "browser", "nextcloud_api", "web_search", "web_fetch"],
    "deny": ["shell", "docker", "env_update"]
  }
}

Marketing can see Ghost/Listmonk/Umami but CANNOT see shell/docker/env_update — those tools don't even appear in its context.

5.3 Layer 3 — Command Gating (What Operations Require Approval)

Even if an agent can see a tool (Layer 2 allows it), the Safety Wrapper may gate specific operations on that tool based on command classification and the agent's effective autonomy level.

Five-tier classification:

Tier Color Description Examples
1 GREEN Non-destructive reads file_read, container_stats, container_logs, query_select, umami_read, uptime_check
2 YELLOW Modifying operations container_restart, file_write, env_update, nginx_reload, chatwoot_assign, calcom_create
3 YELLOW_EXTERNAL External-facing communications ghost_publish, listmonk_send, poste_send, chatwoot_reply_external, social_post, documenso_send
4 RED Destructive operations file_delete, container_remove, volume_delete, user_revoke, db_drop_table, backup_delete
5 CRITICAL_RED Irreversible infrastructure db_drop_database, firewall_modify, ssh_config_modify, backup_wipe_all, ssl_revoke

Autonomy level × classification gating matrix:

Command Tier Training Wheels (L1) Trusted Assistant (L2) Full Autonomy (L3)
GREEN Auto-execute Auto-execute Auto-execute
YELLOW Gate → approval Auto-execute Auto-execute
YELLOW_EXTERNAL Gate → approval Gate → approval (unless unlocked) Gate → approval (unless unlocked)
RED Gate → approval Gate → approval Auto-execute
CRITICAL_RED Gate → approval Gate → approval Gate → approval

5.4 Layer 4 — Secrets Redaction (Always On)

Regardless of sandbox mode, tool permissions, or autonomy level, ALL outbound LLM traffic is redacted via the Secrets Proxy's 4-layer pipeline (see Section 3.5). This layer cannot be disabled. It runs at every autonomy level. The AI never sees raw credentials.

5.5 External Communications Gate

Independent of autonomy levels. A separate mechanism that gates all YELLOW_EXTERNAL operations by default for every agent. Users explicitly unlock autonomous external sending per-agent per-tool via the mobile app or web portal.

Resolution logic:

  1. Command classified as YELLOW_EXTERNAL
  2. Check external_comms_gate.unlocks[agentId][toolName]
  3. If "autonomous" → follow normal autonomy level gating (YELLOW rules apply)
  4. If "gated" or not set → always gate, regardless of autonomy level
  5. Present approval: "Marketing Agent wants to publish: 'Top 10 Tips...' to your blog. [Approve] [Edit] [Deny]"

6. AI Autonomy Levels

6.1 Level Definitions

Level Name Default For Auto-Execute Requires Approval
1 Training Wheels New customers GREEN only YELLOW + RED + CRITICAL_RED
2 Trusted Assistant Default GREEN + YELLOW RED + CRITICAL_RED
3 Full Autonomy Power users GREEN + YELLOW + RED CRITICAL_RED only

6.2 Per-Agent Override

Each agent can have its own autonomy level independent of the tenant default:

Agent Tenant Default L2 Agent Override Effective
IT Admin Level 2 Level 3 3 — full autonomy for infrastructure
Marketing Level 2 2 — default
Secretary Level 2 Level 1 1 — extra cautious with communications
Sales Level 2 2 — default

6.3 Transition Criteria

Moving between levels is manual — triggered by the customer in the mobile app or web portal, synced to the Safety Wrapper via Hub heartbeat. There is no automatic promotion. The customer builds trust at their own pace.

Invariants across ALL levels:

  • Secrets are always redacted (Layer 4)
  • Audit trail is always logged
  • External comms are gated by default until explicitly unlocked
  • CRITICAL_RED always requires approval
  • The AI never sees raw credentials

7. Data Flow Diagrams

7.1 Message Processing Flow

User (mobile app)
  │
  ▼
Hub (WebSocket relay)
  │
  ▼
OpenClaw Gateway (port 18789)
  │
  ├─► Dispatcher Agent (intent classification)
  │     │
  │     ▼
  │   Route to specialist agent (Marketing, IT, Secretary, Sales)
  │     │
  │     ▼
  │   Agent decides on tool call(s)
  │     │
  ▼     ▼
Safety Wrapper (port 8200)
  │
  ├─ 1. Classify command (GREEN/YELLOW/YELLOW_EXT/RED/CRITICAL_RED)
  ├─ 2. Check agent's effective autonomy level
  ├─ 3. Check external comms gate (if YELLOW_EXT)
  │
  ├─ IF ALLOWED:
  │   ├─ 4. Resolve SECRET_REFs from encrypted registry
  │   ├─ 5. Execute tool call (shell/Docker/API/browser)
  │   ├─ 6. Scrub secrets from response
  │   ├─ 7. Log to audit trail
  │   └─ 8. Return result to OpenClaw → Agent → User
  │
  └─ IF GATED:
      ├─ 4. Create approval request with human-readable description
      ├─ 5. POST to Hub /api/v1/tenant/approval-request
      ├─ 6. Hub pushes to mobile app via WebSocket
      ├─ 7. Mobile shows push notification: "[Approve] [Deny]"
      ├─ 8. User taps Approve → Hub relays to Safety Wrapper
      └─ 9. Safety Wrapper resumes execution from step 4 of ALLOWED path

7.2 Secrets Injection Flow

Agent decides to call NocoDB API
  │
  ▼
OpenClaw sends tool call to Safety Wrapper:
  exec("curl http://127.0.0.1:3037/api/v2/tables -H 'xc-token: SECRET_REF(nocodb_api_token)'")
  │
  ▼
Safety Wrapper intercepts:
  1. Classify: GREEN (read-only query) → auto-execute
  2. Resolve SECRET_REF: look up "nocodb_api_token" in encrypted SQLite
  3. Substitute: SECRET_REF(nocodb_api_token) → "xc_abc123def456..."
  4. Execute curl with real token
  │
  ▼
Tool responds:
  { "tables": [...] }   ← response may contain secrets in error messages
  │
  ▼
Safety Wrapper scrubs response:
  Run through mini redaction pipeline (registry match + regex)
  │
  ▼
Secrets Proxy intercepts agent's next LLM call:
  Full 4-layer redaction on all outbound text
  │
  ▼
LLM receives: clean data, no secrets
  Agent sees: [SECRET_REF:nocodb_api_token] (never the real value)

7.3 Token Metering Flow

Every LLM call:
  Agent → OpenClaw → Secrets Proxy → OpenRouter → LLM Provider
                                                       │
  OpenRouter response includes:                        │
    usage: { input_tokens, output_tokens,               │
             cache_read_tokens, cache_write_tokens }    │
                                                       ▼
  Safety Wrapper captures (via response headers or proxy inspection):
    { agent_id, model, input_tokens, output_tokens,
      cached_tokens, timestamp, request_id }
                │
                ▼
  Local SQLite (token_usage table):
    INSERT per-call record
                │
                ▼
  Hourly aggregation job:
    GROUP BY agent_id, model, HOUR(timestamp)
    → TokenUsageBucket records
                │
                ▼
  Heartbeat (every 60s) or dedicated POST:
    Safety Wrapper → Hub /api/v1/tenant/usage
    Payload: array of unsent TokenUsageBucket records
                │
                ▼
  Hub processes:
    1. Store in PostgreSQL TokenUsageBucket table
    2. Update BillingPeriod.tokensUsed
    3. Check pool exhaustion → trigger overage if needed
    4. Report to Stripe Billing Meter (hourly batch)
                │
                ▼
  Stripe calculates overage on next invoice

7.4 Provisioning Flow

1. Customer completes Stripe checkout on Website
2. Stripe webhook → Hub creates User + Subscription + Order (PAYMENT_CONFIRMED)
3. Automation state machine: PAYMENT_CONFIRMED → AWAITING_SERVER
4. Hub assigns Netcup server from pre-provisioned pool (EU or US region)
5. State: AWAITING_SERVER → SERVER_READY
6. Hub creates DNS records (A records for all tool subdomains)
7. State: SERVER_READY → DNS_PENDING → DNS_READY
8. Hub spawns Provisioner Docker container with job config
9. Provisioner:
   a. SSH into VPS (port 22022)
   b. Steps 1-8: system setup, Docker, nginx, firewall, SSH hardening
   c. Step 9: Deploy 28+ tool stacks via docker-compose
   d. Step 10: Deploy OpenClaw + Safety Wrapper + Secrets Proxy
      - Generate 50+ credentials via env_setup.sh
      - Generate Safety Wrapper config (secrets registry seed, agent configs)
      - Generate OpenClaw config (model routing, agent definitions, caching)
      - Start all three processes
      - Run Playwright initial-setup scenarios via OpenClaw browser
      - Generate SSL certs via Let's Encrypt
10. Safety Wrapper registers with Hub, receives API key
11. State: PROVISIONING → FULFILLED
12. Customer receives welcome email with dashboard URL + app download links
13. Heartbeat loop begins (Safety Wrapper → Hub, every 60 seconds)

8. Inter-Agent Communication

8.1 Dispatcher Hub Pattern

The Dispatcher is a first-class default agent — the user's primary point of contact. Every tenant gets one. It has three responsibilities:

  1. Intent routing: Classifies user messages and delegates to specialist agents
  2. Workflow decomposition: Breaks multi-domain requests into ordered steps across agents
  3. Morning briefing: Aggregates overnight activity from all agents into a unified summary

The Dispatcher has NO direct tool access (no shell, no docker, no file operations). It works exclusively through agent-to-agent delegation. This keeps it lightweight and prevents scope creep.

8.2 Agent-to-Agent Communication

OpenClaw's native agentToAgent tool, enabled for all agents:

{
  "tools": {
    "agentToAgent": {
      "enabled": true,
      "allow": ["dispatcher", "it-admin", "marketing", "secretary", "sales"]
    }
  }
}

Communication patterns:

  • Dispatcher → Specialist: "Handle this user request" (primary pattern)
  • Specialist → Specialist: "What's the current Ghost version?" (peer queries)
  • Specialist → Dispatcher: "Task complete, here's the result" (reporting)

Safety controls:

  • Maximum dispatch depth: 5 levels (prevents A→B→A→B→... loops)
  • Rate limiting: max inter-agent dispatches per minute per agent
  • Full audit trail: every dispatch logged with source, target, task, result
  • User visibility: all agent activity visible in mobile app's Activity feed

8.3 Shared Memory

Each agent has its own workspace, but all agents get extraPaths pointing to /opt/letsbe/shared-memory/. When one agent writes to the shared directory, others discover it via memory_search. This enables cross-agent knowledge sharing without breaking workspace isolation.


9. Memory Architecture

9.1 OpenClaw Native Memory

Layer Location Purpose Loaded When
Daily logs memory/YYYY-MM-DD.md Session context Today + yesterday
Long-term MEMORY.md Curated durable knowledge Private sessions
Transcripts Session JSONL Full conversation recall Via memory_search

Hybrid retrieval combining:

  • Vector search (cosine similarity via sqlite-vec): Semantic matching
  • BM25 keyword search (SQLite FTS5): Exact token matching
  • MMR re-ranking (lambda 0.7): Balances relevance with diversity
  • Temporal decay (30-day half-life): Boosts recent memories
  • Local embeddings (ggml-org/embeddinggemma-300m-qat-q8_0-GGUF, ~0.6GB)

9.3 Token Efficiency Strategy

Strategy Impact
Tool registry (structured JSON, ~2.5K tokens) vs. verbose skills ~80% reduction in tool context
On-demand cheat sheets vs. always-loaded skills Only pay for tools used in session
Compact SOUL.md (~600-800 tokens per agent) ~50% reduction in identity context
cacheRetention: "long" (1 hour) 80-99% cheaper on repeated SOUL.md calls
Context pruning (cache-ttl, 1h default) Auto-removes stale tool outputs
Session compaction Keeps long conversations from blowing up costs

Base context cost per agent: master skill (~700 tokens) + tool registry (~2,500 tokens) = ~3,200 tokens — regardless of how many tools are installed. Compare to 30 individual skills at ~750 tokens each = ~22,500 tokens always in context.


10. Network Security

10.1 Firewall Rules

# UFW configuration (set during provisioning step 5)
ufw default deny incoming
ufw default allow outgoing
ufw allow 80/tcp      # HTTP (nginx → redirect to HTTPS)
ufw allow 443/tcp     # HTTPS (nginx → tool web UIs + Hub API)
ufw allow 22022/tcp   # SSH (hardened port, key-only auth)
ufw enable

NOT exposed:

  • Port 18789 (OpenClaw) — loopback only
  • Port 8200 (Safety Wrapper) — loopback only
  • Port 8100 (Secrets Proxy) — loopback only
  • Ports 3001-3099 (tool containers) — loopback only, accessed via nginx

10.2 TLS

  • All tool web UIs served via nginx with Let's Encrypt certificates
  • Auto-renewal via certbot cron
  • Strict Transport Security headers
  • OCSP stapling enabled

10.3 Inter-Process Authentication

From → To Auth Method
OpenClaw → Safety Wrapper Shared secret token (generated at provisioning)
Safety Wrapper → Secrets Proxy Unix socket (no network, filesystem permissions)
Safety Wrapper → Hub Bearer token (Hub API key, received at registration)
Hub → Safety Wrapper Registration token → Hub API key exchange
Mobile → Hub JWT (NextAuth session)
Hub → Tenant via nginx Not needed — Safety Wrapper initiates all Hub communication

10.4 SSRF Protection

OpenClaw's browser tool has configurable URL allowlists. LetsBe restricts browser navigation to:

  • 127.0.0.1:* (localhost tool UIs)
  • Tool-specific external URLs (if configured)
  • Blocks: metadata endpoints (169.254.169.254), internal networks, file:// URIs

11. Scalability & Performance

11.1 Horizontal Scaling

Each tenant is an independent VPS — horizontal scaling means adding more VPS instances. No shared state between tenants. The Hub handles N tenants, scaling its own PostgreSQL and server capacity as needed.

11.2 Vertical Scaling

Tier upgrades: Lite → Build → Scale → Enterprise. The provisioner can migrate tool stacks to a larger VPS. OpenClaw and Safety Wrapper configs don't change — only resource limits increase.

11.3 Performance Targets

Metric Target Measured At
Secrets redaction latency <10ms per LLM call Secrets Proxy
Command classification latency <5ms per tool call Safety Wrapper
Approval round-trip (auto-execute) <50ms Safety Wrapper
Approval round-trip (with mobile) <30 seconds typical Safety Wrapper → Hub → Mobile → Hub → SW
Agent response time 2-15 seconds (model-dependent) End-to-end
Heartbeat interval 60 seconds Safety Wrapper → Hub
Config sync latency <60 seconds (next heartbeat) Hub → Safety Wrapper

12. Disaster Recovery & Backup

12.1 Application-Level Backups (Existing)

The Provisioner deploys backups.sh (~473 lines):

  • 18 PostgreSQL databases + 2 MySQL + 1 MongoDB
  • Daily 2:00 AM cron job
  • Rotation: 7 daily local + 4 weekly remote (via rclone)
  • Output: backup-status.json with per-database status

12.2 Backup Monitoring (NEW)

OpenClaw cron job at 6:00 AM reads backup-status.json:

  • Was backup updated today?
  • All databases listed?
  • Any failures?
  • Reports to Hub via Safety Wrapper's /tenant/backup-status endpoint

12.3 VPS Snapshots

Daily Netcup VPS snapshots via SCP API:

  • Triggered by Hub cron job
  • 3 snapshots retained (rolling)
  • Staggered across tenants to avoid API rate limits
  • Free to create and store

12.4 Recovery Procedures

Scenario Recovery
Single tool database corruption Restore from application-level dump
OpenClaw/Safety Wrapper state loss Restore from VPS snapshot
Full VPS failure Restore from snapshot to new VPS, re-provision
Hub database loss Separate Hub backup strategy (not tenant concern)

13. Error Handling & Resilience

13.1 Severity-Based Alerting

Severity Examples Auto-Recovery Alert
Soft OpenClaw crash, Secrets Proxy restart, tool adapter timeout Auto-restart immediately Push notification after 3 failures in 1 hour
Medium Tool API unreachable, OpenRouter timeout, Hub communication failure Retry with backoff (30s → 1m → 5m) Push notification after 3 consecutive failures
Hard Auth token rejected, secrets registry corrupted, disk full, SSL expired Stop affected component, do NOT auto-restart Immediate push to customer + Hub alert to staff

13.2 Model Failover

OpenClaw native failover chains:

{
  "model": {
    "primary": "anthropic/claude-sonnet-4-6",
    "fallbacks": ["anthropic/claude-haiku-4-5", "google/gemini-2.0-flash"]
  }
}

Auth profile rotation before model fallback — if primary fails due to API key issue, OpenClaw rotates auth profiles before falling back to a different model.

13.3 Graceful Degradation

Component Down User Experience
Single tool Agent says "I can't reach X right now. I'll try again shortly."
Secrets Proxy Agents pause (can't make LLM calls). Resume on restart (~2-5s).
Safety Wrapper Tool calls blocked. Agents can still respond from cached context. Resume on restart.
OpenClaw All agents offline. Auto-restart. User sees "Your AI team is restarting."
Hub Agents continue locally (cached config). Heartbeats queue. Approvals delayed.
OpenRouter Model failover chain. If all fail, agent reports temporary issue.
Mobile app Customer portal (web) available as fallback.

End of System Architecture Document