LetsBeBiz-Redesign/docs/architecture-proposal/claude/01-SYSTEM-ARCHITECTURE.md

# LetsBe Biz — System Architecture

**Date:** February 27, 2026
**Team:** Claude Opus 4.6 Architecture Team
**Document:** 01 of 09
**Status:** Proposal — Competing with independent team

---

## Table of Contents

1. [Architecture Philosophy](#1-architecture-philosophy)
2. [High-Level System Overview](#2-high-level-system-overview)
3. [Tenant Server Architecture](#3-tenant-server-architecture)
4. [Central Platform Architecture](#4-central-platform-architecture)
5. [Four-Layer Security Model](#5-four-layer-security-model)
6. [AI Autonomy Levels](#6-ai-autonomy-levels)
7. [Data Flow Diagrams](#7-data-flow-diagrams)
8. [Inter-Agent Communication](#8-inter-agent-communication)
9. [Memory Architecture](#9-memory-architecture)
10. [Network Security](#10-network-security)
11. [Scalability & Performance](#11-scalability--performance)
12. [Disaster Recovery & Backup](#12-disaster-recovery--backup)
13. [Error Handling & Resilience](#13-error-handling--resilience)

---

## 1. Architecture Philosophy

### 1.1 Non-Negotiable Principles

**Principle 1 — Secrets Never Leave the Server**

All credential redaction happens locally on the tenant VPS before any data reaches an LLM provider. This is enforced at the transport layer through a dedicated Secrets Proxy process — not by trusting the AI to behave, not by configuration, not by policy. The enforcement point is a separate process that sits between OpenClaw and the internet. Traffic that hasn't passed through the Secrets Proxy physically cannot reach an LLM. This is the single most important architectural invariant.

**Principle 2 — Per-Tenant Physical Isolation**

One customer = one VPS. No multi-tenancy, no shared containers, no shared databases. Each tenant's data, credentials, agent state, and conversation history lives on dedicated hardware. This is permanent for v1. It eliminates entire categories of security vulnerabilities (cross-tenant data leaks, noisy neighbor performance issues, shared-secret compromise) at the cost of higher per-customer infrastructure spend.

**Principle 3 — Defense in Depth (Four Independent Security Layers)**

Security is not one wall — it's four independent layers, each enforced by different mechanisms, each unable to expand access granted by layers above. A failure in any single layer does not compromise the system because the remaining three layers still enforce their restrictions independently:

| Layer | Mechanism | Enforced By | Bypassable By AI? |
|-------|-----------|-------------|-------------------|
| 1. Sandbox | Container isolation | Docker / OS kernel | No |
| 2. Tool Policy | Per-agent allow/deny arrays | OpenClaw config (loaded at startup) | No |
| 3. Command Gating | 5-tier classification + autonomy levels | Safety Wrapper (separate process) | No |
| 4. Secrets Redaction | 4-layer redaction pipeline | Secrets Proxy (separate process) | No |

**Principle 4 — OpenClaw Stays Vanilla**

OpenClaw is treated as an upstream dependency, never a fork. All LetsBe-specific logic (secrets redaction, command gating, Hub communication, tool adapters, billing metering) lives in a Safety Wrapper process that runs alongside OpenClaw. This means:
- Upstream security patches apply cleanly
- New OpenClaw features are available without merge conflicts
- Our competitive IP is cleanly separated from the upstream codebase
- Pin to a tested release tag; upgrade monthly after staging verification

**Principle 5 — Graceful Degradation**

Every component has a failure mode that preserves the user's experience:
- Hub goes down → agents continue working from cached config; approvals queue locally
- OpenRouter goes down → model failover chains try alternatives; agents pause gracefully
- Single tool goes down → agent reports it, other tools continue
- Safety Wrapper restarts → agents pause briefly (~2-5s), auto-resume
- Secrets Proxy restarts → LLM calls fail temporarily, auto-resume

### 1.2 Key Divergence from Technical Architecture v1.2

The Technical Architecture v1.2 proposes the Safety Wrapper as an **in-process OpenClaw extension** running inside the Gateway process, with only a thin Secrets Proxy as a separate process. After deep research into OpenClaw's plugin system, we propose a fundamentally different approach.

**Our proposal: Safety Wrapper as a SEPARATE process (localhost:8200)**

Three findings drive this decision:

1. **Hook Gap (GitHub Discussion #20575):** OpenClaw's `before_tool_call` and `after_tool_call` hooks are NOT bridged to external plugins. The internal hook system fires events via `emitEvent()` but never calls `triggerInternalHook()` for external plugin consumers. This means an in-process extension CANNOT reliably intercept tool calls — the exact mechanism the v1.2 architecture depends on for command classification and secrets injection.

2. **CVE-2026-25253 (CVSS 8.8):** Cross-site WebSocket hijacking vulnerability in OpenClaw, patched 2026-01-29. An in-process extension shares the vulnerability surface with the host process. A separate process has an independent attack surface — compromising OpenClaw doesn't automatically compromise the Safety Wrapper.

3. **Synchronous hook limitation:** `tool_result_persist` hook is synchronous — it cannot return Promises. This limits what an in-process extension can do for async operations like Hub API calls, approval requests, and token reporting.

**Impact on architecture:**
- Safety Wrapper runs as a separate Node.js process on `localhost:8200`
- OpenClaw is configured to route tool calls through the Safety Wrapper's HTTP API
- Secrets Proxy remains as a separate thin process on `localhost:8100`
- Total: 3 LetsBe processes (OpenClaw + Safety Wrapper + Secrets Proxy) + nginx + tool containers
- RAM overhead increases by ~64MB (from ~576MB to ~640MB) — acceptable on all tiers

### 1.3 Why These Principles Matter for the Business

Privacy-first architecture is the competitive moat. SMBs increasingly distrust cloud-only AI solutions — stories of training data leaks, terms-of-service changes, and API key compromises make headlines weekly. LetsBe's "secrets never leave your server" guarantee is verifiable (the Secrets Proxy is inspectable) and defensible (transport-layer enforcement can't be bypassed by prompt injection). This positions LetsBe uniquely against competitors who run AI in multi-tenant cloud environments.

---

## 2. High-Level System Overview

### 2.1 Two-Domain Architecture

The platform operates across two distinct trust domains connected by HTTPS:

```
┌─────────────────────────────────────────────────────────────────────┐
│                        CENTRAL PLATFORM                             │
│                    (LetsBe infrastructure)                           │
│                                                                     │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐    │
│  │     Hub      │   │  Provisioner │   │      Website         │    │
│  │  (Next.js)   │   │  (Bash/SSH)  │   │    (Next.js SSG)     │    │
│  │              │   │              │   │                      │    │
│  │ Admin Portal │   │ 10-step VPS  │   │ Marketing + AI       │    │
│  │ Customer API │   │ setup via    │   │ onboarding chat +    │    │
│  │ Billing      │   │ Docker       │   │ Stripe checkout      │    │
│  │ Tenant Comms │   │              │   │                      │    │
│  └──────┬───────┘   └──────┬───────┘   └──────────────────────┘    │
│         │                  │                                        │
│         │   PostgreSQL     │                                        │
│         └──────┬───────────┘                                        │
│                │                                                    │
└────────────────┼────────────────────────────────────────────────────┘
                 │
                 │  HTTPS (heartbeat, config sync, approvals, usage)
                 │  SSH (provisioning only — one-shot, no persistent connection)
                 │
┌────────────────┼────────────────────────────────────────────────────┐
│                │           TENANT SERVER                            │
│                │      (Customer's isolated VPS)                     │
│                │                                                    │
│  ┌─────────────▼──────────┐                                        │
│  │    Safety Wrapper       │◄────── Hub API Key auth               │
│  │    (localhost:8200)     │                                        │
│  │                         │                                        │
│  │  Command Classification │        ┌──────────────────┐           │
│  │  Secrets Registry (SQLite)│      │  Secrets Proxy   │           │
│  │  Tool Execution Proxy   │───────►│  (localhost:8100) │           │
│  │  Hub Communication      │        │                  │           │
│  │  Token Metering         │        │  4-layer redact  │──► LLM   │
│  │  Audit Logger           │        │  <10ms overhead  │  (OpenRouter)
│  └────────────┬────────────┘        └──────────────────┘           │
│               │                                                     │
│  ┌────────────▼────────────┐                                        │
│  │      OpenClaw           │                                        │
│  │   (Gateway:18789)       │                                        │
│  │                         │                                        │
│  │  Agent Runtime          │     ┌──────────────────────────────┐  │
│  │  Session Management     │     │     Tool Stacks (Docker)     │  │
│  │  Prompt Caching         │     │                              │  │
│  │  Browser (Playwright)   │     │  Ghost    Cal.com   Nextcloud│  │
│  │  Channels (WA/TG)      │     │  Chatwoot Odoo      NocoDB   │  │
│  │  Cron / Webhooks        │     │  Listmonk Umami    Keycloak  │  │
│  └─────────────────────────┘     │  ... 20+ more containers    │  │
│                                   └──────────────────────────────┘  │
│  ┌─────────────────────────┐                                        │
│  │   nginx (80/443)        │  Only external-facing process          │
│  └─────────────────────────┘                                        │
└─────────────────────────────────────────────────────────────────────┘
```

### 2.2 Trust Boundaries

```
                    UNTRUSTED                │           TRUSTED (on-VPS)
                                             │
    External LLM Providers ◄─────────────────┤◄── Secrets Proxy (redacts ALL secrets)
    (via OpenRouter:                          │         ▲
     Anthropic, Google,                       │         │ outbound LLM traffic only
     DeepSeek, OpenAI, etc.)                  │         │
                                             │    Safety Wrapper (classifies commands)
    Internet Users ─────────► nginx ──────►  │         │
                              (TLS)          │         ▼
                                             │    OpenClaw (agent runtime)
    Mobile App ◄─────► Hub ◄────────────────►│         │
    (WebSocket)        (relay)               │         ▼
                                             │    Tool Containers
    Messaging Channels ◄────────────────────►│    (Ghost, Nextcloud, Cal.com, etc.)
    (WhatsApp, Telegram)                      │
```

**Key boundaries:**
- LLMs are UNTRUSTED — all outbound traffic is sanitized by Secrets Proxy
- The Internet is UNTRUSTED — only nginx port 80/443 and SSH 22022 are exposed
- Hub communication is AUTHENTICATED — Bearer token over HTTPS
- Inter-process communication is LOCAL — localhost only, no network exposure

### 2.3 Network Boundary

- **Central → Tenant:** SSH (provisioning, one-shot), HTTPS (API calls to Safety Wrapper if needed)
- **Tenant → Central:** HTTPS (heartbeat, config sync, approval requests, usage reporting)
- **Tenant → Internet:** Only through Secrets Proxy (LLM calls) and nginx (tool web UIs)
- **No persistent connections:** Heartbeat is periodic HTTP POST, not WebSocket

---

## 3. Tenant Server Architecture

### 3.1 Process Map

Every tenant VPS runs the following processes:

| Process | Port | Protocol | RAM Budget | Restartable | Purpose |
|---------|------|----------|------------|-------------|---------|
| **OpenClaw Gateway** | 18789 | HTTP+WS | ~384MB (includes Chromium ~200MB) | Yes (Docker restart) | AI agent runtime, session management, browser tool |
| **Safety Wrapper** | 8200 | HTTP | ~128MB | Yes (Docker restart) | Command gating, secrets registry, Hub comms, metering |
| **Secrets Proxy** | 8100 | HTTP | ~64MB | Yes (Docker restart) | Outbound LLM traffic redaction (4-layer pipeline) |
| **nginx** | 80, 443 | HTTP/S | ~32MB | Yes (systemd) | Reverse proxy, TLS termination, tool routing |
| **Tool containers** | 3001-3099 | Various | ~128-512MB each | Yes (Docker restart) | Ghost, Nextcloud, Cal.com, etc. (28+) |
| **Monitoring** | — | — | ~32MB | Yes | Netdata or lightweight metrics agent |

**Total LetsBe overhead: ~640MB** (OpenClaw 384MB + Safety Wrapper 128MB + Secrets Proxy 64MB + nginx 32MB + monitoring 32MB)

### 3.2 Memory Budget per Tier

| Tier | Total RAM | LetsBe Overhead | Available for Tools | Max Practical Tools | Chromium? |
|------|-----------|-----------------|--------------------|--------------------|-----------|
| Lite (8GB) | 8,192MB | 640MB | ~7,552MB | 8-12 (constrained) | Yes, but consider browser-less mode |
| Build (16GB) | 16,384MB | 640MB | ~15,744MB | 15-20 (comfortable) | Yes |
| Scale (32GB) | 32,768MB | 640MB | ~32,128MB | 25-30 (full stack) | Yes |
| Enterprise (64GB) | 65,536MB | 640MB | ~64,896MB | 30+ with headroom | Yes |

**Lite tier note:** With ~7.5GB for tools, the Lite tier is tight. Each tool averages 256-512MB. A Freelancer bundle (7 tools) at ~2.5GB fits comfortably. The Lite tier is hidden at launch until real-world memory profiling confirms it's viable. If browser-less mode is needed (saves ~200MB from Chromium), OpenClaw supports running without the browser tool.

### 3.3 OpenClaw Configuration

OpenClaw (v2026.2.6-3) is configured via `~/.openclaw/openclaw.json` (JSON5 format with environment variable substitution).

**Critical configuration decisions:**

```json5
{
  // Route ALL LLM calls through Safety Wrapper → Secrets Proxy → OpenRouter
  "model": {
    "primary": "${SW_PROXY_MODEL}",  // e.g., "anthropic/claude-sonnet-4-6"
    "apiUrl": "http://localhost:8100/v1",  // Secrets Proxy intercepts
    "apiKey": "${OPENROUTER_API_KEY_ENCRYPTED}",  // Resolved by Secrets Proxy
    "fallbacks": ["${SW_FALLBACK_1}", "${SW_FALLBACK_2}"],
    "contextTokens": 200000
  },

  // Prompt caching — massive cost saver
  "cacheRetention": "long",          // 1 hour (SOUL.md cached 80-99% cheaper)
  "heartbeat": { "every": "55m" },   // Keep-warm to prevent cache eviction

  // Security hardening
  "security": {
    "elevated": { "enable": false },  // DISABLED — Safety Wrapper handles all elevation
    "rateLimit": {
      "maxAttempts": 10,
      "windowSeconds": 60,
      "lockoutSeconds": 300,
      "exemptLoopback": true
    }
  },

  // Tool safety
  "tools": {
    "loopDetection": { "enabled": true },  // Prevent runaway tool calls
    "exec": {
      "security": "allowlist",  // Only allowlisted binaries
      "timeout": 1800
    }
  },

  // Logging with redaction
  "logging": {
    "level": "info",
    "redactSensitive": "tools"  // Extra protection — redact tool output in logs
  },

  // Agent definitions
  "agents": {
    "list": [
      // Dispatcher, IT Admin, Marketing, Secretary, Sales
      // (see Section 8 for full configurations)
    ]
  },

  // Channel support (configured per-tenant)
  "channels": {
    "whatsapp": { "enabled": "${WHATSAPP_ENABLED}" },
    "telegram": { "enabled": "${TELEGRAM_ENABLED}" }
  }
}
```

### 3.4 Safety Wrapper Architecture (localhost:8200)

The Safety Wrapper is the core IP — where all LetsBe-specific logic lives.

```
┌────────────────────────────────────────────────────────────────┐
│                     SAFETY WRAPPER (localhost:8200)              │
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ Command          │  │ Secrets          │  │ Token        │  │
│  │ Classification   │  │ Registry         │  │ Metering     │  │
│  │ Engine           │  │ (Encrypted       │  │ Engine       │  │
│  │                  │  │  SQLite)         │  │              │  │
│  │ 5-tier classify  │  │ ChaCha20-Poly1305│  │ Per-agent    │  │
│  │ Autonomy gating  │  │ via sqleet       │  │ per-model    │  │
│  │ Ext. comms gate  │  │ WAL mode         │  │ hourly agg   │  │
│  └────────┬─────────┘  └────────┬─────────┘  └──────┬───────┘  │
│           │                     │                    │           │
│  ┌────────▼─────────────────────▼────────────────────▼────────┐ │
│  │              Tool Execution Proxy                           │ │
│  │                                                             │ │
│  │  Intercepts ALL tool calls from OpenClaw                    │ │
│  │  1. Classify command (green/yellow/yellow_ext/red/crit_red) │ │
│  │  2. Check autonomy level + external comms gate              │ │
│  │  3. If gated → push approval to Hub, wait for response      │ │
│  │  4. If allowed → resolve SECRET_REFs from registry          │ │
│  │  5. Execute tool call (shell, Docker, API, browser)         │ │
│  │  6. Scrub secrets from response                             │ │
│  │  7. Log to audit trail                                      │ │
│  │  8. Report token usage to metering engine                   │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │ Hub              │  │ Audit            │  │ Config       │  │
│  │ Communication    │  │ Logger           │  │ Manager      │  │
│  │ Client           │  │                  │  │              │  │
│  │                  │  │ Append-only      │  │ Hot-reload   │  │
│  │ Registration     │  │ SQLite           │  │ autonomy lvl │  │
│  │ Heartbeat (60s)  │  │ Every tool call  │  │ ext comms    │  │
│  │ Config sync      │  │ Every approval   │  │ agent config │  │
│  │ Approval routing │  │ Every secret use │  │              │  │
│  │ Usage reporting  │  │                  │  │              │  │
│  └──────────────────┘  └──────────────────┘  └──────────────┘  │
└────────────────────────────────────────────────────────────────┘
```

**Technology stack:**
- Node.js 22+ (same runtime as OpenClaw — one ecosystem)
- TypeScript (strict mode)
- No web framework (raw `node:http` for minimal overhead and attack surface)
- `better-sqlite3-multiple-ciphers` for encrypted SQLite (secrets registry + audit log + usage buckets)
- Key derivation: scrypt from provisioner-generated seed
- Cipher: ChaCha20-Poly1305 via sqleet (modern AEAD, ~2x faster than AES-256-CBC on ARM)

### 3.5 Secrets Proxy Architecture (localhost:8100)

The thinnest possible process — its only job is intercepting outbound LLM traffic and scrubbing secrets.

```
┌─────────────────────────────────────────────────────────┐
│             SECRETS PROXY (localhost:8100)                │
│                                                           │
│  Inbound (from OpenClaw via Safety Wrapper config)        │
│  ──────────────────────────────────────────────────       │
│  POST /v1/chat/completions                                │
│  POST /v1/completions                                     │
│  POST /v1/embeddings                                      │
│                                                           │
│  ┌─────────────────────────────────────────────────────┐ │
│  │         4-LAYER REDACTION PIPELINE                   │ │
│  │                                                       │ │
│  │  Layer 1: Aho-Corasick Registry Substitution          │ │
│  │  ─────────────────────────────────────────            │ │
│  │  All 50+ known secrets from encrypted registry        │ │
│  │  loaded into Aho-Corasick automaton at startup         │ │
│  │  O(n) in text length regardless of pattern count       │ │
│  │  Deterministic replacements: value → [SECRET_REF:name] │ │
│  │                                                       │ │
│  │  Layer 2: Regex Pattern Safety Net                    │ │
│  │  ─────────────────────────────────────────            │ │
│  │  7 patterns catch secrets the registry might miss:    │ │
│  │  • -----BEGIN.*PRIVATE KEY-----                       │ │
│  │  • eyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+ (JWT)       │ │
│  │  • \$2[aby]?\$[0-9]+\$ (bcrypt)                      │ │
│  │  • ://[^:]+:[^@]+@ (connection strings)              │ │
│  │  • (PASSWORD|SECRET|KEY|TOKEN)=.+ (env patterns)      │ │
│  │  • High-entropy base64 (length > 32)                  │ │
│  │  • Hex strings 32+ chars matching known key patterns  │ │
│  │                                                       │ │
│  │  Layer 3: Shannon Entropy Filter                      │ │
│  │  ─────────────────────────────────────────            │ │
│  │  Threshold: 4.5 bits/char, minimum length: 16 chars   │ │
│  │  H(X) = -Σ p(x) log2(p(x))                          │ │
│  │  English text: ~3.5-4.0 bits/char                     │ │
│  │  Random secrets: ~5.0-6.0 bits/char                   │ │
│  │  Catches: API keys, random passwords, hex tokens      │ │
│  │  Excludes: common words, UUIDs (known format)         │ │
│  │                                                       │ │
│  │  Layer 4: Context-Aware JSON Key Scanning             │ │
│  │  ─────────────────────────────────────────            │ │
│  │  Scans JSON structures for sensitive keys:            │ │
│  │  password, secret, token, key, credential,            │ │
│  │  api_key, apiKey, auth, authorization, bearer,        │ │
│  │  private_key, access_token, refresh_token             │ │
│  │  Redacts the VALUE (not the key) in matched pairs     │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                           │
│  Outbound → OpenRouter (HTTPS)                            │
│  Performance target: <10ms added latency per LLM call     │
│                                                           │
│  Control interface: Unix socket (Safety Wrapper only)     │
│  • Credential sync (on rotation/add/remove)               │
│  • Pattern updates                                        │
│  • Health check                                           │
└─────────────────────────────────────────────────────────┘
```

### 3.6 Container Layout

| Container | Image | Network | Ports | Resources |
|-----------|-------|---------|-------|-----------|
| `letsbe-openclaw` | Custom (OpenClaw + CLI binaries + config) | host | 18789 (loopback) | ~384MB |
| `letsbe-safety-wrapper` | LetsBe custom (Node.js) | host | 8200 (loopback) | ~128MB |
| `letsbe-secrets-proxy` | LetsBe custom (Node.js, minimal) | host | 8100 (loopback) | ~64MB |
| nginx | nginx:alpine | host | 80, 443 | ~32MB |
| Tool stacks (28+) | Various (Ghost, Nextcloud, etc.) | isolated per-tool | 127.0.0.1:30XX | Variable |

**Network access pattern:** OpenClaw container uses `--network host` to reach tool containers via `127.0.0.1:30XX` (e.g., 3023 for Nextcloud, 3037 for NocoDB). Each tool keeps its own isolated Docker network — the AI accesses them through the host loopback interface. No shared Docker network across all 30 tools.

---

## 4. Central Platform Architecture

### 4.1 Hub (letsbe-hub)

The most mature component (~15K LOC, 244 source files, 80+ existing endpoints, 22+ Prisma models).

**Current capabilities (KEEP):**
- Staff admin dashboard with RBAC (4 roles, 20 permissions, 2FA)
- Customer management (CRUD, subscriptions)
- Order lifecycle (8-state automation state machine)
- Netcup SCP API integration (full OAuth2 Device Flow)
- Portainer integration (container management)
- DNS verification workflow
- Docker-based provisioning with SSE log streaming
- Stripe checkout + webhook integration
- Enterprise client management + monitoring
- Email notifications, credential encryption, system settings

**New capabilities (BUILD):**
- Customer-facing portal API (~14 endpoints) — dashboard, agents, approvals, usage, billing
- Tenant communication API (~7 endpoints) — registration, heartbeat, config sync, approvals, usage
- Billing + token metering (~7 endpoints) — Stripe Billing Meters, overage, founding member multiplier
- Agent management API (~5 endpoints) — CRUD for agent configs, deploy to tenant
- Command approval queue (~3 endpoints) — pending, approve, deny
- WebSocket relay for mobile app ↔ tenant server communication

**New Prisma models:** TokenUsageBucket, BillingPeriod, FoundingMember, AgentConfig, CommandApproval + ServerConnection updates (see 02-COMPONENT-BREAKDOWN for full schemas)

### 4.2 Provisioner (letsbe-ansible-runner → letsbe-provisioner)

One-shot Bash container (~4,477 LOC) that provisions a fresh VPS via SSH.

**Existing 10-step pipeline (KEEP):**
1. System packages
2. Docker CE installation
3. Disable conflicting services
4. nginx + fallback config
5. UFW firewall (ports 80, 443, 22022)
6. Optional admin user + SSH key
7. SSH hardening (port 22022, key-only auth, fail2ban)
8. Unattended security updates
9. Deploy tool stacks via docker-compose
10. **Deploy LetsBe agents + bootstrap** ← UPDATE THIS STEP

**Step 10 changes:**
- Deploy OpenClaw + Safety Wrapper + Secrets Proxy (replacing orchestrator + sysadmin agent)
- Generate Safety Wrapper config (secrets registry seed, agent configs, Hub credentials, autonomy defaults)
- Generate OpenClaw config (model routing through Secrets Proxy, agent definitions, caching, loop detection)
- Run Playwright initial-setup scenarios via OpenClaw native browser (7 scenarios — Cal.com, Chatwoot, Keycloak, Nextcloud, Stalwart Mail, Umami, Uptime Kuma; n8n removed)
- **CRITICAL FIX:** Clean up config.json after provisioning (currently contains root password in plaintext)

**Zero tests** — container-based integration tests are part of this proposal (see 07-TESTING-STRATEGY)

### 4.3 Website (Separate Next.js App)

A separate Next.js application in the monorepo, sharing the `@letsbe/db` Prisma package. Not part of the Hub — different concerns (marketing + onboarding vs. admin + operations).

**Key features:**
- Marketing pages (SSG for performance)
- AI-powered onboarding chat (Gemini Flash for business classification, ~$0.001 per prospect)
- Tool recommendation engine with live resource calculator
- Stripe checkout flow
- SSE provisioning status page
- Shares Prisma schema via monorepo package — no data duplication

### 4.4 Mobile App (Expo Bare Workflow, SDK 52+)

**Why Expo over alternatives:**
- **EAS Build:** Eliminates iOS code signing complexity — CI builds without Mac hardware
- **EAS Update:** OTA updates without App Store review — critical for rapid iteration
- **expo-notifications:** Action buttons on push notifications (Approve/Deny) for command gating
- **expo-local-authentication:** Biometric auth (Face ID, Touch ID, Android fingerprint)
- **expo-secure-store:** Secure token storage (iOS Keychain, Android Keystore)

**Architecture:** Mobile ↔ Hub (WebSocket relay) ↔ Tenant Server. The Hub acts as a relay — the tenant server is never directly exposed to the internet. JWT auth, reconnection strategy, offline message queuing.

---

## 5. Four-Layer Security Model

### 5.1 Layer 1 — Sandbox (Where Code Runs)

OpenClaw's native sandbox controls the execution environment:

| Mode | Description | LetsBe Default |
|------|-------------|---------------|
| `off` | No containerization | **Default** — Safety Wrapper handles gating |
| `non-main` | Only non-default agents sandboxed | For untrusted custom agents |
| `all` | Every agent sandboxed | Maximum isolation (performance cost) |

Default agents (Dispatcher, IT Admin, Marketing, Secretary, Sales) run with sandbox `off` because the Safety Wrapper provides command-level gating that's more granular than container isolation. Custom user-created agents can be sandboxed per-agent.

### 5.2 Layer 2 — Tool Policy (What Tools Are Visible)

OpenClaw's native `agents.list[].tools.allow/deny` arrays control which tools each agent can see. Deny wins over allow. Cascading restriction model:

1. Tool profiles (`tools.profile` — coding, minimal, messaging, full)
2. Global policies (`tools.allow`/`tools.deny`)
3. Agent-specific policies (`agents.list[].tools.allow/deny`)

**Example — Marketing Agent:**
```json
{
  "id": "marketing",
  "tools": {
    "profile": "minimal",
    "allow": ["ghost_api", "listmonk_api", "umami_api", "file_read", "browser", "nextcloud_api", "web_search", "web_fetch"],
    "deny": ["shell", "docker", "env_update"]
  }
}
```

Marketing can see Ghost/Listmonk/Umami but CANNOT see shell/docker/env_update — those tools don't even appear in its context.

### 5.3 Layer 3 — Command Gating (What Operations Require Approval)

Even if an agent can see a tool (Layer 2 allows it), the Safety Wrapper may gate specific operations on that tool based on command classification and the agent's effective autonomy level.

**Five-tier classification:**

| Tier | Color | Description | Examples |
|------|-------|-------------|---------|
| 1 | **GREEN** | Non-destructive reads | `file_read`, `container_stats`, `container_logs`, `query_select`, `umami_read`, `uptime_check` |
| 2 | **YELLOW** | Modifying operations | `container_restart`, `file_write`, `env_update`, `nginx_reload`, `chatwoot_assign`, `calcom_create` |
| 3 | **YELLOW_EXTERNAL** | External-facing communications | `ghost_publish`, `listmonk_send`, `poste_send`, `chatwoot_reply_external`, `social_post`, `documenso_send` |
| 4 | **RED** | Destructive operations | `file_delete`, `container_remove`, `volume_delete`, `user_revoke`, `db_drop_table`, `backup_delete` |
| 5 | **CRITICAL_RED** | Irreversible infrastructure | `db_drop_database`, `firewall_modify`, `ssh_config_modify`, `backup_wipe_all`, `ssl_revoke` |

**Autonomy level × classification gating matrix:**

| Command Tier | Training Wheels (L1) | Trusted Assistant (L2) | Full Autonomy (L3) |
|-------------|---------------------|----------------------|-------------------|
| GREEN | Auto-execute | Auto-execute | Auto-execute |
| YELLOW | **Gate → approval** | Auto-execute | Auto-execute |
| YELLOW_EXTERNAL | **Gate → approval** | **Gate → approval** *(unless unlocked)* | **Gate → approval** *(unless unlocked)* |
| RED | **Gate → approval** | **Gate → approval** | Auto-execute |
| CRITICAL_RED | **Gate → approval** | **Gate → approval** | **Gate → approval** |

### 5.4 Layer 4 — Secrets Redaction (Always On)

Regardless of sandbox mode, tool permissions, or autonomy level, ALL outbound LLM traffic is redacted via the Secrets Proxy's 4-layer pipeline (see Section 3.5). This layer cannot be disabled. It runs at every autonomy level. The AI never sees raw credentials.

### 5.5 External Communications Gate

Independent of autonomy levels. A separate mechanism that gates all YELLOW_EXTERNAL operations by default for every agent. Users explicitly unlock autonomous external sending per-agent per-tool via the mobile app or web portal.

**Resolution logic:**
1. Command classified as YELLOW_EXTERNAL
2. Check `external_comms_gate.unlocks[agentId][toolName]`
3. If `"autonomous"` → follow normal autonomy level gating (YELLOW rules apply)
4. If `"gated"` or not set → always gate, regardless of autonomy level
5. Present approval: "Marketing Agent wants to publish: 'Top 10 Tips...' to your blog. [Approve] [Edit] [Deny]"

---

## 6. AI Autonomy Levels

### 6.1 Level Definitions

| Level | Name | Default For | Auto-Execute | Requires Approval |
|-------|------|------------|-------------|-------------------|
| 1 | Training Wheels | New customers | GREEN only | YELLOW + RED + CRITICAL_RED |
| 2 | Trusted Assistant | **Default** | GREEN + YELLOW | RED + CRITICAL_RED |
| 3 | Full Autonomy | Power users | GREEN + YELLOW + RED | CRITICAL_RED only |

### 6.2 Per-Agent Override

Each agent can have its own autonomy level independent of the tenant default:

| Agent | Tenant Default L2 | Agent Override | Effective |
|-------|-------------------|----------------|-----------|
| IT Admin | Level 2 | Level 3 | 3 — full autonomy for infrastructure |
| Marketing | Level 2 | — | 2 — default |
| Secretary | Level 2 | Level 1 | 1 — extra cautious with communications |
| Sales | Level 2 | — | 2 — default |

### 6.3 Transition Criteria

Moving between levels is manual — triggered by the customer in the mobile app or web portal, synced to the Safety Wrapper via Hub heartbeat. There is no automatic promotion. The customer builds trust at their own pace.

**Invariants across ALL levels:**
- Secrets are always redacted (Layer 4)
- Audit trail is always logged
- External comms are gated by default until explicitly unlocked
- CRITICAL_RED always requires approval
- The AI never sees raw credentials

---

## 7. Data Flow Diagrams

### 7.1 Message Processing Flow

```
User (mobile app)
  │
  ▼
Hub (WebSocket relay)
  │
  ▼
OpenClaw Gateway (port 18789)
  │
  ├─► Dispatcher Agent (intent classification)
  │     │
  │     ▼
  │   Route to specialist agent (Marketing, IT, Secretary, Sales)
  │     │
  │     ▼
  │   Agent decides on tool call(s)
  │     │
  ▼     ▼
Safety Wrapper (port 8200)
  │
  ├─ 1. Classify command (GREEN/YELLOW/YELLOW_EXT/RED/CRITICAL_RED)
  ├─ 2. Check agent's effective autonomy level
  ├─ 3. Check external comms gate (if YELLOW_EXT)
  │
  ├─ IF ALLOWED:
  │   ├─ 4. Resolve SECRET_REFs from encrypted registry
  │   ├─ 5. Execute tool call (shell/Docker/API/browser)
  │   ├─ 6. Scrub secrets from response
  │   ├─ 7. Log to audit trail
  │   └─ 8. Return result to OpenClaw → Agent → User
  │
  └─ IF GATED:
      ├─ 4. Create approval request with human-readable description
      ├─ 5. POST to Hub /api/v1/tenant/approval-request
      ├─ 6. Hub pushes to mobile app via WebSocket
      ├─ 7. Mobile shows push notification: "[Approve] [Deny]"
      ├─ 8. User taps Approve → Hub relays to Safety Wrapper
      └─ 9. Safety Wrapper resumes execution from step 4 of ALLOWED path
```

### 7.2 Secrets Injection Flow

```
Agent decides to call NocoDB API
  │
  ▼
OpenClaw sends tool call to Safety Wrapper:
  exec("curl http://127.0.0.1:3037/api/v2/tables -H 'xc-token: SECRET_REF(nocodb_api_token)'")
  │
  ▼
Safety Wrapper intercepts:
  1. Classify: GREEN (read-only query) → auto-execute
  2. Resolve SECRET_REF: look up "nocodb_api_token" in encrypted SQLite
  3. Substitute: SECRET_REF(nocodb_api_token) → "xc_abc123def456..."
  4. Execute curl with real token
  │
  ▼
Tool responds:
  { "tables": [...] }   ← response may contain secrets in error messages
  │
  ▼
Safety Wrapper scrubs response:
  Run through mini redaction pipeline (registry match + regex)
  │
  ▼
Secrets Proxy intercepts agent's next LLM call:
  Full 4-layer redaction on all outbound text
  │
  ▼
LLM receives: clean data, no secrets
  Agent sees: [SECRET_REF:nocodb_api_token] (never the real value)
```

### 7.3 Token Metering Flow

```
Every LLM call:
  Agent → OpenClaw → Secrets Proxy → OpenRouter → LLM Provider
                                                       │
  OpenRouter response includes:                        │
    usage: { input_tokens, output_tokens,               │
             cache_read_tokens, cache_write_tokens }    │
                                                       ▼
  Safety Wrapper captures (via response headers or proxy inspection):
    { agent_id, model, input_tokens, output_tokens,
      cached_tokens, timestamp, request_id }
                │
                ▼
  Local SQLite (token_usage table):
    INSERT per-call record
                │
                ▼
  Hourly aggregation job:
    GROUP BY agent_id, model, HOUR(timestamp)
    → TokenUsageBucket records
                │
                ▼
  Heartbeat (every 60s) or dedicated POST:
    Safety Wrapper → Hub /api/v1/tenant/usage
    Payload: array of unsent TokenUsageBucket records
                │
                ▼
  Hub processes:
    1. Store in PostgreSQL TokenUsageBucket table
    2. Update BillingPeriod.tokensUsed
    3. Check pool exhaustion → trigger overage if needed
    4. Report to Stripe Billing Meter (hourly batch)
                │
                ▼
  Stripe calculates overage on next invoice
```

### 7.4 Provisioning Flow

```
1. Customer completes Stripe checkout on Website
2. Stripe webhook → Hub creates User + Subscription + Order (PAYMENT_CONFIRMED)
3. Automation state machine: PAYMENT_CONFIRMED → AWAITING_SERVER
4. Hub assigns Netcup server from pre-provisioned pool (EU or US region)
5. State: AWAITING_SERVER → SERVER_READY
6. Hub creates DNS records (A records for all tool subdomains)
7. State: SERVER_READY → DNS_PENDING → DNS_READY
8. Hub spawns Provisioner Docker container with job config
9. Provisioner:
   a. SSH into VPS (port 22022)
   b. Steps 1-8: system setup, Docker, nginx, firewall, SSH hardening
   c. Step 9: Deploy 28+ tool stacks via docker-compose
   d. Step 10: Deploy OpenClaw + Safety Wrapper + Secrets Proxy
      - Generate 50+ credentials via env_setup.sh
      - Generate Safety Wrapper config (secrets registry seed, agent configs)
      - Generate OpenClaw config (model routing, agent definitions, caching)
      - Start all three processes
      - Run Playwright initial-setup scenarios via OpenClaw browser
      - Generate SSL certs via Let's Encrypt
10. Safety Wrapper registers with Hub, receives API key
11. State: PROVISIONING → FULFILLED
12. Customer receives welcome email with dashboard URL + app download links
13. Heartbeat loop begins (Safety Wrapper → Hub, every 60 seconds)
```

---

## 8. Inter-Agent Communication

### 8.1 Dispatcher Hub Pattern

The Dispatcher is a first-class default agent — the user's primary point of contact. Every tenant gets one. It has three responsibilities:

1. **Intent routing:** Classifies user messages and delegates to specialist agents
2. **Workflow decomposition:** Breaks multi-domain requests into ordered steps across agents
3. **Morning briefing:** Aggregates overnight activity from all agents into a unified summary

The Dispatcher has NO direct tool access (no shell, no docker, no file operations). It works exclusively through agent-to-agent delegation. This keeps it lightweight and prevents scope creep.

### 8.2 Agent-to-Agent Communication

OpenClaw's native `agentToAgent` tool, enabled for all agents:

```json5
{
  "tools": {
    "agentToAgent": {
      "enabled": true,
      "allow": ["dispatcher", "it-admin", "marketing", "secretary", "sales"]
    }
  }
}
```

**Communication patterns:**
- **Dispatcher → Specialist:** "Handle this user request" (primary pattern)
- **Specialist → Specialist:** "What's the current Ghost version?" (peer queries)
- **Specialist → Dispatcher:** "Task complete, here's the result" (reporting)

**Safety controls:**
- Maximum dispatch depth: 5 levels (prevents A→B→A→B→... loops)
- Rate limiting: max inter-agent dispatches per minute per agent
- Full audit trail: every dispatch logged with source, target, task, result
- User visibility: all agent activity visible in mobile app's Activity feed

### 8.3 Shared Memory

Each agent has its own workspace, but all agents get `extraPaths` pointing to `/opt/letsbe/shared-memory/`. When one agent writes to the shared directory, others discover it via `memory_search`. This enables cross-agent knowledge sharing without breaking workspace isolation.

---

## 9. Memory Architecture

### 9.1 OpenClaw Native Memory

| Layer | Location | Purpose | Loaded When |
|-------|----------|---------|-------------|
| Daily logs | `memory/YYYY-MM-DD.md` | Session context | Today + yesterday |
| Long-term | `MEMORY.md` | Curated durable knowledge | Private sessions |
| Transcripts | Session JSONL | Full conversation recall | Via `memory_search` |

### 9.2 Memory Search

Hybrid retrieval combining:
- **Vector search** (cosine similarity via sqlite-vec): Semantic matching
- **BM25 keyword search** (SQLite FTS5): Exact token matching
- **MMR re-ranking** (lambda 0.7): Balances relevance with diversity
- **Temporal decay** (30-day half-life): Boosts recent memories
- **Local embeddings** (`ggml-org/embeddinggemma-300m-qat-q8_0-GGUF`, ~0.6GB)

### 9.3 Token Efficiency Strategy

| Strategy | Impact |
|----------|--------|
| Tool registry (structured JSON, ~2.5K tokens) vs. verbose skills | ~80% reduction in tool context |
| On-demand cheat sheets vs. always-loaded skills | Only pay for tools used in session |
| Compact SOUL.md (~600-800 tokens per agent) | ~50% reduction in identity context |
| `cacheRetention: "long"` (1 hour) | 80-99% cheaper on repeated SOUL.md calls |
| Context pruning (`cache-ttl`, 1h default) | Auto-removes stale tool outputs |
| Session compaction | Keeps long conversations from blowing up costs |

**Base context cost per agent:** master skill (~700 tokens) + tool registry (~2,500 tokens) = **~3,200 tokens** — regardless of how many tools are installed. Compare to 30 individual skills at ~750 tokens each = ~22,500 tokens always in context.

---

## 10. Network Security

### 10.1 Firewall Rules

```bash
# UFW configuration (set during provisioning step 5)
ufw default deny incoming
ufw default allow outgoing
ufw allow 80/tcp      # HTTP (nginx → redirect to HTTPS)
ufw allow 443/tcp     # HTTPS (nginx → tool web UIs + Hub API)
ufw allow 22022/tcp   # SSH (hardened port, key-only auth)
ufw enable
```

**NOT exposed:**
- Port 18789 (OpenClaw) — loopback only
- Port 8200 (Safety Wrapper) — loopback only
- Port 8100 (Secrets Proxy) — loopback only
- Ports 3001-3099 (tool containers) — loopback only, accessed via nginx

### 10.2 TLS

- All tool web UIs served via nginx with Let's Encrypt certificates
- Auto-renewal via certbot cron
- Strict Transport Security headers
- OCSP stapling enabled

### 10.3 Inter-Process Authentication

| From → To | Auth Method |
|-----------|-------------|
| OpenClaw → Safety Wrapper | Shared secret token (generated at provisioning) |
| Safety Wrapper → Secrets Proxy | Unix socket (no network, filesystem permissions) |
| Safety Wrapper → Hub | Bearer token (Hub API key, received at registration) |
| Hub → Safety Wrapper | Registration token → Hub API key exchange |
| Mobile → Hub | JWT (NextAuth session) |
| Hub → Tenant via nginx | Not needed — Safety Wrapper initiates all Hub communication |

### 10.4 SSRF Protection

OpenClaw's browser tool has configurable URL allowlists. LetsBe restricts browser navigation to:
- `127.0.0.1:*` (localhost tool UIs)
- Tool-specific external URLs (if configured)
- Blocks: metadata endpoints (169.254.169.254), internal networks, file:// URIs

---

## 11. Scalability & Performance

### 11.1 Horizontal Scaling

Each tenant is an independent VPS — horizontal scaling means adding more VPS instances. No shared state between tenants. The Hub handles N tenants, scaling its own PostgreSQL and server capacity as needed.

### 11.2 Vertical Scaling

Tier upgrades: Lite → Build → Scale → Enterprise. The provisioner can migrate tool stacks to a larger VPS. OpenClaw and Safety Wrapper configs don't change — only resource limits increase.

### 11.3 Performance Targets

| Metric | Target | Measured At |
|--------|--------|------------|
| Secrets redaction latency | <10ms per LLM call | Secrets Proxy |
| Command classification latency | <5ms per tool call | Safety Wrapper |
| Approval round-trip (auto-execute) | <50ms | Safety Wrapper |
| Approval round-trip (with mobile) | <30 seconds typical | Safety Wrapper → Hub → Mobile → Hub → SW |
| Agent response time | 2-15 seconds (model-dependent) | End-to-end |
| Heartbeat interval | 60 seconds | Safety Wrapper → Hub |
| Config sync latency | <60 seconds (next heartbeat) | Hub → Safety Wrapper |

---

## 12. Disaster Recovery & Backup

### 12.1 Application-Level Backups (Existing)

The Provisioner deploys `backups.sh` (~473 lines):
- 18 PostgreSQL databases + 2 MySQL + 1 MongoDB
- Daily 2:00 AM cron job
- Rotation: 7 daily local + 4 weekly remote (via rclone)
- Output: `backup-status.json` with per-database status

### 12.2 Backup Monitoring (NEW)

OpenClaw cron job at 6:00 AM reads `backup-status.json`:
- Was backup updated today?
- All databases listed?
- Any failures?
- Reports to Hub via Safety Wrapper's `/tenant/backup-status` endpoint

### 12.3 VPS Snapshots

Daily Netcup VPS snapshots via SCP API:
- Triggered by Hub cron job
- 3 snapshots retained (rolling)
- Staggered across tenants to avoid API rate limits
- Free to create and store

### 12.4 Recovery Procedures

| Scenario | Recovery |
|----------|----------|
| Single tool database corruption | Restore from application-level dump |
| OpenClaw/Safety Wrapper state loss | Restore from VPS snapshot |
| Full VPS failure | Restore from snapshot to new VPS, re-provision |
| Hub database loss | Separate Hub backup strategy (not tenant concern) |

---

## 13. Error Handling & Resilience

### 13.1 Severity-Based Alerting

| Severity | Examples | Auto-Recovery | Alert |
|----------|----------|---------------|-------|
| **Soft** | OpenClaw crash, Secrets Proxy restart, tool adapter timeout | Auto-restart immediately | Push notification after 3 failures in 1 hour |
| **Medium** | Tool API unreachable, OpenRouter timeout, Hub communication failure | Retry with backoff (30s → 1m → 5m) | Push notification after 3 consecutive failures |
| **Hard** | Auth token rejected, secrets registry corrupted, disk full, SSL expired | Stop affected component, do NOT auto-restart | Immediate push to customer + Hub alert to staff |

### 13.2 Model Failover

OpenClaw native failover chains:
```json
{
  "model": {
    "primary": "anthropic/claude-sonnet-4-6",
    "fallbacks": ["anthropic/claude-haiku-4-5", "google/gemini-2.0-flash"]
  }
}
```

Auth profile rotation before model fallback — if primary fails due to API key issue, OpenClaw rotates auth profiles before falling back to a different model.

### 13.3 Graceful Degradation

| Component Down | User Experience |
|---------------|----------------|
| Single tool | Agent says "I can't reach X right now. I'll try again shortly." |
| Secrets Proxy | Agents pause (can't make LLM calls). Resume on restart (~2-5s). |
| Safety Wrapper | Tool calls blocked. Agents can still respond from cached context. Resume on restart. |
| OpenClaw | All agents offline. Auto-restart. User sees "Your AI team is restarting." |
| Hub | Agents continue locally (cached config). Heartbeats queue. Approvals delayed. |
| OpenRouter | Model failover chain. If all fail, agent reports temporary issue. |
| Mobile app | Customer portal (web) available as fallback. |

---

*End of System Architecture Document*