LetsBeBiz-Redesign/docs/architecture-proposal/claude/03-DEPLOYMENT-STRATEGY.md

23 KiB

LetsBe Biz — Deployment Strategy

Date: February 27, 2026 Team: Claude Opus 4.6 Architecture Team Document: 03 of 09 Status: Proposal — Competing with independent team


Table of Contents

  1. Deployment Topology
  2. Central Platform Deployment
  3. Tenant Server Deployment
  4. Container Strategy
  5. Resource Budgets
  6. Provider Strategy
  7. Update & Rollout Strategy
  8. Disaster Recovery
  9. Monitoring & Alerting
  10. SSL & Domain Management

1. Deployment Topology

                    ┌─────────────────────────────────────┐
                    │         CENTRAL PLATFORM             │
                    │                                      │
                    │  ┌──────────┐  ┌──────────────────┐  │
                    │  │   Hub    │  │   PostgreSQL 16   │  │
                    │  │  (Next.js│  │   (hub database)  │  │
                    │  │  port    │  └──────────────────┘  │
                    │  │  3847)   │                        │
                    │  └──────────┘  ┌──────────────────┐  │
                    │                │  Website (Vercel  │  │
                    │  ┌──────────┐  │  or self-hosted)  │  │
                    │  │ Gitea CI │  └──────────────────┘  │
                    │  └──────────┘                        │
                    └──────────┬──────────────────────────┘
                               │ HTTPS
              ┌────────────────┼────────────────┐
              │                │                │
    ┌─────────▼──────┐ ┌──────▼────────┐ ┌─────▼────────────┐
    │ Tenant VPS #1  │ │ Tenant VPS #2 │ │ Tenant VPS #N    │
    │ (customer-a)   │ │ (customer-b)  │ │ (customer-n)     │
    │                │ │               │ │                  │
    │ OpenClaw       │ │ OpenClaw      │ │ OpenClaw         │
    │ Safety Wrapper │ │ Safety Wrapper│ │ Safety Wrapper   │
    │ Secrets Proxy  │ │ Secrets Proxy │ │ Secrets Proxy    │
    │ nginx          │ │ nginx         │ │ nginx            │
    │ 25+ tool       │ │ 25+ tool      │ │ 25+ tool         │
    │ containers     │ │ containers    │ │ containers       │
    └────────────────┘ └───────────────┘ └──────────────────┘

1.1 Key Topology Decisions

Decision Choice Rationale
Hub hosting Dedicated Netcup RS G12 (EU) + mirror (US) Low latency to tenants, cost-effective
Website hosting Vercel (CDN) or static export on Hub server CDN for global reach, simple deployment
Tenant isolation One VPS per customer, no shared infrastructure Privacy guarantee, blast radius containment
Region support EU (Nuremberg) + US (Manassas) Customer-selectable, same RS G12 hardware
Provider strategy Netcup primary (contracts) + Hetzner overflow (hourly) Cost optimization + burst capacity

2. Central Platform Deployment

2.1 Hub Server

# deploy/hub/docker-compose.yml
version: '3.8'
services:
  db:
    image: postgres:16-alpine
    container_name: letsbe-hub-db
    restart: unless-stopped
    volumes:
      - hub-db-data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: letsbe_hub
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  hub:
    image: code.letsbe.solutions/letsbe/hub:${HUB_VERSION}
    container_name: letsbe-hub
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
    ports:
      - "127.0.0.1:3847:3000"
    volumes:
      - hub-jobs:/app/jobs
      - hub-logs:/app/logs
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      DATABASE_URL: postgresql://${DB_USER}:${DB_PASSWORD}@db:5432/letsbe_hub
      NEXTAUTH_URL: ${HUB_URL}
      NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
      STRIPE_SECRET_KEY: ${STRIPE_SECRET_KEY}
      STRIPE_WEBHOOK_SECRET: ${STRIPE_WEBHOOK_SECRET}
      # ... (see existing config)

  # Provisioner runner (spawned on demand by Hub)
  # Not a persistent service — Hub spawns Docker containers per job

volumes:
  hub-db-data:
  hub-jobs:
  hub-logs:

2.2 Hub nginx Configuration

# deploy/hub/nginx/hub.conf
server {
    listen 443 ssl http2;
    server_name hub.letsbe.biz;

    ssl_certificate     /etc/letsencrypt/live/hub.letsbe.biz/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/hub.letsbe.biz/privkey.pem;

    # Security headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "DENY" always;
    add_header Referrer-Policy "strict-origin-when-cross-origin" always;

    # Rate limiting for public API
    limit_req_zone $binary_remote_addr zone=public_api:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=tenant_api:10m rate=30r/s;

    # Public API rate limiting
    location /api/v1/public/ {
        limit_req zone=public_api burst=20 nodelay;
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Tenant API (Safety Wrapper calls) rate limiting
    location /api/v1/tenant/ {
        limit_req zone=tenant_api burst=50 nodelay;
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # SSE for provisioning logs and chat relay
    location /api/v1/admin/orders/ {
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding off;
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 3600s;
    }

    # WebSocket for real-time chat relay
    location /api/v1/customer/ws {
        proxy_pass http://127.0.0.1:3847;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400s;
    }

    # Default
    location / {
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

2.3 Hub Database Backup

# deploy/hub/backup.sh — runs daily at 3:00 AM
#!/bin/bash
BACKUP_DIR="/opt/letsbe/hub-backups"
DATE=$(date +%Y%m%d_%H%M%S)

# PostgreSQL dump
docker exec letsbe-hub-db pg_dump -U ${DB_USER} letsbe_hub \
  | gzip > "${BACKUP_DIR}/hub_${DATE}.sql.gz"

# Rotate: keep 14 daily, 8 weekly, 3 monthly
find "${BACKUP_DIR}" -name "hub_*.sql.gz" -mtime +14 -delete
# Weekly: kept by separate cron moving to weekly/
# Monthly: kept by separate cron moving to monthly/

# Upload to off-site storage (S3/Backblaze)
rclone copy "${BACKUP_DIR}/hub_${DATE}.sql.gz" remote:letsbe-hub-backups/daily/

3. Tenant Server Deployment

3.1 Provisioning Flow

Hub receives order (status: PAYMENT_CONFIRMED)
  │
  ▼
Automation worker: PAYMENT_CONFIRMED → AWAITING_SERVER
  │
  ▼
Assign Netcup server from pre-provisioned pool
  (or spin up Hetzner Cloud if pool empty)
  │
  ▼
AWAITING_SERVER → SERVER_READY
  │
  ▼
Create DNS records via Cloudflare API (NEW — was manual)
  │
  ▼
SERVER_READY → DNS_PENDING → DNS_READY
  │
  ▼
Spawn Provisioner Docker container with job config
  │
  ▼
Provisioner SSHs into VPS, runs 10-step pipeline:
  Step 1-8:  System setup, Docker, nginx, firewall, SSH hardening
  Step 9:    Deploy tool stacks (28+ Docker Compose stacks)
  Step 10:   Deploy LetsBe AI stack (OpenClaw + Safety Wrapper + Secrets Proxy)
  │
  ▼
Safety Wrapper registers with Hub → receives API key
  │
  ▼
PROVISIONING → FULFILLED
  │
  ▼
Customer receives welcome email + app download links

3.2 Pre-Provisioned Server Pool

To minimize customer wait time (target: <20 minutes from payment to AI ready):

Region Pool Size Server Tier Status
EU (Nuremberg) 3-5 servers Build (RS 2000 G12) Freshly installed Debian 12, Docker pre-installed
US (Manassas) 2-3 servers Build (RS 2000 G12) Same

Pool is replenished automatically when it drops below minimum. Netcup servers are on 12-month contracts — pre-provisioning is a cost commitment.

3.3 Tenant Container Layout

Tenant VPS (e.g., Build tier: 8c/16GB/512GB NVMe)
│
├── nginx (port 80, 443)                         ~64MB
├── letsbe-openclaw (port 18789, host network)    ~384MB + Chromium
├── letsbe-safety-wrapper (port 8200)             ~128MB
├── letsbe-secrets-proxy (port 8100)              ~64MB
│
├── TOOL STACKS (Docker Compose per tool):
│   ├── nextcloud + postgres (port 3023)          ~768MB
│   ├── chatwoot + postgres + redis (port 3019)   ~1024MB
│   ├── ghost + mysql (port 3025)                 ~384MB
│   ├── calcom + postgres (port 3044)             ~384MB
│   ├── stalwart-mail (port 3011)                 ~256MB
│   ├── odoo + postgres (port 3035)               ~1280MB
│   ├── keycloak + postgres (port 3043)           ~512MB
│   ├── listmonk + postgres (port 3026)           ~256MB
│   ├── nocodb (port 3037)                        ~256MB
│   ├── umami + postgres (port 3029)              ~256MB
│   ├── uptime-kuma (port 3033)                   ~128MB
│   ├── portainer (port 9443)                     ~128MB
│   ├── activepieces (port 3040)                  ~384MB
│   ├── ... (remaining tools)
│   └── certbot                                   ~16MB
│
└── TOTAL: varies by tier and selected tools

4. Container Strategy

4.1 Image Registry

All custom images hosted on Gitea Container Registry:

code.letsbe.solutions/letsbe/hub:latest
code.letsbe.solutions/letsbe/openclaw:latest
code.letsbe.solutions/letsbe/safety-wrapper:latest
code.letsbe.solutions/letsbe/secrets-proxy:latest
code.letsbe.solutions/letsbe/provisioner:latest
code.letsbe.solutions/letsbe/demo:latest

4.2 Image Build Strategy

# packages/safety-wrapper/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER letsbe
EXPOSE 8200
CMD ["node", "dist/server.js"]
# packages/secrets-proxy/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER letsbe
EXPOSE 8100
CMD ["node", "dist/server.js"]

4.3 OpenClaw Custom Image

# packages/openclaw-image/Dockerfile
FROM openclaw/openclaw:2026.2.6-3

# Install CLI binaries for tool access
RUN apk add --no-cache curl jq

# Install gog (Google CLI) and himalaya (IMAP CLI)
COPY bin/gog /usr/local/bin/gog
COPY bin/himalaya /usr/local/bin/himalaya
RUN chmod +x /usr/local/bin/gog /usr/local/bin/himalaya

# Pre-create directory structure
RUN mkdir -p /home/openclaw/.openclaw/agents \
             /home/openclaw/.openclaw/skills \
             /home/openclaw/.openclaw/references \
             /home/openclaw/.openclaw/data \
             /home/openclaw/.openclaw/shared-memory

USER openclaw

4.4 Container Restart Policies

Container Restart Policy Rationale
All LetsBe containers unless-stopped Auto-recover from crashes; manual stop stays stopped
Tool containers unless-stopped Same — tools should self-heal
nginx unless-stopped Critical path — must auto-restart

5. Resource Budgets

5.1 Per-Tier Budget

Component Lite (8GB) Build (16GB) Scale (32GB) Enterprise (64GB)
LetsBe overhead 640MB 640MB 640MB 640MB
Tool headroom 7,360MB 15,360MB 31,360MB 63,360MB
Recommended tools 5-8 10-15 15-25 25-30+
CPU cores 4 8 12 16
NVMe storage 256GB 512GB 1TB 2TB

5.2 LetsBe Overhead Breakdown

Process RAM CPU Notes
OpenClaw Gateway ~256MB 1.0 core Node.js 22 + agent state
Chromium (browser tool) ~128MB 0.5 core Managed by OpenClaw, shared across agents
Safety Wrapper ~128MB 0.5 core Tool execution + Hub communication
Secrets Proxy ~64MB 0.25 core Lightweight HTTP proxy
nginx ~64MB 0.25 core Reverse proxy for all tool subdomains
Total ~640MB ~2.5 cores

5.3 Tool Resource Registry

Used by the resource calculator in the website and by the IT Agent for dynamic tool installation:

{
  "nextcloud": { "ram_mb": 512, "disk_gb": 10, "requires_db": "postgres" },
  "chatwoot": { "ram_mb": 768, "disk_gb": 5, "requires_db": "postgres", "requires_redis": true },
  "ghost": { "ram_mb": 256, "disk_gb": 3, "requires_db": "mysql" },
  "odoo": { "ram_mb": 1024, "disk_gb": 10, "requires_db": "postgres" },
  "calcom": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
  "stalwart": { "ram_mb": 256, "disk_gb": 5 },
  "keycloak": { "ram_mb": 512, "disk_gb": 2, "requires_db": "postgres" },
  "listmonk": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
  "nocodb": { "ram_mb": 256, "disk_gb": 2 },
  "umami": { "ram_mb": 192, "disk_gb": 1, "requires_db": "postgres" },
  "uptime_kuma": { "ram_mb": 128, "disk_gb": 1 },
  "portainer": { "ram_mb": 128, "disk_gb": 1 },
  "activepieces": { "ram_mb": 384, "disk_gb": 3, "requires_db": "postgres" }
}

6. Provider Strategy

6.1 Primary: Netcup RS G12

Plan Specs Monthly Contract Use Case
RS 1000 G12 4c/8GB/256GB ~€8.50 12-month Lite tier
RS 2000 G12 8c/16GB/512GB ~€14.50 12-month Build tier (default)
RS 4000 G12 12c/32GB/1TB ~€26.00 12-month Scale tier
RS 8000 G12 16c/64GB/2TB ~€48.00 12-month Enterprise tier

Both EU (Nuremberg) and US (Manassas) datacenters available.

Pre-provisioned pool: 5 Build-tier servers in EU, 3 in US. Replenished weekly.

6.2 Overflow: Hetzner Cloud

For burst capacity when Netcup pool is depleted:

Type Specs Hourly Monthly Cap Notes
CPX21 3c/4GB/80GB €0.0113 ~€8.24 Lite equivalent
CPX31 4c/8GB/160GB €0.0214 ~€15.59 Build equivalent
CPX41 8c/16GB/240GB €0.0399 ~€29.09 Scale equivalent
CPX51 16c/32GB/360GB €0.0798 ~€58.15 Enterprise equivalent

Trigger: When Netcup pool for a tier + region is empty AND order in AUTO mode. Migration: Customer migrated to Netcup RS when next contract cycle opens (monthly check).

6.3 Provider Abstraction

The Provisioner is provider-agnostic — it only needs SSH access to a Debian 12 VPS. Provider-specific logic lives in the Hub:

interface ServerProvider {
  name: 'netcup' | 'hetzner';
  allocateServer(tier: ServerTier, region: Region): Promise<ServerAllocation>;
  deallocateServer(serverId: string): Promise<void>;
  getServerStatus(serverId: string): Promise<ServerStatus>;
  createSnapshot(serverId: string): Promise<SnapshotResult>;
}

7. Update & Rollout Strategy

7.1 Central Platform Updates

Component Deployment Rollback
Hub Docker image pull + restart Previous image tag
Website Vercel deploy (instant) or Docker pull Previous deployment
Hub Database Prisma migrate deploy (forward-only) Reverse migration script

7.2 Tenant Server Updates

Tenant updates are pushed from the Hub, NOT pulled by tenants:

1. Hub builds new Safety Wrapper / Secrets Proxy image
2. Hub creates update task for each tenant
3. Safety Wrapper receives update command via heartbeat
4. Safety Wrapper downloads new image (from Gitea registry)
5. Safety Wrapper performs rolling restart:
   a. Pull new image
   b. Stop old container
   c. Start new container
   d. Health check
   e. Report success/failure to Hub
6. If health check fails: rollback to previous image

7.3 OpenClaw Updates

OpenClaw is pinned to a tested release tag. Update cadence:

  1. Monthly review of upstream changelog
  2. Test new release on staging VPS (dedicated test tenant)
  3. If no issues after 48 hours: roll out to 10% of tenants (canary)
  4. Monitor for 24 hours
  5. Roll out to remaining tenants
  6. Rollback available: previous Docker image tag

7.4 Canary Deployment

Stage 1: Staging VPS (internal testing)        — 48 hours
Stage 2: 5% of tenants (canary group)          — 24 hours
Stage 3: 25% of tenants                        — 12 hours
Stage 4: 100% of tenants                       — complete

Canary selection: newest tenants first (less established, lower blast radius).


8. Disaster Recovery

8.1 Three-Tier Backup Strategy

Tier What How Frequency Retention
1. Application Tool databases (18 PG + 2 MySQL + 1 Mongo) backups.sh (existing) Daily 2:00 AM 7 daily + 4 weekly
2. VPS Snapshot Full VPS image Netcup SCP API Daily (staggered) 3 rolling
3. Hub Database Central PostgreSQL pg_dump + rclone Daily 3:00 AM 14 daily + 8 weekly + 3 monthly

8.2 Recovery Scenarios

Scenario Recovery Method RTO RPO
Single tool database corrupted Restore from application backup 15 minutes 24 hours
VPS disk failure Restore from Netcup snapshot 30 minutes 24 hours
VPS completely lost Re-provision from scratch + restore snapshot 2 hours 24 hours
Hub database corrupted Restore from pg_dump backup 30 minutes 24 hours
Hub server lost Re-deploy on new server + restore DB 2 hours 24 hours
Regional outage Failover to other region (manual) 4 hours 24 hours

8.3 Backup Monitoring

The Safety Wrapper's cron job reads backup-status.json daily at 6:00 AM:

{
  "last_run": "2026-02-27T02:15:00Z",
  "duration_seconds": 342,
  "databases": {
    "chatwoot": { "status": "success", "size_mb": 45 },
    "ghost": { "status": "success", "size_mb": 12 },
    "nextcloud": { "status": "failed", "error": "connection refused" }
  },
  "remote_sync": { "status": "success", "uploaded_mb": 230 }
}

Alerts:

  • Medium severity: Any database backup failed
  • Hard severity: All backups failed, or backup-status.json is stale (>48 hours)

9. Monitoring & Alerting

9.1 Tenant Health Monitoring

The Hub monitors all tenants via Safety Wrapper heartbeats:

Metric Source Alert Threshold
Heartbeat freshness Safety Wrapper heartbeat >3 missed intervals (3 min)
Disk usage Heartbeat payload >85%
Memory usage Heartbeat payload >90%
Token pool usage Billing period 80%, 90%, 100%
Backup status Backup report Any failure
Container health Portainer integration Crash/OOM events
SSL cert expiry Cert check cron <14 days

9.2 Alert Routing

Severity Customer Notification Staff Notification
Soft None (auto-recovers) Dashboard indicator
Medium Push notification (after 3 failures) Email + dashboard
Hard Push notification (immediate) Email + Slack/webhook + dashboard

9.3 Hub Self-Monitoring

- PostgreSQL connection pool usage
- API response times (p50, p95, p99)
- Failed provisioning jobs
- Stripe webhook processing latency
- Cron job execution status
- Disk space on Hub server

10. SSL & Domain Management

10.1 Tenant SSL

Each tenant gets wildcard SSL via Let's Encrypt + certbot:

# Provisioner Step 4 (existing)
certbot certonly --nginx -d "*.${DOMAIN}" -d "${DOMAIN}" \
  --non-interactive --agree-tos -m "ssl@letsbe.biz"

Auto-renewal via cron (certbot default: every 12 hours, renews when <30 days to expiry).

10.2 Subdomain Layout

Each tool gets a subdomain on the customer's domain:

files.example.com      → Nextcloud
chat.example.com       → Chatwoot
blog.example.com       → Ghost
cal.example.com        → Cal.com
mail.example.com       → Stalwart Mail
erp.example.com        → Odoo
wiki.example.com       → BookStack (if installed)
...
status.example.com     → Uptime Kuma
portainer.example.com  → Portainer (admin only)

10.3 DNS Automation

New capability — auto-create DNS records at provisioning time:

// Hub: src/lib/services/dns-automation-service.ts

interface DnsAutomationService {
  createRecords(params: {
    domain: string;
    ip: string;
    tools: string[];
    provider: 'cloudflare';
    zone_id: string;
  }): Promise<{ records_created: number; errors: string[] }>;
}

// Creates A records for:
// 1. Root domain → VPS IP
// 2. Wildcard *.domain → VPS IP (covers all tool subdomains)
// Or individual A records per tool subdomain if wildcard not supported

End of Document — 03 Deployment Strategy