23 KiB

Raw Permalink Blame History

LetsBe Biz — Deployment Strategy

Date: February 27, 2026 Team: Claude Opus 4.6 Architecture Team Document: 03 of 09 Status: Proposal — Competing with independent team

Deployment Topology
Central Platform Deployment
Tenant Server Deployment
Container Strategy
Resource Budgets
Provider Strategy
Update & Rollout Strategy
Disaster Recovery
Monitoring & Alerting
SSL & Domain Management

1. Deployment Topology

                    ┌─────────────────────────────────────┐
                    │         CENTRAL PLATFORM             │
                    │                                      │
                    │  ┌──────────┐  ┌──────────────────┐  │
                    │  │   Hub    │  │   PostgreSQL 16   │  │
                    │  │  (Next.js│  │   (hub database)  │  │
                    │  │  port    │  └──────────────────┘  │
                    │  │  3847)   │                        │
                    │  └──────────┘  ┌──────────────────┐  │
                    │                │  Website (Vercel  │  │
                    │  ┌──────────┐  │  or self-hosted)  │  │
                    │  │ Gitea CI │  └──────────────────┘  │
                    │  └──────────┘                        │
                    └──────────┬──────────────────────────┘
                               │ HTTPS
              ┌────────────────┼────────────────┐
              │                │                │
    ┌─────────▼──────┐ ┌──────▼────────┐ ┌─────▼────────────┐
    │ Tenant VPS #1  │ │ Tenant VPS #2 │ │ Tenant VPS #N    │
    │ (customer-a)   │ │ (customer-b)  │ │ (customer-n)     │
    │                │ │               │ │                  │
    │ OpenClaw       │ │ OpenClaw      │ │ OpenClaw         │
    │ Safety Wrapper │ │ Safety Wrapper│ │ Safety Wrapper   │
    │ Secrets Proxy  │ │ Secrets Proxy │ │ Secrets Proxy    │
    │ nginx          │ │ nginx         │ │ nginx            │
    │ 25+ tool       │ │ 25+ tool      │ │ 25+ tool         │
    │ containers     │ │ containers    │ │ containers       │
    └────────────────┘ └───────────────┘ └──────────────────┘

1.1 Key Topology Decisions

Decision	Choice	Rationale
Hub hosting	Dedicated Netcup RS G12 (EU) + mirror (US)	Low latency to tenants, cost-effective
Website hosting	Vercel (CDN) or static export on Hub server	CDN for global reach, simple deployment
Tenant isolation	One VPS per customer, no shared infrastructure	Privacy guarantee, blast radius containment
Region support	EU (Nuremberg) + US (Manassas)	Customer-selectable, same RS G12 hardware
Provider strategy	Netcup primary (contracts) + Hetzner overflow (hourly)	Cost optimization + burst capacity

2. Central Platform Deployment

2.1 Hub Server

# deploy/hub/docker-compose.yml
version: '3.8'
services:
  db:
    image: postgres:16-alpine
    container_name: letsbe-hub-db
    restart: unless-stopped
    volumes:
      - hub-db-data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: letsbe_hub
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  hub:
    image: code.letsbe.solutions/letsbe/hub:${HUB_VERSION}
    container_name: letsbe-hub
    restart: unless-stopped
    depends_on:
      db:
        condition: service_healthy
    ports:
      - "127.0.0.1:3847:3000"
    volumes:
      - hub-jobs:/app/jobs
      - hub-logs:/app/logs
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      DATABASE_URL: postgresql://${DB_USER}:${DB_PASSWORD}@db:5432/letsbe_hub
      NEXTAUTH_URL: ${HUB_URL}
      NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
      STRIPE_SECRET_KEY: ${STRIPE_SECRET_KEY}
      STRIPE_WEBHOOK_SECRET: ${STRIPE_WEBHOOK_SECRET}
      # ... (see existing config)

  # Provisioner runner (spawned on demand by Hub)
  # Not a persistent service — Hub spawns Docker containers per job

volumes:
  hub-db-data:
  hub-jobs:
  hub-logs:

2.2 Hub nginx Configuration

# deploy/hub/nginx/hub.conf
server {
    listen 443 ssl http2;
    server_name hub.letsbe.biz;

    ssl_certificate     /etc/letsencrypt/live/hub.letsbe.biz/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/hub.letsbe.biz/privkey.pem;

    # Security headers
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "DENY" always;
    add_header Referrer-Policy "strict-origin-when-cross-origin" always;

    # Rate limiting for public API
    limit_req_zone $binary_remote_addr zone=public_api:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=tenant_api:10m rate=30r/s;

    # Public API rate limiting
    location /api/v1/public/ {
        limit_req zone=public_api burst=20 nodelay;
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Tenant API (Safety Wrapper calls) rate limiting
    location /api/v1/tenant/ {
        limit_req zone=tenant_api burst=50 nodelay;
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # SSE for provisioning logs and chat relay
    location /api/v1/admin/orders/ {
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        chunked_transfer_encoding off;
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 3600s;
    }

    # WebSocket for real-time chat relay
    location /api/v1/customer/ws {
        proxy_pass http://127.0.0.1:3847;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400s;
    }

    # Default
    location / {
        proxy_pass http://127.0.0.1:3847;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

2.3 Hub Database Backup

# deploy/hub/backup.sh — runs daily at 3:00 AM
#!/bin/bash
BACKUP_DIR="/opt/letsbe/hub-backups"
DATE=$(date +%Y%m%d_%H%M%S)

# PostgreSQL dump
docker exec letsbe-hub-db pg_dump -U ${DB_USER} letsbe_hub \
  | gzip > "${BACKUP_DIR}/hub_${DATE}.sql.gz"

# Rotate: keep 14 daily, 8 weekly, 3 monthly
find "${BACKUP_DIR}" -name "hub_*.sql.gz" -mtime +14 -delete
# Weekly: kept by separate cron moving to weekly/
# Monthly: kept by separate cron moving to monthly/

# Upload to off-site storage (S3/Backblaze)
rclone copy "${BACKUP_DIR}/hub_${DATE}.sql.gz" remote:letsbe-hub-backups/daily/

3. Tenant Server Deployment

3.1 Provisioning Flow

Hub receives order (status: PAYMENT_CONFIRMED)
  │
  ▼
Automation worker: PAYMENT_CONFIRMED → AWAITING_SERVER
  │
  ▼
Assign Netcup server from pre-provisioned pool
  (or spin up Hetzner Cloud if pool empty)
  │
  ▼
AWAITING_SERVER → SERVER_READY
  │
  ▼
Create DNS records via Cloudflare API (NEW — was manual)
  │
  ▼
SERVER_READY → DNS_PENDING → DNS_READY
  │
  ▼
Spawn Provisioner Docker container with job config
  │
  ▼
Provisioner SSHs into VPS, runs 10-step pipeline:
  Step 1-8:  System setup, Docker, nginx, firewall, SSH hardening
  Step 9:    Deploy tool stacks (28+ Docker Compose stacks)
  Step 10:   Deploy LetsBe AI stack (OpenClaw + Safety Wrapper + Secrets Proxy)
  │
  ▼
Safety Wrapper registers with Hub → receives API key
  │
  ▼
PROVISIONING → FULFILLED
  │
  ▼
Customer receives welcome email + app download links

3.2 Pre-Provisioned Server Pool

To minimize customer wait time (target: <20 minutes from payment to AI ready):

Region	Pool Size	Server Tier	Status
EU (Nuremberg)	3-5 servers	Build (RS 2000 G12)	Freshly installed Debian 12, Docker pre-installed
US (Manassas)	2-3 servers	Build (RS 2000 G12)	Same

Pool is replenished automatically when it drops below minimum. Netcup servers are on 12-month contracts — pre-provisioning is a cost commitment.

3.3 Tenant Container Layout

Tenant VPS (e.g., Build tier: 8c/16GB/512GB NVMe)
│
├── nginx (port 80, 443)                         ~64MB
├── letsbe-openclaw (port 18789, host network)    ~384MB + Chromium
├── letsbe-safety-wrapper (port 8200)             ~128MB
├── letsbe-secrets-proxy (port 8100)              ~64MB
│
├── TOOL STACKS (Docker Compose per tool):
│   ├── nextcloud + postgres (port 3023)          ~768MB
│   ├── chatwoot + postgres + redis (port 3019)   ~1024MB
│   ├── ghost + mysql (port 3025)                 ~384MB
│   ├── calcom + postgres (port 3044)             ~384MB
│   ├── stalwart-mail (port 3011)                 ~256MB
│   ├── odoo + postgres (port 3035)               ~1280MB
│   ├── keycloak + postgres (port 3043)           ~512MB
│   ├── listmonk + postgres (port 3026)           ~256MB
│   ├── nocodb (port 3037)                        ~256MB
│   ├── umami + postgres (port 3029)              ~256MB
│   ├── uptime-kuma (port 3033)                   ~128MB
│   ├── portainer (port 9443)                     ~128MB
│   ├── activepieces (port 3040)                  ~384MB
│   ├── ... (remaining tools)
│   └── certbot                                   ~16MB
│
└── TOTAL: varies by tier and selected tools

4. Container Strategy

4.1 Image Registry

All custom images hosted on Gitea Container Registry:

code.letsbe.solutions/letsbe/hub:latest
code.letsbe.solutions/letsbe/openclaw:latest
code.letsbe.solutions/letsbe/safety-wrapper:latest
code.letsbe.solutions/letsbe/secrets-proxy:latest
code.letsbe.solutions/letsbe/provisioner:latest
code.letsbe.solutions/letsbe/demo:latest

4.2 Image Build Strategy

# packages/safety-wrapper/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER letsbe
EXPOSE 8200
CMD ["node", "dist/server.js"]

# packages/secrets-proxy/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER letsbe
EXPOSE 8100
CMD ["node", "dist/server.js"]

4.3 OpenClaw Custom Image

# packages/openclaw-image/Dockerfile
FROM openclaw/openclaw:2026.2.6-3

# Install CLI binaries for tool access
RUN apk add --no-cache curl jq

# Install gog (Google CLI) and himalaya (IMAP CLI)
COPY bin/gog /usr/local/bin/gog
COPY bin/himalaya /usr/local/bin/himalaya
RUN chmod +x /usr/local/bin/gog /usr/local/bin/himalaya

# Pre-create directory structure
RUN mkdir -p /home/openclaw/.openclaw/agents \
             /home/openclaw/.openclaw/skills \
             /home/openclaw/.openclaw/references \
             /home/openclaw/.openclaw/data \
             /home/openclaw/.openclaw/shared-memory

USER openclaw

4.4 Container Restart Policies

Container	Restart Policy	Rationale
All LetsBe containers	`unless-stopped`	Auto-recover from crashes; manual stop stays stopped
Tool containers	`unless-stopped`	Same — tools should self-heal
nginx	`unless-stopped`	Critical path — must auto-restart

5. Resource Budgets

5.1 Per-Tier Budget

Component	Lite (8GB)	Build (16GB)	Scale (32GB)	Enterprise (64GB)
LetsBe overhead	640MB	640MB	640MB	640MB
Tool headroom	7,360MB	15,360MB	31,360MB	63,360MB
Recommended tools	5-8	10-15	15-25	25-30+
CPU cores	4	8	12	16
NVMe storage	256GB	512GB	1TB	2TB

5.2 LetsBe Overhead Breakdown

Process	RAM	CPU	Notes
OpenClaw Gateway	~256MB	1.0 core	Node.js 22 + agent state
Chromium (browser tool)	~128MB	0.5 core	Managed by OpenClaw, shared across agents
Safety Wrapper	~128MB	0.5 core	Tool execution + Hub communication
Secrets Proxy	~64MB	0.25 core	Lightweight HTTP proxy
nginx	~64MB	0.25 core	Reverse proxy for all tool subdomains
Total	~640MB	~2.5 cores

5.3 Tool Resource Registry

Used by the resource calculator in the website and by the IT Agent for dynamic tool installation:

{
  "nextcloud": { "ram_mb": 512, "disk_gb": 10, "requires_db": "postgres" },
  "chatwoot": { "ram_mb": 768, "disk_gb": 5, "requires_db": "postgres", "requires_redis": true },
  "ghost": { "ram_mb": 256, "disk_gb": 3, "requires_db": "mysql" },
  "odoo": { "ram_mb": 1024, "disk_gb": 10, "requires_db": "postgres" },
  "calcom": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
  "stalwart": { "ram_mb": 256, "disk_gb": 5 },
  "keycloak": { "ram_mb": 512, "disk_gb": 2, "requires_db": "postgres" },
  "listmonk": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
  "nocodb": { "ram_mb": 256, "disk_gb": 2 },
  "umami": { "ram_mb": 192, "disk_gb": 1, "requires_db": "postgres" },
  "uptime_kuma": { "ram_mb": 128, "disk_gb": 1 },
  "portainer": { "ram_mb": 128, "disk_gb": 1 },
  "activepieces": { "ram_mb": 384, "disk_gb": 3, "requires_db": "postgres" }
}

6. Provider Strategy

6.1 Primary: Netcup RS G12

Plan	Specs	Monthly	Contract	Use Case
RS 1000 G12	4c/8GB/256GB	~€8.50	12-month	Lite tier
RS 2000 G12	8c/16GB/512GB	~€14.50	12-month	Build tier (default)
RS 4000 G12	12c/32GB/1TB	~€26.00	12-month	Scale tier
RS 8000 G12	16c/64GB/2TB	~€48.00	12-month	Enterprise tier

Both EU (Nuremberg) and US (Manassas) datacenters available.

Pre-provisioned pool: 5 Build-tier servers in EU, 3 in US. Replenished weekly.

6.2 Overflow: Hetzner Cloud

For burst capacity when Netcup pool is depleted:

Type	Specs	Hourly	Monthly Cap	Notes
CPX21	3c/4GB/80GB	€0.0113	~€8.24	Lite equivalent
CPX31	4c/8GB/160GB	€0.0214	~€15.59	Build equivalent
CPX41	8c/16GB/240GB	€0.0399	~€29.09	Scale equivalent
CPX51	16c/32GB/360GB	€0.0798	~€58.15	Enterprise equivalent

Trigger: When Netcup pool for a tier + region is empty AND order in AUTO mode. Migration: Customer migrated to Netcup RS when next contract cycle opens (monthly check).

6.3 Provider Abstraction

The Provisioner is provider-agnostic — it only needs SSH access to a Debian 12 VPS. Provider-specific logic lives in the Hub:

interface ServerProvider {
  name: 'netcup' | 'hetzner';
  allocateServer(tier: ServerTier, region: Region): Promise<ServerAllocation>;
  deallocateServer(serverId: string): Promise<void>;
  getServerStatus(serverId: string): Promise<ServerStatus>;
  createSnapshot(serverId: string): Promise<SnapshotResult>;
}

7. Update & Rollout Strategy

7.1 Central Platform Updates

Component	Deployment	Rollback
Hub	Docker image pull + restart	Previous image tag
Website	Vercel deploy (instant) or Docker pull	Previous deployment
Hub Database	Prisma migrate deploy (forward-only)	Reverse migration script

7.2 Tenant Server Updates

Tenant updates are pushed from the Hub, NOT pulled by tenants:

1. Hub builds new Safety Wrapper / Secrets Proxy image
2. Hub creates update task for each tenant
3. Safety Wrapper receives update command via heartbeat
4. Safety Wrapper downloads new image (from Gitea registry)
5. Safety Wrapper performs rolling restart:
   a. Pull new image
   b. Stop old container
   c. Start new container
   d. Health check
   e. Report success/failure to Hub
6. If health check fails: rollback to previous image

7.3 OpenClaw Updates

OpenClaw is pinned to a tested release tag. Update cadence:

Monthly review of upstream changelog
Test new release on staging VPS (dedicated test tenant)
If no issues after 48 hours: roll out to 10% of tenants (canary)
Monitor for 24 hours
Roll out to remaining tenants
Rollback available: previous Docker image tag

7.4 Canary Deployment

Stage 1: Staging VPS (internal testing)        — 48 hours
Stage 2: 5% of tenants (canary group)          — 24 hours
Stage 3: 25% of tenants                        — 12 hours
Stage 4: 100% of tenants                       — complete

Canary selection: newest tenants first (less established, lower blast radius).

8. Disaster Recovery

8.1 Three-Tier Backup Strategy

Tier	What	How	Frequency	Retention
1. Application	Tool databases (18 PG + 2 MySQL + 1 Mongo)	`backups.sh` (existing)	Daily 2:00 AM	7 daily + 4 weekly
2. VPS Snapshot	Full VPS image	Netcup SCP API	Daily (staggered)	3 rolling
3. Hub Database	Central PostgreSQL	`pg_dump` + rclone	Daily 3:00 AM	14 daily + 8 weekly + 3 monthly

8.2 Recovery Scenarios

Scenario	Recovery Method	RTO	RPO
Single tool database corrupted	Restore from application backup	15 minutes	24 hours
VPS disk failure	Restore from Netcup snapshot	30 minutes	24 hours
VPS completely lost	Re-provision from scratch + restore snapshot	2 hours	24 hours
Hub database corrupted	Restore from pg_dump backup	30 minutes	24 hours
Hub server lost	Re-deploy on new server + restore DB	2 hours	24 hours
Regional outage	Failover to other region (manual)	4 hours	24 hours

8.3 Backup Monitoring

The Safety Wrapper's cron job reads backup-status.json daily at 6:00 AM:

{
  "last_run": "2026-02-27T02:15:00Z",
  "duration_seconds": 342,
  "databases": {
    "chatwoot": { "status": "success", "size_mb": 45 },
    "ghost": { "status": "success", "size_mb": 12 },
    "nextcloud": { "status": "failed", "error": "connection refused" }
  },
  "remote_sync": { "status": "success", "uploaded_mb": 230 }
}

Alerts:

Medium severity: Any database backup failed
Hard severity: All backups failed, or backup-status.json is stale (>48 hours)

9. Monitoring & Alerting

9.1 Tenant Health Monitoring

The Hub monitors all tenants via Safety Wrapper heartbeats:

Metric	Source	Alert Threshold
Heartbeat freshness	Safety Wrapper heartbeat	>3 missed intervals (3 min)
Disk usage	Heartbeat payload	>85%
Memory usage	Heartbeat payload	>90%
Token pool usage	Billing period	80%, 90%, 100%
Backup status	Backup report	Any failure
Container health	Portainer integration	Crash/OOM events
SSL cert expiry	Cert check cron	<14 days

9.2 Alert Routing

Severity	Customer Notification	Staff Notification
Soft	None (auto-recovers)	Dashboard indicator
Medium	Push notification (after 3 failures)	Email + dashboard
Hard	Push notification (immediate)	Email + Slack/webhook + dashboard

9.3 Hub Self-Monitoring

- PostgreSQL connection pool usage
- API response times (p50, p95, p99)
- Failed provisioning jobs
- Stripe webhook processing latency
- Cron job execution status
- Disk space on Hub server

10. SSL & Domain Management

10.1 Tenant SSL

Each tenant gets wildcard SSL via Let's Encrypt + certbot:

# Provisioner Step 4 (existing)
certbot certonly --nginx -d "*.${DOMAIN}" -d "${DOMAIN}" \
  --non-interactive --agree-tos -m "ssl@letsbe.biz"

Auto-renewal via cron (certbot default: every 12 hours, renews when <30 days to expiry).

10.2 Subdomain Layout

Each tool gets a subdomain on the customer's domain:

files.example.com      → Nextcloud
chat.example.com       → Chatwoot
blog.example.com       → Ghost
cal.example.com        → Cal.com
mail.example.com       → Stalwart Mail
erp.example.com        → Odoo
wiki.example.com       → BookStack (if installed)
...
status.example.com     → Uptime Kuma
portainer.example.com  → Portainer (admin only)

10.3 DNS Automation

New capability — auto-create DNS records at provisioning time:

// Hub: src/lib/services/dns-automation-service.ts

interface DnsAutomationService {
  createRecords(params: {
    domain: string;
    ip: string;
    tools: string[];
    provider: 'cloudflare';
    zone_id: string;
  }): Promise<{ records_created: number; errors: string[] }>;
}

// Creates A records for:
// 1. Root domain → VPS IP
// 2. Wildcard *.domain → VPS IP (covers all tool subdomains)
// Or individual A records per tool subdomain if wildcard not supported

End of Document — 03 Deployment Strategy

23 KiB Raw Permalink Blame History