LetsBeBiz-Redesign/docs/architecture-proposal/claude/03-DEPLOYMENT-STRATEGY.md

677 lines
23 KiB
Markdown

# LetsBe Biz — Deployment Strategy
**Date:** February 27, 2026
**Team:** Claude Opus 4.6 Architecture Team
**Document:** 03 of 09
**Status:** Proposal — Competing with independent team
---
## Table of Contents
1. [Deployment Topology](#1-deployment-topology)
2. [Central Platform Deployment](#2-central-platform-deployment)
3. [Tenant Server Deployment](#3-tenant-server-deployment)
4. [Container Strategy](#4-container-strategy)
5. [Resource Budgets](#5-resource-budgets)
6. [Provider Strategy](#6-provider-strategy)
7. [Update & Rollout Strategy](#7-update--rollout-strategy)
8. [Disaster Recovery](#8-disaster-recovery)
9. [Monitoring & Alerting](#9-monitoring--alerting)
10. [SSL & Domain Management](#10-ssl--domain-management)
---
## 1. Deployment Topology
```
┌─────────────────────────────────────┐
│ CENTRAL PLATFORM │
│ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Hub │ │ PostgreSQL 16 │ │
│ │ (Next.js│ │ (hub database) │ │
│ │ port │ └──────────────────┘ │
│ │ 3847) │ │
│ └──────────┘ ┌──────────────────┐ │
│ │ Website (Vercel │ │
│ ┌──────────┐ │ or self-hosted) │ │
│ │ Gitea CI │ └──────────────────┘ │
│ └──────────┘ │
└──────────┬──────────────────────────┘
│ HTTPS
┌────────────────┼────────────────┐
│ │ │
┌─────────▼──────┐ ┌──────▼────────┐ ┌─────▼────────────┐
│ Tenant VPS #1 │ │ Tenant VPS #2 │ │ Tenant VPS #N │
│ (customer-a) │ │ (customer-b) │ │ (customer-n) │
│ │ │ │ │ │
│ OpenClaw │ │ OpenClaw │ │ OpenClaw │
│ Safety Wrapper │ │ Safety Wrapper│ │ Safety Wrapper │
│ Secrets Proxy │ │ Secrets Proxy │ │ Secrets Proxy │
│ nginx │ │ nginx │ │ nginx │
│ 25+ tool │ │ 25+ tool │ │ 25+ tool │
│ containers │ │ containers │ │ containers │
└────────────────┘ └───────────────┘ └──────────────────┘
```
### 1.1 Key Topology Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Hub hosting | Dedicated Netcup RS G12 (EU) + mirror (US) | Low latency to tenants, cost-effective |
| Website hosting | Vercel (CDN) or static export on Hub server | CDN for global reach, simple deployment |
| Tenant isolation | One VPS per customer, no shared infrastructure | Privacy guarantee, blast radius containment |
| Region support | EU (Nuremberg) + US (Manassas) | Customer-selectable, same RS G12 hardware |
| Provider strategy | Netcup primary (contracts) + Hetzner overflow (hourly) | Cost optimization + burst capacity |
---
## 2. Central Platform Deployment
### 2.1 Hub Server
```yaml
# deploy/hub/docker-compose.yml
version: '3.8'
services:
db:
image: postgres:16-alpine
container_name: letsbe-hub-db
restart: unless-stopped
volumes:
- hub-db-data:/var/lib/postgresql/data
environment:
POSTGRES_DB: letsbe_hub
POSTGRES_USER: ${DB_USER}
POSTGRES_PASSWORD: ${DB_PASSWORD}
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
interval: 10s
timeout: 5s
retries: 5
hub:
image: code.letsbe.solutions/letsbe/hub:${HUB_VERSION}
container_name: letsbe-hub
restart: unless-stopped
depends_on:
db:
condition: service_healthy
ports:
- "127.0.0.1:3847:3000"
volumes:
- hub-jobs:/app/jobs
- hub-logs:/app/logs
- /var/run/docker.sock:/var/run/docker.sock
environment:
DATABASE_URL: postgresql://${DB_USER}:${DB_PASSWORD}@db:5432/letsbe_hub
NEXTAUTH_URL: ${HUB_URL}
NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
STRIPE_SECRET_KEY: ${STRIPE_SECRET_KEY}
STRIPE_WEBHOOK_SECRET: ${STRIPE_WEBHOOK_SECRET}
# ... (see existing config)
# Provisioner runner (spawned on demand by Hub)
# Not a persistent service — Hub spawns Docker containers per job
volumes:
hub-db-data:
hub-jobs:
hub-logs:
```
### 2.2 Hub nginx Configuration
```nginx
# deploy/hub/nginx/hub.conf
server {
listen 443 ssl http2;
server_name hub.letsbe.biz;
ssl_certificate /etc/letsencrypt/live/hub.letsbe.biz/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/hub.letsbe.biz/privkey.pem;
# Security headers
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
# Rate limiting for public API
limit_req_zone $binary_remote_addr zone=public_api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=tenant_api:10m rate=30r/s;
# Public API rate limiting
location /api/v1/public/ {
limit_req zone=public_api burst=20 nodelay;
proxy_pass http://127.0.0.1:3847;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# Tenant API (Safety Wrapper calls) rate limiting
location /api/v1/tenant/ {
limit_req zone=tenant_api burst=50 nodelay;
proxy_pass http://127.0.0.1:3847;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# SSE for provisioning logs and chat relay
location /api/v1/admin/orders/ {
proxy_pass http://127.0.0.1:3847;
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding off;
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 3600s;
}
# WebSocket for real-time chat relay
location /api/v1/customer/ws {
proxy_pass http://127.0.0.1:3847;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400s;
}
# Default
location / {
proxy_pass http://127.0.0.1:3847;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
```
### 2.3 Hub Database Backup
```bash
# deploy/hub/backup.sh — runs daily at 3:00 AM
#!/bin/bash
BACKUP_DIR="/opt/letsbe/hub-backups"
DATE=$(date +%Y%m%d_%H%M%S)
# PostgreSQL dump
docker exec letsbe-hub-db pg_dump -U ${DB_USER} letsbe_hub \
| gzip > "${BACKUP_DIR}/hub_${DATE}.sql.gz"
# Rotate: keep 14 daily, 8 weekly, 3 monthly
find "${BACKUP_DIR}" -name "hub_*.sql.gz" -mtime +14 -delete
# Weekly: kept by separate cron moving to weekly/
# Monthly: kept by separate cron moving to monthly/
# Upload to off-site storage (S3/Backblaze)
rclone copy "${BACKUP_DIR}/hub_${DATE}.sql.gz" remote:letsbe-hub-backups/daily/
```
---
## 3. Tenant Server Deployment
### 3.1 Provisioning Flow
```
Hub receives order (status: PAYMENT_CONFIRMED)
Automation worker: PAYMENT_CONFIRMED → AWAITING_SERVER
Assign Netcup server from pre-provisioned pool
(or spin up Hetzner Cloud if pool empty)
AWAITING_SERVER → SERVER_READY
Create DNS records via Cloudflare API (NEW — was manual)
SERVER_READY → DNS_PENDING → DNS_READY
Spawn Provisioner Docker container with job config
Provisioner SSHs into VPS, runs 10-step pipeline:
Step 1-8: System setup, Docker, nginx, firewall, SSH hardening
Step 9: Deploy tool stacks (28+ Docker Compose stacks)
Step 10: Deploy LetsBe AI stack (OpenClaw + Safety Wrapper + Secrets Proxy)
Safety Wrapper registers with Hub → receives API key
PROVISIONING → FULFILLED
Customer receives welcome email + app download links
```
### 3.2 Pre-Provisioned Server Pool
To minimize customer wait time (target: <20 minutes from payment to AI ready):
| Region | Pool Size | Server Tier | Status |
|--------|----------|-------------|--------|
| EU (Nuremberg) | 3-5 servers | Build (RS 2000 G12) | Freshly installed Debian 12, Docker pre-installed |
| US (Manassas) | 2-3 servers | Build (RS 2000 G12) | Same |
Pool is replenished automatically when it drops below minimum. Netcup servers are on 12-month contracts pre-provisioning is a cost commitment.
### 3.3 Tenant Container Layout
```
Tenant VPS (e.g., Build tier: 8c/16GB/512GB NVMe)
├── nginx (port 80, 443) ~64MB
├── letsbe-openclaw (port 18789, host network) ~384MB + Chromium
├── letsbe-safety-wrapper (port 8200) ~128MB
├── letsbe-secrets-proxy (port 8100) ~64MB
├── TOOL STACKS (Docker Compose per tool):
│ ├── nextcloud + postgres (port 3023) ~768MB
│ ├── chatwoot + postgres + redis (port 3019) ~1024MB
│ ├── ghost + mysql (port 3025) ~384MB
│ ├── calcom + postgres (port 3044) ~384MB
│ ├── stalwart-mail (port 3011) ~256MB
│ ├── odoo + postgres (port 3035) ~1280MB
│ ├── keycloak + postgres (port 3043) ~512MB
│ ├── listmonk + postgres (port 3026) ~256MB
│ ├── nocodb (port 3037) ~256MB
│ ├── umami + postgres (port 3029) ~256MB
│ ├── uptime-kuma (port 3033) ~128MB
│ ├── portainer (port 9443) ~128MB
│ ├── activepieces (port 3040) ~384MB
│ ├── ... (remaining tools)
│ └── certbot ~16MB
└── TOTAL: varies by tier and selected tools
```
---
## 4. Container Strategy
### 4.1 Image Registry
All custom images hosted on Gitea Container Registry:
```
code.letsbe.solutions/letsbe/hub:latest
code.letsbe.solutions/letsbe/openclaw:latest
code.letsbe.solutions/letsbe/safety-wrapper:latest
code.letsbe.solutions/letsbe/secrets-proxy:latest
code.letsbe.solutions/letsbe/provisioner:latest
code.letsbe.solutions/letsbe/demo:latest
```
### 4.2 Image Build Strategy
```dockerfile
# packages/safety-wrapper/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:22-alpine AS runner
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER letsbe
EXPOSE 8200
CMD ["node", "dist/server.js"]
```
```dockerfile
# packages/secrets-proxy/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
COPY . .
RUN npm run build
FROM node:22-alpine AS runner
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./
USER letsbe
EXPOSE 8100
CMD ["node", "dist/server.js"]
```
### 4.3 OpenClaw Custom Image
```dockerfile
# packages/openclaw-image/Dockerfile
FROM openclaw/openclaw:2026.2.6-3
# Install CLI binaries for tool access
RUN apk add --no-cache curl jq
# Install gog (Google CLI) and himalaya (IMAP CLI)
COPY bin/gog /usr/local/bin/gog
COPY bin/himalaya /usr/local/bin/himalaya
RUN chmod +x /usr/local/bin/gog /usr/local/bin/himalaya
# Pre-create directory structure
RUN mkdir -p /home/openclaw/.openclaw/agents \
/home/openclaw/.openclaw/skills \
/home/openclaw/.openclaw/references \
/home/openclaw/.openclaw/data \
/home/openclaw/.openclaw/shared-memory
USER openclaw
```
### 4.4 Container Restart Policies
| Container | Restart Policy | Rationale |
|-----------|---------------|-----------|
| All LetsBe containers | `unless-stopped` | Auto-recover from crashes; manual stop stays stopped |
| Tool containers | `unless-stopped` | Same tools should self-heal |
| nginx | `unless-stopped` | Critical path must auto-restart |
---
## 5. Resource Budgets
### 5.1 Per-Tier Budget
| Component | Lite (8GB) | Build (16GB) | Scale (32GB) | Enterprise (64GB) |
|-----------|-----------|-------------|-------------|------------------|
| LetsBe overhead | 640MB | 640MB | 640MB | 640MB |
| Tool headroom | 7,360MB | 15,360MB | 31,360MB | 63,360MB |
| Recommended tools | 5-8 | 10-15 | 15-25 | 25-30+ |
| CPU cores | 4 | 8 | 12 | 16 |
| NVMe storage | 256GB | 512GB | 1TB | 2TB |
### 5.2 LetsBe Overhead Breakdown
| Process | RAM | CPU | Notes |
|---------|-----|-----|-------|
| OpenClaw Gateway | ~256MB | 1.0 core | Node.js 22 + agent state |
| Chromium (browser tool) | ~128MB | 0.5 core | Managed by OpenClaw, shared across agents |
| Safety Wrapper | ~128MB | 0.5 core | Tool execution + Hub communication |
| Secrets Proxy | ~64MB | 0.25 core | Lightweight HTTP proxy |
| nginx | ~64MB | 0.25 core | Reverse proxy for all tool subdomains |
| **Total** | **~640MB** | **~2.5 cores** | |
### 5.3 Tool Resource Registry
Used by the resource calculator in the website and by the IT Agent for dynamic tool installation:
```json
{
"nextcloud": { "ram_mb": 512, "disk_gb": 10, "requires_db": "postgres" },
"chatwoot": { "ram_mb": 768, "disk_gb": 5, "requires_db": "postgres", "requires_redis": true },
"ghost": { "ram_mb": 256, "disk_gb": 3, "requires_db": "mysql" },
"odoo": { "ram_mb": 1024, "disk_gb": 10, "requires_db": "postgres" },
"calcom": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
"stalwart": { "ram_mb": 256, "disk_gb": 5 },
"keycloak": { "ram_mb": 512, "disk_gb": 2, "requires_db": "postgres" },
"listmonk": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
"nocodb": { "ram_mb": 256, "disk_gb": 2 },
"umami": { "ram_mb": 192, "disk_gb": 1, "requires_db": "postgres" },
"uptime_kuma": { "ram_mb": 128, "disk_gb": 1 },
"portainer": { "ram_mb": 128, "disk_gb": 1 },
"activepieces": { "ram_mb": 384, "disk_gb": 3, "requires_db": "postgres" }
}
```
---
## 6. Provider Strategy
### 6.1 Primary: Netcup RS G12
| Plan | Specs | Monthly | Contract | Use Case |
|------|-------|---------|----------|----------|
| RS 1000 G12 | 4c/8GB/256GB | ~€8.50 | 12-month | Lite tier |
| RS 2000 G12 | 8c/16GB/512GB | ~€14.50 | 12-month | Build tier (default) |
| RS 4000 G12 | 12c/32GB/1TB | ~€26.00 | 12-month | Scale tier |
| RS 8000 G12 | 16c/64GB/2TB | ~€48.00 | 12-month | Enterprise tier |
**Both EU (Nuremberg) and US (Manassas) datacenters available.**
Pre-provisioned pool: 5 Build-tier servers in EU, 3 in US. Replenished weekly.
### 6.2 Overflow: Hetzner Cloud
For burst capacity when Netcup pool is depleted:
| Type | Specs | Hourly | Monthly Cap | Notes |
|------|-------|--------|-------------|-------|
| CPX21 | 3c/4GB/80GB | 0.0113 | ~€8.24 | Lite equivalent |
| CPX31 | 4c/8GB/160GB | 0.0214 | ~€15.59 | Build equivalent |
| CPX41 | 8c/16GB/240GB | 0.0399 | ~€29.09 | Scale equivalent |
| CPX51 | 16c/32GB/360GB | 0.0798 | ~€58.15 | Enterprise equivalent |
**Trigger:** When Netcup pool for a tier + region is empty AND order in AUTO mode.
**Migration:** Customer migrated to Netcup RS when next contract cycle opens (monthly check).
### 6.3 Provider Abstraction
The Provisioner is provider-agnostic it only needs SSH access to a Debian 12 VPS. Provider-specific logic lives in the Hub:
```typescript
interface ServerProvider {
name: 'netcup' | 'hetzner';
allocateServer(tier: ServerTier, region: Region): Promise<ServerAllocation>;
deallocateServer(serverId: string): Promise<void>;
getServerStatus(serverId: string): Promise<ServerStatus>;
createSnapshot(serverId: string): Promise<SnapshotResult>;
}
```
---
## 7. Update & Rollout Strategy
### 7.1 Central Platform Updates
| Component | Deployment | Rollback |
|-----------|-----------|----------|
| Hub | Docker image pull + restart | Previous image tag |
| Website | Vercel deploy (instant) or Docker pull | Previous deployment |
| Hub Database | Prisma migrate deploy (forward-only) | Reverse migration script |
### 7.2 Tenant Server Updates
Tenant updates are pushed from the Hub, NOT pulled by tenants:
```
1. Hub builds new Safety Wrapper / Secrets Proxy image
2. Hub creates update task for each tenant
3. Safety Wrapper receives update command via heartbeat
4. Safety Wrapper downloads new image (from Gitea registry)
5. Safety Wrapper performs rolling restart:
a. Pull new image
b. Stop old container
c. Start new container
d. Health check
e. Report success/failure to Hub
6. If health check fails: rollback to previous image
```
### 7.3 OpenClaw Updates
OpenClaw is pinned to a tested release tag. Update cadence:
1. Monthly review of upstream changelog
2. Test new release on staging VPS (dedicated test tenant)
3. If no issues after 48 hours: roll out to 10% of tenants (canary)
4. Monitor for 24 hours
5. Roll out to remaining tenants
6. Rollback available: previous Docker image tag
### 7.4 Canary Deployment
```
Stage 1: Staging VPS (internal testing) — 48 hours
Stage 2: 5% of tenants (canary group) — 24 hours
Stage 3: 25% of tenants — 12 hours
Stage 4: 100% of tenants — complete
```
Canary selection: newest tenants first (less established, lower blast radius).
---
## 8. Disaster Recovery
### 8.1 Three-Tier Backup Strategy
| Tier | What | How | Frequency | Retention |
|------|------|-----|-----------|-----------|
| 1. Application | Tool databases (18 PG + 2 MySQL + 1 Mongo) | `backups.sh` (existing) | Daily 2:00 AM | 7 daily + 4 weekly |
| 2. VPS Snapshot | Full VPS image | Netcup SCP API | Daily (staggered) | 3 rolling |
| 3. Hub Database | Central PostgreSQL | `pg_dump` + rclone | Daily 3:00 AM | 14 daily + 8 weekly + 3 monthly |
### 8.2 Recovery Scenarios
| Scenario | Recovery Method | RTO | RPO |
|----------|----------------|-----|-----|
| Single tool database corrupted | Restore from application backup | 15 minutes | 24 hours |
| VPS disk failure | Restore from Netcup snapshot | 30 minutes | 24 hours |
| VPS completely lost | Re-provision from scratch + restore snapshot | 2 hours | 24 hours |
| Hub database corrupted | Restore from pg_dump backup | 30 minutes | 24 hours |
| Hub server lost | Re-deploy on new server + restore DB | 2 hours | 24 hours |
| Regional outage | Failover to other region (manual) | 4 hours | 24 hours |
### 8.3 Backup Monitoring
The Safety Wrapper's cron job reads `backup-status.json` daily at 6:00 AM:
```json
{
"last_run": "2026-02-27T02:15:00Z",
"duration_seconds": 342,
"databases": {
"chatwoot": { "status": "success", "size_mb": 45 },
"ghost": { "status": "success", "size_mb": 12 },
"nextcloud": { "status": "failed", "error": "connection refused" }
},
"remote_sync": { "status": "success", "uploaded_mb": 230 }
}
```
Alerts:
- **Medium severity:** Any database backup failed
- **Hard severity:** All backups failed, or `backup-status.json` is stale (>48 hours)
---
## 9. Monitoring & Alerting
### 9.1 Tenant Health Monitoring
The Hub monitors all tenants via Safety Wrapper heartbeats:
| Metric | Source | Alert Threshold |
|--------|--------|----------------|
| Heartbeat freshness | Safety Wrapper heartbeat | >3 missed intervals (3 min) |
| Disk usage | Heartbeat payload | >85% |
| Memory usage | Heartbeat payload | >90% |
| Token pool usage | Billing period | 80%, 90%, 100% |
| Backup status | Backup report | Any failure |
| Container health | Portainer integration | Crash/OOM events |
| SSL cert expiry | Cert check cron | <14 days |
### 9.2 Alert Routing
| Severity | Customer Notification | Staff Notification |
|----------|----------------------|-------------------|
| Soft | None (auto-recovers) | Dashboard indicator |
| Medium | Push notification (after 3 failures) | Email + dashboard |
| Hard | Push notification (immediate) | Email + Slack/webhook + dashboard |
### 9.3 Hub Self-Monitoring
```
- PostgreSQL connection pool usage
- API response times (p50, p95, p99)
- Failed provisioning jobs
- Stripe webhook processing latency
- Cron job execution status
- Disk space on Hub server
```
---
## 10. SSL & Domain Management
### 10.1 Tenant SSL
Each tenant gets wildcard SSL via Let's Encrypt + certbot:
```bash
# Provisioner Step 4 (existing)
certbot certonly --nginx -d "*.${DOMAIN}" -d "${DOMAIN}" \
--non-interactive --agree-tos -m "ssl@letsbe.biz"
```
Auto-renewal via cron (certbot default: every 12 hours, renews when <30 days to expiry).
### 10.2 Subdomain Layout
Each tool gets a subdomain on the customer's domain:
```
files.example.com → Nextcloud
chat.example.com → Chatwoot
blog.example.com → Ghost
cal.example.com → Cal.com
mail.example.com → Stalwart Mail
erp.example.com → Odoo
wiki.example.com → BookStack (if installed)
...
status.example.com → Uptime Kuma
portainer.example.com → Portainer (admin only)
```
### 10.3 DNS Automation
New capability auto-create DNS records at provisioning time:
```typescript
// Hub: src/lib/services/dns-automation-service.ts
interface DnsAutomationService {
createRecords(params: {
domain: string;
ip: string;
tools: string[];
provider: 'cloudflare';
zone_id: string;
}): Promise<{ records_created: number; errors: string[] }>;
}
// Creates A records for:
// 1. Root domain → VPS IP
// 2. Wildcard *.domain → VPS IP (covers all tool subdomains)
// Or individual A records per tool subdomain if wildcard not supported
```
---
*End of Document — 03 Deployment Strategy*