677 lines
23 KiB
Markdown
677 lines
23 KiB
Markdown
# LetsBe Biz — Deployment Strategy
|
|
|
|
**Date:** February 27, 2026
|
|
**Team:** Claude Opus 4.6 Architecture Team
|
|
**Document:** 03 of 09
|
|
**Status:** Proposal — Competing with independent team
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Deployment Topology](#1-deployment-topology)
|
|
2. [Central Platform Deployment](#2-central-platform-deployment)
|
|
3. [Tenant Server Deployment](#3-tenant-server-deployment)
|
|
4. [Container Strategy](#4-container-strategy)
|
|
5. [Resource Budgets](#5-resource-budgets)
|
|
6. [Provider Strategy](#6-provider-strategy)
|
|
7. [Update & Rollout Strategy](#7-update--rollout-strategy)
|
|
8. [Disaster Recovery](#8-disaster-recovery)
|
|
9. [Monitoring & Alerting](#9-monitoring--alerting)
|
|
10. [SSL & Domain Management](#10-ssl--domain-management)
|
|
|
|
---
|
|
|
|
## 1. Deployment Topology
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ CENTRAL PLATFORM │
|
|
│ │
|
|
│ ┌──────────┐ ┌──────────────────┐ │
|
|
│ │ Hub │ │ PostgreSQL 16 │ │
|
|
│ │ (Next.js│ │ (hub database) │ │
|
|
│ │ port │ └──────────────────┘ │
|
|
│ │ 3847) │ │
|
|
│ └──────────┘ ┌──────────────────┐ │
|
|
│ │ Website (Vercel │ │
|
|
│ ┌──────────┐ │ or self-hosted) │ │
|
|
│ │ Gitea CI │ └──────────────────┘ │
|
|
│ └──────────┘ │
|
|
└──────────┬──────────────────────────┘
|
|
│ HTTPS
|
|
┌────────────────┼────────────────┐
|
|
│ │ │
|
|
┌─────────▼──────┐ ┌──────▼────────┐ ┌─────▼────────────┐
|
|
│ Tenant VPS #1 │ │ Tenant VPS #2 │ │ Tenant VPS #N │
|
|
│ (customer-a) │ │ (customer-b) │ │ (customer-n) │
|
|
│ │ │ │ │ │
|
|
│ OpenClaw │ │ OpenClaw │ │ OpenClaw │
|
|
│ Safety Wrapper │ │ Safety Wrapper│ │ Safety Wrapper │
|
|
│ Secrets Proxy │ │ Secrets Proxy │ │ Secrets Proxy │
|
|
│ nginx │ │ nginx │ │ nginx │
|
|
│ 25+ tool │ │ 25+ tool │ │ 25+ tool │
|
|
│ containers │ │ containers │ │ containers │
|
|
└────────────────┘ └───────────────┘ └──────────────────┘
|
|
```
|
|
|
|
### 1.1 Key Topology Decisions
|
|
|
|
| Decision | Choice | Rationale |
|
|
|----------|--------|-----------|
|
|
| Hub hosting | Dedicated Netcup RS G12 (EU) + mirror (US) | Low latency to tenants, cost-effective |
|
|
| Website hosting | Vercel (CDN) or static export on Hub server | CDN for global reach, simple deployment |
|
|
| Tenant isolation | One VPS per customer, no shared infrastructure | Privacy guarantee, blast radius containment |
|
|
| Region support | EU (Nuremberg) + US (Manassas) | Customer-selectable, same RS G12 hardware |
|
|
| Provider strategy | Netcup primary (contracts) + Hetzner overflow (hourly) | Cost optimization + burst capacity |
|
|
|
|
---
|
|
|
|
## 2. Central Platform Deployment
|
|
|
|
### 2.1 Hub Server
|
|
|
|
```yaml
|
|
# deploy/hub/docker-compose.yml
|
|
version: '3.8'
|
|
services:
|
|
db:
|
|
image: postgres:16-alpine
|
|
container_name: letsbe-hub-db
|
|
restart: unless-stopped
|
|
volumes:
|
|
- hub-db-data:/var/lib/postgresql/data
|
|
environment:
|
|
POSTGRES_DB: letsbe_hub
|
|
POSTGRES_USER: ${DB_USER}
|
|
POSTGRES_PASSWORD: ${DB_PASSWORD}
|
|
healthcheck:
|
|
test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
|
|
interval: 10s
|
|
timeout: 5s
|
|
retries: 5
|
|
|
|
hub:
|
|
image: code.letsbe.solutions/letsbe/hub:${HUB_VERSION}
|
|
container_name: letsbe-hub
|
|
restart: unless-stopped
|
|
depends_on:
|
|
db:
|
|
condition: service_healthy
|
|
ports:
|
|
- "127.0.0.1:3847:3000"
|
|
volumes:
|
|
- hub-jobs:/app/jobs
|
|
- hub-logs:/app/logs
|
|
- /var/run/docker.sock:/var/run/docker.sock
|
|
environment:
|
|
DATABASE_URL: postgresql://${DB_USER}:${DB_PASSWORD}@db:5432/letsbe_hub
|
|
NEXTAUTH_URL: ${HUB_URL}
|
|
NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
|
|
STRIPE_SECRET_KEY: ${STRIPE_SECRET_KEY}
|
|
STRIPE_WEBHOOK_SECRET: ${STRIPE_WEBHOOK_SECRET}
|
|
# ... (see existing config)
|
|
|
|
# Provisioner runner (spawned on demand by Hub)
|
|
# Not a persistent service — Hub spawns Docker containers per job
|
|
|
|
volumes:
|
|
hub-db-data:
|
|
hub-jobs:
|
|
hub-logs:
|
|
```
|
|
|
|
### 2.2 Hub nginx Configuration
|
|
|
|
```nginx
|
|
# deploy/hub/nginx/hub.conf
|
|
server {
|
|
listen 443 ssl http2;
|
|
server_name hub.letsbe.biz;
|
|
|
|
ssl_certificate /etc/letsencrypt/live/hub.letsbe.biz/fullchain.pem;
|
|
ssl_certificate_key /etc/letsencrypt/live/hub.letsbe.biz/privkey.pem;
|
|
|
|
# Security headers
|
|
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
|
|
add_header X-Content-Type-Options "nosniff" always;
|
|
add_header X-Frame-Options "DENY" always;
|
|
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
|
|
|
|
# Rate limiting for public API
|
|
limit_req_zone $binary_remote_addr zone=public_api:10m rate=10r/s;
|
|
limit_req_zone $binary_remote_addr zone=tenant_api:10m rate=30r/s;
|
|
|
|
# Public API rate limiting
|
|
location /api/v1/public/ {
|
|
limit_req zone=public_api burst=20 nodelay;
|
|
proxy_pass http://127.0.0.1:3847;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
}
|
|
|
|
# Tenant API (Safety Wrapper calls) rate limiting
|
|
location /api/v1/tenant/ {
|
|
limit_req zone=tenant_api burst=50 nodelay;
|
|
proxy_pass http://127.0.0.1:3847;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
}
|
|
|
|
# SSE for provisioning logs and chat relay
|
|
location /api/v1/admin/orders/ {
|
|
proxy_pass http://127.0.0.1:3847;
|
|
proxy_set_header Connection '';
|
|
proxy_http_version 1.1;
|
|
chunked_transfer_encoding off;
|
|
proxy_buffering off;
|
|
proxy_cache off;
|
|
proxy_read_timeout 3600s;
|
|
}
|
|
|
|
# WebSocket for real-time chat relay
|
|
location /api/v1/customer/ws {
|
|
proxy_pass http://127.0.0.1:3847;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection "upgrade";
|
|
proxy_read_timeout 86400s;
|
|
}
|
|
|
|
# Default
|
|
location / {
|
|
proxy_pass http://127.0.0.1:3847;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2.3 Hub Database Backup
|
|
|
|
```bash
|
|
# deploy/hub/backup.sh — runs daily at 3:00 AM
|
|
#!/bin/bash
|
|
BACKUP_DIR="/opt/letsbe/hub-backups"
|
|
DATE=$(date +%Y%m%d_%H%M%S)
|
|
|
|
# PostgreSQL dump
|
|
docker exec letsbe-hub-db pg_dump -U ${DB_USER} letsbe_hub \
|
|
| gzip > "${BACKUP_DIR}/hub_${DATE}.sql.gz"
|
|
|
|
# Rotate: keep 14 daily, 8 weekly, 3 monthly
|
|
find "${BACKUP_DIR}" -name "hub_*.sql.gz" -mtime +14 -delete
|
|
# Weekly: kept by separate cron moving to weekly/
|
|
# Monthly: kept by separate cron moving to monthly/
|
|
|
|
# Upload to off-site storage (S3/Backblaze)
|
|
rclone copy "${BACKUP_DIR}/hub_${DATE}.sql.gz" remote:letsbe-hub-backups/daily/
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Tenant Server Deployment
|
|
|
|
### 3.1 Provisioning Flow
|
|
|
|
```
|
|
Hub receives order (status: PAYMENT_CONFIRMED)
|
|
│
|
|
▼
|
|
Automation worker: PAYMENT_CONFIRMED → AWAITING_SERVER
|
|
│
|
|
▼
|
|
Assign Netcup server from pre-provisioned pool
|
|
(or spin up Hetzner Cloud if pool empty)
|
|
│
|
|
▼
|
|
AWAITING_SERVER → SERVER_READY
|
|
│
|
|
▼
|
|
Create DNS records via Cloudflare API (NEW — was manual)
|
|
│
|
|
▼
|
|
SERVER_READY → DNS_PENDING → DNS_READY
|
|
│
|
|
▼
|
|
Spawn Provisioner Docker container with job config
|
|
│
|
|
▼
|
|
Provisioner SSHs into VPS, runs 10-step pipeline:
|
|
Step 1-8: System setup, Docker, nginx, firewall, SSH hardening
|
|
Step 9: Deploy tool stacks (28+ Docker Compose stacks)
|
|
Step 10: Deploy LetsBe AI stack (OpenClaw + Safety Wrapper + Secrets Proxy)
|
|
│
|
|
▼
|
|
Safety Wrapper registers with Hub → receives API key
|
|
│
|
|
▼
|
|
PROVISIONING → FULFILLED
|
|
│
|
|
▼
|
|
Customer receives welcome email + app download links
|
|
```
|
|
|
|
### 3.2 Pre-Provisioned Server Pool
|
|
|
|
To minimize customer wait time (target: <20 minutes from payment to AI ready):
|
|
|
|
| Region | Pool Size | Server Tier | Status |
|
|
|--------|----------|-------------|--------|
|
|
| EU (Nuremberg) | 3-5 servers | Build (RS 2000 G12) | Freshly installed Debian 12, Docker pre-installed |
|
|
| US (Manassas) | 2-3 servers | Build (RS 2000 G12) | Same |
|
|
|
|
Pool is replenished automatically when it drops below minimum. Netcup servers are on 12-month contracts — pre-provisioning is a cost commitment.
|
|
|
|
### 3.3 Tenant Container Layout
|
|
|
|
```
|
|
Tenant VPS (e.g., Build tier: 8c/16GB/512GB NVMe)
|
|
│
|
|
├── nginx (port 80, 443) ~64MB
|
|
├── letsbe-openclaw (port 18789, host network) ~384MB + Chromium
|
|
├── letsbe-safety-wrapper (port 8200) ~128MB
|
|
├── letsbe-secrets-proxy (port 8100) ~64MB
|
|
│
|
|
├── TOOL STACKS (Docker Compose per tool):
|
|
│ ├── nextcloud + postgres (port 3023) ~768MB
|
|
│ ├── chatwoot + postgres + redis (port 3019) ~1024MB
|
|
│ ├── ghost + mysql (port 3025) ~384MB
|
|
│ ├── calcom + postgres (port 3044) ~384MB
|
|
│ ├── stalwart-mail (port 3011) ~256MB
|
|
│ ├── odoo + postgres (port 3035) ~1280MB
|
|
│ ├── keycloak + postgres (port 3043) ~512MB
|
|
│ ├── listmonk + postgres (port 3026) ~256MB
|
|
│ ├── nocodb (port 3037) ~256MB
|
|
│ ├── umami + postgres (port 3029) ~256MB
|
|
│ ├── uptime-kuma (port 3033) ~128MB
|
|
│ ├── portainer (port 9443) ~128MB
|
|
│ ├── activepieces (port 3040) ~384MB
|
|
│ ├── ... (remaining tools)
|
|
│ └── certbot ~16MB
|
|
│
|
|
└── TOTAL: varies by tier and selected tools
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Container Strategy
|
|
|
|
### 4.1 Image Registry
|
|
|
|
All custom images hosted on Gitea Container Registry:
|
|
|
|
```
|
|
code.letsbe.solutions/letsbe/hub:latest
|
|
code.letsbe.solutions/letsbe/openclaw:latest
|
|
code.letsbe.solutions/letsbe/safety-wrapper:latest
|
|
code.letsbe.solutions/letsbe/secrets-proxy:latest
|
|
code.letsbe.solutions/letsbe/provisioner:latest
|
|
code.letsbe.solutions/letsbe/demo:latest
|
|
```
|
|
|
|
### 4.2 Image Build Strategy
|
|
|
|
```dockerfile
|
|
# packages/safety-wrapper/Dockerfile
|
|
FROM node:22-alpine AS builder
|
|
WORKDIR /app
|
|
COPY package.json package-lock.json ./
|
|
RUN npm ci --production=false
|
|
COPY . .
|
|
RUN npm run build
|
|
|
|
FROM node:22-alpine AS runner
|
|
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
|
|
WORKDIR /app
|
|
COPY --from=builder /app/dist ./dist
|
|
COPY --from=builder /app/node_modules ./node_modules
|
|
COPY --from=builder /app/package.json ./
|
|
USER letsbe
|
|
EXPOSE 8200
|
|
CMD ["node", "dist/server.js"]
|
|
```
|
|
|
|
```dockerfile
|
|
# packages/secrets-proxy/Dockerfile
|
|
FROM node:22-alpine AS builder
|
|
WORKDIR /app
|
|
COPY package.json package-lock.json ./
|
|
RUN npm ci --production=false
|
|
COPY . .
|
|
RUN npm run build
|
|
|
|
FROM node:22-alpine AS runner
|
|
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
|
|
WORKDIR /app
|
|
COPY --from=builder /app/dist ./dist
|
|
COPY --from=builder /app/node_modules ./node_modules
|
|
COPY --from=builder /app/package.json ./
|
|
USER letsbe
|
|
EXPOSE 8100
|
|
CMD ["node", "dist/server.js"]
|
|
```
|
|
|
|
### 4.3 OpenClaw Custom Image
|
|
|
|
```dockerfile
|
|
# packages/openclaw-image/Dockerfile
|
|
FROM openclaw/openclaw:2026.2.6-3
|
|
|
|
# Install CLI binaries for tool access
|
|
RUN apk add --no-cache curl jq
|
|
|
|
# Install gog (Google CLI) and himalaya (IMAP CLI)
|
|
COPY bin/gog /usr/local/bin/gog
|
|
COPY bin/himalaya /usr/local/bin/himalaya
|
|
RUN chmod +x /usr/local/bin/gog /usr/local/bin/himalaya
|
|
|
|
# Pre-create directory structure
|
|
RUN mkdir -p /home/openclaw/.openclaw/agents \
|
|
/home/openclaw/.openclaw/skills \
|
|
/home/openclaw/.openclaw/references \
|
|
/home/openclaw/.openclaw/data \
|
|
/home/openclaw/.openclaw/shared-memory
|
|
|
|
USER openclaw
|
|
```
|
|
|
|
### 4.4 Container Restart Policies
|
|
|
|
| Container | Restart Policy | Rationale |
|
|
|-----------|---------------|-----------|
|
|
| All LetsBe containers | `unless-stopped` | Auto-recover from crashes; manual stop stays stopped |
|
|
| Tool containers | `unless-stopped` | Same — tools should self-heal |
|
|
| nginx | `unless-stopped` | Critical path — must auto-restart |
|
|
|
|
---
|
|
|
|
## 5. Resource Budgets
|
|
|
|
### 5.1 Per-Tier Budget
|
|
|
|
| Component | Lite (8GB) | Build (16GB) | Scale (32GB) | Enterprise (64GB) |
|
|
|-----------|-----------|-------------|-------------|------------------|
|
|
| LetsBe overhead | 640MB | 640MB | 640MB | 640MB |
|
|
| Tool headroom | 7,360MB | 15,360MB | 31,360MB | 63,360MB |
|
|
| Recommended tools | 5-8 | 10-15 | 15-25 | 25-30+ |
|
|
| CPU cores | 4 | 8 | 12 | 16 |
|
|
| NVMe storage | 256GB | 512GB | 1TB | 2TB |
|
|
|
|
### 5.2 LetsBe Overhead Breakdown
|
|
|
|
| Process | RAM | CPU | Notes |
|
|
|---------|-----|-----|-------|
|
|
| OpenClaw Gateway | ~256MB | 1.0 core | Node.js 22 + agent state |
|
|
| Chromium (browser tool) | ~128MB | 0.5 core | Managed by OpenClaw, shared across agents |
|
|
| Safety Wrapper | ~128MB | 0.5 core | Tool execution + Hub communication |
|
|
| Secrets Proxy | ~64MB | 0.25 core | Lightweight HTTP proxy |
|
|
| nginx | ~64MB | 0.25 core | Reverse proxy for all tool subdomains |
|
|
| **Total** | **~640MB** | **~2.5 cores** | |
|
|
|
|
### 5.3 Tool Resource Registry
|
|
|
|
Used by the resource calculator in the website and by the IT Agent for dynamic tool installation:
|
|
|
|
```json
|
|
{
|
|
"nextcloud": { "ram_mb": 512, "disk_gb": 10, "requires_db": "postgres" },
|
|
"chatwoot": { "ram_mb": 768, "disk_gb": 5, "requires_db": "postgres", "requires_redis": true },
|
|
"ghost": { "ram_mb": 256, "disk_gb": 3, "requires_db": "mysql" },
|
|
"odoo": { "ram_mb": 1024, "disk_gb": 10, "requires_db": "postgres" },
|
|
"calcom": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
|
|
"stalwart": { "ram_mb": 256, "disk_gb": 5 },
|
|
"keycloak": { "ram_mb": 512, "disk_gb": 2, "requires_db": "postgres" },
|
|
"listmonk": { "ram_mb": 256, "disk_gb": 2, "requires_db": "postgres" },
|
|
"nocodb": { "ram_mb": 256, "disk_gb": 2 },
|
|
"umami": { "ram_mb": 192, "disk_gb": 1, "requires_db": "postgres" },
|
|
"uptime_kuma": { "ram_mb": 128, "disk_gb": 1 },
|
|
"portainer": { "ram_mb": 128, "disk_gb": 1 },
|
|
"activepieces": { "ram_mb": 384, "disk_gb": 3, "requires_db": "postgres" }
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Provider Strategy
|
|
|
|
### 6.1 Primary: Netcup RS G12
|
|
|
|
| Plan | Specs | Monthly | Contract | Use Case |
|
|
|------|-------|---------|----------|----------|
|
|
| RS 1000 G12 | 4c/8GB/256GB | ~€8.50 | 12-month | Lite tier |
|
|
| RS 2000 G12 | 8c/16GB/512GB | ~€14.50 | 12-month | Build tier (default) |
|
|
| RS 4000 G12 | 12c/32GB/1TB | ~€26.00 | 12-month | Scale tier |
|
|
| RS 8000 G12 | 16c/64GB/2TB | ~€48.00 | 12-month | Enterprise tier |
|
|
|
|
**Both EU (Nuremberg) and US (Manassas) datacenters available.**
|
|
|
|
Pre-provisioned pool: 5 Build-tier servers in EU, 3 in US. Replenished weekly.
|
|
|
|
### 6.2 Overflow: Hetzner Cloud
|
|
|
|
For burst capacity when Netcup pool is depleted:
|
|
|
|
| Type | Specs | Hourly | Monthly Cap | Notes |
|
|
|------|-------|--------|-------------|-------|
|
|
| CPX21 | 3c/4GB/80GB | €0.0113 | ~€8.24 | Lite equivalent |
|
|
| CPX31 | 4c/8GB/160GB | €0.0214 | ~€15.59 | Build equivalent |
|
|
| CPX41 | 8c/16GB/240GB | €0.0399 | ~€29.09 | Scale equivalent |
|
|
| CPX51 | 16c/32GB/360GB | €0.0798 | ~€58.15 | Enterprise equivalent |
|
|
|
|
**Trigger:** When Netcup pool for a tier + region is empty AND order in AUTO mode.
|
|
**Migration:** Customer migrated to Netcup RS when next contract cycle opens (monthly check).
|
|
|
|
### 6.3 Provider Abstraction
|
|
|
|
The Provisioner is provider-agnostic — it only needs SSH access to a Debian 12 VPS. Provider-specific logic lives in the Hub:
|
|
|
|
```typescript
|
|
interface ServerProvider {
|
|
name: 'netcup' | 'hetzner';
|
|
allocateServer(tier: ServerTier, region: Region): Promise<ServerAllocation>;
|
|
deallocateServer(serverId: string): Promise<void>;
|
|
getServerStatus(serverId: string): Promise<ServerStatus>;
|
|
createSnapshot(serverId: string): Promise<SnapshotResult>;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Update & Rollout Strategy
|
|
|
|
### 7.1 Central Platform Updates
|
|
|
|
| Component | Deployment | Rollback |
|
|
|-----------|-----------|----------|
|
|
| Hub | Docker image pull + restart | Previous image tag |
|
|
| Website | Vercel deploy (instant) or Docker pull | Previous deployment |
|
|
| Hub Database | Prisma migrate deploy (forward-only) | Reverse migration script |
|
|
|
|
### 7.2 Tenant Server Updates
|
|
|
|
Tenant updates are pushed from the Hub, NOT pulled by tenants:
|
|
|
|
```
|
|
1. Hub builds new Safety Wrapper / Secrets Proxy image
|
|
2. Hub creates update task for each tenant
|
|
3. Safety Wrapper receives update command via heartbeat
|
|
4. Safety Wrapper downloads new image (from Gitea registry)
|
|
5. Safety Wrapper performs rolling restart:
|
|
a. Pull new image
|
|
b. Stop old container
|
|
c. Start new container
|
|
d. Health check
|
|
e. Report success/failure to Hub
|
|
6. If health check fails: rollback to previous image
|
|
```
|
|
|
|
### 7.3 OpenClaw Updates
|
|
|
|
OpenClaw is pinned to a tested release tag. Update cadence:
|
|
|
|
1. Monthly review of upstream changelog
|
|
2. Test new release on staging VPS (dedicated test tenant)
|
|
3. If no issues after 48 hours: roll out to 10% of tenants (canary)
|
|
4. Monitor for 24 hours
|
|
5. Roll out to remaining tenants
|
|
6. Rollback available: previous Docker image tag
|
|
|
|
### 7.4 Canary Deployment
|
|
|
|
```
|
|
Stage 1: Staging VPS (internal testing) — 48 hours
|
|
Stage 2: 5% of tenants (canary group) — 24 hours
|
|
Stage 3: 25% of tenants — 12 hours
|
|
Stage 4: 100% of tenants — complete
|
|
```
|
|
|
|
Canary selection: newest tenants first (less established, lower blast radius).
|
|
|
|
---
|
|
|
|
## 8. Disaster Recovery
|
|
|
|
### 8.1 Three-Tier Backup Strategy
|
|
|
|
| Tier | What | How | Frequency | Retention |
|
|
|------|------|-----|-----------|-----------|
|
|
| 1. Application | Tool databases (18 PG + 2 MySQL + 1 Mongo) | `backups.sh` (existing) | Daily 2:00 AM | 7 daily + 4 weekly |
|
|
| 2. VPS Snapshot | Full VPS image | Netcup SCP API | Daily (staggered) | 3 rolling |
|
|
| 3. Hub Database | Central PostgreSQL | `pg_dump` + rclone | Daily 3:00 AM | 14 daily + 8 weekly + 3 monthly |
|
|
|
|
### 8.2 Recovery Scenarios
|
|
|
|
| Scenario | Recovery Method | RTO | RPO |
|
|
|----------|----------------|-----|-----|
|
|
| Single tool database corrupted | Restore from application backup | 15 minutes | 24 hours |
|
|
| VPS disk failure | Restore from Netcup snapshot | 30 minutes | 24 hours |
|
|
| VPS completely lost | Re-provision from scratch + restore snapshot | 2 hours | 24 hours |
|
|
| Hub database corrupted | Restore from pg_dump backup | 30 minutes | 24 hours |
|
|
| Hub server lost | Re-deploy on new server + restore DB | 2 hours | 24 hours |
|
|
| Regional outage | Failover to other region (manual) | 4 hours | 24 hours |
|
|
|
|
### 8.3 Backup Monitoring
|
|
|
|
The Safety Wrapper's cron job reads `backup-status.json` daily at 6:00 AM:
|
|
|
|
```json
|
|
{
|
|
"last_run": "2026-02-27T02:15:00Z",
|
|
"duration_seconds": 342,
|
|
"databases": {
|
|
"chatwoot": { "status": "success", "size_mb": 45 },
|
|
"ghost": { "status": "success", "size_mb": 12 },
|
|
"nextcloud": { "status": "failed", "error": "connection refused" }
|
|
},
|
|
"remote_sync": { "status": "success", "uploaded_mb": 230 }
|
|
}
|
|
```
|
|
|
|
Alerts:
|
|
- **Medium severity:** Any database backup failed
|
|
- **Hard severity:** All backups failed, or `backup-status.json` is stale (>48 hours)
|
|
|
|
---
|
|
|
|
## 9. Monitoring & Alerting
|
|
|
|
### 9.1 Tenant Health Monitoring
|
|
|
|
The Hub monitors all tenants via Safety Wrapper heartbeats:
|
|
|
|
| Metric | Source | Alert Threshold |
|
|
|--------|--------|----------------|
|
|
| Heartbeat freshness | Safety Wrapper heartbeat | >3 missed intervals (3 min) |
|
|
| Disk usage | Heartbeat payload | >85% |
|
|
| Memory usage | Heartbeat payload | >90% |
|
|
| Token pool usage | Billing period | 80%, 90%, 100% |
|
|
| Backup status | Backup report | Any failure |
|
|
| Container health | Portainer integration | Crash/OOM events |
|
|
| SSL cert expiry | Cert check cron | <14 days |
|
|
|
|
### 9.2 Alert Routing
|
|
|
|
| Severity | Customer Notification | Staff Notification |
|
|
|----------|----------------------|-------------------|
|
|
| Soft | None (auto-recovers) | Dashboard indicator |
|
|
| Medium | Push notification (after 3 failures) | Email + dashboard |
|
|
| Hard | Push notification (immediate) | Email + Slack/webhook + dashboard |
|
|
|
|
### 9.3 Hub Self-Monitoring
|
|
|
|
```
|
|
- PostgreSQL connection pool usage
|
|
- API response times (p50, p95, p99)
|
|
- Failed provisioning jobs
|
|
- Stripe webhook processing latency
|
|
- Cron job execution status
|
|
- Disk space on Hub server
|
|
```
|
|
|
|
---
|
|
|
|
## 10. SSL & Domain Management
|
|
|
|
### 10.1 Tenant SSL
|
|
|
|
Each tenant gets wildcard SSL via Let's Encrypt + certbot:
|
|
|
|
```bash
|
|
# Provisioner Step 4 (existing)
|
|
certbot certonly --nginx -d "*.${DOMAIN}" -d "${DOMAIN}" \
|
|
--non-interactive --agree-tos -m "ssl@letsbe.biz"
|
|
```
|
|
|
|
Auto-renewal via cron (certbot default: every 12 hours, renews when <30 days to expiry).
|
|
|
|
### 10.2 Subdomain Layout
|
|
|
|
Each tool gets a subdomain on the customer's domain:
|
|
|
|
```
|
|
files.example.com → Nextcloud
|
|
chat.example.com → Chatwoot
|
|
blog.example.com → Ghost
|
|
cal.example.com → Cal.com
|
|
mail.example.com → Stalwart Mail
|
|
erp.example.com → Odoo
|
|
wiki.example.com → BookStack (if installed)
|
|
...
|
|
status.example.com → Uptime Kuma
|
|
portainer.example.com → Portainer (admin only)
|
|
```
|
|
|
|
### 10.3 DNS Automation
|
|
|
|
New capability — auto-create DNS records at provisioning time:
|
|
|
|
```typescript
|
|
// Hub: src/lib/services/dns-automation-service.ts
|
|
|
|
interface DnsAutomationService {
|
|
createRecords(params: {
|
|
domain: string;
|
|
ip: string;
|
|
tools: string[];
|
|
provider: 'cloudflare';
|
|
zone_id: string;
|
|
}): Promise<{ records_created: number; errors: string[] }>;
|
|
}
|
|
|
|
// Creates A records for:
|
|
// 1. Root domain → VPS IP
|
|
// 2. Wildcard *.domain → VPS IP (covers all tool subdomains)
|
|
// Or individual A records per tool subdomain if wildcard not supported
|
|
```
|
|
|
|
---
|
|
|
|
*End of Document — 03 Deployment Strategy*
|