765 lines
24 KiB
Markdown
765 lines
24 KiB
Markdown
|
|
# LetsBe Biz — Infrastructure Runbook
|
|||
|
|
|
|||
|
|
**Version:** 1.0
|
|||
|
|
**Date:** February 26, 2026
|
|||
|
|
**Authors:** Matt (Founder), Claude (Architecture)
|
|||
|
|
**Status:** Engineering Spec — Ready for Implementation
|
|||
|
|
**Companion docs:** Technical Architecture v1.2, Tool Catalog v2.2, Security & GDPR Framework v1.1
|
|||
|
|
**Decision refs:** Foundation Document Decisions #18, #27
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Purpose
|
|||
|
|
|
|||
|
|
This runbook is the operational reference for provisioning, managing, monitoring, and maintaining LetsBe Biz infrastructure. It covers the full lifecycle: from ordering a VPS through Netcup to deprovisioning a customer's server at account termination.
|
|||
|
|
|
|||
|
|
**Target audience:** Matt (operations), future engineering team, and the IT Admin AI agent (for self-referencing operational procedures).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Infrastructure Overview
|
|||
|
|
|
|||
|
|
### 2.1 Hosting Provider: Netcup
|
|||
|
|
|
|||
|
|
| Item | Detail |
|
|||
|
|
|------|--------|
|
|||
|
|
| **Provider** | Netcup GmbH (Karlsruhe, Germany) |
|
|||
|
|
| **Product line** | VPS (Virtual Private Server) |
|
|||
|
|
| **EU data center** | Netcup Nürnberg/Karlsruhe, Germany |
|
|||
|
|
| **NA data center** | Netcup Manassas, Virginia, USA |
|
|||
|
|
| **API** | SCP (Server Control Panel) REST API with OAuth2 Device Flow |
|
|||
|
|
| **Hub integration** | Full — server ordering, power actions, metrics, snapshots, rescue mode via `netcupService.ts` |
|
|||
|
|
|
|||
|
|
### 2.2 Server Tiers
|
|||
|
|
|
|||
|
|
| Tier | vCPUs | RAM | Disk | Recommended Tools | Monthly Cost (est.) |
|
|||
|
|
|------|-------|-----|------|-------------------|---------------------|
|
|||
|
|
| Lite (€29) | 4 | 8 GB | 160 GB SSD | 5–8 tools | ~€8–12 |
|
|||
|
|
| Build (€45) | 8 | 16 GB | 320 GB SSD | 10–15 tools | ~€14–18 |
|
|||
|
|
| Scale (€75) | 12 | 32 GB | 640 GB SSD | 15–25 tools | ~€22–28 |
|
|||
|
|
| Enterprise (€109) | 16 | 64 GB | 1.2 TB SSD | 28+ tools | ~€35–45 |
|
|||
|
|
|
|||
|
|
### 2.3 Network Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Internet
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
Netcup VPS (public IP)
|
|||
|
|
│
|
|||
|
|
├── Port 80 (HTTP → 301 redirect to HTTPS)
|
|||
|
|
├── Port 443 (HTTPS → nginx reverse proxy)
|
|||
|
|
├── Port 22022 (SSH — hardened, key-only)
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
nginx (Alpine container)
|
|||
|
|
│
|
|||
|
|
├── *.{{domain}} → Route by subdomain to tool containers
|
|||
|
|
│ ├── files.{{domain}} → 127.0.0.1:3023 (Nextcloud)
|
|||
|
|
│ ├── crm.{{domain}} → 127.0.0.1:3025 (Odoo)
|
|||
|
|
│ ├── chat.{{domain}} → 127.0.0.1:3026 (Chatwoot)
|
|||
|
|
│ ├── blog.{{domain}} → 127.0.0.1:3029 (Ghost)
|
|||
|
|
│ ├── mail.{{domain}} → 127.0.0.1:3031 (Stalwart Mail)
|
|||
|
|
│ ├── ... (33 nginx configs total)
|
|||
|
|
│ └── status.{{domain}} → 127.0.0.1:3008 (Uptime Kuma)
|
|||
|
|
│
|
|||
|
|
└── Internal only (not exposed via nginx):
|
|||
|
|
├── 127.0.0.1:18789 (OpenClaw Gateway)
|
|||
|
|
├── 127.0.0.1:8100 (Secrets Proxy)
|
|||
|
|
└── Various internal tool ports
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Provisioning Pipeline
|
|||
|
|
|
|||
|
|
### 3.1 End-to-End Flow
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Customer signs up → Stripe payment → Hub creates Order
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
Hub Automation Worker (state machine)
|
|||
|
|
│
|
|||
|
|
├── PAYMENT_CONFIRMED → order VPS from Netcup (if AUTO mode)
|
|||
|
|
├── AWAITING_SERVER → poll Netcup until VPS is ready
|
|||
|
|
├── SERVER_READY → wait for DNS records
|
|||
|
|
├── DNS_PENDING → verify A records for all subdomains
|
|||
|
|
├── DNS_READY → trigger provisioning
|
|||
|
|
├── PROVISIONING → spawn Docker provisioner container
|
|||
|
|
│ │
|
|||
|
|
│ ▼
|
|||
|
|
│ letsbe-provisioner (10-step pipeline via SSH)
|
|||
|
|
│ ├── Step 1: System packages (apt update, essentials)
|
|||
|
|
│ ├── Step 2: Docker CE installation
|
|||
|
|
│ ├── Step 3: Disable conflicting services
|
|||
|
|
│ ├── Step 4: nginx + fallback config
|
|||
|
|
│ ├── Step 5: UFW firewall (80, 443, 22022)
|
|||
|
|
│ ├── Step 6: Admin user + SSH key (optional)
|
|||
|
|
│ ├── Step 7: SSH hardening (port 22022, key-only)
|
|||
|
|
│ ├── Step 8: Unattended security updates
|
|||
|
|
│ ├── Step 9: Deploy tool stacks (docker-compose)
|
|||
|
|
│ └── Step 10: Deploy OpenClaw + Safety Wrapper + bootstrap
|
|||
|
|
│
|
|||
|
|
├── FULFILLED → server is live, customer notified
|
|||
|
|
└── FAILED → retry logic (1min / 5min / 15min backoff, max 3 attempts)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 Provisioner Detail (setup.sh)
|
|||
|
|
|
|||
|
|
**Location:** `letsbe-provisioner/scripts/setup.sh` (~832 lines)
|
|||
|
|
|
|||
|
|
#### Step 1: System Packages
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
apt-get update && apt-get upgrade -y
|
|||
|
|
apt-get install -y curl wget gnupg2 ca-certificates lsb-release apt-transport-https \
|
|||
|
|
software-properties-common unzip jq htop iotop net-tools dnsutils certbot \
|
|||
|
|
python3-certbot-nginx fail2ban rclone
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 2: Docker CE
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
|
|||
|
|
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" > /etc/apt/sources.list.d/docker.list
|
|||
|
|
apt-get update && apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
|
|||
|
|
systemctl enable --now docker
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 3: Disable Conflicting Services
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
systemctl stop apache2 2>/dev/null || true
|
|||
|
|
systemctl disable apache2 2>/dev/null || true
|
|||
|
|
systemctl stop postfix 2>/dev/null || true
|
|||
|
|
systemctl disable postfix 2>/dev/null || true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 4: nginx
|
|||
|
|
|
|||
|
|
Deploy nginx Alpine container with initial fallback config. SSL certificates provisioned via certbot after DNS is verified.
|
|||
|
|
|
|||
|
|
#### Step 5: UFW Firewall
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
ufw default deny incoming
|
|||
|
|
ufw default allow outgoing
|
|||
|
|
ufw allow 80/tcp # HTTP
|
|||
|
|
ufw allow 443/tcp # HTTPS
|
|||
|
|
ufw allow 22022/tcp # SSH (hardened port)
|
|||
|
|
ufw allow 25/tcp # SMTP (Stalwart Mail)
|
|||
|
|
ufw allow 587/tcp # SMTP submission
|
|||
|
|
ufw allow 993/tcp # IMAPS
|
|||
|
|
ufw --force enable
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 6: Admin User
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
useradd -m -s /bin/bash -G docker letsbe-admin
|
|||
|
|
mkdir -p /home/letsbe-admin/.ssh
|
|||
|
|
echo "{{admin_ssh_public_key}}" > /home/letsbe-admin/.ssh/authorized_keys
|
|||
|
|
chmod 700 /home/letsbe-admin/.ssh
|
|||
|
|
chmod 600 /home/letsbe-admin/.ssh/authorized_keys
|
|||
|
|
chown -R letsbe-admin:letsbe-admin /home/letsbe-admin/.ssh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 7: SSH Hardening
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# /etc/ssh/sshd_config modifications:
|
|||
|
|
Port 22022
|
|||
|
|
PermitRootLogin no
|
|||
|
|
PasswordAuthentication no
|
|||
|
|
PubkeyAuthentication yes
|
|||
|
|
MaxAuthTries 3
|
|||
|
|
LoginGraceTime 30
|
|||
|
|
AllowUsers letsbe-admin
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 8: Unattended Security Updates
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
apt-get install -y unattended-upgrades
|
|||
|
|
dpkg-reconfigure -plow unattended-upgrades
|
|||
|
|
# Configure /etc/apt/apt.conf.d/50unattended-upgrades for security-only updates
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 9: Deploy Tool Stacks
|
|||
|
|
|
|||
|
|
For each tool selected by the customer:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Generate credentials (env_setup.sh)
|
|||
|
|
# 50+ secrets: database passwords, admin tokens, API keys, JWT secrets
|
|||
|
|
# Written to /opt/letsbe/env/credentials.env and per-tool .env files
|
|||
|
|
|
|||
|
|
# 2. Deploy Docker Compose stacks
|
|||
|
|
for stack in {{selected_tools}}; do
|
|||
|
|
cd /opt/letsbe/stacks/$stack
|
|||
|
|
docker compose up -d
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# 3. Deploy nginx configs per tool
|
|||
|
|
for conf in {{selected_nginx_configs}}; do
|
|||
|
|
cp /opt/letsbe/nginx/sites/$conf /etc/nginx/sites-enabled/
|
|||
|
|
done
|
|||
|
|
nginx -t && nginx -s reload
|
|||
|
|
|
|||
|
|
# 4. Request SSL certificates
|
|||
|
|
certbot --nginx -d "*.{{domain}}" --non-interactive --agree-tos -m "ssl@{{domain}}"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 10: Deploy OpenClaw + Safety Wrapper + Bootstrap
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Deploy OpenClaw container with Safety Wrapper extension pre-installed
|
|||
|
|
cd /opt/letsbe/stacks/openclaw
|
|||
|
|
docker compose up -d
|
|||
|
|
|
|||
|
|
# 2. Deploy Secrets Proxy
|
|||
|
|
cd /opt/letsbe/stacks/secrets-proxy
|
|||
|
|
docker compose up -d
|
|||
|
|
|
|||
|
|
# 3. Seed secrets registry from credentials.env
|
|||
|
|
docker exec letsbe-openclaw /opt/letsbe/scripts/seed-secrets.sh
|
|||
|
|
|
|||
|
|
# 4. Generate tool-registry.json from deployed tools
|
|||
|
|
docker exec letsbe-openclaw /opt/letsbe/scripts/generate-tool-registry.sh
|
|||
|
|
|
|||
|
|
# 5. Deploy SOUL.md files for each agent
|
|||
|
|
# (generated from templates with tenant variables substituted)
|
|||
|
|
|
|||
|
|
# 6. Run initial setup browser automations
|
|||
|
|
# (Cal.com, Chatwoot, Keycloak, Nextcloud, Stalwart Mail, Umami, Uptime Kuma)
|
|||
|
|
|
|||
|
|
# 7. Register with Hub
|
|||
|
|
docker exec letsbe-openclaw /opt/letsbe/scripts/hub-register.sh
|
|||
|
|
|
|||
|
|
# 8. Clean up config.json (CRITICAL: remove plaintext passwords)
|
|||
|
|
rm -f /opt/letsbe/config.json
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.3 Credential Generation (env_setup.sh)
|
|||
|
|
|
|||
|
|
**Location:** `letsbe-provisioner/scripts/env_setup.sh` (~678 lines)
|
|||
|
|
|
|||
|
|
Generates 50+ unique credentials per tenant:
|
|||
|
|
|
|||
|
|
| Category | Count | Examples |
|
|||
|
|
|----------|-------|---------|
|
|||
|
|
| Database passwords | 18 | PostgreSQL passwords for each tool with a DB |
|
|||
|
|
| Admin passwords | 12 | Nextcloud admin, Keycloak admin, Odoo admin, etc. |
|
|||
|
|
| API tokens | 10 | NocoDB API token, Ghost admin API key, etc. |
|
|||
|
|
| JWT secrets | 5 | Chatwoot, Cal.com, OpenClaw, etc. |
|
|||
|
|
| Encryption keys | 3 | Safety Wrapper registry key, backup encryption key |
|
|||
|
|
| SSH keys | 2 | Admin key pair, Hub communication key |
|
|||
|
|
| SMTP credentials | 2 | Stalwart Mail admin, relay credentials |
|
|||
|
|
|
|||
|
|
**Generation method:** `openssl rand -base64 32` for passwords, `openssl rand -hex 32` for tokens, `ssh-keygen -t ed25519` for SSH keys.
|
|||
|
|
|
|||
|
|
**Template rendering:** All `{{ variable }}` placeholders in Docker Compose files and nginx configs are substituted with generated values.
|
|||
|
|
|
|||
|
|
### 3.4 Post-Provisioning Verification
|
|||
|
|
|
|||
|
|
After step 10 completes, the provisioner runs health checks:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Verify all containers are running
|
|||
|
|
docker ps --format "{{.Names}}: {{.Status}}" | grep -v "Up" && exit 1
|
|||
|
|
|
|||
|
|
# 2. Verify nginx is serving
|
|||
|
|
curl -sf https://{{domain}} > /dev/null || exit 1
|
|||
|
|
|
|||
|
|
# 3. Verify each tool's health endpoint
|
|||
|
|
for tool in {{health_check_urls}}; do
|
|||
|
|
curl -sf "$tool" > /dev/null || echo "WARNING: $tool not responding"
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# 4. Verify Safety Wrapper registered with Hub
|
|||
|
|
curl -sf http://127.0.0.1:8100/health || exit 1
|
|||
|
|
|
|||
|
|
# 5. Verify OpenClaw is responsive
|
|||
|
|
curl -sf http://127.0.0.1:18789/health || exit 1
|
|||
|
|
|
|||
|
|
# 6. Report success to Hub
|
|||
|
|
curl -X PATCH "{{hub_url}}/api/v1/jobs/{{job_id}}" \
|
|||
|
|
-H "Authorization: Bearer {{runner_token}}" \
|
|||
|
|
-d '{"status": "COMPLETED"}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Backup System
|
|||
|
|
|
|||
|
|
### 4.1 Backup Architecture
|
|||
|
|
|
|||
|
|
**Location:** `letsbe-provisioner/scripts/backups.sh` (~473 lines)
|
|||
|
|
**Schedule:** Daily via cron at 02:00 server local time
|
|||
|
|
**Retention:** 7 daily backups + 4 weekly backups (rolling)
|
|||
|
|
|
|||
|
|
### 4.2 What Gets Backed Up
|
|||
|
|
|
|||
|
|
| Component | Method | Target |
|
|||
|
|
|-----------|--------|--------|
|
|||
|
|
| PostgreSQL databases (18) | `pg_dump --format=custom` | `/opt/letsbe/backups/daily/` |
|
|||
|
|
| MySQL databases (2) | `mysqldump --single-transaction` | `/opt/letsbe/backups/daily/` |
|
|||
|
|
| MongoDB databases (1) | `mongodump --archive` | `/opt/letsbe/backups/daily/` |
|
|||
|
|
| Nextcloud files | rsync snapshot | `/opt/letsbe/backups/daily/nextcloud/` |
|
|||
|
|
| Docker volumes (critical) | `docker run --volumes-from` tar | `/opt/letsbe/backups/daily/volumes/` |
|
|||
|
|
| nginx configs | tar archive | `/opt/letsbe/backups/daily/nginx/` |
|
|||
|
|
| OpenClaw state | tar of `~/.openclaw/` | `/opt/letsbe/backups/daily/openclaw/` |
|
|||
|
|
| Safety Wrapper state | SQLite backup API | `/opt/letsbe/backups/daily/safety-wrapper/` |
|
|||
|
|
| Credentials | Encrypted tar | `/opt/letsbe/backups/daily/credentials.enc` |
|
|||
|
|
|
|||
|
|
### 4.3 Remote Backup
|
|||
|
|
|
|||
|
|
After local backup completes, `rclone` syncs to a remote destination:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
rclone sync /opt/letsbe/backups/ remote:backups/{{tenant_id}}/ \
|
|||
|
|
--transfers 4 \
|
|||
|
|
--checkers 8 \
|
|||
|
|
--fast-list \
|
|||
|
|
--log-file /var/log/letsbe/rclone.log
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Remote destination options (configured per tenant):
|
|||
|
|
- Netcup S3 (default)
|
|||
|
|
- Customer-provided S3 bucket
|
|||
|
|
- Customer-provided rclone remote
|
|||
|
|
|
|||
|
|
### 4.4 Backup Status Reporting
|
|||
|
|
|
|||
|
|
After each backup run, `backups.sh` writes a `backup-status.json`:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"timestamp": "2026-02-26T02:15:00Z",
|
|||
|
|
"status": "success",
|
|||
|
|
"duration_seconds": 847,
|
|||
|
|
"databases_backed_up": 21,
|
|||
|
|
"files_backed_up": true,
|
|||
|
|
"remote_sync": "success",
|
|||
|
|
"total_size_gb": 4.2,
|
|||
|
|
"errors": []
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The Safety Wrapper monitors this file (Decision #27) and reports status to the Hub via heartbeat.
|
|||
|
|
|
|||
|
|
### 4.5 Backup Rotation
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Daily: keep last 7
|
|||
|
|
find /opt/letsbe/backups/daily/ -maxdepth 1 -mtime +7 -exec rm -rf {} \;
|
|||
|
|
|
|||
|
|
# Weekly: copy Sunday's backup to weekly/, keep last 4
|
|||
|
|
if [ "$(date +%u)" = "7" ]; then
|
|||
|
|
cp -a /opt/letsbe/backups/daily/ /opt/letsbe/backups/weekly/$(date +%Y-%m-%d)/
|
|||
|
|
fi
|
|||
|
|
find /opt/letsbe/backups/weekly/ -maxdepth 1 -mtime +28 -exec rm -rf {} \;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Restore Procedures
|
|||
|
|
|
|||
|
|
### 5.1 Per-Tool Restore
|
|||
|
|
|
|||
|
|
**Location:** `letsbe-provisioner/scripts/restore.sh` (~512 lines)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Restore a specific tool's database from a daily backup
|
|||
|
|
./restore.sh --tool nextcloud --date 2026-02-25
|
|||
|
|
|
|||
|
|
# Steps:
|
|||
|
|
# 1. Stop the tool container
|
|||
|
|
# 2. Restore database from backup
|
|||
|
|
# 3. Restore files (if applicable)
|
|||
|
|
# 4. Start the tool container
|
|||
|
|
# 5. Verify health check
|
|||
|
|
# 6. Report to Hub
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.2 Full Server Restore
|
|||
|
|
|
|||
|
|
For complete server recovery (e.g., VPS failure):
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1. Order new VPS from Netcup (same region, same tier)
|
|||
|
|
2. Run provisioner with --restore flag
|
|||
|
|
- Steps 1-8: Standard server setup
|
|||
|
|
- Step 9: Deploy tool stacks (empty)
|
|||
|
|
- Step 10: Deploy OpenClaw + Safety Wrapper
|
|||
|
|
3. Restore from remote backup:
|
|||
|
|
rclone sync remote:backups/{{tenant_id}}/latest/ /opt/letsbe/backups/daily/
|
|||
|
|
4. Run restore.sh --all
|
|||
|
|
- Restores all 21 databases
|
|||
|
|
- Restores all file volumes
|
|||
|
|
- Restores OpenClaw state
|
|||
|
|
- Restores Safety Wrapper secrets registry
|
|||
|
|
- Restores credentials
|
|||
|
|
5. Verify all tools are healthy
|
|||
|
|
6. Update DNS if IP changed
|
|||
|
|
7. Hub updates server connection record
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.3 Point-in-Time Recovery
|
|||
|
|
|
|||
|
|
For accidental data deletion by a user:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1. Identify the backup date that contains the needed data
|
|||
|
|
2. Restore the specific tool to a temporary container:
|
|||
|
|
./restore.sh --tool odoo --date 2026-02-23 --target temp
|
|||
|
|
3. Extract the needed data from the temp container
|
|||
|
|
4. Import the data into the production tool
|
|||
|
|
5. Remove the temp container
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Monitoring
|
|||
|
|
|
|||
|
|
### 6.1 Uptime Kuma (On-Tenant)
|
|||
|
|
|
|||
|
|
Each tenant VPS runs Uptime Kuma monitoring all local services:
|
|||
|
|
|
|||
|
|
| Monitor | Type | Interval | Alert Threshold |
|
|||
|
|
|---------|------|----------|-----------------|
|
|||
|
|
| nginx | HTTP(S) | 60s | 3 failures |
|
|||
|
|
| Each tool container | HTTP | 120s | 3 failures |
|
|||
|
|
| OpenClaw Gateway | HTTP (port 18789) | 60s | 2 failures |
|
|||
|
|
| Secrets Proxy | HTTP (port 8100) | 60s | 2 failures |
|
|||
|
|
| SSL certificate expiry | Certificate | Daily | 14 days before expiry |
|
|||
|
|
| Disk usage | Push | 300s | >85% |
|
|||
|
|
|
|||
|
|
### 6.2 Hub-Level Monitoring
|
|||
|
|
|
|||
|
|
The Hub monitors all tenant servers centrally:
|
|||
|
|
|
|||
|
|
| Metric | Source | Check Interval | Alert |
|
|||
|
|
|--------|--------|---------------|-------|
|
|||
|
|
| Heartbeat received | Safety Wrapper | Expected every 5 min | Missing >15 min |
|
|||
|
|
| Token usage rate | Safety Wrapper heartbeat | Every heartbeat | >90% pool consumed |
|
|||
|
|
| Backup status | Safety Wrapper (reads backup-status.json) | Daily | Any backup failure |
|
|||
|
|
| Container health | Portainer API (via Hub) | Every 10 min | Container crash/OOM |
|
|||
|
|
| VPS metrics | Netcup SCP API | Every 15 min | CPU >90% sustained, disk >90% |
|
|||
|
|
| OpenClaw version | Safety Wrapper heartbeat | Every heartbeat | Version mismatch with expected |
|
|||
|
|
|
|||
|
|
### 6.3 GlitchTip (Error Tracking)
|
|||
|
|
|
|||
|
|
GlitchTip runs on each tenant and captures application errors from:
|
|||
|
|
- OpenClaw (Node.js errors, unhandled rejections)
|
|||
|
|
- Safety Wrapper (hook errors, tool execution failures)
|
|||
|
|
- Tool containers that support Sentry-compatible error reporting
|
|||
|
|
|
|||
|
|
### 6.4 Diun (Container Update Notifications)
|
|||
|
|
|
|||
|
|
Diun monitors all Docker images for new releases:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# /opt/letsbe/stacks/diun/docker-compose.yml
|
|||
|
|
watch:
|
|||
|
|
schedule: "0 6 * * *" # Check daily at 06:00
|
|||
|
|
notif:
|
|||
|
|
webhook:
|
|||
|
|
endpoint: "http://127.0.0.1:8100/webhooks/diun" # Safety Wrapper
|
|||
|
|
method: POST
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The Safety Wrapper receives update notifications and:
|
|||
|
|
1. Logs the available update
|
|||
|
|
2. Reports to Hub via heartbeat
|
|||
|
|
3. Does NOT auto-update (updates require IT Admin agent or manual action)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Maintenance Procedures
|
|||
|
|
|
|||
|
|
### 7.1 Tool Updates
|
|||
|
|
|
|||
|
|
Tool container updates are initiated by the IT Admin agent or manually:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Pull new image
|
|||
|
|
cd /opt/letsbe/stacks/{{tool}}
|
|||
|
|
docker compose pull
|
|||
|
|
|
|||
|
|
# 2. Backup the tool's database
|
|||
|
|
./backups.sh --tool {{tool}}
|
|||
|
|
|
|||
|
|
# 3. Rolling update
|
|||
|
|
docker compose up -d --force-recreate
|
|||
|
|
|
|||
|
|
# 4. Verify health check
|
|||
|
|
curl -sf http://127.0.0.1:{{port}}/health
|
|||
|
|
|
|||
|
|
# 5. If health check fails, rollback:
|
|||
|
|
docker compose down
|
|||
|
|
docker tag {{tool}}:previous {{tool}}:latest
|
|||
|
|
docker compose up -d
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.2 OpenClaw Updates
|
|||
|
|
|
|||
|
|
OpenClaw is pinned to a tested release tag. Update procedure:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Check upstream changelog for breaking changes
|
|||
|
|
# 2. Test in staging VPS first
|
|||
|
|
|
|||
|
|
# 3. On tenant VPS:
|
|||
|
|
cd /opt/letsbe/stacks/openclaw
|
|||
|
|
|
|||
|
|
# 4. Backup OpenClaw state
|
|||
|
|
tar czf /opt/letsbe/backups/openclaw-pre-update.tar.gz ~/.openclaw/
|
|||
|
|
|
|||
|
|
# 5. Update image tag in docker-compose.yml
|
|||
|
|
sed -i 's/openclaw:v2026.2.1/openclaw:v2026.3.0/' docker-compose.yml
|
|||
|
|
|
|||
|
|
# 6. Pull and recreate
|
|||
|
|
docker compose pull && docker compose up -d --force-recreate
|
|||
|
|
|
|||
|
|
# 7. Verify
|
|||
|
|
curl -sf http://127.0.0.1:18789/health
|
|||
|
|
docker exec letsbe-openclaw openclaw --version
|
|||
|
|
|
|||
|
|
# 8. If verification fails, rollback:
|
|||
|
|
docker compose down
|
|||
|
|
sed -i 's/openclaw:v2026.3.0/openclaw:v2026.2.1/' docker-compose.yml
|
|||
|
|
docker compose up -d
|
|||
|
|
tar xzf /opt/letsbe/backups/openclaw-pre-update.tar.gz -C /
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Update cadence:** Monthly review of upstream changelog. Update only for security fixes or features we need. Never update on Fridays.
|
|||
|
|
|
|||
|
|
### 7.3 SSL Certificate Renewal
|
|||
|
|
|
|||
|
|
Let's Encrypt certificates auto-renew via certbot cron. Manual renewal if needed:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
certbot renew --nginx --force-renewal
|
|||
|
|
nginx -t && nginx -s reload
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.4 Credential Rotation
|
|||
|
|
|
|||
|
|
The IT Admin agent can rotate credentials for any tool:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Generate new credential
|
|||
|
|
NEW_PASS=$(openssl rand -base64 32)
|
|||
|
|
|
|||
|
|
# 2. Update the tool's .env file
|
|||
|
|
sed -i "s/DB_PASSWORD=.*/DB_PASSWORD=$NEW_PASS/" /opt/letsbe/stacks/{{tool}}/.env
|
|||
|
|
|
|||
|
|
# 3. Update the database user's password
|
|||
|
|
docker exec {{tool}}-db psql -c "ALTER USER {{user}} PASSWORD '$NEW_PASS';"
|
|||
|
|
|
|||
|
|
# 4. Restart the tool container
|
|||
|
|
docker compose -f /opt/letsbe/stacks/{{tool}}/docker-compose.yml restart
|
|||
|
|
|
|||
|
|
# 5. Update the secrets registry
|
|||
|
|
# (Safety Wrapper detects .env change and updates registry automatically)
|
|||
|
|
|
|||
|
|
# 6. Verify tool health
|
|||
|
|
curl -sf http://127.0.0.1:{{port}}/health
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.5 Disk Space Management
|
|||
|
|
|
|||
|
|
When disk usage exceeds 85%:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Check disk usage by directory
|
|||
|
|
du -sh /opt/letsbe/stacks/* | sort -rh | head -20
|
|||
|
|
du -sh /opt/letsbe/backups/* | sort -rh
|
|||
|
|
|
|||
|
|
# 2. Clean Docker resources
|
|||
|
|
docker system prune -f # Remove stopped containers, unused networks
|
|||
|
|
docker image prune -a -f # Remove unused images
|
|||
|
|
docker volume prune -f # Remove unused volumes (CAREFUL: verify first)
|
|||
|
|
|
|||
|
|
# 3. Clean old logs
|
|||
|
|
find /var/log -name "*.gz" -mtime +30 -delete
|
|||
|
|
docker container ls -a --format "{{.Names}}" | xargs -I {} docker logs {} --since 720h 2>/dev/null | wc -l
|
|||
|
|
|
|||
|
|
# 4. Clean old backups (if rotation isn't catching them)
|
|||
|
|
find /opt/letsbe/backups/daily/ -maxdepth 1 -mtime +7 -exec rm -rf {} \;
|
|||
|
|
|
|||
|
|
# 5. If still above 85%, recommend tier upgrade to user
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Deprovisioning
|
|||
|
|
|
|||
|
|
### 8.1 Customer Cancellation Flow
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Customer requests cancellation
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
Hub: 48-hour cooling-off period
|
|||
|
|
│ (Customer can cancel the cancellation)
|
|||
|
|
▼
|
|||
|
|
Hub: 30-day data export window begins
|
|||
|
|
│ Customer can:
|
|||
|
|
│ - Download files via Nextcloud
|
|||
|
|
│ - Export CRM data via Odoo
|
|||
|
|
│ - Export email via IMAP
|
|||
|
|
│ - SSH into server for full access
|
|||
|
|
│ - Request a full backup via Hub
|
|||
|
|
▼
|
|||
|
|
Hub: After 30 days → trigger deprovisioning
|
|||
|
|
│
|
|||
|
|
├── Revoke Safety Wrapper Hub API key
|
|||
|
|
├── Stop all containers
|
|||
|
|
├── Delete remote backups (rclone purge)
|
|||
|
|
├── Request VPS deletion via Netcup API
|
|||
|
|
│ └── Netcup wipes disk and destroys VPS
|
|||
|
|
├── Delete all Netcup snapshots
|
|||
|
|
├── Remove DNS records
|
|||
|
|
└── Hub: soft-delete account data, retain billing records (7 years per HGB §257)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 8.2 Emergency Server Isolation
|
|||
|
|
|
|||
|
|
If a tenant VPS is compromised or abusing the platform:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Revoke Hub API key immediately (Hub admin panel)
|
|||
|
|
# 2. SSH into server (port 22022):
|
|||
|
|
ssh -p 22022 letsbe-admin@{{server_ip}}
|
|||
|
|
|
|||
|
|
# 3. Stop the AI runtime
|
|||
|
|
docker stop letsbe-openclaw letsbe-secrets-proxy
|
|||
|
|
|
|||
|
|
# 4. Block outbound traffic (except SSH)
|
|||
|
|
ufw deny out to any
|
|||
|
|
ufw allow out to any port 22022
|
|||
|
|
|
|||
|
|
# 5. Take a forensic snapshot via Netcup API
|
|||
|
|
# 6. Assess and decide: remediate or deprovision
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Disaster Recovery
|
|||
|
|
|
|||
|
|
### 9.1 Scenarios
|
|||
|
|
|
|||
|
|
| Scenario | RTO | RPO | Procedure |
|
|||
|
|
|----------|-----|-----|-----------|
|
|||
|
|
| Single container crash | <5 min | 0 (no data loss) | Auto-restart via Docker restart policy |
|
|||
|
|
| Multiple container failure | <30 min | 0 | IT Admin agent investigates, restarts services |
|
|||
|
|
| VPS disk corruption | 2–4 hours | 24 hours (last backup) | New VPS + restore from remote backup |
|
|||
|
|
| VPS total loss | 2–4 hours | 24 hours | New VPS (same region) + restore |
|
|||
|
|
| Netcup data center outage | 4–8 hours | 24 hours | New VPS in alternate region + restore |
|
|||
|
|
| Hub outage | <1 hour | 0 (tenant VPS operates independently) | Hub restart/failover |
|
|||
|
|
| OpenRouter outage | <5 min | 0 | Model fallback chain engages automatically |
|
|||
|
|
|
|||
|
|
### 9.2 Tenant VPS Operates Independently
|
|||
|
|
|
|||
|
|
A key architectural property: **tenant VPS continues operating even if the Hub is down.** The Safety Wrapper operates with its local config, the AI agents continue serving the user, and tools continue running. The Hub is needed only for:
|
|||
|
|
- Billing and subscription management
|
|||
|
|
- Config updates (new agents, autonomy changes)
|
|||
|
|
- Approval queue (if approvals are routed through Hub instead of local)
|
|||
|
|
- Monitoring dashboards
|
|||
|
|
|
|||
|
|
### 9.3 Recovery Testing
|
|||
|
|
|
|||
|
|
**Monthly:** Restore a random tool's database from backup on a staging VPS to verify backup integrity.
|
|||
|
|
|
|||
|
|
**Quarterly:** Full server restore drill — order a new VPS, run complete restore from remote backup, verify all tools and agents are functional.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Security Operations
|
|||
|
|
|
|||
|
|
### 10.1 SSH Access Audit
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Review successful SSH logins
|
|||
|
|
journalctl -u sshd --since "7 days ago" | grep "Accepted"
|
|||
|
|
|
|||
|
|
# Review failed SSH attempts
|
|||
|
|
journalctl -u sshd --since "7 days ago" | grep "Failed"
|
|||
|
|
|
|||
|
|
# Check fail2ban status
|
|||
|
|
fail2ban-client status sshd
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 10.2 Container Security
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check for containers running as root (should be minimal)
|
|||
|
|
docker ps --format "{{.Names}}" | xargs -I {} docker inspect {} --format "{{.Config.User}}"
|
|||
|
|
|
|||
|
|
# Check for containers with excessive privileges
|
|||
|
|
docker ps --format "{{.Names}}" | xargs -I {} docker inspect {} --format "{{.HostConfig.Privileged}}"
|
|||
|
|
|
|||
|
|
# Verify network isolation
|
|||
|
|
docker network ls
|
|||
|
|
docker network inspect bridge
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 10.3 Vulnerability Scanning
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Scan Docker images for known vulnerabilities (using Trivy)
|
|||
|
|
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
|
|||
|
|
aquasec/trivy image --severity HIGH,CRITICAL {{image_name}}
|
|||
|
|
|
|||
|
|
# Scan all running containers
|
|||
|
|
docker ps --format "{{.Image}}" | sort -u | while read img; do
|
|||
|
|
trivy image --severity HIGH,CRITICAL "$img"
|
|||
|
|
done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 10.4 Incident Response Checklist
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[ ] 1. Contain: Isolate affected VPS (Section 8.2)
|
|||
|
|
[ ] 2. Assess: Determine scope (which data, which users affected)
|
|||
|
|
[ ] 3. Preserve: Take forensic snapshot before changes
|
|||
|
|
[ ] 4. Notify: Hub alerts → Matt → customer (within timelines per GDPR Art. 33/34)
|
|||
|
|
[ ] 5. Remediate: Fix the vulnerability, rotate compromised credentials
|
|||
|
|
[ ] 6. Restore: From clean backup if data was corrupted
|
|||
|
|
[ ] 7. Verify: Full health check on all services
|
|||
|
|
[ ] 8. Document: Post-mortem with root cause, timeline, actions taken
|
|||
|
|
[ ] 9. Improve: Update runbook/monitoring to prevent recurrence
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 11. Common Operations Quick Reference
|
|||
|
|
|
|||
|
|
| Task | Command / Procedure |
|
|||
|
|
|------|---------------------|
|
|||
|
|
| Check all containers | `docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"` |
|
|||
|
|
| Restart a tool | `cd /opt/letsbe/stacks/{{tool}} && docker compose restart` |
|
|||
|
|
| View tool logs | `docker logs --tail 100 -f {{container_name}}` |
|
|||
|
|
| Check disk usage | `df -h /opt/letsbe` |
|
|||
|
|
| Check RAM usage | `free -h` |
|
|||
|
|
| Run manual backup | `/opt/letsbe/scripts/backups.sh` |
|
|||
|
|
| Restore a tool | `/opt/letsbe/scripts/restore.sh --tool {{tool}} --date YYYY-MM-DD` |
|
|||
|
|
| Check SSL expiry | `certbot certificates` |
|
|||
|
|
| Renew SSL | `certbot renew --nginx` |
|
|||
|
|
| Check Safety Wrapper | `curl http://127.0.0.1:8100/health` |
|
|||
|
|
| Check OpenClaw | `curl http://127.0.0.1:18789/health` |
|
|||
|
|
| View backup status | `cat /opt/letsbe/backups/backup-status.json \| jq` |
|
|||
|
|
| Check firewall | `ufw status verbose` |
|
|||
|
|
| Check fail2ban | `fail2ban-client status sshd` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 12. Changelog
|
|||
|
|
|
|||
|
|
| Version | Date | Changes |
|
|||
|
|
|---------|------|---------|
|
|||
|
|
| 1.0 | 2026-02-26 | Initial runbook. Covers: Netcup provisioning, 10-step pipeline, credential generation, backup/restore, monitoring stack, maintenance procedures, deprovisioning, disaster recovery, security operations, quick reference. |
|