765 lines
24 KiB
Markdown
765 lines
24 KiB
Markdown
# LetsBe Biz — Infrastructure Runbook
|
||
|
||
**Version:** 1.0
|
||
**Date:** February 26, 2026
|
||
**Authors:** Matt (Founder), Claude (Architecture)
|
||
**Status:** Engineering Spec — Ready for Implementation
|
||
**Companion docs:** Technical Architecture v1.2, Tool Catalog v2.2, Security & GDPR Framework v1.1
|
||
**Decision refs:** Foundation Document Decisions #18, #27
|
||
|
||
---
|
||
|
||
## 1. Purpose
|
||
|
||
This runbook is the operational reference for provisioning, managing, monitoring, and maintaining LetsBe Biz infrastructure. It covers the full lifecycle: from ordering a VPS through Netcup to deprovisioning a customer's server at account termination.
|
||
|
||
**Target audience:** Matt (operations), future engineering team, and the IT Admin AI agent (for self-referencing operational procedures).
|
||
|
||
---
|
||
|
||
## 2. Infrastructure Overview
|
||
|
||
### 2.1 Hosting Provider: Netcup
|
||
|
||
| Item | Detail |
|
||
|------|--------|
|
||
| **Provider** | Netcup GmbH (Karlsruhe, Germany) |
|
||
| **Product line** | VPS (Virtual Private Server) |
|
||
| **EU data center** | Netcup Nürnberg/Karlsruhe, Germany |
|
||
| **NA data center** | Netcup Manassas, Virginia, USA |
|
||
| **API** | SCP (Server Control Panel) REST API with OAuth2 Device Flow |
|
||
| **Hub integration** | Full — server ordering, power actions, metrics, snapshots, rescue mode via `netcupService.ts` |
|
||
|
||
### 2.2 Server Tiers
|
||
|
||
| Tier | vCPUs | RAM | Disk | Recommended Tools | Monthly Cost (est.) |
|
||
|------|-------|-----|------|-------------------|---------------------|
|
||
| Lite (€29) | 4 | 8 GB | 160 GB SSD | 5–8 tools | ~€8–12 |
|
||
| Build (€45) | 8 | 16 GB | 320 GB SSD | 10–15 tools | ~€14–18 |
|
||
| Scale (€75) | 12 | 32 GB | 640 GB SSD | 15–25 tools | ~€22–28 |
|
||
| Enterprise (€109) | 16 | 64 GB | 1.2 TB SSD | 28+ tools | ~€35–45 |
|
||
|
||
### 2.3 Network Architecture
|
||
|
||
```
|
||
Internet
|
||
│
|
||
▼
|
||
Netcup VPS (public IP)
|
||
│
|
||
├── Port 80 (HTTP → 301 redirect to HTTPS)
|
||
├── Port 443 (HTTPS → nginx reverse proxy)
|
||
├── Port 22022 (SSH — hardened, key-only)
|
||
│
|
||
▼
|
||
nginx (Alpine container)
|
||
│
|
||
├── *.{{domain}} → Route by subdomain to tool containers
|
||
│ ├── files.{{domain}} → 127.0.0.1:3023 (Nextcloud)
|
||
│ ├── crm.{{domain}} → 127.0.0.1:3025 (Odoo)
|
||
│ ├── chat.{{domain}} → 127.0.0.1:3026 (Chatwoot)
|
||
│ ├── blog.{{domain}} → 127.0.0.1:3029 (Ghost)
|
||
│ ├── mail.{{domain}} → 127.0.0.1:3031 (Stalwart Mail)
|
||
│ ├── ... (33 nginx configs total)
|
||
│ └── status.{{domain}} → 127.0.0.1:3008 (Uptime Kuma)
|
||
│
|
||
└── Internal only (not exposed via nginx):
|
||
├── 127.0.0.1:18789 (OpenClaw Gateway)
|
||
├── 127.0.0.1:8100 (Secrets Proxy)
|
||
└── Various internal tool ports
|
||
```
|
||
|
||
---
|
||
|
||
## 3. Provisioning Pipeline
|
||
|
||
### 3.1 End-to-End Flow
|
||
|
||
```
|
||
Customer signs up → Stripe payment → Hub creates Order
|
||
│
|
||
▼
|
||
Hub Automation Worker (state machine)
|
||
│
|
||
├── PAYMENT_CONFIRMED → order VPS from Netcup (if AUTO mode)
|
||
├── AWAITING_SERVER → poll Netcup until VPS is ready
|
||
├── SERVER_READY → wait for DNS records
|
||
├── DNS_PENDING → verify A records for all subdomains
|
||
├── DNS_READY → trigger provisioning
|
||
├── PROVISIONING → spawn Docker provisioner container
|
||
│ │
|
||
│ ▼
|
||
│ letsbe-provisioner (10-step pipeline via SSH)
|
||
│ ├── Step 1: System packages (apt update, essentials)
|
||
│ ├── Step 2: Docker CE installation
|
||
│ ├── Step 3: Disable conflicting services
|
||
│ ├── Step 4: nginx + fallback config
|
||
│ ├── Step 5: UFW firewall (80, 443, 22022)
|
||
│ ├── Step 6: Admin user + SSH key (optional)
|
||
│ ├── Step 7: SSH hardening (port 22022, key-only)
|
||
│ ├── Step 8: Unattended security updates
|
||
│ ├── Step 9: Deploy tool stacks (docker-compose)
|
||
│ └── Step 10: Deploy OpenClaw + Safety Wrapper + bootstrap
|
||
│
|
||
├── FULFILLED → server is live, customer notified
|
||
└── FAILED → retry logic (1min / 5min / 15min backoff, max 3 attempts)
|
||
```
|
||
|
||
### 3.2 Provisioner Detail (setup.sh)
|
||
|
||
**Location:** `letsbe-provisioner/scripts/setup.sh` (~832 lines)
|
||
|
||
#### Step 1: System Packages
|
||
|
||
```bash
|
||
apt-get update && apt-get upgrade -y
|
||
apt-get install -y curl wget gnupg2 ca-certificates lsb-release apt-transport-https \
|
||
software-properties-common unzip jq htop iotop net-tools dnsutils certbot \
|
||
python3-certbot-nginx fail2ban rclone
|
||
```
|
||
|
||
#### Step 2: Docker CE
|
||
|
||
```bash
|
||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
|
||
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" > /etc/apt/sources.list.d/docker.list
|
||
apt-get update && apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
|
||
systemctl enable --now docker
|
||
```
|
||
|
||
#### Step 3: Disable Conflicting Services
|
||
|
||
```bash
|
||
systemctl stop apache2 2>/dev/null || true
|
||
systemctl disable apache2 2>/dev/null || true
|
||
systemctl stop postfix 2>/dev/null || true
|
||
systemctl disable postfix 2>/dev/null || true
|
||
```
|
||
|
||
#### Step 4: nginx
|
||
|
||
Deploy nginx Alpine container with initial fallback config. SSL certificates provisioned via certbot after DNS is verified.
|
||
|
||
#### Step 5: UFW Firewall
|
||
|
||
```bash
|
||
ufw default deny incoming
|
||
ufw default allow outgoing
|
||
ufw allow 80/tcp # HTTP
|
||
ufw allow 443/tcp # HTTPS
|
||
ufw allow 22022/tcp # SSH (hardened port)
|
||
ufw allow 25/tcp # SMTP (Stalwart Mail)
|
||
ufw allow 587/tcp # SMTP submission
|
||
ufw allow 993/tcp # IMAPS
|
||
ufw --force enable
|
||
```
|
||
|
||
#### Step 6: Admin User
|
||
|
||
```bash
|
||
useradd -m -s /bin/bash -G docker letsbe-admin
|
||
mkdir -p /home/letsbe-admin/.ssh
|
||
echo "{{admin_ssh_public_key}}" > /home/letsbe-admin/.ssh/authorized_keys
|
||
chmod 700 /home/letsbe-admin/.ssh
|
||
chmod 600 /home/letsbe-admin/.ssh/authorized_keys
|
||
chown -R letsbe-admin:letsbe-admin /home/letsbe-admin/.ssh
|
||
```
|
||
|
||
#### Step 7: SSH Hardening
|
||
|
||
```bash
|
||
# /etc/ssh/sshd_config modifications:
|
||
Port 22022
|
||
PermitRootLogin no
|
||
PasswordAuthentication no
|
||
PubkeyAuthentication yes
|
||
MaxAuthTries 3
|
||
LoginGraceTime 30
|
||
AllowUsers letsbe-admin
|
||
```
|
||
|
||
#### Step 8: Unattended Security Updates
|
||
|
||
```bash
|
||
apt-get install -y unattended-upgrades
|
||
dpkg-reconfigure -plow unattended-upgrades
|
||
# Configure /etc/apt/apt.conf.d/50unattended-upgrades for security-only updates
|
||
```
|
||
|
||
#### Step 9: Deploy Tool Stacks
|
||
|
||
For each tool selected by the customer:
|
||
|
||
```bash
|
||
# 1. Generate credentials (env_setup.sh)
|
||
# 50+ secrets: database passwords, admin tokens, API keys, JWT secrets
|
||
# Written to /opt/letsbe/env/credentials.env and per-tool .env files
|
||
|
||
# 2. Deploy Docker Compose stacks
|
||
for stack in {{selected_tools}}; do
|
||
cd /opt/letsbe/stacks/$stack
|
||
docker compose up -d
|
||
done
|
||
|
||
# 3. Deploy nginx configs per tool
|
||
for conf in {{selected_nginx_configs}}; do
|
||
cp /opt/letsbe/nginx/sites/$conf /etc/nginx/sites-enabled/
|
||
done
|
||
nginx -t && nginx -s reload
|
||
|
||
# 4. Request SSL certificates
|
||
certbot --nginx -d "*.{{domain}}" --non-interactive --agree-tos -m "ssl@{{domain}}"
|
||
```
|
||
|
||
#### Step 10: Deploy OpenClaw + Safety Wrapper + Bootstrap
|
||
|
||
```bash
|
||
# 1. Deploy OpenClaw container with Safety Wrapper extension pre-installed
|
||
cd /opt/letsbe/stacks/openclaw
|
||
docker compose up -d
|
||
|
||
# 2. Deploy Secrets Proxy
|
||
cd /opt/letsbe/stacks/secrets-proxy
|
||
docker compose up -d
|
||
|
||
# 3. Seed secrets registry from credentials.env
|
||
docker exec letsbe-openclaw /opt/letsbe/scripts/seed-secrets.sh
|
||
|
||
# 4. Generate tool-registry.json from deployed tools
|
||
docker exec letsbe-openclaw /opt/letsbe/scripts/generate-tool-registry.sh
|
||
|
||
# 5. Deploy SOUL.md files for each agent
|
||
# (generated from templates with tenant variables substituted)
|
||
|
||
# 6. Run initial setup browser automations
|
||
# (Cal.com, Chatwoot, Keycloak, Nextcloud, Stalwart Mail, Umami, Uptime Kuma)
|
||
|
||
# 7. Register with Hub
|
||
docker exec letsbe-openclaw /opt/letsbe/scripts/hub-register.sh
|
||
|
||
# 8. Clean up config.json (CRITICAL: remove plaintext passwords)
|
||
rm -f /opt/letsbe/config.json
|
||
```
|
||
|
||
### 3.3 Credential Generation (env_setup.sh)
|
||
|
||
**Location:** `letsbe-provisioner/scripts/env_setup.sh` (~678 lines)
|
||
|
||
Generates 50+ unique credentials per tenant:
|
||
|
||
| Category | Count | Examples |
|
||
|----------|-------|---------|
|
||
| Database passwords | 18 | PostgreSQL passwords for each tool with a DB |
|
||
| Admin passwords | 12 | Nextcloud admin, Keycloak admin, Odoo admin, etc. |
|
||
| API tokens | 10 | NocoDB API token, Ghost admin API key, etc. |
|
||
| JWT secrets | 5 | Chatwoot, Cal.com, OpenClaw, etc. |
|
||
| Encryption keys | 3 | Safety Wrapper registry key, backup encryption key |
|
||
| SSH keys | 2 | Admin key pair, Hub communication key |
|
||
| SMTP credentials | 2 | Stalwart Mail admin, relay credentials |
|
||
|
||
**Generation method:** `openssl rand -base64 32` for passwords, `openssl rand -hex 32` for tokens, `ssh-keygen -t ed25519` for SSH keys.
|
||
|
||
**Template rendering:** All `{{ variable }}` placeholders in Docker Compose files and nginx configs are substituted with generated values.
|
||
|
||
### 3.4 Post-Provisioning Verification
|
||
|
||
After step 10 completes, the provisioner runs health checks:
|
||
|
||
```bash
|
||
# 1. Verify all containers are running
|
||
docker ps --format "{{.Names}}: {{.Status}}" | grep -v "Up" && exit 1
|
||
|
||
# 2. Verify nginx is serving
|
||
curl -sf https://{{domain}} > /dev/null || exit 1
|
||
|
||
# 3. Verify each tool's health endpoint
|
||
for tool in {{health_check_urls}}; do
|
||
curl -sf "$tool" > /dev/null || echo "WARNING: $tool not responding"
|
||
done
|
||
|
||
# 4. Verify Safety Wrapper registered with Hub
|
||
curl -sf http://127.0.0.1:8100/health || exit 1
|
||
|
||
# 5. Verify OpenClaw is responsive
|
||
curl -sf http://127.0.0.1:18789/health || exit 1
|
||
|
||
# 6. Report success to Hub
|
||
curl -X PATCH "{{hub_url}}/api/v1/jobs/{{job_id}}" \
|
||
-H "Authorization: Bearer {{runner_token}}" \
|
||
-d '{"status": "COMPLETED"}'
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Backup System
|
||
|
||
### 4.1 Backup Architecture
|
||
|
||
**Location:** `letsbe-provisioner/scripts/backups.sh` (~473 lines)
|
||
**Schedule:** Daily via cron at 02:00 server local time
|
||
**Retention:** 7 daily backups + 4 weekly backups (rolling)
|
||
|
||
### 4.2 What Gets Backed Up
|
||
|
||
| Component | Method | Target |
|
||
|-----------|--------|--------|
|
||
| PostgreSQL databases (18) | `pg_dump --format=custom` | `/opt/letsbe/backups/daily/` |
|
||
| MySQL databases (2) | `mysqldump --single-transaction` | `/opt/letsbe/backups/daily/` |
|
||
| MongoDB databases (1) | `mongodump --archive` | `/opt/letsbe/backups/daily/` |
|
||
| Nextcloud files | rsync snapshot | `/opt/letsbe/backups/daily/nextcloud/` |
|
||
| Docker volumes (critical) | `docker run --volumes-from` tar | `/opt/letsbe/backups/daily/volumes/` |
|
||
| nginx configs | tar archive | `/opt/letsbe/backups/daily/nginx/` |
|
||
| OpenClaw state | tar of `~/.openclaw/` | `/opt/letsbe/backups/daily/openclaw/` |
|
||
| Safety Wrapper state | SQLite backup API | `/opt/letsbe/backups/daily/safety-wrapper/` |
|
||
| Credentials | Encrypted tar | `/opt/letsbe/backups/daily/credentials.enc` |
|
||
|
||
### 4.3 Remote Backup
|
||
|
||
After local backup completes, `rclone` syncs to a remote destination:
|
||
|
||
```bash
|
||
rclone sync /opt/letsbe/backups/ remote:backups/{{tenant_id}}/ \
|
||
--transfers 4 \
|
||
--checkers 8 \
|
||
--fast-list \
|
||
--log-file /var/log/letsbe/rclone.log
|
||
```
|
||
|
||
Remote destination options (configured per tenant):
|
||
- Netcup S3 (default)
|
||
- Customer-provided S3 bucket
|
||
- Customer-provided rclone remote
|
||
|
||
### 4.4 Backup Status Reporting
|
||
|
||
After each backup run, `backups.sh` writes a `backup-status.json`:
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-02-26T02:15:00Z",
|
||
"status": "success",
|
||
"duration_seconds": 847,
|
||
"databases_backed_up": 21,
|
||
"files_backed_up": true,
|
||
"remote_sync": "success",
|
||
"total_size_gb": 4.2,
|
||
"errors": []
|
||
}
|
||
```
|
||
|
||
The Safety Wrapper monitors this file (Decision #27) and reports status to the Hub via heartbeat.
|
||
|
||
### 4.5 Backup Rotation
|
||
|
||
```bash
|
||
# Daily: keep last 7
|
||
find /opt/letsbe/backups/daily/ -maxdepth 1 -mtime +7 -exec rm -rf {} \;
|
||
|
||
# Weekly: copy Sunday's backup to weekly/, keep last 4
|
||
if [ "$(date +%u)" = "7" ]; then
|
||
cp -a /opt/letsbe/backups/daily/ /opt/letsbe/backups/weekly/$(date +%Y-%m-%d)/
|
||
fi
|
||
find /opt/letsbe/backups/weekly/ -maxdepth 1 -mtime +28 -exec rm -rf {} \;
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Restore Procedures
|
||
|
||
### 5.1 Per-Tool Restore
|
||
|
||
**Location:** `letsbe-provisioner/scripts/restore.sh` (~512 lines)
|
||
|
||
```bash
|
||
# Restore a specific tool's database from a daily backup
|
||
./restore.sh --tool nextcloud --date 2026-02-25
|
||
|
||
# Steps:
|
||
# 1. Stop the tool container
|
||
# 2. Restore database from backup
|
||
# 3. Restore files (if applicable)
|
||
# 4. Start the tool container
|
||
# 5. Verify health check
|
||
# 6. Report to Hub
|
||
```
|
||
|
||
### 5.2 Full Server Restore
|
||
|
||
For complete server recovery (e.g., VPS failure):
|
||
|
||
```
|
||
1. Order new VPS from Netcup (same region, same tier)
|
||
2. Run provisioner with --restore flag
|
||
- Steps 1-8: Standard server setup
|
||
- Step 9: Deploy tool stacks (empty)
|
||
- Step 10: Deploy OpenClaw + Safety Wrapper
|
||
3. Restore from remote backup:
|
||
rclone sync remote:backups/{{tenant_id}}/latest/ /opt/letsbe/backups/daily/
|
||
4. Run restore.sh --all
|
||
- Restores all 21 databases
|
||
- Restores all file volumes
|
||
- Restores OpenClaw state
|
||
- Restores Safety Wrapper secrets registry
|
||
- Restores credentials
|
||
5. Verify all tools are healthy
|
||
6. Update DNS if IP changed
|
||
7. Hub updates server connection record
|
||
```
|
||
|
||
### 5.3 Point-in-Time Recovery
|
||
|
||
For accidental data deletion by a user:
|
||
|
||
```
|
||
1. Identify the backup date that contains the needed data
|
||
2. Restore the specific tool to a temporary container:
|
||
./restore.sh --tool odoo --date 2026-02-23 --target temp
|
||
3. Extract the needed data from the temp container
|
||
4. Import the data into the production tool
|
||
5. Remove the temp container
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Monitoring
|
||
|
||
### 6.1 Uptime Kuma (On-Tenant)
|
||
|
||
Each tenant VPS runs Uptime Kuma monitoring all local services:
|
||
|
||
| Monitor | Type | Interval | Alert Threshold |
|
||
|---------|------|----------|-----------------|
|
||
| nginx | HTTP(S) | 60s | 3 failures |
|
||
| Each tool container | HTTP | 120s | 3 failures |
|
||
| OpenClaw Gateway | HTTP (port 18789) | 60s | 2 failures |
|
||
| Secrets Proxy | HTTP (port 8100) | 60s | 2 failures |
|
||
| SSL certificate expiry | Certificate | Daily | 14 days before expiry |
|
||
| Disk usage | Push | 300s | >85% |
|
||
|
||
### 6.2 Hub-Level Monitoring
|
||
|
||
The Hub monitors all tenant servers centrally:
|
||
|
||
| Metric | Source | Check Interval | Alert |
|
||
|--------|--------|---------------|-------|
|
||
| Heartbeat received | Safety Wrapper | Expected every 5 min | Missing >15 min |
|
||
| Token usage rate | Safety Wrapper heartbeat | Every heartbeat | >90% pool consumed |
|
||
| Backup status | Safety Wrapper (reads backup-status.json) | Daily | Any backup failure |
|
||
| Container health | Portainer API (via Hub) | Every 10 min | Container crash/OOM |
|
||
| VPS metrics | Netcup SCP API | Every 15 min | CPU >90% sustained, disk >90% |
|
||
| OpenClaw version | Safety Wrapper heartbeat | Every heartbeat | Version mismatch with expected |
|
||
|
||
### 6.3 GlitchTip (Error Tracking)
|
||
|
||
GlitchTip runs on each tenant and captures application errors from:
|
||
- OpenClaw (Node.js errors, unhandled rejections)
|
||
- Safety Wrapper (hook errors, tool execution failures)
|
||
- Tool containers that support Sentry-compatible error reporting
|
||
|
||
### 6.4 Diun (Container Update Notifications)
|
||
|
||
Diun monitors all Docker images for new releases:
|
||
|
||
```yaml
|
||
# /opt/letsbe/stacks/diun/docker-compose.yml
|
||
watch:
|
||
schedule: "0 6 * * *" # Check daily at 06:00
|
||
notif:
|
||
webhook:
|
||
endpoint: "http://127.0.0.1:8100/webhooks/diun" # Safety Wrapper
|
||
method: POST
|
||
```
|
||
|
||
The Safety Wrapper receives update notifications and:
|
||
1. Logs the available update
|
||
2. Reports to Hub via heartbeat
|
||
3. Does NOT auto-update (updates require IT Admin agent or manual action)
|
||
|
||
---
|
||
|
||
## 7. Maintenance Procedures
|
||
|
||
### 7.1 Tool Updates
|
||
|
||
Tool container updates are initiated by the IT Admin agent or manually:
|
||
|
||
```bash
|
||
# 1. Pull new image
|
||
cd /opt/letsbe/stacks/{{tool}}
|
||
docker compose pull
|
||
|
||
# 2. Backup the tool's database
|
||
./backups.sh --tool {{tool}}
|
||
|
||
# 3. Rolling update
|
||
docker compose up -d --force-recreate
|
||
|
||
# 4. Verify health check
|
||
curl -sf http://127.0.0.1:{{port}}/health
|
||
|
||
# 5. If health check fails, rollback:
|
||
docker compose down
|
||
docker tag {{tool}}:previous {{tool}}:latest
|
||
docker compose up -d
|
||
```
|
||
|
||
### 7.2 OpenClaw Updates
|
||
|
||
OpenClaw is pinned to a tested release tag. Update procedure:
|
||
|
||
```bash
|
||
# 1. Check upstream changelog for breaking changes
|
||
# 2. Test in staging VPS first
|
||
|
||
# 3. On tenant VPS:
|
||
cd /opt/letsbe/stacks/openclaw
|
||
|
||
# 4. Backup OpenClaw state
|
||
tar czf /opt/letsbe/backups/openclaw-pre-update.tar.gz ~/.openclaw/
|
||
|
||
# 5. Update image tag in docker-compose.yml
|
||
sed -i 's/openclaw:v2026.2.1/openclaw:v2026.3.0/' docker-compose.yml
|
||
|
||
# 6. Pull and recreate
|
||
docker compose pull && docker compose up -d --force-recreate
|
||
|
||
# 7. Verify
|
||
curl -sf http://127.0.0.1:18789/health
|
||
docker exec letsbe-openclaw openclaw --version
|
||
|
||
# 8. If verification fails, rollback:
|
||
docker compose down
|
||
sed -i 's/openclaw:v2026.3.0/openclaw:v2026.2.1/' docker-compose.yml
|
||
docker compose up -d
|
||
tar xzf /opt/letsbe/backups/openclaw-pre-update.tar.gz -C /
|
||
```
|
||
|
||
**Update cadence:** Monthly review of upstream changelog. Update only for security fixes or features we need. Never update on Fridays.
|
||
|
||
### 7.3 SSL Certificate Renewal
|
||
|
||
Let's Encrypt certificates auto-renew via certbot cron. Manual renewal if needed:
|
||
|
||
```bash
|
||
certbot renew --nginx --force-renewal
|
||
nginx -t && nginx -s reload
|
||
```
|
||
|
||
### 7.4 Credential Rotation
|
||
|
||
The IT Admin agent can rotate credentials for any tool:
|
||
|
||
```bash
|
||
# 1. Generate new credential
|
||
NEW_PASS=$(openssl rand -base64 32)
|
||
|
||
# 2. Update the tool's .env file
|
||
sed -i "s/DB_PASSWORD=.*/DB_PASSWORD=$NEW_PASS/" /opt/letsbe/stacks/{{tool}}/.env
|
||
|
||
# 3. Update the database user's password
|
||
docker exec {{tool}}-db psql -c "ALTER USER {{user}} PASSWORD '$NEW_PASS';"
|
||
|
||
# 4. Restart the tool container
|
||
docker compose -f /opt/letsbe/stacks/{{tool}}/docker-compose.yml restart
|
||
|
||
# 5. Update the secrets registry
|
||
# (Safety Wrapper detects .env change and updates registry automatically)
|
||
|
||
# 6. Verify tool health
|
||
curl -sf http://127.0.0.1:{{port}}/health
|
||
```
|
||
|
||
### 7.5 Disk Space Management
|
||
|
||
When disk usage exceeds 85%:
|
||
|
||
```bash
|
||
# 1. Check disk usage by directory
|
||
du -sh /opt/letsbe/stacks/* | sort -rh | head -20
|
||
du -sh /opt/letsbe/backups/* | sort -rh
|
||
|
||
# 2. Clean Docker resources
|
||
docker system prune -f # Remove stopped containers, unused networks
|
||
docker image prune -a -f # Remove unused images
|
||
docker volume prune -f # Remove unused volumes (CAREFUL: verify first)
|
||
|
||
# 3. Clean old logs
|
||
find /var/log -name "*.gz" -mtime +30 -delete
|
||
docker container ls -a --format "{{.Names}}" | xargs -I {} docker logs {} --since 720h 2>/dev/null | wc -l
|
||
|
||
# 4. Clean old backups (if rotation isn't catching them)
|
||
find /opt/letsbe/backups/daily/ -maxdepth 1 -mtime +7 -exec rm -rf {} \;
|
||
|
||
# 5. If still above 85%, recommend tier upgrade to user
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Deprovisioning
|
||
|
||
### 8.1 Customer Cancellation Flow
|
||
|
||
```
|
||
Customer requests cancellation
|
||
│
|
||
▼
|
||
Hub: 48-hour cooling-off period
|
||
│ (Customer can cancel the cancellation)
|
||
▼
|
||
Hub: 30-day data export window begins
|
||
│ Customer can:
|
||
│ - Download files via Nextcloud
|
||
│ - Export CRM data via Odoo
|
||
│ - Export email via IMAP
|
||
│ - SSH into server for full access
|
||
│ - Request a full backup via Hub
|
||
▼
|
||
Hub: After 30 days → trigger deprovisioning
|
||
│
|
||
├── Revoke Safety Wrapper Hub API key
|
||
├── Stop all containers
|
||
├── Delete remote backups (rclone purge)
|
||
├── Request VPS deletion via Netcup API
|
||
│ └── Netcup wipes disk and destroys VPS
|
||
├── Delete all Netcup snapshots
|
||
├── Remove DNS records
|
||
└── Hub: soft-delete account data, retain billing records (7 years per HGB §257)
|
||
```
|
||
|
||
### 8.2 Emergency Server Isolation
|
||
|
||
If a tenant VPS is compromised or abusing the platform:
|
||
|
||
```bash
|
||
# 1. Revoke Hub API key immediately (Hub admin panel)
|
||
# 2. SSH into server (port 22022):
|
||
ssh -p 22022 letsbe-admin@{{server_ip}}
|
||
|
||
# 3. Stop the AI runtime
|
||
docker stop letsbe-openclaw letsbe-secrets-proxy
|
||
|
||
# 4. Block outbound traffic (except SSH)
|
||
ufw deny out to any
|
||
ufw allow out to any port 22022
|
||
|
||
# 5. Take a forensic snapshot via Netcup API
|
||
# 6. Assess and decide: remediate or deprovision
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Disaster Recovery
|
||
|
||
### 9.1 Scenarios
|
||
|
||
| Scenario | RTO | RPO | Procedure |
|
||
|----------|-----|-----|-----------|
|
||
| Single container crash | <5 min | 0 (no data loss) | Auto-restart via Docker restart policy |
|
||
| Multiple container failure | <30 min | 0 | IT Admin agent investigates, restarts services |
|
||
| VPS disk corruption | 2–4 hours | 24 hours (last backup) | New VPS + restore from remote backup |
|
||
| VPS total loss | 2–4 hours | 24 hours | New VPS (same region) + restore |
|
||
| Netcup data center outage | 4–8 hours | 24 hours | New VPS in alternate region + restore |
|
||
| Hub outage | <1 hour | 0 (tenant VPS operates independently) | Hub restart/failover |
|
||
| OpenRouter outage | <5 min | 0 | Model fallback chain engages automatically |
|
||
|
||
### 9.2 Tenant VPS Operates Independently
|
||
|
||
A key architectural property: **tenant VPS continues operating even if the Hub is down.** The Safety Wrapper operates with its local config, the AI agents continue serving the user, and tools continue running. The Hub is needed only for:
|
||
- Billing and subscription management
|
||
- Config updates (new agents, autonomy changes)
|
||
- Approval queue (if approvals are routed through Hub instead of local)
|
||
- Monitoring dashboards
|
||
|
||
### 9.3 Recovery Testing
|
||
|
||
**Monthly:** Restore a random tool's database from backup on a staging VPS to verify backup integrity.
|
||
|
||
**Quarterly:** Full server restore drill — order a new VPS, run complete restore from remote backup, verify all tools and agents are functional.
|
||
|
||
---
|
||
|
||
## 10. Security Operations
|
||
|
||
### 10.1 SSH Access Audit
|
||
|
||
```bash
|
||
# Review successful SSH logins
|
||
journalctl -u sshd --since "7 days ago" | grep "Accepted"
|
||
|
||
# Review failed SSH attempts
|
||
journalctl -u sshd --since "7 days ago" | grep "Failed"
|
||
|
||
# Check fail2ban status
|
||
fail2ban-client status sshd
|
||
```
|
||
|
||
### 10.2 Container Security
|
||
|
||
```bash
|
||
# Check for containers running as root (should be minimal)
|
||
docker ps --format "{{.Names}}" | xargs -I {} docker inspect {} --format "{{.Config.User}}"
|
||
|
||
# Check for containers with excessive privileges
|
||
docker ps --format "{{.Names}}" | xargs -I {} docker inspect {} --format "{{.HostConfig.Privileged}}"
|
||
|
||
# Verify network isolation
|
||
docker network ls
|
||
docker network inspect bridge
|
||
```
|
||
|
||
### 10.3 Vulnerability Scanning
|
||
|
||
```bash
|
||
# Scan Docker images for known vulnerabilities (using Trivy)
|
||
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
|
||
aquasec/trivy image --severity HIGH,CRITICAL {{image_name}}
|
||
|
||
# Scan all running containers
|
||
docker ps --format "{{.Image}}" | sort -u | while read img; do
|
||
trivy image --severity HIGH,CRITICAL "$img"
|
||
done
|
||
```
|
||
|
||
### 10.4 Incident Response Checklist
|
||
|
||
```
|
||
[ ] 1. Contain: Isolate affected VPS (Section 8.2)
|
||
[ ] 2. Assess: Determine scope (which data, which users affected)
|
||
[ ] 3. Preserve: Take forensic snapshot before changes
|
||
[ ] 4. Notify: Hub alerts → Matt → customer (within timelines per GDPR Art. 33/34)
|
||
[ ] 5. Remediate: Fix the vulnerability, rotate compromised credentials
|
||
[ ] 6. Restore: From clean backup if data was corrupted
|
||
[ ] 7. Verify: Full health check on all services
|
||
[ ] 8. Document: Post-mortem with root cause, timeline, actions taken
|
||
[ ] 9. Improve: Update runbook/monitoring to prevent recurrence
|
||
```
|
||
|
||
---
|
||
|
||
## 11. Common Operations Quick Reference
|
||
|
||
| Task | Command / Procedure |
|
||
|------|---------------------|
|
||
| Check all containers | `docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"` |
|
||
| Restart a tool | `cd /opt/letsbe/stacks/{{tool}} && docker compose restart` |
|
||
| View tool logs | `docker logs --tail 100 -f {{container_name}}` |
|
||
| Check disk usage | `df -h /opt/letsbe` |
|
||
| Check RAM usage | `free -h` |
|
||
| Run manual backup | `/opt/letsbe/scripts/backups.sh` |
|
||
| Restore a tool | `/opt/letsbe/scripts/restore.sh --tool {{tool}} --date YYYY-MM-DD` |
|
||
| Check SSL expiry | `certbot certificates` |
|
||
| Renew SSL | `certbot renew --nginx` |
|
||
| Check Safety Wrapper | `curl http://127.0.0.1:8100/health` |
|
||
| Check OpenClaw | `curl http://127.0.0.1:18789/health` |
|
||
| View backup status | `cat /opt/letsbe/backups/backup-status.json \| jq` |
|
||
| Check firewall | `ufw status verbose` |
|
||
| Check fail2ban | `fail2ban-client status sshd` |
|
||
|
||
---
|
||
|
||
## 12. Changelog
|
||
|
||
| Version | Date | Changes |
|
||
|---------|------|---------|
|
||
| 1.0 | 2026-02-26 | Initial runbook. Covers: Netcup provisioning, 10-step pipeline, credential generation, backup/restore, monitoring stack, maintenance procedures, deprovisioning, disaster recovery, security operations, quick reference. |
|