LetsBeBiz-Redesign/docs/technical/LetsBe_Biz_Infrastructure_R...

765 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LetsBe Biz — Infrastructure Runbook
**Version:** 1.0
**Date:** February 26, 2026
**Authors:** Matt (Founder), Claude (Architecture)
**Status:** Engineering Spec — Ready for Implementation
**Companion docs:** Technical Architecture v1.2, Tool Catalog v2.2, Security & GDPR Framework v1.1
**Decision refs:** Foundation Document Decisions #18, #27
---
## 1. Purpose
This runbook is the operational reference for provisioning, managing, monitoring, and maintaining LetsBe Biz infrastructure. It covers the full lifecycle: from ordering a VPS through Netcup to deprovisioning a customer's server at account termination.
**Target audience:** Matt (operations), future engineering team, and the IT Admin AI agent (for self-referencing operational procedures).
---
## 2. Infrastructure Overview
### 2.1 Hosting Provider: Netcup
| Item | Detail |
|------|--------|
| **Provider** | Netcup GmbH (Karlsruhe, Germany) |
| **Product line** | VPS (Virtual Private Server) |
| **EU data center** | Netcup Nürnberg/Karlsruhe, Germany |
| **NA data center** | Netcup Manassas, Virginia, USA |
| **API** | SCP (Server Control Panel) REST API with OAuth2 Device Flow |
| **Hub integration** | Full — server ordering, power actions, metrics, snapshots, rescue mode via `netcupService.ts` |
### 2.2 Server Tiers
| Tier | vCPUs | RAM | Disk | Recommended Tools | Monthly Cost (est.) |
|------|-------|-----|------|-------------------|---------------------|
| Lite (€29) | 4 | 8 GB | 160 GB SSD | 58 tools | ~€812 |
| Build (€45) | 8 | 16 GB | 320 GB SSD | 1015 tools | ~€1418 |
| Scale (€75) | 12 | 32 GB | 640 GB SSD | 1525 tools | ~€2228 |
| Enterprise (€109) | 16 | 64 GB | 1.2 TB SSD | 28+ tools | ~€3545 |
### 2.3 Network Architecture
```
Internet
Netcup VPS (public IP)
├── Port 80 (HTTP → 301 redirect to HTTPS)
├── Port 443 (HTTPS → nginx reverse proxy)
├── Port 22022 (SSH — hardened, key-only)
nginx (Alpine container)
├── *.{{domain}} → Route by subdomain to tool containers
│ ├── files.{{domain}} → 127.0.0.1:3023 (Nextcloud)
│ ├── crm.{{domain}} → 127.0.0.1:3025 (Odoo)
│ ├── chat.{{domain}} → 127.0.0.1:3026 (Chatwoot)
│ ├── blog.{{domain}} → 127.0.0.1:3029 (Ghost)
│ ├── mail.{{domain}} → 127.0.0.1:3031 (Stalwart Mail)
│ ├── ... (33 nginx configs total)
│ └── status.{{domain}} → 127.0.0.1:3008 (Uptime Kuma)
└── Internal only (not exposed via nginx):
├── 127.0.0.1:18789 (OpenClaw Gateway)
├── 127.0.0.1:8100 (Secrets Proxy)
└── Various internal tool ports
```
---
## 3. Provisioning Pipeline
### 3.1 End-to-End Flow
```
Customer signs up → Stripe payment → Hub creates Order
Hub Automation Worker (state machine)
├── PAYMENT_CONFIRMED → order VPS from Netcup (if AUTO mode)
├── AWAITING_SERVER → poll Netcup until VPS is ready
├── SERVER_READY → wait for DNS records
├── DNS_PENDING → verify A records for all subdomains
├── DNS_READY → trigger provisioning
├── PROVISIONING → spawn Docker provisioner container
│ │
│ ▼
│ letsbe-provisioner (10-step pipeline via SSH)
│ ├── Step 1: System packages (apt update, essentials)
│ ├── Step 2: Docker CE installation
│ ├── Step 3: Disable conflicting services
│ ├── Step 4: nginx + fallback config
│ ├── Step 5: UFW firewall (80, 443, 22022)
│ ├── Step 6: Admin user + SSH key (optional)
│ ├── Step 7: SSH hardening (port 22022, key-only)
│ ├── Step 8: Unattended security updates
│ ├── Step 9: Deploy tool stacks (docker-compose)
│ └── Step 10: Deploy OpenClaw + Safety Wrapper + bootstrap
├── FULFILLED → server is live, customer notified
└── FAILED → retry logic (1min / 5min / 15min backoff, max 3 attempts)
```
### 3.2 Provisioner Detail (setup.sh)
**Location:** `letsbe-provisioner/scripts/setup.sh` (~832 lines)
#### Step 1: System Packages
```bash
apt-get update && apt-get upgrade -y
apt-get install -y curl wget gnupg2 ca-certificates lsb-release apt-transport-https \
software-properties-common unzip jq htop iotop net-tools dnsutils certbot \
python3-certbot-nginx fail2ban rclone
```
#### Step 2: Docker CE
```bash
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" > /etc/apt/sources.list.d/docker.list
apt-get update && apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
systemctl enable --now docker
```
#### Step 3: Disable Conflicting Services
```bash
systemctl stop apache2 2>/dev/null || true
systemctl disable apache2 2>/dev/null || true
systemctl stop postfix 2>/dev/null || true
systemctl disable postfix 2>/dev/null || true
```
#### Step 4: nginx
Deploy nginx Alpine container with initial fallback config. SSL certificates provisioned via certbot after DNS is verified.
#### Step 5: UFW Firewall
```bash
ufw default deny incoming
ufw default allow outgoing
ufw allow 80/tcp # HTTP
ufw allow 443/tcp # HTTPS
ufw allow 22022/tcp # SSH (hardened port)
ufw allow 25/tcp # SMTP (Stalwart Mail)
ufw allow 587/tcp # SMTP submission
ufw allow 993/tcp # IMAPS
ufw --force enable
```
#### Step 6: Admin User
```bash
useradd -m -s /bin/bash -G docker letsbe-admin
mkdir -p /home/letsbe-admin/.ssh
echo "{{admin_ssh_public_key}}" > /home/letsbe-admin/.ssh/authorized_keys
chmod 700 /home/letsbe-admin/.ssh
chmod 600 /home/letsbe-admin/.ssh/authorized_keys
chown -R letsbe-admin:letsbe-admin /home/letsbe-admin/.ssh
```
#### Step 7: SSH Hardening
```bash
# /etc/ssh/sshd_config modifications:
Port 22022
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
MaxAuthTries 3
LoginGraceTime 30
AllowUsers letsbe-admin
```
#### Step 8: Unattended Security Updates
```bash
apt-get install -y unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades
# Configure /etc/apt/apt.conf.d/50unattended-upgrades for security-only updates
```
#### Step 9: Deploy Tool Stacks
For each tool selected by the customer:
```bash
# 1. Generate credentials (env_setup.sh)
# 50+ secrets: database passwords, admin tokens, API keys, JWT secrets
# Written to /opt/letsbe/env/credentials.env and per-tool .env files
# 2. Deploy Docker Compose stacks
for stack in {{selected_tools}}; do
cd /opt/letsbe/stacks/$stack
docker compose up -d
done
# 3. Deploy nginx configs per tool
for conf in {{selected_nginx_configs}}; do
cp /opt/letsbe/nginx/sites/$conf /etc/nginx/sites-enabled/
done
nginx -t && nginx -s reload
# 4. Request SSL certificates
certbot --nginx -d "*.{{domain}}" --non-interactive --agree-tos -m "ssl@{{domain}}"
```
#### Step 10: Deploy OpenClaw + Safety Wrapper + Bootstrap
```bash
# 1. Deploy OpenClaw container with Safety Wrapper extension pre-installed
cd /opt/letsbe/stacks/openclaw
docker compose up -d
# 2. Deploy Secrets Proxy
cd /opt/letsbe/stacks/secrets-proxy
docker compose up -d
# 3. Seed secrets registry from credentials.env
docker exec letsbe-openclaw /opt/letsbe/scripts/seed-secrets.sh
# 4. Generate tool-registry.json from deployed tools
docker exec letsbe-openclaw /opt/letsbe/scripts/generate-tool-registry.sh
# 5. Deploy SOUL.md files for each agent
# (generated from templates with tenant variables substituted)
# 6. Run initial setup browser automations
# (Cal.com, Chatwoot, Keycloak, Nextcloud, Stalwart Mail, Umami, Uptime Kuma)
# 7. Register with Hub
docker exec letsbe-openclaw /opt/letsbe/scripts/hub-register.sh
# 8. Clean up config.json (CRITICAL: remove plaintext passwords)
rm -f /opt/letsbe/config.json
```
### 3.3 Credential Generation (env_setup.sh)
**Location:** `letsbe-provisioner/scripts/env_setup.sh` (~678 lines)
Generates 50+ unique credentials per tenant:
| Category | Count | Examples |
|----------|-------|---------|
| Database passwords | 18 | PostgreSQL passwords for each tool with a DB |
| Admin passwords | 12 | Nextcloud admin, Keycloak admin, Odoo admin, etc. |
| API tokens | 10 | NocoDB API token, Ghost admin API key, etc. |
| JWT secrets | 5 | Chatwoot, Cal.com, OpenClaw, etc. |
| Encryption keys | 3 | Safety Wrapper registry key, backup encryption key |
| SSH keys | 2 | Admin key pair, Hub communication key |
| SMTP credentials | 2 | Stalwart Mail admin, relay credentials |
**Generation method:** `openssl rand -base64 32` for passwords, `openssl rand -hex 32` for tokens, `ssh-keygen -t ed25519` for SSH keys.
**Template rendering:** All `{{ variable }}` placeholders in Docker Compose files and nginx configs are substituted with generated values.
### 3.4 Post-Provisioning Verification
After step 10 completes, the provisioner runs health checks:
```bash
# 1. Verify all containers are running
docker ps --format "{{.Names}}: {{.Status}}" | grep -v "Up" && exit 1
# 2. Verify nginx is serving
curl -sf https://{{domain}} > /dev/null || exit 1
# 3. Verify each tool's health endpoint
for tool in {{health_check_urls}}; do
curl -sf "$tool" > /dev/null || echo "WARNING: $tool not responding"
done
# 4. Verify Safety Wrapper registered with Hub
curl -sf http://127.0.0.1:8100/health || exit 1
# 5. Verify OpenClaw is responsive
curl -sf http://127.0.0.1:18789/health || exit 1
# 6. Report success to Hub
curl -X PATCH "{{hub_url}}/api/v1/jobs/{{job_id}}" \
-H "Authorization: Bearer {{runner_token}}" \
-d '{"status": "COMPLETED"}'
```
---
## 4. Backup System
### 4.1 Backup Architecture
**Location:** `letsbe-provisioner/scripts/backups.sh` (~473 lines)
**Schedule:** Daily via cron at 02:00 server local time
**Retention:** 7 daily backups + 4 weekly backups (rolling)
### 4.2 What Gets Backed Up
| Component | Method | Target |
|-----------|--------|--------|
| PostgreSQL databases (18) | `pg_dump --format=custom` | `/opt/letsbe/backups/daily/` |
| MySQL databases (2) | `mysqldump --single-transaction` | `/opt/letsbe/backups/daily/` |
| MongoDB databases (1) | `mongodump --archive` | `/opt/letsbe/backups/daily/` |
| Nextcloud files | rsync snapshot | `/opt/letsbe/backups/daily/nextcloud/` |
| Docker volumes (critical) | `docker run --volumes-from` tar | `/opt/letsbe/backups/daily/volumes/` |
| nginx configs | tar archive | `/opt/letsbe/backups/daily/nginx/` |
| OpenClaw state | tar of `~/.openclaw/` | `/opt/letsbe/backups/daily/openclaw/` |
| Safety Wrapper state | SQLite backup API | `/opt/letsbe/backups/daily/safety-wrapper/` |
| Credentials | Encrypted tar | `/opt/letsbe/backups/daily/credentials.enc` |
### 4.3 Remote Backup
After local backup completes, `rclone` syncs to a remote destination:
```bash
rclone sync /opt/letsbe/backups/ remote:backups/{{tenant_id}}/ \
--transfers 4 \
--checkers 8 \
--fast-list \
--log-file /var/log/letsbe/rclone.log
```
Remote destination options (configured per tenant):
- Netcup S3 (default)
- Customer-provided S3 bucket
- Customer-provided rclone remote
### 4.4 Backup Status Reporting
After each backup run, `backups.sh` writes a `backup-status.json`:
```json
{
"timestamp": "2026-02-26T02:15:00Z",
"status": "success",
"duration_seconds": 847,
"databases_backed_up": 21,
"files_backed_up": true,
"remote_sync": "success",
"total_size_gb": 4.2,
"errors": []
}
```
The Safety Wrapper monitors this file (Decision #27) and reports status to the Hub via heartbeat.
### 4.5 Backup Rotation
```bash
# Daily: keep last 7
find /opt/letsbe/backups/daily/ -maxdepth 1 -mtime +7 -exec rm -rf {} \;
# Weekly: copy Sunday's backup to weekly/, keep last 4
if [ "$(date +%u)" = "7" ]; then
cp -a /opt/letsbe/backups/daily/ /opt/letsbe/backups/weekly/$(date +%Y-%m-%d)/
fi
find /opt/letsbe/backups/weekly/ -maxdepth 1 -mtime +28 -exec rm -rf {} \;
```
---
## 5. Restore Procedures
### 5.1 Per-Tool Restore
**Location:** `letsbe-provisioner/scripts/restore.sh` (~512 lines)
```bash
# Restore a specific tool's database from a daily backup
./restore.sh --tool nextcloud --date 2026-02-25
# Steps:
# 1. Stop the tool container
# 2. Restore database from backup
# 3. Restore files (if applicable)
# 4. Start the tool container
# 5. Verify health check
# 6. Report to Hub
```
### 5.2 Full Server Restore
For complete server recovery (e.g., VPS failure):
```
1. Order new VPS from Netcup (same region, same tier)
2. Run provisioner with --restore flag
- Steps 1-8: Standard server setup
- Step 9: Deploy tool stacks (empty)
- Step 10: Deploy OpenClaw + Safety Wrapper
3. Restore from remote backup:
rclone sync remote:backups/{{tenant_id}}/latest/ /opt/letsbe/backups/daily/
4. Run restore.sh --all
- Restores all 21 databases
- Restores all file volumes
- Restores OpenClaw state
- Restores Safety Wrapper secrets registry
- Restores credentials
5. Verify all tools are healthy
6. Update DNS if IP changed
7. Hub updates server connection record
```
### 5.3 Point-in-Time Recovery
For accidental data deletion by a user:
```
1. Identify the backup date that contains the needed data
2. Restore the specific tool to a temporary container:
./restore.sh --tool odoo --date 2026-02-23 --target temp
3. Extract the needed data from the temp container
4. Import the data into the production tool
5. Remove the temp container
```
---
## 6. Monitoring
### 6.1 Uptime Kuma (On-Tenant)
Each tenant VPS runs Uptime Kuma monitoring all local services:
| Monitor | Type | Interval | Alert Threshold |
|---------|------|----------|-----------------|
| nginx | HTTP(S) | 60s | 3 failures |
| Each tool container | HTTP | 120s | 3 failures |
| OpenClaw Gateway | HTTP (port 18789) | 60s | 2 failures |
| Secrets Proxy | HTTP (port 8100) | 60s | 2 failures |
| SSL certificate expiry | Certificate | Daily | 14 days before expiry |
| Disk usage | Push | 300s | >85% |
### 6.2 Hub-Level Monitoring
The Hub monitors all tenant servers centrally:
| Metric | Source | Check Interval | Alert |
|--------|--------|---------------|-------|
| Heartbeat received | Safety Wrapper | Expected every 5 min | Missing >15 min |
| Token usage rate | Safety Wrapper heartbeat | Every heartbeat | >90% pool consumed |
| Backup status | Safety Wrapper (reads backup-status.json) | Daily | Any backup failure |
| Container health | Portainer API (via Hub) | Every 10 min | Container crash/OOM |
| VPS metrics | Netcup SCP API | Every 15 min | CPU >90% sustained, disk >90% |
| OpenClaw version | Safety Wrapper heartbeat | Every heartbeat | Version mismatch with expected |
### 6.3 GlitchTip (Error Tracking)
GlitchTip runs on each tenant and captures application errors from:
- OpenClaw (Node.js errors, unhandled rejections)
- Safety Wrapper (hook errors, tool execution failures)
- Tool containers that support Sentry-compatible error reporting
### 6.4 Diun (Container Update Notifications)
Diun monitors all Docker images for new releases:
```yaml
# /opt/letsbe/stacks/diun/docker-compose.yml
watch:
schedule: "0 6 * * *" # Check daily at 06:00
notif:
webhook:
endpoint: "http://127.0.0.1:8100/webhooks/diun" # Safety Wrapper
method: POST
```
The Safety Wrapper receives update notifications and:
1. Logs the available update
2. Reports to Hub via heartbeat
3. Does NOT auto-update (updates require IT Admin agent or manual action)
---
## 7. Maintenance Procedures
### 7.1 Tool Updates
Tool container updates are initiated by the IT Admin agent or manually:
```bash
# 1. Pull new image
cd /opt/letsbe/stacks/{{tool}}
docker compose pull
# 2. Backup the tool's database
./backups.sh --tool {{tool}}
# 3. Rolling update
docker compose up -d --force-recreate
# 4. Verify health check
curl -sf http://127.0.0.1:{{port}}/health
# 5. If health check fails, rollback:
docker compose down
docker tag {{tool}}:previous {{tool}}:latest
docker compose up -d
```
### 7.2 OpenClaw Updates
OpenClaw is pinned to a tested release tag. Update procedure:
```bash
# 1. Check upstream changelog for breaking changes
# 2. Test in staging VPS first
# 3. On tenant VPS:
cd /opt/letsbe/stacks/openclaw
# 4. Backup OpenClaw state
tar czf /opt/letsbe/backups/openclaw-pre-update.tar.gz ~/.openclaw/
# 5. Update image tag in docker-compose.yml
sed -i 's/openclaw:v2026.2.1/openclaw:v2026.3.0/' docker-compose.yml
# 6. Pull and recreate
docker compose pull && docker compose up -d --force-recreate
# 7. Verify
curl -sf http://127.0.0.1:18789/health
docker exec letsbe-openclaw openclaw --version
# 8. If verification fails, rollback:
docker compose down
sed -i 's/openclaw:v2026.3.0/openclaw:v2026.2.1/' docker-compose.yml
docker compose up -d
tar xzf /opt/letsbe/backups/openclaw-pre-update.tar.gz -C /
```
**Update cadence:** Monthly review of upstream changelog. Update only for security fixes or features we need. Never update on Fridays.
### 7.3 SSL Certificate Renewal
Let's Encrypt certificates auto-renew via certbot cron. Manual renewal if needed:
```bash
certbot renew --nginx --force-renewal
nginx -t && nginx -s reload
```
### 7.4 Credential Rotation
The IT Admin agent can rotate credentials for any tool:
```bash
# 1. Generate new credential
NEW_PASS=$(openssl rand -base64 32)
# 2. Update the tool's .env file
sed -i "s/DB_PASSWORD=.*/DB_PASSWORD=$NEW_PASS/" /opt/letsbe/stacks/{{tool}}/.env
# 3. Update the database user's password
docker exec {{tool}}-db psql -c "ALTER USER {{user}} PASSWORD '$NEW_PASS';"
# 4. Restart the tool container
docker compose -f /opt/letsbe/stacks/{{tool}}/docker-compose.yml restart
# 5. Update the secrets registry
# (Safety Wrapper detects .env change and updates registry automatically)
# 6. Verify tool health
curl -sf http://127.0.0.1:{{port}}/health
```
### 7.5 Disk Space Management
When disk usage exceeds 85%:
```bash
# 1. Check disk usage by directory
du -sh /opt/letsbe/stacks/* | sort -rh | head -20
du -sh /opt/letsbe/backups/* | sort -rh
# 2. Clean Docker resources
docker system prune -f # Remove stopped containers, unused networks
docker image prune -a -f # Remove unused images
docker volume prune -f # Remove unused volumes (CAREFUL: verify first)
# 3. Clean old logs
find /var/log -name "*.gz" -mtime +30 -delete
docker container ls -a --format "{{.Names}}" | xargs -I {} docker logs {} --since 720h 2>/dev/null | wc -l
# 4. Clean old backups (if rotation isn't catching them)
find /opt/letsbe/backups/daily/ -maxdepth 1 -mtime +7 -exec rm -rf {} \;
# 5. If still above 85%, recommend tier upgrade to user
```
---
## 8. Deprovisioning
### 8.1 Customer Cancellation Flow
```
Customer requests cancellation
Hub: 48-hour cooling-off period
│ (Customer can cancel the cancellation)
Hub: 30-day data export window begins
│ Customer can:
│ - Download files via Nextcloud
│ - Export CRM data via Odoo
│ - Export email via IMAP
│ - SSH into server for full access
│ - Request a full backup via Hub
Hub: After 30 days → trigger deprovisioning
├── Revoke Safety Wrapper Hub API key
├── Stop all containers
├── Delete remote backups (rclone purge)
├── Request VPS deletion via Netcup API
│ └── Netcup wipes disk and destroys VPS
├── Delete all Netcup snapshots
├── Remove DNS records
└── Hub: soft-delete account data, retain billing records (7 years per HGB §257)
```
### 8.2 Emergency Server Isolation
If a tenant VPS is compromised or abusing the platform:
```bash
# 1. Revoke Hub API key immediately (Hub admin panel)
# 2. SSH into server (port 22022):
ssh -p 22022 letsbe-admin@{{server_ip}}
# 3. Stop the AI runtime
docker stop letsbe-openclaw letsbe-secrets-proxy
# 4. Block outbound traffic (except SSH)
ufw deny out to any
ufw allow out to any port 22022
# 5. Take a forensic snapshot via Netcup API
# 6. Assess and decide: remediate or deprovision
```
---
## 9. Disaster Recovery
### 9.1 Scenarios
| Scenario | RTO | RPO | Procedure |
|----------|-----|-----|-----------|
| Single container crash | <5 min | 0 (no data loss) | Auto-restart via Docker restart policy |
| Multiple container failure | <30 min | 0 | IT Admin agent investigates, restarts services |
| VPS disk corruption | 24 hours | 24 hours (last backup) | New VPS + restore from remote backup |
| VPS total loss | 24 hours | 24 hours | New VPS (same region) + restore |
| Netcup data center outage | 48 hours | 24 hours | New VPS in alternate region + restore |
| Hub outage | <1 hour | 0 (tenant VPS operates independently) | Hub restart/failover |
| OpenRouter outage | <5 min | 0 | Model fallback chain engages automatically |
### 9.2 Tenant VPS Operates Independently
A key architectural property: **tenant VPS continues operating even if the Hub is down.** The Safety Wrapper operates with its local config, the AI agents continue serving the user, and tools continue running. The Hub is needed only for:
- Billing and subscription management
- Config updates (new agents, autonomy changes)
- Approval queue (if approvals are routed through Hub instead of local)
- Monitoring dashboards
### 9.3 Recovery Testing
**Monthly:** Restore a random tool's database from backup on a staging VPS to verify backup integrity.
**Quarterly:** Full server restore drill order a new VPS, run complete restore from remote backup, verify all tools and agents are functional.
---
## 10. Security Operations
### 10.1 SSH Access Audit
```bash
# Review successful SSH logins
journalctl -u sshd --since "7 days ago" | grep "Accepted"
# Review failed SSH attempts
journalctl -u sshd --since "7 days ago" | grep "Failed"
# Check fail2ban status
fail2ban-client status sshd
```
### 10.2 Container Security
```bash
# Check for containers running as root (should be minimal)
docker ps --format "{{.Names}}" | xargs -I {} docker inspect {} --format "{{.Config.User}}"
# Check for containers with excessive privileges
docker ps --format "{{.Names}}" | xargs -I {} docker inspect {} --format "{{.HostConfig.Privileged}}"
# Verify network isolation
docker network ls
docker network inspect bridge
```
### 10.3 Vulnerability Scanning
```bash
# Scan Docker images for known vulnerabilities (using Trivy)
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
aquasec/trivy image --severity HIGH,CRITICAL {{image_name}}
# Scan all running containers
docker ps --format "{{.Image}}" | sort -u | while read img; do
trivy image --severity HIGH,CRITICAL "$img"
done
```
### 10.4 Incident Response Checklist
```
[ ] 1. Contain: Isolate affected VPS (Section 8.2)
[ ] 2. Assess: Determine scope (which data, which users affected)
[ ] 3. Preserve: Take forensic snapshot before changes
[ ] 4. Notify: Hub alerts → Matt → customer (within timelines per GDPR Art. 33/34)
[ ] 5. Remediate: Fix the vulnerability, rotate compromised credentials
[ ] 6. Restore: From clean backup if data was corrupted
[ ] 7. Verify: Full health check on all services
[ ] 8. Document: Post-mortem with root cause, timeline, actions taken
[ ] 9. Improve: Update runbook/monitoring to prevent recurrence
```
---
## 11. Common Operations Quick Reference
| Task | Command / Procedure |
|------|---------------------|
| Check all containers | `docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"` |
| Restart a tool | `cd /opt/letsbe/stacks/{{tool}} && docker compose restart` |
| View tool logs | `docker logs --tail 100 -f {{container_name}}` |
| Check disk usage | `df -h /opt/letsbe` |
| Check RAM usage | `free -h` |
| Run manual backup | `/opt/letsbe/scripts/backups.sh` |
| Restore a tool | `/opt/letsbe/scripts/restore.sh --tool {{tool}} --date YYYY-MM-DD` |
| Check SSL expiry | `certbot certificates` |
| Renew SSL | `certbot renew --nginx` |
| Check Safety Wrapper | `curl http://127.0.0.1:8100/health` |
| Check OpenClaw | `curl http://127.0.0.1:18789/health` |
| View backup status | `cat /opt/letsbe/backups/backup-status.json \| jq` |
| Check firewall | `ufw status verbose` |
| Check fail2ban | `fail2ban-client status sshd` |
---
## 12. Changelog
| Version | Date | Changes |
|---------|------|---------|
| 1.0 | 2026-02-26 | Initial runbook. Covers: Netcup provisioning, 10-step pipeline, credential generation, backup/restore, monitoring stack, maintenance procedures, deprovisioning, disaster recovery, security operations, quick reference. |