Files

Matt Ciaccio 6eb0d3dc92 docs(ops): backup/restore + email deliverability runbooks

Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.

Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
  (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
  hourly MinIO mirror, 7-day hourly + 30-day daily retention),
  cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
  upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
  flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
  that runs against a sandbox DB and diffs row counts

Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
  records (SPF/DKIM/DMARC/MX), per-port override implications,
  diagnosis flow ("didn't arrive" → 4-step checklist starting with
  EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
  end-to-end probe

Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 20:10:30 +02:00

9.6 KiB

Raw Blame History

Backup and restore runbook

This runbook documents what gets backed up, how often, where it lands, and the exact commands to restore the system from a cold start. The goal is that any operator who has the off-site backup credentials can bring the CRM back up on a clean host without help.

Scope of a "full backup"

The CRM has three stateful surfaces. All three must be captured for a restore to be useful.

Surface	Holds	Risk if missing
PostgreSQL (`port_nimara_crm`)	Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc.	Total data loss — site is unrecoverable.
MinIO bucket (`MINIO_BUCKET`, default `crm-files`)	Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments.	Files reachable by row references in Postgres become 404s.
`.env` + secrets	DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`).	OCR API keys re-resolve from `system_settings` (encrypted at rest), but without the original `ENCRYPTION_KEY` they're unreadable.

The Redis instance is not backed up. It only holds queue state, rate-limit counters, and Socket.IO presence — all reconstructable. Stop the workers during a restore so the queue starts clean.

Backup schedule

Defaults are tuned for a single-port deployment with O(10k) clients. Bump on the producing side as scale demands.

Job	Frequency	Retention	Where
`pg_dump` (custom format, gzipped)	Hourly	7 days hourly + 30 days daily	`${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz`
MinIO mirror	Hourly (incremental)	30 days versions	`${BACKUP_BUCKET}/minio/`
`.env` snapshot (encrypted)	On change (manual)	Forever	Password manager / secrets vault — never the same bucket as data

The hourly cadence is the right answer for this workload — invoices and contracts cluster around business hours, and an hour of lost work is the worst-case data loss window most clients will tolerate. Promote to 15-min WAL streaming if a customer demands tighter RPO.

Required environment variables

The scripts below read these. Store them in a CI secret store, not the host's bash profile.

# Source (the running CRM database)
DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm

# MinIO (source bucket — the live one)
MINIO_ENDPOINT=minio.letsbe.solutions
MINIO_PORT=443
MINIO_USE_SSL=true
MINIO_ACCESS_KEY=<live key>
MINIO_SECRET_KEY=<live secret>
MINIO_BUCKET=crm-files

# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
# with no IAM overlap with the live keys)
BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
BACKUP_S3_REGION=eu-west-1
BACKUP_S3_BUCKET=portnimara-backups-prod
BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
BACKUP_S3_SECRET_KEY=<...>

# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
# radius if the backup bucket itself is compromised.
BACKUP_GPG_RECIPIENT=ops@portnimara.com

Provisioning the backup destination

Create a dedicated S3-compatible bucket in a different account from the live infra. AWS S3, Backblaze B2, or a separately-credentialed MinIO instance all work.
Apply object-lock or versioning so an attacker who steals the backup write key still can't permanently delete history.
Generate IAM credentials scoped to s3:PutObject, s3:GetObject, s3:ListBucket on this bucket only. Inject them as BACKUP_S3_* above. Do not reuse the live MINIO_* keys.
Set a 90-day lifecycle rule that transitions objects older than 30 days to cold storage and deletes them at 90 days. Past 90 days it's cheaper to restart from a snapshot taken outside the system.

The scripts

Three scripts in scripts/backup/:

pg-backup.sh — runs pg_dump, gzips, optionally GPG-encrypts, uploads
minio-mirror.sh — mc mirror of the live bucket → backup bucket
restore.sh — interactive restore (DB + MinIO) given a snapshot path

Make them executable and wire them into cron / GitHub Actions / your scheduler of choice. Sample crontab on the worker host:

# Hourly DB dump at minute 7
7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1

# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1

# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1

Restoring from cold

These steps have been rehearsed against the dev environment; expect them to take 15–30 minutes for a typical port. The drill (last cron line above) ensures the runbook stays correct — if the drill fails, the real restore will too.

0. Stop everything that writes

docker compose -f docker-compose.prod.yml stop web worker scheduler
# Leave postgres + minio + redis up; we'll point them at restored data.

1. Restore PostgreSQL

# Find the dump you want. Prefer the most recent successful hour.
mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
SNAPSHOT="2026-04-28/14.dump.gz"

# Pull it.
mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/

# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz

# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
# to user means we restore in the right order — pg_restore handles this.
psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
  --dbname "$DATABASE_URL"

2. Restore MinIO

# Sync the backup bucket back over the live one. --overwrite handles
# files that were modified between snapshots.
mc mirror --overwrite \
  "$BACKUP_S3_BUCKET/minio/" \
  "live/$MINIO_BUCKET/"

3. Restore secrets

The .env file is not in object storage. Pull it from the password manager / secrets vault. Verify ENCRYPTION_KEY matches the value used when the database was last running — if it doesn't, rows in system_settings (OCR API keys, etc.) decrypt to garbage and the OCR "Test connection" button will return an opaque error. There is no recovery path; the keys must be re-entered through the admin UI.

4. Bring services back up

docker compose -f docker-compose.prod.yml up -d
# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
docker compose -f docker-compose.prod.yml logs -f worker

5. Verify

Tail through the smoke checklist, in order:

DB up — psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;' matches the producer-side count from the snapshot's hour.
MinIO up — open any client with attachments in the CRM, click a receipt thumbnail; verify the signed URL serves the file.
Documenso webhooks — re-trigger one in the Documenso admin and confirm audit_logs records the receipt.
Email — send a portal invite to a real address.
Realtime — open two browser windows, edit a client in one, watch the other update via Socket.IO.
AI usage ledger — SELECT count(*) FROM ai_usage_ledger; non-empty if AI was being used. Old rows survive but the budget gates reset alongside the period boundary at month rollover.

Drill schedule

The weekly drill (cron line above) runs restore.sh --drill against a throwaway database and a sandbox MinIO bucket. It must produce zero diff between the restored row counts and the live row counts (modulo the hour-or-so the drill takes to run).

Failure modes the drill catches before they bite production:

New tables added without inclusion in pg_dump's --schema=public (we use the default, which captures everything in public — but a future developer adding a tenant_X schema will silently lose it).
MinIO bucket-policy changes that block the backup-side s3:GetObject on certain prefixes.
GPG passphrase rotation that wasn't propagated to the restore host.
A pg_restore version skew with the producer-side pg_dump.

9.6 KiB Raw Blame History Unescape Escape