Files
pn-new-crm/docs/runbooks/backup-and-restore.md
Matt Ciaccio 6eb0d3dc92 docs(ops): backup/restore + email deliverability runbooks
Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.

Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
  (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
  hourly MinIO mirror, 7-day hourly + 30-day daily retention),
  cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
  upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
  flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
  that runs against a sandbox DB and diffs row counts

Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
  records (SPF/DKIM/DMARC/MX), per-port override implications,
  diagnosis flow ("didn't arrive" → 4-step checklist starting with
  EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
  end-to-end probe

Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 20:10:30 +02:00

200 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Backup and restore runbook
This runbook documents what gets backed up, how often, where it lands, and
the exact commands to restore the system from a cold start. The goal is
that any operator who has the off-site backup credentials can bring the
CRM back up on a clean host without help.
## Scope of a "full backup"
The CRM has three stateful surfaces. All three must be captured for a
restore to be useful.
| Surface | Holds | Risk if missing |
| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| **PostgreSQL** (`port_nimara_crm`) | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable. |
| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments. | Files reachable by row references in Postgres become 404s. |
| **`.env` + secrets** | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`). | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. |
The Redis instance is not backed up. It only holds queue state, rate-limit
counters, and Socket.IO presence — all reconstructable. Stop the workers
during a restore so the queue starts clean.
## Backup schedule
Defaults are tuned for a single-port deployment with O(10k) clients. Bump
on the producing side as scale demands.
| Job | Frequency | Retention | Where |
| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- |
| `pg_dump` (custom format, gzipped) | Hourly | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz` |
| MinIO mirror | Hourly (incremental) | 30 days versions | `${BACKUP_BUCKET}/minio/` |
| `.env` snapshot (encrypted) | On change (manual) | Forever | Password manager / secrets vault — **never the same bucket as data** |
The hourly cadence is the right answer for this workload — invoices and
contracts cluster around business hours, and an hour of lost work is the
worst-case data loss window most clients will tolerate. Promote to 15-min
WAL streaming if a customer demands tighter RPO.
## Required environment variables
The scripts below read these. Store them in a CI secret store, not the
host's bash profile.
```
# Source (the running CRM database)
DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm
# MinIO (source bucket — the live one)
MINIO_ENDPOINT=minio.letsbe.solutions
MINIO_PORT=443
MINIO_USE_SSL=true
MINIO_ACCESS_KEY=<live key>
MINIO_SECRET_KEY=<live secret>
MINIO_BUCKET=crm-files
# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
# with no IAM overlap with the live keys)
BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
BACKUP_S3_REGION=eu-west-1
BACKUP_S3_BUCKET=portnimara-backups-prod
BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
BACKUP_S3_SECRET_KEY=<...>
# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
# radius if the backup bucket itself is compromised.
BACKUP_GPG_RECIPIENT=ops@portnimara.com
```
## Provisioning the backup destination
1. Create a dedicated S3-compatible bucket in a **different account** from
the live infra. AWS S3, Backblaze B2, or a separately-credentialed
MinIO instance all work.
2. Apply object-lock or versioning so an attacker who steals the backup
write key still can't permanently delete history.
3. Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`,
`s3:ListBucket` on this bucket only. Inject them as
`BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys.
4. Set a 90-day lifecycle rule that transitions objects older than 30
days to cold storage and deletes them at 90 days. Past 90 days it's
cheaper to restart from a snapshot taken outside the system.
## The scripts
Three scripts in `scripts/backup/`:
- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads
- `minio-mirror.sh``mc mirror` of the live bucket → backup bucket
- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path
Make them executable and wire them into cron / GitHub Actions / your
scheduler of choice. Sample crontab on the worker host:
```cron
# Hourly DB dump at minute 7
7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1
# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1
# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
```
## Restoring from cold
These steps have been rehearsed against the dev environment; expect them
to take 1530 minutes for a typical port. **The drill (last cron line
above) ensures the runbook stays correct — if the drill fails, the
real restore will too.**
### 0. Stop everything that writes
```bash
docker compose -f docker-compose.prod.yml stop web worker scheduler
# Leave postgres + minio + redis up; we'll point them at restored data.
```
### 1. Restore PostgreSQL
```bash
# Find the dump you want. Prefer the most recent successful hour.
mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
SNAPSHOT="2026-04-28/14.dump.gz"
# Pull it.
mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/
# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz
# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
# to user means we restore in the right order — pg_restore handles this.
psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
--dbname "$DATABASE_URL"
```
### 2. Restore MinIO
```bash
# Sync the backup bucket back over the live one. --overwrite handles
# files that were modified between snapshots.
mc mirror --overwrite \
"$BACKUP_S3_BUCKET/minio/" \
"live/$MINIO_BUCKET/"
```
### 3. Restore secrets
The `.env` file is **not** in object storage. Pull it from the password
manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used
when the database was last running — if it doesn't, rows in
`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR
"Test connection" button will return an opaque error. There is no
recovery path; the keys must be re-entered through the admin UI.
### 4. Bring services back up
```bash
docker compose -f docker-compose.prod.yml up -d
# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
docker compose -f docker-compose.prod.yml logs -f worker
```
### 5. Verify
Tail through the smoke checklist, in order:
1. **DB up**`psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'`
matches the producer-side count from the snapshot's hour.
2. **MinIO up** — open any client with attachments in the CRM, click a
receipt thumbnail; verify the signed URL serves the file.
3. **Documenso webhooks** — re-trigger one in the Documenso admin and
confirm `audit_logs` records the receipt.
4. **Email** — send a portal invite to a real address.
5. **Realtime** — open two browser windows, edit a client in one, watch
the other update via Socket.IO.
6. **AI usage ledger**`SELECT count(*) FROM ai_usage_ledger;`
non-empty if AI was being used. Old rows survive but the budget gates
reset alongside the period boundary at month rollover.
## Drill schedule
The weekly drill (cron line above) runs `restore.sh --drill` against a
throwaway database and a sandbox MinIO bucket. It must produce zero diff
between the restored row counts and the live row counts (modulo the
hour-or-so the drill takes to run).
Failure modes the drill catches before they bite production:
- New tables added without inclusion in `pg_dump`'s `--schema=public` (we
use the default, which captures everything in `public` — but a future
developer adding a `tenant_X` schema will silently lose it).
- MinIO bucket-policy changes that block the backup-side `s3:GetObject`
on certain prefixes.
- GPG passphrase rotation that wasn't propagated to the restore host.
- A `pg_restore` version skew with the producer-side `pg_dump`.