docs(ops): backup/restore + email deliverability runbooks

Two new runbooks under docs/runbooks/ plus the automation scripts the backup runbook references. Both are written so an operator who has only the off-site backup credentials and the runbook can recover the system unaided. Backup/restore (Phase 4a): - docs/runbooks/backup-and-restore.md — covers what gets backed up (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB + hourly MinIO mirror, 7-day hourly + 30-day daily retention), cold-restore procedure with row-count verification, weekly drill - scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc upload, fails loud - scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove flag so accidental deletes on the live bucket can't cascade - scripts/backup/restore.sh — interactive prod restore + --drill mode that runs against a sandbox DB and diffs row counts Email deliverability (Phase 4b): - docs/runbooks/email-deliverability.md — what the CRM sends, DNS records (SPF/DKIM/DMARC/MX), per-port override implications, diagnosis flow ("didn't arrive" → 4-step checklist starting with EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the end-to-end probe Tests still 778/778 vitest, tsc/lint clean — these phases are docs + shell scripts, no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 20:10:30 +02:00
parent a3305a94f3
commit 6eb0d3dc92
5 changed files with 620 additions and 0 deletions
--- a/docs/runbooks/backup-and-restore.md
+++ b/docs/runbooks/backup-and-restore.md
@@ -0,0 +1,199 @@
+# Backup and restore runbook
+
+This runbook documents what gets backed up, how often, where it lands, and
+the exact commands to restore the system from a cold start. The goal is
+that any operator who has the off-site backup credentials can bring the
+CRM back up on a clean host without help.
+
+## Scope of a "full backup"
+
+The CRM has three stateful surfaces. All three must be captured for a
+restore to be useful.
+
+| Surface                                                | Holds                                                                                                                                                              | Risk if missing                                                                                                                       |
+| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
+| **PostgreSQL** (`port_nimara_crm`)                     | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable.                                                                                              |
+| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments.                                                                                      | Files reachable by row references in Postgres become 404s.                                                                            |
+| **`.env` + secrets**                                   | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`).                                                                  | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. |
+
+The Redis instance is not backed up. It only holds queue state, rate-limit
+counters, and Socket.IO presence — all reconstructable. Stop the workers
+during a restore so the queue starts clean.
+
+## Backup schedule
+
+Defaults are tuned for a single-port deployment with O(10k) clients. Bump
+on the producing side as scale demands.
+
+| Job                                | Frequency            | Retention                     | Where                                                                |
+| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- |
+| `pg_dump` (custom format, gzipped) | Hourly               | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz`               |
+| MinIO mirror                       | Hourly (incremental) | 30 days versions              | `${BACKUP_BUCKET}/minio/`                                            |
+| `.env` snapshot (encrypted)        | On change (manual)   | Forever                       | Password manager / secrets vault — **never the same bucket as data** |
+
+The hourly cadence is the right answer for this workload — invoices and
+contracts cluster around business hours, and an hour of lost work is the
+worst-case data loss window most clients will tolerate. Promote to 15-min
+WAL streaming if a customer demands tighter RPO.
+
+## Required environment variables
+
+The scripts below read these. Store them in a CI secret store, not the
+host's bash profile.
+
+```
+# Source (the running CRM database)
+DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm
+
+# MinIO (source bucket — the live one)
+MINIO_ENDPOINT=minio.letsbe.solutions
+MINIO_PORT=443
+MINIO_USE_SSL=true
+MINIO_ACCESS_KEY=<live key>
+MINIO_SECRET_KEY=<live secret>
+MINIO_BUCKET=crm-files
+
+# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
+# with no IAM overlap with the live keys)
+BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
+BACKUP_S3_REGION=eu-west-1
+BACKUP_S3_BUCKET=portnimara-backups-prod
+BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
+BACKUP_S3_SECRET_KEY=<...>
+
+# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
+# radius if the backup bucket itself is compromised.
+BACKUP_GPG_RECIPIENT=ops@portnimara.com
+```
+
+## Provisioning the backup destination
+
+1. Create a dedicated S3-compatible bucket in a **different account** from
+   the live infra. AWS S3, Backblaze B2, or a separately-credentialed
+   MinIO instance all work.
+2. Apply object-lock or versioning so an attacker who steals the backup
+   write key still can't permanently delete history.
+3. Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`,
+   `s3:ListBucket` on this bucket only. Inject them as
+   `BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys.
+4. Set a 90-day lifecycle rule that transitions objects older than 30
+   days to cold storage and deletes them at 90 days. Past 90 days it's
+   cheaper to restart from a snapshot taken outside the system.
+
+## The scripts
+
+Three scripts in `scripts/backup/`:
+
+- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads
+- `minio-mirror.sh` — `mc mirror` of the live bucket → backup bucket
+- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path
+
+Make them executable and wire them into cron / GitHub Actions / your
+scheduler of choice. Sample crontab on the worker host:
+
+```cron
+# Hourly DB dump at minute 7
+7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1
+
+# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
+17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1
+
+# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
+0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
+```
+
+## Restoring from cold
+
+These steps have been rehearsed against the dev environment; expect them
+to take 15–30 minutes for a typical port. **The drill (last cron line
+above) ensures the runbook stays correct — if the drill fails, the
+real restore will too.**
+
+### 0. Stop everything that writes
+
+```bash
+docker compose -f docker-compose.prod.yml stop web worker scheduler
+# Leave postgres + minio + redis up; we'll point them at restored data.
+```
+
+### 1. Restore PostgreSQL
+
+```bash
+# Find the dump you want. Prefer the most recent successful hour.
+mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
+SNAPSHOT="2026-04-28/14.dump.gz"
+
+# Pull it.
+mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/
+
+# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
+gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz
+
+# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
+# to user means we restore in the right order — pg_restore handles this.
+psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
+psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
+gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
+  --dbname "$DATABASE_URL"
+```
+
+### 2. Restore MinIO
+
+```bash
+# Sync the backup bucket back over the live one. --overwrite handles
+# files that were modified between snapshots.
+mc mirror --overwrite \
+  "$BACKUP_S3_BUCKET/minio/" \
+  "live/$MINIO_BUCKET/"
+```
+
+### 3. Restore secrets
+
+The `.env` file is **not** in object storage. Pull it from the password
+manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used
+when the database was last running — if it doesn't, rows in
+`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR
+"Test connection" button will return an opaque error. There is no
+recovery path; the keys must be re-entered through the admin UI.
+
+### 4. Bring services back up
+
+```bash
+docker compose -f docker-compose.prod.yml up -d
+# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
+docker compose -f docker-compose.prod.yml logs -f worker
+```
+
+### 5. Verify
+
+Tail through the smoke checklist, in order:
+
+1. **DB up** — `psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'`
+   matches the producer-side count from the snapshot's hour.
+2. **MinIO up** — open any client with attachments in the CRM, click a
+   receipt thumbnail; verify the signed URL serves the file.
+3. **Documenso webhooks** — re-trigger one in the Documenso admin and
+   confirm `audit_logs` records the receipt.
+4. **Email** — send a portal invite to a real address.
+5. **Realtime** — open two browser windows, edit a client in one, watch
+   the other update via Socket.IO.
+6. **AI usage ledger** — `SELECT count(*) FROM ai_usage_ledger;`
+   non-empty if AI was being used. Old rows survive but the budget gates
+   reset alongside the period boundary at month rollover.
+
+## Drill schedule
+
+The weekly drill (cron line above) runs `restore.sh --drill` against a
+throwaway database and a sandbox MinIO bucket. It must produce zero diff
+between the restored row counts and the live row counts (modulo the
+hour-or-so the drill takes to run).
+
+Failure modes the drill catches before they bite production:
+
+- New tables added without inclusion in `pg_dump`'s `--schema=public` (we
+  use the default, which captures everything in `public` — but a future
+  developer adding a `tenant_X` schema will silently lose it).
+- MinIO bucket-policy changes that block the backup-side `s3:GetObject`
+  on certain prefixes.
+- GPG passphrase rotation that wasn't propagated to the restore host.
+- A `pg_restore` version skew with the producer-side `pg_dump`.
--- a/docs/runbooks/email-deliverability.md
+++ b/docs/runbooks/email-deliverability.md
@@ -0,0 +1,186 @@
+# Email deliverability runbook
+
+The CRM sends transactional email through three different surfaces. Each
+has a different failure mode when it lands in spam. This runbook covers
+how to diagnose, fix, and verify each path.
+
+## What email the CRM sends
+
+| Surface                                   | Trigger                                                                                | Template                                                          | Default `from`                                        |
+| ----------------------------------------- | -------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------- |
+| Portal activation / password-reset        | Admin invites a client to the portal                                                   | `src/lib/email/templates/portal-auth.ts`                          | per-port `email_settings.from_address` or `SMTP_FROM` |
+| Inquiry confirmation + sales notification | Public website POSTs to `/api/public/interests` or `/api/public/residential-inquiries` | `inquiry-client-confirmation.ts`, `inquiry-sales-notification.ts` | same                                                  |
+| GDPR export ready                         | Staff requests an export with `emailToClient=true`                                     | inline in `gdpr-export.service.ts`                                | same                                                  |
+| Documenso reminders                       | Cadence job fires for an unsigned signer                                               | `documenso/reminders/*`                                           | same                                                  |
+
+Documenso _itself_ sends signing requests with its own `from` address —
+those don't flow through this codebase. SPF/DKIM for the Documenso
+sender is the Documenso operator's problem, not yours.
+
+## DNS records
+
+For every domain that appears in a `from:` header you must publish:
+
+### 1. SPF
+
+A single TXT record at the apex authorizing whichever provider is
+sending. Multiple SPF records on the same name **break SPF entirely** —
+combine into one.
+
+```
+v=spf1 include:_spf.google.com include:amazonses.com -all
+```
+
+The `-all` (hardfail) is correct for transactional mail. Switch to `~all`
+(softfail) only as a temporary diagnostic when migrating providers.
+
+### 2. DKIM
+
+Each provider publishes its own selector. Common shapes:
+
+- Google Workspace: `google._domainkey` → 2048-bit RSA pubkey (rotate every 12 months).
+- Amazon SES: `xxxx._domainkey`, `yyyy._domainkey`, `zzzz._domainkey` (three CNAMEs SES gives you).
+- Postmark / Resend / Mailgun: one CNAME per selector.
+
+Verify alignment — the `d=` value in the DKIM signature must match the
+`From:` domain (relaxed alignment is fine, strict is overkill).
+
+### 3. DMARC
+
+Start at `p=none` while you build deliverability data, then upgrade.
+
+```
+_dmarc 14400 IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@portnimara.com; ruf=mailto:dmarc@portnimara.com; fo=1; adkim=r; aspf=r; pct=100"
+```
+
+`rua` (aggregate reports) is the diagnostic feed — set it before the
+first send so the first weekly report has data.
+
+### 4. MX (only if you also receive)
+
+The CRM's IMAP probe (`scripts/dev-imap-probe.ts`) and the inbound thread
+sync rely on a real mailbox. Whoever runs that mailbox publishes the MX
+records — typically Google Workspace or a dedicated provider. Don't add
+an MX pointing at the CRM host; it doesn't accept SMTP IN.
+
+## Per-port overrides
+
+Each port can override `from_address`, `from_name`, and SMTP creds via
+the admin email-settings page. When set, `getPortEmailConfig()` returns
+those values and `sendEmail()` uses them in preference to the global
+`SMTP_*` env. **The override domain still needs SPF / DKIM / DMARC** on
+its own DNS — without them, every send from that port lands in spam.
+
+When a customer reports "our portal invite didn't arrive":
+
+1. Pull the port's email settings from the admin UI. Check `from_address`.
+2. Run `dig TXT <from-domain>` and `dig TXT _dmarc.<from-domain>`.
+   Confirm SPF includes the SMTP provider's domain and DMARC exists.
+3. Send a probe through `mail-tester.com`: paste the address into a
+   test send, click the score breakdown.
+4. Score < 8/10 → fix whatever's flagged before doing anything else in
+   this runbook.
+
+## Diagnosing a "didn't arrive" report
+
+Order matters — go top-down, stop when one of these is the answer.
+
+### Step 1: Was the send attempted?
+
+```bash
+# Tail the worker logs for the recipient address.
+docker compose logs worker | grep '<recipient>'
+```
+
+You'll see one of three patterns:
+
+- **Nothing**: The job didn't run. Check that BullMQ actually queued it.
+  `redis-cli LLEN bull:email:waiting` — if non-zero, the worker is dead.
+  `docker compose logs scheduler | tail` to see why.
+- **`Email sent`** with a message-id: The provider accepted it. Move to
+  Step 2.
+- **`SendError`**: Provider rejected. The error string says why
+  (auth, rate limit, blocked recipient).
+
+### Step 2: Is `EMAIL_REDIRECT_TO` set?
+
+In dev/test we set `EMAIL_REDIRECT_TO=ops@portnimara.com` so seeded fake
+clients don't get real email. **It must be unset in production.**
+
+```bash
+# On the production host:
+docker exec pncrm-web printenv EMAIL_REDIRECT_TO
+# Should print nothing.
+```
+
+If it's set, every email is going to the redirect target with the
+original recipient prefixed in the subject — the customer never sees it.
+
+### Step 3: Did it land but get filtered?
+
+Ask the recipient to check:
+
+- Spam / Junk folder
+- Gmail "Promotions" tab
+- Outlook "Other" folder (vs Focused)
+- The Quarantine console if they're on M365 with anti-spam enabled
+
+If found in a spam folder: the email arrived; the recipient's filter
+classified it. SPF/DKIM/DMARC alignment is suspect — re-run the
+mail-tester probe from above.
+
+### Step 4: Was the recipient on a suppression list?
+
+Some providers (SES, Postmark) maintain a suppression list — once a
+domain bounces from an address, future sends are dropped silently.
+
+```bash
+# SES example:
+aws ses list-suppressed-destinations --region eu-west-1
+```
+
+If the recipient is suppressed, remove them and ask them to retry. The
+CRM doesn't track suppression locally; that's the provider's job.
+
+## When migrating SMTP providers
+
+1. Add the new provider's DKIM CNAMEs alongside the old ones.
+2. Add the new provider's `include:` to the existing SPF record.
+3. Wait 48 hours for DNS to propagate and DMARC reports to confirm both
+   providers align.
+4. Switch `SMTP_*` env to the new provider on a single staging host.
+5. Send through the staging host for a week. Watch DMARC reports.
+6. Cut production over.
+7. Wait two weeks before removing the old provider's DNS — undelivered
+   bounce reports keep arriving for a while.
+
+## Testing a deliverability fix
+
+There's no automated test for "did this email reach the inbox" — that's a
+property of the recipient's filter, which we don't control. The closest
+proxy is the realapi suite:
+
+```bash
+pnpm exec playwright test --project=realapi
+```
+
+It runs `tests/e2e/realapi/portal-imap-activation.spec.ts` which sends a
+real portal-invite email through SMTP, then polls the configured IMAP
+mailbox for the activation link. If it appears within 30 seconds, the
+SMTP→DKIM→DMARC chain is alive end-to-end. If the test times out, work
+backwards through this runbook.
+
+The realapi suite needs `SMTP_*` and `IMAP_*` env vars — see the
+"Optional dev/test-only env vars" block in `CLAUDE.md`.
+
+## Bounce handling
+
+The CRM doesn't currently process bounces. If you start seeing volume:
+
+- Set up the provider's webhook (SES → SNS → Lambda; Postmark → webhook
+  URL) to POST bounce events to a new `/api/webhooks/email-bounce` route.
+- Persist the bounced address into a `email_suppressions` table.
+- Have `sendEmail()` consult that table before each send.
+
+That work isn't in scope yet; this runbook just flags it as the next
+deliverability gap.