Files

Matt Ciaccio 6eb0d3dc92 docs(ops): backup/restore + email deliverability runbooks

Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.

Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
  (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
  hourly MinIO mirror, 7-day hourly + 30-day daily retention),
  cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
  upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
  flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
  that runs against a sandbox DB and diffs row counts

Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
  records (SPF/DKIM/DMARC/MX), per-port override implications,
  diagnosis flow ("didn't arrive" → 4-step checklist starting with
  EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
  end-to-end probe

Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-28 20:10:30 +02:00

7.9 KiB

Raw Blame History

Email deliverability runbook

The CRM sends transactional email through three different surfaces. Each has a different failure mode when it lands in spam. This runbook covers how to diagnose, fix, and verify each path.

What email the CRM sends

Surface	Trigger	Template	Default `from`
Portal activation / password-reset	Admin invites a client to the portal	`src/lib/email/templates/portal-auth.ts`	per-port `email_settings.from_address` or `SMTP_FROM`
Inquiry confirmation + sales notification	Public website POSTs to `/api/public/interests` or `/api/public/residential-inquiries`	`inquiry-client-confirmation.ts`, `inquiry-sales-notification.ts`	same
GDPR export ready	Staff requests an export with `emailToClient=true`	inline in `gdpr-export.service.ts`	same
Documenso reminders	Cadence job fires for an unsigned signer	`documenso/reminders/*`	same

Documenso itself sends signing requests with its own from address — those don't flow through this codebase. SPF/DKIM for the Documenso sender is the Documenso operator's problem, not yours.

DNS records

For every domain that appears in a from: header you must publish:

1. SPF

A single TXT record at the apex authorizing whichever provider is sending. Multiple SPF records on the same name break SPF entirely — combine into one.

v=spf1 include:_spf.google.com include:amazonses.com -all

The -all (hardfail) is correct for transactional mail. Switch to ~all (softfail) only as a temporary diagnostic when migrating providers.

2. DKIM

Each provider publishes its own selector. Common shapes:

Google Workspace: google._domainkey → 2048-bit RSA pubkey (rotate every 12 months).
Amazon SES: xxxx._domainkey, yyyy._domainkey, zzzz._domainkey (three CNAMEs SES gives you).
Postmark / Resend / Mailgun: one CNAME per selector.

Verify alignment — the d= value in the DKIM signature must match the From: domain (relaxed alignment is fine, strict is overkill).

3. DMARC

Start at p=none while you build deliverability data, then upgrade.

_dmarc 14400 IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@portnimara.com; ruf=mailto:dmarc@portnimara.com; fo=1; adkim=r; aspf=r; pct=100"

rua (aggregate reports) is the diagnostic feed — set it before the first send so the first weekly report has data.

4. MX (only if you also receive)

The CRM's IMAP probe (scripts/dev-imap-probe.ts) and the inbound thread sync rely on a real mailbox. Whoever runs that mailbox publishes the MX records — typically Google Workspace or a dedicated provider. Don't add an MX pointing at the CRM host; it doesn't accept SMTP IN.

Per-port overrides

Each port can override from_address, from_name, and SMTP creds via the admin email-settings page. When set, getPortEmailConfig() returns those values and sendEmail() uses them in preference to the global SMTP_* env. The override domain still needs SPF / DKIM / DMARC on its own DNS — without them, every send from that port lands in spam.

When a customer reports "our portal invite didn't arrive":

Pull the port's email settings from the admin UI. Check from_address.
Run dig TXT <from-domain> and dig TXT _dmarc.<from-domain>. Confirm SPF includes the SMTP provider's domain and DMARC exists.
Send a probe through mail-tester.com: paste the address into a test send, click the score breakdown.
Score < 8/10 → fix whatever's flagged before doing anything else in this runbook.

Diagnosing a "didn't arrive" report

Order matters — go top-down, stop when one of these is the answer.

Step 1: Was the send attempted?

# Tail the worker logs for the recipient address.
docker compose logs worker | grep '<recipient>'

You'll see one of three patterns:

Nothing: The job didn't run. Check that BullMQ actually queued it. redis-cli LLEN bull:email:waiting — if non-zero, the worker is dead. docker compose logs scheduler | tail to see why.
Email sent with a message-id: The provider accepted it. Move to Step 2.
SendError: Provider rejected. The error string says why (auth, rate limit, blocked recipient).

Step 2: Is `EMAIL_REDIRECT_TO` set?

In dev/test we set EMAIL_REDIRECT_TO=ops@portnimara.com so seeded fake clients don't get real email. It must be unset in production.

# On the production host:
docker exec pncrm-web printenv EMAIL_REDIRECT_TO
# Should print nothing.

If it's set, every email is going to the redirect target with the original recipient prefixed in the subject — the customer never sees it.

Step 3: Did it land but get filtered?

Ask the recipient to check:

Spam / Junk folder
Gmail "Promotions" tab
Outlook "Other" folder (vs Focused)
The Quarantine console if they're on M365 with anti-spam enabled

If found in a spam folder: the email arrived; the recipient's filter classified it. SPF/DKIM/DMARC alignment is suspect — re-run the mail-tester probe from above.

Step 4: Was the recipient on a suppression list?

Some providers (SES, Postmark) maintain a suppression list — once a domain bounces from an address, future sends are dropped silently.

# SES example:
aws ses list-suppressed-destinations --region eu-west-1

If the recipient is suppressed, remove them and ask them to retry. The CRM doesn't track suppression locally; that's the provider's job.

When migrating SMTP providers

Add the new provider's DKIM CNAMEs alongside the old ones.
Add the new provider's include: to the existing SPF record.
Wait 48 hours for DNS to propagate and DMARC reports to confirm both providers align.
Switch SMTP_* env to the new provider on a single staging host.
Send through the staging host for a week. Watch DMARC reports.
Cut production over.
Wait two weeks before removing the old provider's DNS — undelivered bounce reports keep arriving for a while.

Testing a deliverability fix

There's no automated test for "did this email reach the inbox" — that's a property of the recipient's filter, which we don't control. The closest proxy is the realapi suite:

pnpm exec playwright test --project=realapi

It runs tests/e2e/realapi/portal-imap-activation.spec.ts which sends a real portal-invite email through SMTP, then polls the configured IMAP mailbox for the activation link. If it appears within 30 seconds, the SMTP→DKIM→DMARC chain is alive end-to-end. If the test times out, work backwards through this runbook.

The realapi suite needs SMTP_* and IMAP_* env vars — see the "Optional dev/test-only env vars" block in CLAUDE.md.

Bounce handling

The CRM doesn't currently process bounces. If you start seeing volume:

Set up the provider's webhook (SES → SNS → Lambda; Postmark → webhook URL) to POST bounce events to a new /api/webhooks/email-bounce route.
Persist the bounced address into a email_suppressions table.
Have sendEmail() consult that table before each send.

That work isn't in scope yet; this runbook just flags it as the next deliverability gap.

7.9 KiB Raw Blame History