Files
pn-new-crm/docs/runbooks/email-deliverability.md
Matt Ciaccio 6eb0d3dc92 docs(ops): backup/restore + email deliverability runbooks
Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.

Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
  (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
  hourly MinIO mirror, 7-day hourly + 30-day daily retention),
  cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
  upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
  flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
  that runs against a sandbox DB and diffs row counts

Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
  records (SPF/DKIM/DMARC/MX), per-port override implications,
  diagnosis flow ("didn't arrive" → 4-step checklist starting with
  EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
  end-to-end probe

Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 20:10:30 +02:00

7.9 KiB

Email deliverability runbook

The CRM sends transactional email through three different surfaces. Each has a different failure mode when it lands in spam. This runbook covers how to diagnose, fix, and verify each path.

What email the CRM sends

Surface Trigger Template Default from
Portal activation / password-reset Admin invites a client to the portal src/lib/email/templates/portal-auth.ts per-port email_settings.from_address or SMTP_FROM
Inquiry confirmation + sales notification Public website POSTs to /api/public/interests or /api/public/residential-inquiries inquiry-client-confirmation.ts, inquiry-sales-notification.ts same
GDPR export ready Staff requests an export with emailToClient=true inline in gdpr-export.service.ts same
Documenso reminders Cadence job fires for an unsigned signer documenso/reminders/* same

Documenso itself sends signing requests with its own from address — those don't flow through this codebase. SPF/DKIM for the Documenso sender is the Documenso operator's problem, not yours.

DNS records

For every domain that appears in a from: header you must publish:

1. SPF

A single TXT record at the apex authorizing whichever provider is sending. Multiple SPF records on the same name break SPF entirely — combine into one.

v=spf1 include:_spf.google.com include:amazonses.com -all

The -all (hardfail) is correct for transactional mail. Switch to ~all (softfail) only as a temporary diagnostic when migrating providers.

2. DKIM

Each provider publishes its own selector. Common shapes:

  • Google Workspace: google._domainkey → 2048-bit RSA pubkey (rotate every 12 months).
  • Amazon SES: xxxx._domainkey, yyyy._domainkey, zzzz._domainkey (three CNAMEs SES gives you).
  • Postmark / Resend / Mailgun: one CNAME per selector.

Verify alignment — the d= value in the DKIM signature must match the From: domain (relaxed alignment is fine, strict is overkill).

3. DMARC

Start at p=none while you build deliverability data, then upgrade.

_dmarc 14400 IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@portnimara.com; ruf=mailto:dmarc@portnimara.com; fo=1; adkim=r; aspf=r; pct=100"

rua (aggregate reports) is the diagnostic feed — set it before the first send so the first weekly report has data.

4. MX (only if you also receive)

The CRM's IMAP probe (scripts/dev-imap-probe.ts) and the inbound thread sync rely on a real mailbox. Whoever runs that mailbox publishes the MX records — typically Google Workspace or a dedicated provider. Don't add an MX pointing at the CRM host; it doesn't accept SMTP IN.

Per-port overrides

Each port can override from_address, from_name, and SMTP creds via the admin email-settings page. When set, getPortEmailConfig() returns those values and sendEmail() uses them in preference to the global SMTP_* env. The override domain still needs SPF / DKIM / DMARC on its own DNS — without them, every send from that port lands in spam.

When a customer reports "our portal invite didn't arrive":

  1. Pull the port's email settings from the admin UI. Check from_address.
  2. Run dig TXT <from-domain> and dig TXT _dmarc.<from-domain>. Confirm SPF includes the SMTP provider's domain and DMARC exists.
  3. Send a probe through mail-tester.com: paste the address into a test send, click the score breakdown.
  4. Score < 8/10 → fix whatever's flagged before doing anything else in this runbook.

Diagnosing a "didn't arrive" report

Order matters — go top-down, stop when one of these is the answer.

Step 1: Was the send attempted?

# Tail the worker logs for the recipient address.
docker compose logs worker | grep '<recipient>'

You'll see one of three patterns:

  • Nothing: The job didn't run. Check that BullMQ actually queued it. redis-cli LLEN bull:email:waiting — if non-zero, the worker is dead. docker compose logs scheduler | tail to see why.
  • Email sent with a message-id: The provider accepted it. Move to Step 2.
  • SendError: Provider rejected. The error string says why (auth, rate limit, blocked recipient).

Step 2: Is EMAIL_REDIRECT_TO set?

In dev/test we set EMAIL_REDIRECT_TO=ops@portnimara.com so seeded fake clients don't get real email. It must be unset in production.

# On the production host:
docker exec pncrm-web printenv EMAIL_REDIRECT_TO
# Should print nothing.

If it's set, every email is going to the redirect target with the original recipient prefixed in the subject — the customer never sees it.

Step 3: Did it land but get filtered?

Ask the recipient to check:

  • Spam / Junk folder
  • Gmail "Promotions" tab
  • Outlook "Other" folder (vs Focused)
  • The Quarantine console if they're on M365 with anti-spam enabled

If found in a spam folder: the email arrived; the recipient's filter classified it. SPF/DKIM/DMARC alignment is suspect — re-run the mail-tester probe from above.

Step 4: Was the recipient on a suppression list?

Some providers (SES, Postmark) maintain a suppression list — once a domain bounces from an address, future sends are dropped silently.

# SES example:
aws ses list-suppressed-destinations --region eu-west-1

If the recipient is suppressed, remove them and ask them to retry. The CRM doesn't track suppression locally; that's the provider's job.

When migrating SMTP providers

  1. Add the new provider's DKIM CNAMEs alongside the old ones.
  2. Add the new provider's include: to the existing SPF record.
  3. Wait 48 hours for DNS to propagate and DMARC reports to confirm both providers align.
  4. Switch SMTP_* env to the new provider on a single staging host.
  5. Send through the staging host for a week. Watch DMARC reports.
  6. Cut production over.
  7. Wait two weeks before removing the old provider's DNS — undelivered bounce reports keep arriving for a while.

Testing a deliverability fix

There's no automated test for "did this email reach the inbox" — that's a property of the recipient's filter, which we don't control. The closest proxy is the realapi suite:

pnpm exec playwright test --project=realapi

It runs tests/e2e/realapi/portal-imap-activation.spec.ts which sends a real portal-invite email through SMTP, then polls the configured IMAP mailbox for the activation link. If it appears within 30 seconds, the SMTP→DKIM→DMARC chain is alive end-to-end. If the test times out, work backwards through this runbook.

The realapi suite needs SMTP_* and IMAP_* env vars — see the "Optional dev/test-only env vars" block in CLAUDE.md.

Bounce handling

The CRM doesn't currently process bounces. If you start seeing volume:

  • Set up the provider's webhook (SES → SNS → Lambda; Postmark → webhook URL) to POST bounce events to a new /api/webhooks/email-bounce route.
  • Persist the bounced address into a email_suppressions table.
  • Have sendEmail() consult that table before each send.

That work isn't in scope yet; this runbook just flags it as the next deliverability gap.