Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.
Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
(Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
hourly MinIO mirror, 7-day hourly + 30-day daily retention),
cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
that runs against a sandbox DB and diffs row counts
Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
records (SPF/DKIM/DMARC/MX), per-port override implications,
diagnosis flow ("didn't arrive" → 4-step checklist starting with
EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
end-to-end probe
Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
187 lines
7.9 KiB
Markdown
187 lines
7.9 KiB
Markdown
# Email deliverability runbook
|
|
|
|
The CRM sends transactional email through three different surfaces. Each
|
|
has a different failure mode when it lands in spam. This runbook covers
|
|
how to diagnose, fix, and verify each path.
|
|
|
|
## What email the CRM sends
|
|
|
|
| Surface | Trigger | Template | Default `from` |
|
|
| ----------------------------------------- | -------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------- |
|
|
| Portal activation / password-reset | Admin invites a client to the portal | `src/lib/email/templates/portal-auth.ts` | per-port `email_settings.from_address` or `SMTP_FROM` |
|
|
| Inquiry confirmation + sales notification | Public website POSTs to `/api/public/interests` or `/api/public/residential-inquiries` | `inquiry-client-confirmation.ts`, `inquiry-sales-notification.ts` | same |
|
|
| GDPR export ready | Staff requests an export with `emailToClient=true` | inline in `gdpr-export.service.ts` | same |
|
|
| Documenso reminders | Cadence job fires for an unsigned signer | `documenso/reminders/*` | same |
|
|
|
|
Documenso _itself_ sends signing requests with its own `from` address —
|
|
those don't flow through this codebase. SPF/DKIM for the Documenso
|
|
sender is the Documenso operator's problem, not yours.
|
|
|
|
## DNS records
|
|
|
|
For every domain that appears in a `from:` header you must publish:
|
|
|
|
### 1. SPF
|
|
|
|
A single TXT record at the apex authorizing whichever provider is
|
|
sending. Multiple SPF records on the same name **break SPF entirely** —
|
|
combine into one.
|
|
|
|
```
|
|
v=spf1 include:_spf.google.com include:amazonses.com -all
|
|
```
|
|
|
|
The `-all` (hardfail) is correct for transactional mail. Switch to `~all`
|
|
(softfail) only as a temporary diagnostic when migrating providers.
|
|
|
|
### 2. DKIM
|
|
|
|
Each provider publishes its own selector. Common shapes:
|
|
|
|
- Google Workspace: `google._domainkey` → 2048-bit RSA pubkey (rotate every 12 months).
|
|
- Amazon SES: `xxxx._domainkey`, `yyyy._domainkey`, `zzzz._domainkey` (three CNAMEs SES gives you).
|
|
- Postmark / Resend / Mailgun: one CNAME per selector.
|
|
|
|
Verify alignment — the `d=` value in the DKIM signature must match the
|
|
`From:` domain (relaxed alignment is fine, strict is overkill).
|
|
|
|
### 3. DMARC
|
|
|
|
Start at `p=none` while you build deliverability data, then upgrade.
|
|
|
|
```
|
|
_dmarc 14400 IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@portnimara.com; ruf=mailto:dmarc@portnimara.com; fo=1; adkim=r; aspf=r; pct=100"
|
|
```
|
|
|
|
`rua` (aggregate reports) is the diagnostic feed — set it before the
|
|
first send so the first weekly report has data.
|
|
|
|
### 4. MX (only if you also receive)
|
|
|
|
The CRM's IMAP probe (`scripts/dev-imap-probe.ts`) and the inbound thread
|
|
sync rely on a real mailbox. Whoever runs that mailbox publishes the MX
|
|
records — typically Google Workspace or a dedicated provider. Don't add
|
|
an MX pointing at the CRM host; it doesn't accept SMTP IN.
|
|
|
|
## Per-port overrides
|
|
|
|
Each port can override `from_address`, `from_name`, and SMTP creds via
|
|
the admin email-settings page. When set, `getPortEmailConfig()` returns
|
|
those values and `sendEmail()` uses them in preference to the global
|
|
`SMTP_*` env. **The override domain still needs SPF / DKIM / DMARC** on
|
|
its own DNS — without them, every send from that port lands in spam.
|
|
|
|
When a customer reports "our portal invite didn't arrive":
|
|
|
|
1. Pull the port's email settings from the admin UI. Check `from_address`.
|
|
2. Run `dig TXT <from-domain>` and `dig TXT _dmarc.<from-domain>`.
|
|
Confirm SPF includes the SMTP provider's domain and DMARC exists.
|
|
3. Send a probe through `mail-tester.com`: paste the address into a
|
|
test send, click the score breakdown.
|
|
4. Score < 8/10 → fix whatever's flagged before doing anything else in
|
|
this runbook.
|
|
|
|
## Diagnosing a "didn't arrive" report
|
|
|
|
Order matters — go top-down, stop when one of these is the answer.
|
|
|
|
### Step 1: Was the send attempted?
|
|
|
|
```bash
|
|
# Tail the worker logs for the recipient address.
|
|
docker compose logs worker | grep '<recipient>'
|
|
```
|
|
|
|
You'll see one of three patterns:
|
|
|
|
- **Nothing**: The job didn't run. Check that BullMQ actually queued it.
|
|
`redis-cli LLEN bull:email:waiting` — if non-zero, the worker is dead.
|
|
`docker compose logs scheduler | tail` to see why.
|
|
- **`Email sent`** with a message-id: The provider accepted it. Move to
|
|
Step 2.
|
|
- **`SendError`**: Provider rejected. The error string says why
|
|
(auth, rate limit, blocked recipient).
|
|
|
|
### Step 2: Is `EMAIL_REDIRECT_TO` set?
|
|
|
|
In dev/test we set `EMAIL_REDIRECT_TO=ops@portnimara.com` so seeded fake
|
|
clients don't get real email. **It must be unset in production.**
|
|
|
|
```bash
|
|
# On the production host:
|
|
docker exec pncrm-web printenv EMAIL_REDIRECT_TO
|
|
# Should print nothing.
|
|
```
|
|
|
|
If it's set, every email is going to the redirect target with the
|
|
original recipient prefixed in the subject — the customer never sees it.
|
|
|
|
### Step 3: Did it land but get filtered?
|
|
|
|
Ask the recipient to check:
|
|
|
|
- Spam / Junk folder
|
|
- Gmail "Promotions" tab
|
|
- Outlook "Other" folder (vs Focused)
|
|
- The Quarantine console if they're on M365 with anti-spam enabled
|
|
|
|
If found in a spam folder: the email arrived; the recipient's filter
|
|
classified it. SPF/DKIM/DMARC alignment is suspect — re-run the
|
|
mail-tester probe from above.
|
|
|
|
### Step 4: Was the recipient on a suppression list?
|
|
|
|
Some providers (SES, Postmark) maintain a suppression list — once a
|
|
domain bounces from an address, future sends are dropped silently.
|
|
|
|
```bash
|
|
# SES example:
|
|
aws ses list-suppressed-destinations --region eu-west-1
|
|
```
|
|
|
|
If the recipient is suppressed, remove them and ask them to retry. The
|
|
CRM doesn't track suppression locally; that's the provider's job.
|
|
|
|
## When migrating SMTP providers
|
|
|
|
1. Add the new provider's DKIM CNAMEs alongside the old ones.
|
|
2. Add the new provider's `include:` to the existing SPF record.
|
|
3. Wait 48 hours for DNS to propagate and DMARC reports to confirm both
|
|
providers align.
|
|
4. Switch `SMTP_*` env to the new provider on a single staging host.
|
|
5. Send through the staging host for a week. Watch DMARC reports.
|
|
6. Cut production over.
|
|
7. Wait two weeks before removing the old provider's DNS — undelivered
|
|
bounce reports keep arriving for a while.
|
|
|
|
## Testing a deliverability fix
|
|
|
|
There's no automated test for "did this email reach the inbox" — that's a
|
|
property of the recipient's filter, which we don't control. The closest
|
|
proxy is the realapi suite:
|
|
|
|
```bash
|
|
pnpm exec playwright test --project=realapi
|
|
```
|
|
|
|
It runs `tests/e2e/realapi/portal-imap-activation.spec.ts` which sends a
|
|
real portal-invite email through SMTP, then polls the configured IMAP
|
|
mailbox for the activation link. If it appears within 30 seconds, the
|
|
SMTP→DKIM→DMARC chain is alive end-to-end. If the test times out, work
|
|
backwards through this runbook.
|
|
|
|
The realapi suite needs `SMTP_*` and `IMAP_*` env vars — see the
|
|
"Optional dev/test-only env vars" block in `CLAUDE.md`.
|
|
|
|
## Bounce handling
|
|
|
|
The CRM doesn't currently process bounces. If you start seeing volume:
|
|
|
|
- Set up the provider's webhook (SES → SNS → Lambda; Postmark → webhook
|
|
URL) to POST bounce events to a new `/api/webhooks/email-bounce` route.
|
|
- Persist the bounced address into a `email_suppressions` table.
|
|
- Have `sendEmail()` consult that table before each send.
|
|
|
|
That work isn't in scope yet; this runbook just flags it as the next
|
|
deliverability gap.
|