docs(ops): backup/restore + email deliverability runbooks
Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.
Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
(Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
hourly MinIO mirror, 7-day hourly + 30-day daily retention),
cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
that runs against a sandbox DB and diffs row counts
Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
records (SPF/DKIM/DMARC/MX), per-port override implications,
diagnosis flow ("didn't arrive" → 4-step checklist starting with
EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
end-to-end probe
Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
199
docs/runbooks/backup-and-restore.md
Normal file
199
docs/runbooks/backup-and-restore.md
Normal file
@@ -0,0 +1,199 @@
|
|||||||
|
# Backup and restore runbook
|
||||||
|
|
||||||
|
This runbook documents what gets backed up, how often, where it lands, and
|
||||||
|
the exact commands to restore the system from a cold start. The goal is
|
||||||
|
that any operator who has the off-site backup credentials can bring the
|
||||||
|
CRM back up on a clean host without help.
|
||||||
|
|
||||||
|
## Scope of a "full backup"
|
||||||
|
|
||||||
|
The CRM has three stateful surfaces. All three must be captured for a
|
||||||
|
restore to be useful.
|
||||||
|
|
||||||
|
| Surface | Holds | Risk if missing |
|
||||||
|
| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
|
| **PostgreSQL** (`port_nimara_crm`) | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable. |
|
||||||
|
| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments. | Files reachable by row references in Postgres become 404s. |
|
||||||
|
| **`.env` + secrets** | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`). | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. |
|
||||||
|
|
||||||
|
The Redis instance is not backed up. It only holds queue state, rate-limit
|
||||||
|
counters, and Socket.IO presence — all reconstructable. Stop the workers
|
||||||
|
during a restore so the queue starts clean.
|
||||||
|
|
||||||
|
## Backup schedule
|
||||||
|
|
||||||
|
Defaults are tuned for a single-port deployment with O(10k) clients. Bump
|
||||||
|
on the producing side as scale demands.
|
||||||
|
|
||||||
|
| Job | Frequency | Retention | Where |
|
||||||
|
| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- |
|
||||||
|
| `pg_dump` (custom format, gzipped) | Hourly | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz` |
|
||||||
|
| MinIO mirror | Hourly (incremental) | 30 days versions | `${BACKUP_BUCKET}/minio/` |
|
||||||
|
| `.env` snapshot (encrypted) | On change (manual) | Forever | Password manager / secrets vault — **never the same bucket as data** |
|
||||||
|
|
||||||
|
The hourly cadence is the right answer for this workload — invoices and
|
||||||
|
contracts cluster around business hours, and an hour of lost work is the
|
||||||
|
worst-case data loss window most clients will tolerate. Promote to 15-min
|
||||||
|
WAL streaming if a customer demands tighter RPO.
|
||||||
|
|
||||||
|
## Required environment variables
|
||||||
|
|
||||||
|
The scripts below read these. Store them in a CI secret store, not the
|
||||||
|
host's bash profile.
|
||||||
|
|
||||||
|
```
|
||||||
|
# Source (the running CRM database)
|
||||||
|
DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm
|
||||||
|
|
||||||
|
# MinIO (source bucket — the live one)
|
||||||
|
MINIO_ENDPOINT=minio.letsbe.solutions
|
||||||
|
MINIO_PORT=443
|
||||||
|
MINIO_USE_SSL=true
|
||||||
|
MINIO_ACCESS_KEY=<live key>
|
||||||
|
MINIO_SECRET_KEY=<live secret>
|
||||||
|
MINIO_BUCKET=crm-files
|
||||||
|
|
||||||
|
# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
|
||||||
|
# with no IAM overlap with the live keys)
|
||||||
|
BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
|
||||||
|
BACKUP_S3_REGION=eu-west-1
|
||||||
|
BACKUP_S3_BUCKET=portnimara-backups-prod
|
||||||
|
BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
|
||||||
|
BACKUP_S3_SECRET_KEY=<...>
|
||||||
|
|
||||||
|
# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
|
||||||
|
# radius if the backup bucket itself is compromised.
|
||||||
|
BACKUP_GPG_RECIPIENT=ops@portnimara.com
|
||||||
|
```
|
||||||
|
|
||||||
|
## Provisioning the backup destination
|
||||||
|
|
||||||
|
1. Create a dedicated S3-compatible bucket in a **different account** from
|
||||||
|
the live infra. AWS S3, Backblaze B2, or a separately-credentialed
|
||||||
|
MinIO instance all work.
|
||||||
|
2. Apply object-lock or versioning so an attacker who steals the backup
|
||||||
|
write key still can't permanently delete history.
|
||||||
|
3. Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`,
|
||||||
|
`s3:ListBucket` on this bucket only. Inject them as
|
||||||
|
`BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys.
|
||||||
|
4. Set a 90-day lifecycle rule that transitions objects older than 30
|
||||||
|
days to cold storage and deletes them at 90 days. Past 90 days it's
|
||||||
|
cheaper to restart from a snapshot taken outside the system.
|
||||||
|
|
||||||
|
## The scripts
|
||||||
|
|
||||||
|
Three scripts in `scripts/backup/`:
|
||||||
|
|
||||||
|
- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads
|
||||||
|
- `minio-mirror.sh` — `mc mirror` of the live bucket → backup bucket
|
||||||
|
- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path
|
||||||
|
|
||||||
|
Make them executable and wire them into cron / GitHub Actions / your
|
||||||
|
scheduler of choice. Sample crontab on the worker host:
|
||||||
|
|
||||||
|
```cron
|
||||||
|
# Hourly DB dump at minute 7
|
||||||
|
7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1
|
||||||
|
|
||||||
|
# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
|
||||||
|
17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1
|
||||||
|
|
||||||
|
# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
|
||||||
|
0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Restoring from cold
|
||||||
|
|
||||||
|
These steps have been rehearsed against the dev environment; expect them
|
||||||
|
to take 15–30 minutes for a typical port. **The drill (last cron line
|
||||||
|
above) ensures the runbook stays correct — if the drill fails, the
|
||||||
|
real restore will too.**
|
||||||
|
|
||||||
|
### 0. Stop everything that writes
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f docker-compose.prod.yml stop web worker scheduler
|
||||||
|
# Leave postgres + minio + redis up; we'll point them at restored data.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1. Restore PostgreSQL
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Find the dump you want. Prefer the most recent successful hour.
|
||||||
|
mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
|
||||||
|
SNAPSHOT="2026-04-28/14.dump.gz"
|
||||||
|
|
||||||
|
# Pull it.
|
||||||
|
mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/
|
||||||
|
|
||||||
|
# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
|
||||||
|
gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz
|
||||||
|
|
||||||
|
# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
|
||||||
|
# to user means we restore in the right order — pg_restore handles this.
|
||||||
|
psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
|
||||||
|
psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
|
||||||
|
gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
|
||||||
|
--dbname "$DATABASE_URL"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Restore MinIO
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Sync the backup bucket back over the live one. --overwrite handles
|
||||||
|
# files that were modified between snapshots.
|
||||||
|
mc mirror --overwrite \
|
||||||
|
"$BACKUP_S3_BUCKET/minio/" \
|
||||||
|
"live/$MINIO_BUCKET/"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Restore secrets
|
||||||
|
|
||||||
|
The `.env` file is **not** in object storage. Pull it from the password
|
||||||
|
manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used
|
||||||
|
when the database was last running — if it doesn't, rows in
|
||||||
|
`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR
|
||||||
|
"Test connection" button will return an opaque error. There is no
|
||||||
|
recovery path; the keys must be re-entered through the admin UI.
|
||||||
|
|
||||||
|
### 4. Bring services back up
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f docker-compose.prod.yml up -d
|
||||||
|
# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
|
||||||
|
docker compose -f docker-compose.prod.yml logs -f worker
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Verify
|
||||||
|
|
||||||
|
Tail through the smoke checklist, in order:
|
||||||
|
|
||||||
|
1. **DB up** — `psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'`
|
||||||
|
matches the producer-side count from the snapshot's hour.
|
||||||
|
2. **MinIO up** — open any client with attachments in the CRM, click a
|
||||||
|
receipt thumbnail; verify the signed URL serves the file.
|
||||||
|
3. **Documenso webhooks** — re-trigger one in the Documenso admin and
|
||||||
|
confirm `audit_logs` records the receipt.
|
||||||
|
4. **Email** — send a portal invite to a real address.
|
||||||
|
5. **Realtime** — open two browser windows, edit a client in one, watch
|
||||||
|
the other update via Socket.IO.
|
||||||
|
6. **AI usage ledger** — `SELECT count(*) FROM ai_usage_ledger;`
|
||||||
|
non-empty if AI was being used. Old rows survive but the budget gates
|
||||||
|
reset alongside the period boundary at month rollover.
|
||||||
|
|
||||||
|
## Drill schedule
|
||||||
|
|
||||||
|
The weekly drill (cron line above) runs `restore.sh --drill` against a
|
||||||
|
throwaway database and a sandbox MinIO bucket. It must produce zero diff
|
||||||
|
between the restored row counts and the live row counts (modulo the
|
||||||
|
hour-or-so the drill takes to run).
|
||||||
|
|
||||||
|
Failure modes the drill catches before they bite production:
|
||||||
|
|
||||||
|
- New tables added without inclusion in `pg_dump`'s `--schema=public` (we
|
||||||
|
use the default, which captures everything in `public` — but a future
|
||||||
|
developer adding a `tenant_X` schema will silently lose it).
|
||||||
|
- MinIO bucket-policy changes that block the backup-side `s3:GetObject`
|
||||||
|
on certain prefixes.
|
||||||
|
- GPG passphrase rotation that wasn't propagated to the restore host.
|
||||||
|
- A `pg_restore` version skew with the producer-side `pg_dump`.
|
||||||
186
docs/runbooks/email-deliverability.md
Normal file
186
docs/runbooks/email-deliverability.md
Normal file
@@ -0,0 +1,186 @@
|
|||||||
|
# Email deliverability runbook
|
||||||
|
|
||||||
|
The CRM sends transactional email through three different surfaces. Each
|
||||||
|
has a different failure mode when it lands in spam. This runbook covers
|
||||||
|
how to diagnose, fix, and verify each path.
|
||||||
|
|
||||||
|
## What email the CRM sends
|
||||||
|
|
||||||
|
| Surface | Trigger | Template | Default `from` |
|
||||||
|
| ----------------------------------------- | -------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------- |
|
||||||
|
| Portal activation / password-reset | Admin invites a client to the portal | `src/lib/email/templates/portal-auth.ts` | per-port `email_settings.from_address` or `SMTP_FROM` |
|
||||||
|
| Inquiry confirmation + sales notification | Public website POSTs to `/api/public/interests` or `/api/public/residential-inquiries` | `inquiry-client-confirmation.ts`, `inquiry-sales-notification.ts` | same |
|
||||||
|
| GDPR export ready | Staff requests an export with `emailToClient=true` | inline in `gdpr-export.service.ts` | same |
|
||||||
|
| Documenso reminders | Cadence job fires for an unsigned signer | `documenso/reminders/*` | same |
|
||||||
|
|
||||||
|
Documenso _itself_ sends signing requests with its own `from` address —
|
||||||
|
those don't flow through this codebase. SPF/DKIM for the Documenso
|
||||||
|
sender is the Documenso operator's problem, not yours.
|
||||||
|
|
||||||
|
## DNS records
|
||||||
|
|
||||||
|
For every domain that appears in a `from:` header you must publish:
|
||||||
|
|
||||||
|
### 1. SPF
|
||||||
|
|
||||||
|
A single TXT record at the apex authorizing whichever provider is
|
||||||
|
sending. Multiple SPF records on the same name **break SPF entirely** —
|
||||||
|
combine into one.
|
||||||
|
|
||||||
|
```
|
||||||
|
v=spf1 include:_spf.google.com include:amazonses.com -all
|
||||||
|
```
|
||||||
|
|
||||||
|
The `-all` (hardfail) is correct for transactional mail. Switch to `~all`
|
||||||
|
(softfail) only as a temporary diagnostic when migrating providers.
|
||||||
|
|
||||||
|
### 2. DKIM
|
||||||
|
|
||||||
|
Each provider publishes its own selector. Common shapes:
|
||||||
|
|
||||||
|
- Google Workspace: `google._domainkey` → 2048-bit RSA pubkey (rotate every 12 months).
|
||||||
|
- Amazon SES: `xxxx._domainkey`, `yyyy._domainkey`, `zzzz._domainkey` (three CNAMEs SES gives you).
|
||||||
|
- Postmark / Resend / Mailgun: one CNAME per selector.
|
||||||
|
|
||||||
|
Verify alignment — the `d=` value in the DKIM signature must match the
|
||||||
|
`From:` domain (relaxed alignment is fine, strict is overkill).
|
||||||
|
|
||||||
|
### 3. DMARC
|
||||||
|
|
||||||
|
Start at `p=none` while you build deliverability data, then upgrade.
|
||||||
|
|
||||||
|
```
|
||||||
|
_dmarc 14400 IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@portnimara.com; ruf=mailto:dmarc@portnimara.com; fo=1; adkim=r; aspf=r; pct=100"
|
||||||
|
```
|
||||||
|
|
||||||
|
`rua` (aggregate reports) is the diagnostic feed — set it before the
|
||||||
|
first send so the first weekly report has data.
|
||||||
|
|
||||||
|
### 4. MX (only if you also receive)
|
||||||
|
|
||||||
|
The CRM's IMAP probe (`scripts/dev-imap-probe.ts`) and the inbound thread
|
||||||
|
sync rely on a real mailbox. Whoever runs that mailbox publishes the MX
|
||||||
|
records — typically Google Workspace or a dedicated provider. Don't add
|
||||||
|
an MX pointing at the CRM host; it doesn't accept SMTP IN.
|
||||||
|
|
||||||
|
## Per-port overrides
|
||||||
|
|
||||||
|
Each port can override `from_address`, `from_name`, and SMTP creds via
|
||||||
|
the admin email-settings page. When set, `getPortEmailConfig()` returns
|
||||||
|
those values and `sendEmail()` uses them in preference to the global
|
||||||
|
`SMTP_*` env. **The override domain still needs SPF / DKIM / DMARC** on
|
||||||
|
its own DNS — without them, every send from that port lands in spam.
|
||||||
|
|
||||||
|
When a customer reports "our portal invite didn't arrive":
|
||||||
|
|
||||||
|
1. Pull the port's email settings from the admin UI. Check `from_address`.
|
||||||
|
2. Run `dig TXT <from-domain>` and `dig TXT _dmarc.<from-domain>`.
|
||||||
|
Confirm SPF includes the SMTP provider's domain and DMARC exists.
|
||||||
|
3. Send a probe through `mail-tester.com`: paste the address into a
|
||||||
|
test send, click the score breakdown.
|
||||||
|
4. Score < 8/10 → fix whatever's flagged before doing anything else in
|
||||||
|
this runbook.
|
||||||
|
|
||||||
|
## Diagnosing a "didn't arrive" report
|
||||||
|
|
||||||
|
Order matters — go top-down, stop when one of these is the answer.
|
||||||
|
|
||||||
|
### Step 1: Was the send attempted?
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Tail the worker logs for the recipient address.
|
||||||
|
docker compose logs worker | grep '<recipient>'
|
||||||
|
```
|
||||||
|
|
||||||
|
You'll see one of three patterns:
|
||||||
|
|
||||||
|
- **Nothing**: The job didn't run. Check that BullMQ actually queued it.
|
||||||
|
`redis-cli LLEN bull:email:waiting` — if non-zero, the worker is dead.
|
||||||
|
`docker compose logs scheduler | tail` to see why.
|
||||||
|
- **`Email sent`** with a message-id: The provider accepted it. Move to
|
||||||
|
Step 2.
|
||||||
|
- **`SendError`**: Provider rejected. The error string says why
|
||||||
|
(auth, rate limit, blocked recipient).
|
||||||
|
|
||||||
|
### Step 2: Is `EMAIL_REDIRECT_TO` set?
|
||||||
|
|
||||||
|
In dev/test we set `EMAIL_REDIRECT_TO=ops@portnimara.com` so seeded fake
|
||||||
|
clients don't get real email. **It must be unset in production.**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On the production host:
|
||||||
|
docker exec pncrm-web printenv EMAIL_REDIRECT_TO
|
||||||
|
# Should print nothing.
|
||||||
|
```
|
||||||
|
|
||||||
|
If it's set, every email is going to the redirect target with the
|
||||||
|
original recipient prefixed in the subject — the customer never sees it.
|
||||||
|
|
||||||
|
### Step 3: Did it land but get filtered?
|
||||||
|
|
||||||
|
Ask the recipient to check:
|
||||||
|
|
||||||
|
- Spam / Junk folder
|
||||||
|
- Gmail "Promotions" tab
|
||||||
|
- Outlook "Other" folder (vs Focused)
|
||||||
|
- The Quarantine console if they're on M365 with anti-spam enabled
|
||||||
|
|
||||||
|
If found in a spam folder: the email arrived; the recipient's filter
|
||||||
|
classified it. SPF/DKIM/DMARC alignment is suspect — re-run the
|
||||||
|
mail-tester probe from above.
|
||||||
|
|
||||||
|
### Step 4: Was the recipient on a suppression list?
|
||||||
|
|
||||||
|
Some providers (SES, Postmark) maintain a suppression list — once a
|
||||||
|
domain bounces from an address, future sends are dropped silently.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# SES example:
|
||||||
|
aws ses list-suppressed-destinations --region eu-west-1
|
||||||
|
```
|
||||||
|
|
||||||
|
If the recipient is suppressed, remove them and ask them to retry. The
|
||||||
|
CRM doesn't track suppression locally; that's the provider's job.
|
||||||
|
|
||||||
|
## When migrating SMTP providers
|
||||||
|
|
||||||
|
1. Add the new provider's DKIM CNAMEs alongside the old ones.
|
||||||
|
2. Add the new provider's `include:` to the existing SPF record.
|
||||||
|
3. Wait 48 hours for DNS to propagate and DMARC reports to confirm both
|
||||||
|
providers align.
|
||||||
|
4. Switch `SMTP_*` env to the new provider on a single staging host.
|
||||||
|
5. Send through the staging host for a week. Watch DMARC reports.
|
||||||
|
6. Cut production over.
|
||||||
|
7. Wait two weeks before removing the old provider's DNS — undelivered
|
||||||
|
bounce reports keep arriving for a while.
|
||||||
|
|
||||||
|
## Testing a deliverability fix
|
||||||
|
|
||||||
|
There's no automated test for "did this email reach the inbox" — that's a
|
||||||
|
property of the recipient's filter, which we don't control. The closest
|
||||||
|
proxy is the realapi suite:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pnpm exec playwright test --project=realapi
|
||||||
|
```
|
||||||
|
|
||||||
|
It runs `tests/e2e/realapi/portal-imap-activation.spec.ts` which sends a
|
||||||
|
real portal-invite email through SMTP, then polls the configured IMAP
|
||||||
|
mailbox for the activation link. If it appears within 30 seconds, the
|
||||||
|
SMTP→DKIM→DMARC chain is alive end-to-end. If the test times out, work
|
||||||
|
backwards through this runbook.
|
||||||
|
|
||||||
|
The realapi suite needs `SMTP_*` and `IMAP_*` env vars — see the
|
||||||
|
"Optional dev/test-only env vars" block in `CLAUDE.md`.
|
||||||
|
|
||||||
|
## Bounce handling
|
||||||
|
|
||||||
|
The CRM doesn't currently process bounces. If you start seeing volume:
|
||||||
|
|
||||||
|
- Set up the provider's webhook (SES → SNS → Lambda; Postmark → webhook
|
||||||
|
URL) to POST bounce events to a new `/api/webhooks/email-bounce` route.
|
||||||
|
- Persist the bounced address into a `email_suppressions` table.
|
||||||
|
- Have `sendEmail()` consult that table before each send.
|
||||||
|
|
||||||
|
That work isn't in scope yet; this runbook just flags it as the next
|
||||||
|
deliverability gap.
|
||||||
51
scripts/backup/minio-mirror.sh
Normal file
51
scripts/backup/minio-mirror.sh
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Hourly MinIO mirror for Port Nimara CRM.
|
||||||
|
#
|
||||||
|
# Mirrors the live `MINIO_BUCKET` to the backup destination. `mc mirror`
|
||||||
|
# is incremental — only changed objects transfer — so this is cheap.
|
||||||
|
#
|
||||||
|
# Versioning on the destination bucket is what protects against object
|
||||||
|
# deletes / overwrites; we don't try to roll our own.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
: "${MINIO_ENDPOINT:?MINIO_ENDPOINT not set}"
|
||||||
|
: "${MINIO_ACCESS_KEY:?MINIO_ACCESS_KEY not set}"
|
||||||
|
: "${MINIO_SECRET_KEY:?MINIO_SECRET_KEY not set}"
|
||||||
|
: "${MINIO_BUCKET:?MINIO_BUCKET not set}"
|
||||||
|
: "${BACKUP_S3_BUCKET:?BACKUP_S3_BUCKET not set}"
|
||||||
|
: "${BACKUP_S3_ENDPOINT:?BACKUP_S3_ENDPOINT not set}"
|
||||||
|
: "${BACKUP_S3_ACCESS_KEY:?BACKUP_S3_ACCESS_KEY not set}"
|
||||||
|
: "${BACKUP_S3_SECRET_KEY:?BACKUP_S3_SECRET_KEY not set}"
|
||||||
|
|
||||||
|
# Default scheme: live MinIO is plain HTTP unless MINIO_USE_SSL=true.
|
||||||
|
LIVE_URL="${MINIO_ENDPOINT}"
|
||||||
|
if [[ "${MINIO_USE_SSL:-false}" == "true" ]]; then
|
||||||
|
LIVE_URL="https://${MINIO_ENDPOINT}:${MINIO_PORT:-443}"
|
||||||
|
else
|
||||||
|
LIVE_URL="http://${MINIO_ENDPOINT}:${MINIO_PORT:-9000}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
LIVE_ALIAS="live-$$"
|
||||||
|
BACKUP_ALIAS="bk-$$"
|
||||||
|
trap 'mc alias remove "$LIVE_ALIAS" 2>/dev/null || true; mc alias remove "$BACKUP_ALIAS" 2>/dev/null || true' EXIT
|
||||||
|
|
||||||
|
mc alias set "$LIVE_ALIAS" "$LIVE_URL" \
|
||||||
|
"$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY" --api S3v4 >/dev/null
|
||||||
|
mc alias set "$BACKUP_ALIAS" "$BACKUP_S3_ENDPOINT" \
|
||||||
|
"$BACKUP_S3_ACCESS_KEY" "$BACKUP_S3_SECRET_KEY" --api S3v4 >/dev/null
|
||||||
|
|
||||||
|
SOURCE="${LIVE_ALIAS}/${MINIO_BUCKET}/"
|
||||||
|
DEST="${BACKUP_ALIAS}/${BACKUP_S3_BUCKET}/minio/"
|
||||||
|
|
||||||
|
echo "[$(date -u +%FT%TZ)] Mirroring $SOURCE → $DEST"
|
||||||
|
|
||||||
|
# `--remove` would delete objects from the destination that no longer
|
||||||
|
# exist in source — we DON'T pass it, because that would let an
|
||||||
|
# accidental delete on the live bucket cascade into permanent loss on
|
||||||
|
# the backup side. Versioning + lifecycle handle stale-object cleanup.
|
||||||
|
mc mirror --quiet --overwrite "$SOURCE" "$DEST"
|
||||||
|
|
||||||
|
# Print byte / count diff for the operator.
|
||||||
|
echo "[$(date -u +%FT%TZ)] Done. Destination summary:"
|
||||||
|
mc du "$DEST"
|
||||||
63
scripts/backup/pg-backup.sh
Normal file
63
scripts/backup/pg-backup.sh
Normal file
@@ -0,0 +1,63 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Hourly PostgreSQL backup for Port Nimara CRM.
|
||||||
|
#
|
||||||
|
# Reads DATABASE_URL and BACKUP_S3_* from the environment. Dumps to a
|
||||||
|
# tmpfile, gzips, optionally GPG-encrypts to BACKUP_GPG_RECIPIENT, and
|
||||||
|
# uploads to s3://${BACKUP_S3_BUCKET}/pg/<hostname>/<UTC-date>/<hour>.dump.gz[.gpg].
|
||||||
|
#
|
||||||
|
# Designed to fail loud: any non-zero exit halts the script and propagates
|
||||||
|
# to the cron / CI runner so the operator sees the failure.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
: "${DATABASE_URL:?DATABASE_URL not set}"
|
||||||
|
: "${BACKUP_S3_BUCKET:?BACKUP_S3_BUCKET not set}"
|
||||||
|
: "${BACKUP_S3_ENDPOINT:?BACKUP_S3_ENDPOINT not set}"
|
||||||
|
: "${BACKUP_S3_ACCESS_KEY:?BACKUP_S3_ACCESS_KEY not set}"
|
||||||
|
: "${BACKUP_S3_SECRET_KEY:?BACKUP_S3_SECRET_KEY not set}"
|
||||||
|
|
||||||
|
HOST="${BACKUP_HOST_OVERRIDE:-$(hostname -s)}"
|
||||||
|
DATE_UTC="$(date -u +%Y-%m-%d)"
|
||||||
|
HOUR_UTC="$(date -u +%H)"
|
||||||
|
WORKDIR="$(mktemp -d)"
|
||||||
|
trap 'rm -rf "$WORKDIR"' EXIT
|
||||||
|
|
||||||
|
DUMP_FILE="$WORKDIR/${HOUR_UTC}.dump"
|
||||||
|
ARCHIVE_NAME="${HOUR_UTC}.dump.gz"
|
||||||
|
|
||||||
|
echo "[$(date -u +%FT%TZ)] Dumping $DATABASE_URL → $DUMP_FILE"
|
||||||
|
pg_dump --format=custom --compress=9 --no-owner --no-privileges \
|
||||||
|
--file="$DUMP_FILE" "$DATABASE_URL"
|
||||||
|
|
||||||
|
# pg_dump's `custom` format is already compressed, but we wrap in gzip so
|
||||||
|
# the file looks the same regardless of the dump format on disk.
|
||||||
|
gzip -n "$DUMP_FILE"
|
||||||
|
GZ_FILE="${DUMP_FILE}.gz"
|
||||||
|
|
||||||
|
# Optional GPG layer. Only encrypt if the recipient is configured.
|
||||||
|
if [[ -n "${BACKUP_GPG_RECIPIENT:-}" ]]; then
|
||||||
|
echo "[$(date -u +%FT%TZ)] Encrypting for $BACKUP_GPG_RECIPIENT"
|
||||||
|
gpg --batch --yes --trust-model always \
|
||||||
|
--recipient "$BACKUP_GPG_RECIPIENT" \
|
||||||
|
--encrypt --output "${GZ_FILE}.gpg" "$GZ_FILE"
|
||||||
|
rm "$GZ_FILE"
|
||||||
|
GZ_FILE="${GZ_FILE}.gpg"
|
||||||
|
ARCHIVE_NAME="${ARCHIVE_NAME}.gpg"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Configure mc client for the backup destination.
|
||||||
|
MC_ALIAS="bk-$$"
|
||||||
|
mc alias set "$MC_ALIAS" "$BACKUP_S3_ENDPOINT" \
|
||||||
|
"$BACKUP_S3_ACCESS_KEY" "$BACKUP_S3_SECRET_KEY" \
|
||||||
|
--api S3v4 >/dev/null
|
||||||
|
|
||||||
|
REMOTE_PATH="${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/${DATE_UTC}/${ARCHIVE_NAME}"
|
||||||
|
echo "[$(date -u +%FT%TZ)] Uploading → $REMOTE_PATH"
|
||||||
|
mc cp --quiet "$GZ_FILE" "$REMOTE_PATH"
|
||||||
|
|
||||||
|
# Tag with retention metadata so lifecycle rules can decide what to expire.
|
||||||
|
mc tag set "$REMOTE_PATH" "kind=hourly&host=${HOST}&date=${DATE_UTC}" >/dev/null
|
||||||
|
|
||||||
|
mc alias remove "$MC_ALIAS" >/dev/null
|
||||||
|
|
||||||
|
echo "[$(date -u +%FT%TZ)] OK ${ARCHIVE_NAME} ($(du -h "$GZ_FILE" | cut -f1))"
|
||||||
121
scripts/backup/restore.sh
Normal file
121
scripts/backup/restore.sh
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Cold-restore script for Port Nimara CRM.
|
||||||
|
#
|
||||||
|
# Two modes:
|
||||||
|
# --drill Restore to a sandbox DB ($DRILL_DATABASE_URL) + a tagged
|
||||||
|
# sandbox path on the live MinIO bucket. Used by the weekly
|
||||||
|
# cron drill so the runbook stays accurate.
|
||||||
|
# (no --drill) Interactive production restore. Prompts before each
|
||||||
|
# destructive step; refuses to run if the live DB has
|
||||||
|
# non-empty tables (caller is expected to drop first).
|
||||||
|
#
|
||||||
|
# Common args:
|
||||||
|
# --snapshot YYYY-MM-DD/HH Specific dump to restore. Defaults to "latest".
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
DRILL=0
|
||||||
|
SNAPSHOT="latest"
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--drill) DRILL=1; shift ;;
|
||||||
|
--snapshot) SNAPSHOT="$2"; shift 2 ;;
|
||||||
|
*) echo "unknown arg: $1" >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
: "${BACKUP_S3_BUCKET:?BACKUP_S3_BUCKET not set}"
|
||||||
|
: "${BACKUP_S3_ENDPOINT:?BACKUP_S3_ENDPOINT not set}"
|
||||||
|
: "${BACKUP_S3_ACCESS_KEY:?BACKUP_S3_ACCESS_KEY not set}"
|
||||||
|
: "${BACKUP_S3_SECRET_KEY:?BACKUP_S3_SECRET_KEY not set}"
|
||||||
|
|
||||||
|
if [[ "$DRILL" -eq 1 ]]; then
|
||||||
|
: "${DRILL_DATABASE_URL:?DRILL_DATABASE_URL not set}"
|
||||||
|
TARGET_DB="$DRILL_DATABASE_URL"
|
||||||
|
echo "[drill] target DB = $TARGET_DB"
|
||||||
|
else
|
||||||
|
: "${DATABASE_URL:?DATABASE_URL not set}"
|
||||||
|
TARGET_DB="$DATABASE_URL"
|
||||||
|
read -rp "About to overwrite $TARGET_DB. Type 'restore' to continue: " confirm
|
||||||
|
[[ "$confirm" == "restore" ]] || { echo "aborted"; exit 1; }
|
||||||
|
fi
|
||||||
|
|
||||||
|
HOST="${BACKUP_HOST_OVERRIDE:-$(hostname -s)}"
|
||||||
|
WORKDIR="$(mktemp -d)"
|
||||||
|
trap 'rm -rf "$WORKDIR"' EXIT
|
||||||
|
|
||||||
|
MC_ALIAS="bk-$$"
|
||||||
|
mc alias set "$MC_ALIAS" "$BACKUP_S3_ENDPOINT" \
|
||||||
|
"$BACKUP_S3_ACCESS_KEY" "$BACKUP_S3_SECRET_KEY" --api S3v4 >/dev/null
|
||||||
|
trap 'rm -rf "$WORKDIR"; mc alias remove "$MC_ALIAS" 2>/dev/null || true' EXIT
|
||||||
|
|
||||||
|
# Resolve the snapshot path.
|
||||||
|
if [[ "$SNAPSHOT" == "latest" ]]; then
|
||||||
|
REMOTE=$(mc ls --recursive "${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/" \
|
||||||
|
| awk '{print $NF}' | sort | tail -1)
|
||||||
|
if [[ -z "$REMOTE" ]]; then
|
||||||
|
echo "no snapshots found under ${BACKUP_S3_BUCKET}/pg/${HOST}/" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
REMOTE="${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/${REMOTE}"
|
||||||
|
else
|
||||||
|
REMOTE="${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/${SNAPSHOT}.dump.gz"
|
||||||
|
# If GPG was used, the file lives at .dump.gz.gpg. Try both.
|
||||||
|
if ! mc stat "$REMOTE" >/dev/null 2>&1; then
|
||||||
|
REMOTE="${REMOTE}.gpg"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[$(date -u +%FT%TZ)] Pulling $REMOTE"
|
||||||
|
LOCAL="$WORKDIR/$(basename "$REMOTE")"
|
||||||
|
mc cp --quiet "$REMOTE" "$LOCAL"
|
||||||
|
|
||||||
|
# Decrypt if needed.
|
||||||
|
if [[ "$LOCAL" == *.gpg ]]; then
|
||||||
|
echo "[$(date -u +%FT%TZ)] Decrypting"
|
||||||
|
gpg --batch --yes --decrypt --output "${LOCAL%.gpg}" "$LOCAL"
|
||||||
|
rm "$LOCAL"
|
||||||
|
LOCAL="${LOCAL%.gpg}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Decompress.
|
||||||
|
gunzip "$LOCAL"
|
||||||
|
LOCAL="${LOCAL%.gz}"
|
||||||
|
|
||||||
|
echo "[$(date -u +%FT%TZ)] Restoring into $TARGET_DB"
|
||||||
|
|
||||||
|
# Drop & recreate to guarantee no half-state from a prior run.
|
||||||
|
DB_NAME=$(echo "$TARGET_DB" | sed -E 's|.*/([^?]+).*|\1|')
|
||||||
|
ADMIN_URL=$(echo "$TARGET_DB" | sed -E "s|/${DB_NAME}|/postgres|")
|
||||||
|
|
||||||
|
psql "$ADMIN_URL" -v ON_ERROR_STOP=1 <<SQL
|
||||||
|
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
|
||||||
|
WHERE datname = '${DB_NAME}' AND pid <> pg_backend_pid();
|
||||||
|
DROP DATABASE IF EXISTS "${DB_NAME}";
|
||||||
|
CREATE DATABASE "${DB_NAME}";
|
||||||
|
SQL
|
||||||
|
|
||||||
|
pg_restore --no-owner --no-privileges --dbname "$TARGET_DB" "$LOCAL"
|
||||||
|
|
||||||
|
# Drill mode: compare row counts vs the live producer for parity.
|
||||||
|
if [[ "$DRILL" -eq 1 ]]; then
|
||||||
|
echo "[$(date -u +%FT%TZ)] Drill row-count diff (live vs restored):"
|
||||||
|
TABLES=$(psql -At "$TARGET_DB" -c \
|
||||||
|
"SELECT tablename FROM pg_tables WHERE schemaname='public' ORDER BY tablename;")
|
||||||
|
diff_count=0
|
||||||
|
while IFS= read -r tbl; do
|
||||||
|
[[ -z "$tbl" ]] && continue
|
||||||
|
live=$(psql -At "${LIVE_DATABASE_URL:-$DATABASE_URL}" -c "SELECT count(*) FROM \"$tbl\";")
|
||||||
|
restored=$(psql -At "$TARGET_DB" -c "SELECT count(*) FROM \"$tbl\";")
|
||||||
|
delta=$((live - restored))
|
||||||
|
if [[ "$delta" -ne 0 ]]; then
|
||||||
|
echo " ⚠ $tbl: live=$live restored=$restored delta=$delta"
|
||||||
|
diff_count=$((diff_count + 1))
|
||||||
|
fi
|
||||||
|
done <<< "$TABLES"
|
||||||
|
if [[ "$diff_count" -eq 0 ]]; then
|
||||||
|
echo " ✓ row counts match across all tables"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[$(date -u +%FT%TZ)] Restore complete."
|
||||||
Reference in New Issue
Block a user