Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.
Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
(Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
hourly MinIO mirror, 7-day hourly + 30-day daily retention),
cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
that runs against a sandbox DB and diffs row counts
Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
records (SPF/DKIM/DMARC/MX), per-port override implications,
diagnosis flow ("didn't arrive" → 4-step checklist starting with
EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
end-to-end probe
Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9.6 KiB
Backup and restore runbook
This runbook documents what gets backed up, how often, where it lands, and the exact commands to restore the system from a cold start. The goal is that any operator who has the off-site backup credentials can bring the CRM back up on a clean host without help.
Scope of a "full backup"
The CRM has three stateful surfaces. All three must be captured for a restore to be useful.
| Surface | Holds | Risk if missing |
|---|---|---|
PostgreSQL (port_nimara_crm) |
Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable. |
MinIO bucket (MINIO_BUCKET, default crm-files) |
Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments. | Files reachable by row references in Postgres become 404s. |
.env + secrets |
DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (ENCRYPTION_KEY). |
OCR API keys re-resolve from system_settings (encrypted at rest), but without the original ENCRYPTION_KEY they're unreadable. |
The Redis instance is not backed up. It only holds queue state, rate-limit counters, and Socket.IO presence — all reconstructable. Stop the workers during a restore so the queue starts clean.
Backup schedule
Defaults are tuned for a single-port deployment with O(10k) clients. Bump on the producing side as scale demands.
| Job | Frequency | Retention | Where |
|---|---|---|---|
pg_dump (custom format, gzipped) |
Hourly | 7 days hourly + 30 days daily | ${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz |
| MinIO mirror | Hourly (incremental) | 30 days versions | ${BACKUP_BUCKET}/minio/ |
.env snapshot (encrypted) |
On change (manual) | Forever | Password manager / secrets vault — never the same bucket as data |
The hourly cadence is the right answer for this workload — invoices and contracts cluster around business hours, and an hour of lost work is the worst-case data loss window most clients will tolerate. Promote to 15-min WAL streaming if a customer demands tighter RPO.
Required environment variables
The scripts below read these. Store them in a CI secret store, not the host's bash profile.
# Source (the running CRM database)
DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm
# MinIO (source bucket — the live one)
MINIO_ENDPOINT=minio.letsbe.solutions
MINIO_PORT=443
MINIO_USE_SSL=true
MINIO_ACCESS_KEY=<live key>
MINIO_SECRET_KEY=<live secret>
MINIO_BUCKET=crm-files
# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
# with no IAM overlap with the live keys)
BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
BACKUP_S3_REGION=eu-west-1
BACKUP_S3_BUCKET=portnimara-backups-prod
BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
BACKUP_S3_SECRET_KEY=<...>
# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
# radius if the backup bucket itself is compromised.
BACKUP_GPG_RECIPIENT=ops@portnimara.com
Provisioning the backup destination
- Create a dedicated S3-compatible bucket in a different account from the live infra. AWS S3, Backblaze B2, or a separately-credentialed MinIO instance all work.
- Apply object-lock or versioning so an attacker who steals the backup write key still can't permanently delete history.
- Generate IAM credentials scoped to
s3:PutObject,s3:GetObject,s3:ListBucketon this bucket only. Inject them asBACKUP_S3_*above. Do not reuse the liveMINIO_*keys. - Set a 90-day lifecycle rule that transitions objects older than 30 days to cold storage and deletes them at 90 days. Past 90 days it's cheaper to restart from a snapshot taken outside the system.
The scripts
Three scripts in scripts/backup/:
pg-backup.sh— runspg_dump, gzips, optionally GPG-encrypts, uploadsminio-mirror.sh—mc mirrorof the live bucket → backup bucketrestore.sh— interactive restore (DB + MinIO) given a snapshot path
Make them executable and wire them into cron / GitHub Actions / your scheduler of choice. Sample crontab on the worker host:
# Hourly DB dump at minute 7
7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1
# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1
# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
Restoring from cold
These steps have been rehearsed against the dev environment; expect them to take 15–30 minutes for a typical port. The drill (last cron line above) ensures the runbook stays correct — if the drill fails, the real restore will too.
0. Stop everything that writes
docker compose -f docker-compose.prod.yml stop web worker scheduler
# Leave postgres + minio + redis up; we'll point them at restored data.
1. Restore PostgreSQL
# Find the dump you want. Prefer the most recent successful hour.
mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
SNAPSHOT="2026-04-28/14.dump.gz"
# Pull it.
mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/
# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz
# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
# to user means we restore in the right order — pg_restore handles this.
psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
--dbname "$DATABASE_URL"
2. Restore MinIO
# Sync the backup bucket back over the live one. --overwrite handles
# files that were modified between snapshots.
mc mirror --overwrite \
"$BACKUP_S3_BUCKET/minio/" \
"live/$MINIO_BUCKET/"
3. Restore secrets
The .env file is not in object storage. Pull it from the password
manager / secrets vault. Verify ENCRYPTION_KEY matches the value used
when the database was last running — if it doesn't, rows in
system_settings (OCR API keys, etc.) decrypt to garbage and the OCR
"Test connection" button will return an opaque error. There is no
recovery path; the keys must be re-entered through the admin UI.
4. Bring services back up
docker compose -f docker-compose.prod.yml up -d
# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
docker compose -f docker-compose.prod.yml logs -f worker
5. Verify
Tail through the smoke checklist, in order:
- DB up —
psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'matches the producer-side count from the snapshot's hour. - MinIO up — open any client with attachments in the CRM, click a receipt thumbnail; verify the signed URL serves the file.
- Documenso webhooks — re-trigger one in the Documenso admin and
confirm
audit_logsrecords the receipt. - Email — send a portal invite to a real address.
- Realtime — open two browser windows, edit a client in one, watch the other update via Socket.IO.
- AI usage ledger —
SELECT count(*) FROM ai_usage_ledger;non-empty if AI was being used. Old rows survive but the budget gates reset alongside the period boundary at month rollover.
Drill schedule
The weekly drill (cron line above) runs restore.sh --drill against a
throwaway database and a sandbox MinIO bucket. It must produce zero diff
between the restored row counts and the live row counts (modulo the
hour-or-so the drill takes to run).
Failure modes the drill catches before they bite production:
- New tables added without inclusion in
pg_dump's--schema=public(we use the default, which captures everything inpublic— but a future developer adding atenant_Xschema will silently lose it). - MinIO bucket-policy changes that block the backup-side
s3:GetObjecton certain prefixes. - GPG passphrase rotation that wasn't propagated to the restore host.
- A
pg_restoreversion skew with the producer-sidepg_dump.