200 lines
9.6 KiB
Markdown
200 lines
9.6 KiB
Markdown
|
|
# Backup and restore runbook
|
|||
|
|
|
|||
|
|
This runbook documents what gets backed up, how often, where it lands, and
|
|||
|
|
the exact commands to restore the system from a cold start. The goal is
|
|||
|
|
that any operator who has the off-site backup credentials can bring the
|
|||
|
|
CRM back up on a clean host without help.
|
|||
|
|
|
|||
|
|
## Scope of a "full backup"
|
|||
|
|
|
|||
|
|
The CRM has three stateful surfaces. All three must be captured for a
|
|||
|
|
restore to be useful.
|
|||
|
|
|
|||
|
|
| Surface | Holds | Risk if missing |
|
|||
|
|
| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
|||
|
|
| **PostgreSQL** (`port_nimara_crm`) | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable. |
|
|||
|
|
| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments. | Files reachable by row references in Postgres become 404s. |
|
|||
|
|
| **`.env` + secrets** | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`). | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. |
|
|||
|
|
|
|||
|
|
The Redis instance is not backed up. It only holds queue state, rate-limit
|
|||
|
|
counters, and Socket.IO presence — all reconstructable. Stop the workers
|
|||
|
|
during a restore so the queue starts clean.
|
|||
|
|
|
|||
|
|
## Backup schedule
|
|||
|
|
|
|||
|
|
Defaults are tuned for a single-port deployment with O(10k) clients. Bump
|
|||
|
|
on the producing side as scale demands.
|
|||
|
|
|
|||
|
|
| Job | Frequency | Retention | Where |
|
|||
|
|
| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- |
|
|||
|
|
| `pg_dump` (custom format, gzipped) | Hourly | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz` |
|
|||
|
|
| MinIO mirror | Hourly (incremental) | 30 days versions | `${BACKUP_BUCKET}/minio/` |
|
|||
|
|
| `.env` snapshot (encrypted) | On change (manual) | Forever | Password manager / secrets vault — **never the same bucket as data** |
|
|||
|
|
|
|||
|
|
The hourly cadence is the right answer for this workload — invoices and
|
|||
|
|
contracts cluster around business hours, and an hour of lost work is the
|
|||
|
|
worst-case data loss window most clients will tolerate. Promote to 15-min
|
|||
|
|
WAL streaming if a customer demands tighter RPO.
|
|||
|
|
|
|||
|
|
## Required environment variables
|
|||
|
|
|
|||
|
|
The scripts below read these. Store them in a CI secret store, not the
|
|||
|
|
host's bash profile.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
# Source (the running CRM database)
|
|||
|
|
DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm
|
|||
|
|
|
|||
|
|
# MinIO (source bucket — the live one)
|
|||
|
|
MINIO_ENDPOINT=minio.letsbe.solutions
|
|||
|
|
MINIO_PORT=443
|
|||
|
|
MINIO_USE_SSL=true
|
|||
|
|
MINIO_ACCESS_KEY=<live key>
|
|||
|
|
MINIO_SECRET_KEY=<live secret>
|
|||
|
|
MINIO_BUCKET=crm-files
|
|||
|
|
|
|||
|
|
# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
|
|||
|
|
# with no IAM overlap with the live keys)
|
|||
|
|
BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
|
|||
|
|
BACKUP_S3_REGION=eu-west-1
|
|||
|
|
BACKUP_S3_BUCKET=portnimara-backups-prod
|
|||
|
|
BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
|
|||
|
|
BACKUP_S3_SECRET_KEY=<...>
|
|||
|
|
|
|||
|
|
# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
|
|||
|
|
# radius if the backup bucket itself is compromised.
|
|||
|
|
BACKUP_GPG_RECIPIENT=ops@portnimara.com
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Provisioning the backup destination
|
|||
|
|
|
|||
|
|
1. Create a dedicated S3-compatible bucket in a **different account** from
|
|||
|
|
the live infra. AWS S3, Backblaze B2, or a separately-credentialed
|
|||
|
|
MinIO instance all work.
|
|||
|
|
2. Apply object-lock or versioning so an attacker who steals the backup
|
|||
|
|
write key still can't permanently delete history.
|
|||
|
|
3. Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`,
|
|||
|
|
`s3:ListBucket` on this bucket only. Inject them as
|
|||
|
|
`BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys.
|
|||
|
|
4. Set a 90-day lifecycle rule that transitions objects older than 30
|
|||
|
|
days to cold storage and deletes them at 90 days. Past 90 days it's
|
|||
|
|
cheaper to restart from a snapshot taken outside the system.
|
|||
|
|
|
|||
|
|
## The scripts
|
|||
|
|
|
|||
|
|
Three scripts in `scripts/backup/`:
|
|||
|
|
|
|||
|
|
- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads
|
|||
|
|
- `minio-mirror.sh` — `mc mirror` of the live bucket → backup bucket
|
|||
|
|
- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path
|
|||
|
|
|
|||
|
|
Make them executable and wire them into cron / GitHub Actions / your
|
|||
|
|
scheduler of choice. Sample crontab on the worker host:
|
|||
|
|
|
|||
|
|
```cron
|
|||
|
|
# Hourly DB dump at minute 7
|
|||
|
|
7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1
|
|||
|
|
|
|||
|
|
# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
|
|||
|
|
17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1
|
|||
|
|
|
|||
|
|
# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
|
|||
|
|
0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Restoring from cold
|
|||
|
|
|
|||
|
|
These steps have been rehearsed against the dev environment; expect them
|
|||
|
|
to take 15–30 minutes for a typical port. **The drill (last cron line
|
|||
|
|
above) ensures the runbook stays correct — if the drill fails, the
|
|||
|
|
real restore will too.**
|
|||
|
|
|
|||
|
|
### 0. Stop everything that writes
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker compose -f docker-compose.prod.yml stop web worker scheduler
|
|||
|
|
# Leave postgres + minio + redis up; we'll point them at restored data.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 1. Restore PostgreSQL
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Find the dump you want. Prefer the most recent successful hour.
|
|||
|
|
mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
|
|||
|
|
SNAPSHOT="2026-04-28/14.dump.gz"
|
|||
|
|
|
|||
|
|
# Pull it.
|
|||
|
|
mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/
|
|||
|
|
|
|||
|
|
# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
|
|||
|
|
gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz
|
|||
|
|
|
|||
|
|
# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
|
|||
|
|
# to user means we restore in the right order — pg_restore handles this.
|
|||
|
|
psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
|
|||
|
|
psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
|
|||
|
|
gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
|
|||
|
|
--dbname "$DATABASE_URL"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Restore MinIO
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Sync the backup bucket back over the live one. --overwrite handles
|
|||
|
|
# files that were modified between snapshots.
|
|||
|
|
mc mirror --overwrite \
|
|||
|
|
"$BACKUP_S3_BUCKET/minio/" \
|
|||
|
|
"live/$MINIO_BUCKET/"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. Restore secrets
|
|||
|
|
|
|||
|
|
The `.env` file is **not** in object storage. Pull it from the password
|
|||
|
|
manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used
|
|||
|
|
when the database was last running — if it doesn't, rows in
|
|||
|
|
`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR
|
|||
|
|
"Test connection" button will return an opaque error. There is no
|
|||
|
|
recovery path; the keys must be re-entered through the admin UI.
|
|||
|
|
|
|||
|
|
### 4. Bring services back up
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
docker compose -f docker-compose.prod.yml up -d
|
|||
|
|
# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
|
|||
|
|
docker compose -f docker-compose.prod.yml logs -f worker
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5. Verify
|
|||
|
|
|
|||
|
|
Tail through the smoke checklist, in order:
|
|||
|
|
|
|||
|
|
1. **DB up** — `psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'`
|
|||
|
|
matches the producer-side count from the snapshot's hour.
|
|||
|
|
2. **MinIO up** — open any client with attachments in the CRM, click a
|
|||
|
|
receipt thumbnail; verify the signed URL serves the file.
|
|||
|
|
3. **Documenso webhooks** — re-trigger one in the Documenso admin and
|
|||
|
|
confirm `audit_logs` records the receipt.
|
|||
|
|
4. **Email** — send a portal invite to a real address.
|
|||
|
|
5. **Realtime** — open two browser windows, edit a client in one, watch
|
|||
|
|
the other update via Socket.IO.
|
|||
|
|
6. **AI usage ledger** — `SELECT count(*) FROM ai_usage_ledger;`
|
|||
|
|
non-empty if AI was being used. Old rows survive but the budget gates
|
|||
|
|
reset alongside the period boundary at month rollover.
|
|||
|
|
|
|||
|
|
## Drill schedule
|
|||
|
|
|
|||
|
|
The weekly drill (cron line above) runs `restore.sh --drill` against a
|
|||
|
|
throwaway database and a sandbox MinIO bucket. It must produce zero diff
|
|||
|
|
between the restored row counts and the live row counts (modulo the
|
|||
|
|
hour-or-so the drill takes to run).
|
|||
|
|
|
|||
|
|
Failure modes the drill catches before they bite production:
|
|||
|
|
|
|||
|
|
- New tables added without inclusion in `pg_dump`'s `--schema=public` (we
|
|||
|
|
use the default, which captures everything in `public` — but a future
|
|||
|
|
developer adding a `tenant_X` schema will silently lose it).
|
|||
|
|
- MinIO bucket-policy changes that block the backup-side `s3:GetObject`
|
|||
|
|
on certain prefixes.
|
|||
|
|
- GPG passphrase rotation that wasn't propagated to the restore host.
|
|||
|
|
- A `pg_restore` version skew with the producer-side `pg_dump`.
|