mobile-ux-polish/docs/runbooks/backup-and-restore.md

# Backup and restore runbook

This runbook documents what gets backed up, how often, where it lands, and
the exact commands to restore the system from a cold start. The goal is
that any operator who has the off-site backup credentials can bring the
CRM back up on a clean host without help.

## Scope of a "full backup"

The CRM has three stateful surfaces. All three must be captured for a
restore to be useful.

| Surface                                                | Holds                                                                                                                                                              | Risk if missing                                                                                                                       |
| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| **PostgreSQL** (`port_nimara_crm`)                     | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable.                                                                                              |
| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments.                                                                                      | Files reachable by row references in Postgres become 404s.                                                                            |
| **`.env` + secrets**                                   | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`).                                                                  | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. |

The Redis instance is not backed up. It only holds queue state, rate-limit
counters, and Socket.IO presence — all reconstructable. Stop the workers
during a restore so the queue starts clean.

## Backup schedule

Defaults are tuned for a single-port deployment with O(10k) clients. Bump
on the producing side as scale demands.

| Job                                | Frequency            | Retention                     | Where                                                                |
| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- |
| `pg_dump` (custom format, gzipped) | Hourly               | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz`               |
| MinIO mirror                       | Hourly (incremental) | 30 days versions              | `${BACKUP_BUCKET}/minio/`                                            |
| `.env` snapshot (encrypted)        | On change (manual)   | Forever                       | Password manager / secrets vault — **never the same bucket as data** |

The hourly cadence is the right answer for this workload — invoices and
contracts cluster around business hours, and an hour of lost work is the
worst-case data loss window most clients will tolerate. Promote to 15-min
WAL streaming if a customer demands tighter RPO.

## Required environment variables

The scripts below read these. Store them in a CI secret store, not the
host's bash profile.

```
# Source (the running CRM database)
DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm

# MinIO (source bucket — the live one)
MINIO_ENDPOINT=minio.letsbe.solutions
MINIO_PORT=443
MINIO_USE_SSL=true
MINIO_ACCESS_KEY=<live key>
MINIO_SECRET_KEY=<live secret>
MINIO_BUCKET=crm-files

# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
# with no IAM overlap with the live keys)
BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
BACKUP_S3_REGION=eu-west-1
BACKUP_S3_BUCKET=portnimara-backups-prod
BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
BACKUP_S3_SECRET_KEY=<...>

# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
# radius if the backup bucket itself is compromised.
BACKUP_GPG_RECIPIENT=ops@portnimara.com
```

## Provisioning the backup destination

1. Create a dedicated S3-compatible bucket in a **different account** from
   the live infra. AWS S3, Backblaze B2, or a separately-credentialed
   MinIO instance all work.
2. Apply object-lock or versioning so an attacker who steals the backup
   write key still can't permanently delete history.
3. Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`,
   `s3:ListBucket` on this bucket only. Inject them as
   `BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys.
4. Set a 90-day lifecycle rule that transitions objects older than 30
   days to cold storage and deletes them at 90 days. Past 90 days it's
   cheaper to restart from a snapshot taken outside the system.

## The scripts

Three scripts in `scripts/backup/`:

- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads
- `minio-mirror.sh` — `mc mirror` of the live bucket → backup bucket
- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path

Make them executable and wire them into cron / GitHub Actions / your
scheduler of choice. Sample crontab on the worker host:

```cron
# Hourly DB dump at minute 7
7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1

# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1

# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
```

## Restoring from cold

These steps have been rehearsed against the dev environment; expect them
to take 15–30 minutes for a typical port. **The drill (last cron line
above) ensures the runbook stays correct — if the drill fails, the
real restore will too.**

### 0. Stop everything that writes

```bash
docker compose -f docker-compose.prod.yml stop web worker scheduler
# Leave postgres + minio + redis up; we'll point them at restored data.
```

### 1. Restore PostgreSQL

```bash
# Find the dump you want. Prefer the most recent successful hour.
mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
SNAPSHOT="2026-04-28/14.dump.gz"

# Pull it.
mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/

# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz

# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
# to user means we restore in the right order — pg_restore handles this.
psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
  --dbname "$DATABASE_URL"
```

### 2. Restore MinIO

```bash
# Sync the backup bucket back over the live one. --overwrite handles
# files that were modified between snapshots.
mc mirror --overwrite \
  "$BACKUP_S3_BUCKET/minio/" \
  "live/$MINIO_BUCKET/"
```

### 3. Restore secrets

The `.env` file is **not** in object storage. Pull it from the password
manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used
when the database was last running — if it doesn't, rows in
`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR
"Test connection" button will return an opaque error. There is no
recovery path; the keys must be re-entered through the admin UI.

### 4. Bring services back up

```bash
docker compose -f docker-compose.prod.yml up -d
# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
docker compose -f docker-compose.prod.yml logs -f worker
```

### 5. Verify

Tail through the smoke checklist, in order:

1. **DB up** — `psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'`
   matches the producer-side count from the snapshot's hour.
2. **MinIO up** — open any client with attachments in the CRM, click a
   receipt thumbnail; verify the signed URL serves the file.
3. **Documenso webhooks** — re-trigger one in the Documenso admin and
   confirm `audit_logs` records the receipt.
4. **Email** — send a portal invite to a real address.
5. **Realtime** — open two browser windows, edit a client in one, watch
   the other update via Socket.IO.
6. **AI usage ledger** — `SELECT count(*) FROM ai_usage_ledger;`
   non-empty if AI was being used. Old rows survive but the budget gates
   reset alongside the period boundary at month rollover.

## Drill schedule

The weekly drill (cron line above) runs `restore.sh --drill` against a
throwaway database and a sandbox MinIO bucket. It must produce zero diff
between the restored row counts and the live row counts (modulo the
hour-or-so the drill takes to run).

Failure modes the drill catches before they bite production:

- New tables added without inclusion in `pg_dump`'s `--schema=public` (we
  use the default, which captures everything in `public` — but a future
  developer adding a `tenant_X` schema will silently lose it).
- MinIO bucket-policy changes that block the backup-side `s3:GetObject`
  on certain prefixes.
- GPG passphrase rotation that wasn't propagated to the restore host.
- A `pg_restore` version skew with the producer-side `pg_dump`.
-												docs(ops): backup/restore + email deliverability runbooks

Two new runbooks under docs/runbooks/ plus the automation scripts the
backup runbook references. Both are written so an operator who has only
the off-site backup credentials and the runbook can recover the system
unaided.

Backup/restore (Phase 4a):
- docs/runbooks/backup-and-restore.md — covers what gets backed up
  (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB +
  hourly MinIO mirror, 7-day hourly + 30-day daily retention),
  cold-restore procedure with row-count verification, weekly drill
- scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc
  upload, fails loud
- scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove
  flag so accidental deletes on the live bucket can't cascade
- scripts/backup/restore.sh — interactive prod restore + --drill mode
  that runs against a sandbox DB and diffs row counts

Email deliverability (Phase 4b):
- docs/runbooks/email-deliverability.md — what the CRM sends, DNS
  records (SPF/DKIM/DMARC/MX), per-port override implications,
  diagnosis flow ("didn't arrive" → 4-step checklist starting with
  EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the
  end-to-end probe

Tests still 778/778 vitest, tsc/lint clean — these phases are docs +
shell scripts, no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-04-28 20:10:30 +02:00
+								# Backup and restore runbook
 								This runbook documents what gets backed up, how often, where it lands, and
 								the exact commands to restore the system from a cold start. The goal is
 								that any operator who has the off-site backup credentials can bring the
 								CRM back up on a clean host without help.
 								## Scope of a "full backup"
 								The CRM has three stateful surfaces. All three must be captured for a
 								restore to be useful.
 								| Surface                                                | Holds                                                                                                                                                              | Risk if missing                                                                                                                       |
 								| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
 								| **PostgreSQL** (`port_nimara_crm`)                     | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable.                                                                                              |
 								| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments.                                                                                      | Files reachable by row references in Postgres become 404s.                                                                            |
 								| **`.env` + secrets**                                   | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`).                                                                  | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. |
 								The Redis instance is not backed up. It only holds queue state, rate-limit
 								counters, and Socket.IO presence — all reconstructable. Stop the workers
 								during a restore so the queue starts clean.
 								## Backup schedule
 								Defaults are tuned for a single-port deployment with O(10k) clients. Bump
 								on the producing side as scale demands.
 								| Job                                | Frequency            | Retention                     | Where                                                                |
 								| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- |
 								| `pg_dump` (custom format, gzipped) | Hourly               | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg/<host>/<UTC date>/<hour>.dump.gz`               |
 								| MinIO mirror                       | Hourly (incremental) | 30 days versions              | `${BACKUP_BUCKET}/minio/`                                            |
 								| `.env` snapshot (encrypted)        | On change (manual)   | Forever                       | Password manager / secrets vault — **never the same bucket as data** |
 								The hourly cadence is the right answer for this workload — invoices and
 								contracts cluster around business hours, and an hour of lost work is the
 								worst-case data loss window most clients will tolerate. Promote to 15-min
 								WAL streaming if a customer demands tighter RPO.
 								## Required environment variables
 								The scripts below read these. Store them in a CI secret store, not the
 								host's bash profile.
 								```
 								# Source (the running CRM database)
 								DATABASE_URL=postgresql://crm:<pw>@<host>:<port>/port_nimara_crm
 								# MinIO (source bucket — the live one)
 								MINIO_ENDPOINT=minio.letsbe.solutions
 								MINIO_PORT=443
 								MINIO_USE_SSL=true
 								MINIO_ACCESS_KEY=<live key>
 								MINIO_SECRET_KEY=<live secret>
 								MINIO_BUCKET=crm-files
 								# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket
 								# with no IAM overlap with the live keys)
 								BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com
 								BACKUP_S3_REGION=eu-west-1
 								BACKUP_S3_BUCKET=portnimara-backups-prod
 								BACKUP_S3_ACCESS_KEY=<dedicated read+write key for this bucket only>
 								BACKUP_S3_SECRET_KEY=<...>
 								# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast
 								# radius if the backup bucket itself is compromised.
 								BACKUP_GPG_RECIPIENT=ops@portnimara.com
 								```
 								## Provisioning the backup destination
 . Create a dedicated S3-compatible bucket in a **different account** from
 								   the live infra. AWS S3, Backblaze B2, or a separately-credentialed
 								   MinIO instance all work.
 . Apply object-lock or versioning so an attacker who steals the backup
 								   write key still can't permanently delete history.
 . Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`,
 								   `s3:ListBucket` on this bucket only. Inject them as
 								   `BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys.
 . Set a 90-day lifecycle rule that transitions objects older than 30
 								   days to cold storage and deletes them at 90 days. Past 90 days it's
 								   cheaper to restart from a snapshot taken outside the system.
 								## The scripts
 								Three scripts in `scripts/backup/`:
 								- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads
 								- `minio-mirror.sh` — `mc mirror` of the live bucket → backup bucket
 								- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path
 								Make them executable and wire them into cron / GitHub Actions / your
 								scheduler of choice. Sample crontab on the worker host:
 								```cron
 								# Hourly DB dump at minute 7
 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1
 								# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O)
 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1
 								# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00)
 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1
 								```
 								## Restoring from cold
 								These steps have been rehearsed against the dev environment; expect them
 								to take 15–30 minutes for a typical port. **The drill (last cron line
 								above) ensures the runbook stays correct — if the drill fails, the
 								real restore will too.**
 								### 0. Stop everything that writes
 								```bash
 								docker compose -f docker-compose.prod.yml stop web worker scheduler
 								# Leave postgres + minio + redis up; we'll point them at restored data.
 								```
 								### 1. Restore PostgreSQL
 								```bash
 								# Find the dump you want. Prefer the most recent successful hour.
 								mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail
 								SNAPSHOT="2026-04-28/14.dump.gz"
 								# Pull it.
 								mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/
 								# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side.
 								gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz
 								# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by
 								# to user means we restore in the right order — pg_restore handles this.
 								psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);'
 								psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;'
 								gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \
 								  --dbname "$DATABASE_URL"
 								```
 								### 2. Restore MinIO
 								```bash
 								# Sync the backup bucket back over the live one. --overwrite handles
 								# files that were modified between snapshots.
 								mc mirror --overwrite \
 								  "$BACKUP_S3_BUCKET/minio/" \
 								  "live/$MINIO_BUCKET/"
 								```
 								### 3. Restore secrets
 								The `.env` file is **not** in object storage. Pull it from the password
 								manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used
 								when the database was last running — if it doesn't, rows in
 								`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR
 								"Test connection" button will return an opaque error. There is no
 								recovery path; the keys must be re-entered through the admin UI.
 								### 4. Bring services back up
 								```bash
 								docker compose -f docker-compose.prod.yml up -d
 								# Watch the worker logs; expect a flurry of socket reconnections, then quiet.
 								docker compose -f docker-compose.prod.yml logs -f worker
 								```
 								### 5. Verify
 								Tail through the smoke checklist, in order:
 . **DB up** — `psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'`
 								   matches the producer-side count from the snapshot's hour.
 . **MinIO up** — open any client with attachments in the CRM, click a
 								   receipt thumbnail; verify the signed URL serves the file.
 . **Documenso webhooks** — re-trigger one in the Documenso admin and
 								   confirm `audit_logs` records the receipt.
 . **Email** — send a portal invite to a real address.
 . **Realtime** — open two browser windows, edit a client in one, watch
 								   the other update via Socket.IO.
 . **AI usage ledger** — `SELECT count(*) FROM ai_usage_ledger;`
 								   non-empty if AI was being used. Old rows survive but the budget gates
 								   reset alongside the period boundary at month rollover.
 								## Drill schedule
 								The weekly drill (cron line above) runs `restore.sh --drill` against a
 								throwaway database and a sandbox MinIO bucket. It must produce zero diff
 								between the restored row counts and the live row counts (modulo the
 								hour-or-so the drill takes to run).
 								Failure modes the drill catches before they bite production:
 								- New tables added without inclusion in `pg_dump`'s `--schema=public` (we
 								  use the default, which captures everything in `public` — but a future
 								  developer adding a `tenant_X` schema will silently lose it).
 								- MinIO bucket-policy changes that block the backup-side `s3:GetObject`
 								  on certain prefixes.
 								- GPG passphrase rotation that wasn't propagated to the restore host.
 								- A `pg_restore` version skew with the producer-side `pg_dump`.