From 6eb0d3dc92844bcda31ecbbbd46a330bc9ac7bb7 Mon Sep 17 00:00:00 2001 From: Matt Ciaccio Date: Tue, 28 Apr 2026 20:10:30 +0200 Subject: [PATCH] docs(ops): backup/restore + email deliverability runbooks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two new runbooks under docs/runbooks/ plus the automation scripts the backup runbook references. Both are written so an operator who has only the off-site backup credentials and the runbook can recover the system unaided. Backup/restore (Phase 4a): - docs/runbooks/backup-and-restore.md — covers what gets backed up (Postgres / MinIO / .env+ENCRYPTION_KEY), schedule (hourly DB + hourly MinIO mirror, 7-day hourly + 30-day daily retention), cold-restore procedure with row-count verification, weekly drill - scripts/backup/pg-backup.sh — pg_dump → gzip → optional GPG → mc upload, fails loud - scripts/backup/minio-mirror.sh — incremental mc mirror, no --remove flag so accidental deletes on the live bucket can't cascade - scripts/backup/restore.sh — interactive prod restore + --drill mode that runs against a sandbox DB and diffs row counts Email deliverability (Phase 4b): - docs/runbooks/email-deliverability.md — what the CRM sends, DNS records (SPF/DKIM/DMARC/MX), per-port override implications, diagnosis flow ("didn't arrive" → 4-step checklist starting with EMAIL_REDIRECT_TO), provider migration plan, realapi suite as the end-to-end probe Tests still 778/778 vitest, tsc/lint clean — these phases are docs + shell scripts, no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/runbooks/backup-and-restore.md | 199 ++++++++++++++++++++++++++ docs/runbooks/email-deliverability.md | 186 ++++++++++++++++++++++++ scripts/backup/minio-mirror.sh | 51 +++++++ scripts/backup/pg-backup.sh | 63 ++++++++ scripts/backup/restore.sh | 121 ++++++++++++++++ 5 files changed, 620 insertions(+) create mode 100644 docs/runbooks/backup-and-restore.md create mode 100644 docs/runbooks/email-deliverability.md create mode 100644 scripts/backup/minio-mirror.sh create mode 100644 scripts/backup/pg-backup.sh create mode 100644 scripts/backup/restore.sh diff --git a/docs/runbooks/backup-and-restore.md b/docs/runbooks/backup-and-restore.md new file mode 100644 index 0000000..8e04000 --- /dev/null +++ b/docs/runbooks/backup-and-restore.md @@ -0,0 +1,199 @@ +# Backup and restore runbook + +This runbook documents what gets backed up, how often, where it lands, and +the exact commands to restore the system from a cold start. The goal is +that any operator who has the off-site backup credentials can bring the +CRM back up on a clean host without help. + +## Scope of a "full backup" + +The CRM has three stateful surfaces. All three must be captured for a +restore to be useful. + +| Surface | Holds | Risk if missing | +| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------- | +| **PostgreSQL** (`port_nimara_crm`) | Every relational record: clients, yachts, companies, interests, reservations, invoices, audit log, GDPR exports, AI usage ledger, Documenso webhook receipts, etc. | Total data loss — site is unrecoverable. | +| **MinIO bucket** (`MINIO_BUCKET`, default `crm-files`) | Receipts, signed contracts, EOI PDFs, GDPR export ZIPs, document attachments. | Files reachable by row references in Postgres become 404s. | +| **`.env` + secrets** | DB password, MinIO keys, Documenso webhook secret, SMTP creds, encryption key (`ENCRYPTION_KEY`). | OCR API keys re-resolve from `system_settings` (encrypted at rest), but **without the original `ENCRYPTION_KEY` they're unreadable**. | + +The Redis instance is not backed up. It only holds queue state, rate-limit +counters, and Socket.IO presence — all reconstructable. Stop the workers +during a restore so the queue starts clean. + +## Backup schedule + +Defaults are tuned for a single-port deployment with O(10k) clients. Bump +on the producing side as scale demands. + +| Job | Frequency | Retention | Where | +| ---------------------------------- | -------------------- | ----------------------------- | -------------------------------------------------------------------- | +| `pg_dump` (custom format, gzipped) | Hourly | 7 days hourly + 30 days daily | `${BACKUP_BUCKET}/pg///.dump.gz` | +| MinIO mirror | Hourly (incremental) | 30 days versions | `${BACKUP_BUCKET}/minio/` | +| `.env` snapshot (encrypted) | On change (manual) | Forever | Password manager / secrets vault — **never the same bucket as data** | + +The hourly cadence is the right answer for this workload — invoices and +contracts cluster around business hours, and an hour of lost work is the +worst-case data loss window most clients will tolerate. Promote to 15-min +WAL streaming if a customer demands tighter RPO. + +## Required environment variables + +The scripts below read these. Store them in a CI secret store, not the +host's bash profile. + +``` +# Source (the running CRM database) +DATABASE_URL=postgresql://crm:@:/port_nimara_crm + +# MinIO (source bucket — the live one) +MINIO_ENDPOINT=minio.letsbe.solutions +MINIO_PORT=443 +MINIO_USE_SSL=true +MINIO_ACCESS_KEY= +MINIO_SECRET_KEY= +MINIO_BUCKET=crm-files + +# Backup destination (a *separate* MinIO/S3 endpoint or a different bucket +# with no IAM overlap with the live keys) +BACKUP_S3_ENDPOINT=https://s3.eu-west-1.amazonaws.com +BACKUP_S3_REGION=eu-west-1 +BACKUP_S3_BUCKET=portnimara-backups-prod +BACKUP_S3_ACCESS_KEY= +BACKUP_S3_SECRET_KEY=<...> + +# Optional: encrypts dumps at rest with a passphrase. Cuts a wider blast +# radius if the backup bucket itself is compromised. +BACKUP_GPG_RECIPIENT=ops@portnimara.com +``` + +## Provisioning the backup destination + +1. Create a dedicated S3-compatible bucket in a **different account** from + the live infra. AWS S3, Backblaze B2, or a separately-credentialed + MinIO instance all work. +2. Apply object-lock or versioning so an attacker who steals the backup + write key still can't permanently delete history. +3. Generate IAM credentials scoped to `s3:PutObject`, `s3:GetObject`, + `s3:ListBucket` on this bucket only. Inject them as + `BACKUP_S3_*` above. Do not reuse the live `MINIO_*` keys. +4. Set a 90-day lifecycle rule that transitions objects older than 30 + days to cold storage and deletes them at 90 days. Past 90 days it's + cheaper to restart from a snapshot taken outside the system. + +## The scripts + +Three scripts in `scripts/backup/`: + +- `pg-backup.sh` — runs `pg_dump`, gzips, optionally GPG-encrypts, uploads +- `minio-mirror.sh` — `mc mirror` of the live bucket → backup bucket +- `restore.sh` — interactive restore (DB + MinIO) given a snapshot path + +Make them executable and wire them into cron / GitHub Actions / your +scheduler of choice. Sample crontab on the worker host: + +```cron +# Hourly DB dump at minute 7 +7 * * * * /opt/pncrm/scripts/backup/pg-backup.sh >> /var/log/pncrm-backup.log 2>&1 + +# Hourly MinIO mirror at minute 17 (offset so the two don't fight for I/O) +17 * * * * /opt/pncrm/scripts/backup/minio-mirror.sh >> /var/log/pncrm-backup.log 2>&1 + +# Weekly restore drill (smoke-test to a throwaway DB on Sunday at 03:00) +0 3 * * 0 /opt/pncrm/scripts/backup/restore.sh --drill >> /var/log/pncrm-restore-drill.log 2>&1 +``` + +## Restoring from cold + +These steps have been rehearsed against the dev environment; expect them +to take 15–30 minutes for a typical port. **The drill (last cron line +above) ensures the runbook stays correct — if the drill fails, the +real restore will too.** + +### 0. Stop everything that writes + +```bash +docker compose -f docker-compose.prod.yml stop web worker scheduler +# Leave postgres + minio + redis up; we'll point them at restored data. +``` + +### 1. Restore PostgreSQL + +```bash +# Find the dump you want. Prefer the most recent successful hour. +mc ls "$BACKUP_S3_BUCKET/pg/$(hostname)/" | tail +SNAPSHOT="2026-04-28/14.dump.gz" + +# Pull it. +mc cp "$BACKUP_S3_BUCKET/pg/$(hostname)/$SNAPSHOT" /tmp/ + +# Decrypt if BACKUP_GPG_RECIPIENT was set on the producer side. +gpg --decrypt /tmp/14.dump.gz.gpg > /tmp/14.dump.gz + +# Drop & recreate the database. The 'restrict' FK from gdpr_exports.requested_by +# to user means we restore in the right order — pg_restore handles this. +psql "$DATABASE_URL" -c 'DROP DATABASE IF EXISTS port_nimara_crm WITH (FORCE);' +psql "$DATABASE_URL" -c 'CREATE DATABASE port_nimara_crm;' +gunzip -c /tmp/14.dump.gz | pg_restore --no-owner --no-privileges \ + --dbname "$DATABASE_URL" +``` + +### 2. Restore MinIO + +```bash +# Sync the backup bucket back over the live one. --overwrite handles +# files that were modified between snapshots. +mc mirror --overwrite \ + "$BACKUP_S3_BUCKET/minio/" \ + "live/$MINIO_BUCKET/" +``` + +### 3. Restore secrets + +The `.env` file is **not** in object storage. Pull it from the password +manager / secrets vault. Verify `ENCRYPTION_KEY` matches the value used +when the database was last running — if it doesn't, rows in +`system_settings` (OCR API keys, etc.) decrypt to garbage and the OCR +"Test connection" button will return an opaque error. There is no +recovery path; the keys must be re-entered through the admin UI. + +### 4. Bring services back up + +```bash +docker compose -f docker-compose.prod.yml up -d +# Watch the worker logs; expect a flurry of socket reconnections, then quiet. +docker compose -f docker-compose.prod.yml logs -f worker +``` + +### 5. Verify + +Tail through the smoke checklist, in order: + +1. **DB up** — `psql "$DATABASE_URL" -c 'SELECT count(*) FROM clients;'` + matches the producer-side count from the snapshot's hour. +2. **MinIO up** — open any client with attachments in the CRM, click a + receipt thumbnail; verify the signed URL serves the file. +3. **Documenso webhooks** — re-trigger one in the Documenso admin and + confirm `audit_logs` records the receipt. +4. **Email** — send a portal invite to a real address. +5. **Realtime** — open two browser windows, edit a client in one, watch + the other update via Socket.IO. +6. **AI usage ledger** — `SELECT count(*) FROM ai_usage_ledger;` + non-empty if AI was being used. Old rows survive but the budget gates + reset alongside the period boundary at month rollover. + +## Drill schedule + +The weekly drill (cron line above) runs `restore.sh --drill` against a +throwaway database and a sandbox MinIO bucket. It must produce zero diff +between the restored row counts and the live row counts (modulo the +hour-or-so the drill takes to run). + +Failure modes the drill catches before they bite production: + +- New tables added without inclusion in `pg_dump`'s `--schema=public` (we + use the default, which captures everything in `public` — but a future + developer adding a `tenant_X` schema will silently lose it). +- MinIO bucket-policy changes that block the backup-side `s3:GetObject` + on certain prefixes. +- GPG passphrase rotation that wasn't propagated to the restore host. +- A `pg_restore` version skew with the producer-side `pg_dump`. diff --git a/docs/runbooks/email-deliverability.md b/docs/runbooks/email-deliverability.md new file mode 100644 index 0000000..debabec --- /dev/null +++ b/docs/runbooks/email-deliverability.md @@ -0,0 +1,186 @@ +# Email deliverability runbook + +The CRM sends transactional email through three different surfaces. Each +has a different failure mode when it lands in spam. This runbook covers +how to diagnose, fix, and verify each path. + +## What email the CRM sends + +| Surface | Trigger | Template | Default `from` | +| ----------------------------------------- | -------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ----------------------------------------------------- | +| Portal activation / password-reset | Admin invites a client to the portal | `src/lib/email/templates/portal-auth.ts` | per-port `email_settings.from_address` or `SMTP_FROM` | +| Inquiry confirmation + sales notification | Public website POSTs to `/api/public/interests` or `/api/public/residential-inquiries` | `inquiry-client-confirmation.ts`, `inquiry-sales-notification.ts` | same | +| GDPR export ready | Staff requests an export with `emailToClient=true` | inline in `gdpr-export.service.ts` | same | +| Documenso reminders | Cadence job fires for an unsigned signer | `documenso/reminders/*` | same | + +Documenso _itself_ sends signing requests with its own `from` address — +those don't flow through this codebase. SPF/DKIM for the Documenso +sender is the Documenso operator's problem, not yours. + +## DNS records + +For every domain that appears in a `from:` header you must publish: + +### 1. SPF + +A single TXT record at the apex authorizing whichever provider is +sending. Multiple SPF records on the same name **break SPF entirely** — +combine into one. + +``` +v=spf1 include:_spf.google.com include:amazonses.com -all +``` + +The `-all` (hardfail) is correct for transactional mail. Switch to `~all` +(softfail) only as a temporary diagnostic when migrating providers. + +### 2. DKIM + +Each provider publishes its own selector. Common shapes: + +- Google Workspace: `google._domainkey` → 2048-bit RSA pubkey (rotate every 12 months). +- Amazon SES: `xxxx._domainkey`, `yyyy._domainkey`, `zzzz._domainkey` (three CNAMEs SES gives you). +- Postmark / Resend / Mailgun: one CNAME per selector. + +Verify alignment — the `d=` value in the DKIM signature must match the +`From:` domain (relaxed alignment is fine, strict is overkill). + +### 3. DMARC + +Start at `p=none` while you build deliverability data, then upgrade. + +``` +_dmarc 14400 IN TXT "v=DMARC1; p=quarantine; rua=mailto:dmarc@portnimara.com; ruf=mailto:dmarc@portnimara.com; fo=1; adkim=r; aspf=r; pct=100" +``` + +`rua` (aggregate reports) is the diagnostic feed — set it before the +first send so the first weekly report has data. + +### 4. MX (only if you also receive) + +The CRM's IMAP probe (`scripts/dev-imap-probe.ts`) and the inbound thread +sync rely on a real mailbox. Whoever runs that mailbox publishes the MX +records — typically Google Workspace or a dedicated provider. Don't add +an MX pointing at the CRM host; it doesn't accept SMTP IN. + +## Per-port overrides + +Each port can override `from_address`, `from_name`, and SMTP creds via +the admin email-settings page. When set, `getPortEmailConfig()` returns +those values and `sendEmail()` uses them in preference to the global +`SMTP_*` env. **The override domain still needs SPF / DKIM / DMARC** on +its own DNS — without them, every send from that port lands in spam. + +When a customer reports "our portal invite didn't arrive": + +1. Pull the port's email settings from the admin UI. Check `from_address`. +2. Run `dig TXT ` and `dig TXT _dmarc.`. + Confirm SPF includes the SMTP provider's domain and DMARC exists. +3. Send a probe through `mail-tester.com`: paste the address into a + test send, click the score breakdown. +4. Score < 8/10 → fix whatever's flagged before doing anything else in + this runbook. + +## Diagnosing a "didn't arrive" report + +Order matters — go top-down, stop when one of these is the answer. + +### Step 1: Was the send attempted? + +```bash +# Tail the worker logs for the recipient address. +docker compose logs worker | grep '' +``` + +You'll see one of three patterns: + +- **Nothing**: The job didn't run. Check that BullMQ actually queued it. + `redis-cli LLEN bull:email:waiting` — if non-zero, the worker is dead. + `docker compose logs scheduler | tail` to see why. +- **`Email sent`** with a message-id: The provider accepted it. Move to + Step 2. +- **`SendError`**: Provider rejected. The error string says why + (auth, rate limit, blocked recipient). + +### Step 2: Is `EMAIL_REDIRECT_TO` set? + +In dev/test we set `EMAIL_REDIRECT_TO=ops@portnimara.com` so seeded fake +clients don't get real email. **It must be unset in production.** + +```bash +# On the production host: +docker exec pncrm-web printenv EMAIL_REDIRECT_TO +# Should print nothing. +``` + +If it's set, every email is going to the redirect target with the +original recipient prefixed in the subject — the customer never sees it. + +### Step 3: Did it land but get filtered? + +Ask the recipient to check: + +- Spam / Junk folder +- Gmail "Promotions" tab +- Outlook "Other" folder (vs Focused) +- The Quarantine console if they're on M365 with anti-spam enabled + +If found in a spam folder: the email arrived; the recipient's filter +classified it. SPF/DKIM/DMARC alignment is suspect — re-run the +mail-tester probe from above. + +### Step 4: Was the recipient on a suppression list? + +Some providers (SES, Postmark) maintain a suppression list — once a +domain bounces from an address, future sends are dropped silently. + +```bash +# SES example: +aws ses list-suppressed-destinations --region eu-west-1 +``` + +If the recipient is suppressed, remove them and ask them to retry. The +CRM doesn't track suppression locally; that's the provider's job. + +## When migrating SMTP providers + +1. Add the new provider's DKIM CNAMEs alongside the old ones. +2. Add the new provider's `include:` to the existing SPF record. +3. Wait 48 hours for DNS to propagate and DMARC reports to confirm both + providers align. +4. Switch `SMTP_*` env to the new provider on a single staging host. +5. Send through the staging host for a week. Watch DMARC reports. +6. Cut production over. +7. Wait two weeks before removing the old provider's DNS — undelivered + bounce reports keep arriving for a while. + +## Testing a deliverability fix + +There's no automated test for "did this email reach the inbox" — that's a +property of the recipient's filter, which we don't control. The closest +proxy is the realapi suite: + +```bash +pnpm exec playwright test --project=realapi +``` + +It runs `tests/e2e/realapi/portal-imap-activation.spec.ts` which sends a +real portal-invite email through SMTP, then polls the configured IMAP +mailbox for the activation link. If it appears within 30 seconds, the +SMTP→DKIM→DMARC chain is alive end-to-end. If the test times out, work +backwards through this runbook. + +The realapi suite needs `SMTP_*` and `IMAP_*` env vars — see the +"Optional dev/test-only env vars" block in `CLAUDE.md`. + +## Bounce handling + +The CRM doesn't currently process bounces. If you start seeing volume: + +- Set up the provider's webhook (SES → SNS → Lambda; Postmark → webhook + URL) to POST bounce events to a new `/api/webhooks/email-bounce` route. +- Persist the bounced address into a `email_suppressions` table. +- Have `sendEmail()` consult that table before each send. + +That work isn't in scope yet; this runbook just flags it as the next +deliverability gap. diff --git a/scripts/backup/minio-mirror.sh b/scripts/backup/minio-mirror.sh new file mode 100644 index 0000000..f384786 --- /dev/null +++ b/scripts/backup/minio-mirror.sh @@ -0,0 +1,51 @@ +#!/usr/bin/env bash +# Hourly MinIO mirror for Port Nimara CRM. +# +# Mirrors the live `MINIO_BUCKET` to the backup destination. `mc mirror` +# is incremental — only changed objects transfer — so this is cheap. +# +# Versioning on the destination bucket is what protects against object +# deletes / overwrites; we don't try to roll our own. + +set -euo pipefail + +: "${MINIO_ENDPOINT:?MINIO_ENDPOINT not set}" +: "${MINIO_ACCESS_KEY:?MINIO_ACCESS_KEY not set}" +: "${MINIO_SECRET_KEY:?MINIO_SECRET_KEY not set}" +: "${MINIO_BUCKET:?MINIO_BUCKET not set}" +: "${BACKUP_S3_BUCKET:?BACKUP_S3_BUCKET not set}" +: "${BACKUP_S3_ENDPOINT:?BACKUP_S3_ENDPOINT not set}" +: "${BACKUP_S3_ACCESS_KEY:?BACKUP_S3_ACCESS_KEY not set}" +: "${BACKUP_S3_SECRET_KEY:?BACKUP_S3_SECRET_KEY not set}" + +# Default scheme: live MinIO is plain HTTP unless MINIO_USE_SSL=true. +LIVE_URL="${MINIO_ENDPOINT}" +if [[ "${MINIO_USE_SSL:-false}" == "true" ]]; then + LIVE_URL="https://${MINIO_ENDPOINT}:${MINIO_PORT:-443}" +else + LIVE_URL="http://${MINIO_ENDPOINT}:${MINIO_PORT:-9000}" +fi + +LIVE_ALIAS="live-$$" +BACKUP_ALIAS="bk-$$" +trap 'mc alias remove "$LIVE_ALIAS" 2>/dev/null || true; mc alias remove "$BACKUP_ALIAS" 2>/dev/null || true' EXIT + +mc alias set "$LIVE_ALIAS" "$LIVE_URL" \ + "$MINIO_ACCESS_KEY" "$MINIO_SECRET_KEY" --api S3v4 >/dev/null +mc alias set "$BACKUP_ALIAS" "$BACKUP_S3_ENDPOINT" \ + "$BACKUP_S3_ACCESS_KEY" "$BACKUP_S3_SECRET_KEY" --api S3v4 >/dev/null + +SOURCE="${LIVE_ALIAS}/${MINIO_BUCKET}/" +DEST="${BACKUP_ALIAS}/${BACKUP_S3_BUCKET}/minio/" + +echo "[$(date -u +%FT%TZ)] Mirroring $SOURCE → $DEST" + +# `--remove` would delete objects from the destination that no longer +# exist in source — we DON'T pass it, because that would let an +# accidental delete on the live bucket cascade into permanent loss on +# the backup side. Versioning + lifecycle handle stale-object cleanup. +mc mirror --quiet --overwrite "$SOURCE" "$DEST" + +# Print byte / count diff for the operator. +echo "[$(date -u +%FT%TZ)] Done. Destination summary:" +mc du "$DEST" diff --git a/scripts/backup/pg-backup.sh b/scripts/backup/pg-backup.sh new file mode 100644 index 0000000..fdf13f7 --- /dev/null +++ b/scripts/backup/pg-backup.sh @@ -0,0 +1,63 @@ +#!/usr/bin/env bash +# Hourly PostgreSQL backup for Port Nimara CRM. +# +# Reads DATABASE_URL and BACKUP_S3_* from the environment. Dumps to a +# tmpfile, gzips, optionally GPG-encrypts to BACKUP_GPG_RECIPIENT, and +# uploads to s3://${BACKUP_S3_BUCKET}/pg///.dump.gz[.gpg]. +# +# Designed to fail loud: any non-zero exit halts the script and propagates +# to the cron / CI runner so the operator sees the failure. + +set -euo pipefail + +: "${DATABASE_URL:?DATABASE_URL not set}" +: "${BACKUP_S3_BUCKET:?BACKUP_S3_BUCKET not set}" +: "${BACKUP_S3_ENDPOINT:?BACKUP_S3_ENDPOINT not set}" +: "${BACKUP_S3_ACCESS_KEY:?BACKUP_S3_ACCESS_KEY not set}" +: "${BACKUP_S3_SECRET_KEY:?BACKUP_S3_SECRET_KEY not set}" + +HOST="${BACKUP_HOST_OVERRIDE:-$(hostname -s)}" +DATE_UTC="$(date -u +%Y-%m-%d)" +HOUR_UTC="$(date -u +%H)" +WORKDIR="$(mktemp -d)" +trap 'rm -rf "$WORKDIR"' EXIT + +DUMP_FILE="$WORKDIR/${HOUR_UTC}.dump" +ARCHIVE_NAME="${HOUR_UTC}.dump.gz" + +echo "[$(date -u +%FT%TZ)] Dumping $DATABASE_URL → $DUMP_FILE" +pg_dump --format=custom --compress=9 --no-owner --no-privileges \ + --file="$DUMP_FILE" "$DATABASE_URL" + +# pg_dump's `custom` format is already compressed, but we wrap in gzip so +# the file looks the same regardless of the dump format on disk. +gzip -n "$DUMP_FILE" +GZ_FILE="${DUMP_FILE}.gz" + +# Optional GPG layer. Only encrypt if the recipient is configured. +if [[ -n "${BACKUP_GPG_RECIPIENT:-}" ]]; then + echo "[$(date -u +%FT%TZ)] Encrypting for $BACKUP_GPG_RECIPIENT" + gpg --batch --yes --trust-model always \ + --recipient "$BACKUP_GPG_RECIPIENT" \ + --encrypt --output "${GZ_FILE}.gpg" "$GZ_FILE" + rm "$GZ_FILE" + GZ_FILE="${GZ_FILE}.gpg" + ARCHIVE_NAME="${ARCHIVE_NAME}.gpg" +fi + +# Configure mc client for the backup destination. +MC_ALIAS="bk-$$" +mc alias set "$MC_ALIAS" "$BACKUP_S3_ENDPOINT" \ + "$BACKUP_S3_ACCESS_KEY" "$BACKUP_S3_SECRET_KEY" \ + --api S3v4 >/dev/null + +REMOTE_PATH="${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/${DATE_UTC}/${ARCHIVE_NAME}" +echo "[$(date -u +%FT%TZ)] Uploading → $REMOTE_PATH" +mc cp --quiet "$GZ_FILE" "$REMOTE_PATH" + +# Tag with retention metadata so lifecycle rules can decide what to expire. +mc tag set "$REMOTE_PATH" "kind=hourly&host=${HOST}&date=${DATE_UTC}" >/dev/null + +mc alias remove "$MC_ALIAS" >/dev/null + +echo "[$(date -u +%FT%TZ)] OK ${ARCHIVE_NAME} ($(du -h "$GZ_FILE" | cut -f1))" diff --git a/scripts/backup/restore.sh b/scripts/backup/restore.sh new file mode 100644 index 0000000..0c0c80b --- /dev/null +++ b/scripts/backup/restore.sh @@ -0,0 +1,121 @@ +#!/usr/bin/env bash +# Cold-restore script for Port Nimara CRM. +# +# Two modes: +# --drill Restore to a sandbox DB ($DRILL_DATABASE_URL) + a tagged +# sandbox path on the live MinIO bucket. Used by the weekly +# cron drill so the runbook stays accurate. +# (no --drill) Interactive production restore. Prompts before each +# destructive step; refuses to run if the live DB has +# non-empty tables (caller is expected to drop first). +# +# Common args: +# --snapshot YYYY-MM-DD/HH Specific dump to restore. Defaults to "latest". + +set -euo pipefail + +DRILL=0 +SNAPSHOT="latest" +while [[ $# -gt 0 ]]; do + case "$1" in + --drill) DRILL=1; shift ;; + --snapshot) SNAPSHOT="$2"; shift 2 ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +: "${BACKUP_S3_BUCKET:?BACKUP_S3_BUCKET not set}" +: "${BACKUP_S3_ENDPOINT:?BACKUP_S3_ENDPOINT not set}" +: "${BACKUP_S3_ACCESS_KEY:?BACKUP_S3_ACCESS_KEY not set}" +: "${BACKUP_S3_SECRET_KEY:?BACKUP_S3_SECRET_KEY not set}" + +if [[ "$DRILL" -eq 1 ]]; then + : "${DRILL_DATABASE_URL:?DRILL_DATABASE_URL not set}" + TARGET_DB="$DRILL_DATABASE_URL" + echo "[drill] target DB = $TARGET_DB" +else + : "${DATABASE_URL:?DATABASE_URL not set}" + TARGET_DB="$DATABASE_URL" + read -rp "About to overwrite $TARGET_DB. Type 'restore' to continue: " confirm + [[ "$confirm" == "restore" ]] || { echo "aborted"; exit 1; } +fi + +HOST="${BACKUP_HOST_OVERRIDE:-$(hostname -s)}" +WORKDIR="$(mktemp -d)" +trap 'rm -rf "$WORKDIR"' EXIT + +MC_ALIAS="bk-$$" +mc alias set "$MC_ALIAS" "$BACKUP_S3_ENDPOINT" \ + "$BACKUP_S3_ACCESS_KEY" "$BACKUP_S3_SECRET_KEY" --api S3v4 >/dev/null +trap 'rm -rf "$WORKDIR"; mc alias remove "$MC_ALIAS" 2>/dev/null || true' EXIT + +# Resolve the snapshot path. +if [[ "$SNAPSHOT" == "latest" ]]; then + REMOTE=$(mc ls --recursive "${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/" \ + | awk '{print $NF}' | sort | tail -1) + if [[ -z "$REMOTE" ]]; then + echo "no snapshots found under ${BACKUP_S3_BUCKET}/pg/${HOST}/" >&2 + exit 1 + fi + REMOTE="${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/${REMOTE}" +else + REMOTE="${MC_ALIAS}/${BACKUP_S3_BUCKET}/pg/${HOST}/${SNAPSHOT}.dump.gz" + # If GPG was used, the file lives at .dump.gz.gpg. Try both. + if ! mc stat "$REMOTE" >/dev/null 2>&1; then + REMOTE="${REMOTE}.gpg" + fi +fi + +echo "[$(date -u +%FT%TZ)] Pulling $REMOTE" +LOCAL="$WORKDIR/$(basename "$REMOTE")" +mc cp --quiet "$REMOTE" "$LOCAL" + +# Decrypt if needed. +if [[ "$LOCAL" == *.gpg ]]; then + echo "[$(date -u +%FT%TZ)] Decrypting" + gpg --batch --yes --decrypt --output "${LOCAL%.gpg}" "$LOCAL" + rm "$LOCAL" + LOCAL="${LOCAL%.gpg}" +fi + +# Decompress. +gunzip "$LOCAL" +LOCAL="${LOCAL%.gz}" + +echo "[$(date -u +%FT%TZ)] Restoring into $TARGET_DB" + +# Drop & recreate to guarantee no half-state from a prior run. +DB_NAME=$(echo "$TARGET_DB" | sed -E 's|.*/([^?]+).*|\1|') +ADMIN_URL=$(echo "$TARGET_DB" | sed -E "s|/${DB_NAME}|/postgres|") + +psql "$ADMIN_URL" -v ON_ERROR_STOP=1 < pg_backend_pid(); +DROP DATABASE IF EXISTS "${DB_NAME}"; +CREATE DATABASE "${DB_NAME}"; +SQL + +pg_restore --no-owner --no-privileges --dbname "$TARGET_DB" "$LOCAL" + +# Drill mode: compare row counts vs the live producer for parity. +if [[ "$DRILL" -eq 1 ]]; then + echo "[$(date -u +%FT%TZ)] Drill row-count diff (live vs restored):" + TABLES=$(psql -At "$TARGET_DB" -c \ + "SELECT tablename FROM pg_tables WHERE schemaname='public' ORDER BY tablename;") + diff_count=0 + while IFS= read -r tbl; do + [[ -z "$tbl" ]] && continue + live=$(psql -At "${LIVE_DATABASE_URL:-$DATABASE_URL}" -c "SELECT count(*) FROM \"$tbl\";") + restored=$(psql -At "$TARGET_DB" -c "SELECT count(*) FROM \"$tbl\";") + delta=$((live - restored)) + if [[ "$delta" -ne 0 ]]; then + echo " ⚠ $tbl: live=$live restored=$restored delta=$delta" + diff_count=$((diff_count + 1)) + fi + done <<< "$TABLES" + if [[ "$diff_count" -eq 0 ]]; then + echo " ✓ row counts match across all tables" + fi +fi + +echo "[$(date -u +%FT%TZ)] Restore complete."