fix(ops): /health DB+Redis checks, validated env.REDIS_URL across workers, error_events 90d retention

Three audit-pass-#3 findings, all in the "wakes you at 3am" category.

- /api/public/health now runs DB SELECT 1 + Redis PING in parallel and
  returns 503 + a degraded payload when either fails. Anonymous probes
  (no X-Intake-Secret) still get a flat {status:'ok'} so generic uptime
  monitors keep working; authenticated probes see the dep results.
- All worker entrypoints (ai, bulk, documents, email, export, import,
  maintenance, notifications, reports, webhooks) and src/lib/redis.ts
  now use env.REDIS_URL (Zod-validated at boot) instead of
  process.env.REDIS_URL!. Previously a missing env let the app start
  silently and fail at first job pickup.
- maintenance worker gains an `error-events-retention` case that
  delete()s rows older than 90 days from error_events. scheduler.ts
  registers it at 06:00 daily. Closes the contract from migration
  0040 which declared the table "pruned at 90 days" but had no
  implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Matt Ciaccio
2026-05-06 14:59:07 +02:00
parent 64f0e0a1b8
commit f93de75bb5
13 changed files with 110 additions and 41 deletions

View File

@@ -57,6 +57,8 @@ export async function registerRecurringJobs(): Promise<void> {
{ queue: 'maintenance', name: 'gdpr-export-cleanup', pattern: '0 4 * * *' },
// Phase 3b: AI usage ledger retention (90-day rolling window)
{ queue: 'maintenance', name: 'ai-usage-retention', pattern: '0 5 * * *' },
// Migration 0040 contract: error_events older than 90 days get pruned.
{ queue: 'maintenance', name: 'error-events-retention', pattern: '0 6 * * *' },
];
for (const job of recurring) {