feat(errors): platform-wide request ids + error codes + admin inspector

End-to-end error-handling overhaul. A user hitting any failure now sees a plain-text message + stable error code + reference id. A super admin can paste the id into /admin/errors/<id> for the full request shape, sanitized body, error stack, and a heuristic likely-cause hint. REQUEST CONTEXT (AsyncLocalStorage) - src/lib/request-context.ts mints a per-request frame carrying requestId + portId + userId + method + path + start timestamp. - withAuth wraps every authenticated handler in runWithRequestContext and accepts an upstream X-Request-Id header (validated shape) or generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id response header, including early-return 401/403/4xx paths. - Pino logger reads from the same context via mixin — every log line emitted during the request automatically carries the ids with no per-call threading. ERROR CODE REGISTRY - src/lib/error-codes.ts defines stable DOMAIN_REASON codes with HTTP status + plain-text user-facing message (no jargon, written for the rep on the phone with a customer). - New CodedError class wraps a registered code + optional internalMessage (admin-only — never sent to client). - Existing AppError subclasses got plain-text default rewrites so legacy throw sites improve immediately without migration. - High-impact services migrated to specific codes: expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY / PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender (INTEREST_PORT_MISMATCH). ERROR ENVELOPE - errorResponse always sets X-Request-Id header + requestId field. - 5xx responses include a "Quote error ID …" friendly line. - 4xx kept clean (validation, permission, not-found don't pollute the inspector — they're already in audit log). PERSISTENCE (error_events table, migration 0040) - One row per 5xx, keyed on requestId, with method/path/status/error name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap; password/token/secret/etc keys redacted)/duration/IP/UA/metadata. - captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code so the classifier can recognize FK / unique / NOT NULL / schema- drift violations. - Failure to persist is logged-not-thrown. LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts) - 4-pass heuristic (first match wins): 1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique, 42703 schema drift, 53300 connection limit, …) 2. Error class name (AbortError, TimeoutError, FetchError, ZodError) 3. Stack-path patterns (/lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/) 4. Free-text message keywords (econnrefused, rate limit, timeout, unauthorized|invalid api key) - Returns { label, hint, subsystem } for the inspector badge. CLIENT SIDE - apiFetch throws structured ApiError with message + code + requestId + details + retryAfter. - toastError() helper renders the standard 3-line toast: plain message / Error code: X / Reference ID: Y [Copy ID]. ADMIN INSPECTOR - /<port>/admin/errors lists captured 5xx with status badge + path + likely-culprit badge + truncated message + reference id. Filter by status code; auto-refresh via TanStack Query. - /<port>/admin/errors/<requestId> deep-dive: request shape, full error name+message+stack, sanitized body excerpt, raw metadata, registered-code lookup (so admin can compare to what user saw), likely-culprit hint with subsystem tag. - /<port>/admin/errors/codes is the in-app code reference page — every registered code grouped by domain prefix, searchable, with HTTP status + user message inline. Linked from inspector header so admins can flip to it while triaging. - Permission: admin.view_audit_log. Super admins see all ports; regular admins port-scoped. - system-monitoring dashboard now surfaces error_events alongside permission_denied audit + queue failed jobs (RecentError gains source: 'request' variant). DOCS - docs/error-handling.md walks through coded errors, plain-text message guidelines, client toasting, admin inspector usage, persistence rules, classifier internals, pruning, and the legacy → CodedError migration path. MIGRATION SAFETY - Audit confirmed all 41 migrations (0000-0040) apply cleanly in journal order against an empty DB. 0040 references ports(id) which exists from 0000. 0035/0038 don't deadlock under sequential psql -f. Removed redundant idx_ds_sent_by from 0038 (created in 0037). Tests: 1168/1168 vitest passing. tsc clean. - security-error-responses tests updated for plain-text messages + new optional response keys (code/requestId/message). - berth-pdf-versions tests assert stable error codes via toMatchObject({ code }) rather than message regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:12:59 +02:00
parent c4a41d5f5b
commit 4723994bdc
26 changed files with 2027 additions and 169 deletions
--- a/src/lib/errors.ts
+++ b/src/lib/errors.ts
@@ -2,6 +2,9 @@ import { NextResponse } from 'next/server';
 import { ZodError } from 'zod';

 import { logger } from '@/lib/logger';
+import { getRequestId } from '@/lib/request-context';
+import { captureErrorEvent } from '@/lib/services/error-events.service';
+import { ERROR_CODES, type ErrorCode } from '@/lib/error-codes';

 export class AppError extends Error {
  constructor(
@@ -14,20 +17,63 @@ export class AppError extends Error {
  }
 }

+/**
+ * Throw site for any registered error code. Consolidates the
+ * status + plain-text message + stable code into one constructor.
+ *
+ *   throw new CodedError('EXPENSES_RECEIPT_REQUIRED');
+ *
+ * Pass `details` for structured payload (e.g. zod validation issues),
+ * or `internalMessage` for an admin-only string that lands in the
+ * error_events row but is NEVER returned to the user (the user gets
+ * the plain-text message from the registry).
+ */
+export class CodedError extends AppError {
+  /** Optional structured details surfaced to the client. */
+  public details?: unknown;
+  /** Optional verbose message for admin logs only — never sent to client. */
+  public internalMessage?: string;
+
+  constructor(code: ErrorCode, opts: { details?: unknown; internalMessage?: string } = {}) {
+    const def = ERROR_CODES[code];
+    super(def.status, def.userMessage, code);
+    this.name = 'CodedError';
+    this.details = opts.details;
+    this.internalMessage = opts.internalMessage;
+  }
+}
+
+/**
+ * Backwards-compat shims: these existing subclasses are still used in
+ * lots of places; new throw sites should prefer `CodedError` so the
+ * code surfaces in the registry.
+ *
+ * Messages have been rewritten to plain language (no internal jargon)
+ * so the user-facing toast reads naturally even before a service is
+ * migrated to a specific CodedError code.
+ */
 export class NotFoundError extends AppError {
  constructor(entity: string) {
-    super(404, `${entity} not found`, 'NOT_FOUND');
+    // Plain-text version of "X not found" — the registered code stays
+    // generic until callers migrate to specific codes per entity.
+    super(
+      404,
+      `We couldn't find that ${entity.toLowerCase()}. It may have been removed.`,
+      'NOT_FOUND',
+    );
  }
 }

 export class ForbiddenError extends AppError {
-  constructor(message = 'Insufficient permissions') {
+  constructor(
+    message = "You don't have permission to do that. Ask an admin if you think you should.",
+  ) {
    super(403, message, 'FORBIDDEN');
  }
 }

 export class UnauthorizedError extends AppError {
-  constructor(message = 'Unauthorized') {
+  constructor(message = 'Please sign in to continue.') {
    super(401, message, 'UNAUTHORIZED');
  }
 }
@@ -49,44 +95,84 @@ export class ConflictError extends AppError {

 export class RateLimitError extends AppError {
  constructor(public retryAfter: number) {
-    super(429, 'Too many requests', 'RATE_LIMITED');
+    super(
+      429,
+      "You've done that a lot in a short time. Please wait a moment and try again.",
+      'RATE_LIMITED',
+    );
  }
 }

 /**
 * Converts any thrown value into a sanitised NextResponse.
- * Never leaks stack traces, internal paths, or database error details to the client.
+ *
+ * Always attaches the active `X-Request-Id` to:
+ *   - the response header (so a curl/dev-tools user can see it)
+ *   - the JSON body (so a UI toast can surface "Error ID: …")
+ *
+ * For unhandled (5xx) errors, also persists a row to `error_events`
+ * so a super admin can paste the request id into the inspector and
+ * pull the full stack + body excerpt + log lines.
+ *
+ * Never leaks stack traces, internal paths, or DB error details to
+ * the client — that data goes to pino + the error_events row only.
 */
 export function errorResponse(error: unknown): NextResponse {
+  const requestId = getRequestId();
+  const headers = requestId ? { 'X-Request-Id': requestId } : undefined;
+
  if (error instanceof AppError) {
    const body: Record<string, unknown> = {
      error: error.message,
      code: error.code,
    };
+    if (requestId) body.requestId = requestId;
    if (error instanceof ValidationError && error.details) {
      body.details = error.details;
    }
+    if (error instanceof CodedError && error.details !== undefined) {
+      body.details = error.details;
+    }
    if (error instanceof RateLimitError) {
      body.retryAfter = error.retryAfter;
    }
-    return NextResponse.json(body, { status: error.statusCode });
+    // 4xx errors are user-action mistakes (validation, not-found,
+    // permission). They DON'T go to error_events — that table is for
+    // platform faults the super admin needs to triage. The exception:
+    // when a CodedError carries an internalMessage, persist it under
+    // a debug_events flag so admins can still trace deliberate-throw
+    // patterns. (Only 5xx CodedErrors get persisted automatically.)
+    if (error.statusCode >= 500) {
+      void captureErrorEvent({
+        statusCode: error.statusCode,
+        error,
+        metadata: error instanceof CodedError ? { internalMessage: error.internalMessage } : {},
+      });
+    }
+    return NextResponse.json(body, { status: error.statusCode, headers });
  }

  if (error instanceof ZodError) {
-    return NextResponse.json(
-      {
-        error: 'Validation failed',
-        code: 'VALIDATION_ERROR',
-        details: error.errors.map((e) => ({
-          field: e.path.join('.'),
-          message: e.message,
-        })),
-      },
-      { status: 400 },
-    );
+    const body: Record<string, unknown> = {
+      error: 'Validation failed',
+      code: 'VALIDATION_ERROR',
+      details: error.errors.map((e) => ({
+        field: e.path.join('.'),
+        message: e.message,
+      })),
+    };
+    if (requestId) body.requestId = requestId;
+    return NextResponse.json(body, { status: 400, headers });
  }

-  // Log full details server-side; never send them to the client.
+  // Unhandled — full details to pino + persist to error_events.
  logger.error({ err: error }, 'Unhandled error');
-  return NextResponse.json({ error: 'Internal server error' }, { status: 500 });
+  void captureErrorEvent({ statusCode: 500, error });
+
+  const body: Record<string, unknown> = { error: 'Internal server error', code: 'INTERNAL' };
+  if (requestId) {
+    body.requestId = requestId;
+    body.message = `Something went wrong on our end. Quote error ID ${requestId} when reporting this.`;
+  }
+  return NextResponse.json(body, { status: 500, headers });
 }