Files

Matt Ciaccio 4723994bdc feat(errors): platform-wide request ids + error codes + admin inspector

End-to-end error-handling overhaul. A user hitting any failure now sees
a plain-text message + stable error code + reference id. A super admin
can paste the id into /admin/errors/<id> for the full request shape,
sanitized body, error stack, and a heuristic likely-cause hint.

REQUEST CONTEXT (AsyncLocalStorage)
- src/lib/request-context.ts mints a per-request frame carrying
  requestId + portId + userId + method + path + start timestamp.
- withAuth wraps every authenticated handler in runWithRequestContext
  and accepts an upstream X-Request-Id header (validated shape) or
  generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id
  response header, including early-return 401/403/4xx paths.
- Pino logger reads from the same context via mixin — every log
  line emitted during the request automatically carries the ids
  with no per-call threading.

ERROR CODE REGISTRY
- src/lib/error-codes.ts defines stable DOMAIN_REASON codes with
  HTTP status + plain-text user-facing message (no jargon, written
  for the rep on the phone with a customer).
- New CodedError class wraps a registered code + optional
  internalMessage (admin-only — never sent to client).
- Existing AppError subclasses got plain-text default rewrites so
  legacy throw sites improve immediately without migration.
- High-impact services migrated to specific codes:
  expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths
  (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY /
  PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender
  (INTEREST_PORT_MISMATCH).

ERROR ENVELOPE
- errorResponse always sets X-Request-Id header + requestId field.
- 5xx responses include a "Quote error ID …" friendly line.
- 4xx kept clean (validation, permission, not-found don't pollute
  the inspector — they're already in audit log).

PERSISTENCE (error_events table, migration 0040)
- One row per 5xx, keyed on requestId, with method/path/status/error
  name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap;
  password/token/secret/etc keys redacted)/duration/IP/UA/metadata.
- captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code
  so the classifier can recognize FK / unique / NOT NULL / schema-
  drift violations.
- Failure to persist is logged-not-thrown.

LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts)
- 4-pass heuristic (first match wins):
  1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique,
     42703 schema drift, 53300 connection limit, …)
  2. Error class name (AbortError, TimeoutError, FetchError,
     ZodError)
  3. Stack-path patterns (/lib/storage/, /lib/email/, documenso,
     openai|claude, /queue/workers/)
  4. Free-text message keywords (econnrefused, rate limit, timeout,
     unauthorized|invalid api key)
- Returns { label, hint, subsystem } for the inspector badge.

CLIENT SIDE
- apiFetch throws structured ApiError with message + code + requestId
  + details + retryAfter.
- toastError() helper renders the standard 3-line toast:
  plain message / Error code: X / Reference ID: Y [Copy ID].

ADMIN INSPECTOR
- /<port>/admin/errors lists captured 5xx with status badge + path +
  likely-culprit badge + truncated message + reference id. Filter by
  status code; auto-refresh via TanStack Query.
- /<port>/admin/errors/<requestId> deep-dive: request shape, full
  error name+message+stack, sanitized body excerpt, raw metadata,
  registered-code lookup (so admin can compare to what user saw),
  likely-culprit hint with subsystem tag.
- /<port>/admin/errors/codes is the in-app code reference page —
  every registered code grouped by domain prefix, searchable, with
  HTTP status + user message inline. Linked from inspector header
  so admins can flip to it while triaging.
- Permission: admin.view_audit_log. Super admins see all ports;
  regular admins port-scoped.
- system-monitoring dashboard now surfaces error_events alongside
  permission_denied audit + queue failed jobs (RecentError gains
  source: 'request' variant).

DOCS
- docs/error-handling.md walks through coded errors, plain-text
  message guidelines, client toasting, admin inspector usage,
  persistence rules, classifier internals, pruning, and the
  legacy → CodedError migration path.

MIGRATION SAFETY
- Audit confirmed all 41 migrations (0000-0040) apply cleanly in
  journal order against an empty DB. 0040 references ports(id)
  which exists from 0000. 0035/0038 don't deadlock under sequential
  psql -f. Removed redundant idx_ds_sent_by from 0038 (created in
  0037).

Tests: 1168/1168 vitest passing. tsc clean.
- security-error-responses tests updated for plain-text messages
  + new optional response keys (code/requestId/message).
- berth-pdf-versions tests assert stable error codes via
  toMatchObject({ code }) rather than message regex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-05 14:12:59 +02:00

6.2 KiB

Raw Blame History

Error handling

Overview

Every authenticated request runs inside an AsyncLocalStorage frame that carries a requestId (UUID) plus the resolved portId / userId / HTTP method / path / start time. The id surfaces:

as X-Request-Id on every response header (success or failure)
inside every pino log line emitted during the request
in the JSON error body returned to the client (requestId field)
as the primary key of the error_events row written when a 5xx fires

A user who hits a failure can copy the Reference ID from the toast and a super admin can paste it into /<port>/admin/errors/<requestId> to see the full request context, sanitized body, error stack, and a heuristic "likely culprit" hint.

Throwing errors from a service

Use CodedError with a registered code:

import { CodedError } from '@/lib/errors';

if (!hasReceipts && !ack) {
  throw new CodedError('EXPENSES_RECEIPT_REQUIRED');
}

The code drives:

the HTTP status (defined in src/lib/error-codes.ts)
the plain-text user-facing message (no jargon — written for the rep on the phone with a customer)
the stable identifier the user can quote to support

For more verbose internal context — admin-only — use internalMessage:

throw new CodedError('CROSS_PORT_LINK_REJECTED', {
  internalMessage: `interest ${a.id} (port ${a.portId}) ↔ berth ${b.id} (port ${b.portId})`,
});

The internalMessage lands in the error_events row and the admin inspector but never reaches the client.

Adding a new error code

Open src/lib/error-codes.ts.

Add an entry to the ERROR_CODES map. Convention: DOMAIN_REASON in SCREAMING_SNAKE_CASE.

FOO_INVALID_BAR: {
  status: 400,
  userMessage: 'That bar value is no good. Please try another.',
},

Use it: throw new CodedError('FOO_INVALID_BAR').
The code, status, and message are now contractually stable — never rename a code once it has shipped. Documentation, UI, and external integrations may pin to it.

Plain-text message guidelines

User-facing messages should:

Avoid internal jargon (no "constraint violation", "FK", "row lock").
Be written for a rep on the phone with a customer.
Include the suggested next action when natural ("Ask an admin if you think you should").
Not include any technical detail that doesn't help the user — the request id + error code carry that.

Verbose technical detail belongs in internalMessage (admin-only).

Client side

In a useMutation, render errors with the shared helper:

import { toastError } from '@/lib/api/toast-error';

const mutation = useMutation({
  mutationFn: () => apiFetch('/api/v1/foo', { method: 'POST', body: { ... } }),
  onSuccess: () => { ... },
  onError: (err) => toastError(err),
});

The toast renders three lines:

{plain-text message}

Error code: EXPENSES_RECEIPT_REQUIRED
Reference ID: 8f3c-ab12-…   [Copy ID]

The "Copy ID" action puts the request id on the clipboard so the user can paste it into a support ticket.

Admin inspector

/<port>/admin/errors lists captured 5xx errors:

Status badge + method + path
"Likely culprit" badge (heuristic — Postgres SQLSTATE, error name, stack-path patterns, message keywords)
Truncated error name + message
Timestamp + reference id

Click any row for /<port>/admin/errors/<requestId> which shows:

Request shape (method / path / when / duration / port / user / IP / UA)
Likely culprit + plain-English hint + subsystem tag
Full error name, message, stack head (first 4 KB)
Sanitized request body excerpt (max 1 KB; sensitive keys redacted)
Raw metadata (Postgres SQLSTATE codes, internalMessage, etc.)

Permission: admin.view_audit_log. Super admins see every port's errors; regular admins are scoped to their active port.

What gets persisted

Status	error_events row?	Toast shows code?
4xx	No	Yes
5xx	Yes	Yes

4xx errors are user-action mistakes (validation, not-found, permission denied). They're visible in the audit log but not the error inspector — that table is reserved for platform faults.

5xx errors hit the errorEvents table via captureErrorEvent inside errorResponse, which:

Reads the request context from ALS.
Sanitizes + truncates the body (1 KB cap, sensitive keys redacted).
Pulls Postgres code / severity / cause.code if the underlying error is a postgres driver error.
Truncates the stack to 4 KB.
Inserts one row keyed on requestId with ON CONFLICT DO NOTHING.

Failure to persist NEVER throws — the user is already getting an error response; we don't want a logging-pipeline failure to mask it.

Likely-culprit classifier

src/lib/error-classifier.ts runs four passes against an error_events row, first match wins:

Postgres SQLSTATE (from metadata.code): 23502 NOT NULL, 23503 FK, 23505 unique, 23514 CHECK, 42703 schema drift, 42P01 missing table, 40001 serialization, 53300 connection limit, …
Error class name: AbortError, TimeoutError, FetchError, ZodError.
Stack path: /lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/.
Message free-text: econnrefused, rate limit, timeout, unauthorized|invalid api key.

Returns null when nothing matches; the inspector renders "Uncategorized" in that case. Adding a new heuristic is a one-line edit to the relevant array.

Pruning

error_events rows are dropped after 90 days by the maintenance worker (TODO: confirm the worker has the deletion path; if not, add a periodic job that runs DELETE FROM error_events WHERE created_at < now() - interval '90 days').

Migration path for legacy throws

Existing NotFoundError / ForbiddenError / ConflictError / ValidationError / RateLimitError still work — the user-facing messages on these classes have been rewritten to plain-text defaults.

Migration to CodedError happens opportunistically: when touching a service to fix something else, swap the throw site for a registered code.

A follow-up audit pass should walk git grep "throw new ValidationError" and migrate the user-impactful ones to specific codes.

6.2 KiB Raw Blame History