Files
pn-new-crm/src/lib/services/system-monitoring.service.ts

416 lines
13 KiB
TypeScript
Raw Normal View History

import { db } from '@/lib/db';
feat(errors): platform-wide request ids + error codes + admin inspector End-to-end error-handling overhaul. A user hitting any failure now sees a plain-text message + stable error code + reference id. A super admin can paste the id into /admin/errors/<id> for the full request shape, sanitized body, error stack, and a heuristic likely-cause hint. REQUEST CONTEXT (AsyncLocalStorage) - src/lib/request-context.ts mints a per-request frame carrying requestId + portId + userId + method + path + start timestamp. - withAuth wraps every authenticated handler in runWithRequestContext and accepts an upstream X-Request-Id header (validated shape) or generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id response header, including early-return 401/403/4xx paths. - Pino logger reads from the same context via mixin — every log line emitted during the request automatically carries the ids with no per-call threading. ERROR CODE REGISTRY - src/lib/error-codes.ts defines stable DOMAIN_REASON codes with HTTP status + plain-text user-facing message (no jargon, written for the rep on the phone with a customer). - New CodedError class wraps a registered code + optional internalMessage (admin-only — never sent to client). - Existing AppError subclasses got plain-text default rewrites so legacy throw sites improve immediately without migration. - High-impact services migrated to specific codes: expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY / PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender (INTEREST_PORT_MISMATCH). ERROR ENVELOPE - errorResponse always sets X-Request-Id header + requestId field. - 5xx responses include a "Quote error ID …" friendly line. - 4xx kept clean (validation, permission, not-found don't pollute the inspector — they're already in audit log). PERSISTENCE (error_events table, migration 0040) - One row per 5xx, keyed on requestId, with method/path/status/error name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap; password/token/secret/etc keys redacted)/duration/IP/UA/metadata. - captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code so the classifier can recognize FK / unique / NOT NULL / schema- drift violations. - Failure to persist is logged-not-thrown. LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts) - 4-pass heuristic (first match wins): 1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique, 42703 schema drift, 53300 connection limit, …) 2. Error class name (AbortError, TimeoutError, FetchError, ZodError) 3. Stack-path patterns (/lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/) 4. Free-text message keywords (econnrefused, rate limit, timeout, unauthorized|invalid api key) - Returns { label, hint, subsystem } for the inspector badge. CLIENT SIDE - apiFetch throws structured ApiError with message + code + requestId + details + retryAfter. - toastError() helper renders the standard 3-line toast: plain message / Error code: X / Reference ID: Y [Copy ID]. ADMIN INSPECTOR - /<port>/admin/errors lists captured 5xx with status badge + path + likely-culprit badge + truncated message + reference id. Filter by status code; auto-refresh via TanStack Query. - /<port>/admin/errors/<requestId> deep-dive: request shape, full error name+message+stack, sanitized body excerpt, raw metadata, registered-code lookup (so admin can compare to what user saw), likely-culprit hint with subsystem tag. - /<port>/admin/errors/codes is the in-app code reference page — every registered code grouped by domain prefix, searchable, with HTTP status + user message inline. Linked from inspector header so admins can flip to it while triaging. - Permission: admin.view_audit_log. Super admins see all ports; regular admins port-scoped. - system-monitoring dashboard now surfaces error_events alongside permission_denied audit + queue failed jobs (RecentError gains source: 'request' variant). DOCS - docs/error-handling.md walks through coded errors, plain-text message guidelines, client toasting, admin inspector usage, persistence rules, classifier internals, pruning, and the legacy → CodedError migration path. MIGRATION SAFETY - Audit confirmed all 41 migrations (0000-0040) apply cleanly in journal order against an empty DB. 0040 references ports(id) which exists from 0000. 0035/0038 don't deadlock under sequential psql -f. Removed redundant idx_ds_sent_by from 0038 (created in 0037). Tests: 1168/1168 vitest passing. tsc clean. - security-error-responses tests updated for plain-text messages + new optional response keys (code/requestId/message). - berth-pdf-versions tests assert stable error codes via toMatchObject({ code }) rather than message regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:12:59 +02:00
import { auditLogs, errorEvents } from '@/lib/db/schema';
import { redis } from '@/lib/redis';
import { minioClient } from '@/lib/minio/index';
import { getQueue, QUEUE_CONFIGS, type QueueName } from '@/lib/queue';
import { createAuditLog } from '@/lib/audit';
import { env } from '@/lib/env';
import { sql, desc, eq } from 'drizzle-orm';
import { NotFoundError } from '@/lib/errors';
import { logger } from '@/lib/logger';
// ─── Types ────────────────────────────────────────────────────────────────────
export interface ServiceStatus {
name: string;
status: 'healthy' | 'degraded' | 'down';
responseTimeMs: number;
details?: string;
}
export interface HealthStatus {
overall: 'healthy' | 'degraded' | 'down';
services: ServiceStatus[];
checkedAt: Date;
}
export interface QueueStatus {
name: string;
waiting: number;
active: number;
completed: number;
failed: number;
delayed: number;
}
export interface QueueJobSummary {
id: string;
name: string;
data: unknown;
status: string;
timestamp: number | undefined;
processedOn: number | undefined;
finishedOn: number | undefined;
failedReason: string | undefined;
}
export interface PaginatedQueueJobs {
jobs: QueueJobSummary[];
total: number;
page: number;
limit: number;
}
export interface ConnectionStatus {
totalConnections: number;
}
export interface RecentError {
id: string;
feat(errors): platform-wide request ids + error codes + admin inspector End-to-end error-handling overhaul. A user hitting any failure now sees a plain-text message + stable error code + reference id. A super admin can paste the id into /admin/errors/<id> for the full request shape, sanitized body, error stack, and a heuristic likely-cause hint. REQUEST CONTEXT (AsyncLocalStorage) - src/lib/request-context.ts mints a per-request frame carrying requestId + portId + userId + method + path + start timestamp. - withAuth wraps every authenticated handler in runWithRequestContext and accepts an upstream X-Request-Id header (validated shape) or generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id response header, including early-return 401/403/4xx paths. - Pino logger reads from the same context via mixin — every log line emitted during the request automatically carries the ids with no per-call threading. ERROR CODE REGISTRY - src/lib/error-codes.ts defines stable DOMAIN_REASON codes with HTTP status + plain-text user-facing message (no jargon, written for the rep on the phone with a customer). - New CodedError class wraps a registered code + optional internalMessage (admin-only — never sent to client). - Existing AppError subclasses got plain-text default rewrites so legacy throw sites improve immediately without migration. - High-impact services migrated to specific codes: expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY / PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender (INTEREST_PORT_MISMATCH). ERROR ENVELOPE - errorResponse always sets X-Request-Id header + requestId field. - 5xx responses include a "Quote error ID …" friendly line. - 4xx kept clean (validation, permission, not-found don't pollute the inspector — they're already in audit log). PERSISTENCE (error_events table, migration 0040) - One row per 5xx, keyed on requestId, with method/path/status/error name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap; password/token/secret/etc keys redacted)/duration/IP/UA/metadata. - captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code so the classifier can recognize FK / unique / NOT NULL / schema- drift violations. - Failure to persist is logged-not-thrown. LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts) - 4-pass heuristic (first match wins): 1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique, 42703 schema drift, 53300 connection limit, …) 2. Error class name (AbortError, TimeoutError, FetchError, ZodError) 3. Stack-path patterns (/lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/) 4. Free-text message keywords (econnrefused, rate limit, timeout, unauthorized|invalid api key) - Returns { label, hint, subsystem } for the inspector badge. CLIENT SIDE - apiFetch throws structured ApiError with message + code + requestId + details + retryAfter. - toastError() helper renders the standard 3-line toast: plain message / Error code: X / Reference ID: Y [Copy ID]. ADMIN INSPECTOR - /<port>/admin/errors lists captured 5xx with status badge + path + likely-culprit badge + truncated message + reference id. Filter by status code; auto-refresh via TanStack Query. - /<port>/admin/errors/<requestId> deep-dive: request shape, full error name+message+stack, sanitized body excerpt, raw metadata, registered-code lookup (so admin can compare to what user saw), likely-culprit hint with subsystem tag. - /<port>/admin/errors/codes is the in-app code reference page — every registered code grouped by domain prefix, searchable, with HTTP status + user message inline. Linked from inspector header so admins can flip to it while triaging. - Permission: admin.view_audit_log. Super admins see all ports; regular admins port-scoped. - system-monitoring dashboard now surfaces error_events alongside permission_denied audit + queue failed jobs (RecentError gains source: 'request' variant). DOCS - docs/error-handling.md walks through coded errors, plain-text message guidelines, client toasting, admin inspector usage, persistence rules, classifier internals, pruning, and the legacy → CodedError migration path. MIGRATION SAFETY - Audit confirmed all 41 migrations (0000-0040) apply cleanly in journal order against an empty DB. 0040 references ports(id) which exists from 0000. 0035/0038 don't deadlock under sequential psql -f. Removed redundant idx_ds_sent_by from 0038 (created in 0037). Tests: 1168/1168 vitest passing. tsc clean. - security-error-responses tests updated for plain-text messages + new optional response keys (code/requestId/message). - berth-pdf-versions tests assert stable error codes via toMatchObject({ code }) rather than message regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:12:59 +02:00
source: 'audit' | 'queue' | 'request';
message: string;
timestamp: Date;
metadata?: Record<string, unknown>;
feat(errors): platform-wide request ids + error codes + admin inspector End-to-end error-handling overhaul. A user hitting any failure now sees a plain-text message + stable error code + reference id. A super admin can paste the id into /admin/errors/<id> for the full request shape, sanitized body, error stack, and a heuristic likely-cause hint. REQUEST CONTEXT (AsyncLocalStorage) - src/lib/request-context.ts mints a per-request frame carrying requestId + portId + userId + method + path + start timestamp. - withAuth wraps every authenticated handler in runWithRequestContext and accepts an upstream X-Request-Id header (validated shape) or generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id response header, including early-return 401/403/4xx paths. - Pino logger reads from the same context via mixin — every log line emitted during the request automatically carries the ids with no per-call threading. ERROR CODE REGISTRY - src/lib/error-codes.ts defines stable DOMAIN_REASON codes with HTTP status + plain-text user-facing message (no jargon, written for the rep on the phone with a customer). - New CodedError class wraps a registered code + optional internalMessage (admin-only — never sent to client). - Existing AppError subclasses got plain-text default rewrites so legacy throw sites improve immediately without migration. - High-impact services migrated to specific codes: expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY / PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender (INTEREST_PORT_MISMATCH). ERROR ENVELOPE - errorResponse always sets X-Request-Id header + requestId field. - 5xx responses include a "Quote error ID …" friendly line. - 4xx kept clean (validation, permission, not-found don't pollute the inspector — they're already in audit log). PERSISTENCE (error_events table, migration 0040) - One row per 5xx, keyed on requestId, with method/path/status/error name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap; password/token/secret/etc keys redacted)/duration/IP/UA/metadata. - captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code so the classifier can recognize FK / unique / NOT NULL / schema- drift violations. - Failure to persist is logged-not-thrown. LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts) - 4-pass heuristic (first match wins): 1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique, 42703 schema drift, 53300 connection limit, …) 2. Error class name (AbortError, TimeoutError, FetchError, ZodError) 3. Stack-path patterns (/lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/) 4. Free-text message keywords (econnrefused, rate limit, timeout, unauthorized|invalid api key) - Returns { label, hint, subsystem } for the inspector badge. CLIENT SIDE - apiFetch throws structured ApiError with message + code + requestId + details + retryAfter. - toastError() helper renders the standard 3-line toast: plain message / Error code: X / Reference ID: Y [Copy ID]. ADMIN INSPECTOR - /<port>/admin/errors lists captured 5xx with status badge + path + likely-culprit badge + truncated message + reference id. Filter by status code; auto-refresh via TanStack Query. - /<port>/admin/errors/<requestId> deep-dive: request shape, full error name+message+stack, sanitized body excerpt, raw metadata, registered-code lookup (so admin can compare to what user saw), likely-culprit hint with subsystem tag. - /<port>/admin/errors/codes is the in-app code reference page — every registered code grouped by domain prefix, searchable, with HTTP status + user message inline. Linked from inspector header so admins can flip to it while triaging. - Permission: admin.view_audit_log. Super admins see all ports; regular admins port-scoped. - system-monitoring dashboard now surfaces error_events alongside permission_denied audit + queue failed jobs (RecentError gains source: 'request' variant). DOCS - docs/error-handling.md walks through coded errors, plain-text message guidelines, client toasting, admin inspector usage, persistence rules, classifier internals, pruning, and the legacy → CodedError migration path. MIGRATION SAFETY - Audit confirmed all 41 migrations (0000-0040) apply cleanly in journal order against an empty DB. 0040 references ports(id) which exists from 0000. 0035/0038 don't deadlock under sequential psql -f. Removed redundant idx_ds_sent_by from 0038 (created in 0037). Tests: 1168/1168 vitest passing. tsc clean. - security-error-responses tests updated for plain-text messages + new optional response keys (code/requestId/message). - berth-pdf-versions tests assert stable error codes via toMatchObject({ code }) rather than message regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:12:59 +02:00
/** Set for `source: 'request'` rows so the UI can deep-link to
* /admin/errors/<requestId>. */
requestId?: string;
/** Set for `source: 'request'` rows. */
statusCode?: number;
/** Set for `source: 'request'` rows. */
errorCode?: string | null;
}
// ─── Timeout helper ───────────────────────────────────────────────────────────
function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
return Promise.race([
promise,
new Promise<T>((_, reject) =>
setTimeout(() => reject(new Error(`Timed out after ${ms}ms`)), ms),
),
]);
}
// ─── healthCheck ──────────────────────────────────────────────────────────────
export async function healthCheck(): Promise<HealthStatus> {
const checks = await Promise.allSettled([
checkPostgres(),
checkRedis(),
checkMinio(),
checkDocumenso(),
]);
const services: ServiceStatus[] = checks.map((result) => {
if (result.status === 'fulfilled') return result.value;
// Should not happen since each checker catches internally
return {
name: 'unknown',
status: 'down' as const,
responseTimeMs: 0,
details: String(result.reason),
};
});
const hasDown = services.some((s) => s.status === 'down');
const hasDegraded = services.some((s) => s.status === 'degraded');
const overall = hasDown ? 'down' : hasDegraded ? 'degraded' : 'healthy';
return { overall, services, checkedAt: new Date() };
}
async function checkPostgres(): Promise<ServiceStatus> {
const start = Date.now();
try {
await withTimeout(db.execute(sql`SELECT 1`), 5000);
return { name: 'PostgreSQL', status: 'healthy', responseTimeMs: Date.now() - start };
} catch (err) {
return {
name: 'PostgreSQL',
status: 'down',
responseTimeMs: Date.now() - start,
details: err instanceof Error ? err.message : 'Unknown error',
};
}
}
async function checkRedis(): Promise<ServiceStatus> {
const start = Date.now();
try {
const result = await withTimeout(redis.ping(), 5000);
const status = result === 'PONG' ? 'healthy' : 'degraded';
return { name: 'Redis', status, responseTimeMs: Date.now() - start };
} catch (err) {
return {
name: 'Redis',
status: 'down',
responseTimeMs: Date.now() - start,
details: err instanceof Error ? err.message : 'Unknown error',
};
}
}
async function checkMinio(): Promise<ServiceStatus> {
const start = Date.now();
try {
await withTimeout(minioClient.bucketExists(env.MINIO_BUCKET), 5000);
return { name: 'MinIO', status: 'healthy', responseTimeMs: Date.now() - start };
} catch (err) {
return {
name: 'MinIO',
status: 'down',
responseTimeMs: Date.now() - start,
details: err instanceof Error ? err.message : 'Unknown error',
};
}
}
async function checkDocumenso(): Promise<ServiceStatus> {
const start = Date.now();
try {
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), 5000);
try {
const res = await fetch(`${env.DOCUMENSO_API_URL}/api/v1/health`, {
signal: controller.signal,
method: 'GET',
});
clearTimeout(timer);
const status = res.ok ? 'healthy' : 'degraded';
return { name: 'Documenso', status, responseTimeMs: Date.now() - start };
} finally {
clearTimeout(timer);
}
} catch (err) {
return {
name: 'Documenso',
status: 'down',
responseTimeMs: Date.now() - start,
details: err instanceof Error ? err.message : 'Unreachable',
};
}
}
// ─── getQueueDashboard ────────────────────────────────────────────────────────
export async function getQueueDashboard(): Promise<QueueStatus[]> {
const queueNames = Object.keys(QUEUE_CONFIGS) as QueueName[];
const results = await Promise.allSettled(
queueNames.map(async (name) => {
const queue = getQueue(name);
const counts = await queue.getJobCounts(
'waiting',
'active',
'completed',
'failed',
'delayed',
);
return {
name,
waiting: counts.waiting ?? 0,
active: counts.active ?? 0,
completed: counts.completed ?? 0,
failed: counts.failed ?? 0,
delayed: counts.delayed ?? 0,
} satisfies QueueStatus;
}),
);
return results.map((r, i) => {
if (r.status === 'fulfilled') return r.value;
const name = queueNames[i] ?? 'unknown';
logger.warn({ queue: name, err: r.reason }, 'Failed to get queue counts');
return {
name,
waiting: 0,
active: 0,
completed: 0,
failed: 0,
delayed: 0,
};
});
}
// ─── getQueueJobs ─────────────────────────────────────────────────────────────
type JobStatus = 'waiting' | 'active' | 'completed' | 'failed' | 'delayed';
export async function getQueueJobs(
queueName: QueueName,
status: JobStatus = 'failed',
page = 1,
limit = 20,
): Promise<PaginatedQueueJobs> {
const queue = getQueue(queueName);
const start = (page - 1) * limit;
const end = start + limit - 1;
const jobs = await queue.getJobs([status], start, end);
const counts = await queue.getJobCounts(status);
const total = counts[status] ?? 0;
const summaries: QueueJobSummary[] = jobs.map((job) => {
// Truncate job data to prevent huge payloads
let truncatedData: unknown;
try {
const dataStr = JSON.stringify(job.data);
truncatedData =
dataStr.length > 500 ? JSON.parse(dataStr.slice(0, 500) + '...(truncated)') : job.data;
} catch {
truncatedData = '[unparseable]';
}
return {
id: job.id ?? '',
name: job.name,
data: truncatedData,
status,
timestamp: job.timestamp,
processedOn: job.processedOn ?? undefined,
finishedOn: job.finishedOn ?? undefined,
failedReason: job.failedReason ?? undefined,
};
});
return { jobs: summaries, total, page, limit };
}
// ─── retryJob ─────────────────────────────────────────────────────────────────
export async function retryJob(queueName: QueueName, jobId: string, userId: string): Promise<void> {
const queue = getQueue(queueName);
const job = await queue.getJob(jobId);
if (!job) throw new NotFoundError('queue job');
await job.retry();
void createAuditLog({
userId,
portId: null,
action: 'update',
entityType: 'queue_job',
entityId: jobId,
metadata: { queueName, jobName: job.name, action: 'retry' },
ipAddress: 'system',
userAgent: 'system',
});
}
// ─── deleteJob ────────────────────────────────────────────────────────────────
export async function deleteJob(
queueName: QueueName,
jobId: string,
userId: string,
): Promise<void> {
const queue = getQueue(queueName);
const job = await queue.getJob(jobId);
if (!job) throw new NotFoundError('queue job');
await job.remove();
void createAuditLog({
userId,
portId: null,
action: 'delete',
entityType: 'queue_job',
entityId: jobId,
metadata: { queueName, jobName: job.name, action: 'delete' },
ipAddress: 'system',
userAgent: 'system',
});
}
// ─── getActiveConnections ─────────────────────────────────────────────────────
export async function getActiveConnections(): Promise<ConnectionStatus> {
try {
const { getIO } = await import('@/lib/socket/server');
const io = getIO();
const sockets = await io.fetchSockets();
return { totalConnections: sockets.length };
} catch {
return { totalConnections: 0 };
}
}
// ─── getRecentErrors ──────────────────────────────────────────────────────────
export async function getRecentErrors(limit = 20): Promise<RecentError[]> {
// Fetch permission-denied audit log entries
const auditErrors = await db
.select({
id: auditLogs.id,
action: auditLogs.action,
entityType: auditLogs.entityType,
entityId: auditLogs.entityId,
metadata: auditLogs.metadata,
createdAt: auditLogs.createdAt,
})
.from(auditLogs)
.where(eq(auditLogs.action, 'permission_denied'))
.orderBy(desc(auditLogs.createdAt))
.limit(limit);
const auditResults: RecentError[] = auditErrors.map((row) => ({
id: row.id,
source: 'audit' as const,
message: `Permission denied on ${row.entityType}`,
timestamp: row.createdAt,
metadata: (row.metadata as Record<string, unknown>) ?? {},
}));
// Fetch failed jobs from all queues (sample - top 5 per queue)
const queueNames = Object.keys(QUEUE_CONFIGS) as QueueName[];
const failedJobResults = await Promise.allSettled(
queueNames.map(async (name) => {
const queue = getQueue(name);
const jobs = await queue.getJobs(['failed'], 0, 4);
return jobs.map(
(job): RecentError => ({
id: `${name}:${job.id ?? ''}`,
source: 'queue',
message: `Queue job failed: ${job.name} in ${name}`,
timestamp: job.finishedOn ? new Date(job.finishedOn) : new Date(job.timestamp),
metadata: { queueName: name, failedReason: job.failedReason },
}),
);
}),
);
const queueErrors: RecentError[] = failedJobResults
.filter((r): r is PromiseFulfilledResult<RecentError[]> => r.status === 'fulfilled')
.flatMap((r) => r.value);
feat(errors): platform-wide request ids + error codes + admin inspector End-to-end error-handling overhaul. A user hitting any failure now sees a plain-text message + stable error code + reference id. A super admin can paste the id into /admin/errors/<id> for the full request shape, sanitized body, error stack, and a heuristic likely-cause hint. REQUEST CONTEXT (AsyncLocalStorage) - src/lib/request-context.ts mints a per-request frame carrying requestId + portId + userId + method + path + start timestamp. - withAuth wraps every authenticated handler in runWithRequestContext and accepts an upstream X-Request-Id header (validated shape) or generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id response header, including early-return 401/403/4xx paths. - Pino logger reads from the same context via mixin — every log line emitted during the request automatically carries the ids with no per-call threading. ERROR CODE REGISTRY - src/lib/error-codes.ts defines stable DOMAIN_REASON codes with HTTP status + plain-text user-facing message (no jargon, written for the rep on the phone with a customer). - New CodedError class wraps a registered code + optional internalMessage (admin-only — never sent to client). - Existing AppError subclasses got plain-text default rewrites so legacy throw sites improve immediately without migration. - High-impact services migrated to specific codes: expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY / PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender (INTEREST_PORT_MISMATCH). ERROR ENVELOPE - errorResponse always sets X-Request-Id header + requestId field. - 5xx responses include a "Quote error ID …" friendly line. - 4xx kept clean (validation, permission, not-found don't pollute the inspector — they're already in audit log). PERSISTENCE (error_events table, migration 0040) - One row per 5xx, keyed on requestId, with method/path/status/error name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap; password/token/secret/etc keys redacted)/duration/IP/UA/metadata. - captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code so the classifier can recognize FK / unique / NOT NULL / schema- drift violations. - Failure to persist is logged-not-thrown. LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts) - 4-pass heuristic (first match wins): 1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique, 42703 schema drift, 53300 connection limit, …) 2. Error class name (AbortError, TimeoutError, FetchError, ZodError) 3. Stack-path patterns (/lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/) 4. Free-text message keywords (econnrefused, rate limit, timeout, unauthorized|invalid api key) - Returns { label, hint, subsystem } for the inspector badge. CLIENT SIDE - apiFetch throws structured ApiError with message + code + requestId + details + retryAfter. - toastError() helper renders the standard 3-line toast: plain message / Error code: X / Reference ID: Y [Copy ID]. ADMIN INSPECTOR - /<port>/admin/errors lists captured 5xx with status badge + path + likely-culprit badge + truncated message + reference id. Filter by status code; auto-refresh via TanStack Query. - /<port>/admin/errors/<requestId> deep-dive: request shape, full error name+message+stack, sanitized body excerpt, raw metadata, registered-code lookup (so admin can compare to what user saw), likely-culprit hint with subsystem tag. - /<port>/admin/errors/codes is the in-app code reference page — every registered code grouped by domain prefix, searchable, with HTTP status + user message inline. Linked from inspector header so admins can flip to it while triaging. - Permission: admin.view_audit_log. Super admins see all ports; regular admins port-scoped. - system-monitoring dashboard now surfaces error_events alongside permission_denied audit + queue failed jobs (RecentError gains source: 'request' variant). DOCS - docs/error-handling.md walks through coded errors, plain-text message guidelines, client toasting, admin inspector usage, persistence rules, classifier internals, pruning, and the legacy → CodedError migration path. MIGRATION SAFETY - Audit confirmed all 41 migrations (0000-0040) apply cleanly in journal order against an empty DB. 0040 references ports(id) which exists from 0000. 0035/0038 don't deadlock under sequential psql -f. Removed redundant idx_ds_sent_by from 0038 (created in 0037). Tests: 1168/1168 vitest passing. tsc clean. - security-error-responses tests updated for plain-text messages + new optional response keys (code/requestId/message). - berth-pdf-versions tests assert stable error codes via toMatchObject({ code }) rather than message regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:12:59 +02:00
// Captured 5xx requests from the per-request error_events table —
// this is the deepest source: full stack head + body excerpt + path.
// The dedicated /admin/errors page paginates this; here we surface
// the most recent for the dashboard.
const requestErrorRows = await db
.select({
requestId: errorEvents.requestId,
statusCode: errorEvents.statusCode,
method: errorEvents.method,
path: errorEvents.path,
errorName: errorEvents.errorName,
errorMessage: errorEvents.errorMessage,
metadata: errorEvents.metadata,
createdAt: errorEvents.createdAt,
})
.from(errorEvents)
.orderBy(desc(errorEvents.createdAt))
.limit(limit);
const requestErrors: RecentError[] = requestErrorRows.map((row) => {
const meta = (row.metadata as Record<string, unknown>) ?? {};
return {
id: row.requestId,
source: 'request' as const,
message:
`${row.method} ${row.path}${row.statusCode} ${row.errorMessage ?? row.errorName ?? ''}`.trim(),
timestamp: row.createdAt,
metadata: meta,
requestId: row.requestId,
statusCode: row.statusCode,
errorCode: typeof meta.code === 'string' ? meta.code : null,
};
});
// Merge and sort combined list by timestamp descending
feat(errors): platform-wide request ids + error codes + admin inspector End-to-end error-handling overhaul. A user hitting any failure now sees a plain-text message + stable error code + reference id. A super admin can paste the id into /admin/errors/<id> for the full request shape, sanitized body, error stack, and a heuristic likely-cause hint. REQUEST CONTEXT (AsyncLocalStorage) - src/lib/request-context.ts mints a per-request frame carrying requestId + portId + userId + method + path + start timestamp. - withAuth wraps every authenticated handler in runWithRequestContext and accepts an upstream X-Request-Id header (validated shape) or generates a fresh UUID. The id ALWAYS leaves on the X-Request-Id response header, including early-return 401/403/4xx paths. - Pino logger reads from the same context via mixin — every log line emitted during the request automatically carries the ids with no per-call threading. ERROR CODE REGISTRY - src/lib/error-codes.ts defines stable DOMAIN_REASON codes with HTTP status + plain-text user-facing message (no jargon, written for the rep on the phone with a customer). - New CodedError class wraps a registered code + optional internalMessage (admin-only — never sent to client). - Existing AppError subclasses got plain-text default rewrites so legacy throw sites improve immediately without migration. - High-impact services migrated to specific codes: expenses (RECEIPT_REQUIRED, INVOICE_LINKED), interest-berths (CROSS_PORT_LINK_REJECTED), berth-pdf (PDF_MAGIC_BYTE / PDF_EMPTY / PDF_TOO_LARGE / VERSION_ALREADY_CURRENT), recommender (INTEREST_PORT_MISMATCH). ERROR ENVELOPE - errorResponse always sets X-Request-Id header + requestId field. - 5xx responses include a "Quote error ID …" friendly line. - 4xx kept clean (validation, permission, not-found don't pollute the inspector — they're already in audit log). PERSISTENCE (error_events table, migration 0040) - One row per 5xx, keyed on requestId, with method/path/status/error name+message/stack head (4KB cap)/sanitized body excerpt (1KB cap; password/token/secret/etc keys redacted)/duration/IP/UA/metadata. - captureErrorEvent extracts Postgres SQLSTATE/severity/cause.code so the classifier can recognize FK / unique / NOT NULL / schema- drift violations. - Failure to persist is logged-not-thrown. LIKELY-CULPRIT CLASSIFIER (src/lib/error-classifier.ts) - 4-pass heuristic (first match wins): 1. Postgres SQLSTATE → human reason (23503 FK, 23505 unique, 42703 schema drift, 53300 connection limit, …) 2. Error class name (AbortError, TimeoutError, FetchError, ZodError) 3. Stack-path patterns (/lib/storage/, /lib/email/, documenso, openai|claude, /queue/workers/) 4. Free-text message keywords (econnrefused, rate limit, timeout, unauthorized|invalid api key) - Returns { label, hint, subsystem } for the inspector badge. CLIENT SIDE - apiFetch throws structured ApiError with message + code + requestId + details + retryAfter. - toastError() helper renders the standard 3-line toast: plain message / Error code: X / Reference ID: Y [Copy ID]. ADMIN INSPECTOR - /<port>/admin/errors lists captured 5xx with status badge + path + likely-culprit badge + truncated message + reference id. Filter by status code; auto-refresh via TanStack Query. - /<port>/admin/errors/<requestId> deep-dive: request shape, full error name+message+stack, sanitized body excerpt, raw metadata, registered-code lookup (so admin can compare to what user saw), likely-culprit hint with subsystem tag. - /<port>/admin/errors/codes is the in-app code reference page — every registered code grouped by domain prefix, searchable, with HTTP status + user message inline. Linked from inspector header so admins can flip to it while triaging. - Permission: admin.view_audit_log. Super admins see all ports; regular admins port-scoped. - system-monitoring dashboard now surfaces error_events alongside permission_denied audit + queue failed jobs (RecentError gains source: 'request' variant). DOCS - docs/error-handling.md walks through coded errors, plain-text message guidelines, client toasting, admin inspector usage, persistence rules, classifier internals, pruning, and the legacy → CodedError migration path. MIGRATION SAFETY - Audit confirmed all 41 migrations (0000-0040) apply cleanly in journal order against an empty DB. 0040 references ports(id) which exists from 0000. 0035/0038 don't deadlock under sequential psql -f. Removed redundant idx_ds_sent_by from 0038 (created in 0037). Tests: 1168/1168 vitest passing. tsc clean. - security-error-responses tests updated for plain-text messages + new optional response keys (code/requestId/message). - berth-pdf-versions tests assert stable error codes via toMatchObject({ code }) rather than message regex. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:12:59 +02:00
const combined = [...auditResults, ...queueErrors, ...requestErrors].sort(
(a, b) => b.timestamp.getTime() - a.timestamp.getTime(),
);
return combined.slice(0, limit);
}