feat(expenses): streaming expense-PDF export + receipt-less expense flag + audit-3 fixes

Replaces the legacy text-only expense PDF (was just dumping rows into a single pdfme text field — no images, no pagination) with a proper streaming export modelled on the legacy Nuxt client-portal but re-architected for memory safety. The legacy implementation OOM'd on hundreds of receipts because it: - buffered every receipt image into memory simultaneously - accumulated PDF chunks into an array, concat'd at end - base64-encoded the whole PDF into a JSON response (3x peak memory) - had no image downscaling The new design: - `streamExpensePdf()` (src/lib/services/expense-pdf.service.ts): pdfkit pipes bytes directly to the HTTP response (no Buffer accumulation). Receipts are processed serially so peak heap is one image at a time. Sharp downscales any receipt > 500 KB or > 1500 px to JPEG q80 — typical 8 MB phone photo collapses to ~250 KB. For a 500-receipt export, peak RSS stays under ~100 MB; legacy needed >2 GB for the same input. - Pages: cover summary box (count, totals, currency equiv, optional processing fee), grouped expense table (groupBy=none|payer|category| date), one-page-per-receipt with header (establishment, amount, date, payer, category, file name) and full-bleed image. - Storage backend abstraction — receipts stream from `getStorageBackend().get(storageKey)`, works on MinIO/S3/filesystem. - Route: POST /api/v1/expenses/export/pdf streams binary application/pdf with cache-control:no-store. Validator caps expenseIds at 1000 to prevent runaway loops. Receipt-less expense flow (per user request): - Schema: 0033 migration adds `expenses.no_receipt_acknowledged` boolean (default false). - Validator: createExpenseSchema requires either receiptFileIds OR noReceiptAcknowledged=true; the .refine() error message tells the rep exactly what to do. updateExpenseSchema is partial and skips the rule (existing rows can be edited without re-acknowledging). - PDF: receiptless expenses get an inline red "(no receipt)" tag in the establishment cell + a red footer warning in the summary box showing the count and at-risk amount. - The legacy parent-company reimbursement queue may refuse to pay receiptless expenses, so the warning is load-bearing for ops. Audit-3 fixes piggy-backed: - 🔴 Tesseract OCR runtime now races a 30s timeout (CPU-bomb DoS protection — a crafted PDF rasterizing to high-res noise could pin the worker indefinitely). - 🟠 brochures.service.ts:listBrochures dropped a wasted query (the legacy single-brochure fast-path was discarding its result on the multi-brochure branch). - 🟠 berth-pdf.service.ts:listBerthPdfVersions now Promise.all's the presignDownload calls instead of awaiting each in a for-loop — 20-version berths went from 20× round-trip to 1×. - 🟡 public berths route no longer logs the full `row` object on enum drift (was dumping price + amenity columns into ops logs). - 🟡 dropped the dead `void sql` import from public berths route. Tests still 1163/1163. tsc clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:38:32 +02:00
parent a3e002852b
commit 014bbe1923
15 changed files with 12966 additions and 93 deletions
--- a/src/lib/services/berth-pdf-parser.ts
+++ b/src/lib/services/berth-pdf-parser.ts
@@ -190,19 +190,42 @@ export interface OcrAdapter {
  recognize(buffer: Buffer): Promise<{ text: string; confidence: number }>;
 }

+/** Hard cap on Tesseract OCR runtime. A crafted PDF rasterizing to
+ *  high-resolution noise can pin the process indefinitely (CPU bomb).
+ *  30 seconds covers the legitimate single-page-spec case by a wide
+ *  margin while bounding the worst-case worker hold-time. The AI
+ *  fallback tier handles cases where OCR couldn't finish. */
+const OCR_TIMEOUT_MS = 30_000;
+
 /** Default adapter — dynamically imports tesseract.js so the WASM bundle isn't
 *  pulled into client builds. */
 async function defaultOcrAdapter(): Promise<OcrAdapter> {
  return {
    recognize: async (buffer: Buffer) => {
      const tesseract = await import('tesseract.js');
-      // Tesseract handles PDF inputs by rasterizing the first page; for our
-      // single-page spec sheets that's sufficient.
-      const result = await tesseract.recognize(buffer, 'eng');
-      return {
-        text: result.data.text ?? '',
-        confidence: typeof result.data.confidence === 'number' ? result.data.confidence : 0,
-      };
+      // Race the OCR against a timeout so a runaway recognition can't
+      // hold the worker forever. The race-loser pattern doesn't
+      // actually cancel Tesseract (no AbortController support), but it
+      // does free the awaiter so the caller can fall through to AI.
+      let timeoutHandle: NodeJS.Timeout | undefined;
+      const timeout = new Promise<{ text: string; confidence: number }>((_, reject) => {
+        timeoutHandle = setTimeout(
+          () => reject(new Error(`Tesseract OCR exceeded ${OCR_TIMEOUT_MS}ms timeout`)),
+          OCR_TIMEOUT_MS,
+        );
+      });
+      try {
+        const result = await Promise.race([
+          tesseract.recognize(buffer, 'eng').then((r) => ({
+            text: r.data.text ?? '',
+            confidence: typeof r.data.confidence === 'number' ? r.data.confidence : 0,
+          })),
+          timeout,
+        ]);
+        return result;
+      } finally {
+        if (timeoutHandle) clearTimeout(timeoutHandle);
+      }
    },
  };
 }