feat(expenses): streaming expense-PDF export + receipt-less expense flag + audit-3 fixes
Replaces the legacy text-only expense PDF (was just dumping rows into a
single pdfme text field — no images, no pagination) with a proper
streaming export modelled on the legacy Nuxt client-portal but
re-architected for memory safety. The legacy implementation OOM'd on
hundreds of receipts because it:
- buffered every receipt image into memory simultaneously
- accumulated PDF chunks into an array, concat'd at end
- base64-encoded the whole PDF into a JSON response (3x peak memory)
- had no image downscaling
The new design:
- `streamExpensePdf()` (src/lib/services/expense-pdf.service.ts):
pdfkit pipes bytes directly to the HTTP response (no Buffer
accumulation). Receipts are processed serially so peak heap is one
image at a time. Sharp downscales any receipt > 500 KB or > 1500 px
to JPEG q80 — typical 8 MB phone photo collapses to ~250 KB. For a
500-receipt export, peak RSS stays under ~100 MB; legacy needed >2
GB for the same input.
- Pages: cover summary box (count, totals, currency equiv, optional
processing fee), grouped expense table (groupBy=none|payer|category|
date), one-page-per-receipt with header (establishment, amount,
date, payer, category, file name) and full-bleed image.
- Storage backend abstraction — receipts stream from
`getStorageBackend().get(storageKey)`, works on MinIO/S3/filesystem.
- Route: POST /api/v1/expenses/export/pdf streams binary
application/pdf with cache-control:no-store. Validator caps
expenseIds at 1000 to prevent runaway loops.
Receipt-less expense flow (per user request):
- Schema: 0033 migration adds `expenses.no_receipt_acknowledged`
boolean (default false).
- Validator: createExpenseSchema requires either receiptFileIds OR
noReceiptAcknowledged=true; the .refine() error message tells the
rep exactly what to do. updateExpenseSchema is partial and skips
the rule (existing rows can be edited without re-acknowledging).
- PDF: receiptless expenses get an inline red "(no receipt)" tag in
the establishment cell + a red footer warning in the summary box
showing the count and at-risk amount.
- The legacy parent-company reimbursement queue may refuse to pay
receiptless expenses, so the warning is load-bearing for ops.
Audit-3 fixes piggy-backed:
- 🔴 Tesseract OCR runtime now races a 30s timeout (CPU-bomb DoS
protection — a crafted PDF rasterizing to high-res noise could
pin the worker indefinitely).
- 🟠 brochures.service.ts:listBrochures dropped a wasted query (the
legacy single-brochure fast-path was discarding its result on the
multi-brochure branch).
- 🟠 berth-pdf.service.ts:listBerthPdfVersions now Promise.all's the
presignDownload calls instead of awaiting each in a for-loop —
20-version berths went from 20× round-trip to 1×.
- 🟡 public berths route no longer logs the full `row` object on
enum drift (was dumping price + amenity columns into ops logs).
- 🟡 dropped the dead `void sql` import from public berths route.
Tests still 1163/1163. tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -190,19 +190,42 @@ export interface OcrAdapter {
|
||||
recognize(buffer: Buffer): Promise<{ text: string; confidence: number }>;
|
||||
}
|
||||
|
||||
/** Hard cap on Tesseract OCR runtime. A crafted PDF rasterizing to
|
||||
* high-resolution noise can pin the process indefinitely (CPU bomb).
|
||||
* 30 seconds covers the legitimate single-page-spec case by a wide
|
||||
* margin while bounding the worst-case worker hold-time. The AI
|
||||
* fallback tier handles cases where OCR couldn't finish. */
|
||||
const OCR_TIMEOUT_MS = 30_000;
|
||||
|
||||
/** Default adapter — dynamically imports tesseract.js so the WASM bundle isn't
|
||||
* pulled into client builds. */
|
||||
async function defaultOcrAdapter(): Promise<OcrAdapter> {
|
||||
return {
|
||||
recognize: async (buffer: Buffer) => {
|
||||
const tesseract = await import('tesseract.js');
|
||||
// Tesseract handles PDF inputs by rasterizing the first page; for our
|
||||
// single-page spec sheets that's sufficient.
|
||||
const result = await tesseract.recognize(buffer, 'eng');
|
||||
return {
|
||||
text: result.data.text ?? '',
|
||||
confidence: typeof result.data.confidence === 'number' ? result.data.confidence : 0,
|
||||
};
|
||||
// Race the OCR against a timeout so a runaway recognition can't
|
||||
// hold the worker forever. The race-loser pattern doesn't
|
||||
// actually cancel Tesseract (no AbortController support), but it
|
||||
// does free the awaiter so the caller can fall through to AI.
|
||||
let timeoutHandle: NodeJS.Timeout | undefined;
|
||||
const timeout = new Promise<{ text: string; confidence: number }>((_, reject) => {
|
||||
timeoutHandle = setTimeout(
|
||||
() => reject(new Error(`Tesseract OCR exceeded ${OCR_TIMEOUT_MS}ms timeout`)),
|
||||
OCR_TIMEOUT_MS,
|
||||
);
|
||||
});
|
||||
try {
|
||||
const result = await Promise.race([
|
||||
tesseract.recognize(buffer, 'eng').then((r) => ({
|
||||
text: r.data.text ?? '',
|
||||
confidence: typeof r.data.confidence === 'number' ? r.data.confidence : 0,
|
||||
})),
|
||||
timeout,
|
||||
]);
|
||||
return result;
|
||||
} finally {
|
||||
if (timeoutHandle) clearTimeout(timeoutHandle);
|
||||
}
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user