Matt e8a852856e feat(berth-parser): unpdf for tier-2 PDF text extraction
Phase 1 / commit 13 of 14 — replaces a quietly-broken tesseract.js
pathway with unpdf for tier-2 of the berth-PDF parser.

The previous code did:
  const tesseract = await import('tesseract.js');
  await tesseract.recognize(buffer, 'eng');   // ← buffer is a PDF

tesseract.recognize() expects an image, not a PDF. The PDFs we get from
the AcroForm-stripped berth-spec sheets would have failed at runtime
(either an "unsupported format" error or silently empty text). Tier-2
was dark code.

unpdf (serverless-friendly pdfjs wrapper) extracts text directly from
the PDF stream. Works on text-PDFs (real text streams), returns empty
on scanned/raster PDFs — those legitimately fall through to the AI
tier where they belong.

The OcrAdapter interface shape is preserved so:
  - Existing unit tests that stub the adapter still work
  - parseAnyBerthPdf(buffer, { adapter }) override still works
  - The 30-second timeout race + warning collection still works

tesseract.js stays as a dep — scan-shell.tsx (receipt scanner) still
uses it for on-device image OCR, which is its intended use case.

1298/1298 vitest green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 21:13:10 +02:00
Description
No description provided
25 MiB
Languages
TypeScript 98.7%
HTML 1%
CSS 0.1%
Shell 0.1%