e8a852856e966f1872242e243be15bd4f71edc75
Phase 1 / commit 13 of 14 — replaces a quietly-broken tesseract.js
pathway with unpdf for tier-2 of the berth-PDF parser.
The previous code did:
const tesseract = await import('tesseract.js');
await tesseract.recognize(buffer, 'eng'); // ← buffer is a PDF
tesseract.recognize() expects an image, not a PDF. The PDFs we get from
the AcroForm-stripped berth-spec sheets would have failed at runtime
(either an "unsupported format" error or silently empty text). Tier-2
was dark code.
unpdf (serverless-friendly pdfjs wrapper) extracts text directly from
the PDF stream. Works on text-PDFs (real text streams), returns empty
on scanned/raster PDFs — those legitimately fall through to the AI
tier where they belong.
The OcrAdapter interface shape is preserved so:
- Existing unit tests that stub the adapter still work
- parseAnyBerthPdf(buffer, { adapter }) override still works
- The 30-second timeout race + warning collection still works
tesseract.js stays as a dep — scan-shell.tsx (receipt scanner) still
uses it for on-device image OCR, which is its intended use case.
1298/1298 vitest green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
TypeScript
98.7%
HTML
1%
CSS
0.1%
Shell
0.1%