Files
pn-new-crm/docs/audit-reliability-2026-05-06.md
Matt a0e68eb060 docs: comprehensive audits + Documenso build plan + admin UX backlog
Six audit documents capture the 2026-05-06 review pass (comprehensive,
frontend, missing-features, permissions, reliability) along with the
Documenso integration audit + locked build plan that drove the bulk
of subsequent feature work.

Adds `docs/admin-ux-backlog.md` as a living tracker for the autonomous
push — every item marked DONE or REMAINING with file pointers and
scope estimates so future sessions can pick up where this one stopped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:57:53 +02:00

221 lines
11 KiB
Markdown

# Reliability audit — 2026-05-06 (focused, post-batch deltas)
Scope: NEW services from the recent archive/restore/hard-delete/external-EOI batches.
Out of scope (already covered in `docs/audit-comprehensive-2026-05-06.md`):
worker imports, rate limits, hard-delete error message UX, smart-restore
dead reversal applier, bulk hard-delete redis loop, audit log spam.
---
## Critical
### C1. Bulk archive enqueues zero post-commit side effects
- **File:** `src/app/api/v1/clients/bulk/route.ts:68-134`
- **Scenario:** When the bulk wizard archives 100 clients with high-stakes
reasons, `archiveClientWithDecisions` returns `externalCleanups` and
`releasedBerths` arrays per-client, but `runBulk` discards the return
value. Documenso envelopes that the wizard marked `void_documenso`
never get queued, and "next-in-line" notifications never fire. The
database is left in `documents.status='cancelled'` with the live
Documenso envelope still out for signature — the signer can complete
a legally-binding envelope that the CRM thinks is voided.
- **Fix:** Make the per-row callback return the result, then loop over
`results` after `runBulk` to enqueue Documenso voids and fire
next-in-line notifications (mirroring the single-client route).
Defaulting `documentDecisions` to `'leave'` (line 113-116) hides the
symptom for the bulk wizard but isn't enough — the single-client
service can still surface this if the bulk path is ever generalized.
---
## High
### H1. Restore wizard silently drops every released berth
- **File:** `src/lib/services/client-restore.service.ts:359-372`
- **Scenario:** `applyReversal` for `berth_released` is a no-op with a
comment saying "v1 leaves the berth available". But the dossier (line
122-129) classifies these as `autoReversible` and the UI tells the
operator "still available — re-attaching to the restored client". The
wizard increments `autoReversed` and the audit log records a
successful auto-reverse — but nothing actually happens. Operator
thinks restore re-linked their berth; it didn't.
- **Fix:** Either (a) actually re-link by persisting the original
`interestId` in the `berth_released` decision detail (it's already
there, line 211) and re-inserting an `interestBerths` row + flipping
the berth status back to `under_offer`, or (b) reclassify these as
`reversibleWithPrompt` with copy that says "berth left available —
re-add via the interest detail page".
### H2. Smart-archive berth status update has TOCTOU race
- **File:** `src/lib/services/client-archive.service.ts:191-207`
- **Scenario:** Berth row is read via `dossier.berths` (read outside the
tx) and modified inside the tx without a `for update` lock on
`berths`. Two concurrent flows — e.g. operator A archives client X
while operator B sells berth A1 to client Y — can race: A reads
`berth.status === 'sold' → false`, B's tx commits sold, A's tx then
flips it back to `available`. The "still under offer" subselect
doesn't catch this because berth.status is the source of truth, not
interest_berths.
- **Fix:** Add `tx.select(...).from(berths).where(eq(berths.id, d.berthId)).for('update')`
before the status flip and re-check `status !== 'sold'` against the
locked row.
### H3. Bulk archive can pick the wrong interest for berth release
- **File:** `src/app/api/v1/clients/bulk/route.ts:95-103`
- **Scenario:** When a client has multiple interests linked to the same
berth, the bulk wizard picks `dossier.interests.find((i) =>
i.primaryBerthMooring === b.mooringNumber)` and falls back to
`dossier.interests[0]?.interestId ?? ''`. The fallback to the
first-interest-or-empty-string can hand `archiveClientWithDecisions`
an `interestId` that was never linked to that berth — so the
`delete from interest_berths where berthId=… and interestId=…`
matches zero rows and the link is silently retained. Worse: an empty
string `''` reaches the delete, which still matches zero rows but
leaves the berth status check believing the link was removed.
- **Fix:** Build the berth→interest map from `interestBerthRows` (the
authoritative join) rather than guessing by `primaryBerthMooring`,
and skip berths with no resolvable interest rather than emitting an
empty-string interestId.
### H4. External EOI runs four writes outside a transaction
- **File:** `src/lib/services/external-eoi.service.ts:67-155`
- **Scenario:** `getStorageBackend().put()`, `files.insert`,
`documents.insert`, `documentEvents.insert`, and the interests
update happen as five independent operations. If any one fails after
the storage upload, you're left with an orphan PDF in S3/MinIO and
partial DB state. If the documents insert fails after the file
insert, the file row points to a storage key with no document
referencing it — and the interest never advances.
- **Fix:** Wrap files/documents/documentEvents/interests in a single
`db.transaction`. Storage upload stays outside (S3 isn't
transactional) but on tx failure, schedule a cleanup job that deletes
the orphan storage object, or accept the orphan and add a janitor.
### H5. Bulk wizard double-submit re-archives the same client and racy errors
- **File:** `src/app/api/v1/clients/bulk/route.ts:68-120` +
`src/lib/services/client-archive.service.ts:165-173`
- **Scenario:** The single-client `archiveClientWithDecisions` locks
the row and throws `ConflictError('Client is already archived')` on
re-entry — good. But `runBulk` swallows the error string and returns
it as `{ok:false, error:"Client is already archived"}` for that
client. If the bulk wizard double-submits (network retry, double
click), partial successes from the first request now look like
per-client failures in the response, confusing the operator. There's
no idempotency key on the bulk submit.
- **Fix:** Treat `ConflictError('already archived')` as success in the
bulk per-row handler (the desired end state is reached). Or add an
idempotency-key header on the bulk endpoint that short-circuits a
duplicate request with the cached response.
---
## Medium
### M1. Hard-delete `clientMergeLog.surviving_client_id` deletes audit history
- **File:** `src/lib/services/client-hard-delete.service.ts:209`
- **Scenario:** The comment says "merged records remain in the log
because mergedClientId has no FK", but the delete is wider than
needed: it removes every merge-log row where this client was the
survivor. If client X (being deleted) previously absorbed clients
A/B/C, the audit trail of those merges is lost on X's deletion. The
surviving rows that remain (`mergedClientId = X`) are now
inconsistent — they reference a survivor that no longer exists.
- **Fix:** Either preserve the survivor rows by setting
`surviving_client_id = NULL` (requires column nullable) or keep the
current behavior but document it more visibly. At minimum, log the
deleted merge-log row count so operators can investigate gaps.
### M2. Documenso void worker has no max-retry guard for non-404 errors
- **File:** `src/lib/queue/workers/documents.ts:19-37`
- **Scenario:** `voidDocument` throws `CodedError` on non-404 failures
(auth error, network blip, Documenso 500). BullMQ retries with
backoff, but there's no per-job idempotency check — the second
retry hits the same envelope, voidDocument's 404 short-circuit only
kicks in if Documenso has actually voided it on the first retry
before the API call returned an error. A persistent 401 / 403 will
retry forever (until BullMQ exhausts attempts) and the documents row
stays `cancelled` in the CRM with the envelope still live in
Documenso. The DLQ is mentioned in the comment but the worker
doesn't surface a DLQ alert hook.
- **Fix:** On exhaustion, write back to `documents` (e.g.
`cancellation_failed=true`) and emit an admin notification so the
envelope can be voided manually.
### M3. Next-in-line notification fan-out unhandled rejection
- **File:** `src/lib/services/next-in-line-notify.service.ts:75-87`
- **Scenario:** Each `void createNotification(...)` is a fire-and-forget
promise with no `.catch` handler. If `notifications.service`
dispatches to a DB that's transiently down, the unhandled rejection
will surface in the Node process with no recipient context (the
closure captured `userId` is in the stack but pino won't include it
unless explicitly logged). Process-level handlers will log it but
individual recipients silently lose their notification.
- **Fix:** `.catch((err) => logger.warn({err, userId, berthId:
input.berthId}, 'next-in-line notification failed'))`.
### M4. Restore service uses `any` for transaction type
- **File:** `src/lib/services/client-restore.service.ts:354-355`
- **Scenario:** `applyReversal(tx: any, ...)` defeats Drizzle's type
safety. A future schema rename (e.g. `yachts.status` enum change)
won't fail at compile time inside this function. Combined with the
documented v1 no-op for `berth_released`, the function looks
innocuous but carries the most risk.
- **Fix:** Use the proper Drizzle tx type — `Parameters<Parameters<typeof
db.transaction>[0]>[0]` or a named type alias from
`@/lib/db/types.ts` if one exists.
### M5. interests.changeInterestStage milestones write outside tx
- **File:** `src/lib/services/interests.service.ts:630-648`
- **Scenario:** The override path (and normal path) writes
`pipelineStage` in one update and milestone fields
(`dateEoiSent`, `dateContractSigned`, etc.) in a second update. If
the process crashes between the two, the stage advances but the
milestone is never recorded. Funnel/conversion math then under-
counts that interest. Over-the-wire this is rare but the audit log
fires before the milestone update succeeds, so the audit trail
claims a complete transition that's actually half-applied.
- **Fix:** Combine both into a single update statement, computing the
milestone fields in JS and merging them into the `set({...})` clause.
---
## Low
### L1. Smart-archive coalesces invoice notes via SQL string concat
- **File:** `src/lib/services/client-archive.service.ts:288-291`
- **Scenario:** `notes: sql\`coalesce(${invoices.notes}, '') || ${...}\``embeds`new Date().toISOString()`and the action label inside a
parameterized string. The values are bound, so it's not an injection
risk, but the`\n[archive ...]` marker is appended unconditionally —
re-running the archive on a not-yet-committed client would double
the marker. Combined with H5 (no idempotency on bulk), a retry could
bloat invoice notes with duplicate markers.
- **Fix:** Append only when the marker isn't already present, or rely
on the `clients.archivedAt is null` precheck (which already guards
re-entry) and accept the duplicate as theoretically impossible.
### L2. Hard-delete `requestHardDeleteCode` reveals client existence pre-archive
- **File:** `src/lib/services/client-hard-delete.service.ts:77-85`
- **Scenario:** A user without `admin.permanently_delete_clients`
shouldn't reach this service, so this is theoretical, but the
ConflictError "Client must be archived" leaks the existence of an
unarchived client to anyone who can reach the route. The audit doc
flagged hard-delete error messages already (out of scope), but this
specific error path isn't covered there.
- **Fix:** Same as the audit-doc finding for the symmetric path —
return a generic `NotFoundError` instead of distinguishing
"not found" from "not archived" externally; log the distinction
internally only.