docs: comprehensive audits + Documenso build plan + admin UX backlog

Six audit documents capture the 2026-05-06 review pass (comprehensive, frontend, missing-features, permissions, reliability) along with the Documenso integration audit + locked build plan that drove the bulk of subsequent feature work. Adds `docs/admin-ux-backlog.md` as a living tracker for the autonomous push — every item marked DONE or REMAINING with file pointers and scope estimates so future sessions can pick up where this one stopped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:57:53 +02:00
parent 05babe57a0
commit a0e68eb060
8 changed files with 3008 additions and 0 deletions
--- a/docs/audit-reliability-2026-05-06.md
+++ b/docs/audit-reliability-2026-05-06.md
@@ -0,0 +1,220 @@
+# Reliability audit — 2026-05-06 (focused, post-batch deltas)
+
+Scope: NEW services from the recent archive/restore/hard-delete/external-EOI batches.
+Out of scope (already covered in `docs/audit-comprehensive-2026-05-06.md`):
+worker imports, rate limits, hard-delete error message UX, smart-restore
+dead reversal applier, bulk hard-delete redis loop, audit log spam.
+
+---
+
+## Critical
+
+### C1. Bulk archive enqueues zero post-commit side effects
+
+- **File:** `src/app/api/v1/clients/bulk/route.ts:68-134`
+- **Scenario:** When the bulk wizard archives 100 clients with high-stakes
+  reasons, `archiveClientWithDecisions` returns `externalCleanups` and
+  `releasedBerths` arrays per-client, but `runBulk` discards the return
+  value. Documenso envelopes that the wizard marked `void_documenso`
+  never get queued, and "next-in-line" notifications never fire. The
+  database is left in `documents.status='cancelled'` with the live
+  Documenso envelope still out for signature — the signer can complete
+  a legally-binding envelope that the CRM thinks is voided.
+- **Fix:** Make the per-row callback return the result, then loop over
+  `results` after `runBulk` to enqueue Documenso voids and fire
+  next-in-line notifications (mirroring the single-client route).
+  Defaulting `documentDecisions` to `'leave'` (line 113-116) hides the
+  symptom for the bulk wizard but isn't enough — the single-client
+  service can still surface this if the bulk path is ever generalized.
+
+---
+
+## High
+
+### H1. Restore wizard silently drops every released berth
+
+- **File:** `src/lib/services/client-restore.service.ts:359-372`
+- **Scenario:** `applyReversal` for `berth_released` is a no-op with a
+  comment saying "v1 leaves the berth available". But the dossier (line
+  122-129) classifies these as `autoReversible` and the UI tells the
+  operator "still available — re-attaching to the restored client". The
+  wizard increments `autoReversed` and the audit log records a
+  successful auto-reverse — but nothing actually happens. Operator
+  thinks restore re-linked their berth; it didn't.
+- **Fix:** Either (a) actually re-link by persisting the original
+  `interestId` in the `berth_released` decision detail (it's already
+  there, line 211) and re-inserting an `interestBerths` row + flipping
+  the berth status back to `under_offer`, or (b) reclassify these as
+  `reversibleWithPrompt` with copy that says "berth left available —
+  re-add via the interest detail page".
+
+### H2. Smart-archive berth status update has TOCTOU race
+
+- **File:** `src/lib/services/client-archive.service.ts:191-207`
+- **Scenario:** Berth row is read via `dossier.berths` (read outside the
+  tx) and modified inside the tx without a `for update` lock on
+  `berths`. Two concurrent flows — e.g. operator A archives client X
+  while operator B sells berth A1 to client Y — can race: A reads
+  `berth.status === 'sold' → false`, B's tx commits sold, A's tx then
+  flips it back to `available`. The "still under offer" subselect
+  doesn't catch this because berth.status is the source of truth, not
+  interest_berths.
+- **Fix:** Add `tx.select(...).from(berths).where(eq(berths.id, d.berthId)).for('update')`
+  before the status flip and re-check `status !== 'sold'` against the
+  locked row.
+
+### H3. Bulk archive can pick the wrong interest for berth release
+
+- **File:** `src/app/api/v1/clients/bulk/route.ts:95-103`
+- **Scenario:** When a client has multiple interests linked to the same
+  berth, the bulk wizard picks `dossier.interests.find((i) =>
+i.primaryBerthMooring === b.mooringNumber)` and falls back to
+  `dossier.interests[0]?.interestId ?? ''`. The fallback to the
+  first-interest-or-empty-string can hand `archiveClientWithDecisions`
+  an `interestId` that was never linked to that berth — so the
+  `delete from interest_berths where berthId=… and interestId=…`
+  matches zero rows and the link is silently retained. Worse: an empty
+  string `''` reaches the delete, which still matches zero rows but
+  leaves the berth status check believing the link was removed.
+- **Fix:** Build the berth→interest map from `interestBerthRows` (the
+  authoritative join) rather than guessing by `primaryBerthMooring`,
+  and skip berths with no resolvable interest rather than emitting an
+  empty-string interestId.
+
+### H4. External EOI runs four writes outside a transaction
+
+- **File:** `src/lib/services/external-eoi.service.ts:67-155`
+- **Scenario:** `getStorageBackend().put()`, `files.insert`,
+  `documents.insert`, `documentEvents.insert`, and the interests
+  update happen as five independent operations. If any one fails after
+  the storage upload, you're left with an orphan PDF in S3/MinIO and
+  partial DB state. If the documents insert fails after the file
+  insert, the file row points to a storage key with no document
+  referencing it — and the interest never advances.
+- **Fix:** Wrap files/documents/documentEvents/interests in a single
+  `db.transaction`. Storage upload stays outside (S3 isn't
+  transactional) but on tx failure, schedule a cleanup job that deletes
+  the orphan storage object, or accept the orphan and add a janitor.
+
+### H5. Bulk wizard double-submit re-archives the same client and racy errors
+
+- **File:** `src/app/api/v1/clients/bulk/route.ts:68-120` +
+  `src/lib/services/client-archive.service.ts:165-173`
+- **Scenario:** The single-client `archiveClientWithDecisions` locks
+  the row and throws `ConflictError('Client is already archived')` on
+  re-entry — good. But `runBulk` swallows the error string and returns
+  it as `{ok:false, error:"Client is already archived"}` for that
+  client. If the bulk wizard double-submits (network retry, double
+  click), partial successes from the first request now look like
+  per-client failures in the response, confusing the operator. There's
+  no idempotency key on the bulk submit.
+- **Fix:** Treat `ConflictError('already archived')` as success in the
+  bulk per-row handler (the desired end state is reached). Or add an
+  idempotency-key header on the bulk endpoint that short-circuits a
+  duplicate request with the cached response.
+
+---
+
+## Medium
+
+### M1. Hard-delete `clientMergeLog.surviving_client_id` deletes audit history
+
+- **File:** `src/lib/services/client-hard-delete.service.ts:209`
+- **Scenario:** The comment says "merged records remain in the log
+  because mergedClientId has no FK", but the delete is wider than
+  needed: it removes every merge-log row where this client was the
+  survivor. If client X (being deleted) previously absorbed clients
+  A/B/C, the audit trail of those merges is lost on X's deletion. The
+  surviving rows that remain (`mergedClientId = X`) are now
+  inconsistent — they reference a survivor that no longer exists.
+- **Fix:** Either preserve the survivor rows by setting
+  `surviving_client_id = NULL` (requires column nullable) or keep the
+  current behavior but document it more visibly. At minimum, log the
+  deleted merge-log row count so operators can investigate gaps.
+
+### M2. Documenso void worker has no max-retry guard for non-404 errors
+
+- **File:** `src/lib/queue/workers/documents.ts:19-37`
+- **Scenario:** `voidDocument` throws `CodedError` on non-404 failures
+  (auth error, network blip, Documenso 500). BullMQ retries with
+  backoff, but there's no per-job idempotency check — the second
+  retry hits the same envelope, voidDocument's 404 short-circuit only
+  kicks in if Documenso has actually voided it on the first retry
+  before the API call returned an error. A persistent 401 / 403 will
+  retry forever (until BullMQ exhausts attempts) and the documents row
+  stays `cancelled` in the CRM with the envelope still live in
+  Documenso. The DLQ is mentioned in the comment but the worker
+  doesn't surface a DLQ alert hook.
+- **Fix:** On exhaustion, write back to `documents` (e.g.
+  `cancellation_failed=true`) and emit an admin notification so the
+  envelope can be voided manually.
+
+### M3. Next-in-line notification fan-out unhandled rejection
+
+- **File:** `src/lib/services/next-in-line-notify.service.ts:75-87`
+- **Scenario:** Each `void createNotification(...)` is a fire-and-forget
+  promise with no `.catch` handler. If `notifications.service`
+  dispatches to a DB that's transiently down, the unhandled rejection
+  will surface in the Node process with no recipient context (the
+  closure captured `userId` is in the stack but pino won't include it
+  unless explicitly logged). Process-level handlers will log it but
+  individual recipients silently lose their notification.
+- **Fix:** `.catch((err) => logger.warn({err, userId, berthId:
+input.berthId}, 'next-in-line notification failed'))`.
+
+### M4. Restore service uses `any` for transaction type
+
+- **File:** `src/lib/services/client-restore.service.ts:354-355`
+- **Scenario:** `applyReversal(tx: any, ...)` defeats Drizzle's type
+  safety. A future schema rename (e.g. `yachts.status` enum change)
+  won't fail at compile time inside this function. Combined with the
+  documented v1 no-op for `berth_released`, the function looks
+  innocuous but carries the most risk.
+- **Fix:** Use the proper Drizzle tx type — `Parameters<Parameters<typeof
+db.transaction>[0]>[0]` or a named type alias from
+  `@/lib/db/types.ts` if one exists.
+
+### M5. interests.changeInterestStage milestones write outside tx
+
+- **File:** `src/lib/services/interests.service.ts:630-648`
+- **Scenario:** The override path (and normal path) writes
+  `pipelineStage` in one update and milestone fields
+  (`dateEoiSent`, `dateContractSigned`, etc.) in a second update. If
+  the process crashes between the two, the stage advances but the
+  milestone is never recorded. Funnel/conversion math then under-
+  counts that interest. Over-the-wire this is rare but the audit log
+  fires before the milestone update succeeds, so the audit trail
+  claims a complete transition that's actually half-applied.
+- **Fix:** Combine both into a single update statement, computing the
+  milestone fields in JS and merging them into the `set({...})` clause.
+
+---
+
+## Low
+
+### L1. Smart-archive coalesces invoice notes via SQL string concat
+
+- **File:** `src/lib/services/client-archive.service.ts:288-291`
+- **Scenario:** `notes: sql\`coalesce(${invoices.notes}, '') || ${...}\``embeds`new Date().toISOString()`and the action label inside a
+parameterized string. The values are bound, so it's not an injection
+risk, but the`\n[archive ...]` marker is appended unconditionally —
+  re-running the archive on a not-yet-committed client would double
+  the marker. Combined with H5 (no idempotency on bulk), a retry could
+  bloat invoice notes with duplicate markers.
+- **Fix:** Append only when the marker isn't already present, or rely
+  on the `clients.archivedAt is null` precheck (which already guards
+  re-entry) and accept the duplicate as theoretically impossible.
+
+### L2. Hard-delete `requestHardDeleteCode` reveals client existence pre-archive
+
+- **File:** `src/lib/services/client-hard-delete.service.ts:77-85`
+- **Scenario:** A user without `admin.permanently_delete_clients`
+  shouldn't reach this service, so this is theoretical, but the
+  ConflictError "Client must be archived" leaks the existence of an
+  unarchived client to anyone who can reach the route. The audit doc
+  flagged hard-delete error messages already (out of scope), but this
+  specific error path isn't covered there.
+- **Fix:** Same as the audit-doc finding for the symmetric path —
+  return a generic `NotFoundError` instead of distinguishing
+  "not found" from "not archived" externally; log the distinction
+  internally only.