pn-new-crm/docs/audit-reliability-2026-05-06.md

# Reliability audit — 2026-05-06 (focused, post-batch deltas)

Scope: NEW services from the recent archive/restore/hard-delete/external-EOI batches.
Out of scope (already covered in `docs/audit-comprehensive-2026-05-06.md`):
worker imports, rate limits, hard-delete error message UX, smart-restore
dead reversal applier, bulk hard-delete redis loop, audit log spam.

---

## Critical

### C1. Bulk archive enqueues zero post-commit side effects

- **File:** `src/app/api/v1/clients/bulk/route.ts:68-134`
- **Scenario:** When the bulk wizard archives 100 clients with high-stakes
  reasons, `archiveClientWithDecisions` returns `externalCleanups` and
  `releasedBerths` arrays per-client, but `runBulk` discards the return
  value. Documenso envelopes that the wizard marked `void_documenso`
  never get queued, and "next-in-line" notifications never fire. The
  database is left in `documents.status='cancelled'` with the live
  Documenso envelope still out for signature — the signer can complete
  a legally-binding envelope that the CRM thinks is voided.
- **Fix:** Make the per-row callback return the result, then loop over
  `results` after `runBulk` to enqueue Documenso voids and fire
  next-in-line notifications (mirroring the single-client route).
  Defaulting `documentDecisions` to `'leave'` (line 113-116) hides the
  symptom for the bulk wizard but isn't enough — the single-client
  service can still surface this if the bulk path is ever generalized.

---

## High

### H1. Restore wizard silently drops every released berth

- **File:** `src/lib/services/client-restore.service.ts:359-372`
- **Scenario:** `applyReversal` for `berth_released` is a no-op with a
  comment saying "v1 leaves the berth available". But the dossier (line
  122-129) classifies these as `autoReversible` and the UI tells the
  operator "still available — re-attaching to the restored client". The
  wizard increments `autoReversed` and the audit log records a
  successful auto-reverse — but nothing actually happens. Operator
  thinks restore re-linked their berth; it didn't.
- **Fix:** Either (a) actually re-link by persisting the original
  `interestId` in the `berth_released` decision detail (it's already
  there, line 211) and re-inserting an `interestBerths` row + flipping
  the berth status back to `under_offer`, or (b) reclassify these as
  `reversibleWithPrompt` with copy that says "berth left available —
  re-add via the interest detail page".

### H2. Smart-archive berth status update has TOCTOU race

- **File:** `src/lib/services/client-archive.service.ts:191-207`
- **Scenario:** Berth row is read via `dossier.berths` (read outside the
  tx) and modified inside the tx without a `for update` lock on
  `berths`. Two concurrent flows — e.g. operator A archives client X
  while operator B sells berth A1 to client Y — can race: A reads
  `berth.status === 'sold' → false`, B's tx commits sold, A's tx then
  flips it back to `available`. The "still under offer" subselect
  doesn't catch this because berth.status is the source of truth, not
  interest_berths.
- **Fix:** Add `tx.select(...).from(berths).where(eq(berths.id, d.berthId)).for('update')`
  before the status flip and re-check `status !== 'sold'` against the
  locked row.

### H3. Bulk archive can pick the wrong interest for berth release

- **File:** `src/app/api/v1/clients/bulk/route.ts:95-103`
- **Scenario:** When a client has multiple interests linked to the same
  berth, the bulk wizard picks `dossier.interests.find((i) =>
i.primaryBerthMooring === b.mooringNumber)` and falls back to
  `dossier.interests[0]?.interestId ?? ''`. The fallback to the
  first-interest-or-empty-string can hand `archiveClientWithDecisions`
  an `interestId` that was never linked to that berth — so the
  `delete from interest_berths where berthId=… and interestId=…`
  matches zero rows and the link is silently retained. Worse: an empty
  string `''` reaches the delete, which still matches zero rows but
  leaves the berth status check believing the link was removed.
- **Fix:** Build the berth→interest map from `interestBerthRows` (the
  authoritative join) rather than guessing by `primaryBerthMooring`,
  and skip berths with no resolvable interest rather than emitting an
  empty-string interestId.

### H4. External EOI runs four writes outside a transaction

- **File:** `src/lib/services/external-eoi.service.ts:67-155`
- **Scenario:** `getStorageBackend().put()`, `files.insert`,
  `documents.insert`, `documentEvents.insert`, and the interests
  update happen as five independent operations. If any one fails after
  the storage upload, you're left with an orphan PDF in S3/MinIO and
  partial DB state. If the documents insert fails after the file
  insert, the file row points to a storage key with no document
  referencing it — and the interest never advances.
- **Fix:** Wrap files/documents/documentEvents/interests in a single
  `db.transaction`. Storage upload stays outside (S3 isn't
  transactional) but on tx failure, schedule a cleanup job that deletes
  the orphan storage object, or accept the orphan and add a janitor.

### H5. Bulk wizard double-submit re-archives the same client and racy errors

- **File:** `src/app/api/v1/clients/bulk/route.ts:68-120` +
  `src/lib/services/client-archive.service.ts:165-173`
- **Scenario:** The single-client `archiveClientWithDecisions` locks
  the row and throws `ConflictError('Client is already archived')` on
  re-entry — good. But `runBulk` swallows the error string and returns
  it as `{ok:false, error:"Client is already archived"}` for that
  client. If the bulk wizard double-submits (network retry, double
  click), partial successes from the first request now look like
  per-client failures in the response, confusing the operator. There's
  no idempotency key on the bulk submit.
- **Fix:** Treat `ConflictError('already archived')` as success in the
  bulk per-row handler (the desired end state is reached). Or add an
  idempotency-key header on the bulk endpoint that short-circuits a
  duplicate request with the cached response.

---

## Medium

### M1. Hard-delete `clientMergeLog.surviving_client_id` deletes audit history

- **File:** `src/lib/services/client-hard-delete.service.ts:209`
- **Scenario:** The comment says "merged records remain in the log
  because mergedClientId has no FK", but the delete is wider than
  needed: it removes every merge-log row where this client was the
  survivor. If client X (being deleted) previously absorbed clients
  A/B/C, the audit trail of those merges is lost on X's deletion. The
  surviving rows that remain (`mergedClientId = X`) are now
  inconsistent — they reference a survivor that no longer exists.
- **Fix:** Either preserve the survivor rows by setting
  `surviving_client_id = NULL` (requires column nullable) or keep the
  current behavior but document it more visibly. At minimum, log the
  deleted merge-log row count so operators can investigate gaps.

### M2. Documenso void worker has no max-retry guard for non-404 errors

- **File:** `src/lib/queue/workers/documents.ts:19-37`
- **Scenario:** `voidDocument` throws `CodedError` on non-404 failures
  (auth error, network blip, Documenso 500). BullMQ retries with
  backoff, but there's no per-job idempotency check — the second
  retry hits the same envelope, voidDocument's 404 short-circuit only
  kicks in if Documenso has actually voided it on the first retry
  before the API call returned an error. A persistent 401 / 403 will
  retry forever (until BullMQ exhausts attempts) and the documents row
  stays `cancelled` in the CRM with the envelope still live in
  Documenso. The DLQ is mentioned in the comment but the worker
  doesn't surface a DLQ alert hook.
- **Fix:** On exhaustion, write back to `documents` (e.g.
  `cancellation_failed=true`) and emit an admin notification so the
  envelope can be voided manually.

### M3. Next-in-line notification fan-out unhandled rejection

- **File:** `src/lib/services/next-in-line-notify.service.ts:75-87`
- **Scenario:** Each `void createNotification(...)` is a fire-and-forget
  promise with no `.catch` handler. If `notifications.service`
  dispatches to a DB that's transiently down, the unhandled rejection
  will surface in the Node process with no recipient context (the
  closure captured `userId` is in the stack but pino won't include it
  unless explicitly logged). Process-level handlers will log it but
  individual recipients silently lose their notification.
- **Fix:** `.catch((err) => logger.warn({err, userId, berthId:
input.berthId}, 'next-in-line notification failed'))`.

### M4. Restore service uses `any` for transaction type

- **File:** `src/lib/services/client-restore.service.ts:354-355`
- **Scenario:** `applyReversal(tx: any, ...)` defeats Drizzle's type
  safety. A future schema rename (e.g. `yachts.status` enum change)
  won't fail at compile time inside this function. Combined with the
  documented v1 no-op for `berth_released`, the function looks
  innocuous but carries the most risk.
- **Fix:** Use the proper Drizzle tx type — `Parameters<Parameters<typeof
db.transaction>[0]>[0]` or a named type alias from
  `@/lib/db/types.ts` if one exists.

### M5. interests.changeInterestStage milestones write outside tx

- **File:** `src/lib/services/interests.service.ts:630-648`
- **Scenario:** The override path (and normal path) writes
  `pipelineStage` in one update and milestone fields
  (`dateEoiSent`, `dateContractSigned`, etc.) in a second update. If
  the process crashes between the two, the stage advances but the
  milestone is never recorded. Funnel/conversion math then under-
  counts that interest. Over-the-wire this is rare but the audit log
  fires before the milestone update succeeds, so the audit trail
  claims a complete transition that's actually half-applied.
- **Fix:** Combine both into a single update statement, computing the
  milestone fields in JS and merging them into the `set({...})` clause.

---

## Low

### L1. Smart-archive coalesces invoice notes via SQL string concat

- **File:** `src/lib/services/client-archive.service.ts:288-291`
- **Scenario:** `notes: sql\`coalesce(${invoices.notes}, '') || ${...}\``embeds`new Date().toISOString()`and the action label inside a
parameterized string. The values are bound, so it's not an injection
risk, but the`\n[archive ...]` marker is appended unconditionally —
  re-running the archive on a not-yet-committed client would double
  the marker. Combined with H5 (no idempotency on bulk), a retry could
  bloat invoice notes with duplicate markers.
- **Fix:** Append only when the marker isn't already present, or rely
  on the `clients.archivedAt is null` precheck (which already guards
  re-entry) and accept the duplicate as theoretically impossible.

### L2. Hard-delete `requestHardDeleteCode` reveals client existence pre-archive

- **File:** `src/lib/services/client-hard-delete.service.ts:77-85`
- **Scenario:** A user without `admin.permanently_delete_clients`
  shouldn't reach this service, so this is theoretical, but the
  ConflictError "Client must be archived" leaks the existence of an
  unarchived client to anyone who can reach the route. The audit doc
  flagged hard-delete error messages already (out of scope), but this
  specific error path isn't covered there.
- **Fix:** Same as the audit-doc finding for the symmetric path —
  return a generic `NotFoundError` instead of distinguishing
  "not found" from "not archived" externally; log the distinction
  internally only.