Files
pn-new-crm/docs/audit-reliability-2026-05-06.md
Matt a0e68eb060 docs: comprehensive audits + Documenso build plan + admin UX backlog
Six audit documents capture the 2026-05-06 review pass (comprehensive,
frontend, missing-features, permissions, reliability) along with the
Documenso integration audit + locked build plan that drove the bulk
of subsequent feature work.

Adds `docs/admin-ux-backlog.md` as a living tracker for the autonomous
push — every item marked DONE or REMAINING with file pointers and
scope estimates so future sessions can pick up where this one stopped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:57:53 +02:00

11 KiB

Reliability audit — 2026-05-06 (focused, post-batch deltas)

Scope: NEW services from the recent archive/restore/hard-delete/external-EOI batches. Out of scope (already covered in docs/audit-comprehensive-2026-05-06.md): worker imports, rate limits, hard-delete error message UX, smart-restore dead reversal applier, bulk hard-delete redis loop, audit log spam.


Critical

C1. Bulk archive enqueues zero post-commit side effects

  • File: src/app/api/v1/clients/bulk/route.ts:68-134
  • Scenario: When the bulk wizard archives 100 clients with high-stakes reasons, archiveClientWithDecisions returns externalCleanups and releasedBerths arrays per-client, but runBulk discards the return value. Documenso envelopes that the wizard marked void_documenso never get queued, and "next-in-line" notifications never fire. The database is left in documents.status='cancelled' with the live Documenso envelope still out for signature — the signer can complete a legally-binding envelope that the CRM thinks is voided.
  • Fix: Make the per-row callback return the result, then loop over results after runBulk to enqueue Documenso voids and fire next-in-line notifications (mirroring the single-client route). Defaulting documentDecisions to 'leave' (line 113-116) hides the symptom for the bulk wizard but isn't enough — the single-client service can still surface this if the bulk path is ever generalized.

High

H1. Restore wizard silently drops every released berth

  • File: src/lib/services/client-restore.service.ts:359-372
  • Scenario: applyReversal for berth_released is a no-op with a comment saying "v1 leaves the berth available". But the dossier (line 122-129) classifies these as autoReversible and the UI tells the operator "still available — re-attaching to the restored client". The wizard increments autoReversed and the audit log records a successful auto-reverse — but nothing actually happens. Operator thinks restore re-linked their berth; it didn't.
  • Fix: Either (a) actually re-link by persisting the original interestId in the berth_released decision detail (it's already there, line 211) and re-inserting an interestBerths row + flipping the berth status back to under_offer, or (b) reclassify these as reversibleWithPrompt with copy that says "berth left available — re-add via the interest detail page".

H2. Smart-archive berth status update has TOCTOU race

  • File: src/lib/services/client-archive.service.ts:191-207
  • Scenario: Berth row is read via dossier.berths (read outside the tx) and modified inside the tx without a for update lock on berths. Two concurrent flows — e.g. operator A archives client X while operator B sells berth A1 to client Y — can race: A reads berth.status === 'sold' → false, B's tx commits sold, A's tx then flips it back to available. The "still under offer" subselect doesn't catch this because berth.status is the source of truth, not interest_berths.
  • Fix: Add tx.select(...).from(berths).where(eq(berths.id, d.berthId)).for('update') before the status flip and re-check status !== 'sold' against the locked row.

H3. Bulk archive can pick the wrong interest for berth release

  • File: src/app/api/v1/clients/bulk/route.ts:95-103
  • Scenario: When a client has multiple interests linked to the same berth, the bulk wizard picks dossier.interests.find((i) => i.primaryBerthMooring === b.mooringNumber) and falls back to dossier.interests[0]?.interestId ?? ''. The fallback to the first-interest-or-empty-string can hand archiveClientWithDecisions an interestId that was never linked to that berth — so the delete from interest_berths where berthId=… and interestId=… matches zero rows and the link is silently retained. Worse: an empty string '' reaches the delete, which still matches zero rows but leaves the berth status check believing the link was removed.
  • Fix: Build the berth→interest map from interestBerthRows (the authoritative join) rather than guessing by primaryBerthMooring, and skip berths with no resolvable interest rather than emitting an empty-string interestId.

H4. External EOI runs four writes outside a transaction

  • File: src/lib/services/external-eoi.service.ts:67-155
  • Scenario: getStorageBackend().put(), files.insert, documents.insert, documentEvents.insert, and the interests update happen as five independent operations. If any one fails after the storage upload, you're left with an orphan PDF in S3/MinIO and partial DB state. If the documents insert fails after the file insert, the file row points to a storage key with no document referencing it — and the interest never advances.
  • Fix: Wrap files/documents/documentEvents/interests in a single db.transaction. Storage upload stays outside (S3 isn't transactional) but on tx failure, schedule a cleanup job that deletes the orphan storage object, or accept the orphan and add a janitor.

H5. Bulk wizard double-submit re-archives the same client and racy errors

  • File: src/app/api/v1/clients/bulk/route.ts:68-120 + src/lib/services/client-archive.service.ts:165-173
  • Scenario: The single-client archiveClientWithDecisions locks the row and throws ConflictError('Client is already archived') on re-entry — good. But runBulk swallows the error string and returns it as {ok:false, error:"Client is already archived"} for that client. If the bulk wizard double-submits (network retry, double click), partial successes from the first request now look like per-client failures in the response, confusing the operator. There's no idempotency key on the bulk submit.
  • Fix: Treat ConflictError('already archived') as success in the bulk per-row handler (the desired end state is reached). Or add an idempotency-key header on the bulk endpoint that short-circuits a duplicate request with the cached response.

Medium

M1. Hard-delete clientMergeLog.surviving_client_id deletes audit history

  • File: src/lib/services/client-hard-delete.service.ts:209
  • Scenario: The comment says "merged records remain in the log because mergedClientId has no FK", but the delete is wider than needed: it removes every merge-log row where this client was the survivor. If client X (being deleted) previously absorbed clients A/B/C, the audit trail of those merges is lost on X's deletion. The surviving rows that remain (mergedClientId = X) are now inconsistent — they reference a survivor that no longer exists.
  • Fix: Either preserve the survivor rows by setting surviving_client_id = NULL (requires column nullable) or keep the current behavior but document it more visibly. At minimum, log the deleted merge-log row count so operators can investigate gaps.

M2. Documenso void worker has no max-retry guard for non-404 errors

  • File: src/lib/queue/workers/documents.ts:19-37
  • Scenario: voidDocument throws CodedError on non-404 failures (auth error, network blip, Documenso 500). BullMQ retries with backoff, but there's no per-job idempotency check — the second retry hits the same envelope, voidDocument's 404 short-circuit only kicks in if Documenso has actually voided it on the first retry before the API call returned an error. A persistent 401 / 403 will retry forever (until BullMQ exhausts attempts) and the documents row stays cancelled in the CRM with the envelope still live in Documenso. The DLQ is mentioned in the comment but the worker doesn't surface a DLQ alert hook.
  • Fix: On exhaustion, write back to documents (e.g. cancellation_failed=true) and emit an admin notification so the envelope can be voided manually.

M3. Next-in-line notification fan-out unhandled rejection

  • File: src/lib/services/next-in-line-notify.service.ts:75-87
  • Scenario: Each void createNotification(...) is a fire-and-forget promise with no .catch handler. If notifications.service dispatches to a DB that's transiently down, the unhandled rejection will surface in the Node process with no recipient context (the closure captured userId is in the stack but pino won't include it unless explicitly logged). Process-level handlers will log it but individual recipients silently lose their notification.
  • Fix: .catch((err) => logger.warn({err, userId, berthId: input.berthId}, 'next-in-line notification failed')).

M4. Restore service uses any for transaction type

  • File: src/lib/services/client-restore.service.ts:354-355
  • Scenario: applyReversal(tx: any, ...) defeats Drizzle's type safety. A future schema rename (e.g. yachts.status enum change) won't fail at compile time inside this function. Combined with the documented v1 no-op for berth_released, the function looks innocuous but carries the most risk.
  • Fix: Use the proper Drizzle tx type — Parameters<Parameters<typeof db.transaction>[0]>[0] or a named type alias from @/lib/db/types.ts if one exists.

M5. interests.changeInterestStage milestones write outside tx

  • File: src/lib/services/interests.service.ts:630-648
  • Scenario: The override path (and normal path) writes pipelineStage in one update and milestone fields (dateEoiSent, dateContractSigned, etc.) in a second update. If the process crashes between the two, the stage advances but the milestone is never recorded. Funnel/conversion math then under- counts that interest. Over-the-wire this is rare but the audit log fires before the milestone update succeeds, so the audit trail claims a complete transition that's actually half-applied.
  • Fix: Combine both into a single update statement, computing the milestone fields in JS and merging them into the set({...}) clause.

Low

L1. Smart-archive coalesces invoice notes via SQL string concat

  • File: src/lib/services/client-archive.service.ts:288-291
  • Scenario: notes: sql\coalesce(${invoices.notes}, '') || ${...}`embedsnew Date().toISOString()and the action label inside a parameterized string. The values are bound, so it's not an injection risk, but the\n[archive ...]` marker is appended unconditionally — re-running the archive on a not-yet-committed client would double the marker. Combined with H5 (no idempotency on bulk), a retry could bloat invoice notes with duplicate markers.
  • Fix: Append only when the marker isn't already present, or rely on the clients.archivedAt is null precheck (which already guards re-entry) and accept the duplicate as theoretically impossible.

L2. Hard-delete requestHardDeleteCode reveals client existence pre-archive

  • File: src/lib/services/client-hard-delete.service.ts:77-85
  • Scenario: A user without admin.permanently_delete_clients shouldn't reach this service, so this is theoretical, but the ConflictError "Client must be archived" leaks the existence of an unarchived client to anyone who can reach the route. The audit doc flagged hard-delete error messages already (out of scope), but this specific error path isn't covered there.
  • Fix: Same as the audit-doc finding for the symmetric path — return a generic NotFoundError instead of distinguishing "not found" from "not archived" externally; log the distinction internally only.