docs: comprehensive audits + Documenso build plan + admin UX backlog
Six audit documents capture the 2026-05-06 review pass (comprehensive, frontend, missing-features, permissions, reliability) along with the Documenso integration audit + locked build plan that drove the bulk of subsequent feature work. Adds `docs/admin-ux-backlog.md` as a living tracker for the autonomous push — every item marked DONE or REMAINING with file pointers and scope estimates so future sessions can pick up where this one stopped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
220
docs/audit-reliability-2026-05-06.md
Normal file
220
docs/audit-reliability-2026-05-06.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# Reliability audit — 2026-05-06 (focused, post-batch deltas)
|
||||
|
||||
Scope: NEW services from the recent archive/restore/hard-delete/external-EOI batches.
|
||||
Out of scope (already covered in `docs/audit-comprehensive-2026-05-06.md`):
|
||||
worker imports, rate limits, hard-delete error message UX, smart-restore
|
||||
dead reversal applier, bulk hard-delete redis loop, audit log spam.
|
||||
|
||||
---
|
||||
|
||||
## Critical
|
||||
|
||||
### C1. Bulk archive enqueues zero post-commit side effects
|
||||
|
||||
- **File:** `src/app/api/v1/clients/bulk/route.ts:68-134`
|
||||
- **Scenario:** When the bulk wizard archives 100 clients with high-stakes
|
||||
reasons, `archiveClientWithDecisions` returns `externalCleanups` and
|
||||
`releasedBerths` arrays per-client, but `runBulk` discards the return
|
||||
value. Documenso envelopes that the wizard marked `void_documenso`
|
||||
never get queued, and "next-in-line" notifications never fire. The
|
||||
database is left in `documents.status='cancelled'` with the live
|
||||
Documenso envelope still out for signature — the signer can complete
|
||||
a legally-binding envelope that the CRM thinks is voided.
|
||||
- **Fix:** Make the per-row callback return the result, then loop over
|
||||
`results` after `runBulk` to enqueue Documenso voids and fire
|
||||
next-in-line notifications (mirroring the single-client route).
|
||||
Defaulting `documentDecisions` to `'leave'` (line 113-116) hides the
|
||||
symptom for the bulk wizard but isn't enough — the single-client
|
||||
service can still surface this if the bulk path is ever generalized.
|
||||
|
||||
---
|
||||
|
||||
## High
|
||||
|
||||
### H1. Restore wizard silently drops every released berth
|
||||
|
||||
- **File:** `src/lib/services/client-restore.service.ts:359-372`
|
||||
- **Scenario:** `applyReversal` for `berth_released` is a no-op with a
|
||||
comment saying "v1 leaves the berth available". But the dossier (line
|
||||
122-129) classifies these as `autoReversible` and the UI tells the
|
||||
operator "still available — re-attaching to the restored client". The
|
||||
wizard increments `autoReversed` and the audit log records a
|
||||
successful auto-reverse — but nothing actually happens. Operator
|
||||
thinks restore re-linked their berth; it didn't.
|
||||
- **Fix:** Either (a) actually re-link by persisting the original
|
||||
`interestId` in the `berth_released` decision detail (it's already
|
||||
there, line 211) and re-inserting an `interestBerths` row + flipping
|
||||
the berth status back to `under_offer`, or (b) reclassify these as
|
||||
`reversibleWithPrompt` with copy that says "berth left available —
|
||||
re-add via the interest detail page".
|
||||
|
||||
### H2. Smart-archive berth status update has TOCTOU race
|
||||
|
||||
- **File:** `src/lib/services/client-archive.service.ts:191-207`
|
||||
- **Scenario:** Berth row is read via `dossier.berths` (read outside the
|
||||
tx) and modified inside the tx without a `for update` lock on
|
||||
`berths`. Two concurrent flows — e.g. operator A archives client X
|
||||
while operator B sells berth A1 to client Y — can race: A reads
|
||||
`berth.status === 'sold' → false`, B's tx commits sold, A's tx then
|
||||
flips it back to `available`. The "still under offer" subselect
|
||||
doesn't catch this because berth.status is the source of truth, not
|
||||
interest_berths.
|
||||
- **Fix:** Add `tx.select(...).from(berths).where(eq(berths.id, d.berthId)).for('update')`
|
||||
before the status flip and re-check `status !== 'sold'` against the
|
||||
locked row.
|
||||
|
||||
### H3. Bulk archive can pick the wrong interest for berth release
|
||||
|
||||
- **File:** `src/app/api/v1/clients/bulk/route.ts:95-103`
|
||||
- **Scenario:** When a client has multiple interests linked to the same
|
||||
berth, the bulk wizard picks `dossier.interests.find((i) =>
|
||||
i.primaryBerthMooring === b.mooringNumber)` and falls back to
|
||||
`dossier.interests[0]?.interestId ?? ''`. The fallback to the
|
||||
first-interest-or-empty-string can hand `archiveClientWithDecisions`
|
||||
an `interestId` that was never linked to that berth — so the
|
||||
`delete from interest_berths where berthId=… and interestId=…`
|
||||
matches zero rows and the link is silently retained. Worse: an empty
|
||||
string `''` reaches the delete, which still matches zero rows but
|
||||
leaves the berth status check believing the link was removed.
|
||||
- **Fix:** Build the berth→interest map from `interestBerthRows` (the
|
||||
authoritative join) rather than guessing by `primaryBerthMooring`,
|
||||
and skip berths with no resolvable interest rather than emitting an
|
||||
empty-string interestId.
|
||||
|
||||
### H4. External EOI runs four writes outside a transaction
|
||||
|
||||
- **File:** `src/lib/services/external-eoi.service.ts:67-155`
|
||||
- **Scenario:** `getStorageBackend().put()`, `files.insert`,
|
||||
`documents.insert`, `documentEvents.insert`, and the interests
|
||||
update happen as five independent operations. If any one fails after
|
||||
the storage upload, you're left with an orphan PDF in S3/MinIO and
|
||||
partial DB state. If the documents insert fails after the file
|
||||
insert, the file row points to a storage key with no document
|
||||
referencing it — and the interest never advances.
|
||||
- **Fix:** Wrap files/documents/documentEvents/interests in a single
|
||||
`db.transaction`. Storage upload stays outside (S3 isn't
|
||||
transactional) but on tx failure, schedule a cleanup job that deletes
|
||||
the orphan storage object, or accept the orphan and add a janitor.
|
||||
|
||||
### H5. Bulk wizard double-submit re-archives the same client and racy errors
|
||||
|
||||
- **File:** `src/app/api/v1/clients/bulk/route.ts:68-120` +
|
||||
`src/lib/services/client-archive.service.ts:165-173`
|
||||
- **Scenario:** The single-client `archiveClientWithDecisions` locks
|
||||
the row and throws `ConflictError('Client is already archived')` on
|
||||
re-entry — good. But `runBulk` swallows the error string and returns
|
||||
it as `{ok:false, error:"Client is already archived"}` for that
|
||||
client. If the bulk wizard double-submits (network retry, double
|
||||
click), partial successes from the first request now look like
|
||||
per-client failures in the response, confusing the operator. There's
|
||||
no idempotency key on the bulk submit.
|
||||
- **Fix:** Treat `ConflictError('already archived')` as success in the
|
||||
bulk per-row handler (the desired end state is reached). Or add an
|
||||
idempotency-key header on the bulk endpoint that short-circuits a
|
||||
duplicate request with the cached response.
|
||||
|
||||
---
|
||||
|
||||
## Medium
|
||||
|
||||
### M1. Hard-delete `clientMergeLog.surviving_client_id` deletes audit history
|
||||
|
||||
- **File:** `src/lib/services/client-hard-delete.service.ts:209`
|
||||
- **Scenario:** The comment says "merged records remain in the log
|
||||
because mergedClientId has no FK", but the delete is wider than
|
||||
needed: it removes every merge-log row where this client was the
|
||||
survivor. If client X (being deleted) previously absorbed clients
|
||||
A/B/C, the audit trail of those merges is lost on X's deletion. The
|
||||
surviving rows that remain (`mergedClientId = X`) are now
|
||||
inconsistent — they reference a survivor that no longer exists.
|
||||
- **Fix:** Either preserve the survivor rows by setting
|
||||
`surviving_client_id = NULL` (requires column nullable) or keep the
|
||||
current behavior but document it more visibly. At minimum, log the
|
||||
deleted merge-log row count so operators can investigate gaps.
|
||||
|
||||
### M2. Documenso void worker has no max-retry guard for non-404 errors
|
||||
|
||||
- **File:** `src/lib/queue/workers/documents.ts:19-37`
|
||||
- **Scenario:** `voidDocument` throws `CodedError` on non-404 failures
|
||||
(auth error, network blip, Documenso 500). BullMQ retries with
|
||||
backoff, but there's no per-job idempotency check — the second
|
||||
retry hits the same envelope, voidDocument's 404 short-circuit only
|
||||
kicks in if Documenso has actually voided it on the first retry
|
||||
before the API call returned an error. A persistent 401 / 403 will
|
||||
retry forever (until BullMQ exhausts attempts) and the documents row
|
||||
stays `cancelled` in the CRM with the envelope still live in
|
||||
Documenso. The DLQ is mentioned in the comment but the worker
|
||||
doesn't surface a DLQ alert hook.
|
||||
- **Fix:** On exhaustion, write back to `documents` (e.g.
|
||||
`cancellation_failed=true`) and emit an admin notification so the
|
||||
envelope can be voided manually.
|
||||
|
||||
### M3. Next-in-line notification fan-out unhandled rejection
|
||||
|
||||
- **File:** `src/lib/services/next-in-line-notify.service.ts:75-87`
|
||||
- **Scenario:** Each `void createNotification(...)` is a fire-and-forget
|
||||
promise with no `.catch` handler. If `notifications.service`
|
||||
dispatches to a DB that's transiently down, the unhandled rejection
|
||||
will surface in the Node process with no recipient context (the
|
||||
closure captured `userId` is in the stack but pino won't include it
|
||||
unless explicitly logged). Process-level handlers will log it but
|
||||
individual recipients silently lose their notification.
|
||||
- **Fix:** `.catch((err) => logger.warn({err, userId, berthId:
|
||||
input.berthId}, 'next-in-line notification failed'))`.
|
||||
|
||||
### M4. Restore service uses `any` for transaction type
|
||||
|
||||
- **File:** `src/lib/services/client-restore.service.ts:354-355`
|
||||
- **Scenario:** `applyReversal(tx: any, ...)` defeats Drizzle's type
|
||||
safety. A future schema rename (e.g. `yachts.status` enum change)
|
||||
won't fail at compile time inside this function. Combined with the
|
||||
documented v1 no-op for `berth_released`, the function looks
|
||||
innocuous but carries the most risk.
|
||||
- **Fix:** Use the proper Drizzle tx type — `Parameters<Parameters<typeof
|
||||
db.transaction>[0]>[0]` or a named type alias from
|
||||
`@/lib/db/types.ts` if one exists.
|
||||
|
||||
### M5. interests.changeInterestStage milestones write outside tx
|
||||
|
||||
- **File:** `src/lib/services/interests.service.ts:630-648`
|
||||
- **Scenario:** The override path (and normal path) writes
|
||||
`pipelineStage` in one update and milestone fields
|
||||
(`dateEoiSent`, `dateContractSigned`, etc.) in a second update. If
|
||||
the process crashes between the two, the stage advances but the
|
||||
milestone is never recorded. Funnel/conversion math then under-
|
||||
counts that interest. Over-the-wire this is rare but the audit log
|
||||
fires before the milestone update succeeds, so the audit trail
|
||||
claims a complete transition that's actually half-applied.
|
||||
- **Fix:** Combine both into a single update statement, computing the
|
||||
milestone fields in JS and merging them into the `set({...})` clause.
|
||||
|
||||
---
|
||||
|
||||
## Low
|
||||
|
||||
### L1. Smart-archive coalesces invoice notes via SQL string concat
|
||||
|
||||
- **File:** `src/lib/services/client-archive.service.ts:288-291`
|
||||
- **Scenario:** `notes: sql\`coalesce(${invoices.notes}, '') || ${...}\``embeds`new Date().toISOString()`and the action label inside a
|
||||
parameterized string. The values are bound, so it's not an injection
|
||||
risk, but the`\n[archive ...]` marker is appended unconditionally —
|
||||
re-running the archive on a not-yet-committed client would double
|
||||
the marker. Combined with H5 (no idempotency on bulk), a retry could
|
||||
bloat invoice notes with duplicate markers.
|
||||
- **Fix:** Append only when the marker isn't already present, or rely
|
||||
on the `clients.archivedAt is null` precheck (which already guards
|
||||
re-entry) and accept the duplicate as theoretically impossible.
|
||||
|
||||
### L2. Hard-delete `requestHardDeleteCode` reveals client existence pre-archive
|
||||
|
||||
- **File:** `src/lib/services/client-hard-delete.service.ts:77-85`
|
||||
- **Scenario:** A user without `admin.permanently_delete_clients`
|
||||
shouldn't reach this service, so this is theoretical, but the
|
||||
ConflictError "Client must be archived" leaks the existence of an
|
||||
unarchived client to anyone who can reach the route. The audit doc
|
||||
flagged hard-delete error messages already (out of scope), but this
|
||||
specific error path isn't covered there.
|
||||
- **Fix:** Same as the audit-doc finding for the symmetric path —
|
||||
return a generic `NotFoundError` instead of distinguishing
|
||||
"not found" from "not archived" externally; log the distinction
|
||||
internally only.
|
||||
Reference in New Issue
Block a user