From 36b92eb82779f002b369a1fe3f777de880546e13 Mon Sep 17 00:00:00 2001 From: Matt Ciaccio Date: Sun, 3 May 2026 14:10:08 +0200 Subject: [PATCH] docs(spec): client deduplication and NocoDB migration design MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures the audit findings from a 2026-05-03 read-only NocoDB review plus the algorithm and migration plan for porting the legacy data into the new client / interest / contacts / addresses model. Highlights: - 252 NocoDB Interests rows ≈ ~190–200 unique humans (~20–25% dup rate). Six duplicate patterns documented from real data, including "same person, multiple yachts" — exactly the case the new client/interest split is designed to handle. - Reuses the battle-tested `client-portal/server/utils/duplicate- detection.ts` algorithm (blocking + weighted rules) with additions: metaphone for non-English surnames, compounded confidence when multiple rules match, negative evidence for split-signal cases. - Three runtime surfaces (at-create suggestion, interest-level same-berth guard, background scoring + admin review queue) plus a one-shot migration script with --dry-run / --apply / --rollback. - Configurable thresholds via per-port system_settings so the merge policy can be tuned (defaults to "always confirm" — never auto-merges out of the box). - Reversible: every merge writes a clientMergeLog row with the loser's full pre-state JSON, enabling 7-day undo without engineering. Implementation decomposes into three plans (P1 library / P2 runtime / P3 migration) sequenced after the mobile branch lands. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-03-dedup-and-migration-design.md | 564 ++++++++++++++++++ 1 file changed, 564 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md diff --git a/docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md b/docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md new file mode 100644 index 0000000..ff6666f --- /dev/null +++ b/docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md @@ -0,0 +1,564 @@ +# Client Deduplication and NocoDB Migration Design + +**Status**: Design draft 2026-05-03 — pending approval. +**Plan decomposition**: Three implementation plans stack from this design — (P1) normalization + dedup core library; (P2) admin settings + at-create + interest-level guards (runtime); (P3) NocoDB migration script + review queue UI. P1 unblocks P2 and P3. +**Branch base**: stacks on `feat/mobile-foundation` once it merges to `main`. +**Out of scope**: live merge of two clients across ports (cross-tenant), automated AI-judged matches, profile-photo / face-match dedup, web-of-trust referrer relationships. + +--- + +## 1. Background + +### 1.1 Why this exists + +The legacy CRM lives in a NocoDB base whose `Interests` table conflates _the human_ with _the deal_. A row contains `Full Name`, `Email Address`, `Phone Number`, `Address`, `Place of Residence` _and_ the sales-pipeline state for one specific berth. A single human pursuing two berths becomes two rows with semi-duplicated personal data. A 2026-05-03 read-only audit confirmed: + +- **252 Interests rows** in NocoDB, against an estimated ~190–200 unique humans (~20–25% duplication rate). +- **35 Residential Interests rows** in a parallel residential pathway with the same conflation. +- **64 Website Interest Submissions + 47 Website Contact Form Submissions + 1 EOI Supplemental Form** as inbound capture surfaces. +- **No Clients table.** The conflated structure is structural, not accidental. + +The new CRM (`src/lib/db/schema/clients.ts`) splits this into `clients` (people) ↔ `interests` (deals), with `clientContacts` (multi-channel), `clientAddresses` (multi-address), and a pre-existing `clientMergeLog` table that anticipates merge with undo. The design has been ready; what's missing is (a) a normalization + matching library, (b) the at-create / at-import surfaces that use it, and (c) the migration of the existing 252+35 records. + +### 1.2 Real duplicate patterns observed in the live data + +Sampled 200 of the 252 NocoDB Interests rows. Confirmed duplicate clusters fall into six patterns: + +| Pattern | Example rows | Signature | +| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | +| **A. Pure double-submit** | Deepak Ramchandani #624/#625; John Lynch #716/#725 | All fields identical; created same day | +| **B. Phone format variance** | Howard Wiarda #236/#536 (`574-274-0548` vs `+15742740548`); Christophe Zasso #701/#702 (`0651381036` vs `0033651381036`) | Same email, normalize-equal phone | +| **C. Name capitalization** | Nicolas Ruiz #681/#682/#683; Jean-Charles Miege/MIEGE #37/#163; John Farmer/FARMER #35/#161 | Same email or empty; surname case differs | +| **D. Name shortening** | Chris vs Christopher Allen #700/#534; Emma c vs Emma Cauchefer #661/#673 | Same email + phone; given-name truncated | +| **E. Resubmit with typo** | Christopher Camazou #649/#650 (phone last 4 digits typo); Gianfranco Di Constanzo/Costanzo #585/#336 (surname typo, **different yacht** — should be ONE client + TWO interests) | Score-on-everything-else high, one field has small-edit-distance noise | +| **F. Hard cases** | Etiennette Clamouze #188/#717 (same name, different country phone + email); Bruno Joyerot #18 with email belonging to Bruce Hearn #19 (couple sharing contact) | Cannot resolve without a human | + +This dataset will be the fixture for the dedup library's tests — every pattern above must be either auto-detected or flagged for review, and the false-positive bar must be high enough that Pattern F doesn't get force-merged. + +### 1.3 Dirty data inventory + +The migration normalizer must survive these real values from production: + +**Phone fields**: `+1-264-235-8840\r` (with carriage return), `'+1.214.603.4235` (apostrophe + dots), `0677580750/0690511494` (two numbers in one field), `00447956657022` (00 prefix), `+447000000000` (placeholder all-zeros), `+4901637039672` (impossible — stripped 0 + country prefix), various unprefixed local formats, dashed US numbers without country code. + +**Email fields**: mixed case rampant (`Arthur@laser-align.com` vs `arthur@laser-align.com`); ALL-CAPS local parts; trailing whitespace. + +**Name fields**: ALL-CAPS surnames mixed with title-case given names; embedded `\n` and `\r`; double spaces; lowercase-only entries; slash-with-company variants (`Daniel Wainstein / 7 Knots, LLC`, `Bruno Joyerot / SAS TIKI`); placeholder `Mr DADER`, `TBC`. + +**Place of Residence (free text)**: `Saint barthelemy`, `St Barth`, `Saint-Barthélemy` (same place, three forms); `anguilla`, `United States `, `USA`, `Kansas City` (city without country), `Sag Harbor Y` (typo). + +### 1.4 Existing battle-tested algorithm + +`client-portal/server/utils/duplicate-detection.ts` already implements blocking + weighted-rules dedup against this same NocoDB. It runs in production today. We **port it forward** (don't reinvent), then add: soundex/metaphone for surname matching, compounded-confidence when multiple rules match, and negative evidence (same email + different country phone reduces confidence). + +### 1.5 Why the website is no longer the source of new dirty data + +The website forms (`website/components/pn/specific/website/{berths-item,register,form}/form.vue`) use `` with a country picker (`prefer-countries: ['US', 'GB', 'DE', 'FR']`) and `[(value) => !!value || 'Phone number is required']` validation. Output is E.164-shaped. The 252 dirty rows are legacy — pre-form-redesign submissions, sales-rep manual entries, and external CSV imports. Future inbound is clean. + +--- + +## 2. Approach + +Three artifacts, layered: + +1. **A pure-logic normalization + matching library** at `src/lib/dedup/`. JSX-free, vitest-native (proven pattern: `realtime-invalidation-core.ts`). Tested against the dirty-data fixture corpus drawn from §1.2. +2. **Three runtime surfaces** that use the library: at-create suggestion in client/interest forms; interest-level same-berth guard; admin review queue powered by a nightly background scoring job. +3. **A one-shot migration script** that pulls NocoDB → normalizes → dedupes → writes new schema → produces a CSV report with auto-merge log + flagged-for-review pile. + +**Configurability via admin settings** (`system_settings` per port) so the team can tune sensitivity without code changes. Defaults err on the safe side — a flagged review is cheaper than a wrongly-merged record. + +**Reversibility**: every merge writes a `client_merge_log` row containing the loser's full pre-state JSON. A 7-day undo window lets a wrong merge be reversed without engineering involvement. After 7 days the snapshot is purged for GDPR; merges become permanent. + +--- + +## 3. Normalization library + +Lives at `src/lib/dedup/normalize.ts`. Pure functions, no DB, vitest-tested. Used by the dedup algorithm AND by all create-paths so what gets stored is already normalized. + +### 3.1 `normalizeName(raw: string)` + +```ts +export function normalizeName(raw: string): { + display: string; // human-readable, kept for UI + normalized: string; // for matching + surnameToken?: string; // for surname-based blocking +}; +``` + +- Trim leading/trailing whitespace +- Replace `\r`, `\n`, tabs with single space +- Collapse consecutive whitespace to single space +- Smart title-case: keep particles (`van`, `de`, `del`, `O'`, `di`, `le`, `da`) lowercase except as first token +- `display` preserves user's intent (slash-with-company stays intact) +- `normalized` is `display.toLowerCase()` for comparison +- `surnameToken` is the last non-particle token for blocking + +### 3.2 `normalizeEmail(raw: string)` + +```ts +export function normalizeEmail(raw: string): string | null; +``` + +- Trim + lowercase +- Validate via `zod.email()` schema +- Returns `null` for empty / invalid (caller decides what to do) +- **Does NOT strip plus-aliases** (`user+tag@domain.com`) — both intentional (real distinct addresses) and malicious-prevention apply. Compare by full localpart. + +### 3.3 `normalizePhone(raw: string, defaultCountry: string)` + +```ts +export function normalizePhone( + raw: string, + defaultCountry: string, +): { + e164: string | null; // canonical, e.g. '+15742740548' + country: string | null; // ISO-3166-1 alpha-2 + display: string | null; // user-facing pretty + flagged?: 'multi_number' | 'placeholder' | 'unparseable'; +} | null; +``` + +Pipeline: + +1. Strip `\r`, `\n`, tabs, single quotes, dots, dashes, parens, spaces +2. If contains `/` or `;` or `,` → flag `multi_number`, take first segment +3. If matches `+\d{2}0+$` (e.g., `+447000000000`) → flag `placeholder`, return null +4. If starts with `00` → replace with `+` +5. If starts with `+` → parse as E.164 +6. Else if `defaultCountry` provided → parse against that country +7. Else return null (caller's problem) + +Backed by `libphonenumber-js` (already in deps via `tests/integration/factories.ts` usage if not, will add). The hostile cases above all need explicit handling — naïve regex won't survive. + +### 3.4 `resolveCountry(text: string)` + +```ts +export function resolveCountry(text: string): { + iso: string | null; // ISO-3166-1 alpha-2 + confidence: 'exact' | 'fuzzy' | 'city' | null; +}; +``` + +Reuses `src/lib/i18n/countries.ts`. Pipeline: + +1. Lowercase + strip diacritics +2. Exact match against country names (any locale we ship) +3. Fuzzy match (Levenshtein ≤ 2 against canonical English names) +4. City fallback — small in-package mapping for high-frequency cities seen in legacy data (`Sag Harbor → US`, `Kansas City → US`, `St Barth → BL`, etc.). Order: exact → city → fuzzy. + +The mapping is opinionated and small (~30 entries covering the actual values seen in the 252-row dataset). Anything that fails to resolve returns `null` and lands in the migration's flagged pile. + +--- + +## 4. Dedup algorithm + +Lives at `src/lib/dedup/find-matches.ts`. Pure function. Vitest-tested against the §1.2 cluster fixtures. + +### 4.1 Public API + +```ts +export interface MatchCandidate { + id: string; + fullName: string | null; + emails: string[]; // already normalized + phonesE164: string[]; // already normalized E.164 + countryIso: string | null; +} + +export interface MatchResult { + candidate: MatchCandidate; + score: number; // 0–100 + reasons: string[]; // human-readable, e.g. ["email match", "phone match"] + confidence: 'high' | 'medium' | 'low'; +} + +export function findClientMatches( + input: MatchCandidate, + pool: MatchCandidate[], + thresholds: DedupThresholds, +): MatchResult[]; +``` + +### 4.2 Scoring rules (compound) + +Each rule produces a score addition. **Compounding**: when two strong rules match (e.g., email AND phone), the result is ~95+ rather than max(50, 50). Negative evidence subtracts. + +| Rule | Score | Notes | +| --------------------------------------------------------------- | ----- | ------------------------------------------------------ | +| Exact email match (case-insensitive, normalized) | +60 | One match suffices | +| Exact phone E.164 match (≥ 8 significant digits) | +50 | Excludes placeholder all-zeros | +| Exact normalized full-name match | +20 | Many "John Smith"s exist | +| Surname soundex match + given-name fuzzy match (Lev ≤ 1) | +15 | Catches `Constanzo/Costanzo`, `Christophe/Christopher` | +| Same address (normalized fuzzy ≥ 0.8) | +10 | Bonus signal | +| **Negative**: Same email but different country code on phone | −15 | Suggests spouse / coworker / shared inbox | +| **Negative**: Same name but DIFFERENT email AND DIFFERENT phone | −20 | Two distinct people with the same name | + +### 4.3 Confidence tiers (post-compound) + +- **score ≥ 90 — `high`** — email AND phone match, or email + name + address. Block-create suggest "Use existing." Auto-link on public-form submit by default. +- **score 50–89 — `medium`** — single strong signal (email or phone alone), or email + same-name + different country (Etiennette case). Soft-warn but allow. +- **score < 50 — `low`** — weak signals only. Don't surface in UI; only relevant in background-job review queue. + +### 4.4 Blocking strategy + +For O(n) scan over a pool of N existing clients, build three lookup maps once per scan: + +- `byEmail: Map` — keyed by normalized email +- `byPhoneE164: Map` — keyed by E.164 +- `bySurnameToken: Map` — keyed by `normalizeName(...).surnameToken` + +For an incoming `MatchCandidate`, the candidate set to compare is the union of pool entries reachable through any of its emails/phones/surname-token. Typically 0–5 candidates per query, regardless of N. + +### 4.5 Performance budget + +For migration: 252 rows compared pairwise once. ~30k comparisons after blocking — a few seconds. + +For runtime at-create: incoming candidate against existing pool of N clients per port. Expected pool size at maturity: 1k–10k. With blocking: <10 comparisons, <1ms target. No DB query needed beyond the initial pool fetch (which itself uses the indexed columns). + +For background nightly job: full pairwise within port, blocked. 10k clients → ~50k pairwise checks per port → <30s. Fine for a nightly cron. + +--- + +## 5. Configurable thresholds (admin settings) + +New rows in `system_settings` per port. Default values err safe (more confirmation, less auto-action). + +| Key | Default | Effect | +| ------------------------------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `dedup_block_create_threshold` | `90` | Score above which the client-create form interrupts: "Use existing client?" | +| `dedup_soft_warn_threshold` | `50` | Score above which a soft-warn panel surfaces below the form | +| `dedup_review_queue_threshold` | `40` | Background job lands pairs ≥ this score in `/admin/duplicates` | +| `dedup_public_form_auto_link` | `true` | When a public-form submission scores ≥ block-threshold against existing client, attach the new interest to that client without prompting. **Safe**: no merge, just attaching a deal. | +| `dedup_auto_merge_threshold` | `null` (disabled) | If non-null, merges happen automatically at this threshold without human confirmation. Recommend leaving null until the team is comfortable; `95` is a reasonable cautious value. | +| `dedup_undo_window_days` | `7` | How long the loser's pre-state JSON is retained for merge-undo. After this, the snapshot is purged (GDPR) and merges are permanent. | + +Each setting is a row in `system_settings`. UI surface in `/[portSlug]/admin/dedup` (a new admin page) with an "Advanced" toggle to expose the thresholds and brief explanations. + +If the sales team complains the safer mode is too click-heavy, an admin flips `dedup_auto_merge_threshold` to `95` without any code change. + +--- + +## 6. Merge service contract + +### 6.1 Data flow + +`mergeClients(winnerId, loserId, fieldChoices, ctx)` does, in a single transaction: + +1. **Snapshot loser** — full row + all attached `clientContacts`, `clientAddresses`, `clientNotes`, `clientTags`, plus a count of dependent rows about to be moved (interests, yacht-memberships, etc.). Stored as `mergeDetails` JSONB in `clientMergeLog`. +2. **Reattach** — every row pointing at `loserId` updates to point at `winnerId`: + - `interests.clientId` + - `clientContacts.clientId` — with conflict handling: if winner already has the same email, keep winner's; flag the duplicate for the user + - `clientAddresses.clientId` — same conflict handling + - `clientNotes.clientId` — preserve `authorId` + `createdAt` (never overwrite) + - `clientTags.clientId` + - `clientYachtMembership.clientId` (or whatever the table is called) + - `auditLogs.entityId` — annotate, don't move (audit truth) +3. **Apply fieldChoices** — for each field where the user picked the loser's value, copy that into the winner row. +4. **Soft-archive loser** — `loser.archivedAt = now()`, `loser.mergedIntoClientId = winnerId`. Row stays in DB so the merge is reversible. +5. **Write `clientMergeLog`** — `{ winnerId, loserId, mergedBy, mergedAt, mergeDetails: , fieldChoices }`. +6. **Audit log** — top-level `auditLogs` row: `{ action: 'merge', entityType: 'client', entityId: winnerId, metadata: { loserId, score, reasons } }`. + +### 6.2 Schema additions (migration) + +`clients` table gets a new column: + +```ts +mergedIntoClientId: text('merged_into_client_id').references(() => clients.id), +``` + +The existing `clientMergeLog` table is reused. Add a partial index for the undo-window query: + +```sql +CREATE INDEX idx_cml_recent ON client_merge_log (port_id, created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days'; +``` + +A daily maintenance job (using the existing `maintenance-cleanup.test.ts` infrastructure) purges `mergeDetails` JSONB older than `dedup_undo_window_days` setting. + +### 6.3 Undo + +`unmergeClients(mergeLogId, ctx)`: + +1. Within the undo window, look up the snapshot +2. Restore loser: clear `archivedAt`, `mergedIntoClientId` +3. Restore loser's contacts/addresses/notes/tags from snapshot +4. Detach reattached rows: `interests` etc. that were touching `winnerId` and originally belonged to loser go back. The snapshot stores the original `(rowType, rowId)` list explicitly so this is deterministic. +5. Mark log row `undoneAt = now()`, `undoneBy = userId` + +After 7 days the snapshot is gone and unmerge returns `410 Gone`. + +### 6.4 Concurrency + +Both merge and unmerge wrap in a single transaction with `SELECT … FOR UPDATE` on `clients.id` of both winner and loser. A second merge attempt against the same loser sees `mergedIntoClientId` already set and refuses (clear error: "Already merged into …"). + +--- + +## 7. Runtime surfaces + +### 7.1 Layer 1 — At-create suggestion + +In `ClientForm` (and the public `register` form once that hits the new system): + +- Debounced 300ms after email or phone field changes +- Calls `findClientMatches` against current port's clients +- Renders top-1 match if score ≥ `dedup_soft_warn_threshold`: + ``` + ┌─────────────────────────────────────┐ + │ This looks like an existing client │ + │ ML Marcus Laurent │ + │ marcus@… +33 6 12 34 56 78 │ + │ 2 interests · last 9d ago │ + │ [ Use this client ] [ Create new ] │ + └─────────────────────────────────────┘ + ``` +- "Use this client" → form switches to "create new interest under existing client" mode (preserves whatever other fields the user typed) +- "Create new" → audit-log `dedup_override` with the candidate's id and reasons (so we have data on false positives) + +### 7.2 Layer 2 — Interest-level same-berth guard + +Cheap one-liner in `createInterest` service: + +- Check `(clientId, berthId)` against existing non-archived interests +- If hit, throw `BerthDuplicateError` with the existing interest details +- UI catches and prompts: "Update existing or create separate?" + +This is NOT the same as client-level dedup. Same client legitimately can pursue the same berth a second time after it falls through. But the prompt-before-create catches the accidental double-submit case. + +### 7.3 Layer 3 — Background scoring + review queue + +- A nightly cron (using existing BullMQ infrastructure — search for `scheduled-tasks` in repo) runs `findClientMatches` over each port's full client pool +- Pairs scoring ≥ `dedup_review_queue_threshold` land in a `client_merge_candidates` table: + ```ts + export const clientMergeCandidates = pgTable('client_merge_candidates', { + id: text('id').primaryKey()..., + portId: text('port_id').notNull()..., + clientAId: text('client_a_id').notNull()..., + clientBId: text('client_b_id').notNull()..., + score: integer('score').notNull(), + reasons: jsonb('reasons').notNull(), + status: text('status').notNull().default('pending'), // pending | dismissed | merged + createdAt: timestamp('created_at')..., + resolvedAt: timestamp('resolved_at'), + resolvedBy: text('resolved_by'), + }) + ``` +- `/[portSlug]/admin/duplicates` lists pending candidates sorted by score desc, with `[Review →]` opening a side-by-side merge dialog +- Dismissing a candidate marks it `status=dismissed` so the job doesn't re-surface the same pair tomorrow (a future score increase re-creates it). + +--- + +## 8. NocoDB → new system field mapping + +This is the explicit mapping the migration script applies. One NocoDB Interest row produces multiple new rows. + +### 8.1 Top-level transform + +``` +NocoDB Interests row + ─→ 0–1 client (deduped against existing pool) + ─→ 0–1 client_address + ─→ 0–2 client_contacts (email, phone) + ─→ exactly 1 interest + ─→ 0–1 yacht (when Yacht Name present and not "TBC"/"Na"/empty placeholders) + ─→ 0–1 document (when documensoID present) +``` + +### 8.2 Field map + +| NocoDB field | Target | Transform | +| ----------------------------------------------------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------- | +| `Full Name` | `clients.fullName` | `normalizeName().display` | +| `Email Address` | `clientContacts(channel='email', value=...)` | `normalizeEmail()` | +| `Phone Number` | `clientContacts(channel='phone', valueE164=..., valueCountry=...)` | `normalizePhone(raw, defaultCountry)` | +| `Address` | `clientAddresses.streetAddress` (LongText preserved) | trim | +| `Place of Residence` | `clientAddresses.countryIso` AND `clients.nationalityIso` | `resolveCountry()` | +| `Contact Method Preferred` | `clients.preferredContactMethod` | lowercase, mapped: Email→email, Phone→phone | +| `Source` | `clients.source` | mapped: portal→website, Form→website, External→manual; null → manual | +| `Date Added` | `interests.createdAt` (fallback to NocoDB `Created At` then now) | parse: try `DD-MM-YYYY`, then `YYYY-MM-DD`, then ISO | +| `Sales Process Level` | `interests.pipelineStage` | see §8.3 | +| `Lead Category` | `interests.leadCategory` | General→general_interest, Friends and Family→general_interest with tag | +| `Berth` (FK) | `interests.berthId` | resolve via `Berths` table by `Mooring Number` | +| `Berth Size Desired` | `interests.notes` (appended) | preserve | +| `Yacht Name`, `Length`, `Width`, `Depth` | `yachts.name`, `lengthM`, `widthM`, `draughtM` | skip if name in {`TBC`, `Na`, ``, null}; ft→m via `\* 0.3048` | +| `EOI Status` | `interests.eoiStatus` | Awaiting Further Details→pending; Waiting for Signatures→sent; Signed→signed | +| `Deposit 10% Status` | `interests.depositStatus` | Pending→pending; Received→received | +| `Contract Status` | `interests.contractStatus` | Pending→pending; 40% Received→partial; Complete→complete | +| `EOI Time Sent` | `interests.dateEoiSent` | parse | +| `clientSignTime` / `developerSignTime` / `all_signed_notified_at` | `interests.dateEoiSigned` (use latest) | parse | +| `Time LOI Sent` | `interests.dateContractSent` | parse | +| `Internal Notes` + `Extra Comments` | `clientNotes` (one row, system author) | concatenate with section markers | +| `documensoID` | `documents.documensoId` (when present, type='eoi') | preserve | +| `Signature Link Client/CC/Developer`, `EmbeddedSignature*` | `documents.signers[]` | one row per non-null signer | +| `reminder_enabled`, `last_reminder_sent`, etc. | `interests.reminderEnabled`, `interests.reminderLastFired` | parse, default true | + +### 8.3 Sales-stage mapping (8 → 9) + +| NocoDB | New (PIPELINE_STAGES) | +| ------------------------------- | ------------------------------------------------------------------------ | +| General Qualified Interest | `open` | +| Specific Qualified Interest | `details_sent` | +| EOI and NDA Sent | `eoi_sent` | +| Signed EOI and NDA | `eoi_signed` | +| Made Reservation | `deposit_10pct` | +| Contract Negotiation | `contract_sent` | +| Contract Negotiations Finalized | `contract_sent` (with audit-note: legacy "negotiations finalized") | +| Contract Signed | `contract_signed` (or `completed` when deposit + contract both complete) | + +### 8.4 Other tables + +- **Residential Interests** (35 rows) — same shape as Interests but maps to `residentialClients` + `residentialInterests`. Smaller and cleaner. Same dedup runs within this pool independently. +- **Website - Interest Submissions** (64 rows) — these are **inbound capture, not yet a client**. Treat as if each row is a fresh public-form submission today: run dedup against the migrated client pool. Auto-link if `dedup_public_form_auto_link` setting allows. +- **Website - Contact Form Submissions** (47 rows) — sparse data (just name + email + interest type). Skip migration; export as CSV for manual triage. Not the source of truth for any deal. +- **Website - Berth EOI Details Supplements** (1 row) — single record, preserved as a one-off attached to the matching Interest. +- **Newsletter Sending** (69 rows) — out of scope; that's a marketing surface, not CRM. +- **Interests Backup, Interests copy** — historical artifacts. Skipped by default. A `--include-backups` flag attaches them as audit-note entries on the corresponding live Interest if the user wants the history. + +--- + +## 9. Migration script + +Located at `scripts/migrate-from-nocodb.ts`. Idempotent: safe to re-run. Three main flags: + +``` +$ pnpm tsx scripts/migrate-from-nocodb.ts --dry-run [--port-slug X] + Pulls everything, transforms, runs dedup, writes CSV report to .migration//. No DB writes. + +$ pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration// + Reads the report, performs the writes the dry-run promised. Refuses if the source data has changed since the report was generated (hash mismatch). + +$ pnpm tsx scripts/migrate-from-nocodb.ts --rollback --apply-id + Reads the apply log, undoes the writes (only valid within the undo window). +``` + +Reuses the `client-portal/server/utils/nocodb.ts` adapter for the NocoDB API client (no need to rebuild). Writes to the new system via Drizzle (re-using the existing services like `createClient`, `createInterest`, etc., so all the same validation runs). + +### 9.1 Dry-run report format + +`.migration//report.csv`: + +```csv +op,reason,nocodb_row_id,target_table,target_value,confidence,manual_review_required +create_client,new,624,clients.fullName,Deepak Ramchandani,N/A,false +create_contact,new,624,clientContacts.email,dannyrams8888@gmail.com,N/A,false +create_contact,new,624,clientContacts.phone,+17215868888,N/A,false +create_interest,new,624,interests.berthId,a1b2c3...,N/A,false +auto_link,score=98 (email+phone),625,clients.id,,high,false +flag_for_review,score=72 (same name diff country),188,client.id,,medium,true +country_unresolved,fallback to AI (port country),198,clientAddresses.countryIso,AI,low,true +phone_unparseable,placeholder all-zeros,641,clientContacts.phone,,N/A,true +``` + +Plus `.migration//summary.md`: + +``` +# Migration Dry-Run — 2026-05-03 14:23 UTC + +NocoDB: 252 Interests + 35 Residences + 64 Website Submissions +Outcome: 198 clients, 287 interests (incl. residences), 91 yachts, 412 contacts + +Auto-linked (high confidence, no human action needed): + - Nicolas Ruiz: rows 681,682,683 → 1 client + 3 interests + - John Lynch: rows 716,725 → 1 client + 2 interests + - Deepak Ramchandani: rows 624,625 → 1 client + 2 interests + - [12 more] + +Flagged for manual review (medium confidence): + - Etiennette Clamouze (rows 188,717): same name, different country phone + email + - Bruno Joyerot #18 + Bruce Hearn #19: shared household contact + - [4 more] + +Country resolution failed for 7 rows. All defaulted to port country (AI). Review: + - Row 239: "Sag Harbor Y" → AI (likely US) + - [6 more] + +Phone parsing failed for 3 rows. All flagged, no contact created: + - Row 178: empty + - Row 641: placeholder "+447000000000" + - Row 175: empty + +Run `--apply` to commit these changes. +``` + +### 9.2 Apply phase + +`--apply` reads the report, re-fetches the source rows (via NocoDB MCP / API), recomputes the hash, fails fast if NocoDB changed since dry-run. Then performs the writes within a single PostgreSQL transaction per port (commit at end). On any error mid-transaction, full rollback. + +After successful apply, an `apply_id` is generated and an audit-log row written. The `apply_id` is the handle used for `--rollback`. + +### 9.3 Idempotency + +The script tracks NocoDB row IDs in a `migration_source_links` table: + +```ts +export const migrationSourceLinks = pgTable('migration_source_links', { + id: text('id').primaryKey()..., + sourceSystem: text('source_system').notNull(), // 'nocodb_interests' | 'nocodb_residences' | … + sourceId: text('source_id').notNull(), // NocoDB row id as string + targetEntityType: text('target_entity_type').notNull(), // client | interest | yacht | … + targetEntityId: text('target_entity_id').notNull(), + appliedAt: timestamp('applied_at')..., + appliedBy: text('applied_by'), +}, (table) => [ + uniqueIndex('idx_msl_source').on(table.sourceSystem, table.sourceId, table.targetEntityType), +]); +``` + +Re-running `--apply` against the same report skips rows already in this table. Useful for partial-failure resumption. + +--- + +## 10. Test plan + +### 10.1 Library-level (vitest unit) + +- `tests/unit/dedup/normalize.test.ts` — every dirty-data pattern from §1.3 has a fixture asserting the expected normalized output. +- `tests/unit/dedup/find-matches.test.ts` — every duplicate cluster from §1.2 has a fixture asserting score + confidence tier. Hard cases (Pattern F) assert "medium" not "high" — false-positive guard. + +### 10.2 Service-level (vitest integration) + +- `tests/integration/dedup/client-merge.test.ts` — merge service exercised: full reattach, clientMergeLog written, undo within window restores, undo after window returns 410, concurrent merge of same loser fails the second. +- `tests/integration/dedup/at-create-suggestion.test.ts` — `findClientMatches` against a seeded pool returns expected matches + reasons. + +### 10.3 Migration script (vitest integration with NocoDB mock) + +- `tests/integration/dedup/migration-dry-run.test.ts` — feed the script a fixture NocoDB dump (the 252 rows, frozen as a JSON snapshot in fixtures), assert the resulting CSV matches a golden file. Catch any future regression in the transform pipeline. +- `tests/integration/dedup/migration-apply.test.ts` — apply the dry-run output to a clean test DB, assert all expected rows exist, assert idempotency (re-apply is a no-op). + +### 10.4 E2E (Playwright) + +- `tests/e2e/smoke/30-dedup-create.spec.ts` — type into ClientForm with an email matching seeded client; assert suggestion card appears; click "Use this client"; assert form switches to interest-create mode. +- `tests/e2e/smoke/31-admin-duplicates.spec.ts` — admin views review queue, opens a candidate, side-by-side merge UI works, merge succeeds, undo within window works. + +--- + +## 11. Rollback plan + +Three layers of safety, ordered by reversibility: + +1. **Per-merge undo** — admin clicks Undo on a wrongly-merged pair, system rolls back from `clientMergeLog` snapshot. 7-day window. No engineering needed. +2. **Migration `--rollback` flag** — entire migration apply is reversed via the `apply_id` and `migration_source_links` table. Useful in the first 24h after `--apply`. Engineering-supervised. +3. **DB restore from backup** — the existing `docs/ops/backup-runbook.md` covers this. Last resort if both above are blocked. + +Pre-migration, take a hot backup of the new DB (`pg_dump`). Pre-merge in production (before any human-facing surface ships), the `dedup_auto_merge_threshold` defaults to `null` so no automatic merges happen — every merge is human-confirmed. + +--- + +## 12. Open items + +- **Soundex vs metaphone** — Soundex is simpler but English-leaning. Metaphone handles non-English surnames better (the dataset has French, German, Italian, Slavic names). Default to metaphone via the `natural` package; revisit if it adds significant install size. +- **Cross-port dedup** — not in scope. Each port's clients are deduped within that port. A future "shared address book" feature would need its own design. +- **Profile photo / face match** — out of scope. +- **AI-assisted match resolution** — out of scope. The Layer-3 review queue is human-only. + +--- + +## Implementation sequence + +P1 (this design's library) → P2 (runtime surfaces) → P3 (migration). Each is a separate plan / PR. + +**P1 deliverables**: `src/lib/dedup/{normalize,find-matches}.ts` + tests. No UI changes. No DB changes (except indexed lookups added to existing `clientContacts`). ~1.5 days. + +**P2 deliverables**: at-create suggestion in `ClientForm` + interest-level guard in `createInterest` service + admin settings UI for thresholds + `clientMergeCandidates` table + nightly job + admin review queue page + merge service + side-by-side merge UI. ~5–7 days. + +**P3 deliverables**: `scripts/migrate-from-nocodb.ts` + `migration_source_links` table + dry-run + apply + rollback. CSV report format frozen against fixture. ~3 days, including fixture creation from the live NocoDB snapshot. + +Total: ~10–12 engineering days from approval. Can be split across three PRs landing independently — each is testable in isolation and the runtime surfaces (P2) work even without P3 being run.