1 Commits

Author SHA1 Message Date
Matt Ciaccio
36b92eb827 docs(spec): client deduplication and NocoDB migration design
Captures the audit findings from a 2026-05-03 read-only NocoDB review
plus the algorithm and migration plan for porting the legacy data
into the new client / interest / contacts / addresses model.

Highlights:
- 252 NocoDB Interests rows ≈ ~190–200 unique humans (~20–25% dup
  rate). Six duplicate patterns documented from real data, including
  "same person, multiple yachts" — exactly the case the new
  client/interest split is designed to handle.
- Reuses the battle-tested `client-portal/server/utils/duplicate-
  detection.ts` algorithm (blocking + weighted rules) with additions:
  metaphone for non-English surnames, compounded confidence when
  multiple rules match, negative evidence for split-signal cases.
- Three runtime surfaces (at-create suggestion, interest-level
  same-berth guard, background scoring + admin review queue) plus a
  one-shot migration script with --dry-run / --apply / --rollback.
- Configurable thresholds via per-port system_settings so the merge
  policy can be tuned (defaults to "always confirm" — never
  auto-merges out of the box).
- Reversible: every merge writes a clientMergeLog row with the
  loser's full pre-state JSON, enabling 7-day undo without engineering.

Implementation decomposes into three plans (P1 library / P2 runtime /
P3 migration) sequenced after the mobile branch lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:10:08 +02:00

View File

@@ -0,0 +1,564 @@
# Client Deduplication and NocoDB Migration Design
**Status**: Design draft 2026-05-03 — pending approval.
**Plan decomposition**: Three implementation plans stack from this design — (P1) normalization + dedup core library; (P2) admin settings + at-create + interest-level guards (runtime); (P3) NocoDB migration script + review queue UI. P1 unblocks P2 and P3.
**Branch base**: stacks on `feat/mobile-foundation` once it merges to `main`.
**Out of scope**: live merge of two clients across ports (cross-tenant), automated AI-judged matches, profile-photo / face-match dedup, web-of-trust referrer relationships.
---
## 1. Background
### 1.1 Why this exists
The legacy CRM lives in a NocoDB base whose `Interests` table conflates _the human_ with _the deal_. A row contains `Full Name`, `Email Address`, `Phone Number`, `Address`, `Place of Residence` _and_ the sales-pipeline state for one specific berth. A single human pursuing two berths becomes two rows with semi-duplicated personal data. A 2026-05-03 read-only audit confirmed:
- **252 Interests rows** in NocoDB, against an estimated ~190200 unique humans (~2025% duplication rate).
- **35 Residential Interests rows** in a parallel residential pathway with the same conflation.
- **64 Website Interest Submissions + 47 Website Contact Form Submissions + 1 EOI Supplemental Form** as inbound capture surfaces.
- **No Clients table.** The conflated structure is structural, not accidental.
The new CRM (`src/lib/db/schema/clients.ts`) splits this into `clients` (people) ↔ `interests` (deals), with `clientContacts` (multi-channel), `clientAddresses` (multi-address), and a pre-existing `clientMergeLog` table that anticipates merge with undo. The design has been ready; what's missing is (a) a normalization + matching library, (b) the at-create / at-import surfaces that use it, and (c) the migration of the existing 252+35 records.
### 1.2 Real duplicate patterns observed in the live data
Sampled 200 of the 252 NocoDB Interests rows. Confirmed duplicate clusters fall into six patterns:
| Pattern | Example rows | Signature |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| **A. Pure double-submit** | Deepak Ramchandani #624/#625; John Lynch #716/#725 | All fields identical; created same day |
| **B. Phone format variance** | Howard Wiarda #236/#536 (`574-274-0548` vs `+15742740548`); Christophe Zasso #701/#702 (`0651381036` vs `0033651381036`) | Same email, normalize-equal phone |
| **C. Name capitalization** | Nicolas Ruiz #681/#682/#683; Jean-Charles Miege/MIEGE #37/#163; John Farmer/FARMER #35/#161 | Same email or empty; surname case differs |
| **D. Name shortening** | Chris vs Christopher Allen #700/#534; Emma c vs Emma Cauchefer #661/#673 | Same email + phone; given-name truncated |
| **E. Resubmit with typo** | Christopher Camazou #649/#650 (phone last 4 digits typo); Gianfranco Di Constanzo/Costanzo #585/#336 (surname typo, **different yacht** — should be ONE client + TWO interests) | Score-on-everything-else high, one field has small-edit-distance noise |
| **F. Hard cases** | Etiennette Clamouze #188/#717 (same name, different country phone + email); Bruno Joyerot #18 with email belonging to Bruce Hearn #19 (couple sharing contact) | Cannot resolve without a human |
This dataset will be the fixture for the dedup library's tests — every pattern above must be either auto-detected or flagged for review, and the false-positive bar must be high enough that Pattern F doesn't get force-merged.
### 1.3 Dirty data inventory
The migration normalizer must survive these real values from production:
**Phone fields**: `+1-264-235-8840\r` (with carriage return), `'+1.214.603.4235` (apostrophe + dots), `0677580750/0690511494` (two numbers in one field), `00447956657022` (00 prefix), `+447000000000` (placeholder all-zeros), `+4901637039672` (impossible — stripped 0 + country prefix), various unprefixed local formats, dashed US numbers without country code.
**Email fields**: mixed case rampant (`Arthur@laser-align.com` vs `arthur@laser-align.com`); ALL-CAPS local parts; trailing whitespace.
**Name fields**: ALL-CAPS surnames mixed with title-case given names; embedded `\n` and `\r`; double spaces; lowercase-only entries; slash-with-company variants (`Daniel Wainstein / 7 Knots, LLC`, `Bruno Joyerot / SAS TIKI`); placeholder `Mr DADER`, `TBC`.
**Place of Residence (free text)**: `Saint barthelemy`, `St Barth`, `Saint-Barthélemy` (same place, three forms); `anguilla`, `United States `, `USA`, `Kansas City` (city without country), `Sag Harbor Y` (typo).
### 1.4 Existing battle-tested algorithm
`client-portal/server/utils/duplicate-detection.ts` already implements blocking + weighted-rules dedup against this same NocoDB. It runs in production today. We **port it forward** (don't reinvent), then add: soundex/metaphone for surname matching, compounded-confidence when multiple rules match, and negative evidence (same email + different country phone reduces confidence).
### 1.5 Why the website is no longer the source of new dirty data
The website forms (`website/components/pn/specific/website/{berths-item,register,form}/form.vue`) use `<v-phone-input>` with a country picker (`prefer-countries: ['US', 'GB', 'DE', 'FR']`) and `[(value) => !!value || 'Phone number is required']` validation. Output is E.164-shaped. The 252 dirty rows are legacy — pre-form-redesign submissions, sales-rep manual entries, and external CSV imports. Future inbound is clean.
---
## 2. Approach
Three artifacts, layered:
1. **A pure-logic normalization + matching library** at `src/lib/dedup/`. JSX-free, vitest-native (proven pattern: `realtime-invalidation-core.ts`). Tested against the dirty-data fixture corpus drawn from §1.2.
2. **Three runtime surfaces** that use the library: at-create suggestion in client/interest forms; interest-level same-berth guard; admin review queue powered by a nightly background scoring job.
3. **A one-shot migration script** that pulls NocoDB → normalizes → dedupes → writes new schema → produces a CSV report with auto-merge log + flagged-for-review pile.
**Configurability via admin settings** (`system_settings` per port) so the team can tune sensitivity without code changes. Defaults err on the safe side — a flagged review is cheaper than a wrongly-merged record.
**Reversibility**: every merge writes a `client_merge_log` row containing the loser's full pre-state JSON. A 7-day undo window lets a wrong merge be reversed without engineering involvement. After 7 days the snapshot is purged for GDPR; merges become permanent.
---
## 3. Normalization library
Lives at `src/lib/dedup/normalize.ts`. Pure functions, no DB, vitest-tested. Used by the dedup algorithm AND by all create-paths so what gets stored is already normalized.
### 3.1 `normalizeName(raw: string)`
```ts
export function normalizeName(raw: string): {
display: string; // human-readable, kept for UI
normalized: string; // for matching
surnameToken?: string; // for surname-based blocking
};
```
- Trim leading/trailing whitespace
- Replace `\r`, `\n`, tabs with single space
- Collapse consecutive whitespace to single space
- Smart title-case: keep particles (`van`, `de`, `del`, `O'`, `di`, `le`, `da`) lowercase except as first token
- `display` preserves user's intent (slash-with-company stays intact)
- `normalized` is `display.toLowerCase()` for comparison
- `surnameToken` is the last non-particle token for blocking
### 3.2 `normalizeEmail(raw: string)`
```ts
export function normalizeEmail(raw: string): string | null;
```
- Trim + lowercase
- Validate via `zod.email()` schema
- Returns `null` for empty / invalid (caller decides what to do)
- **Does NOT strip plus-aliases** (`user+tag@domain.com`) — both intentional (real distinct addresses) and malicious-prevention apply. Compare by full localpart.
### 3.3 `normalizePhone(raw: string, defaultCountry: string)`
```ts
export function normalizePhone(
raw: string,
defaultCountry: string,
): {
e164: string | null; // canonical, e.g. '+15742740548'
country: string | null; // ISO-3166-1 alpha-2
display: string | null; // user-facing pretty
flagged?: 'multi_number' | 'placeholder' | 'unparseable';
} | null;
```
Pipeline:
1. Strip `\r`, `\n`, tabs, single quotes, dots, dashes, parens, spaces
2. If contains `/` or `;` or `,` → flag `multi_number`, take first segment
3. If matches `+\d{2}0+$` (e.g., `+447000000000`) → flag `placeholder`, return null
4. If starts with `00` → replace with `+`
5. If starts with `+` → parse as E.164
6. Else if `defaultCountry` provided → parse against that country
7. Else return null (caller's problem)
Backed by `libphonenumber-js` (already in deps via `tests/integration/factories.ts` usage if not, will add). The hostile cases above all need explicit handling — naïve regex won't survive.
### 3.4 `resolveCountry(text: string)`
```ts
export function resolveCountry(text: string): {
iso: string | null; // ISO-3166-1 alpha-2
confidence: 'exact' | 'fuzzy' | 'city' | null;
};
```
Reuses `src/lib/i18n/countries.ts`. Pipeline:
1. Lowercase + strip diacritics
2. Exact match against country names (any locale we ship)
3. Fuzzy match (Levenshtein ≤ 2 against canonical English names)
4. City fallback — small in-package mapping for high-frequency cities seen in legacy data (`Sag Harbor → US`, `Kansas City → US`, `St Barth → BL`, etc.). Order: exact → city → fuzzy.
The mapping is opinionated and small (~30 entries covering the actual values seen in the 252-row dataset). Anything that fails to resolve returns `null` and lands in the migration's flagged pile.
---
## 4. Dedup algorithm
Lives at `src/lib/dedup/find-matches.ts`. Pure function. Vitest-tested against the §1.2 cluster fixtures.
### 4.1 Public API
```ts
export interface MatchCandidate {
id: string;
fullName: string | null;
emails: string[]; // already normalized
phonesE164: string[]; // already normalized E.164
countryIso: string | null;
}
export interface MatchResult {
candidate: MatchCandidate;
score: number; // 0100
reasons: string[]; // human-readable, e.g. ["email match", "phone match"]
confidence: 'high' | 'medium' | 'low';
}
export function findClientMatches(
input: MatchCandidate,
pool: MatchCandidate[],
thresholds: DedupThresholds,
): MatchResult[];
```
### 4.2 Scoring rules (compound)
Each rule produces a score addition. **Compounding**: when two strong rules match (e.g., email AND phone), the result is ~95+ rather than max(50, 50). Negative evidence subtracts.
| Rule | Score | Notes |
| --------------------------------------------------------------- | ----- | ------------------------------------------------------ |
| Exact email match (case-insensitive, normalized) | +60 | One match suffices |
| Exact phone E.164 match (≥ 8 significant digits) | +50 | Excludes placeholder all-zeros |
| Exact normalized full-name match | +20 | Many "John Smith"s exist |
| Surname soundex match + given-name fuzzy match (Lev ≤ 1) | +15 | Catches `Constanzo/Costanzo`, `Christophe/Christopher` |
| Same address (normalized fuzzy ≥ 0.8) | +10 | Bonus signal |
| **Negative**: Same email but different country code on phone | 15 | Suggests spouse / coworker / shared inbox |
| **Negative**: Same name but DIFFERENT email AND DIFFERENT phone | 20 | Two distinct people with the same name |
### 4.3 Confidence tiers (post-compound)
- **score ≥ 90 — `high`** — email AND phone match, or email + name + address. Block-create suggest "Use existing." Auto-link on public-form submit by default.
- **score 5089 — `medium`** — single strong signal (email or phone alone), or email + same-name + different country (Etiennette case). Soft-warn but allow.
- **score < 50 — `low`** — weak signals only. Don't surface in UI; only relevant in background-job review queue.
### 4.4 Blocking strategy
For O(n) scan over a pool of N existing clients, build three lookup maps once per scan:
- `byEmail: Map<string, MatchCandidate[]>` — keyed by normalized email
- `byPhoneE164: Map<string, MatchCandidate[]>` — keyed by E.164
- `bySurnameToken: Map<string, MatchCandidate[]>` — keyed by `normalizeName(...).surnameToken`
For an incoming `MatchCandidate`, the candidate set to compare is the union of pool entries reachable through any of its emails/phones/surname-token. Typically 05 candidates per query, regardless of N.
### 4.5 Performance budget
For migration: 252 rows compared pairwise once. ~30k comparisons after blocking — a few seconds.
For runtime at-create: incoming candidate against existing pool of N clients per port. Expected pool size at maturity: 1k10k. With blocking: <10 comparisons, <1ms target. No DB query needed beyond the initial pool fetch (which itself uses the indexed columns).
For background nightly job: full pairwise within port, blocked. 10k clients → ~50k pairwise checks per port → <30s. Fine for a nightly cron.
---
## 5. Configurable thresholds (admin settings)
New rows in `system_settings` per port. Default values err safe (more confirmation, less auto-action).
| Key | Default | Effect |
| ------------------------------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `dedup_block_create_threshold` | `90` | Score above which the client-create form interrupts: "Use existing client?" |
| `dedup_soft_warn_threshold` | `50` | Score above which a soft-warn panel surfaces below the form |
| `dedup_review_queue_threshold` | `40` | Background job lands pairs ≥ this score in `/admin/duplicates` |
| `dedup_public_form_auto_link` | `true` | When a public-form submission scores ≥ block-threshold against existing client, attach the new interest to that client without prompting. **Safe**: no merge, just attaching a deal. |
| `dedup_auto_merge_threshold` | `null` (disabled) | If non-null, merges happen automatically at this threshold without human confirmation. Recommend leaving null until the team is comfortable; `95` is a reasonable cautious value. |
| `dedup_undo_window_days` | `7` | How long the loser's pre-state JSON is retained for merge-undo. After this, the snapshot is purged (GDPR) and merges are permanent. |
Each setting is a row in `system_settings`. UI surface in `/[portSlug]/admin/dedup` (a new admin page) with an "Advanced" toggle to expose the thresholds and brief explanations.
If the sales team complains the safer mode is too click-heavy, an admin flips `dedup_auto_merge_threshold` to `95` without any code change.
---
## 6. Merge service contract
### 6.1 Data flow
`mergeClients(winnerId, loserId, fieldChoices, ctx)` does, in a single transaction:
1. **Snapshot loser** — full row + all attached `clientContacts`, `clientAddresses`, `clientNotes`, `clientTags`, plus a count of dependent rows about to be moved (interests, yacht-memberships, etc.). Stored as `mergeDetails` JSONB in `clientMergeLog`.
2. **Reattach** — every row pointing at `loserId` updates to point at `winnerId`:
- `interests.clientId`
- `clientContacts.clientId` — with conflict handling: if winner already has the same email, keep winner's; flag the duplicate for the user
- `clientAddresses.clientId` — same conflict handling
- `clientNotes.clientId` — preserve `authorId` + `createdAt` (never overwrite)
- `clientTags.clientId`
- `clientYachtMembership.clientId` (or whatever the table is called)
- `auditLogs.entityId` — annotate, don't move (audit truth)
3. **Apply fieldChoices** — for each field where the user picked the loser's value, copy that into the winner row.
4. **Soft-archive loser**`loser.archivedAt = now()`, `loser.mergedIntoClientId = winnerId`. Row stays in DB so the merge is reversible.
5. **Write `clientMergeLog`**`{ winnerId, loserId, mergedBy, mergedAt, mergeDetails: <snapshot>, fieldChoices }`.
6. **Audit log** — top-level `auditLogs` row: `{ action: 'merge', entityType: 'client', entityId: winnerId, metadata: { loserId, score, reasons } }`.
### 6.2 Schema additions (migration)
`clients` table gets a new column:
```ts
mergedIntoClientId: text('merged_into_client_id').references(() => clients.id),
```
The existing `clientMergeLog` table is reused. Add a partial index for the undo-window query:
```sql
CREATE INDEX idx_cml_recent ON client_merge_log (port_id, created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days';
```
A daily maintenance job (using the existing `maintenance-cleanup.test.ts` infrastructure) purges `mergeDetails` JSONB older than `dedup_undo_window_days` setting.
### 6.3 Undo
`unmergeClients(mergeLogId, ctx)`:
1. Within the undo window, look up the snapshot
2. Restore loser: clear `archivedAt`, `mergedIntoClientId`
3. Restore loser's contacts/addresses/notes/tags from snapshot
4. Detach reattached rows: `interests` etc. that were touching `winnerId` and originally belonged to loser go back. The snapshot stores the original `(rowType, rowId)` list explicitly so this is deterministic.
5. Mark log row `undoneAt = now()`, `undoneBy = userId`
After 7 days the snapshot is gone and unmerge returns `410 Gone`.
### 6.4 Concurrency
Both merge and unmerge wrap in a single transaction with `SELECT … FOR UPDATE` on `clients.id` of both winner and loser. A second merge attempt against the same loser sees `mergedIntoClientId` already set and refuses (clear error: "Already merged into …").
---
## 7. Runtime surfaces
### 7.1 Layer 1 — At-create suggestion
In `ClientForm` (and the public `register` form once that hits the new system):
- Debounced 300ms after email or phone field changes
- Calls `findClientMatches` against current port's clients
- Renders top-1 match if score ≥ `dedup_soft_warn_threshold`:
```
┌─────────────────────────────────────┐
│ This looks like an existing client │
│ ML Marcus Laurent │
│ marcus@… +33 6 12 34 56 78 │
│ 2 interests · last 9d ago │
│ [ Use this client ] [ Create new ] │
└─────────────────────────────────────┘
```
- "Use this client" → form switches to "create new interest under existing client" mode (preserves whatever other fields the user typed)
- "Create new" → audit-log `dedup_override` with the candidate's id and reasons (so we have data on false positives)
### 7.2 Layer 2 — Interest-level same-berth guard
Cheap one-liner in `createInterest` service:
- Check `(clientId, berthId)` against existing non-archived interests
- If hit, throw `BerthDuplicateError` with the existing interest details
- UI catches and prompts: "Update existing or create separate?"
This is NOT the same as client-level dedup. Same client legitimately can pursue the same berth a second time after it falls through. But the prompt-before-create catches the accidental double-submit case.
### 7.3 Layer 3 — Background scoring + review queue
- A nightly cron (using existing BullMQ infrastructure — search for `scheduled-tasks` in repo) runs `findClientMatches` over each port's full client pool
- Pairs scoring ≥ `dedup_review_queue_threshold` land in a `client_merge_candidates` table:
```ts
export const clientMergeCandidates = pgTable('client_merge_candidates', {
id: text('id').primaryKey()...,
portId: text('port_id').notNull()...,
clientAId: text('client_a_id').notNull()...,
clientBId: text('client_b_id').notNull()...,
score: integer('score').notNull(),
reasons: jsonb('reasons').notNull(),
status: text('status').notNull().default('pending'), // pending | dismissed | merged
createdAt: timestamp('created_at')...,
resolvedAt: timestamp('resolved_at'),
resolvedBy: text('resolved_by'),
})
```
- `/[portSlug]/admin/duplicates` lists pending candidates sorted by score desc, with `[Review →]` opening a side-by-side merge dialog
- Dismissing a candidate marks it `status=dismissed` so the job doesn't re-surface the same pair tomorrow (a future score increase re-creates it).
---
## 8. NocoDB → new system field mapping
This is the explicit mapping the migration script applies. One NocoDB Interest row produces multiple new rows.
### 8.1 Top-level transform
```
NocoDB Interests row
─→ 01 client (deduped against existing pool)
─→ 01 client_address
─→ 02 client_contacts (email, phone)
─→ exactly 1 interest
─→ 01 yacht (when Yacht Name present and not "TBC"/"Na"/empty placeholders)
─→ 01 document (when documensoID present)
```
### 8.2 Field map
| NocoDB field | Target | Transform |
| ----------------------------------------------------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------- |
| `Full Name` | `clients.fullName` | `normalizeName().display` |
| `Email Address` | `clientContacts(channel='email', value=...)` | `normalizeEmail()` |
| `Phone Number` | `clientContacts(channel='phone', valueE164=..., valueCountry=...)` | `normalizePhone(raw, defaultCountry)` |
| `Address` | `clientAddresses.streetAddress` (LongText preserved) | trim |
| `Place of Residence` | `clientAddresses.countryIso` AND `clients.nationalityIso` | `resolveCountry()` |
| `Contact Method Preferred` | `clients.preferredContactMethod` | lowercase, mapped: Email→email, Phone→phone |
| `Source` | `clients.source` | mapped: portal→website, Form→website, External→manual; null → manual |
| `Date Added` | `interests.createdAt` (fallback to NocoDB `Created At` then now) | parse: try `DD-MM-YYYY`, then `YYYY-MM-DD`, then ISO |
| `Sales Process Level` | `interests.pipelineStage` | see §8.3 |
| `Lead Category` | `interests.leadCategory` | General→general_interest, Friends and Family→general_interest with tag |
| `Berth` (FK) | `interests.berthId` | resolve via `Berths` table by `Mooring Number` |
| `Berth Size Desired` | `interests.notes` (appended) | preserve |
| `Yacht Name`, `Length`, `Width`, `Depth` | `yachts.name`, `lengthM`, `widthM`, `draughtM` | skip if name in {`TBC`, `Na`, ``, null}; ft→m via `\* 0.3048` |
| `EOI Status` | `interests.eoiStatus` | Awaiting Further Details→pending; Waiting for Signatures→sent; Signed→signed |
| `Deposit 10% Status` | `interests.depositStatus` | Pending→pending; Received→received |
| `Contract Status` | `interests.contractStatus` | Pending→pending; 40% Received→partial; Complete→complete |
| `EOI Time Sent` | `interests.dateEoiSent` | parse |
| `clientSignTime` / `developerSignTime` / `all_signed_notified_at` | `interests.dateEoiSigned` (use latest) | parse |
| `Time LOI Sent` | `interests.dateContractSent` | parse |
| `Internal Notes` + `Extra Comments` | `clientNotes` (one row, system author) | concatenate with section markers |
| `documensoID` | `documents.documensoId` (when present, type='eoi') | preserve |
| `Signature Link Client/CC/Developer`, `EmbeddedSignature*` | `documents.signers[]` | one row per non-null signer |
| `reminder_enabled`, `last_reminder_sent`, etc. | `interests.reminderEnabled`, `interests.reminderLastFired` | parse, default true |
### 8.3 Sales-stage mapping (8 → 9)
| NocoDB | New (PIPELINE_STAGES) |
| ------------------------------- | ------------------------------------------------------------------------ |
| General Qualified Interest | `open` |
| Specific Qualified Interest | `details_sent` |
| EOI and NDA Sent | `eoi_sent` |
| Signed EOI and NDA | `eoi_signed` |
| Made Reservation | `deposit_10pct` |
| Contract Negotiation | `contract_sent` |
| Contract Negotiations Finalized | `contract_sent` (with audit-note: legacy "negotiations finalized") |
| Contract Signed | `contract_signed` (or `completed` when deposit + contract both complete) |
### 8.4 Other tables
- **Residential Interests** (35 rows) — same shape as Interests but maps to `residentialClients` + `residentialInterests`. Smaller and cleaner. Same dedup runs within this pool independently.
- **Website - Interest Submissions** (64 rows) — these are **inbound capture, not yet a client**. Treat as if each row is a fresh public-form submission today: run dedup against the migrated client pool. Auto-link if `dedup_public_form_auto_link` setting allows.
- **Website - Contact Form Submissions** (47 rows) — sparse data (just name + email + interest type). Skip migration; export as CSV for manual triage. Not the source of truth for any deal.
- **Website - Berth EOI Details Supplements** (1 row) — single record, preserved as a one-off attached to the matching Interest.
- **Newsletter Sending** (69 rows) — out of scope; that's a marketing surface, not CRM.
- **Interests Backup, Interests copy** — historical artifacts. Skipped by default. A `--include-backups` flag attaches them as audit-note entries on the corresponding live Interest if the user wants the history.
---
## 9. Migration script
Located at `scripts/migrate-from-nocodb.ts`. Idempotent: safe to re-run. Three main flags:
```
$ pnpm tsx scripts/migrate-from-nocodb.ts --dry-run [--port-slug X]
Pulls everything, transforms, runs dedup, writes CSV report to .migration/<timestamp>/. No DB writes.
$ pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<timestamp>/
Reads the report, performs the writes the dry-run promised. Refuses if the source data has changed since the report was generated (hash mismatch).
$ pnpm tsx scripts/migrate-from-nocodb.ts --rollback --apply-id <id>
Reads the apply log, undoes the writes (only valid within the undo window).
```
Reuses the `client-portal/server/utils/nocodb.ts` adapter for the NocoDB API client (no need to rebuild). Writes to the new system via Drizzle (re-using the existing services like `createClient`, `createInterest`, etc., so all the same validation runs).
### 9.1 Dry-run report format
`.migration/<timestamp>/report.csv`:
```csv
op,reason,nocodb_row_id,target_table,target_value,confidence,manual_review_required
create_client,new,624,clients.fullName,Deepak Ramchandani,N/A,false
create_contact,new,624,clientContacts.email,dannyrams8888@gmail.com,N/A,false
create_contact,new,624,clientContacts.phone,+17215868888,N/A,false
create_interest,new,624,interests.berthId,a1b2c3...,N/A,false
auto_link,score=98 (email+phone),625,clients.id,<existing client UUID from row 624>,high,false
flag_for_review,score=72 (same name diff country),188,client.id,<existing client UUID from row 717>,medium,true
country_unresolved,fallback to AI (port country),198,clientAddresses.countryIso,AI,low,true
phone_unparseable,placeholder all-zeros,641,clientContacts.phone,<skipped>,N/A,true
```
Plus `.migration/<timestamp>/summary.md`:
```
# Migration Dry-Run — 2026-05-03 14:23 UTC
NocoDB: 252 Interests + 35 Residences + 64 Website Submissions
Outcome: 198 clients, 287 interests (incl. residences), 91 yachts, 412 contacts
Auto-linked (high confidence, no human action needed):
- Nicolas Ruiz: rows 681,682,683 → 1 client + 3 interests
- John Lynch: rows 716,725 → 1 client + 2 interests
- Deepak Ramchandani: rows 624,625 → 1 client + 2 interests
- [12 more]
Flagged for manual review (medium confidence):
- Etiennette Clamouze (rows 188,717): same name, different country phone + email
- Bruno Joyerot #18 + Bruce Hearn #19: shared household contact
- [4 more]
Country resolution failed for 7 rows. All defaulted to port country (AI). Review:
- Row 239: "Sag Harbor Y" → AI (likely US)
- [6 more]
Phone parsing failed for 3 rows. All flagged, no contact created:
- Row 178: empty
- Row 641: placeholder "+447000000000"
- Row 175: empty
Run `--apply` to commit these changes.
```
### 9.2 Apply phase
`--apply` reads the report, re-fetches the source rows (via NocoDB MCP / API), recomputes the hash, fails fast if NocoDB changed since dry-run. Then performs the writes within a single PostgreSQL transaction per port (commit at end). On any error mid-transaction, full rollback.
After successful apply, an `apply_id` is generated and an audit-log row written. The `apply_id` is the handle used for `--rollback`.
### 9.3 Idempotency
The script tracks NocoDB row IDs in a `migration_source_links` table:
```ts
export const migrationSourceLinks = pgTable('migration_source_links', {
id: text('id').primaryKey()...,
sourceSystem: text('source_system').notNull(), // 'nocodb_interests' | 'nocodb_residences' | …
sourceId: text('source_id').notNull(), // NocoDB row id as string
targetEntityType: text('target_entity_type').notNull(), // client | interest | yacht | …
targetEntityId: text('target_entity_id').notNull(),
appliedAt: timestamp('applied_at')...,
appliedBy: text('applied_by'),
}, (table) => [
uniqueIndex('idx_msl_source').on(table.sourceSystem, table.sourceId, table.targetEntityType),
]);
```
Re-running `--apply` against the same report skips rows already in this table. Useful for partial-failure resumption.
---
## 10. Test plan
### 10.1 Library-level (vitest unit)
- `tests/unit/dedup/normalize.test.ts` — every dirty-data pattern from §1.3 has a fixture asserting the expected normalized output.
- `tests/unit/dedup/find-matches.test.ts` — every duplicate cluster from §1.2 has a fixture asserting score + confidence tier. Hard cases (Pattern F) assert "medium" not "high" — false-positive guard.
### 10.2 Service-level (vitest integration)
- `tests/integration/dedup/client-merge.test.ts` — merge service exercised: full reattach, clientMergeLog written, undo within window restores, undo after window returns 410, concurrent merge of same loser fails the second.
- `tests/integration/dedup/at-create-suggestion.test.ts` — `findClientMatches` against a seeded pool returns expected matches + reasons.
### 10.3 Migration script (vitest integration with NocoDB mock)
- `tests/integration/dedup/migration-dry-run.test.ts` — feed the script a fixture NocoDB dump (the 252 rows, frozen as a JSON snapshot in fixtures), assert the resulting CSV matches a golden file. Catch any future regression in the transform pipeline.
- `tests/integration/dedup/migration-apply.test.ts` — apply the dry-run output to a clean test DB, assert all expected rows exist, assert idempotency (re-apply is a no-op).
### 10.4 E2E (Playwright)
- `tests/e2e/smoke/30-dedup-create.spec.ts` — type into ClientForm with an email matching seeded client; assert suggestion card appears; click "Use this client"; assert form switches to interest-create mode.
- `tests/e2e/smoke/31-admin-duplicates.spec.ts` — admin views review queue, opens a candidate, side-by-side merge UI works, merge succeeds, undo within window works.
---
## 11. Rollback plan
Three layers of safety, ordered by reversibility:
1. **Per-merge undo** — admin clicks Undo on a wrongly-merged pair, system rolls back from `clientMergeLog` snapshot. 7-day window. No engineering needed.
2. **Migration `--rollback` flag** — entire migration apply is reversed via the `apply_id` and `migration_source_links` table. Useful in the first 24h after `--apply`. Engineering-supervised.
3. **DB restore from backup** — the existing `docs/ops/backup-runbook.md` covers this. Last resort if both above are blocked.
Pre-migration, take a hot backup of the new DB (`pg_dump`). Pre-merge in production (before any human-facing surface ships), the `dedup_auto_merge_threshold` defaults to `null` so no automatic merges happen — every merge is human-confirmed.
---
## 12. Open items
- **Soundex vs metaphone** — Soundex is simpler but English-leaning. Metaphone handles non-English surnames better (the dataset has French, German, Italian, Slavic names). Default to metaphone via the `natural` package; revisit if it adds significant install size.
- **Cross-port dedup** — not in scope. Each port's clients are deduped within that port. A future "shared address book" feature would need its own design.
- **Profile photo / face match** — out of scope.
- **AI-assisted match resolution** — out of scope. The Layer-3 review queue is human-only.
---
## Implementation sequence
P1 (this design's library) → P2 (runtime surfaces) → P3 (migration). Each is a separate plan / PR.
**P1 deliverables**: `src/lib/dedup/{normalize,find-matches}.ts` + tests. No UI changes. No DB changes (except indexed lookups added to existing `clientContacts`). ~1.5 days.
**P2 deliverables**: at-create suggestion in `ClientForm` + interest-level guard in `createInterest` service + admin settings UI for thresholds + `clientMergeCandidates` table + nightly job + admin review queue page + merge service + side-by-side merge UI. ~57 days.
**P3 deliverables**: `scripts/migrate-from-nocodb.ts` + `migration_source_links` table + dry-run + apply + rollback. CSV report format frozen against fixture. ~3 days, including fixture creation from the live NocoDB snapshot.
Total: ~1012 engineering days from approval. Can be split across three PRs landing independently — each is testable in isolation and the runtime surfaces (P2) work even without P3 being run.