Compare commits
1 Commits
94f049c8b8
...
docs/dedup
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
36b92eb827 |
564
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md
Normal file
564
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md
Normal file
@@ -0,0 +1,564 @@
|
|||||||
|
# Client Deduplication and NocoDB Migration Design
|
||||||
|
|
||||||
|
**Status**: Design draft 2026-05-03 — pending approval.
|
||||||
|
**Plan decomposition**: Three implementation plans stack from this design — (P1) normalization + dedup core library; (P2) admin settings + at-create + interest-level guards (runtime); (P3) NocoDB migration script + review queue UI. P1 unblocks P2 and P3.
|
||||||
|
**Branch base**: stacks on `feat/mobile-foundation` once it merges to `main`.
|
||||||
|
**Out of scope**: live merge of two clients across ports (cross-tenant), automated AI-judged matches, profile-photo / face-match dedup, web-of-trust referrer relationships.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Background
|
||||||
|
|
||||||
|
### 1.1 Why this exists
|
||||||
|
|
||||||
|
The legacy CRM lives in a NocoDB base whose `Interests` table conflates _the human_ with _the deal_. A row contains `Full Name`, `Email Address`, `Phone Number`, `Address`, `Place of Residence` _and_ the sales-pipeline state for one specific berth. A single human pursuing two berths becomes two rows with semi-duplicated personal data. A 2026-05-03 read-only audit confirmed:
|
||||||
|
|
||||||
|
- **252 Interests rows** in NocoDB, against an estimated ~190–200 unique humans (~20–25% duplication rate).
|
||||||
|
- **35 Residential Interests rows** in a parallel residential pathway with the same conflation.
|
||||||
|
- **64 Website Interest Submissions + 47 Website Contact Form Submissions + 1 EOI Supplemental Form** as inbound capture surfaces.
|
||||||
|
- **No Clients table.** The conflated structure is structural, not accidental.
|
||||||
|
|
||||||
|
The new CRM (`src/lib/db/schema/clients.ts`) splits this into `clients` (people) ↔ `interests` (deals), with `clientContacts` (multi-channel), `clientAddresses` (multi-address), and a pre-existing `clientMergeLog` table that anticipates merge with undo. The design has been ready; what's missing is (a) a normalization + matching library, (b) the at-create / at-import surfaces that use it, and (c) the migration of the existing 252+35 records.
|
||||||
|
|
||||||
|
### 1.2 Real duplicate patterns observed in the live data
|
||||||
|
|
||||||
|
Sampled 200 of the 252 NocoDB Interests rows. Confirmed duplicate clusters fall into six patterns:
|
||||||
|
|
||||||
|
| Pattern | Example rows | Signature |
|
||||||
|
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
|
||||||
|
| **A. Pure double-submit** | Deepak Ramchandani #624/#625; John Lynch #716/#725 | All fields identical; created same day |
|
||||||
|
| **B. Phone format variance** | Howard Wiarda #236/#536 (`574-274-0548` vs `+15742740548`); Christophe Zasso #701/#702 (`0651381036` vs `0033651381036`) | Same email, normalize-equal phone |
|
||||||
|
| **C. Name capitalization** | Nicolas Ruiz #681/#682/#683; Jean-Charles Miege/MIEGE #37/#163; John Farmer/FARMER #35/#161 | Same email or empty; surname case differs |
|
||||||
|
| **D. Name shortening** | Chris vs Christopher Allen #700/#534; Emma c vs Emma Cauchefer #661/#673 | Same email + phone; given-name truncated |
|
||||||
|
| **E. Resubmit with typo** | Christopher Camazou #649/#650 (phone last 4 digits typo); Gianfranco Di Constanzo/Costanzo #585/#336 (surname typo, **different yacht** — should be ONE client + TWO interests) | Score-on-everything-else high, one field has small-edit-distance noise |
|
||||||
|
| **F. Hard cases** | Etiennette Clamouze #188/#717 (same name, different country phone + email); Bruno Joyerot #18 with email belonging to Bruce Hearn #19 (couple sharing contact) | Cannot resolve without a human |
|
||||||
|
|
||||||
|
This dataset will be the fixture for the dedup library's tests — every pattern above must be either auto-detected or flagged for review, and the false-positive bar must be high enough that Pattern F doesn't get force-merged.
|
||||||
|
|
||||||
|
### 1.3 Dirty data inventory
|
||||||
|
|
||||||
|
The migration normalizer must survive these real values from production:
|
||||||
|
|
||||||
|
**Phone fields**: `+1-264-235-8840\r` (with carriage return), `'+1.214.603.4235` (apostrophe + dots), `0677580750/0690511494` (two numbers in one field), `00447956657022` (00 prefix), `+447000000000` (placeholder all-zeros), `+4901637039672` (impossible — stripped 0 + country prefix), various unprefixed local formats, dashed US numbers without country code.
|
||||||
|
|
||||||
|
**Email fields**: mixed case rampant (`Arthur@laser-align.com` vs `arthur@laser-align.com`); ALL-CAPS local parts; trailing whitespace.
|
||||||
|
|
||||||
|
**Name fields**: ALL-CAPS surnames mixed with title-case given names; embedded `\n` and `\r`; double spaces; lowercase-only entries; slash-with-company variants (`Daniel Wainstein / 7 Knots, LLC`, `Bruno Joyerot / SAS TIKI`); placeholder `Mr DADER`, `TBC`.
|
||||||
|
|
||||||
|
**Place of Residence (free text)**: `Saint barthelemy`, `St Barth`, `Saint-Barthélemy` (same place, three forms); `anguilla`, `United States `, `USA`, `Kansas City` (city without country), `Sag Harbor Y` (typo).
|
||||||
|
|
||||||
|
### 1.4 Existing battle-tested algorithm
|
||||||
|
|
||||||
|
`client-portal/server/utils/duplicate-detection.ts` already implements blocking + weighted-rules dedup against this same NocoDB. It runs in production today. We **port it forward** (don't reinvent), then add: soundex/metaphone for surname matching, compounded-confidence when multiple rules match, and negative evidence (same email + different country phone reduces confidence).
|
||||||
|
|
||||||
|
### 1.5 Why the website is no longer the source of new dirty data
|
||||||
|
|
||||||
|
The website forms (`website/components/pn/specific/website/{berths-item,register,form}/form.vue`) use `<v-phone-input>` with a country picker (`prefer-countries: ['US', 'GB', 'DE', 'FR']`) and `[(value) => !!value || 'Phone number is required']` validation. Output is E.164-shaped. The 252 dirty rows are legacy — pre-form-redesign submissions, sales-rep manual entries, and external CSV imports. Future inbound is clean.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Approach
|
||||||
|
|
||||||
|
Three artifacts, layered:
|
||||||
|
|
||||||
|
1. **A pure-logic normalization + matching library** at `src/lib/dedup/`. JSX-free, vitest-native (proven pattern: `realtime-invalidation-core.ts`). Tested against the dirty-data fixture corpus drawn from §1.2.
|
||||||
|
2. **Three runtime surfaces** that use the library: at-create suggestion in client/interest forms; interest-level same-berth guard; admin review queue powered by a nightly background scoring job.
|
||||||
|
3. **A one-shot migration script** that pulls NocoDB → normalizes → dedupes → writes new schema → produces a CSV report with auto-merge log + flagged-for-review pile.
|
||||||
|
|
||||||
|
**Configurability via admin settings** (`system_settings` per port) so the team can tune sensitivity without code changes. Defaults err on the safe side — a flagged review is cheaper than a wrongly-merged record.
|
||||||
|
|
||||||
|
**Reversibility**: every merge writes a `client_merge_log` row containing the loser's full pre-state JSON. A 7-day undo window lets a wrong merge be reversed without engineering involvement. After 7 days the snapshot is purged for GDPR; merges become permanent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Normalization library
|
||||||
|
|
||||||
|
Lives at `src/lib/dedup/normalize.ts`. Pure functions, no DB, vitest-tested. Used by the dedup algorithm AND by all create-paths so what gets stored is already normalized.
|
||||||
|
|
||||||
|
### 3.1 `normalizeName(raw: string)`
|
||||||
|
|
||||||
|
```ts
|
||||||
|
export function normalizeName(raw: string): {
|
||||||
|
display: string; // human-readable, kept for UI
|
||||||
|
normalized: string; // for matching
|
||||||
|
surnameToken?: string; // for surname-based blocking
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
- Trim leading/trailing whitespace
|
||||||
|
- Replace `\r`, `\n`, tabs with single space
|
||||||
|
- Collapse consecutive whitespace to single space
|
||||||
|
- Smart title-case: keep particles (`van`, `de`, `del`, `O'`, `di`, `le`, `da`) lowercase except as first token
|
||||||
|
- `display` preserves user's intent (slash-with-company stays intact)
|
||||||
|
- `normalized` is `display.toLowerCase()` for comparison
|
||||||
|
- `surnameToken` is the last non-particle token for blocking
|
||||||
|
|
||||||
|
### 3.2 `normalizeEmail(raw: string)`
|
||||||
|
|
||||||
|
```ts
|
||||||
|
export function normalizeEmail(raw: string): string | null;
|
||||||
|
```
|
||||||
|
|
||||||
|
- Trim + lowercase
|
||||||
|
- Validate via `zod.email()` schema
|
||||||
|
- Returns `null` for empty / invalid (caller decides what to do)
|
||||||
|
- **Does NOT strip plus-aliases** (`user+tag@domain.com`) — both intentional (real distinct addresses) and malicious-prevention apply. Compare by full localpart.
|
||||||
|
|
||||||
|
### 3.3 `normalizePhone(raw: string, defaultCountry: string)`
|
||||||
|
|
||||||
|
```ts
|
||||||
|
export function normalizePhone(
|
||||||
|
raw: string,
|
||||||
|
defaultCountry: string,
|
||||||
|
): {
|
||||||
|
e164: string | null; // canonical, e.g. '+15742740548'
|
||||||
|
country: string | null; // ISO-3166-1 alpha-2
|
||||||
|
display: string | null; // user-facing pretty
|
||||||
|
flagged?: 'multi_number' | 'placeholder' | 'unparseable';
|
||||||
|
} | null;
|
||||||
|
```
|
||||||
|
|
||||||
|
Pipeline:
|
||||||
|
|
||||||
|
1. Strip `\r`, `\n`, tabs, single quotes, dots, dashes, parens, spaces
|
||||||
|
2. If contains `/` or `;` or `,` → flag `multi_number`, take first segment
|
||||||
|
3. If matches `+\d{2}0+$` (e.g., `+447000000000`) → flag `placeholder`, return null
|
||||||
|
4. If starts with `00` → replace with `+`
|
||||||
|
5. If starts with `+` → parse as E.164
|
||||||
|
6. Else if `defaultCountry` provided → parse against that country
|
||||||
|
7. Else return null (caller's problem)
|
||||||
|
|
||||||
|
Backed by `libphonenumber-js` (already in deps via `tests/integration/factories.ts` usage if not, will add). The hostile cases above all need explicit handling — naïve regex won't survive.
|
||||||
|
|
||||||
|
### 3.4 `resolveCountry(text: string)`
|
||||||
|
|
||||||
|
```ts
|
||||||
|
export function resolveCountry(text: string): {
|
||||||
|
iso: string | null; // ISO-3166-1 alpha-2
|
||||||
|
confidence: 'exact' | 'fuzzy' | 'city' | null;
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
Reuses `src/lib/i18n/countries.ts`. Pipeline:
|
||||||
|
|
||||||
|
1. Lowercase + strip diacritics
|
||||||
|
2. Exact match against country names (any locale we ship)
|
||||||
|
3. Fuzzy match (Levenshtein ≤ 2 against canonical English names)
|
||||||
|
4. City fallback — small in-package mapping for high-frequency cities seen in legacy data (`Sag Harbor → US`, `Kansas City → US`, `St Barth → BL`, etc.). Order: exact → city → fuzzy.
|
||||||
|
|
||||||
|
The mapping is opinionated and small (~30 entries covering the actual values seen in the 252-row dataset). Anything that fails to resolve returns `null` and lands in the migration's flagged pile.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Dedup algorithm
|
||||||
|
|
||||||
|
Lives at `src/lib/dedup/find-matches.ts`. Pure function. Vitest-tested against the §1.2 cluster fixtures.
|
||||||
|
|
||||||
|
### 4.1 Public API
|
||||||
|
|
||||||
|
```ts
|
||||||
|
export interface MatchCandidate {
|
||||||
|
id: string;
|
||||||
|
fullName: string | null;
|
||||||
|
emails: string[]; // already normalized
|
||||||
|
phonesE164: string[]; // already normalized E.164
|
||||||
|
countryIso: string | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface MatchResult {
|
||||||
|
candidate: MatchCandidate;
|
||||||
|
score: number; // 0–100
|
||||||
|
reasons: string[]; // human-readable, e.g. ["email match", "phone match"]
|
||||||
|
confidence: 'high' | 'medium' | 'low';
|
||||||
|
}
|
||||||
|
|
||||||
|
export function findClientMatches(
|
||||||
|
input: MatchCandidate,
|
||||||
|
pool: MatchCandidate[],
|
||||||
|
thresholds: DedupThresholds,
|
||||||
|
): MatchResult[];
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Scoring rules (compound)
|
||||||
|
|
||||||
|
Each rule produces a score addition. **Compounding**: when two strong rules match (e.g., email AND phone), the result is ~95+ rather than max(50, 50). Negative evidence subtracts.
|
||||||
|
|
||||||
|
| Rule | Score | Notes |
|
||||||
|
| --------------------------------------------------------------- | ----- | ------------------------------------------------------ |
|
||||||
|
| Exact email match (case-insensitive, normalized) | +60 | One match suffices |
|
||||||
|
| Exact phone E.164 match (≥ 8 significant digits) | +50 | Excludes placeholder all-zeros |
|
||||||
|
| Exact normalized full-name match | +20 | Many "John Smith"s exist |
|
||||||
|
| Surname soundex match + given-name fuzzy match (Lev ≤ 1) | +15 | Catches `Constanzo/Costanzo`, `Christophe/Christopher` |
|
||||||
|
| Same address (normalized fuzzy ≥ 0.8) | +10 | Bonus signal |
|
||||||
|
| **Negative**: Same email but different country code on phone | −15 | Suggests spouse / coworker / shared inbox |
|
||||||
|
| **Negative**: Same name but DIFFERENT email AND DIFFERENT phone | −20 | Two distinct people with the same name |
|
||||||
|
|
||||||
|
### 4.3 Confidence tiers (post-compound)
|
||||||
|
|
||||||
|
- **score ≥ 90 — `high`** — email AND phone match, or email + name + address. Block-create suggest "Use existing." Auto-link on public-form submit by default.
|
||||||
|
- **score 50–89 — `medium`** — single strong signal (email or phone alone), or email + same-name + different country (Etiennette case). Soft-warn but allow.
|
||||||
|
- **score < 50 — `low`** — weak signals only. Don't surface in UI; only relevant in background-job review queue.
|
||||||
|
|
||||||
|
### 4.4 Blocking strategy
|
||||||
|
|
||||||
|
For O(n) scan over a pool of N existing clients, build three lookup maps once per scan:
|
||||||
|
|
||||||
|
- `byEmail: Map<string, MatchCandidate[]>` — keyed by normalized email
|
||||||
|
- `byPhoneE164: Map<string, MatchCandidate[]>` — keyed by E.164
|
||||||
|
- `bySurnameToken: Map<string, MatchCandidate[]>` — keyed by `normalizeName(...).surnameToken`
|
||||||
|
|
||||||
|
For an incoming `MatchCandidate`, the candidate set to compare is the union of pool entries reachable through any of its emails/phones/surname-token. Typically 0–5 candidates per query, regardless of N.
|
||||||
|
|
||||||
|
### 4.5 Performance budget
|
||||||
|
|
||||||
|
For migration: 252 rows compared pairwise once. ~30k comparisons after blocking — a few seconds.
|
||||||
|
|
||||||
|
For runtime at-create: incoming candidate against existing pool of N clients per port. Expected pool size at maturity: 1k–10k. With blocking: <10 comparisons, <1ms target. No DB query needed beyond the initial pool fetch (which itself uses the indexed columns).
|
||||||
|
|
||||||
|
For background nightly job: full pairwise within port, blocked. 10k clients → ~50k pairwise checks per port → <30s. Fine for a nightly cron.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Configurable thresholds (admin settings)
|
||||||
|
|
||||||
|
New rows in `system_settings` per port. Default values err safe (more confirmation, less auto-action).
|
||||||
|
|
||||||
|
| Key | Default | Effect |
|
||||||
|
| ------------------------------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| `dedup_block_create_threshold` | `90` | Score above which the client-create form interrupts: "Use existing client?" |
|
||||||
|
| `dedup_soft_warn_threshold` | `50` | Score above which a soft-warn panel surfaces below the form |
|
||||||
|
| `dedup_review_queue_threshold` | `40` | Background job lands pairs ≥ this score in `/admin/duplicates` |
|
||||||
|
| `dedup_public_form_auto_link` | `true` | When a public-form submission scores ≥ block-threshold against existing client, attach the new interest to that client without prompting. **Safe**: no merge, just attaching a deal. |
|
||||||
|
| `dedup_auto_merge_threshold` | `null` (disabled) | If non-null, merges happen automatically at this threshold without human confirmation. Recommend leaving null until the team is comfortable; `95` is a reasonable cautious value. |
|
||||||
|
| `dedup_undo_window_days` | `7` | How long the loser's pre-state JSON is retained for merge-undo. After this, the snapshot is purged (GDPR) and merges are permanent. |
|
||||||
|
|
||||||
|
Each setting is a row in `system_settings`. UI surface in `/[portSlug]/admin/dedup` (a new admin page) with an "Advanced" toggle to expose the thresholds and brief explanations.
|
||||||
|
|
||||||
|
If the sales team complains the safer mode is too click-heavy, an admin flips `dedup_auto_merge_threshold` to `95` without any code change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Merge service contract
|
||||||
|
|
||||||
|
### 6.1 Data flow
|
||||||
|
|
||||||
|
`mergeClients(winnerId, loserId, fieldChoices, ctx)` does, in a single transaction:
|
||||||
|
|
||||||
|
1. **Snapshot loser** — full row + all attached `clientContacts`, `clientAddresses`, `clientNotes`, `clientTags`, plus a count of dependent rows about to be moved (interests, yacht-memberships, etc.). Stored as `mergeDetails` JSONB in `clientMergeLog`.
|
||||||
|
2. **Reattach** — every row pointing at `loserId` updates to point at `winnerId`:
|
||||||
|
- `interests.clientId`
|
||||||
|
- `clientContacts.clientId` — with conflict handling: if winner already has the same email, keep winner's; flag the duplicate for the user
|
||||||
|
- `clientAddresses.clientId` — same conflict handling
|
||||||
|
- `clientNotes.clientId` — preserve `authorId` + `createdAt` (never overwrite)
|
||||||
|
- `clientTags.clientId`
|
||||||
|
- `clientYachtMembership.clientId` (or whatever the table is called)
|
||||||
|
- `auditLogs.entityId` — annotate, don't move (audit truth)
|
||||||
|
3. **Apply fieldChoices** — for each field where the user picked the loser's value, copy that into the winner row.
|
||||||
|
4. **Soft-archive loser** — `loser.archivedAt = now()`, `loser.mergedIntoClientId = winnerId`. Row stays in DB so the merge is reversible.
|
||||||
|
5. **Write `clientMergeLog`** — `{ winnerId, loserId, mergedBy, mergedAt, mergeDetails: <snapshot>, fieldChoices }`.
|
||||||
|
6. **Audit log** — top-level `auditLogs` row: `{ action: 'merge', entityType: 'client', entityId: winnerId, metadata: { loserId, score, reasons } }`.
|
||||||
|
|
||||||
|
### 6.2 Schema additions (migration)
|
||||||
|
|
||||||
|
`clients` table gets a new column:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
mergedIntoClientId: text('merged_into_client_id').references(() => clients.id),
|
||||||
|
```
|
||||||
|
|
||||||
|
The existing `clientMergeLog` table is reused. Add a partial index for the undo-window query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE INDEX idx_cml_recent ON client_merge_log (port_id, created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days';
|
||||||
|
```
|
||||||
|
|
||||||
|
A daily maintenance job (using the existing `maintenance-cleanup.test.ts` infrastructure) purges `mergeDetails` JSONB older than `dedup_undo_window_days` setting.
|
||||||
|
|
||||||
|
### 6.3 Undo
|
||||||
|
|
||||||
|
`unmergeClients(mergeLogId, ctx)`:
|
||||||
|
|
||||||
|
1. Within the undo window, look up the snapshot
|
||||||
|
2. Restore loser: clear `archivedAt`, `mergedIntoClientId`
|
||||||
|
3. Restore loser's contacts/addresses/notes/tags from snapshot
|
||||||
|
4. Detach reattached rows: `interests` etc. that were touching `winnerId` and originally belonged to loser go back. The snapshot stores the original `(rowType, rowId)` list explicitly so this is deterministic.
|
||||||
|
5. Mark log row `undoneAt = now()`, `undoneBy = userId`
|
||||||
|
|
||||||
|
After 7 days the snapshot is gone and unmerge returns `410 Gone`.
|
||||||
|
|
||||||
|
### 6.4 Concurrency
|
||||||
|
|
||||||
|
Both merge and unmerge wrap in a single transaction with `SELECT … FOR UPDATE` on `clients.id` of both winner and loser. A second merge attempt against the same loser sees `mergedIntoClientId` already set and refuses (clear error: "Already merged into …").
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Runtime surfaces
|
||||||
|
|
||||||
|
### 7.1 Layer 1 — At-create suggestion
|
||||||
|
|
||||||
|
In `ClientForm` (and the public `register` form once that hits the new system):
|
||||||
|
|
||||||
|
- Debounced 300ms after email or phone field changes
|
||||||
|
- Calls `findClientMatches` against current port's clients
|
||||||
|
- Renders top-1 match if score ≥ `dedup_soft_warn_threshold`:
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────┐
|
||||||
|
│ This looks like an existing client │
|
||||||
|
│ ML Marcus Laurent │
|
||||||
|
│ marcus@… +33 6 12 34 56 78 │
|
||||||
|
│ 2 interests · last 9d ago │
|
||||||
|
│ [ Use this client ] [ Create new ] │
|
||||||
|
└─────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
- "Use this client" → form switches to "create new interest under existing client" mode (preserves whatever other fields the user typed)
|
||||||
|
- "Create new" → audit-log `dedup_override` with the candidate's id and reasons (so we have data on false positives)
|
||||||
|
|
||||||
|
### 7.2 Layer 2 — Interest-level same-berth guard
|
||||||
|
|
||||||
|
Cheap one-liner in `createInterest` service:
|
||||||
|
|
||||||
|
- Check `(clientId, berthId)` against existing non-archived interests
|
||||||
|
- If hit, throw `BerthDuplicateError` with the existing interest details
|
||||||
|
- UI catches and prompts: "Update existing or create separate?"
|
||||||
|
|
||||||
|
This is NOT the same as client-level dedup. Same client legitimately can pursue the same berth a second time after it falls through. But the prompt-before-create catches the accidental double-submit case.
|
||||||
|
|
||||||
|
### 7.3 Layer 3 — Background scoring + review queue
|
||||||
|
|
||||||
|
- A nightly cron (using existing BullMQ infrastructure — search for `scheduled-tasks` in repo) runs `findClientMatches` over each port's full client pool
|
||||||
|
- Pairs scoring ≥ `dedup_review_queue_threshold` land in a `client_merge_candidates` table:
|
||||||
|
```ts
|
||||||
|
export const clientMergeCandidates = pgTable('client_merge_candidates', {
|
||||||
|
id: text('id').primaryKey()...,
|
||||||
|
portId: text('port_id').notNull()...,
|
||||||
|
clientAId: text('client_a_id').notNull()...,
|
||||||
|
clientBId: text('client_b_id').notNull()...,
|
||||||
|
score: integer('score').notNull(),
|
||||||
|
reasons: jsonb('reasons').notNull(),
|
||||||
|
status: text('status').notNull().default('pending'), // pending | dismissed | merged
|
||||||
|
createdAt: timestamp('created_at')...,
|
||||||
|
resolvedAt: timestamp('resolved_at'),
|
||||||
|
resolvedBy: text('resolved_by'),
|
||||||
|
})
|
||||||
|
```
|
||||||
|
- `/[portSlug]/admin/duplicates` lists pending candidates sorted by score desc, with `[Review →]` opening a side-by-side merge dialog
|
||||||
|
- Dismissing a candidate marks it `status=dismissed` so the job doesn't re-surface the same pair tomorrow (a future score increase re-creates it).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. NocoDB → new system field mapping
|
||||||
|
|
||||||
|
This is the explicit mapping the migration script applies. One NocoDB Interest row produces multiple new rows.
|
||||||
|
|
||||||
|
### 8.1 Top-level transform
|
||||||
|
|
||||||
|
```
|
||||||
|
NocoDB Interests row
|
||||||
|
─→ 0–1 client (deduped against existing pool)
|
||||||
|
─→ 0–1 client_address
|
||||||
|
─→ 0–2 client_contacts (email, phone)
|
||||||
|
─→ exactly 1 interest
|
||||||
|
─→ 0–1 yacht (when Yacht Name present and not "TBC"/"Na"/empty placeholders)
|
||||||
|
─→ 0–1 document (when documensoID present)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 8.2 Field map
|
||||||
|
|
||||||
|
| NocoDB field | Target | Transform |
|
||||||
|
| ----------------------------------------------------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------- |
|
||||||
|
| `Full Name` | `clients.fullName` | `normalizeName().display` |
|
||||||
|
| `Email Address` | `clientContacts(channel='email', value=...)` | `normalizeEmail()` |
|
||||||
|
| `Phone Number` | `clientContacts(channel='phone', valueE164=..., valueCountry=...)` | `normalizePhone(raw, defaultCountry)` |
|
||||||
|
| `Address` | `clientAddresses.streetAddress` (LongText preserved) | trim |
|
||||||
|
| `Place of Residence` | `clientAddresses.countryIso` AND `clients.nationalityIso` | `resolveCountry()` |
|
||||||
|
| `Contact Method Preferred` | `clients.preferredContactMethod` | lowercase, mapped: Email→email, Phone→phone |
|
||||||
|
| `Source` | `clients.source` | mapped: portal→website, Form→website, External→manual; null → manual |
|
||||||
|
| `Date Added` | `interests.createdAt` (fallback to NocoDB `Created At` then now) | parse: try `DD-MM-YYYY`, then `YYYY-MM-DD`, then ISO |
|
||||||
|
| `Sales Process Level` | `interests.pipelineStage` | see §8.3 |
|
||||||
|
| `Lead Category` | `interests.leadCategory` | General→general_interest, Friends and Family→general_interest with tag |
|
||||||
|
| `Berth` (FK) | `interests.berthId` | resolve via `Berths` table by `Mooring Number` |
|
||||||
|
| `Berth Size Desired` | `interests.notes` (appended) | preserve |
|
||||||
|
| `Yacht Name`, `Length`, `Width`, `Depth` | `yachts.name`, `lengthM`, `widthM`, `draughtM` | skip if name in {`TBC`, `Na`, ``, null}; ft→m via `\* 0.3048` |
|
||||||
|
| `EOI Status` | `interests.eoiStatus` | Awaiting Further Details→pending; Waiting for Signatures→sent; Signed→signed |
|
||||||
|
| `Deposit 10% Status` | `interests.depositStatus` | Pending→pending; Received→received |
|
||||||
|
| `Contract Status` | `interests.contractStatus` | Pending→pending; 40% Received→partial; Complete→complete |
|
||||||
|
| `EOI Time Sent` | `interests.dateEoiSent` | parse |
|
||||||
|
| `clientSignTime` / `developerSignTime` / `all_signed_notified_at` | `interests.dateEoiSigned` (use latest) | parse |
|
||||||
|
| `Time LOI Sent` | `interests.dateContractSent` | parse |
|
||||||
|
| `Internal Notes` + `Extra Comments` | `clientNotes` (one row, system author) | concatenate with section markers |
|
||||||
|
| `documensoID` | `documents.documensoId` (when present, type='eoi') | preserve |
|
||||||
|
| `Signature Link Client/CC/Developer`, `EmbeddedSignature*` | `documents.signers[]` | one row per non-null signer |
|
||||||
|
| `reminder_enabled`, `last_reminder_sent`, etc. | `interests.reminderEnabled`, `interests.reminderLastFired` | parse, default true |
|
||||||
|
|
||||||
|
### 8.3 Sales-stage mapping (8 → 9)
|
||||||
|
|
||||||
|
| NocoDB | New (PIPELINE_STAGES) |
|
||||||
|
| ------------------------------- | ------------------------------------------------------------------------ |
|
||||||
|
| General Qualified Interest | `open` |
|
||||||
|
| Specific Qualified Interest | `details_sent` |
|
||||||
|
| EOI and NDA Sent | `eoi_sent` |
|
||||||
|
| Signed EOI and NDA | `eoi_signed` |
|
||||||
|
| Made Reservation | `deposit_10pct` |
|
||||||
|
| Contract Negotiation | `contract_sent` |
|
||||||
|
| Contract Negotiations Finalized | `contract_sent` (with audit-note: legacy "negotiations finalized") |
|
||||||
|
| Contract Signed | `contract_signed` (or `completed` when deposit + contract both complete) |
|
||||||
|
|
||||||
|
### 8.4 Other tables
|
||||||
|
|
||||||
|
- **Residential Interests** (35 rows) — same shape as Interests but maps to `residentialClients` + `residentialInterests`. Smaller and cleaner. Same dedup runs within this pool independently.
|
||||||
|
- **Website - Interest Submissions** (64 rows) — these are **inbound capture, not yet a client**. Treat as if each row is a fresh public-form submission today: run dedup against the migrated client pool. Auto-link if `dedup_public_form_auto_link` setting allows.
|
||||||
|
- **Website - Contact Form Submissions** (47 rows) — sparse data (just name + email + interest type). Skip migration; export as CSV for manual triage. Not the source of truth for any deal.
|
||||||
|
- **Website - Berth EOI Details Supplements** (1 row) — single record, preserved as a one-off attached to the matching Interest.
|
||||||
|
- **Newsletter Sending** (69 rows) — out of scope; that's a marketing surface, not CRM.
|
||||||
|
- **Interests Backup, Interests copy** — historical artifacts. Skipped by default. A `--include-backups` flag attaches them as audit-note entries on the corresponding live Interest if the user wants the history.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Migration script
|
||||||
|
|
||||||
|
Located at `scripts/migrate-from-nocodb.ts`. Idempotent: safe to re-run. Three main flags:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ pnpm tsx scripts/migrate-from-nocodb.ts --dry-run [--port-slug X]
|
||||||
|
Pulls everything, transforms, runs dedup, writes CSV report to .migration/<timestamp>/. No DB writes.
|
||||||
|
|
||||||
|
$ pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<timestamp>/
|
||||||
|
Reads the report, performs the writes the dry-run promised. Refuses if the source data has changed since the report was generated (hash mismatch).
|
||||||
|
|
||||||
|
$ pnpm tsx scripts/migrate-from-nocodb.ts --rollback --apply-id <id>
|
||||||
|
Reads the apply log, undoes the writes (only valid within the undo window).
|
||||||
|
```
|
||||||
|
|
||||||
|
Reuses the `client-portal/server/utils/nocodb.ts` adapter for the NocoDB API client (no need to rebuild). Writes to the new system via Drizzle (re-using the existing services like `createClient`, `createInterest`, etc., so all the same validation runs).
|
||||||
|
|
||||||
|
### 9.1 Dry-run report format
|
||||||
|
|
||||||
|
`.migration/<timestamp>/report.csv`:
|
||||||
|
|
||||||
|
```csv
|
||||||
|
op,reason,nocodb_row_id,target_table,target_value,confidence,manual_review_required
|
||||||
|
create_client,new,624,clients.fullName,Deepak Ramchandani,N/A,false
|
||||||
|
create_contact,new,624,clientContacts.email,dannyrams8888@gmail.com,N/A,false
|
||||||
|
create_contact,new,624,clientContacts.phone,+17215868888,N/A,false
|
||||||
|
create_interest,new,624,interests.berthId,a1b2c3...,N/A,false
|
||||||
|
auto_link,score=98 (email+phone),625,clients.id,<existing client UUID from row 624>,high,false
|
||||||
|
flag_for_review,score=72 (same name diff country),188,client.id,<existing client UUID from row 717>,medium,true
|
||||||
|
country_unresolved,fallback to AI (port country),198,clientAddresses.countryIso,AI,low,true
|
||||||
|
phone_unparseable,placeholder all-zeros,641,clientContacts.phone,<skipped>,N/A,true
|
||||||
|
```
|
||||||
|
|
||||||
|
Plus `.migration/<timestamp>/summary.md`:
|
||||||
|
|
||||||
|
```
|
||||||
|
# Migration Dry-Run — 2026-05-03 14:23 UTC
|
||||||
|
|
||||||
|
NocoDB: 252 Interests + 35 Residences + 64 Website Submissions
|
||||||
|
Outcome: 198 clients, 287 interests (incl. residences), 91 yachts, 412 contacts
|
||||||
|
|
||||||
|
Auto-linked (high confidence, no human action needed):
|
||||||
|
- Nicolas Ruiz: rows 681,682,683 → 1 client + 3 interests
|
||||||
|
- John Lynch: rows 716,725 → 1 client + 2 interests
|
||||||
|
- Deepak Ramchandani: rows 624,625 → 1 client + 2 interests
|
||||||
|
- [12 more]
|
||||||
|
|
||||||
|
Flagged for manual review (medium confidence):
|
||||||
|
- Etiennette Clamouze (rows 188,717): same name, different country phone + email
|
||||||
|
- Bruno Joyerot #18 + Bruce Hearn #19: shared household contact
|
||||||
|
- [4 more]
|
||||||
|
|
||||||
|
Country resolution failed for 7 rows. All defaulted to port country (AI). Review:
|
||||||
|
- Row 239: "Sag Harbor Y" → AI (likely US)
|
||||||
|
- [6 more]
|
||||||
|
|
||||||
|
Phone parsing failed for 3 rows. All flagged, no contact created:
|
||||||
|
- Row 178: empty
|
||||||
|
- Row 641: placeholder "+447000000000"
|
||||||
|
- Row 175: empty
|
||||||
|
|
||||||
|
Run `--apply` to commit these changes.
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9.2 Apply phase
|
||||||
|
|
||||||
|
`--apply` reads the report, re-fetches the source rows (via NocoDB MCP / API), recomputes the hash, fails fast if NocoDB changed since dry-run. Then performs the writes within a single PostgreSQL transaction per port (commit at end). On any error mid-transaction, full rollback.
|
||||||
|
|
||||||
|
After successful apply, an `apply_id` is generated and an audit-log row written. The `apply_id` is the handle used for `--rollback`.
|
||||||
|
|
||||||
|
### 9.3 Idempotency
|
||||||
|
|
||||||
|
The script tracks NocoDB row IDs in a `migration_source_links` table:
|
||||||
|
|
||||||
|
```ts
|
||||||
|
export const migrationSourceLinks = pgTable('migration_source_links', {
|
||||||
|
id: text('id').primaryKey()...,
|
||||||
|
sourceSystem: text('source_system').notNull(), // 'nocodb_interests' | 'nocodb_residences' | …
|
||||||
|
sourceId: text('source_id').notNull(), // NocoDB row id as string
|
||||||
|
targetEntityType: text('target_entity_type').notNull(), // client | interest | yacht | …
|
||||||
|
targetEntityId: text('target_entity_id').notNull(),
|
||||||
|
appliedAt: timestamp('applied_at')...,
|
||||||
|
appliedBy: text('applied_by'),
|
||||||
|
}, (table) => [
|
||||||
|
uniqueIndex('idx_msl_source').on(table.sourceSystem, table.sourceId, table.targetEntityType),
|
||||||
|
]);
|
||||||
|
```
|
||||||
|
|
||||||
|
Re-running `--apply` against the same report skips rows already in this table. Useful for partial-failure resumption.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Test plan
|
||||||
|
|
||||||
|
### 10.1 Library-level (vitest unit)
|
||||||
|
|
||||||
|
- `tests/unit/dedup/normalize.test.ts` — every dirty-data pattern from §1.3 has a fixture asserting the expected normalized output.
|
||||||
|
- `tests/unit/dedup/find-matches.test.ts` — every duplicate cluster from §1.2 has a fixture asserting score + confidence tier. Hard cases (Pattern F) assert "medium" not "high" — false-positive guard.
|
||||||
|
|
||||||
|
### 10.2 Service-level (vitest integration)
|
||||||
|
|
||||||
|
- `tests/integration/dedup/client-merge.test.ts` — merge service exercised: full reattach, clientMergeLog written, undo within window restores, undo after window returns 410, concurrent merge of same loser fails the second.
|
||||||
|
- `tests/integration/dedup/at-create-suggestion.test.ts` — `findClientMatches` against a seeded pool returns expected matches + reasons.
|
||||||
|
|
||||||
|
### 10.3 Migration script (vitest integration with NocoDB mock)
|
||||||
|
|
||||||
|
- `tests/integration/dedup/migration-dry-run.test.ts` — feed the script a fixture NocoDB dump (the 252 rows, frozen as a JSON snapshot in fixtures), assert the resulting CSV matches a golden file. Catch any future regression in the transform pipeline.
|
||||||
|
- `tests/integration/dedup/migration-apply.test.ts` — apply the dry-run output to a clean test DB, assert all expected rows exist, assert idempotency (re-apply is a no-op).
|
||||||
|
|
||||||
|
### 10.4 E2E (Playwright)
|
||||||
|
|
||||||
|
- `tests/e2e/smoke/30-dedup-create.spec.ts` — type into ClientForm with an email matching seeded client; assert suggestion card appears; click "Use this client"; assert form switches to interest-create mode.
|
||||||
|
- `tests/e2e/smoke/31-admin-duplicates.spec.ts` — admin views review queue, opens a candidate, side-by-side merge UI works, merge succeeds, undo within window works.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Rollback plan
|
||||||
|
|
||||||
|
Three layers of safety, ordered by reversibility:
|
||||||
|
|
||||||
|
1. **Per-merge undo** — admin clicks Undo on a wrongly-merged pair, system rolls back from `clientMergeLog` snapshot. 7-day window. No engineering needed.
|
||||||
|
2. **Migration `--rollback` flag** — entire migration apply is reversed via the `apply_id` and `migration_source_links` table. Useful in the first 24h after `--apply`. Engineering-supervised.
|
||||||
|
3. **DB restore from backup** — the existing `docs/ops/backup-runbook.md` covers this. Last resort if both above are blocked.
|
||||||
|
|
||||||
|
Pre-migration, take a hot backup of the new DB (`pg_dump`). Pre-merge in production (before any human-facing surface ships), the `dedup_auto_merge_threshold` defaults to `null` so no automatic merges happen — every merge is human-confirmed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Open items
|
||||||
|
|
||||||
|
- **Soundex vs metaphone** — Soundex is simpler but English-leaning. Metaphone handles non-English surnames better (the dataset has French, German, Italian, Slavic names). Default to metaphone via the `natural` package; revisit if it adds significant install size.
|
||||||
|
- **Cross-port dedup** — not in scope. Each port's clients are deduped within that port. A future "shared address book" feature would need its own design.
|
||||||
|
- **Profile photo / face match** — out of scope.
|
||||||
|
- **AI-assisted match resolution** — out of scope. The Layer-3 review queue is human-only.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation sequence
|
||||||
|
|
||||||
|
P1 (this design's library) → P2 (runtime surfaces) → P3 (migration). Each is a separate plan / PR.
|
||||||
|
|
||||||
|
**P1 deliverables**: `src/lib/dedup/{normalize,find-matches}.ts` + tests. No UI changes. No DB changes (except indexed lookups added to existing `clientContacts`). ~1.5 days.
|
||||||
|
|
||||||
|
**P2 deliverables**: at-create suggestion in `ClientForm` + interest-level guard in `createInterest` service + admin settings UI for thresholds + `clientMergeCandidates` table + nightly job + admin review queue page + merge service + side-by-side merge UI. ~5–7 days.
|
||||||
|
|
||||||
|
**P3 deliverables**: `scripts/migrate-from-nocodb.ts` + `migration_source_links` table + dry-run + apply + rollback. CSV report format frozen against fixture. ~3 days, including fixture creation from the live NocoDB snapshot.
|
||||||
|
|
||||||
|
Total: ~10–12 engineering days from approval. Can be split across three PRs landing independently — each is testable in isolation and the runtime surfaces (P2) work even without P3 being run.
|
||||||
Reference in New Issue
Block a user