Captures the audit findings from a 2026-05-03 read-only NocoDB review plus the algorithm and migration plan for porting the legacy data into the new client / interest / contacts / addresses model. Highlights: - 252 NocoDB Interests rows ≈ ~190–200 unique humans (~20–25% dup rate). Six duplicate patterns documented from real data, including "same person, multiple yachts" — exactly the case the new client/interest split is designed to handle. - Reuses the battle-tested `client-portal/server/utils/duplicate- detection.ts` algorithm (blocking + weighted rules) with additions: metaphone for non-English surnames, compounded confidence when multiple rules match, negative evidence for split-signal cases. - Three runtime surfaces (at-create suggestion, interest-level same-berth guard, background scoring + admin review queue) plus a one-shot migration script with --dry-run / --apply / --rollback. - Configurable thresholds via per-port system_settings so the merge policy can be tuned (defaults to "always confirm" — never auto-merges out of the box). - Reversible: every merge writes a clientMergeLog row with the loser's full pre-state JSON, enabling 7-day undo without engineering. Implementation decomposes into three plans (P1 library / P2 runtime / P3 migration) sequenced after the mobile branch lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
38 KiB
Client Deduplication and NocoDB Migration Design
Status: Design draft 2026-05-03 — pending approval.
Plan decomposition: Three implementation plans stack from this design — (P1) normalization + dedup core library; (P2) admin settings + at-create + interest-level guards (runtime); (P3) NocoDB migration script + review queue UI. P1 unblocks P2 and P3.
Branch base: stacks on feat/mobile-foundation once it merges to main.
Out of scope: live merge of two clients across ports (cross-tenant), automated AI-judged matches, profile-photo / face-match dedup, web-of-trust referrer relationships.
1. Background
1.1 Why this exists
The legacy CRM lives in a NocoDB base whose Interests table conflates the human with the deal. A row contains Full Name, Email Address, Phone Number, Address, Place of Residence and the sales-pipeline state for one specific berth. A single human pursuing two berths becomes two rows with semi-duplicated personal data. A 2026-05-03 read-only audit confirmed:
- 252 Interests rows in NocoDB, against an estimated ~190–200 unique humans (~20–25% duplication rate).
- 35 Residential Interests rows in a parallel residential pathway with the same conflation.
- 64 Website Interest Submissions + 47 Website Contact Form Submissions + 1 EOI Supplemental Form as inbound capture surfaces.
- No Clients table. The conflated structure is structural, not accidental.
The new CRM (src/lib/db/schema/clients.ts) splits this into clients (people) ↔ interests (deals), with clientContacts (multi-channel), clientAddresses (multi-address), and a pre-existing clientMergeLog table that anticipates merge with undo. The design has been ready; what's missing is (a) a normalization + matching library, (b) the at-create / at-import surfaces that use it, and (c) the migration of the existing 252+35 records.
1.2 Real duplicate patterns observed in the live data
Sampled 200 of the 252 NocoDB Interests rows. Confirmed duplicate clusters fall into six patterns:
| Pattern | Example rows | Signature |
|---|---|---|
| A. Pure double-submit | Deepak Ramchandani #624/#625; John Lynch #716/#725 | All fields identical; created same day |
| B. Phone format variance | Howard Wiarda #236/#536 (574-274-0548 vs +15742740548); Christophe Zasso #701/#702 (0651381036 vs 0033651381036) |
Same email, normalize-equal phone |
| C. Name capitalization | Nicolas Ruiz #681/#682/#683; Jean-Charles Miege/MIEGE #37/#163; John Farmer/FARMER #35/#161 | Same email or empty; surname case differs |
| D. Name shortening | Chris vs Christopher Allen #700/#534; Emma c vs Emma Cauchefer #661/#673 | Same email + phone; given-name truncated |
| E. Resubmit with typo | Christopher Camazou #649/#650 (phone last 4 digits typo); Gianfranco Di Constanzo/Costanzo #585/#336 (surname typo, different yacht — should be ONE client + TWO interests) | Score-on-everything-else high, one field has small-edit-distance noise |
| F. Hard cases | Etiennette Clamouze #188/#717 (same name, different country phone + email); Bruno Joyerot #18 with email belonging to Bruce Hearn #19 (couple sharing contact) | Cannot resolve without a human |
This dataset will be the fixture for the dedup library's tests — every pattern above must be either auto-detected or flagged for review, and the false-positive bar must be high enough that Pattern F doesn't get force-merged.
1.3 Dirty data inventory
The migration normalizer must survive these real values from production:
Phone fields: +1-264-235-8840\r (with carriage return), '+1.214.603.4235 (apostrophe + dots), 0677580750/0690511494 (two numbers in one field), 00447956657022 (00 prefix), +447000000000 (placeholder all-zeros), +4901637039672 (impossible — stripped 0 + country prefix), various unprefixed local formats, dashed US numbers without country code.
Email fields: mixed case rampant (Arthur@laser-align.com vs arthur@laser-align.com); ALL-CAPS local parts; trailing whitespace.
Name fields: ALL-CAPS surnames mixed with title-case given names; embedded \n and \r; double spaces; lowercase-only entries; slash-with-company variants (Daniel Wainstein / 7 Knots, LLC, Bruno Joyerot / SAS TIKI); placeholder Mr DADER, TBC.
Place of Residence (free text): Saint barthelemy, St Barth, Saint-Barthélemy (same place, three forms); anguilla, United States , USA, Kansas City (city without country), Sag Harbor Y (typo).
1.4 Existing battle-tested algorithm
client-portal/server/utils/duplicate-detection.ts already implements blocking + weighted-rules dedup against this same NocoDB. It runs in production today. We port it forward (don't reinvent), then add: soundex/metaphone for surname matching, compounded-confidence when multiple rules match, and negative evidence (same email + different country phone reduces confidence).
1.5 Why the website is no longer the source of new dirty data
The website forms (website/components/pn/specific/website/{berths-item,register,form}/form.vue) use <v-phone-input> with a country picker (prefer-countries: ['US', 'GB', 'DE', 'FR']) and [(value) => !!value || 'Phone number is required'] validation. Output is E.164-shaped. The 252 dirty rows are legacy — pre-form-redesign submissions, sales-rep manual entries, and external CSV imports. Future inbound is clean.
2. Approach
Three artifacts, layered:
- A pure-logic normalization + matching library at
src/lib/dedup/. JSX-free, vitest-native (proven pattern:realtime-invalidation-core.ts). Tested against the dirty-data fixture corpus drawn from §1.2. - Three runtime surfaces that use the library: at-create suggestion in client/interest forms; interest-level same-berth guard; admin review queue powered by a nightly background scoring job.
- A one-shot migration script that pulls NocoDB → normalizes → dedupes → writes new schema → produces a CSV report with auto-merge log + flagged-for-review pile.
Configurability via admin settings (system_settings per port) so the team can tune sensitivity without code changes. Defaults err on the safe side — a flagged review is cheaper than a wrongly-merged record.
Reversibility: every merge writes a client_merge_log row containing the loser's full pre-state JSON. A 7-day undo window lets a wrong merge be reversed without engineering involvement. After 7 days the snapshot is purged for GDPR; merges become permanent.
3. Normalization library
Lives at src/lib/dedup/normalize.ts. Pure functions, no DB, vitest-tested. Used by the dedup algorithm AND by all create-paths so what gets stored is already normalized.
3.1 normalizeName(raw: string)
export function normalizeName(raw: string): {
display: string; // human-readable, kept for UI
normalized: string; // for matching
surnameToken?: string; // for surname-based blocking
};
- Trim leading/trailing whitespace
- Replace
\r,\n, tabs with single space - Collapse consecutive whitespace to single space
- Smart title-case: keep particles (
van,de,del,O',di,le,da) lowercase except as first token displaypreserves user's intent (slash-with-company stays intact)normalizedisdisplay.toLowerCase()for comparisonsurnameTokenis the last non-particle token for blocking
3.2 normalizeEmail(raw: string)
export function normalizeEmail(raw: string): string | null;
- Trim + lowercase
- Validate via
zod.email()schema - Returns
nullfor empty / invalid (caller decides what to do) - Does NOT strip plus-aliases (
user+tag@domain.com) — both intentional (real distinct addresses) and malicious-prevention apply. Compare by full localpart.
3.3 normalizePhone(raw: string, defaultCountry: string)
export function normalizePhone(
raw: string,
defaultCountry: string,
): {
e164: string | null; // canonical, e.g. '+15742740548'
country: string | null; // ISO-3166-1 alpha-2
display: string | null; // user-facing pretty
flagged?: 'multi_number' | 'placeholder' | 'unparseable';
} | null;
Pipeline:
- Strip
\r,\n, tabs, single quotes, dots, dashes, parens, spaces - If contains
/or;or,→ flagmulti_number, take first segment - If matches
+\d{2}0+$(e.g.,+447000000000) → flagplaceholder, return null - If starts with
00→ replace with+ - If starts with
+→ parse as E.164 - Else if
defaultCountryprovided → parse against that country - Else return null (caller's problem)
Backed by libphonenumber-js (already in deps via tests/integration/factories.ts usage if not, will add). The hostile cases above all need explicit handling — naïve regex won't survive.
3.4 resolveCountry(text: string)
export function resolveCountry(text: string): {
iso: string | null; // ISO-3166-1 alpha-2
confidence: 'exact' | 'fuzzy' | 'city' | null;
};
Reuses src/lib/i18n/countries.ts. Pipeline:
- Lowercase + strip diacritics
- Exact match against country names (any locale we ship)
- Fuzzy match (Levenshtein ≤ 2 against canonical English names)
- City fallback — small in-package mapping for high-frequency cities seen in legacy data (
Sag Harbor → US,Kansas City → US,St Barth → BL, etc.). Order: exact → city → fuzzy.
The mapping is opinionated and small (~30 entries covering the actual values seen in the 252-row dataset). Anything that fails to resolve returns null and lands in the migration's flagged pile.
4. Dedup algorithm
Lives at src/lib/dedup/find-matches.ts. Pure function. Vitest-tested against the §1.2 cluster fixtures.
4.1 Public API
export interface MatchCandidate {
id: string;
fullName: string | null;
emails: string[]; // already normalized
phonesE164: string[]; // already normalized E.164
countryIso: string | null;
}
export interface MatchResult {
candidate: MatchCandidate;
score: number; // 0–100
reasons: string[]; // human-readable, e.g. ["email match", "phone match"]
confidence: 'high' | 'medium' | 'low';
}
export function findClientMatches(
input: MatchCandidate,
pool: MatchCandidate[],
thresholds: DedupThresholds,
): MatchResult[];
4.2 Scoring rules (compound)
Each rule produces a score addition. Compounding: when two strong rules match (e.g., email AND phone), the result is ~95+ rather than max(50, 50). Negative evidence subtracts.
| Rule | Score | Notes |
|---|---|---|
| Exact email match (case-insensitive, normalized) | +60 | One match suffices |
| Exact phone E.164 match (≥ 8 significant digits) | +50 | Excludes placeholder all-zeros |
| Exact normalized full-name match | +20 | Many "John Smith"s exist |
| Surname soundex match + given-name fuzzy match (Lev ≤ 1) | +15 | Catches Constanzo/Costanzo, Christophe/Christopher |
| Same address (normalized fuzzy ≥ 0.8) | +10 | Bonus signal |
| Negative: Same email but different country code on phone | −15 | Suggests spouse / coworker / shared inbox |
| Negative: Same name but DIFFERENT email AND DIFFERENT phone | −20 | Two distinct people with the same name |
4.3 Confidence tiers (post-compound)
- score ≥ 90 —
high— email AND phone match, or email + name + address. Block-create suggest "Use existing." Auto-link on public-form submit by default. - score 50–89 —
medium— single strong signal (email or phone alone), or email + same-name + different country (Etiennette case). Soft-warn but allow. - score < 50 —
low— weak signals only. Don't surface in UI; only relevant in background-job review queue.
4.4 Blocking strategy
For O(n) scan over a pool of N existing clients, build three lookup maps once per scan:
byEmail: Map<string, MatchCandidate[]>— keyed by normalized emailbyPhoneE164: Map<string, MatchCandidate[]>— keyed by E.164bySurnameToken: Map<string, MatchCandidate[]>— keyed bynormalizeName(...).surnameToken
For an incoming MatchCandidate, the candidate set to compare is the union of pool entries reachable through any of its emails/phones/surname-token. Typically 0–5 candidates per query, regardless of N.
4.5 Performance budget
For migration: 252 rows compared pairwise once. ~30k comparisons after blocking — a few seconds.
For runtime at-create: incoming candidate against existing pool of N clients per port. Expected pool size at maturity: 1k–10k. With blocking: <10 comparisons, <1ms target. No DB query needed beyond the initial pool fetch (which itself uses the indexed columns).
For background nightly job: full pairwise within port, blocked. 10k clients → ~50k pairwise checks per port → <30s. Fine for a nightly cron.
5. Configurable thresholds (admin settings)
New rows in system_settings per port. Default values err safe (more confirmation, less auto-action).
| Key | Default | Effect |
|---|---|---|
dedup_block_create_threshold |
90 |
Score above which the client-create form interrupts: "Use existing client?" |
dedup_soft_warn_threshold |
50 |
Score above which a soft-warn panel surfaces below the form |
dedup_review_queue_threshold |
40 |
Background job lands pairs ≥ this score in /admin/duplicates |
dedup_public_form_auto_link |
true |
When a public-form submission scores ≥ block-threshold against existing client, attach the new interest to that client without prompting. Safe: no merge, just attaching a deal. |
dedup_auto_merge_threshold |
null (disabled) |
If non-null, merges happen automatically at this threshold without human confirmation. Recommend leaving null until the team is comfortable; 95 is a reasonable cautious value. |
dedup_undo_window_days |
7 |
How long the loser's pre-state JSON is retained for merge-undo. After this, the snapshot is purged (GDPR) and merges are permanent. |
Each setting is a row in system_settings. UI surface in /[portSlug]/admin/dedup (a new admin page) with an "Advanced" toggle to expose the thresholds and brief explanations.
If the sales team complains the safer mode is too click-heavy, an admin flips dedup_auto_merge_threshold to 95 without any code change.
6. Merge service contract
6.1 Data flow
mergeClients(winnerId, loserId, fieldChoices, ctx) does, in a single transaction:
- Snapshot loser — full row + all attached
clientContacts,clientAddresses,clientNotes,clientTags, plus a count of dependent rows about to be moved (interests, yacht-memberships, etc.). Stored asmergeDetailsJSONB inclientMergeLog. - Reattach — every row pointing at
loserIdupdates to point atwinnerId:interests.clientIdclientContacts.clientId— with conflict handling: if winner already has the same email, keep winner's; flag the duplicate for the userclientAddresses.clientId— same conflict handlingclientNotes.clientId— preserveauthorId+createdAt(never overwrite)clientTags.clientIdclientYachtMembership.clientId(or whatever the table is called)auditLogs.entityId— annotate, don't move (audit truth)
- Apply fieldChoices — for each field where the user picked the loser's value, copy that into the winner row.
- Soft-archive loser —
loser.archivedAt = now(),loser.mergedIntoClientId = winnerId. Row stays in DB so the merge is reversible. - Write
clientMergeLog—{ winnerId, loserId, mergedBy, mergedAt, mergeDetails: <snapshot>, fieldChoices }. - Audit log — top-level
auditLogsrow:{ action: 'merge', entityType: 'client', entityId: winnerId, metadata: { loserId, score, reasons } }.
6.2 Schema additions (migration)
clients table gets a new column:
mergedIntoClientId: text('merged_into_client_id').references(() => clients.id),
The existing clientMergeLog table is reused. Add a partial index for the undo-window query:
CREATE INDEX idx_cml_recent ON client_merge_log (port_id, created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days';
A daily maintenance job (using the existing maintenance-cleanup.test.ts infrastructure) purges mergeDetails JSONB older than dedup_undo_window_days setting.
6.3 Undo
unmergeClients(mergeLogId, ctx):
- Within the undo window, look up the snapshot
- Restore loser: clear
archivedAt,mergedIntoClientId - Restore loser's contacts/addresses/notes/tags from snapshot
- Detach reattached rows:
interestsetc. that were touchingwinnerIdand originally belonged to loser go back. The snapshot stores the original(rowType, rowId)list explicitly so this is deterministic. - Mark log row
undoneAt = now(),undoneBy = userId
After 7 days the snapshot is gone and unmerge returns 410 Gone.
6.4 Concurrency
Both merge and unmerge wrap in a single transaction with SELECT … FOR UPDATE on clients.id of both winner and loser. A second merge attempt against the same loser sees mergedIntoClientId already set and refuses (clear error: "Already merged into …").
7. Runtime surfaces
7.1 Layer 1 — At-create suggestion
In ClientForm (and the public register form once that hits the new system):
- Debounced 300ms after email or phone field changes
- Calls
findClientMatchesagainst current port's clients - Renders top-1 match if score ≥
dedup_soft_warn_threshold:┌─────────────────────────────────────┐ │ This looks like an existing client │ │ ML Marcus Laurent │ │ marcus@… +33 6 12 34 56 78 │ │ 2 interests · last 9d ago │ │ [ Use this client ] [ Create new ] │ └─────────────────────────────────────┘ - "Use this client" → form switches to "create new interest under existing client" mode (preserves whatever other fields the user typed)
- "Create new" → audit-log
dedup_overridewith the candidate's id and reasons (so we have data on false positives)
7.2 Layer 2 — Interest-level same-berth guard
Cheap one-liner in createInterest service:
- Check
(clientId, berthId)against existing non-archived interests - If hit, throw
BerthDuplicateErrorwith the existing interest details - UI catches and prompts: "Update existing or create separate?"
This is NOT the same as client-level dedup. Same client legitimately can pursue the same berth a second time after it falls through. But the prompt-before-create catches the accidental double-submit case.
7.3 Layer 3 — Background scoring + review queue
- A nightly cron (using existing BullMQ infrastructure — search for
scheduled-tasksin repo) runsfindClientMatchesover each port's full client pool - Pairs scoring ≥
dedup_review_queue_thresholdland in aclient_merge_candidatestable:export const clientMergeCandidates = pgTable('client_merge_candidates', { id: text('id').primaryKey()..., portId: text('port_id').notNull()..., clientAId: text('client_a_id').notNull()..., clientBId: text('client_b_id').notNull()..., score: integer('score').notNull(), reasons: jsonb('reasons').notNull(), status: text('status').notNull().default('pending'), // pending | dismissed | merged createdAt: timestamp('created_at')..., resolvedAt: timestamp('resolved_at'), resolvedBy: text('resolved_by'), }) /[portSlug]/admin/duplicateslists pending candidates sorted by score desc, with[Review →]opening a side-by-side merge dialog- Dismissing a candidate marks it
status=dismissedso the job doesn't re-surface the same pair tomorrow (a future score increase re-creates it).
8. NocoDB → new system field mapping
This is the explicit mapping the migration script applies. One NocoDB Interest row produces multiple new rows.
8.1 Top-level transform
NocoDB Interests row
─→ 0–1 client (deduped against existing pool)
─→ 0–1 client_address
─→ 0–2 client_contacts (email, phone)
─→ exactly 1 interest
─→ 0–1 yacht (when Yacht Name present and not "TBC"/"Na"/empty placeholders)
─→ 0–1 document (when documensoID present)
8.2 Field map
| NocoDB field | Target | Transform |
|---|---|---|
Full Name |
clients.fullName |
normalizeName().display |
Email Address |
clientContacts(channel='email', value=...) |
normalizeEmail() |
Phone Number |
clientContacts(channel='phone', valueE164=..., valueCountry=...) |
normalizePhone(raw, defaultCountry) |
Address |
clientAddresses.streetAddress (LongText preserved) |
trim |
Place of Residence |
clientAddresses.countryIso AND clients.nationalityIso |
resolveCountry() |
Contact Method Preferred |
clients.preferredContactMethod |
lowercase, mapped: Email→email, Phone→phone |
Source |
clients.source |
mapped: portal→website, Form→website, External→manual; null → manual |
Date Added |
interests.createdAt (fallback to NocoDB Created At then now) |
parse: try DD-MM-YYYY, then YYYY-MM-DD, then ISO |
Sales Process Level |
interests.pipelineStage |
see §8.3 |
Lead Category |
interests.leadCategory |
General→general_interest, Friends and Family→general_interest with tag |
Berth (FK) |
interests.berthId |
resolve via Berths table by Mooring Number |
Berth Size Desired |
interests.notes (appended) |
preserve |
Yacht Name, Length, Width, Depth |
yachts.name, lengthM, widthM, draughtM |
skip if name in {TBC, Na, ``, null}; ft→m via \* 0.3048 |
EOI Status |
interests.eoiStatus |
Awaiting Further Details→pending; Waiting for Signatures→sent; Signed→signed |
Deposit 10% Status |
interests.depositStatus |
Pending→pending; Received→received |
Contract Status |
interests.contractStatus |
Pending→pending; 40% Received→partial; Complete→complete |
EOI Time Sent |
interests.dateEoiSent |
parse |
clientSignTime / developerSignTime / all_signed_notified_at |
interests.dateEoiSigned (use latest) |
parse |
Time LOI Sent |
interests.dateContractSent |
parse |
Internal Notes + Extra Comments |
clientNotes (one row, system author) |
concatenate with section markers |
documensoID |
documents.documensoId (when present, type='eoi') |
preserve |
Signature Link Client/CC/Developer, EmbeddedSignature* |
documents.signers[] |
one row per non-null signer |
reminder_enabled, last_reminder_sent, etc. |
interests.reminderEnabled, interests.reminderLastFired |
parse, default true |
8.3 Sales-stage mapping (8 → 9)
| NocoDB | New (PIPELINE_STAGES) |
|---|---|
| General Qualified Interest | open |
| Specific Qualified Interest | details_sent |
| EOI and NDA Sent | eoi_sent |
| Signed EOI and NDA | eoi_signed |
| Made Reservation | deposit_10pct |
| Contract Negotiation | contract_sent |
| Contract Negotiations Finalized | contract_sent (with audit-note: legacy "negotiations finalized") |
| Contract Signed | contract_signed (or completed when deposit + contract both complete) |
8.4 Other tables
- Residential Interests (35 rows) — same shape as Interests but maps to
residentialClients+residentialInterests. Smaller and cleaner. Same dedup runs within this pool independently. - Website - Interest Submissions (64 rows) — these are inbound capture, not yet a client. Treat as if each row is a fresh public-form submission today: run dedup against the migrated client pool. Auto-link if
dedup_public_form_auto_linksetting allows. - Website - Contact Form Submissions (47 rows) — sparse data (just name + email + interest type). Skip migration; export as CSV for manual triage. Not the source of truth for any deal.
- Website - Berth EOI Details Supplements (1 row) — single record, preserved as a one-off attached to the matching Interest.
- Newsletter Sending (69 rows) — out of scope; that's a marketing surface, not CRM.
- Interests Backup, Interests copy — historical artifacts. Skipped by default. A
--include-backupsflag attaches them as audit-note entries on the corresponding live Interest if the user wants the history.
9. Migration script
Located at scripts/migrate-from-nocodb.ts. Idempotent: safe to re-run. Three main flags:
$ pnpm tsx scripts/migrate-from-nocodb.ts --dry-run [--port-slug X]
Pulls everything, transforms, runs dedup, writes CSV report to .migration/<timestamp>/. No DB writes.
$ pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<timestamp>/
Reads the report, performs the writes the dry-run promised. Refuses if the source data has changed since the report was generated (hash mismatch).
$ pnpm tsx scripts/migrate-from-nocodb.ts --rollback --apply-id <id>
Reads the apply log, undoes the writes (only valid within the undo window).
Reuses the client-portal/server/utils/nocodb.ts adapter for the NocoDB API client (no need to rebuild). Writes to the new system via Drizzle (re-using the existing services like createClient, createInterest, etc., so all the same validation runs).
9.1 Dry-run report format
.migration/<timestamp>/report.csv:
op,reason,nocodb_row_id,target_table,target_value,confidence,manual_review_required
create_client,new,624,clients.fullName,Deepak Ramchandani,N/A,false
create_contact,new,624,clientContacts.email,dannyrams8888@gmail.com,N/A,false
create_contact,new,624,clientContacts.phone,+17215868888,N/A,false
create_interest,new,624,interests.berthId,a1b2c3...,N/A,false
auto_link,score=98 (email+phone),625,clients.id,<existing client UUID from row 624>,high,false
flag_for_review,score=72 (same name diff country),188,client.id,<existing client UUID from row 717>,medium,true
country_unresolved,fallback to AI (port country),198,clientAddresses.countryIso,AI,low,true
phone_unparseable,placeholder all-zeros,641,clientContacts.phone,<skipped>,N/A,true
Plus .migration/<timestamp>/summary.md:
# Migration Dry-Run — 2026-05-03 14:23 UTC
NocoDB: 252 Interests + 35 Residences + 64 Website Submissions
Outcome: 198 clients, 287 interests (incl. residences), 91 yachts, 412 contacts
Auto-linked (high confidence, no human action needed):
- Nicolas Ruiz: rows 681,682,683 → 1 client + 3 interests
- John Lynch: rows 716,725 → 1 client + 2 interests
- Deepak Ramchandani: rows 624,625 → 1 client + 2 interests
- [12 more]
Flagged for manual review (medium confidence):
- Etiennette Clamouze (rows 188,717): same name, different country phone + email
- Bruno Joyerot #18 + Bruce Hearn #19: shared household contact
- [4 more]
Country resolution failed for 7 rows. All defaulted to port country (AI). Review:
- Row 239: "Sag Harbor Y" → AI (likely US)
- [6 more]
Phone parsing failed for 3 rows. All flagged, no contact created:
- Row 178: empty
- Row 641: placeholder "+447000000000"
- Row 175: empty
Run `--apply` to commit these changes.
9.2 Apply phase
--apply reads the report, re-fetches the source rows (via NocoDB MCP / API), recomputes the hash, fails fast if NocoDB changed since dry-run. Then performs the writes within a single PostgreSQL transaction per port (commit at end). On any error mid-transaction, full rollback.
After successful apply, an apply_id is generated and an audit-log row written. The apply_id is the handle used for --rollback.
9.3 Idempotency
The script tracks NocoDB row IDs in a migration_source_links table:
export const migrationSourceLinks = pgTable('migration_source_links', {
id: text('id').primaryKey()...,
sourceSystem: text('source_system').notNull(), // 'nocodb_interests' | 'nocodb_residences' | …
sourceId: text('source_id').notNull(), // NocoDB row id as string
targetEntityType: text('target_entity_type').notNull(), // client | interest | yacht | …
targetEntityId: text('target_entity_id').notNull(),
appliedAt: timestamp('applied_at')...,
appliedBy: text('applied_by'),
}, (table) => [
uniqueIndex('idx_msl_source').on(table.sourceSystem, table.sourceId, table.targetEntityType),
]);
Re-running --apply against the same report skips rows already in this table. Useful for partial-failure resumption.
10. Test plan
10.1 Library-level (vitest unit)
tests/unit/dedup/normalize.test.ts— every dirty-data pattern from §1.3 has a fixture asserting the expected normalized output.tests/unit/dedup/find-matches.test.ts— every duplicate cluster from §1.2 has a fixture asserting score + confidence tier. Hard cases (Pattern F) assert "medium" not "high" — false-positive guard.
10.2 Service-level (vitest integration)
tests/integration/dedup/client-merge.test.ts— merge service exercised: full reattach, clientMergeLog written, undo within window restores, undo after window returns 410, concurrent merge of same loser fails the second.tests/integration/dedup/at-create-suggestion.test.ts—findClientMatchesagainst a seeded pool returns expected matches + reasons.
10.3 Migration script (vitest integration with NocoDB mock)
tests/integration/dedup/migration-dry-run.test.ts— feed the script a fixture NocoDB dump (the 252 rows, frozen as a JSON snapshot in fixtures), assert the resulting CSV matches a golden file. Catch any future regression in the transform pipeline.tests/integration/dedup/migration-apply.test.ts— apply the dry-run output to a clean test DB, assert all expected rows exist, assert idempotency (re-apply is a no-op).
10.4 E2E (Playwright)
tests/e2e/smoke/30-dedup-create.spec.ts— type into ClientForm with an email matching seeded client; assert suggestion card appears; click "Use this client"; assert form switches to interest-create mode.tests/e2e/smoke/31-admin-duplicates.spec.ts— admin views review queue, opens a candidate, side-by-side merge UI works, merge succeeds, undo within window works.
11. Rollback plan
Three layers of safety, ordered by reversibility:
- Per-merge undo — admin clicks Undo on a wrongly-merged pair, system rolls back from
clientMergeLogsnapshot. 7-day window. No engineering needed. - Migration
--rollbackflag — entire migration apply is reversed via theapply_idandmigration_source_linkstable. Useful in the first 24h after--apply. Engineering-supervised. - DB restore from backup — the existing
docs/ops/backup-runbook.mdcovers this. Last resort if both above are blocked.
Pre-migration, take a hot backup of the new DB (pg_dump). Pre-merge in production (before any human-facing surface ships), the dedup_auto_merge_threshold defaults to null so no automatic merges happen — every merge is human-confirmed.
12. Open items
- Soundex vs metaphone — Soundex is simpler but English-leaning. Metaphone handles non-English surnames better (the dataset has French, German, Italian, Slavic names). Default to metaphone via the
naturalpackage; revisit if it adds significant install size. - Cross-port dedup — not in scope. Each port's clients are deduped within that port. A future "shared address book" feature would need its own design.
- Profile photo / face match — out of scope.
- AI-assisted match resolution — out of scope. The Layer-3 review queue is human-only.
Implementation sequence
P1 (this design's library) → P2 (runtime surfaces) → P3 (migration). Each is a separate plan / PR.
P1 deliverables: src/lib/dedup/{normalize,find-matches}.ts + tests. No UI changes. No DB changes (except indexed lookups added to existing clientContacts). ~1.5 days.
P2 deliverables: at-create suggestion in ClientForm + interest-level guard in createInterest service + admin settings UI for thresholds + clientMergeCandidates table + nightly job + admin review queue page + merge service + side-by-side merge UI. ~5–7 days.
P3 deliverables: scripts/migrate-from-nocodb.ts + migration_source_links table + dry-run + apply + rollback. CSV report format frozen against fixture. ~3 days, including fixture creation from the live NocoDB snapshot.
Total: ~10–12 engineering days from approval. Can be split across three PRs landing independently — each is testable in isolation and the runtime surfaces (P2) work even without P3 being run.