Files
pn-new-crm/docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md
Matt Ciaccio 36b92eb827 docs(spec): client deduplication and NocoDB migration design
Captures the audit findings from a 2026-05-03 read-only NocoDB review
plus the algorithm and migration plan for porting the legacy data
into the new client / interest / contacts / addresses model.

Highlights:
- 252 NocoDB Interests rows ≈ ~190–200 unique humans (~20–25% dup
  rate). Six duplicate patterns documented from real data, including
  "same person, multiple yachts" — exactly the case the new
  client/interest split is designed to handle.
- Reuses the battle-tested `client-portal/server/utils/duplicate-
  detection.ts` algorithm (blocking + weighted rules) with additions:
  metaphone for non-English surnames, compounded confidence when
  multiple rules match, negative evidence for split-signal cases.
- Three runtime surfaces (at-create suggestion, interest-level
  same-berth guard, background scoring + admin review queue) plus a
  one-shot migration script with --dry-run / --apply / --rollback.
- Configurable thresholds via per-port system_settings so the merge
  policy can be tuned (defaults to "always confirm" — never
  auto-merges out of the box).
- Reversible: every merge writes a clientMergeLog row with the
  loser's full pre-state JSON, enabling 7-day undo without engineering.

Implementation decomposes into three plans (P1 library / P2 runtime /
P3 migration) sequenced after the mobile branch lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:10:08 +02:00

38 KiB
Raw Permalink Blame History

Client Deduplication and NocoDB Migration Design

Status: Design draft 2026-05-03 — pending approval. Plan decomposition: Three implementation plans stack from this design — (P1) normalization + dedup core library; (P2) admin settings + at-create + interest-level guards (runtime); (P3) NocoDB migration script + review queue UI. P1 unblocks P2 and P3. Branch base: stacks on feat/mobile-foundation once it merges to main. Out of scope: live merge of two clients across ports (cross-tenant), automated AI-judged matches, profile-photo / face-match dedup, web-of-trust referrer relationships.


1. Background

1.1 Why this exists

The legacy CRM lives in a NocoDB base whose Interests table conflates the human with the deal. A row contains Full Name, Email Address, Phone Number, Address, Place of Residence and the sales-pipeline state for one specific berth. A single human pursuing two berths becomes two rows with semi-duplicated personal data. A 2026-05-03 read-only audit confirmed:

  • 252 Interests rows in NocoDB, against an estimated ~190200 unique humans (~2025% duplication rate).
  • 35 Residential Interests rows in a parallel residential pathway with the same conflation.
  • 64 Website Interest Submissions + 47 Website Contact Form Submissions + 1 EOI Supplemental Form as inbound capture surfaces.
  • No Clients table. The conflated structure is structural, not accidental.

The new CRM (src/lib/db/schema/clients.ts) splits this into clients (people) ↔ interests (deals), with clientContacts (multi-channel), clientAddresses (multi-address), and a pre-existing clientMergeLog table that anticipates merge with undo. The design has been ready; what's missing is (a) a normalization + matching library, (b) the at-create / at-import surfaces that use it, and (c) the migration of the existing 252+35 records.

1.2 Real duplicate patterns observed in the live data

Sampled 200 of the 252 NocoDB Interests rows. Confirmed duplicate clusters fall into six patterns:

Pattern Example rows Signature
A. Pure double-submit Deepak Ramchandani #624/#625; John Lynch #716/#725 All fields identical; created same day
B. Phone format variance Howard Wiarda #236/#536 (574-274-0548 vs +15742740548); Christophe Zasso #701/#702 (0651381036 vs 0033651381036) Same email, normalize-equal phone
C. Name capitalization Nicolas Ruiz #681/#682/#683; Jean-Charles Miege/MIEGE #37/#163; John Farmer/FARMER #35/#161 Same email or empty; surname case differs
D. Name shortening Chris vs Christopher Allen #700/#534; Emma c vs Emma Cauchefer #661/#673 Same email + phone; given-name truncated
E. Resubmit with typo Christopher Camazou #649/#650 (phone last 4 digits typo); Gianfranco Di Constanzo/Costanzo #585/#336 (surname typo, different yacht — should be ONE client + TWO interests) Score-on-everything-else high, one field has small-edit-distance noise
F. Hard cases Etiennette Clamouze #188/#717 (same name, different country phone + email); Bruno Joyerot #18 with email belonging to Bruce Hearn #19 (couple sharing contact) Cannot resolve without a human

This dataset will be the fixture for the dedup library's tests — every pattern above must be either auto-detected or flagged for review, and the false-positive bar must be high enough that Pattern F doesn't get force-merged.

1.3 Dirty data inventory

The migration normalizer must survive these real values from production:

Phone fields: +1-264-235-8840\r (with carriage return), '+1.214.603.4235 (apostrophe + dots), 0677580750/0690511494 (two numbers in one field), 00447956657022 (00 prefix), +447000000000 (placeholder all-zeros), +4901637039672 (impossible — stripped 0 + country prefix), various unprefixed local formats, dashed US numbers without country code.

Email fields: mixed case rampant (Arthur@laser-align.com vs arthur@laser-align.com); ALL-CAPS local parts; trailing whitespace.

Name fields: ALL-CAPS surnames mixed with title-case given names; embedded \n and \r; double spaces; lowercase-only entries; slash-with-company variants (Daniel Wainstein / 7 Knots, LLC, Bruno Joyerot / SAS TIKI); placeholder Mr DADER, TBC.

Place of Residence (free text): Saint barthelemy, St Barth, Saint-Barthélemy (same place, three forms); anguilla, United States , USA, Kansas City (city without country), Sag Harbor Y (typo).

1.4 Existing battle-tested algorithm

client-portal/server/utils/duplicate-detection.ts already implements blocking + weighted-rules dedup against this same NocoDB. It runs in production today. We port it forward (don't reinvent), then add: soundex/metaphone for surname matching, compounded-confidence when multiple rules match, and negative evidence (same email + different country phone reduces confidence).

1.5 Why the website is no longer the source of new dirty data

The website forms (website/components/pn/specific/website/{berths-item,register,form}/form.vue) use <v-phone-input> with a country picker (prefer-countries: ['US', 'GB', 'DE', 'FR']) and [(value) => !!value || 'Phone number is required'] validation. Output is E.164-shaped. The 252 dirty rows are legacy — pre-form-redesign submissions, sales-rep manual entries, and external CSV imports. Future inbound is clean.


2. Approach

Three artifacts, layered:

  1. A pure-logic normalization + matching library at src/lib/dedup/. JSX-free, vitest-native (proven pattern: realtime-invalidation-core.ts). Tested against the dirty-data fixture corpus drawn from §1.2.
  2. Three runtime surfaces that use the library: at-create suggestion in client/interest forms; interest-level same-berth guard; admin review queue powered by a nightly background scoring job.
  3. A one-shot migration script that pulls NocoDB → normalizes → dedupes → writes new schema → produces a CSV report with auto-merge log + flagged-for-review pile.

Configurability via admin settings (system_settings per port) so the team can tune sensitivity without code changes. Defaults err on the safe side — a flagged review is cheaper than a wrongly-merged record.

Reversibility: every merge writes a client_merge_log row containing the loser's full pre-state JSON. A 7-day undo window lets a wrong merge be reversed without engineering involvement. After 7 days the snapshot is purged for GDPR; merges become permanent.


3. Normalization library

Lives at src/lib/dedup/normalize.ts. Pure functions, no DB, vitest-tested. Used by the dedup algorithm AND by all create-paths so what gets stored is already normalized.

3.1 normalizeName(raw: string)

export function normalizeName(raw: string): {
  display: string; // human-readable, kept for UI
  normalized: string; // for matching
  surnameToken?: string; // for surname-based blocking
};
  • Trim leading/trailing whitespace
  • Replace \r, \n, tabs with single space
  • Collapse consecutive whitespace to single space
  • Smart title-case: keep particles (van, de, del, O', di, le, da) lowercase except as first token
  • display preserves user's intent (slash-with-company stays intact)
  • normalized is display.toLowerCase() for comparison
  • surnameToken is the last non-particle token for blocking

3.2 normalizeEmail(raw: string)

export function normalizeEmail(raw: string): string | null;
  • Trim + lowercase
  • Validate via zod.email() schema
  • Returns null for empty / invalid (caller decides what to do)
  • Does NOT strip plus-aliases (user+tag@domain.com) — both intentional (real distinct addresses) and malicious-prevention apply. Compare by full localpart.

3.3 normalizePhone(raw: string, defaultCountry: string)

export function normalizePhone(
  raw: string,
  defaultCountry: string,
): {
  e164: string | null; // canonical, e.g. '+15742740548'
  country: string | null; // ISO-3166-1 alpha-2
  display: string | null; // user-facing pretty
  flagged?: 'multi_number' | 'placeholder' | 'unparseable';
} | null;

Pipeline:

  1. Strip \r, \n, tabs, single quotes, dots, dashes, parens, spaces
  2. If contains / or ; or , → flag multi_number, take first segment
  3. If matches +\d{2}0+$ (e.g., +447000000000) → flag placeholder, return null
  4. If starts with 00 → replace with +
  5. If starts with + → parse as E.164
  6. Else if defaultCountry provided → parse against that country
  7. Else return null (caller's problem)

Backed by libphonenumber-js (already in deps via tests/integration/factories.ts usage if not, will add). The hostile cases above all need explicit handling — naïve regex won't survive.

3.4 resolveCountry(text: string)

export function resolveCountry(text: string): {
  iso: string | null; // ISO-3166-1 alpha-2
  confidence: 'exact' | 'fuzzy' | 'city' | null;
};

Reuses src/lib/i18n/countries.ts. Pipeline:

  1. Lowercase + strip diacritics
  2. Exact match against country names (any locale we ship)
  3. Fuzzy match (Levenshtein ≤ 2 against canonical English names)
  4. City fallback — small in-package mapping for high-frequency cities seen in legacy data (Sag Harbor → US, Kansas City → US, St Barth → BL, etc.). Order: exact → city → fuzzy.

The mapping is opinionated and small (~30 entries covering the actual values seen in the 252-row dataset). Anything that fails to resolve returns null and lands in the migration's flagged pile.


4. Dedup algorithm

Lives at src/lib/dedup/find-matches.ts. Pure function. Vitest-tested against the §1.2 cluster fixtures.

4.1 Public API

export interface MatchCandidate {
  id: string;
  fullName: string | null;
  emails: string[]; // already normalized
  phonesE164: string[]; // already normalized E.164
  countryIso: string | null;
}

export interface MatchResult {
  candidate: MatchCandidate;
  score: number; // 0100
  reasons: string[]; // human-readable, e.g. ["email match", "phone match"]
  confidence: 'high' | 'medium' | 'low';
}

export function findClientMatches(
  input: MatchCandidate,
  pool: MatchCandidate[],
  thresholds: DedupThresholds,
): MatchResult[];

4.2 Scoring rules (compound)

Each rule produces a score addition. Compounding: when two strong rules match (e.g., email AND phone), the result is ~95+ rather than max(50, 50). Negative evidence subtracts.

Rule Score Notes
Exact email match (case-insensitive, normalized) +60 One match suffices
Exact phone E.164 match (≥ 8 significant digits) +50 Excludes placeholder all-zeros
Exact normalized full-name match +20 Many "John Smith"s exist
Surname soundex match + given-name fuzzy match (Lev ≤ 1) +15 Catches Constanzo/Costanzo, Christophe/Christopher
Same address (normalized fuzzy ≥ 0.8) +10 Bonus signal
Negative: Same email but different country code on phone 15 Suggests spouse / coworker / shared inbox
Negative: Same name but DIFFERENT email AND DIFFERENT phone 20 Two distinct people with the same name

4.3 Confidence tiers (post-compound)

  • score ≥ 90 — high — email AND phone match, or email + name + address. Block-create suggest "Use existing." Auto-link on public-form submit by default.
  • score 5089 — medium — single strong signal (email or phone alone), or email + same-name + different country (Etiennette case). Soft-warn but allow.
  • score < 50 — low — weak signals only. Don't surface in UI; only relevant in background-job review queue.

4.4 Blocking strategy

For O(n) scan over a pool of N existing clients, build three lookup maps once per scan:

  • byEmail: Map<string, MatchCandidate[]> — keyed by normalized email
  • byPhoneE164: Map<string, MatchCandidate[]> — keyed by E.164
  • bySurnameToken: Map<string, MatchCandidate[]> — keyed by normalizeName(...).surnameToken

For an incoming MatchCandidate, the candidate set to compare is the union of pool entries reachable through any of its emails/phones/surname-token. Typically 05 candidates per query, regardless of N.

4.5 Performance budget

For migration: 252 rows compared pairwise once. ~30k comparisons after blocking — a few seconds.

For runtime at-create: incoming candidate against existing pool of N clients per port. Expected pool size at maturity: 1k10k. With blocking: <10 comparisons, <1ms target. No DB query needed beyond the initial pool fetch (which itself uses the indexed columns).

For background nightly job: full pairwise within port, blocked. 10k clients → ~50k pairwise checks per port → <30s. Fine for a nightly cron.


5. Configurable thresholds (admin settings)

New rows in system_settings per port. Default values err safe (more confirmation, less auto-action).

Key Default Effect
dedup_block_create_threshold 90 Score above which the client-create form interrupts: "Use existing client?"
dedup_soft_warn_threshold 50 Score above which a soft-warn panel surfaces below the form
dedup_review_queue_threshold 40 Background job lands pairs ≥ this score in /admin/duplicates
dedup_public_form_auto_link true When a public-form submission scores ≥ block-threshold against existing client, attach the new interest to that client without prompting. Safe: no merge, just attaching a deal.
dedup_auto_merge_threshold null (disabled) If non-null, merges happen automatically at this threshold without human confirmation. Recommend leaving null until the team is comfortable; 95 is a reasonable cautious value.
dedup_undo_window_days 7 How long the loser's pre-state JSON is retained for merge-undo. After this, the snapshot is purged (GDPR) and merges are permanent.

Each setting is a row in system_settings. UI surface in /[portSlug]/admin/dedup (a new admin page) with an "Advanced" toggle to expose the thresholds and brief explanations.

If the sales team complains the safer mode is too click-heavy, an admin flips dedup_auto_merge_threshold to 95 without any code change.


6. Merge service contract

6.1 Data flow

mergeClients(winnerId, loserId, fieldChoices, ctx) does, in a single transaction:

  1. Snapshot loser — full row + all attached clientContacts, clientAddresses, clientNotes, clientTags, plus a count of dependent rows about to be moved (interests, yacht-memberships, etc.). Stored as mergeDetails JSONB in clientMergeLog.
  2. Reattach — every row pointing at loserId updates to point at winnerId:
    • interests.clientId
    • clientContacts.clientId — with conflict handling: if winner already has the same email, keep winner's; flag the duplicate for the user
    • clientAddresses.clientId — same conflict handling
    • clientNotes.clientId — preserve authorId + createdAt (never overwrite)
    • clientTags.clientId
    • clientYachtMembership.clientId (or whatever the table is called)
    • auditLogs.entityId — annotate, don't move (audit truth)
  3. Apply fieldChoices — for each field where the user picked the loser's value, copy that into the winner row.
  4. Soft-archive loserloser.archivedAt = now(), loser.mergedIntoClientId = winnerId. Row stays in DB so the merge is reversible.
  5. Write clientMergeLog{ winnerId, loserId, mergedBy, mergedAt, mergeDetails: <snapshot>, fieldChoices }.
  6. Audit log — top-level auditLogs row: { action: 'merge', entityType: 'client', entityId: winnerId, metadata: { loserId, score, reasons } }.

6.2 Schema additions (migration)

clients table gets a new column:

mergedIntoClientId: text('merged_into_client_id').references(() => clients.id),

The existing clientMergeLog table is reused. Add a partial index for the undo-window query:

CREATE INDEX idx_cml_recent ON client_merge_log (port_id, created_at DESC) WHERE created_at > NOW() - INTERVAL '7 days';

A daily maintenance job (using the existing maintenance-cleanup.test.ts infrastructure) purges mergeDetails JSONB older than dedup_undo_window_days setting.

6.3 Undo

unmergeClients(mergeLogId, ctx):

  1. Within the undo window, look up the snapshot
  2. Restore loser: clear archivedAt, mergedIntoClientId
  3. Restore loser's contacts/addresses/notes/tags from snapshot
  4. Detach reattached rows: interests etc. that were touching winnerId and originally belonged to loser go back. The snapshot stores the original (rowType, rowId) list explicitly so this is deterministic.
  5. Mark log row undoneAt = now(), undoneBy = userId

After 7 days the snapshot is gone and unmerge returns 410 Gone.

6.4 Concurrency

Both merge and unmerge wrap in a single transaction with SELECT … FOR UPDATE on clients.id of both winner and loser. A second merge attempt against the same loser sees mergedIntoClientId already set and refuses (clear error: "Already merged into …").


7. Runtime surfaces

7.1 Layer 1 — At-create suggestion

In ClientForm (and the public register form once that hits the new system):

  • Debounced 300ms after email or phone field changes
  • Calls findClientMatches against current port's clients
  • Renders top-1 match if score ≥ dedup_soft_warn_threshold:
    ┌─────────────────────────────────────┐
    │ This looks like an existing client   │
    │  ML  Marcus Laurent                  │
    │      marcus@…   +33 6 12 34 56 78    │
    │      2 interests · last 9d ago        │
    │  [ Use this client ]   [ Create new ] │
    └─────────────────────────────────────┘
    
  • "Use this client" → form switches to "create new interest under existing client" mode (preserves whatever other fields the user typed)
  • "Create new" → audit-log dedup_override with the candidate's id and reasons (so we have data on false positives)

7.2 Layer 2 — Interest-level same-berth guard

Cheap one-liner in createInterest service:

  • Check (clientId, berthId) against existing non-archived interests
  • If hit, throw BerthDuplicateError with the existing interest details
  • UI catches and prompts: "Update existing or create separate?"

This is NOT the same as client-level dedup. Same client legitimately can pursue the same berth a second time after it falls through. But the prompt-before-create catches the accidental double-submit case.

7.3 Layer 3 — Background scoring + review queue

  • A nightly cron (using existing BullMQ infrastructure — search for scheduled-tasks in repo) runs findClientMatches over each port's full client pool
  • Pairs scoring ≥ dedup_review_queue_threshold land in a client_merge_candidates table:
    export const clientMergeCandidates = pgTable('client_merge_candidates', {
      id: text('id').primaryKey()...,
      portId: text('port_id').notNull()...,
      clientAId: text('client_a_id').notNull()...,
      clientBId: text('client_b_id').notNull()...,
      score: integer('score').notNull(),
      reasons: jsonb('reasons').notNull(),
      status: text('status').notNull().default('pending'), // pending | dismissed | merged
      createdAt: timestamp('created_at')...,
      resolvedAt: timestamp('resolved_at'),
      resolvedBy: text('resolved_by'),
    })
    
  • /[portSlug]/admin/duplicates lists pending candidates sorted by score desc, with [Review →] opening a side-by-side merge dialog
  • Dismissing a candidate marks it status=dismissed so the job doesn't re-surface the same pair tomorrow (a future score increase re-creates it).

8. NocoDB → new system field mapping

This is the explicit mapping the migration script applies. One NocoDB Interest row produces multiple new rows.

8.1 Top-level transform

NocoDB Interests row
  ─→ 01 client (deduped against existing pool)
  ─→ 01 client_address
  ─→ 02 client_contacts (email, phone)
  ─→ exactly 1 interest
  ─→ 01 yacht (when Yacht Name present and not "TBC"/"Na"/empty placeholders)
  ─→ 01 document (when documensoID present)

8.2 Field map

NocoDB field Target Transform
Full Name clients.fullName normalizeName().display
Email Address clientContacts(channel='email', value=...) normalizeEmail()
Phone Number clientContacts(channel='phone', valueE164=..., valueCountry=...) normalizePhone(raw, defaultCountry)
Address clientAddresses.streetAddress (LongText preserved) trim
Place of Residence clientAddresses.countryIso AND clients.nationalityIso resolveCountry()
Contact Method Preferred clients.preferredContactMethod lowercase, mapped: Email→email, Phone→phone
Source clients.source mapped: portal→website, Form→website, External→manual; null → manual
Date Added interests.createdAt (fallback to NocoDB Created At then now) parse: try DD-MM-YYYY, then YYYY-MM-DD, then ISO
Sales Process Level interests.pipelineStage see §8.3
Lead Category interests.leadCategory General→general_interest, Friends and Family→general_interest with tag
Berth (FK) interests.berthId resolve via Berths table by Mooring Number
Berth Size Desired interests.notes (appended) preserve
Yacht Name, Length, Width, Depth yachts.name, lengthM, widthM, draughtM skip if name in {TBC, Na, ``, null}; ft→m via \* 0.3048
EOI Status interests.eoiStatus Awaiting Further Details→pending; Waiting for Signatures→sent; Signed→signed
Deposit 10% Status interests.depositStatus Pending→pending; Received→received
Contract Status interests.contractStatus Pending→pending; 40% Received→partial; Complete→complete
EOI Time Sent interests.dateEoiSent parse
clientSignTime / developerSignTime / all_signed_notified_at interests.dateEoiSigned (use latest) parse
Time LOI Sent interests.dateContractSent parse
Internal Notes + Extra Comments clientNotes (one row, system author) concatenate with section markers
documensoID documents.documensoId (when present, type='eoi') preserve
Signature Link Client/CC/Developer, EmbeddedSignature* documents.signers[] one row per non-null signer
reminder_enabled, last_reminder_sent, etc. interests.reminderEnabled, interests.reminderLastFired parse, default true

8.3 Sales-stage mapping (8 → 9)

NocoDB New (PIPELINE_STAGES)
General Qualified Interest open
Specific Qualified Interest details_sent
EOI and NDA Sent eoi_sent
Signed EOI and NDA eoi_signed
Made Reservation deposit_10pct
Contract Negotiation contract_sent
Contract Negotiations Finalized contract_sent (with audit-note: legacy "negotiations finalized")
Contract Signed contract_signed (or completed when deposit + contract both complete)

8.4 Other tables

  • Residential Interests (35 rows) — same shape as Interests but maps to residentialClients + residentialInterests. Smaller and cleaner. Same dedup runs within this pool independently.
  • Website - Interest Submissions (64 rows) — these are inbound capture, not yet a client. Treat as if each row is a fresh public-form submission today: run dedup against the migrated client pool. Auto-link if dedup_public_form_auto_link setting allows.
  • Website - Contact Form Submissions (47 rows) — sparse data (just name + email + interest type). Skip migration; export as CSV for manual triage. Not the source of truth for any deal.
  • Website - Berth EOI Details Supplements (1 row) — single record, preserved as a one-off attached to the matching Interest.
  • Newsletter Sending (69 rows) — out of scope; that's a marketing surface, not CRM.
  • Interests Backup, Interests copy — historical artifacts. Skipped by default. A --include-backups flag attaches them as audit-note entries on the corresponding live Interest if the user wants the history.

9. Migration script

Located at scripts/migrate-from-nocodb.ts. Idempotent: safe to re-run. Three main flags:

$ pnpm tsx scripts/migrate-from-nocodb.ts --dry-run [--port-slug X]
  Pulls everything, transforms, runs dedup, writes CSV report to .migration/<timestamp>/. No DB writes.

$ pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<timestamp>/
  Reads the report, performs the writes the dry-run promised. Refuses if the source data has changed since the report was generated (hash mismatch).

$ pnpm tsx scripts/migrate-from-nocodb.ts --rollback --apply-id <id>
  Reads the apply log, undoes the writes (only valid within the undo window).

Reuses the client-portal/server/utils/nocodb.ts adapter for the NocoDB API client (no need to rebuild). Writes to the new system via Drizzle (re-using the existing services like createClient, createInterest, etc., so all the same validation runs).

9.1 Dry-run report format

.migration/<timestamp>/report.csv:

op,reason,nocodb_row_id,target_table,target_value,confidence,manual_review_required
create_client,new,624,clients.fullName,Deepak Ramchandani,N/A,false
create_contact,new,624,clientContacts.email,dannyrams8888@gmail.com,N/A,false
create_contact,new,624,clientContacts.phone,+17215868888,N/A,false
create_interest,new,624,interests.berthId,a1b2c3...,N/A,false
auto_link,score=98 (email+phone),625,clients.id,<existing client UUID from row 624>,high,false
flag_for_review,score=72 (same name diff country),188,client.id,<existing client UUID from row 717>,medium,true
country_unresolved,fallback to AI (port country),198,clientAddresses.countryIso,AI,low,true
phone_unparseable,placeholder all-zeros,641,clientContacts.phone,<skipped>,N/A,true

Plus .migration/<timestamp>/summary.md:

# Migration Dry-Run — 2026-05-03 14:23 UTC

NocoDB:    252 Interests + 35 Residences + 64 Website Submissions
Outcome:   198 clients, 287 interests (incl. residences), 91 yachts, 412 contacts

Auto-linked (high confidence, no human action needed):
  - Nicolas Ruiz: rows 681,682,683 → 1 client + 3 interests
  - John Lynch:   rows 716,725 → 1 client + 2 interests
  - Deepak Ramchandani: rows 624,625 → 1 client + 2 interests
  - [12 more]

Flagged for manual review (medium confidence):
  - Etiennette Clamouze (rows 188,717): same name, different country phone + email
  - Bruno Joyerot #18 + Bruce Hearn #19: shared household contact
  - [4 more]

Country resolution failed for 7 rows. All defaulted to port country (AI). Review:
  - Row 239: "Sag Harbor Y" → AI (likely US)
  - [6 more]

Phone parsing failed for 3 rows. All flagged, no contact created:
  - Row 178: empty
  - Row 641: placeholder "+447000000000"
  - Row 175: empty

Run `--apply` to commit these changes.

9.2 Apply phase

--apply reads the report, re-fetches the source rows (via NocoDB MCP / API), recomputes the hash, fails fast if NocoDB changed since dry-run. Then performs the writes within a single PostgreSQL transaction per port (commit at end). On any error mid-transaction, full rollback.

After successful apply, an apply_id is generated and an audit-log row written. The apply_id is the handle used for --rollback.

9.3 Idempotency

The script tracks NocoDB row IDs in a migration_source_links table:

export const migrationSourceLinks = pgTable('migration_source_links', {
  id: text('id').primaryKey()...,
  sourceSystem: text('source_system').notNull(), // 'nocodb_interests' | 'nocodb_residences' | …
  sourceId: text('source_id').notNull(),         // NocoDB row id as string
  targetEntityType: text('target_entity_type').notNull(), // client | interest | yacht | …
  targetEntityId: text('target_entity_id').notNull(),
  appliedAt: timestamp('applied_at')...,
  appliedBy: text('applied_by'),
}, (table) => [
  uniqueIndex('idx_msl_source').on(table.sourceSystem, table.sourceId, table.targetEntityType),
]);

Re-running --apply against the same report skips rows already in this table. Useful for partial-failure resumption.


10. Test plan

10.1 Library-level (vitest unit)

  • tests/unit/dedup/normalize.test.ts — every dirty-data pattern from §1.3 has a fixture asserting the expected normalized output.
  • tests/unit/dedup/find-matches.test.ts — every duplicate cluster from §1.2 has a fixture asserting score + confidence tier. Hard cases (Pattern F) assert "medium" not "high" — false-positive guard.

10.2 Service-level (vitest integration)

  • tests/integration/dedup/client-merge.test.ts — merge service exercised: full reattach, clientMergeLog written, undo within window restores, undo after window returns 410, concurrent merge of same loser fails the second.
  • tests/integration/dedup/at-create-suggestion.test.tsfindClientMatches against a seeded pool returns expected matches + reasons.

10.3 Migration script (vitest integration with NocoDB mock)

  • tests/integration/dedup/migration-dry-run.test.ts — feed the script a fixture NocoDB dump (the 252 rows, frozen as a JSON snapshot in fixtures), assert the resulting CSV matches a golden file. Catch any future regression in the transform pipeline.
  • tests/integration/dedup/migration-apply.test.ts — apply the dry-run output to a clean test DB, assert all expected rows exist, assert idempotency (re-apply is a no-op).

10.4 E2E (Playwright)

  • tests/e2e/smoke/30-dedup-create.spec.ts — type into ClientForm with an email matching seeded client; assert suggestion card appears; click "Use this client"; assert form switches to interest-create mode.
  • tests/e2e/smoke/31-admin-duplicates.spec.ts — admin views review queue, opens a candidate, side-by-side merge UI works, merge succeeds, undo within window works.

11. Rollback plan

Three layers of safety, ordered by reversibility:

  1. Per-merge undo — admin clicks Undo on a wrongly-merged pair, system rolls back from clientMergeLog snapshot. 7-day window. No engineering needed.
  2. Migration --rollback flag — entire migration apply is reversed via the apply_id and migration_source_links table. Useful in the first 24h after --apply. Engineering-supervised.
  3. DB restore from backup — the existing docs/ops/backup-runbook.md covers this. Last resort if both above are blocked.

Pre-migration, take a hot backup of the new DB (pg_dump). Pre-merge in production (before any human-facing surface ships), the dedup_auto_merge_threshold defaults to null so no automatic merges happen — every merge is human-confirmed.


12. Open items

  • Soundex vs metaphone — Soundex is simpler but English-leaning. Metaphone handles non-English surnames better (the dataset has French, German, Italian, Slavic names). Default to metaphone via the natural package; revisit if it adds significant install size.
  • Cross-port dedup — not in scope. Each port's clients are deduped within that port. A future "shared address book" feature would need its own design.
  • Profile photo / face match — out of scope.
  • AI-assisted match resolution — out of scope. The Layer-3 review queue is human-only.

Implementation sequence

P1 (this design's library) → P2 (runtime surfaces) → P3 (migration). Each is a separate plan / PR.

P1 deliverables: src/lib/dedup/{normalize,find-matches}.ts + tests. No UI changes. No DB changes (except indexed lookups added to existing clientContacts). ~1.5 days.

P2 deliverables: at-create suggestion in ClientForm + interest-level guard in createInterest service + admin settings UI for thresholds + clientMergeCandidates table + nightly job + admin review queue page + merge service + side-by-side merge UI. ~57 days.

P3 deliverables: scripts/migrate-from-nocodb.ts + migration_source_links table + dry-run + apply + rollback. CSV report format frozen against fixture. ~3 days, including fixture creation from the live NocoDB snapshot.

Total: ~1012 engineering days from approval. Can be split across three PRs landing independently — each is testable in isolation and the runtime surfaces (P2) work even without P3 being run.