feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
/**
|
|
|
|
|
* Normalization helpers for the dedup pipeline.
|
|
|
|
|
*
|
|
|
|
|
* Pure functions (no DB, no React). Used by both the runtime at-create
|
|
|
|
|
* surfaces and the one-shot NocoDB migration script. Every transform
|
|
|
|
|
* here has a fixture in `tests/unit/dedup/normalize.test.ts` drawn from
|
|
|
|
|
* real dirty values observed in the legacy NocoDB Interests table.
|
|
|
|
|
*
|
|
|
|
|
* Design reference: docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md §3.
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
import { z } from 'zod';
|
|
|
|
|
|
|
|
|
|
import { ALL_COUNTRY_CODES, getCountryName, type CountryCode } from '@/lib/i18n/countries';
|
feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.
Schema additions
================
- `client_merge_candidates` — pairs flagged by the background scoring
job for the /admin/duplicates review queue. Status enum: pending /
dismissed / merged. Unique-(portId, clientAId, clientBId) so the
same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
rows (NocoDB Interest #624 → new client UUID) so re-running --apply
against the same dry-run report skips already-imported entities.
Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.
Library
=======
src/lib/dedup/nocodb-source.ts
Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
auto-paginates until isLastPage, captures the table IDs from the
2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
parallel into one in-memory object the transform layer consumes.
src/lib/dedup/migration-transform.ts
Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
- normalizes name / email / phone / country via the dedup library
- parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
- maps the 8-stage `Sales Process Level` enum to the new 9-stage
pipelineStage
- filters yacht-name placeholders ('TBC', 'Na', etc.)
- merges Internal Notes + Extra Comments + Berth Size Desired into
a single notes blob
Then runs `findClientMatches` pairwise (with blocking) and
union-finds clusters of rows whose score crosses the auto-link
threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
Each cluster's "lead" row is picked by completeness score with
recency tie-break.
src/lib/dedup/migration-report.ts
Writes three artifacts to .migration/<timestamp>/:
- report.csv — one row per planned op, RFC-4180 escaped
- summary.md — human-skimmable overview
- plan.json — full structured plan for the --apply phase
CSV cells with comma / quote / newline are quoted; internal quotes
are doubled. No external CSV dep.
src/lib/dedup/phone-parse.ts
Script-safe wrapper around libphonenumber-js's `core` entry that
loads `metadata.min.json` directly. The default `index.cjs.js`
bundled by libphonenumber hits a metadata-shape interop bug under
Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
The dedup `normalizePhone` and `find-matches` both use this wrapper
now so the same code path runs in vitest, Next.js, and the migration
CLI without surprises.
src/lib/dedup/normalize.ts
Tightened country resolution: added Caribbean short-form aliases
('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
US locations seen in the NocoDB dump (Boston, Tampa, Fort
Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
to drop the `isValid()` strict check — the libphonenumber min build
rejects many real NANP-territory numbers, and dedup only needs a
canonical E.164 to compare.
CLI
===
scripts/migrate-from-nocodb.ts
pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
→ Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
runs the transform, writes report. No DB writes.
pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
→ Stubbed; exits with `not yet implemented` and a pointer to the
design doc. Apply phase ships in a follow-up.
Tests
=====
tests/unit/dedup/migration-transform.test.ts (7 cases)
Fixture-based regression. A frozen 12-row NocoDB snapshot covers
every duplicate pattern in the design (§1.2). The test asserts:
- 12 input rows → 7 unique clients (cluster math is right)
- Patterns A / B / C / E auto-link
- Pattern F (Etiennette Clamouze) does NOT auto-link
- Every interest preserved as its own row even when clients merge
- 8-stage → 9-stage enum mapping is correct per spec
- Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
client) — the design's signature win
- Output is deterministic (run twice, identical)
Validation against real data
============================
Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
- 237 clients (15 merged into 13 clusters)
- 252 interests (one per source row)
- 406 contacts, 52 addresses
- 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
- 3 pairs flagged for review (Camazou, Zasso, one new)
- 1 phone placeholder flagged
Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
|
|
|
import { parsePhoneScriptSafe as parsePhone } from './phone-parse';
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
|
|
|
|
|
// ─── Names ──────────────────────────────────────────────────────────────────
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Tokens that should stay lowercase mid-name. Covers the common Romance,
|
|
|
|
|
* Germanic, and Iberian particles seen in client records. The first token
|
|
|
|
|
* of a name is always title-cased even if it appears in this set.
|
|
|
|
|
*/
|
|
|
|
|
const PARTICLES: ReadonlySet<string> = new Set([
|
|
|
|
|
'van',
|
|
|
|
|
'von',
|
|
|
|
|
'de',
|
|
|
|
|
'del',
|
|
|
|
|
'da',
|
|
|
|
|
'das',
|
|
|
|
|
'do',
|
|
|
|
|
'dos',
|
|
|
|
|
'di',
|
|
|
|
|
'le',
|
|
|
|
|
'la',
|
|
|
|
|
'el',
|
|
|
|
|
'al',
|
|
|
|
|
'der',
|
|
|
|
|
'den',
|
|
|
|
|
'des',
|
|
|
|
|
'du',
|
|
|
|
|
'dalla',
|
|
|
|
|
'della',
|
|
|
|
|
'st',
|
|
|
|
|
'st.',
|
|
|
|
|
'y',
|
|
|
|
|
]);
|
|
|
|
|
|
|
|
|
|
export interface NormalizedName {
|
|
|
|
|
/** Human-readable form preserved for UI display. Trims, collapses
|
2026-05-04 22:56:18 +02:00
|
|
|
* whitespace, fixes case, but never destroys the user's intent -
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
* slash-with-company structure ("Daniel Wainstein / 7 Knots, LLC")
|
|
|
|
|
* is left intact. */
|
|
|
|
|
display: string;
|
|
|
|
|
/** Lowercased form for matching. */
|
|
|
|
|
normalized: string;
|
|
|
|
|
/** Last non-particle token, lowercased. Used as a blocking key by the
|
|
|
|
|
* dedup algorithm so we only compare candidates with similar surnames. */
|
|
|
|
|
surnameToken?: string;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Normalize a free-text full name. Trims and collapses whitespace,
|
|
|
|
|
* replaces \r/\n/\t with single spaces, intelligently title-cases
|
|
|
|
|
* ALL-CAPS surnames while keeping particles (van / de / dalla / etc.)
|
|
|
|
|
* lowercase mid-name, and preserves Irish O' surnames as O'Brien.
|
|
|
|
|
*
|
|
|
|
|
* If the input contains a `/` (slash-with-company structure like
|
|
|
|
|
* "Daniel Wainstein / 7 Knots, LLC"), the trailing company text is
|
2026-05-04 22:56:18 +02:00
|
|
|
* preserved verbatim - it's signal, not noise.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
*/
|
|
|
|
|
export function normalizeName(raw: string | null | undefined): NormalizedName {
|
|
|
|
|
const safe = (raw ?? '').toString();
|
|
|
|
|
// Replace \r, \n, \t with single spaces, then collapse runs of whitespace.
|
|
|
|
|
const cleaned = safe
|
|
|
|
|
.replace(/[\r\n\t]/g, ' ')
|
|
|
|
|
.replace(/\s+/g, ' ')
|
|
|
|
|
.trim();
|
|
|
|
|
|
|
|
|
|
if (!cleaned) {
|
|
|
|
|
return { display: '', normalized: '', surnameToken: undefined };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// Slash-with-company: title-case the part before the slash, leave the
|
|
|
|
|
// company segment untouched (it's typically already a brand we shouldn't
|
|
|
|
|
// mangle: "SAS TIKI", "7 Knots, LLC").
|
|
|
|
|
const slashIdx = cleaned.indexOf('/');
|
|
|
|
|
let displayCore: string;
|
|
|
|
|
if (slashIdx !== -1) {
|
|
|
|
|
const personPart = cleaned.slice(0, slashIdx).trim();
|
|
|
|
|
const companyPart = cleaned.slice(slashIdx + 1).trim();
|
|
|
|
|
displayCore = `${titleCaseTokens(personPart)} / ${companyPart}`;
|
|
|
|
|
} else {
|
|
|
|
|
displayCore = titleCaseTokens(cleaned);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
const display = displayCore;
|
|
|
|
|
const normalized = display.toLowerCase();
|
|
|
|
|
const surnameToken = computeSurnameToken(slashIdx !== -1 ? cleaned.slice(0, slashIdx) : cleaned);
|
|
|
|
|
|
|
|
|
|
return { display, normalized, surnameToken };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function titleCaseTokens(s: string): string {
|
|
|
|
|
const tokens = s.split(' ').filter(Boolean);
|
|
|
|
|
if (tokens.length === 0) return '';
|
|
|
|
|
return tokens.map((tok, idx) => titleCaseOneToken(tok, idx === 0)).join(' ');
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function titleCaseOneToken(token: string, isFirst: boolean): string {
|
|
|
|
|
if (!token) return '';
|
|
|
|
|
const lower = token.toLowerCase();
|
|
|
|
|
if (!isFirst && PARTICLES.has(lower)) return lower;
|
2026-05-04 22:56:18 +02:00
|
|
|
// O'Brien / D'Angelo / l'Estrange - capitalize the segment after each
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
// apostrophe so a lowercased input round-trips to readable Irish caps.
|
|
|
|
|
if (lower.includes("'")) {
|
|
|
|
|
return lower
|
|
|
|
|
.split("'")
|
|
|
|
|
.map((part) => (part.length > 0 ? part[0]!.toUpperCase() + part.slice(1) : part))
|
|
|
|
|
.join("'");
|
|
|
|
|
}
|
|
|
|
|
return lower[0]!.toUpperCase() + lower.slice(1);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
function computeSurnameToken(personPart: string): string | undefined {
|
|
|
|
|
const cleaned = personPart
|
|
|
|
|
.replace(/[\r\n\t]/g, ' ')
|
|
|
|
|
.replace(/\s+/g, ' ')
|
|
|
|
|
.trim();
|
|
|
|
|
if (!cleaned) return undefined;
|
|
|
|
|
const tokens = cleaned.split(' ').map((t) => t.toLowerCase());
|
|
|
|
|
// Walk from the right past particles to find the last "real" surname token.
|
|
|
|
|
for (let i = tokens.length - 1; i >= 0; i -= 1) {
|
|
|
|
|
const tok = tokens[i]!;
|
|
|
|
|
if (!PARTICLES.has(tok)) return tok;
|
|
|
|
|
}
|
|
|
|
|
// All tokens are particles? Fall back to the last token verbatim.
|
|
|
|
|
return tokens[tokens.length - 1];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// ─── Emails ─────────────────────────────────────────────────────────────────
|
|
|
|
|
|
|
|
|
|
const emailSchema = z.string().email();
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Normalize a free-text email. Trims + lowercases. Returns null for empty
|
2026-05-04 22:56:18 +02:00
|
|
|
* or malformed input - caller decides whether to flag, store, or drop.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
*
|
|
|
|
|
* Plus-aliases (`user+tag@domain.com`) are NOT stripped: they're real
|
|
|
|
|
* distinct addresses, and stripping them would auto-merge legitimately
|
|
|
|
|
* separate accounts.
|
|
|
|
|
*/
|
|
|
|
|
export function normalizeEmail(raw: string | null | undefined): string | null {
|
|
|
|
|
if (raw == null) return null;
|
|
|
|
|
const trimmed = raw.toString().trim().toLowerCase();
|
|
|
|
|
if (!trimmed) return null;
|
|
|
|
|
const result = emailSchema.safeParse(trimmed);
|
|
|
|
|
return result.success ? trimmed : null;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// ─── Phones ─────────────────────────────────────────────────────────────────
|
|
|
|
|
|
|
|
|
|
export type PhoneFlag = 'multi_number' | 'placeholder' | 'unparseable';
|
|
|
|
|
|
|
|
|
|
export interface NormalizedPhone {
|
|
|
|
|
/** Canonical E.164 form, e.g. '+15742740548'. Null when unparseable
|
|
|
|
|
* or flagged as placeholder. */
|
|
|
|
|
e164: string | null;
|
|
|
|
|
/** ISO-3166-1 alpha-2 of the country the number was parsed against. */
|
|
|
|
|
country: CountryCode | null;
|
|
|
|
|
/** Display-friendly international format. Useful for migration reports. */
|
|
|
|
|
display: string | null;
|
|
|
|
|
/** Set when the input had a quirk worth surfacing in the migration
|
|
|
|
|
* report or runtime audit log. Absent on clean parses. */
|
|
|
|
|
flagged?: PhoneFlag;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Normalize a raw user-entered phone string for comparison + storage.
|
|
|
|
|
*
|
|
|
|
|
* Pipeline:
|
|
|
|
|
* 1. strip leading apostrophe (spreadsheet copy-paste artifact)
|
|
|
|
|
* 2. strip \r / \n / \t (real values seen in NocoDB had carriage returns)
|
|
|
|
|
* 3. detect multi-number fields ("+33611111111;+33622222222",
|
2026-05-04 22:56:18 +02:00
|
|
|
* "0677580750/0690511494") - flag and take first segment
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
* 4. strip whitespace, dots, dashes, parens, single quotes
|
|
|
|
|
* 5. convert leading "00" → "+" (international dialling code)
|
2026-05-04 22:56:18 +02:00
|
|
|
* 6. detect placeholder fakes (8+ consecutive zeros) - flag, return null e164
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
* 7. parse via libphonenumber-js
|
|
|
|
|
* 8. on parse failure or invalid number → flag 'unparseable'
|
|
|
|
|
*
|
|
|
|
|
* Returns null for empty inputs (cheaper to short-circuit than to wrap).
|
|
|
|
|
*/
|
|
|
|
|
export function normalizePhone(
|
|
|
|
|
raw: string | null | undefined,
|
|
|
|
|
defaultCountry?: CountryCode,
|
|
|
|
|
): NormalizedPhone | null {
|
|
|
|
|
if (raw == null) return null;
|
|
|
|
|
let cleaned = raw.toString().trim();
|
|
|
|
|
if (!cleaned) return null;
|
|
|
|
|
|
|
|
|
|
// 1. Spreadsheet apostrophe prefix.
|
|
|
|
|
if (cleaned.startsWith("'")) cleaned = cleaned.slice(1);
|
|
|
|
|
|
|
|
|
|
// 2. Strip carriage returns / newlines / tabs.
|
|
|
|
|
cleaned = cleaned.replace(/[\r\n\t]/g, '');
|
|
|
|
|
|
2026-05-04 22:56:18 +02:00
|
|
|
// 3. Multi-number detection - split on /, ;, , (in that order of priority).
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
let flagged: PhoneFlag | undefined;
|
|
|
|
|
if (/[/;,]/.test(cleaned)) {
|
|
|
|
|
flagged = 'multi_number';
|
|
|
|
|
cleaned = cleaned.split(/[/;,]/)[0]!.trim();
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// 4. Strip whitespace, dots, dashes, parens. Keep + for E.164 prefix.
|
|
|
|
|
cleaned = cleaned.replace(/[\s.\-()]/g, '');
|
|
|
|
|
if (!cleaned) return { e164: null, country: null, display: null, flagged: 'unparseable' };
|
|
|
|
|
|
|
|
|
|
// 5. 00 international prefix → +.
|
|
|
|
|
if (cleaned.startsWith('00')) {
|
|
|
|
|
cleaned = '+' + cleaned.slice(2);
|
|
|
|
|
}
|
|
|
|
|
|
2026-05-04 22:56:18 +02:00
|
|
|
// 6. Placeholder fakes - runs of 8+ consecutive zeros, e.g. +447000000000.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
if (/0{8,}/.test(cleaned)) {
|
|
|
|
|
return { e164: null, country: null, display: null, flagged: 'placeholder' };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// 7. Parse via the existing i18n helper (libphonenumber-js under the hood).
|
|
|
|
|
const parsed = parsePhone(cleaned, defaultCountry);
|
feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.
Schema additions
================
- `client_merge_candidates` — pairs flagged by the background scoring
job for the /admin/duplicates review queue. Status enum: pending /
dismissed / merged. Unique-(portId, clientAId, clientBId) so the
same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
rows (NocoDB Interest #624 → new client UUID) so re-running --apply
against the same dry-run report skips already-imported entities.
Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.
Library
=======
src/lib/dedup/nocodb-source.ts
Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
auto-paginates until isLastPage, captures the table IDs from the
2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
parallel into one in-memory object the transform layer consumes.
src/lib/dedup/migration-transform.ts
Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
- normalizes name / email / phone / country via the dedup library
- parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
- maps the 8-stage `Sales Process Level` enum to the new 9-stage
pipelineStage
- filters yacht-name placeholders ('TBC', 'Na', etc.)
- merges Internal Notes + Extra Comments + Berth Size Desired into
a single notes blob
Then runs `findClientMatches` pairwise (with blocking) and
union-finds clusters of rows whose score crosses the auto-link
threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
Each cluster's "lead" row is picked by completeness score with
recency tie-break.
src/lib/dedup/migration-report.ts
Writes three artifacts to .migration/<timestamp>/:
- report.csv — one row per planned op, RFC-4180 escaped
- summary.md — human-skimmable overview
- plan.json — full structured plan for the --apply phase
CSV cells with comma / quote / newline are quoted; internal quotes
are doubled. No external CSV dep.
src/lib/dedup/phone-parse.ts
Script-safe wrapper around libphonenumber-js's `core` entry that
loads `metadata.min.json` directly. The default `index.cjs.js`
bundled by libphonenumber hits a metadata-shape interop bug under
Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
The dedup `normalizePhone` and `find-matches` both use this wrapper
now so the same code path runs in vitest, Next.js, and the migration
CLI without surprises.
src/lib/dedup/normalize.ts
Tightened country resolution: added Caribbean short-form aliases
('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
US locations seen in the NocoDB dump (Boston, Tampa, Fort
Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
to drop the `isValid()` strict check — the libphonenumber min build
rejects many real NANP-territory numbers, and dedup only needs a
canonical E.164 to compare.
CLI
===
scripts/migrate-from-nocodb.ts
pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
→ Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
runs the transform, writes report. No DB writes.
pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
→ Stubbed; exits with `not yet implemented` and a pointer to the
design doc. Apply phase ships in a follow-up.
Tests
=====
tests/unit/dedup/migration-transform.test.ts (7 cases)
Fixture-based regression. A frozen 12-row NocoDB snapshot covers
every duplicate pattern in the design (§1.2). The test asserts:
- 12 input rows → 7 unique clients (cluster math is right)
- Patterns A / B / C / E auto-link
- Pattern F (Etiennette Clamouze) does NOT auto-link
- Every interest preserved as its own row even when clients merge
- 8-stage → 9-stage enum mapping is correct per spec
- Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
client) — the design's signature win
- Output is deterministic (run twice, identical)
Validation against real data
============================
Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
- 237 clients (15 merged into 13 clusters)
- 252 interests (one per source row)
- 406 contacts, 52 addresses
- 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
- 3 pairs flagged for review (Camazou, Zasso, one new)
- 1 phone placeholder flagged
Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
|
|
|
if (!parsed.e164) {
|
2026-05-04 22:56:18 +02:00
|
|
|
// Couldn't even produce a canonical form - genuinely garbage.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
return { e164: null, country: null, display: null, flagged: 'unparseable' };
|
|
|
|
|
}
|
|
|
|
|
|
feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.
Schema additions
================
- `client_merge_candidates` — pairs flagged by the background scoring
job for the /admin/duplicates review queue. Status enum: pending /
dismissed / merged. Unique-(portId, clientAId, clientBId) so the
same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
rows (NocoDB Interest #624 → new client UUID) so re-running --apply
against the same dry-run report skips already-imported entities.
Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.
Library
=======
src/lib/dedup/nocodb-source.ts
Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
auto-paginates until isLastPage, captures the table IDs from the
2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
parallel into one in-memory object the transform layer consumes.
src/lib/dedup/migration-transform.ts
Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
- normalizes name / email / phone / country via the dedup library
- parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
- maps the 8-stage `Sales Process Level` enum to the new 9-stage
pipelineStage
- filters yacht-name placeholders ('TBC', 'Na', etc.)
- merges Internal Notes + Extra Comments + Berth Size Desired into
a single notes blob
Then runs `findClientMatches` pairwise (with blocking) and
union-finds clusters of rows whose score crosses the auto-link
threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
Each cluster's "lead" row is picked by completeness score with
recency tie-break.
src/lib/dedup/migration-report.ts
Writes three artifacts to .migration/<timestamp>/:
- report.csv — one row per planned op, RFC-4180 escaped
- summary.md — human-skimmable overview
- plan.json — full structured plan for the --apply phase
CSV cells with comma / quote / newline are quoted; internal quotes
are doubled. No external CSV dep.
src/lib/dedup/phone-parse.ts
Script-safe wrapper around libphonenumber-js's `core` entry that
loads `metadata.min.json` directly. The default `index.cjs.js`
bundled by libphonenumber hits a metadata-shape interop bug under
Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
The dedup `normalizePhone` and `find-matches` both use this wrapper
now so the same code path runs in vitest, Next.js, and the migration
CLI without surprises.
src/lib/dedup/normalize.ts
Tightened country resolution: added Caribbean short-form aliases
('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
US locations seen in the NocoDB dump (Boston, Tampa, Fort
Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
to drop the `isValid()` strict check — the libphonenumber min build
rejects many real NANP-territory numbers, and dedup only needs a
canonical E.164 to compare.
CLI
===
scripts/migrate-from-nocodb.ts
pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
→ Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
runs the transform, writes report. No DB writes.
pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
→ Stubbed; exits with `not yet implemented` and a pointer to the
design doc. Apply phase ships in a follow-up.
Tests
=====
tests/unit/dedup/migration-transform.test.ts (7 cases)
Fixture-based regression. A frozen 12-row NocoDB snapshot covers
every duplicate pattern in the design (§1.2). The test asserts:
- 12 input rows → 7 unique clients (cluster math is right)
- Patterns A / B / C / E auto-link
- Pattern F (Etiennette Clamouze) does NOT auto-link
- Every interest preserved as its own row even when clients merge
- 8-stage → 9-stage enum mapping is correct per spec
- Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
client) — the design's signature win
- Output is deterministic (run twice, identical)
Validation against real data
============================
Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
- 237 clients (15 merged into 13 clusters)
- 252 interests (one per source row)
- 406 contacts, 52 addresses
- 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
- 3 pairs flagged for review (Camazou, Zasso, one new)
- 1 phone placeholder flagged
Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
|
|
|
// Note: we deliberately don't gate on `parsed.isValid`. The
|
|
|
|
|
// libphonenumber-js `min` build returns isValid=false for many real
|
|
|
|
|
// numbers (NANP territories share +1; some country metadata is
|
|
|
|
|
// truncated). For dedup we only need a canonical E.164 string to
|
|
|
|
|
// compare; strict validity is the form layer's problem, not ours.
|
|
|
|
|
// If a string-only test (e.g. \"abc-not-a-phone\") gets here, parse
|
|
|
|
|
// returns null e164 anyway and the branch above handles it.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
return {
|
|
|
|
|
e164: parsed.e164,
|
|
|
|
|
country: parsed.country,
|
|
|
|
|
display: parsed.international,
|
|
|
|
|
flagged,
|
|
|
|
|
};
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// ─── Countries ──────────────────────────────────────────────────────────────
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Aliases for canonical country names that don't match
|
|
|
|
|
* `Intl.DisplayNames(en)` output verbatim. Keys are pre-normalized
|
|
|
|
|
* (lowercase, diacritic-free, hyphens/dots → spaces, collapsed whitespace).
|
|
|
|
|
*
|
2026-05-04 22:56:18 +02:00
|
|
|
* Kept opinionated and small - only entries we've actually seen in legacy
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
* data. Adding a new alias is cheap; trying to be exhaustive isn't.
|
|
|
|
|
*/
|
|
|
|
|
const COUNTRY_ALIASES: Record<string, CountryCode> = {
|
|
|
|
|
// Generic abbreviations
|
|
|
|
|
usa: 'US',
|
|
|
|
|
us: 'US',
|
|
|
|
|
uk: 'GB',
|
|
|
|
|
// Saint-Barthélemy variants seen in production
|
|
|
|
|
'saint barthelemy': 'BL',
|
|
|
|
|
'saint barth': 'BL',
|
|
|
|
|
'st barth': 'BL',
|
|
|
|
|
'st barths': 'BL',
|
|
|
|
|
'st barthelemy': 'BL',
|
feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.
Schema additions
================
- `client_merge_candidates` — pairs flagged by the background scoring
job for the /admin/duplicates review queue. Status enum: pending /
dismissed / merged. Unique-(portId, clientAId, clientBId) so the
same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
rows (NocoDB Interest #624 → new client UUID) so re-running --apply
against the same dry-run report skips already-imported entities.
Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.
Library
=======
src/lib/dedup/nocodb-source.ts
Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
auto-paginates until isLastPage, captures the table IDs from the
2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
parallel into one in-memory object the transform layer consumes.
src/lib/dedup/migration-transform.ts
Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
- normalizes name / email / phone / country via the dedup library
- parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
- maps the 8-stage `Sales Process Level` enum to the new 9-stage
pipelineStage
- filters yacht-name placeholders ('TBC', 'Na', etc.)
- merges Internal Notes + Extra Comments + Berth Size Desired into
a single notes blob
Then runs `findClientMatches` pairwise (with blocking) and
union-finds clusters of rows whose score crosses the auto-link
threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
Each cluster's "lead" row is picked by completeness score with
recency tie-break.
src/lib/dedup/migration-report.ts
Writes three artifacts to .migration/<timestamp>/:
- report.csv — one row per planned op, RFC-4180 escaped
- summary.md — human-skimmable overview
- plan.json — full structured plan for the --apply phase
CSV cells with comma / quote / newline are quoted; internal quotes
are doubled. No external CSV dep.
src/lib/dedup/phone-parse.ts
Script-safe wrapper around libphonenumber-js's `core` entry that
loads `metadata.min.json` directly. The default `index.cjs.js`
bundled by libphonenumber hits a metadata-shape interop bug under
Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
The dedup `normalizePhone` and `find-matches` both use this wrapper
now so the same code path runs in vitest, Next.js, and the migration
CLI without surprises.
src/lib/dedup/normalize.ts
Tightened country resolution: added Caribbean short-form aliases
('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
US locations seen in the NocoDB dump (Boston, Tampa, Fort
Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
to drop the `isValid()` strict check — the libphonenumber min build
rejects many real NANP-territory numbers, and dedup only needs a
canonical E.164 to compare.
CLI
===
scripts/migrate-from-nocodb.ts
pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
→ Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
runs the transform, writes report. No DB writes.
pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
→ Stubbed; exits with `not yet implemented` and a pointer to the
design doc. Apply phase ships in a follow-up.
Tests
=====
tests/unit/dedup/migration-transform.test.ts (7 cases)
Fixture-based regression. A frozen 12-row NocoDB snapshot covers
every duplicate pattern in the design (§1.2). The test asserts:
- 12 input rows → 7 unique clients (cluster math is right)
- Patterns A / B / C / E auto-link
- Pattern F (Etiennette Clamouze) does NOT auto-link
- Every interest preserved as its own row even when clients merge
- 8-stage → 9-stage enum mapping is correct per spec
- Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
client) — the design's signature win
- Output is deterministic (run twice, identical)
Validation against real data
============================
Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
- 237 clients (15 merged into 13 clusters)
- 252 interests (one per source row)
- 406 contacts, 52 addresses
- 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
- 3 pairs flagged for review (Camazou, Zasso, one new)
- 1 phone placeholder flagged
Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
|
|
|
// Caribbean short-forms whose canonical Intl names are awkward
|
|
|
|
|
// ("Antigua and Barbuda", "Saint Vincent and the Grenadines", etc.).
|
|
|
|
|
antigua: 'AG',
|
|
|
|
|
barbuda: 'AG',
|
|
|
|
|
'st kitts': 'KN',
|
|
|
|
|
'saint kitts': 'KN',
|
|
|
|
|
nevis: 'KN',
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* High-frequency cities → country, used as a last-resort fallback when
|
|
|
|
|
* exact / alias / fuzzy country matching all miss. Keys are normalized.
|
|
|
|
|
*
|
|
|
|
|
* Order matters: an entry's key is also matched as a substring of the
|
|
|
|
|
* input ("Sag Harbor Y" contains "sag harbor"), so the most specific
|
|
|
|
|
* city appears first to avoid a wrong partial hit.
|
|
|
|
|
*/
|
|
|
|
|
const CITY_TO_COUNTRY: Record<string, CountryCode> = {
|
|
|
|
|
'kansas city': 'US',
|
|
|
|
|
'sag harbor': 'US',
|
|
|
|
|
'new york': 'US',
|
feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.
Schema additions
================
- `client_merge_candidates` — pairs flagged by the background scoring
job for the /admin/duplicates review queue. Status enum: pending /
dismissed / merged. Unique-(portId, clientAId, clientBId) so the
same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
rows (NocoDB Interest #624 → new client UUID) so re-running --apply
against the same dry-run report skips already-imported entities.
Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.
Library
=======
src/lib/dedup/nocodb-source.ts
Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
auto-paginates until isLastPage, captures the table IDs from the
2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
parallel into one in-memory object the transform layer consumes.
src/lib/dedup/migration-transform.ts
Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
- normalizes name / email / phone / country via the dedup library
- parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
- maps the 8-stage `Sales Process Level` enum to the new 9-stage
pipelineStage
- filters yacht-name placeholders ('TBC', 'Na', etc.)
- merges Internal Notes + Extra Comments + Berth Size Desired into
a single notes blob
Then runs `findClientMatches` pairwise (with blocking) and
union-finds clusters of rows whose score crosses the auto-link
threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
Each cluster's "lead" row is picked by completeness score with
recency tie-break.
src/lib/dedup/migration-report.ts
Writes three artifacts to .migration/<timestamp>/:
- report.csv — one row per planned op, RFC-4180 escaped
- summary.md — human-skimmable overview
- plan.json — full structured plan for the --apply phase
CSV cells with comma / quote / newline are quoted; internal quotes
are doubled. No external CSV dep.
src/lib/dedup/phone-parse.ts
Script-safe wrapper around libphonenumber-js's `core` entry that
loads `metadata.min.json` directly. The default `index.cjs.js`
bundled by libphonenumber hits a metadata-shape interop bug under
Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
The dedup `normalizePhone` and `find-matches` both use this wrapper
now so the same code path runs in vitest, Next.js, and the migration
CLI without surprises.
src/lib/dedup/normalize.ts
Tightened country resolution: added Caribbean short-form aliases
('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
US locations seen in the NocoDB dump (Boston, Tampa, Fort
Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
to drop the `isValid()` strict check — the libphonenumber min build
rejects many real NANP-territory numbers, and dedup only needs a
canonical E.164 to compare.
CLI
===
scripts/migrate-from-nocodb.ts
pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
→ Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
runs the transform, writes report. No DB writes.
pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
→ Stubbed; exits with `not yet implemented` and a pointer to the
design doc. Apply phase ships in a follow-up.
Tests
=====
tests/unit/dedup/migration-transform.test.ts (7 cases)
Fixture-based regression. A frozen 12-row NocoDB snapshot covers
every duplicate pattern in the design (§1.2). The test asserts:
- 12 input rows → 7 unique clients (cluster math is right)
- Patterns A / B / C / E auto-link
- Pattern F (Etiennette Clamouze) does NOT auto-link
- Every interest preserved as its own row even when clients merge
- 8-stage → 9-stage enum mapping is correct per spec
- Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
client) — the design's signature win
- Output is deterministic (run twice, identical)
Validation against real data
============================
Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
- 237 clients (15 merged into 13 clusters)
- 252 interests (one per source row)
- 406 contacts, 52 addresses
- 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
- 3 pairs flagged for review (Camazou, Zasso, one new)
- 1 phone placeholder flagged
Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
|
|
|
// Cities that came out unresolved from the 2026-05-03 NocoDB dry-run.
|
|
|
|
|
// Using lowercase (post-normalize keys).
|
|
|
|
|
boston: 'US',
|
|
|
|
|
tampa: 'US',
|
|
|
|
|
'fort lauderdale': 'US',
|
|
|
|
|
'port jefferson': 'US',
|
|
|
|
|
nantucket: 'US',
|
|
|
|
|
// US state abbreviations that often appear standalone or as suffix:
|
|
|
|
|
' fl': 'US',
|
|
|
|
|
' ma': 'US',
|
|
|
|
|
' ny': 'US',
|
|
|
|
|
' tx': 'US',
|
|
|
|
|
' ca': 'US',
|
|
|
|
|
// International
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
london: 'GB',
|
|
|
|
|
paris: 'FR',
|
|
|
|
|
};
|
|
|
|
|
|
|
|
|
|
export type CountryConfidence = 'exact' | 'fuzzy' | 'city';
|
|
|
|
|
|
|
|
|
|
export interface ResolvedCountry {
|
|
|
|
|
iso: CountryCode | null;
|
|
|
|
|
confidence: CountryConfidence | null;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Map free-text country / region input to an ISO-3166-1 alpha-2 code.
|
|
|
|
|
*
|
|
|
|
|
* Lookup order: alias → exact (vs. all locale country names) → city →
|
|
|
|
|
* fuzzy (Levenshtein ≤ 2). Anything beyond fuzzy returns null and the
|
|
|
|
|
* migration script flags the row for human review.
|
|
|
|
|
*/
|
|
|
|
|
export function resolveCountry(text: string | null | undefined): ResolvedCountry {
|
|
|
|
|
if (text == null) return { iso: null, confidence: null };
|
|
|
|
|
const normalized = normalizeForLookup(text.toString());
|
|
|
|
|
if (!normalized) return { iso: null, confidence: null };
|
|
|
|
|
|
2026-05-04 22:56:18 +02:00
|
|
|
// 1. Aliases - covers USA / UK / St Barth and friends.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
const alias = COUNTRY_ALIASES[normalized];
|
|
|
|
|
if (alias) return { iso: alias, confidence: 'exact' };
|
|
|
|
|
|
|
|
|
|
// 2. Exact match against Intl-derived country names. We compare against
|
|
|
|
|
// diacritic-stripped + lowercased canonical names so 'United States'
|
|
|
|
|
// and 'united states' both resolve.
|
|
|
|
|
for (const code of ALL_COUNTRY_CODES) {
|
|
|
|
|
const cleanName = normalizeForLookup(getCountryName(code, 'en'));
|
|
|
|
|
if (cleanName === normalized) return { iso: code, confidence: 'exact' };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// 3. City → country fallback, exact or substring.
|
|
|
|
|
const cityExact = CITY_TO_COUNTRY[normalized];
|
|
|
|
|
if (cityExact) return { iso: cityExact, confidence: 'city' };
|
|
|
|
|
for (const [city, iso] of Object.entries(CITY_TO_COUNTRY)) {
|
|
|
|
|
if (normalized.includes(city)) return { iso, confidence: 'city' };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// 4. Fuzzy fallback (Levenshtein ≤ 2). Skipped for short inputs because
|
|
|
|
|
// a 4-char string like "Mars" sits within distance 2 of multiple
|
2026-05-04 22:56:18 +02:00
|
|
|
// short country names (Mali, Laos, Iran, …) - false-positive city.
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
if (normalized.length >= 6) {
|
|
|
|
|
let bestCode: CountryCode | null = null;
|
|
|
|
|
let bestDistance = Number.POSITIVE_INFINITY;
|
|
|
|
|
for (const code of ALL_COUNTRY_CODES) {
|
|
|
|
|
const cleanName = normalizeForLookup(getCountryName(code, 'en'));
|
|
|
|
|
const d = levenshtein(cleanName, normalized);
|
|
|
|
|
if (d < bestDistance) {
|
|
|
|
|
bestDistance = d;
|
|
|
|
|
bestCode = code;
|
|
|
|
|
if (d === 0) break;
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
if (bestDistance <= 2 && bestCode) {
|
|
|
|
|
return { iso: bestCode, confidence: 'fuzzy' };
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return { iso: null, confidence: null };
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/** Lowercase + strip diacritics + replace hyphens/dots with spaces +
|
|
|
|
|
* collapse whitespace. Used by both the input and the canonical-name
|
|
|
|
|
* side of the country comparison so they meet on the same shape. */
|
|
|
|
|
function normalizeForLookup(s: string): string {
|
|
|
|
|
return s
|
|
|
|
|
.normalize('NFD')
|
|
|
|
|
.replace(/[̀-ͯ]/g, '')
|
|
|
|
|
.toLowerCase()
|
|
|
|
|
.replace(/[-.]/g, ' ')
|
|
|
|
|
.replace(/\s+/g, ' ')
|
|
|
|
|
.trim();
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// ─── Levenshtein ────────────────────────────────────────────────────────────
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Standard iterative Levenshtein. Used by the country fuzzy match and by
|
|
|
|
|
* the dedup algorithm's name-similarity rule. Allocates O(n*m) so callers
|
2026-05-04 22:56:18 +02:00
|
|
|
* shouldn't run it against pathological inputs - the dedup blocking
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
* strategy keeps comparison sets small.
|
|
|
|
|
*
|
|
|
|
|
* Exported so the find-matches module can reuse the same implementation
|
|
|
|
|
* without relying on an external dep.
|
|
|
|
|
*/
|
|
|
|
|
export function levenshtein(a: string, b: string): number {
|
|
|
|
|
if (a === b) return 0;
|
|
|
|
|
if (!a) return b.length;
|
|
|
|
|
if (!b) return a.length;
|
|
|
|
|
|
|
|
|
|
const m = a.length;
|
|
|
|
|
const n = b.length;
|
2026-05-04 22:56:18 +02:00
|
|
|
// Two rolling rows is enough - keeps memory at O(n) instead of O(n*m).
|
feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.
src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
title-cases ALL-CAPS surnames while keeping particles (van / de /
dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
LLC") seen in production. Returns a surnameToken (lowercased last
non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
carriage returns, dots/dashes/parens, converts 00 prefix to +) then
delegates to libphonenumber-js. Detects multi-number fields ("a/b",
"a;b") and placeholder fakes (8+ consecutive zeros, e.g.
+447000000000). Flags every quirk so the migration report and runtime
audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
(Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
by find-matches.
src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
/ phone / surname-token), gathers the comparison set via union, and
scores each candidate via the rule set in design §4.2:
Email match +60
Phone E.164 match +50 (≥ 8 digits, excludes placeholder zeros)
Name exact match +20
Surname + given fuzzy +15 (Levenshtein ≤ 1)
Negative: shared email but different phone country −15
Negative: name match but no shared contact −20
Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
'low') is derived from configurable thresholds passed in by the
caller — defaults are highScore=90, mediumScore=50.
tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.
tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool
Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00
|
|
|
let prev = new Array<number>(n + 1);
|
|
|
|
|
let curr = new Array<number>(n + 1);
|
|
|
|
|
for (let j = 0; j <= n; j += 1) prev[j] = j;
|
|
|
|
|
|
|
|
|
|
for (let i = 1; i <= m; i += 1) {
|
|
|
|
|
curr[0] = i;
|
|
|
|
|
for (let j = 1; j <= n; j += 1) {
|
|
|
|
|
const cost = a[i - 1] === b[j - 1] ? 0 : 1;
|
|
|
|
|
curr[j] = Math.min(curr[j - 1]! + 1, prev[j]! + 1, prev[j - 1]! + cost);
|
|
|
|
|
}
|
|
|
|
|
[prev, curr] = [curr, prev];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return prev[n]!;
|
|
|
|
|
}
|