Commit Graph

3 Commits

Author SHA1 Message Date
Matt Ciaccio
d62822c284 fix(migration): NocoDB import safety + dedup helpers + lead-source backfill
migration-apply: residential client + interest inserts now wrap in
db.transaction so a partial failure can't leave an orphan client
row without its interest (or vice versa).

migration-transform: buildPlannedDocument returns null when there
are no signers so the apply pass doesn't try to send a Documenso
envelope without recipients. mapDocumentStatus gets an explicit
"Awaiting Further Details" branch that no longer auto-promotes via
stale sign-time fields. parseFlexibleDate handles ISO and DD-MM-YYYY
inputs uniformly.

backfill-legacy-lead-source: chunk UPDATE WHERE clause now
isNull(source) on top of the inArray match, so a re-run can't
overwrite a more accurate source written between batches.

Adds 235 lines of vitest coverage on migration-transform.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:56:18 +02:00
Matt Ciaccio
18e5c124b0 feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.

Schema additions
================

- `client_merge_candidates` — pairs flagged by the background scoring
  job for the /admin/duplicates review queue. Status enum: pending /
  dismissed / merged. Unique-(portId, clientAId, clientBId) so the
  same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
  rows (NocoDB Interest #624 → new client UUID) so re-running --apply
  against the same dry-run report skips already-imported entities.

Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.

Library
=======

src/lib/dedup/nocodb-source.ts
  Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
  auto-paginates until isLastPage, captures the table IDs from the
  2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
  parallel into one in-memory object the transform layer consumes.

src/lib/dedup/migration-transform.ts
  Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
    - normalizes name / email / phone / country via the dedup library
    - parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
    - maps the 8-stage `Sales Process Level` enum to the new 9-stage
      pipelineStage
    - filters yacht-name placeholders ('TBC', 'Na', etc.)
    - merges Internal Notes + Extra Comments + Berth Size Desired into
      a single notes blob
  Then runs `findClientMatches` pairwise (with blocking) and
  union-finds clusters of rows whose score crosses the auto-link
  threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
  Each cluster's "lead" row is picked by completeness score with
  recency tie-break.

src/lib/dedup/migration-report.ts
  Writes three artifacts to .migration/<timestamp>/:
    - report.csv  — one row per planned op, RFC-4180 escaped
    - summary.md  — human-skimmable overview
    - plan.json   — full structured plan for the --apply phase
  CSV cells with comma / quote / newline are quoted; internal quotes
  are doubled. No external CSV dep.

src/lib/dedup/phone-parse.ts
  Script-safe wrapper around libphonenumber-js's `core` entry that
  loads `metadata.min.json` directly. The default `index.cjs.js`
  bundled by libphonenumber hits a metadata-shape interop bug under
  Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
  The dedup `normalizePhone` and `find-matches` both use this wrapper
  now so the same code path runs in vitest, Next.js, and the migration
  CLI without surprises.

src/lib/dedup/normalize.ts
  Tightened country resolution: added Caribbean short-form aliases
  ('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
  US locations seen in the NocoDB dump (Boston, Tampa, Fort
  Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
  to drop the `isValid()` strict check — the libphonenumber min build
  rejects many real NANP-territory numbers, and dedup only needs a
  canonical E.164 to compare.

CLI
===

scripts/migrate-from-nocodb.ts
  pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
    → Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
       runs the transform, writes report. No DB writes.
  pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
    → Stubbed; exits with `not yet implemented` and a pointer to the
       design doc. Apply phase ships in a follow-up.

Tests
=====

tests/unit/dedup/migration-transform.test.ts (7 cases)
  Fixture-based regression. A frozen 12-row NocoDB snapshot covers
  every duplicate pattern in the design (§1.2). The test asserts:
    - 12 input rows → 7 unique clients (cluster math is right)
    - Patterns A / B / C / E auto-link
    - Pattern F (Etiennette Clamouze) does NOT auto-link
    - Every interest preserved as its own row even when clients merge
    - 8-stage → 9-stage enum mapping is correct per spec
    - Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
      client) — the design's signature win
    - Output is deterministic (run twice, identical)

Validation against real data
============================

Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
  - 237 clients (15 merged into 13 clusters)
  - 252 interests (one per source row)
  - 406 contacts, 52 addresses
  - 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
  - 3 pairs flagged for review (Camazou, Zasso, one new)
  - 1 phone placeholder flagged

Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
Matt Ciaccio
8b077e1999 feat(dedup): normalization + match-finding library (P1)
The pure-logic spine of the client deduplication system spec'd in
docs/superpowers/specs/2026-05-03-dedup-and-migration-design.md.
Two modules, JSX-free, vitest-tested against fixtures drawn directly
from real dirty values observed in the legacy NocoDB Interests audit.

src/lib/dedup/normalize.ts
- normalizeName: trims whitespace, replaces \r/\n/\t, intelligently
  title-cases ALL-CAPS surnames while keeping particles (van / de /
  dalla / etc.) lowercase mid-name. Preserves Irish O' surnames and
  the "slash-with-company" structure ("Daniel Wainstein / 7 Knots,
  LLC") seen in production. Returns a surnameToken (lowercased last
  non-particle token) for use as a dedup blocking key.
- normalizeEmail: trim + lowercase + zod email validation. Plus-aliases
  preserved; null on invalid.
- normalizePhone: pre-cleans the input (strips spreadsheet apostrophes,
  carriage returns, dots/dashes/parens, converts 00 prefix to +) then
  delegates to libphonenumber-js. Detects multi-number fields ("a/b",
  "a;b") and placeholder fakes (8+ consecutive zeros, e.g.
  +447000000000). Flags every quirk so the migration report and runtime
  audit log can surface it.
- resolveCountry: maps free-text country/region input to ISO-3166-1
  alpha-2 via alias → exact (vs. Intl-derived names) → city → fuzzy
  (Levenshtein ≤ 2). Fuzzy is gated by length so 4-char inputs ("Mars")
  don't false-positive against short country names.
- levenshtein: standard iterative implementation, exported for reuse
  by find-matches.

src/lib/dedup/find-matches.ts
- findClientMatches: builds three blocking indexes off the pool (email
  / phone / surname-token), gathers the comparison set via union, and
  scores each candidate via the rule set in design §4.2:
    Email match            +60
    Phone E.164 match      +50  (≥ 8 digits, excludes placeholder zeros)
    Name exact match       +20
    Surname + given fuzzy  +15  (Levenshtein ≤ 1)
    Negative: shared email but different phone country  −15
    Negative: name match but no shared contact          −20
  Score is clamped to [0,100]. Confidence tier ('high' / 'medium' /
  'low') is derived from configurable thresholds passed in by the
  caller — defaults are highScore=90, mediumScore=50.

tests/unit/dedup/normalize.test.ts (38 cases)
Every dirty-data pattern from design §1.3 has a fixture: carriage
returns in names, ALL-CAPS surnames, lowercase entries, particles,
slash-with-company, plus-aliases, capitalized email localparts,
spreadsheet-apostrophe phones, multi-number phones, placeholder
phones, 00-prefix phones, French/UK local-format phones,
Saint-Barthélemy diacritic variants, Kansas City fallback.

tests/unit/dedup/find-matches.test.ts (12 cases)
Each duplicate cluster from design §1.2 has a test:
- Pattern A (Deepak Ramchandani — pure double-submit) → high
- Pattern B (Howard Wiarda — phone format variance) → high
- Pattern C (Nicolas Ruiz — name capitalization) → high
- Pattern D (Chris/Christopher Allen — name shortening) → high
- Pattern E (Christopher Camazou — typo on resubmit) → high or medium
- Pattern E (Constanzo/Costanzo — surname typo, multi-yacht) → high
- Pattern F (Etiennette Clamouze — same name, different country) →
  must NOT auto-merge
- Pattern F (Bruno+Bruce — shared household contact) → no match
- Negative evidence (same email, different phone country) → medium
- Blocking (no shared keys → 0 matches)
- Sort order (high before low)
- Empty pool

Total: 50 new tests, all green. Zero changes to runtime behavior or
schema; unblocks P2 (runtime surfaces) and P3 (NocoDB migration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:28:59 +02:00