Files
pn-new-crm/src/lib/dedup/migration-report.ts
Matt Ciaccio 18e5c124b0 feat(dedup): NocoDB migration script + tables (P3 dry-run)
Lands the one-shot migration pipeline from the legacy NocoDB Interests
base into the new client/interest schema. Dry-run mode is fully
operational: pulls the live snapshot, runs the dedup library, and
writes a CSV + Markdown report under .migration/<timestamp>/. The
--apply phase is stubbed for a follow-up PR per the design's P3
implementation sequence.

Schema additions
================

- `client_merge_candidates` — pairs flagged by the background scoring
  job for the /admin/duplicates review queue. Status enum: pending /
  dismissed / merged. Unique-(portId, clientAId, clientBId) so the
  same pair can't surface twice. Empty until P2 lands the cron.
- `migration_source_links` — idempotency ledger. Maps source-system
  rows (NocoDB Interest #624 → new client UUID) so re-running --apply
  against the same dry-run report skips already-imported entities.

Both tables ship with the migration `0020_unusual_azazel.sql` —
already applied to the local dev DB during this commit's preparation.

Library
=======

src/lib/dedup/nocodb-source.ts
  Read-only adapter for the legacy NocoDB v2 API. xc-token auth,
  auto-paginates until isLastPage, captures the table IDs from the
  2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in
  parallel into one in-memory object the transform layer consumes.

src/lib/dedup/migration-transform.ts
  Pure function: NocoDB snapshot in, MigrationPlan out. Per row:
    - normalizes name / email / phone / country via the dedup library
    - parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats
    - maps the 8-stage `Sales Process Level` enum to the new 9-stage
      pipelineStage
    - filters yacht-name placeholders ('TBC', 'Na', etc.)
    - merges Internal Notes + Extra Comments + Berth Size Desired into
      a single notes blob
  Then runs `findClientMatches` pairwise (with blocking) and
  union-finds clusters of rows whose score crosses the auto-link
  threshold (90). Lower-scoring pairs (50–89) become 'needs review'.
  Each cluster's "lead" row is picked by completeness score with
  recency tie-break.

src/lib/dedup/migration-report.ts
  Writes three artifacts to .migration/<timestamp>/:
    - report.csv  — one row per planned op, RFC-4180 escaped
    - summary.md  — human-skimmable overview
    - plan.json   — full structured plan for the --apply phase
  CSV cells with comma / quote / newline are quoted; internal quotes
  are doubled. No external CSV dep.

src/lib/dedup/phone-parse.ts
  Script-safe wrapper around libphonenumber-js's `core` entry that
  loads `metadata.min.json` directly. The default `index.cjs.js`
  bundled by libphonenumber hits a metadata-shape interop bug under
  Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it.
  The dedup `normalizePhone` and `find-matches` both use this wrapper
  now so the same code path runs in vitest, Next.js, and the migration
  CLI without surprises.

src/lib/dedup/normalize.ts
  Tightened country resolution: added Caribbean short-form aliases
  ('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the
  US locations seen in the NocoDB dump (Boston, Tampa, Fort
  Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing
  to drop the `isValid()` strict check — the libphonenumber min build
  rejects many real NANP-territory numbers, and dedup only needs a
  canonical E.164 to compare.

CLI
===

scripts/migrate-from-nocodb.ts
  pnpm tsx scripts/migrate-from-nocodb.ts --dry-run
    → Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars),
       runs the transform, writes report. No DB writes.
  pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/
    → Stubbed; exits with `not yet implemented` and a pointer to the
       design doc. Apply phase ships in a follow-up.

Tests
=====

tests/unit/dedup/migration-transform.test.ts (7 cases)
  Fixture-based regression. A frozen 12-row NocoDB snapshot covers
  every duplicate pattern in the design (§1.2). The test asserts:
    - 12 input rows → 7 unique clients (cluster math is right)
    - Patterns A / B / C / E auto-link
    - Pattern F (Etiennette Clamouze) does NOT auto-link
    - Every interest preserved as its own row even when clients merge
    - 8-stage → 9-stage enum mapping is correct per spec
    - Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one
      client) — the design's signature win
    - Output is deterministic (run twice, identical)

Validation against real data
============================

Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the
live NocoDB. Result on 252 Interests rows:
  - 237 clients (15 merged into 13 clusters)
  - 252 interests (one per source row)
  - 406 contacts, 52 addresses
  - 13 auto-linked clusters (every confirmed cluster from §1.2 audit)
  - 3 pairs flagged for review (Camazou, Zasso, one new)
  - 1 phone placeholder flagged

Total dedup test count: 57 (50 from P1 + 7 fixture tests).
Lint: clean. Tsc: clean for new files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00

275 lines
9.0 KiB
TypeScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
/**
* Migration report writer — turns a `MigrationPlan` (from
* `migration-transform.ts`) into a CSV + a human-readable Markdown
* summary on disk under `.migration/<timestamp>/`.
*
* The CSV format is intentionally machine-friendly (one row per
* planned operation) so it can be diffed across runs and inspected
* by hand. The summary is designed for "open this in your editor and
* eyeball it for 5 minutes before --apply."
*/
import { promises as fs } from 'node:fs';
import path from 'node:path';
import type { MigrationPlan } from './migration-transform';
// ─── Output directory ───────────────────────────────────────────────────────
export interface ReportPaths {
rootDir: string;
csvPath: string;
summaryPath: string;
planJsonPath: string;
}
/** Resolve report paths relative to the worktree root. The timestamped
* directory is created lazily by `writeReport`. */
export function resolveReportPaths(
rootDir: string,
timestamp: string = new Date().toISOString().replace(/[:.]/g, '-'),
): ReportPaths {
const dir = path.join(rootDir, '.migration', timestamp);
return {
rootDir: dir,
csvPath: path.join(dir, 'report.csv'),
summaryPath: path.join(dir, 'summary.md'),
planJsonPath: path.join(dir, 'plan.json'),
};
}
// ─── CSV row shape ──────────────────────────────────────────────────────────
interface CsvRow {
op: string; // create_client / create_contact / create_interest / auto_link / flag / needs_review
reason: string;
source_id: string;
target_table: string;
target_value: string;
confidence: string;
manual_review: 'true' | 'false';
}
// Trivial CSV escape: quote any cell that contains comma / quote / newline,
// double up internal quotes per RFC 4180. No need for a dependency.
function csvEscape(s: string): string {
if (/[",\n\r]/.test(s)) {
return `"${s.replace(/"/g, '""')}"`;
}
return s;
}
function rowToCsvLine(r: CsvRow): string {
return [
r.op,
r.reason,
r.source_id,
r.target_table,
r.target_value,
r.confidence,
r.manual_review,
]
.map(csvEscape)
.join(',');
}
// ─── Build CSV ──────────────────────────────────────────────────────────────
export function buildCsv(plan: MigrationPlan): string {
const lines: string[] = [];
lines.push(
[
'op',
'reason',
'source_id',
'target_table',
'target_value',
'confidence',
'manual_review',
].join(','),
);
for (const client of plan.clients) {
lines.push(
rowToCsvLine({
op: 'create_client',
reason: client.sourceIds.length > 1 ? 'auto-merged cluster' : 'new',
source_id: client.sourceIds.join('|'),
target_table: 'clients.fullName',
target_value: client.fullName,
confidence: 'N/A',
manual_review: 'false',
}),
);
for (const c of client.contacts) {
lines.push(
rowToCsvLine({
op: 'create_contact',
reason: c.flagged ?? 'new',
source_id: client.sourceIds.join('|'),
target_table: `clientContacts.${c.channel}`,
target_value: c.value,
confidence: 'N/A',
manual_review: c.flagged ? 'true' : 'false',
}),
);
}
for (const a of client.addresses) {
lines.push(
rowToCsvLine({
op: 'create_address',
reason: 'address text present',
source_id: client.sourceIds.join('|'),
target_table: 'clientAddresses.countryIso',
target_value: a.countryIso ?? '(unresolved)',
confidence: a.countryConfidence ?? 'fallback',
manual_review: a.countryConfidence === 'fallback' || !a.countryIso ? 'true' : 'false',
}),
);
}
}
for (const interest of plan.interests) {
lines.push(
rowToCsvLine({
op: 'create_interest',
reason: `pipelineStage=${interest.pipelineStage}`,
source_id: String(interest.sourceId),
target_table: 'interests',
target_value: `${interest.berthMooringNumber ?? '(no berth)'} / ${interest.yachtName ?? '(no yacht)'}`,
confidence: 'N/A',
manual_review: 'false',
}),
);
}
for (const link of plan.autoLinks) {
lines.push(
rowToCsvLine({
op: 'auto_link',
reason: link.reasons.join(' + '),
source_id: `${link.leadSourceId}<-${link.mergedSourceIds.join(',')}`,
target_table: 'clients',
target_value: '(merged into lead)',
confidence: `score=${link.score}`,
manual_review: 'false',
}),
);
}
for (const pair of plan.needsReview) {
lines.push(
rowToCsvLine({
op: 'needs_review',
reason: pair.reasons.join(' + '),
source_id: `${pair.aSourceId}<->${pair.bSourceId}`,
target_table: 'clients',
target_value: '(human review required)',
confidence: `score=${pair.score}`,
manual_review: 'true',
}),
);
}
for (const flag of plan.flags) {
lines.push(
rowToCsvLine({
op: 'flag',
reason: flag.reason,
source_id: String(flag.sourceId),
target_table: flag.sourceTable,
target_value: JSON.stringify(flag.details ?? {}),
confidence: 'N/A',
manual_review: 'true',
}),
);
}
return lines.join('\n') + '\n';
}
// ─── Build summary markdown ─────────────────────────────────────────────────
export function buildSummary(plan: MigrationPlan, generatedAt: string): string {
const s = plan.stats;
const lines: string[] = [];
lines.push(`# Migration Dry-Run — ${generatedAt}`);
lines.push('');
lines.push('## Input');
lines.push(`- ${s.inputInterestRows} NocoDB Interests`);
lines.push(`- ${s.inputResidentialRows} NocoDB Residential Interests`);
lines.push('');
lines.push('## Outcome');
lines.push(`- ${s.outputClients} clients`);
lines.push(`- ${s.outputInterests} interests (one per source row, linked to deduped client)`);
lines.push(`- ${s.outputContacts} client_contacts`);
lines.push(`- ${s.outputAddresses} client_addresses`);
lines.push('');
lines.push('## Auto-linked clusters');
if (plan.autoLinks.length === 0) {
lines.push('_None — every input row maps to a unique client._');
} else {
for (const link of plan.autoLinks) {
const merged = link.mergedSourceIds.length;
lines.push(
`- Lead row \`${link.leadSourceId}\` ← merged ${merged} other row${merged === 1 ? '' : 's'} (\`${link.mergedSourceIds.join(', ')}\`) — score ${link.score} via ${link.reasons.join(' + ')}`,
);
}
}
lines.push('');
lines.push('## Pairs flagged for human review');
if (plan.needsReview.length === 0) {
lines.push('_None._');
} else {
for (const pair of plan.needsReview) {
lines.push(
`- Rows \`${pair.aSourceId}\`\`${pair.bSourceId}\` — score ${pair.score} (${pair.reasons.join(' + ')})`,
);
}
}
lines.push('');
lines.push('## Data quality flags');
if (plan.flags.length === 0) {
lines.push('_No quality issues._');
} else {
const byReason = new Map<string, number>();
for (const f of plan.flags) {
byReason.set(f.reason, (byReason.get(f.reason) ?? 0) + 1);
}
for (const [reason, count] of [...byReason].sort((a, b) => b[1] - a[1])) {
lines.push(`- **${count}× ${reason}**`);
}
lines.push('');
lines.push('### Detail');
for (const f of plan.flags.slice(0, 30)) {
lines.push(
`- \`${f.sourceTable}#${f.sourceId}\`: ${f.reason}${f.details ? `\`${JSON.stringify(f.details)}\`` : ''}`,
);
}
if (plan.flags.length > 30) {
lines.push(`- _… and ${plan.flags.length - 30} more (see report.csv for full list)_`);
}
}
lines.push('');
lines.push('## Next step');
lines.push('');
lines.push('Eyeball the auto-linked + flagged-for-review pairs above.');
lines.push('When satisfied, re-run the script with `--apply --report .migration/<this-dir>/`.');
lines.push('Apply will refuse to run if the source NocoDB has changed since this dry-run.');
return lines.join('\n') + '\n';
}
// ─── Write to disk ──────────────────────────────────────────────────────────
export async function writeReport(
paths: ReportPaths,
plan: MigrationPlan,
generatedAt: string,
): Promise<void> {
await fs.mkdir(paths.rootDir, { recursive: true });
await fs.writeFile(paths.csvPath, buildCsv(plan), 'utf-8');
await fs.writeFile(paths.summaryPath, buildSummary(plan, generatedAt), 'utf-8');
await fs.writeFile(paths.planJsonPath, JSON.stringify(plan, null, 2), 'utf-8');
}