Files
pn-new-crm/src/lib/dedup/migration-report.ts

275 lines
9.0 KiB
TypeScript
Raw Normal View History

feat(dedup): NocoDB migration script + tables (P3 dry-run) Lands the one-shot migration pipeline from the legacy NocoDB Interests base into the new client/interest schema. Dry-run mode is fully operational: pulls the live snapshot, runs the dedup library, and writes a CSV + Markdown report under .migration/<timestamp>/. The --apply phase is stubbed for a follow-up PR per the design's P3 implementation sequence. Schema additions ================ - `client_merge_candidates` — pairs flagged by the background scoring job for the /admin/duplicates review queue. Status enum: pending / dismissed / merged. Unique-(portId, clientAId, clientBId) so the same pair can't surface twice. Empty until P2 lands the cron. - `migration_source_links` — idempotency ledger. Maps source-system rows (NocoDB Interest #624 → new client UUID) so re-running --apply against the same dry-run report skips already-imported entities. Both tables ship with the migration `0020_unusual_azazel.sql` — already applied to the local dev DB during this commit's preparation. Library ======= src/lib/dedup/nocodb-source.ts Read-only adapter for the legacy NocoDB v2 API. xc-token auth, auto-paginates until isLastPage, captures the table IDs from the 2026-05-03 audit. `fetchSnapshot()` pulls every relevant table in parallel into one in-memory object the transform layer consumes. src/lib/dedup/migration-transform.ts Pure function: NocoDB snapshot in, MigrationPlan out. Per row: - normalizes name / email / phone / country via the dedup library - parses the legacy DD-MM-YYYY / DD/MM/YYYY / ISO date formats - maps the 8-stage `Sales Process Level` enum to the new 9-stage pipelineStage - filters yacht-name placeholders ('TBC', 'Na', etc.) - merges Internal Notes + Extra Comments + Berth Size Desired into a single notes blob Then runs `findClientMatches` pairwise (with blocking) and union-finds clusters of rows whose score crosses the auto-link threshold (90). Lower-scoring pairs (50–89) become 'needs review'. Each cluster's "lead" row is picked by completeness score with recency tie-break. src/lib/dedup/migration-report.ts Writes three artifacts to .migration/<timestamp>/: - report.csv — one row per planned op, RFC-4180 escaped - summary.md — human-skimmable overview - plan.json — full structured plan for the --apply phase CSV cells with comma / quote / newline are quoted; internal quotes are doubled. No external CSV dep. src/lib/dedup/phone-parse.ts Script-safe wrapper around libphonenumber-js's `core` entry that loads `metadata.min.json` directly. The default `index.cjs.js` bundled by libphonenumber hits a metadata-shape interop bug under Node 25 + tsx (`{ default }` wrapping); core+JSON sidesteps it. The dedup `normalizePhone` and `find-matches` both use this wrapper now so the same code path runs in vitest, Next.js, and the migration CLI without surprises. src/lib/dedup/normalize.ts Tightened country resolution: added Caribbean short-form aliases ('antigua' → AG, 'st kitts' → KN, etc.) and a city map covering the US locations seen in the NocoDB dump (Boston, Tampa, Fort Lauderdale, Port Jefferson, Nantucket). Also relaxed phone parsing to drop the `isValid()` strict check — the libphonenumber min build rejects many real NANP-territory numbers, and dedup only needs a canonical E.164 to compare. CLI === scripts/migrate-from-nocodb.ts pnpm tsx scripts/migrate-from-nocodb.ts --dry-run → Pulls the live NocoDB base (NOCODB_URL + NOCODB_TOKEN env vars), runs the transform, writes report. No DB writes. pnpm tsx scripts/migrate-from-nocodb.ts --apply --report .migration/<dir>/ → Stubbed; exits with `not yet implemented` and a pointer to the design doc. Apply phase ships in a follow-up. Tests ===== tests/unit/dedup/migration-transform.test.ts (7 cases) Fixture-based regression. A frozen 12-row NocoDB snapshot covers every duplicate pattern in the design (§1.2). The test asserts: - 12 input rows → 7 unique clients (cluster math is right) - Patterns A / B / C / E auto-link - Pattern F (Etiennette Clamouze) does NOT auto-link - Every interest preserved as its own row even when clients merge - 8-stage → 9-stage enum mapping is correct per spec - Multi-yacht merge (Constanzo CALYPSO + Costanzo GEMINI under one client) — the design's signature win - Output is deterministic (run twice, identical) Validation against real data ============================ Ran `pnpm tsx scripts/migrate-from-nocodb.ts --dry-run` against the live NocoDB. Result on 252 Interests rows: - 237 clients (15 merged into 13 clusters) - 252 interests (one per source row) - 406 contacts, 52 addresses - 13 auto-linked clusters (every confirmed cluster from §1.2 audit) - 3 pairs flagged for review (Camazou, Zasso, one new) - 1 phone placeholder flagged Total dedup test count: 57 (50 from P1 + 7 fixture tests). Lint: clean. Tsc: clean for new files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:50:01 +02:00
/**
* Migration report writer turns a `MigrationPlan` (from
* `migration-transform.ts`) into a CSV + a human-readable Markdown
* summary on disk under `.migration/<timestamp>/`.
*
* The CSV format is intentionally machine-friendly (one row per
* planned operation) so it can be diffed across runs and inspected
* by hand. The summary is designed for "open this in your editor and
* eyeball it for 5 minutes before --apply."
*/
import { promises as fs } from 'node:fs';
import path from 'node:path';
import type { MigrationPlan } from './migration-transform';
// ─── Output directory ───────────────────────────────────────────────────────
export interface ReportPaths {
rootDir: string;
csvPath: string;
summaryPath: string;
planJsonPath: string;
}
/** Resolve report paths relative to the worktree root. The timestamped
* directory is created lazily by `writeReport`. */
export function resolveReportPaths(
rootDir: string,
timestamp: string = new Date().toISOString().replace(/[:.]/g, '-'),
): ReportPaths {
const dir = path.join(rootDir, '.migration', timestamp);
return {
rootDir: dir,
csvPath: path.join(dir, 'report.csv'),
summaryPath: path.join(dir, 'summary.md'),
planJsonPath: path.join(dir, 'plan.json'),
};
}
// ─── CSV row shape ──────────────────────────────────────────────────────────
interface CsvRow {
op: string; // create_client / create_contact / create_interest / auto_link / flag / needs_review
reason: string;
source_id: string;
target_table: string;
target_value: string;
confidence: string;
manual_review: 'true' | 'false';
}
// Trivial CSV escape: quote any cell that contains comma / quote / newline,
// double up internal quotes per RFC 4180. No need for a dependency.
function csvEscape(s: string): string {
if (/[",\n\r]/.test(s)) {
return `"${s.replace(/"/g, '""')}"`;
}
return s;
}
function rowToCsvLine(r: CsvRow): string {
return [
r.op,
r.reason,
r.source_id,
r.target_table,
r.target_value,
r.confidence,
r.manual_review,
]
.map(csvEscape)
.join(',');
}
// ─── Build CSV ──────────────────────────────────────────────────────────────
export function buildCsv(plan: MigrationPlan): string {
const lines: string[] = [];
lines.push(
[
'op',
'reason',
'source_id',
'target_table',
'target_value',
'confidence',
'manual_review',
].join(','),
);
for (const client of plan.clients) {
lines.push(
rowToCsvLine({
op: 'create_client',
reason: client.sourceIds.length > 1 ? 'auto-merged cluster' : 'new',
source_id: client.sourceIds.join('|'),
target_table: 'clients.fullName',
target_value: client.fullName,
confidence: 'N/A',
manual_review: 'false',
}),
);
for (const c of client.contacts) {
lines.push(
rowToCsvLine({
op: 'create_contact',
reason: c.flagged ?? 'new',
source_id: client.sourceIds.join('|'),
target_table: `clientContacts.${c.channel}`,
target_value: c.value,
confidence: 'N/A',
manual_review: c.flagged ? 'true' : 'false',
}),
);
}
for (const a of client.addresses) {
lines.push(
rowToCsvLine({
op: 'create_address',
reason: 'address text present',
source_id: client.sourceIds.join('|'),
target_table: 'clientAddresses.countryIso',
target_value: a.countryIso ?? '(unresolved)',
confidence: a.countryConfidence ?? 'fallback',
manual_review: a.countryConfidence === 'fallback' || !a.countryIso ? 'true' : 'false',
}),
);
}
}
for (const interest of plan.interests) {
lines.push(
rowToCsvLine({
op: 'create_interest',
reason: `pipelineStage=${interest.pipelineStage}`,
source_id: String(interest.sourceId),
target_table: 'interests',
target_value: `${interest.berthMooringNumber ?? '(no berth)'} / ${interest.yachtName ?? '(no yacht)'}`,
confidence: 'N/A',
manual_review: 'false',
}),
);
}
for (const link of plan.autoLinks) {
lines.push(
rowToCsvLine({
op: 'auto_link',
reason: link.reasons.join(' + '),
source_id: `${link.leadSourceId}<-${link.mergedSourceIds.join(',')}`,
target_table: 'clients',
target_value: '(merged into lead)',
confidence: `score=${link.score}`,
manual_review: 'false',
}),
);
}
for (const pair of plan.needsReview) {
lines.push(
rowToCsvLine({
op: 'needs_review',
reason: pair.reasons.join(' + '),
source_id: `${pair.aSourceId}<->${pair.bSourceId}`,
target_table: 'clients',
target_value: '(human review required)',
confidence: `score=${pair.score}`,
manual_review: 'true',
}),
);
}
for (const flag of plan.flags) {
lines.push(
rowToCsvLine({
op: 'flag',
reason: flag.reason,
source_id: String(flag.sourceId),
target_table: flag.sourceTable,
target_value: JSON.stringify(flag.details ?? {}),
confidence: 'N/A',
manual_review: 'true',
}),
);
}
return lines.join('\n') + '\n';
}
// ─── Build summary markdown ─────────────────────────────────────────────────
export function buildSummary(plan: MigrationPlan, generatedAt: string): string {
const s = plan.stats;
const lines: string[] = [];
lines.push(`# Migration Dry-Run — ${generatedAt}`);
lines.push('');
lines.push('## Input');
lines.push(`- ${s.inputInterestRows} NocoDB Interests`);
lines.push(`- ${s.inputResidentialRows} NocoDB Residential Interests`);
lines.push('');
lines.push('## Outcome');
lines.push(`- ${s.outputClients} clients`);
lines.push(`- ${s.outputInterests} interests (one per source row, linked to deduped client)`);
lines.push(`- ${s.outputContacts} client_contacts`);
lines.push(`- ${s.outputAddresses} client_addresses`);
lines.push('');
lines.push('## Auto-linked clusters');
if (plan.autoLinks.length === 0) {
lines.push('_None — every input row maps to a unique client._');
} else {
for (const link of plan.autoLinks) {
const merged = link.mergedSourceIds.length;
lines.push(
`- Lead row \`${link.leadSourceId}\` ← merged ${merged} other row${merged === 1 ? '' : 's'} (\`${link.mergedSourceIds.join(', ')}\`) — score ${link.score} via ${link.reasons.join(' + ')}`,
);
}
}
lines.push('');
lines.push('## Pairs flagged for human review');
if (plan.needsReview.length === 0) {
lines.push('_None._');
} else {
for (const pair of plan.needsReview) {
lines.push(
`- Rows \`${pair.aSourceId}\`\`${pair.bSourceId}\` — score ${pair.score} (${pair.reasons.join(' + ')})`,
);
}
}
lines.push('');
lines.push('## Data quality flags');
if (plan.flags.length === 0) {
lines.push('_No quality issues._');
} else {
const byReason = new Map<string, number>();
for (const f of plan.flags) {
byReason.set(f.reason, (byReason.get(f.reason) ?? 0) + 1);
}
for (const [reason, count] of [...byReason].sort((a, b) => b[1] - a[1])) {
lines.push(`- **${count}× ${reason}**`);
}
lines.push('');
lines.push('### Detail');
for (const f of plan.flags.slice(0, 30)) {
lines.push(
`- \`${f.sourceTable}#${f.sourceId}\`: ${f.reason}${f.details ? `\`${JSON.stringify(f.details)}\`` : ''}`,
);
}
if (plan.flags.length > 30) {
lines.push(`- _… and ${plan.flags.length - 30} more (see report.csv for full list)_`);
}
}
lines.push('');
lines.push('## Next step');
lines.push('');
lines.push('Eyeball the auto-linked + flagged-for-review pairs above.');
lines.push('When satisfied, re-run the script with `--apply --report .migration/<this-dir>/`.');
lines.push('Apply will refuse to run if the source NocoDB has changed since this dry-run.');
return lines.join('\n') + '\n';
}
// ─── Write to disk ──────────────────────────────────────────────────────────
export async function writeReport(
paths: ReportPaths,
plan: MigrationPlan,
generatedAt: string,
): Promise<void> {
await fs.mkdir(paths.rootDir, { recursive: true });
await fs.writeFile(paths.csvPath, buildCsv(plan), 'utf-8');
await fs.writeFile(paths.summaryPath, buildSummary(plan, generatedAt), 'utf-8');
await fs.writeFile(paths.planJsonPath, JSON.stringify(plan, null, 2), 'utf-8');
}