# Observability & Release Gates ## Overview This document defines the metrics, logging, alerting, and release gate criteria for the redesigned competition system. Every phase of the implementation must pass its corresponding release gate before proceeding. --- ## Competition Lifecycle Metrics ### Submission Metrics | Metric | Type | Description | |--------|------|-------------| | `projects_submitted_total` | Counter | Total projects submitted, by category and round | | `projects_submitted_late` | Counter | Projects submitted after deadline (FLAG policy) | | `submission_window_utilization` | Gauge | % of deadline elapsed vs submissions received | | `file_upload_duration_seconds` | Histogram | Time to upload a submission file | | `file_upload_size_bytes` | Histogram | Size distribution of uploaded files | ### Filtering Metrics | Metric | Type | Description | |--------|------|-------------| | `eligibility_pass_rate` | Gauge | % of projects passing AI filtering | | `eligibility_manual_review_count` | Counter | Projects flagged for manual review | | `eligibility_override_count` | Counter | Admin overrides of AI decisions | | `ai_filtering_duration_seconds` | Histogram | Time for AI to process one project | ### Assignment Metrics | Metric | Type | Description | |--------|------|-------------| | `assignment_coverage_percent` | Gauge | % of projects assigned to at least one judge | | `assignment_unassigned_queue_size` | Gauge | Projects in unassigned queue | | `assignment_cap_utilization` | Gauge | Average % of cap used per judge | | `assignment_exception_count` | Counter | Over-cap manual assignments | | `assignment_coi_skip_count` | Counter | Assignments skipped due to COI | | `assignment_duration_seconds` | Histogram | Time to run assignment algorithm | ### Evaluation Metrics | Metric | Type | Description | |--------|------|-------------| | `evaluations_completed_total` | Counter | Scores submitted, by jury group | | `evaluation_completion_rate` | Gauge | % of assigned evaluations completed | | `ai_shortlist_generated` | Counter | AI shortlists generated per round | | `admin_shortlist_override_count` | Counter | Admin overrides of AI shortlist | ### Deliberation Metrics | Metric | Type | Description | |--------|------|-------------| | `deliberation_session_count` | Counter | Sessions created, by mode | | `deliberation_duration_seconds` | Histogram | Time from VOTING to LOCKED | | `deliberation_runoff_count` | Counter | Runoff rounds triggered | | `deliberation_admin_override_count` | Counter | Admin overrides of deliberation result | | `result_lock_count` | Counter | Results locked | | `result_unlock_count` | Counter | Results unlocked (should be rare) | ### Live Finals Metrics | Metric | Type | Description | |--------|------|-------------| | `live_voting_concurrent_users` | Gauge | Concurrent jury + audience users during voting | | `live_vote_submission_duration_ms` | Histogram | Time to submit a live vote | | `audience_vote_total` | Counter | Total audience votes submitted | | `stage_manager_cursor_advances` | Counter | Project cursor advances by admin | ### Mentoring Metrics | Metric | Type | Description | |--------|------|-------------| | `mentor_messages_sent` | Counter | Messages sent in mentor workspaces | | `mentor_files_uploaded` | Counter | Files uploaded to mentor workspaces | | `mentor_file_promotions` | Counter | Files promoted to official submissions | | `mentor_assignment_coverage` | Gauge | % of mentoring-requesting teams with assigned mentors | --- ## Reliability Metrics | Metric | Type | Description | |--------|------|-------------| | `api_request_duration_seconds` | Histogram | tRPC procedure latency, by router and procedure | | `api_error_rate` | Gauge | % of requests returning errors, by router | | `job_success_total` | Counter | Background job completions | | `job_failure_total` | Counter | Background job failures | | `email_send_total` | Counter | Emails sent (reminders, invitations, notifications) | | `email_send_failure_total` | Counter | Failed email sends | | `db_query_duration_seconds` | Histogram | Prisma query latency | --- ## Audit Event Metrics All admin override actions emit `DecisionAuditLog` records. These metrics track audit completeness and quality: | Metric | Type | Description | |--------|------|-------------| | `audit_events_total` | Counter | Total audit log entries, by `actionType` | | `audit_events_with_diff` | Counter | Audit entries that include both `beforeState` and `afterState` in `detailsJson` | | `audit_coverage_percent` | Gauge | % of override actions (eligibility, assignment, shortlist, deliberation, lock/unlock, promotion, cap change) that have a corresponding audit record | | `audit_missing_diff_count` | Counter | Override actions where `detailsJson` is missing before/after state (should be 0) | | `audit_correlation_coverage` | Gauge | % of audit entries that include full correlation context (`competitionId`, `roundId`, `projectId`) | ### Audit Completeness Validation During each phase gate review, run an automated check that verifies: 1. **Every admin override has an audit record** — query `DecisionAuditLog` for each override type and verify count matches the operation count 2. **Before/after state present** — for all OVERRIDE-type actions, `detailsJson` must contain `{ before: {...}, after: {...} }` structure 3. **Correlation completeness** — every audit entry references at minimum `userId`, `competitionId`, and `timestamp` 4. **No orphaned operations** — cross-reference assignment exceptions, result unlocks, eligibility overrides, and file promotions against audit log ### Audit Event Correlation For tracing admin actions across the system, audit log entries should include these correlation fields when available: ```typescript type AuditCorrelation = { competitionId: string roundId?: string roundType?: RoundType juryGroupId?: string projectId?: string sessionId?: string // deliberation session assignmentIntentId?: string // when overriding an intent } ``` This enables queries like: "Show all admin overrides for Competition X, Round 3, sorted by time" or "Show all actions taken on Project Y across all rounds." --- ## Structured Logging All log entries must include correlation IDs for tracing: ```typescript type LogContext = { // Always present requestId: string userId?: string userRole?: string // Competition context (when available) programId?: string competitionId?: string roundId?: string roundType?: RoundType // Entity context (when relevant) juryGroupId?: string projectId?: string awardId?: string sessionId?: string // deliberation session // Operation context operation: string // e.g., "assignment.run", "deliberation.submitVote" result: 'success' | 'failure' | 'skipped' details?: Record } ``` ### Key Log Events | Event | Level | When | |-------|-------|------| | `competition.created` | INFO | New competition created | | `round.transitioned` | INFO | Round status changed | | `assignment.completed` | INFO | Assignment algorithm finished | | `assignment.unassigned` | WARN | Projects remain unassigned | | `assignment.exception` | WARN | Over-cap assignment made | | `evaluation.submitted` | INFO | Judge submitted a score | | `deliberation.voteSubmitted` | INFO | Juror submitted deliberation vote | | `deliberation.tieDetected` | WARN | Tie detected in deliberation | | `deliberation.adminOverride` | WARN | Admin overrode deliberation result | | `resultLock.locked` | INFO | Result locked | | `resultLock.unlocked` | WARN | Result unlocked (should be rare) | | `eligibility.override` | WARN | Admin overrode AI eligibility decision | | `file.promoted` | INFO | Mentor file promoted to submission | | `liveVoting.windowClosed` | INFO | Live voting window closed | --- ## Alert Definitions | Alert | Condition | Severity | Action | |-------|-----------|----------|--------| | **Unresolved manual queue** | `assignment_unassigned_queue_size > 0` for > 24h | WARNING | Notify admin to manually assign | | **Quorum mismatch** | Deliberation participants < required quorum | CRITICAL | Block voting, notify admin | | **Result lock failure** | ResultLock creation fails | CRITICAL | Retry, notify super-admin | | **Unlock by non-super-admin** | ResultUnlock attempted by non-super-admin | CRITICAL | Block, log security event | | **AI filtering timeout** | `ai_filtering_duration_seconds > 60` | WARNING | Check OpenAI API, retry | | **Evaluation completion low** | `evaluation_completion_rate < 50%` with < 24h remaining | WARNING | Send reminder emails | | **Live voting overload** | `live_voting_concurrent_users > 200` | WARNING | Monitor for performance degradation | | **Email delivery failure spike** | `email_send_failure_total` increases > 10 in 1h | WARNING | Check SMTP connection | | **API error rate spike** | `api_error_rate > 5%` for any router | CRITICAL | Investigate, potential rollback | --- ## Release Gates Each release gate must be passed before the corresponding implementation phase can proceed to production. ### Gate A: Schema + Backfill Readiness **When**: Before Phase 1 code reaches production **Criteria**: - [ ] All Prisma migrations apply cleanly on staging database - [ ] Backfill script runs successfully on staging data - [ ] No existing table modified (new tables only) - [ ] Rollback tested: drop new tables, verify existing system works - [ ] Migration takes < 5 minutes on production-sized dataset --- ### Gate B: Policy Engine Correctness **When**: Before Phase 2 code reaches production **Criteria**: - [ ] 5-layer policy precedence returns correct values for all combinations - [ ] Cap mode behavior correct: HARD stops at cap, SOFT allows buffer, NONE unlimited - [ ] Category bias applied correctly in assignment - [ ] COI check prevents assignment to declared conflicts - [ ] Edge cases: null overrides, conflicting settings, zero cap - [ ] All policy unit tests pass (100% coverage on resolution logic) --- ### Gate C: End-to-End Monaco Flow Simulation **When**: Before Phase 5 code reaches production **Criteria**: - [ ] Full 8-round flow completes on staging with realistic data - [ ] All round transitions work correctly - [ ] Multi-round document visibility correct for all roles - [ ] AI shortlist generated at end of evaluation rounds - [ ] Assignment algorithm handles 500+ projects across 30 judges - [ ] Result: finalists selected, no data inconsistencies --- ### Gate D: Invite/Onboarding Stability Under Load **When**: Before Phase 3 code reaches production **Criteria**: - [ ] 50 concurrent invite acceptances process correctly - [ ] JuryGroupMember records created for all accepted invites - [ ] Onboarding self-service values saved correctly - [ ] AssignmentIntent records created for pre-assigned judges - [ ] No race conditions in concurrent onboarding --- ### Gate E: Live Finals + Deliberation Integrity **When**: Before Phase 6 code reaches production **Criteria**: - [ ] Stage manager cursor advances correctly for all Jury 3 members - [ ] Live voting window enforcement (no votes outside window) - [ ] 50 concurrent audience votes processed without conflicts - [ ] Deliberation SINGLE_WINNER_VOTE tallying correct - [ ] Deliberation FULL_RANKING Borda count correct - [ ] Tie-breaking: RUNOFF creates new vote round - [ ] Tie-breaking: ADMIN_DECIDES records decision - [ ] ResultLock creates immutable snapshot - [ ] ResultUnlock blocked for non-super-admin - [ ] Audience vote totals correctly shown to Jury 3 --- ### Gate F: Operational Readiness **When**: Before Phase 8 (cutover) begins **Criteria**: - [ ] All metrics listed above are being collected and visible in dashboard - [ ] All alerts listed above are configured and tested - [ ] Structured logging includes all correlation IDs - [ ] Runbook exists for common operational scenarios - [ ] On-call team briefed on new system components - [ ] Backup and restore tested for new tables - [ ] Performance baseline established (API latency, DB query times) --- ## Audit Event Metrics Track audit completeness and integrity across the redesigned system: | Metric | Target | Alert Threshold | |--------|--------|-----------------| | Audit coverage (admin overrides with audit record) | 100% | Any override without audit record | | Before/after state completeness | 100% of override actions | Missing `before` or `after` in details JSON | | Assignment intent fulfillment rate | > 80% | < 50% intents HONORED per round | | Result lock integrity | 0 unauthorized unlocks | Any unlock without super-admin role | | File replacement provenance chain | 100% linked | Any replacement without `replacedById` | **Audit Correlation Tracking:** - Every admin override must include `correlationId` linking the audit entry to the triggering action (e.g., which assignment algorithm run triggered an intent override) - Round transition events must correlate with the `DecisionAuditLog` entries for that round - Result lock/unlock events must reference the `DeliberationSession.id` they apply to --- ## Sign-Off Criteria Final sign-off requires: 1. **All 6 release gates passed** (A through F) 2. **72-hour burn-in period** with zero critical errors 3. **All test matrices passed** (see [11-testing-and-qa.md](./11-testing-and-qa.md)) 4. **Documentation updated** (CLAUDE.md, API docs, admin guide) 5. **Architecture owner sign-off** on schema finality 6. **Product owner sign-off** on feature completeness