MOPC-App/docs/unified-architecture-redesign/12-observability-and-releas...

13 KiB

Observability & Release Gates

Overview

This document defines the metrics, logging, alerting, and release gate criteria for the redesigned competition system. Every phase of the implementation must pass its corresponding release gate before proceeding.


Competition Lifecycle Metrics

Submission Metrics

Metric Type Description
projects_submitted_total Counter Total projects submitted, by category and round
projects_submitted_late Counter Projects submitted after deadline (FLAG policy)
submission_window_utilization Gauge % of deadline elapsed vs submissions received
file_upload_duration_seconds Histogram Time to upload a submission file
file_upload_size_bytes Histogram Size distribution of uploaded files

Filtering Metrics

Metric Type Description
eligibility_pass_rate Gauge % of projects passing AI filtering
eligibility_manual_review_count Counter Projects flagged for manual review
eligibility_override_count Counter Admin overrides of AI decisions
ai_filtering_duration_seconds Histogram Time for AI to process one project

Assignment Metrics

Metric Type Description
assignment_coverage_percent Gauge % of projects assigned to at least one judge
assignment_unassigned_queue_size Gauge Projects in unassigned queue
assignment_cap_utilization Gauge Average % of cap used per judge
assignment_exception_count Counter Over-cap manual assignments
assignment_coi_skip_count Counter Assignments skipped due to COI
assignment_duration_seconds Histogram Time to run assignment algorithm

Evaluation Metrics

Metric Type Description
evaluations_completed_total Counter Scores submitted, by jury group
evaluation_completion_rate Gauge % of assigned evaluations completed
ai_shortlist_generated Counter AI shortlists generated per round
admin_shortlist_override_count Counter Admin overrides of AI shortlist

Deliberation Metrics

Metric Type Description
deliberation_session_count Counter Sessions created, by mode
deliberation_duration_seconds Histogram Time from VOTING to LOCKED
deliberation_runoff_count Counter Runoff rounds triggered
deliberation_admin_override_count Counter Admin overrides of deliberation result
result_lock_count Counter Results locked
result_unlock_count Counter Results unlocked (should be rare)

Live Finals Metrics

Metric Type Description
live_voting_concurrent_users Gauge Concurrent jury + audience users during voting
live_vote_submission_duration_ms Histogram Time to submit a live vote
audience_vote_total Counter Total audience votes submitted
stage_manager_cursor_advances Counter Project cursor advances by admin

Mentoring Metrics

Metric Type Description
mentor_messages_sent Counter Messages sent in mentor workspaces
mentor_files_uploaded Counter Files uploaded to mentor workspaces
mentor_file_promotions Counter Files promoted to official submissions
mentor_assignment_coverage Gauge % of mentoring-requesting teams with assigned mentors

Reliability Metrics

Metric Type Description
api_request_duration_seconds Histogram tRPC procedure latency, by router and procedure
api_error_rate Gauge % of requests returning errors, by router
job_success_total Counter Background job completions
job_failure_total Counter Background job failures
email_send_total Counter Emails sent (reminders, invitations, notifications)
email_send_failure_total Counter Failed email sends
db_query_duration_seconds Histogram Prisma query latency

Audit Event Metrics

All admin override actions emit DecisionAuditLog records. These metrics track audit completeness and quality:

Metric Type Description
audit_events_total Counter Total audit log entries, by actionType
audit_events_with_diff Counter Audit entries that include both beforeState and afterState in detailsJson
audit_coverage_percent Gauge % of override actions (eligibility, assignment, shortlist, deliberation, lock/unlock, promotion, cap change) that have a corresponding audit record
audit_missing_diff_count Counter Override actions where detailsJson is missing before/after state (should be 0)
audit_correlation_coverage Gauge % of audit entries that include full correlation context (competitionId, roundId, projectId)

Audit Completeness Validation

During each phase gate review, run an automated check that verifies:

  1. Every admin override has an audit record — query DecisionAuditLog for each override type and verify count matches the operation count
  2. Before/after state present — for all OVERRIDE-type actions, detailsJson must contain { before: {...}, after: {...} } structure
  3. Correlation completeness — every audit entry references at minimum userId, competitionId, and timestamp
  4. No orphaned operations — cross-reference assignment exceptions, result unlocks, eligibility overrides, and file promotions against audit log

Audit Event Correlation

For tracing admin actions across the system, audit log entries should include these correlation fields when available:

type AuditCorrelation = {
  competitionId: string
  roundId?: string
  roundType?: RoundType
  juryGroupId?: string
  projectId?: string
  sessionId?: string      // deliberation session
  assignmentIntentId?: string  // when overriding an intent
}

This enables queries like: "Show all admin overrides for Competition X, Round 3, sorted by time" or "Show all actions taken on Project Y across all rounds."


Structured Logging

All log entries must include correlation IDs for tracing:

type LogContext = {
  // Always present
  requestId: string
  userId?: string
  userRole?: string

  // Competition context (when available)
  programId?: string
  competitionId?: string
  roundId?: string
  roundType?: RoundType

  // Entity context (when relevant)
  juryGroupId?: string
  projectId?: string
  awardId?: string
  sessionId?: string  // deliberation session

  // Operation context
  operation: string   // e.g., "assignment.run", "deliberation.submitVote"
  result: 'success' | 'failure' | 'skipped'
  details?: Record<string, unknown>
}

Key Log Events

Event Level When
competition.created INFO New competition created
round.transitioned INFO Round status changed
assignment.completed INFO Assignment algorithm finished
assignment.unassigned WARN Projects remain unassigned
assignment.exception WARN Over-cap assignment made
evaluation.submitted INFO Judge submitted a score
deliberation.voteSubmitted INFO Juror submitted deliberation vote
deliberation.tieDetected WARN Tie detected in deliberation
deliberation.adminOverride WARN Admin overrode deliberation result
resultLock.locked INFO Result locked
resultLock.unlocked WARN Result unlocked (should be rare)
eligibility.override WARN Admin overrode AI eligibility decision
file.promoted INFO Mentor file promoted to submission
liveVoting.windowClosed INFO Live voting window closed

Alert Definitions

Alert Condition Severity Action
Unresolved manual queue assignment_unassigned_queue_size > 0 for > 24h WARNING Notify admin to manually assign
Quorum mismatch Deliberation participants < required quorum CRITICAL Block voting, notify admin
Result lock failure ResultLock creation fails CRITICAL Retry, notify super-admin
Unlock by non-super-admin ResultUnlock attempted by non-super-admin CRITICAL Block, log security event
AI filtering timeout ai_filtering_duration_seconds > 60 WARNING Check OpenAI API, retry
Evaluation completion low evaluation_completion_rate < 50% with < 24h remaining WARNING Send reminder emails
Live voting overload live_voting_concurrent_users > 200 WARNING Monitor for performance degradation
Email delivery failure spike email_send_failure_total increases > 10 in 1h WARNING Check SMTP connection
API error rate spike api_error_rate > 5% for any router CRITICAL Investigate, potential rollback

Release Gates

Each release gate must be passed before the corresponding implementation phase can proceed to production.

Gate A: Schema + Backfill Readiness

When: Before Phase 1 code reaches production

Criteria:

  • All Prisma migrations apply cleanly on staging database
  • Backfill script runs successfully on staging data
  • No existing table modified (new tables only)
  • Rollback tested: drop new tables, verify existing system works
  • Migration takes < 5 minutes on production-sized dataset

Gate B: Policy Engine Correctness

When: Before Phase 2 code reaches production

Criteria:

  • 5-layer policy precedence returns correct values for all combinations
  • Cap mode behavior correct: HARD stops at cap, SOFT allows buffer, NONE unlimited
  • Category bias applied correctly in assignment
  • COI check prevents assignment to declared conflicts
  • Edge cases: null overrides, conflicting settings, zero cap
  • All policy unit tests pass (100% coverage on resolution logic)

Gate C: End-to-End Monaco Flow Simulation

When: Before Phase 5 code reaches production

Criteria:

  • Full 8-round flow completes on staging with realistic data
  • All round transitions work correctly
  • Multi-round document visibility correct for all roles
  • AI shortlist generated at end of evaluation rounds
  • Assignment algorithm handles 500+ projects across 30 judges
  • Result: finalists selected, no data inconsistencies

Gate D: Invite/Onboarding Stability Under Load

When: Before Phase 3 code reaches production

Criteria:

  • 50 concurrent invite acceptances process correctly
  • JuryGroupMember records created for all accepted invites
  • Onboarding self-service values saved correctly
  • AssignmentIntent records created for pre-assigned judges
  • No race conditions in concurrent onboarding

Gate E: Live Finals + Deliberation Integrity

When: Before Phase 6 code reaches production

Criteria:

  • Stage manager cursor advances correctly for all Jury 3 members
  • Live voting window enforcement (no votes outside window)
  • 50 concurrent audience votes processed without conflicts
  • Deliberation SINGLE_WINNER_VOTE tallying correct
  • Deliberation FULL_RANKING Borda count correct
  • Tie-breaking: RUNOFF creates new vote round
  • Tie-breaking: ADMIN_DECIDES records decision
  • ResultLock creates immutable snapshot
  • ResultUnlock blocked for non-super-admin
  • Audience vote totals correctly shown to Jury 3

Gate F: Operational Readiness

When: Before Phase 8 (cutover) begins

Criteria:

  • All metrics listed above are being collected and visible in dashboard
  • All alerts listed above are configured and tested
  • Structured logging includes all correlation IDs
  • Runbook exists for common operational scenarios
  • On-call team briefed on new system components
  • Backup and restore tested for new tables
  • Performance baseline established (API latency, DB query times)

Audit Event Metrics

Track audit completeness and integrity across the redesigned system:

Metric Target Alert Threshold
Audit coverage (admin overrides with audit record) 100% Any override without audit record
Before/after state completeness 100% of override actions Missing before or after in details JSON
Assignment intent fulfillment rate > 80% < 50% intents HONORED per round
Result lock integrity 0 unauthorized unlocks Any unlock without super-admin role
File replacement provenance chain 100% linked Any replacement without replacedById

Audit Correlation Tracking:

  • Every admin override must include correlationId linking the audit entry to the triggering action (e.g., which assignment algorithm run triggered an intent override)
  • Round transition events must correlate with the DecisionAuditLog entries for that round
  • Result lock/unlock events must reference the DeliberationSession.id they apply to

Sign-Off Criteria

Final sign-off requires:

  1. All 6 release gates passed (A through F)
  2. 72-hour burn-in period with zero critical errors
  3. All test matrices passed (see 11-testing-and-qa.md)
  4. Documentation updated (CLAUDE.md, API docs, admin guide)
  5. Architecture owner sign-off on schema finality
  6. Product owner sign-off on feature completeness