13 KiB

Raw Blame History

Observability & Release Gates

Overview

This document defines the metrics, logging, alerting, and release gate criteria for the redesigned competition system. Every phase of the implementation must pass its corresponding release gate before proceeding.

Competition Lifecycle Metrics

Submission Metrics

Metric	Type	Description
`projects_submitted_total`	Counter	Total projects submitted, by category and round
`projects_submitted_late`	Counter	Projects submitted after deadline (FLAG policy)
`submission_window_utilization`	Gauge	% of deadline elapsed vs submissions received
`file_upload_duration_seconds`	Histogram	Time to upload a submission file
`file_upload_size_bytes`	Histogram	Size distribution of uploaded files

Filtering Metrics

Metric	Type	Description
`eligibility_pass_rate`	Gauge	% of projects passing AI filtering
`eligibility_manual_review_count`	Counter	Projects flagged for manual review
`eligibility_override_count`	Counter	Admin overrides of AI decisions
`ai_filtering_duration_seconds`	Histogram	Time for AI to process one project

Assignment Metrics

Metric	Type	Description
`assignment_coverage_percent`	Gauge	% of projects assigned to at least one judge
`assignment_unassigned_queue_size`	Gauge	Projects in unassigned queue
`assignment_cap_utilization`	Gauge	Average % of cap used per judge
`assignment_exception_count`	Counter	Over-cap manual assignments
`assignment_coi_skip_count`	Counter	Assignments skipped due to COI
`assignment_duration_seconds`	Histogram	Time to run assignment algorithm

Evaluation Metrics

Metric	Type	Description
`evaluations_completed_total`	Counter	Scores submitted, by jury group
`evaluation_completion_rate`	Gauge	% of assigned evaluations completed
`ai_shortlist_generated`	Counter	AI shortlists generated per round
`admin_shortlist_override_count`	Counter	Admin overrides of AI shortlist

Deliberation Metrics

Metric	Type	Description
`deliberation_session_count`	Counter	Sessions created, by mode
`deliberation_duration_seconds`	Histogram	Time from VOTING to LOCKED
`deliberation_runoff_count`	Counter	Runoff rounds triggered
`deliberation_admin_override_count`	Counter	Admin overrides of deliberation result
`result_lock_count`	Counter	Results locked
`result_unlock_count`	Counter	Results unlocked (should be rare)

Live Finals Metrics

Metric	Type	Description
`live_voting_concurrent_users`	Gauge	Concurrent jury + audience users during voting
`live_vote_submission_duration_ms`	Histogram	Time to submit a live vote
`audience_vote_total`	Counter	Total audience votes submitted
`stage_manager_cursor_advances`	Counter	Project cursor advances by admin

Mentoring Metrics

Metric	Type	Description
`mentor_messages_sent`	Counter	Messages sent in mentor workspaces
`mentor_files_uploaded`	Counter	Files uploaded to mentor workspaces
`mentor_file_promotions`	Counter	Files promoted to official submissions
`mentor_assignment_coverage`	Gauge	% of mentoring-requesting teams with assigned mentors

Reliability Metrics

Metric	Type	Description
`api_request_duration_seconds`	Histogram	tRPC procedure latency, by router and procedure
`api_error_rate`	Gauge	% of requests returning errors, by router
`job_success_total`	Counter	Background job completions
`job_failure_total`	Counter	Background job failures
`email_send_total`	Counter	Emails sent (reminders, invitations, notifications)
`email_send_failure_total`	Counter	Failed email sends
`db_query_duration_seconds`	Histogram	Prisma query latency

Audit Event Metrics

All admin override actions emit DecisionAuditLog records. These metrics track audit completeness and quality:

Metric	Type	Description
`audit_events_total`	Counter	Total audit log entries, by `actionType`
`audit_events_with_diff`	Counter	Audit entries that include both `beforeState` and `afterState` in `detailsJson`
`audit_coverage_percent`	Gauge	% of override actions (eligibility, assignment, shortlist, deliberation, lock/unlock, promotion, cap change) that have a corresponding audit record
`audit_missing_diff_count`	Counter	Override actions where `detailsJson` is missing before/after state (should be 0)
`audit_correlation_coverage`	Gauge	% of audit entries that include full correlation context (`competitionId`, `roundId`, `projectId`)

Audit Completeness Validation

During each phase gate review, run an automated check that verifies:

Every admin override has an audit record — query DecisionAuditLog for each override type and verify count matches the operation count
Before/after state present — for all OVERRIDE-type actions, detailsJson must contain { before: {...}, after: {...} } structure
Correlation completeness — every audit entry references at minimum userId, competitionId, and timestamp
No orphaned operations — cross-reference assignment exceptions, result unlocks, eligibility overrides, and file promotions against audit log

Audit Event Correlation

For tracing admin actions across the system, audit log entries should include these correlation fields when available:

type AuditCorrelation = {
  competitionId: string
  roundId?: string
  roundType?: RoundType
  juryGroupId?: string
  projectId?: string
  sessionId?: string      // deliberation session
  assignmentIntentId?: string  // when overriding an intent
}

This enables queries like: "Show all admin overrides for Competition X, Round 3, sorted by time" or "Show all actions taken on Project Y across all rounds."

Structured Logging

All log entries must include correlation IDs for tracing:

type LogContext = {
  // Always present
  requestId: string
  userId?: string
  userRole?: string

  // Competition context (when available)
  programId?: string
  competitionId?: string
  roundId?: string
  roundType?: RoundType

  // Entity context (when relevant)
  juryGroupId?: string
  projectId?: string
  awardId?: string
  sessionId?: string  // deliberation session

  // Operation context
  operation: string   // e.g., "assignment.run", "deliberation.submitVote"
  result: 'success' | 'failure' | 'skipped'
  details?: Record<string, unknown>
}

Key Log Events

Event	Level	When
`competition.created`	INFO	New competition created
`round.transitioned`	INFO	Round status changed
`assignment.completed`	INFO	Assignment algorithm finished
`assignment.unassigned`	WARN	Projects remain unassigned
`assignment.exception`	WARN	Over-cap assignment made
`evaluation.submitted`	INFO	Judge submitted a score
`deliberation.voteSubmitted`	INFO	Juror submitted deliberation vote
`deliberation.tieDetected`	WARN	Tie detected in deliberation
`deliberation.adminOverride`	WARN	Admin overrode deliberation result
`resultLock.locked`	INFO	Result locked
`resultLock.unlocked`	WARN	Result unlocked (should be rare)
`eligibility.override`	WARN	Admin overrode AI eligibility decision
`file.promoted`	INFO	Mentor file promoted to submission
`liveVoting.windowClosed`	INFO	Live voting window closed

Alert Definitions

Alert	Condition	Severity	Action
Unresolved manual queue	`assignment_unassigned_queue_size > 0` for > 24h	WARNING	Notify admin to manually assign
Quorum mismatch	Deliberation participants < required quorum	CRITICAL	Block voting, notify admin
Result lock failure	ResultLock creation fails	CRITICAL	Retry, notify super-admin
Unlock by non-super-admin	ResultUnlock attempted by non-super-admin	CRITICAL	Block, log security event
AI filtering timeout	`ai_filtering_duration_seconds > 60`	WARNING	Check OpenAI API, retry
Evaluation completion low	`evaluation_completion_rate < 50%` with < 24h remaining	WARNING	Send reminder emails
Live voting overload	`live_voting_concurrent_users > 200`	WARNING	Monitor for performance degradation
Email delivery failure spike	`email_send_failure_total` increases > 10 in 1h	WARNING	Check SMTP connection
API error rate spike	`api_error_rate > 5%` for any router	CRITICAL	Investigate, potential rollback

Release Gates

Each release gate must be passed before the corresponding implementation phase can proceed to production.

Gate A: Schema + Backfill Readiness

When: Before Phase 1 code reaches production

Criteria:

All Prisma migrations apply cleanly on staging database
Backfill script runs successfully on staging data
No existing table modified (new tables only)
Rollback tested: drop new tables, verify existing system works
Migration takes < 5 minutes on production-sized dataset

Gate B: Policy Engine Correctness

When: Before Phase 2 code reaches production

Criteria:

5-layer policy precedence returns correct values for all combinations
Cap mode behavior correct: HARD stops at cap, SOFT allows buffer, NONE unlimited
Category bias applied correctly in assignment
COI check prevents assignment to declared conflicts
Edge cases: null overrides, conflicting settings, zero cap
All policy unit tests pass (100% coverage on resolution logic)

Gate C: End-to-End Monaco Flow Simulation

When: Before Phase 5 code reaches production

Criteria:

Full 8-round flow completes on staging with realistic data
All round transitions work correctly
Multi-round document visibility correct for all roles
AI shortlist generated at end of evaluation rounds
Assignment algorithm handles 500+ projects across 30 judges
Result: finalists selected, no data inconsistencies

Gate D: Invite/Onboarding Stability Under Load

When: Before Phase 3 code reaches production

Criteria:

50 concurrent invite acceptances process correctly
JuryGroupMember records created for all accepted invites
Onboarding self-service values saved correctly
AssignmentIntent records created for pre-assigned judges
No race conditions in concurrent onboarding

Gate E: Live Finals + Deliberation Integrity

When: Before Phase 6 code reaches production

Criteria:

Stage manager cursor advances correctly for all Jury 3 members
Live voting window enforcement (no votes outside window)
50 concurrent audience votes processed without conflicts
Deliberation SINGLE_WINNER_VOTE tallying correct
Deliberation FULL_RANKING Borda count correct
Tie-breaking: RUNOFF creates new vote round
Tie-breaking: ADMIN_DECIDES records decision
ResultLock creates immutable snapshot
ResultUnlock blocked for non-super-admin
Audience vote totals correctly shown to Jury 3

Gate F: Operational Readiness

When: Before Phase 8 (cutover) begins

Criteria:

All metrics listed above are being collected and visible in dashboard
All alerts listed above are configured and tested
Structured logging includes all correlation IDs
Runbook exists for common operational scenarios
On-call team briefed on new system components
Backup and restore tested for new tables
Performance baseline established (API latency, DB query times)

Audit Event Metrics

Track audit completeness and integrity across the redesigned system:

Metric	Target	Alert Threshold
Audit coverage (admin overrides with audit record)	100%	Any override without audit record
Before/after state completeness	100% of override actions	Missing `before` or `after` in details JSON
Assignment intent fulfillment rate	> 80%	< 50% intents HONORED per round
Result lock integrity	0 unauthorized unlocks	Any unlock without super-admin role
File replacement provenance chain	100% linked	Any replacement without `replacedById`

Audit Correlation Tracking:

Every admin override must include correlationId linking the audit entry to the triggering action (e.g., which assignment algorithm run triggered an intent override)
Round transition events must correlate with the DecisionAuditLog entries for that round
Result lock/unlock events must reference the DeliberationSession.id they apply to

Sign-Off Criteria

Final sign-off requires:

All 6 release gates passed (A through F)
72-hour burn-in period with zero critical errors
All test matrices passed (see 11-testing-and-qa.md)
Documentation updated (CLAUDE.md, API docs, admin guide)
Architecture owner sign-off on schema finality
Product owner sign-off on feature completeness

13 KiB Raw Blame History

Observability & Release Gates

Overview

Competition Lifecycle Metrics

Submission Metrics

Filtering Metrics

Assignment Metrics

Evaluation Metrics

Deliberation Metrics

Live Finals Metrics

Mentoring Metrics

Reliability Metrics

Audit Event Metrics

Audit Completeness Validation

Audit Event Correlation

Structured Logging

Key Log Events

Alert Definitions

Release Gates

Gate A: Schema + Backfill Readiness

Gate B: Policy Engine Correctness

Gate C: End-to-End Monaco Flow Simulation

Gate D: Invite/Onboarding Stability Under Load

Gate E: Live Finals + Deliberation Integrity

Gate F: Operational Readiness

Audit Event Metrics

Sign-Off Criteria

13 KiB

Raw Blame History