13 KiB
Observability & Release Gates
Overview
This document defines the metrics, logging, alerting, and release gate criteria for the redesigned competition system. Every phase of the implementation must pass its corresponding release gate before proceeding.
Competition Lifecycle Metrics
Submission Metrics
| Metric | Type | Description |
|---|---|---|
projects_submitted_total |
Counter | Total projects submitted, by category and round |
projects_submitted_late |
Counter | Projects submitted after deadline (FLAG policy) |
submission_window_utilization |
Gauge | % of deadline elapsed vs submissions received |
file_upload_duration_seconds |
Histogram | Time to upload a submission file |
file_upload_size_bytes |
Histogram | Size distribution of uploaded files |
Filtering Metrics
| Metric | Type | Description |
|---|---|---|
eligibility_pass_rate |
Gauge | % of projects passing AI filtering |
eligibility_manual_review_count |
Counter | Projects flagged for manual review |
eligibility_override_count |
Counter | Admin overrides of AI decisions |
ai_filtering_duration_seconds |
Histogram | Time for AI to process one project |
Assignment Metrics
| Metric | Type | Description |
|---|---|---|
assignment_coverage_percent |
Gauge | % of projects assigned to at least one judge |
assignment_unassigned_queue_size |
Gauge | Projects in unassigned queue |
assignment_cap_utilization |
Gauge | Average % of cap used per judge |
assignment_exception_count |
Counter | Over-cap manual assignments |
assignment_coi_skip_count |
Counter | Assignments skipped due to COI |
assignment_duration_seconds |
Histogram | Time to run assignment algorithm |
Evaluation Metrics
| Metric | Type | Description |
|---|---|---|
evaluations_completed_total |
Counter | Scores submitted, by jury group |
evaluation_completion_rate |
Gauge | % of assigned evaluations completed |
ai_shortlist_generated |
Counter | AI shortlists generated per round |
admin_shortlist_override_count |
Counter | Admin overrides of AI shortlist |
Deliberation Metrics
| Metric | Type | Description |
|---|---|---|
deliberation_session_count |
Counter | Sessions created, by mode |
deliberation_duration_seconds |
Histogram | Time from VOTING to LOCKED |
deliberation_runoff_count |
Counter | Runoff rounds triggered |
deliberation_admin_override_count |
Counter | Admin overrides of deliberation result |
result_lock_count |
Counter | Results locked |
result_unlock_count |
Counter | Results unlocked (should be rare) |
Live Finals Metrics
| Metric | Type | Description |
|---|---|---|
live_voting_concurrent_users |
Gauge | Concurrent jury + audience users during voting |
live_vote_submission_duration_ms |
Histogram | Time to submit a live vote |
audience_vote_total |
Counter | Total audience votes submitted |
stage_manager_cursor_advances |
Counter | Project cursor advances by admin |
Mentoring Metrics
| Metric | Type | Description |
|---|---|---|
mentor_messages_sent |
Counter | Messages sent in mentor workspaces |
mentor_files_uploaded |
Counter | Files uploaded to mentor workspaces |
mentor_file_promotions |
Counter | Files promoted to official submissions |
mentor_assignment_coverage |
Gauge | % of mentoring-requesting teams with assigned mentors |
Reliability Metrics
| Metric | Type | Description |
|---|---|---|
api_request_duration_seconds |
Histogram | tRPC procedure latency, by router and procedure |
api_error_rate |
Gauge | % of requests returning errors, by router |
job_success_total |
Counter | Background job completions |
job_failure_total |
Counter | Background job failures |
email_send_total |
Counter | Emails sent (reminders, invitations, notifications) |
email_send_failure_total |
Counter | Failed email sends |
db_query_duration_seconds |
Histogram | Prisma query latency |
Audit Event Metrics
All admin override actions emit DecisionAuditLog records. These metrics track audit completeness and quality:
| Metric | Type | Description |
|---|---|---|
audit_events_total |
Counter | Total audit log entries, by actionType |
audit_events_with_diff |
Counter | Audit entries that include both beforeState and afterState in detailsJson |
audit_coverage_percent |
Gauge | % of override actions (eligibility, assignment, shortlist, deliberation, lock/unlock, promotion, cap change) that have a corresponding audit record |
audit_missing_diff_count |
Counter | Override actions where detailsJson is missing before/after state (should be 0) |
audit_correlation_coverage |
Gauge | % of audit entries that include full correlation context (competitionId, roundId, projectId) |
Audit Completeness Validation
During each phase gate review, run an automated check that verifies:
- Every admin override has an audit record — query
DecisionAuditLogfor each override type and verify count matches the operation count - Before/after state present — for all OVERRIDE-type actions,
detailsJsonmust contain{ before: {...}, after: {...} }structure - Correlation completeness — every audit entry references at minimum
userId,competitionId, andtimestamp - No orphaned operations — cross-reference assignment exceptions, result unlocks, eligibility overrides, and file promotions against audit log
Audit Event Correlation
For tracing admin actions across the system, audit log entries should include these correlation fields when available:
type AuditCorrelation = {
competitionId: string
roundId?: string
roundType?: RoundType
juryGroupId?: string
projectId?: string
sessionId?: string // deliberation session
assignmentIntentId?: string // when overriding an intent
}
This enables queries like: "Show all admin overrides for Competition X, Round 3, sorted by time" or "Show all actions taken on Project Y across all rounds."
Structured Logging
All log entries must include correlation IDs for tracing:
type LogContext = {
// Always present
requestId: string
userId?: string
userRole?: string
// Competition context (when available)
programId?: string
competitionId?: string
roundId?: string
roundType?: RoundType
// Entity context (when relevant)
juryGroupId?: string
projectId?: string
awardId?: string
sessionId?: string // deliberation session
// Operation context
operation: string // e.g., "assignment.run", "deliberation.submitVote"
result: 'success' | 'failure' | 'skipped'
details?: Record<string, unknown>
}
Key Log Events
| Event | Level | When |
|---|---|---|
competition.created |
INFO | New competition created |
round.transitioned |
INFO | Round status changed |
assignment.completed |
INFO | Assignment algorithm finished |
assignment.unassigned |
WARN | Projects remain unassigned |
assignment.exception |
WARN | Over-cap assignment made |
evaluation.submitted |
INFO | Judge submitted a score |
deliberation.voteSubmitted |
INFO | Juror submitted deliberation vote |
deliberation.tieDetected |
WARN | Tie detected in deliberation |
deliberation.adminOverride |
WARN | Admin overrode deliberation result |
resultLock.locked |
INFO | Result locked |
resultLock.unlocked |
WARN | Result unlocked (should be rare) |
eligibility.override |
WARN | Admin overrode AI eligibility decision |
file.promoted |
INFO | Mentor file promoted to submission |
liveVoting.windowClosed |
INFO | Live voting window closed |
Alert Definitions
| Alert | Condition | Severity | Action |
|---|---|---|---|
| Unresolved manual queue | assignment_unassigned_queue_size > 0 for > 24h |
WARNING | Notify admin to manually assign |
| Quorum mismatch | Deliberation participants < required quorum | CRITICAL | Block voting, notify admin |
| Result lock failure | ResultLock creation fails | CRITICAL | Retry, notify super-admin |
| Unlock by non-super-admin | ResultUnlock attempted by non-super-admin | CRITICAL | Block, log security event |
| AI filtering timeout | ai_filtering_duration_seconds > 60 |
WARNING | Check OpenAI API, retry |
| Evaluation completion low | evaluation_completion_rate < 50% with < 24h remaining |
WARNING | Send reminder emails |
| Live voting overload | live_voting_concurrent_users > 200 |
WARNING | Monitor for performance degradation |
| Email delivery failure spike | email_send_failure_total increases > 10 in 1h |
WARNING | Check SMTP connection |
| API error rate spike | api_error_rate > 5% for any router |
CRITICAL | Investigate, potential rollback |
Release Gates
Each release gate must be passed before the corresponding implementation phase can proceed to production.
Gate A: Schema + Backfill Readiness
When: Before Phase 1 code reaches production
Criteria:
- All Prisma migrations apply cleanly on staging database
- Backfill script runs successfully on staging data
- No existing table modified (new tables only)
- Rollback tested: drop new tables, verify existing system works
- Migration takes < 5 minutes on production-sized dataset
Gate B: Policy Engine Correctness
When: Before Phase 2 code reaches production
Criteria:
- 5-layer policy precedence returns correct values for all combinations
- Cap mode behavior correct: HARD stops at cap, SOFT allows buffer, NONE unlimited
- Category bias applied correctly in assignment
- COI check prevents assignment to declared conflicts
- Edge cases: null overrides, conflicting settings, zero cap
- All policy unit tests pass (100% coverage on resolution logic)
Gate C: End-to-End Monaco Flow Simulation
When: Before Phase 5 code reaches production
Criteria:
- Full 8-round flow completes on staging with realistic data
- All round transitions work correctly
- Multi-round document visibility correct for all roles
- AI shortlist generated at end of evaluation rounds
- Assignment algorithm handles 500+ projects across 30 judges
- Result: finalists selected, no data inconsistencies
Gate D: Invite/Onboarding Stability Under Load
When: Before Phase 3 code reaches production
Criteria:
- 50 concurrent invite acceptances process correctly
- JuryGroupMember records created for all accepted invites
- Onboarding self-service values saved correctly
- AssignmentIntent records created for pre-assigned judges
- No race conditions in concurrent onboarding
Gate E: Live Finals + Deliberation Integrity
When: Before Phase 6 code reaches production
Criteria:
- Stage manager cursor advances correctly for all Jury 3 members
- Live voting window enforcement (no votes outside window)
- 50 concurrent audience votes processed without conflicts
- Deliberation SINGLE_WINNER_VOTE tallying correct
- Deliberation FULL_RANKING Borda count correct
- Tie-breaking: RUNOFF creates new vote round
- Tie-breaking: ADMIN_DECIDES records decision
- ResultLock creates immutable snapshot
- ResultUnlock blocked for non-super-admin
- Audience vote totals correctly shown to Jury 3
Gate F: Operational Readiness
When: Before Phase 8 (cutover) begins
Criteria:
- All metrics listed above are being collected and visible in dashboard
- All alerts listed above are configured and tested
- Structured logging includes all correlation IDs
- Runbook exists for common operational scenarios
- On-call team briefed on new system components
- Backup and restore tested for new tables
- Performance baseline established (API latency, DB query times)
Audit Event Metrics
Track audit completeness and integrity across the redesigned system:
| Metric | Target | Alert Threshold |
|---|---|---|
| Audit coverage (admin overrides with audit record) | 100% | Any override without audit record |
| Before/after state completeness | 100% of override actions | Missing before or after in details JSON |
| Assignment intent fulfillment rate | > 80% | < 50% intents HONORED per round |
| Result lock integrity | 0 unauthorized unlocks | Any unlock without super-admin role |
| File replacement provenance chain | 100% linked | Any replacement without replacedById |
Audit Correlation Tracking:
- Every admin override must include
correlationIdlinking the audit entry to the triggering action (e.g., which assignment algorithm run triggered an intent override) - Round transition events must correlate with the
DecisionAuditLogentries for that round - Result lock/unlock events must reference the
DeliberationSession.idthey apply to
Sign-Off Criteria
Final sign-off requires:
- All 6 release gates passed (A through F)
- 72-hour burn-in period with zero critical errors
- All test matrices passed (see 11-testing-and-qa.md)
- Documentation updated (CLAUDE.md, API docs, admin guide)
- Architecture owner sign-off on schema finality
- Product owner sign-off on feature completeness