323 lines
13 KiB
Markdown
323 lines
13 KiB
Markdown
# Observability & Release Gates
|
|
|
|
## Overview
|
|
|
|
This document defines the metrics, logging, alerting, and release gate criteria for the redesigned competition system. Every phase of the implementation must pass its corresponding release gate before proceeding.
|
|
|
|
---
|
|
|
|
## Competition Lifecycle Metrics
|
|
|
|
### Submission Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `projects_submitted_total` | Counter | Total projects submitted, by category and round |
|
|
| `projects_submitted_late` | Counter | Projects submitted after deadline (FLAG policy) |
|
|
| `submission_window_utilization` | Gauge | % of deadline elapsed vs submissions received |
|
|
| `file_upload_duration_seconds` | Histogram | Time to upload a submission file |
|
|
| `file_upload_size_bytes` | Histogram | Size distribution of uploaded files |
|
|
|
|
### Filtering Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `eligibility_pass_rate` | Gauge | % of projects passing AI filtering |
|
|
| `eligibility_manual_review_count` | Counter | Projects flagged for manual review |
|
|
| `eligibility_override_count` | Counter | Admin overrides of AI decisions |
|
|
| `ai_filtering_duration_seconds` | Histogram | Time for AI to process one project |
|
|
|
|
### Assignment Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `assignment_coverage_percent` | Gauge | % of projects assigned to at least one judge |
|
|
| `assignment_unassigned_queue_size` | Gauge | Projects in unassigned queue |
|
|
| `assignment_cap_utilization` | Gauge | Average % of cap used per judge |
|
|
| `assignment_exception_count` | Counter | Over-cap manual assignments |
|
|
| `assignment_coi_skip_count` | Counter | Assignments skipped due to COI |
|
|
| `assignment_duration_seconds` | Histogram | Time to run assignment algorithm |
|
|
|
|
### Evaluation Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `evaluations_completed_total` | Counter | Scores submitted, by jury group |
|
|
| `evaluation_completion_rate` | Gauge | % of assigned evaluations completed |
|
|
| `ai_shortlist_generated` | Counter | AI shortlists generated per round |
|
|
| `admin_shortlist_override_count` | Counter | Admin overrides of AI shortlist |
|
|
|
|
### Deliberation Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `deliberation_session_count` | Counter | Sessions created, by mode |
|
|
| `deliberation_duration_seconds` | Histogram | Time from VOTING to LOCKED |
|
|
| `deliberation_runoff_count` | Counter | Runoff rounds triggered |
|
|
| `deliberation_admin_override_count` | Counter | Admin overrides of deliberation result |
|
|
| `result_lock_count` | Counter | Results locked |
|
|
| `result_unlock_count` | Counter | Results unlocked (should be rare) |
|
|
|
|
### Live Finals Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `live_voting_concurrent_users` | Gauge | Concurrent jury + audience users during voting |
|
|
| `live_vote_submission_duration_ms` | Histogram | Time to submit a live vote |
|
|
| `audience_vote_total` | Counter | Total audience votes submitted |
|
|
| `stage_manager_cursor_advances` | Counter | Project cursor advances by admin |
|
|
|
|
### Mentoring Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `mentor_messages_sent` | Counter | Messages sent in mentor workspaces |
|
|
| `mentor_files_uploaded` | Counter | Files uploaded to mentor workspaces |
|
|
| `mentor_file_promotions` | Counter | Files promoted to official submissions |
|
|
| `mentor_assignment_coverage` | Gauge | % of mentoring-requesting teams with assigned mentors |
|
|
|
|
---
|
|
|
|
## Reliability Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `api_request_duration_seconds` | Histogram | tRPC procedure latency, by router and procedure |
|
|
| `api_error_rate` | Gauge | % of requests returning errors, by router |
|
|
| `job_success_total` | Counter | Background job completions |
|
|
| `job_failure_total` | Counter | Background job failures |
|
|
| `email_send_total` | Counter | Emails sent (reminders, invitations, notifications) |
|
|
| `email_send_failure_total` | Counter | Failed email sends |
|
|
| `db_query_duration_seconds` | Histogram | Prisma query latency |
|
|
|
|
---
|
|
|
|
## Audit Event Metrics
|
|
|
|
All admin override actions emit `DecisionAuditLog` records. These metrics track audit completeness and quality:
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `audit_events_total` | Counter | Total audit log entries, by `actionType` |
|
|
| `audit_events_with_diff` | Counter | Audit entries that include both `beforeState` and `afterState` in `detailsJson` |
|
|
| `audit_coverage_percent` | Gauge | % of override actions (eligibility, assignment, shortlist, deliberation, lock/unlock, promotion, cap change) that have a corresponding audit record |
|
|
| `audit_missing_diff_count` | Counter | Override actions where `detailsJson` is missing before/after state (should be 0) |
|
|
| `audit_correlation_coverage` | Gauge | % of audit entries that include full correlation context (`competitionId`, `roundId`, `projectId`) |
|
|
|
|
### Audit Completeness Validation
|
|
|
|
During each phase gate review, run an automated check that verifies:
|
|
|
|
1. **Every admin override has an audit record** — query `DecisionAuditLog` for each override type and verify count matches the operation count
|
|
2. **Before/after state present** — for all OVERRIDE-type actions, `detailsJson` must contain `{ before: {...}, after: {...} }` structure
|
|
3. **Correlation completeness** — every audit entry references at minimum `userId`, `competitionId`, and `timestamp`
|
|
4. **No orphaned operations** — cross-reference assignment exceptions, result unlocks, eligibility overrides, and file promotions against audit log
|
|
|
|
### Audit Event Correlation
|
|
|
|
For tracing admin actions across the system, audit log entries should include these correlation fields when available:
|
|
|
|
```typescript
|
|
type AuditCorrelation = {
|
|
competitionId: string
|
|
roundId?: string
|
|
roundType?: RoundType
|
|
juryGroupId?: string
|
|
projectId?: string
|
|
sessionId?: string // deliberation session
|
|
assignmentIntentId?: string // when overriding an intent
|
|
}
|
|
```
|
|
|
|
This enables queries like: "Show all admin overrides for Competition X, Round 3, sorted by time" or "Show all actions taken on Project Y across all rounds."
|
|
|
|
---
|
|
|
|
## Structured Logging
|
|
|
|
All log entries must include correlation IDs for tracing:
|
|
|
|
```typescript
|
|
type LogContext = {
|
|
// Always present
|
|
requestId: string
|
|
userId?: string
|
|
userRole?: string
|
|
|
|
// Competition context (when available)
|
|
programId?: string
|
|
competitionId?: string
|
|
roundId?: string
|
|
roundType?: RoundType
|
|
|
|
// Entity context (when relevant)
|
|
juryGroupId?: string
|
|
projectId?: string
|
|
awardId?: string
|
|
sessionId?: string // deliberation session
|
|
|
|
// Operation context
|
|
operation: string // e.g., "assignment.run", "deliberation.submitVote"
|
|
result: 'success' | 'failure' | 'skipped'
|
|
details?: Record<string, unknown>
|
|
}
|
|
```
|
|
|
|
### Key Log Events
|
|
|
|
| Event | Level | When |
|
|
|-------|-------|------|
|
|
| `competition.created` | INFO | New competition created |
|
|
| `round.transitioned` | INFO | Round status changed |
|
|
| `assignment.completed` | INFO | Assignment algorithm finished |
|
|
| `assignment.unassigned` | WARN | Projects remain unassigned |
|
|
| `assignment.exception` | WARN | Over-cap assignment made |
|
|
| `evaluation.submitted` | INFO | Judge submitted a score |
|
|
| `deliberation.voteSubmitted` | INFO | Juror submitted deliberation vote |
|
|
| `deliberation.tieDetected` | WARN | Tie detected in deliberation |
|
|
| `deliberation.adminOverride` | WARN | Admin overrode deliberation result |
|
|
| `resultLock.locked` | INFO | Result locked |
|
|
| `resultLock.unlocked` | WARN | Result unlocked (should be rare) |
|
|
| `eligibility.override` | WARN | Admin overrode AI eligibility decision |
|
|
| `file.promoted` | INFO | Mentor file promoted to submission |
|
|
| `liveVoting.windowClosed` | INFO | Live voting window closed |
|
|
|
|
---
|
|
|
|
## Alert Definitions
|
|
|
|
| Alert | Condition | Severity | Action |
|
|
|-------|-----------|----------|--------|
|
|
| **Unresolved manual queue** | `assignment_unassigned_queue_size > 0` for > 24h | WARNING | Notify admin to manually assign |
|
|
| **Quorum mismatch** | Deliberation participants < required quorum | CRITICAL | Block voting, notify admin |
|
|
| **Result lock failure** | ResultLock creation fails | CRITICAL | Retry, notify super-admin |
|
|
| **Unlock by non-super-admin** | ResultUnlock attempted by non-super-admin | CRITICAL | Block, log security event |
|
|
| **AI filtering timeout** | `ai_filtering_duration_seconds > 60` | WARNING | Check OpenAI API, retry |
|
|
| **Evaluation completion low** | `evaluation_completion_rate < 50%` with < 24h remaining | WARNING | Send reminder emails |
|
|
| **Live voting overload** | `live_voting_concurrent_users > 200` | WARNING | Monitor for performance degradation |
|
|
| **Email delivery failure spike** | `email_send_failure_total` increases > 10 in 1h | WARNING | Check SMTP connection |
|
|
| **API error rate spike** | `api_error_rate > 5%` for any router | CRITICAL | Investigate, potential rollback |
|
|
|
|
---
|
|
|
|
## Release Gates
|
|
|
|
Each release gate must be passed before the corresponding implementation phase can proceed to production.
|
|
|
|
### Gate A: Schema + Backfill Readiness
|
|
|
|
**When**: Before Phase 1 code reaches production
|
|
|
|
**Criteria**:
|
|
- [ ] All Prisma migrations apply cleanly on staging database
|
|
- [ ] Backfill script runs successfully on staging data
|
|
- [ ] No existing table modified (new tables only)
|
|
- [ ] Rollback tested: drop new tables, verify existing system works
|
|
- [ ] Migration takes < 5 minutes on production-sized dataset
|
|
|
|
---
|
|
|
|
### Gate B: Policy Engine Correctness
|
|
|
|
**When**: Before Phase 2 code reaches production
|
|
|
|
**Criteria**:
|
|
- [ ] 5-layer policy precedence returns correct values for all combinations
|
|
- [ ] Cap mode behavior correct: HARD stops at cap, SOFT allows buffer, NONE unlimited
|
|
- [ ] Category bias applied correctly in assignment
|
|
- [ ] COI check prevents assignment to declared conflicts
|
|
- [ ] Edge cases: null overrides, conflicting settings, zero cap
|
|
- [ ] All policy unit tests pass (100% coverage on resolution logic)
|
|
|
|
---
|
|
|
|
### Gate C: End-to-End Monaco Flow Simulation
|
|
|
|
**When**: Before Phase 5 code reaches production
|
|
|
|
**Criteria**:
|
|
- [ ] Full 8-round flow completes on staging with realistic data
|
|
- [ ] All round transitions work correctly
|
|
- [ ] Multi-round document visibility correct for all roles
|
|
- [ ] AI shortlist generated at end of evaluation rounds
|
|
- [ ] Assignment algorithm handles 500+ projects across 30 judges
|
|
- [ ] Result: finalists selected, no data inconsistencies
|
|
|
|
---
|
|
|
|
### Gate D: Invite/Onboarding Stability Under Load
|
|
|
|
**When**: Before Phase 3 code reaches production
|
|
|
|
**Criteria**:
|
|
- [ ] 50 concurrent invite acceptances process correctly
|
|
- [ ] JuryGroupMember records created for all accepted invites
|
|
- [ ] Onboarding self-service values saved correctly
|
|
- [ ] AssignmentIntent records created for pre-assigned judges
|
|
- [ ] No race conditions in concurrent onboarding
|
|
|
|
---
|
|
|
|
### Gate E: Live Finals + Deliberation Integrity
|
|
|
|
**When**: Before Phase 6 code reaches production
|
|
|
|
**Criteria**:
|
|
- [ ] Stage manager cursor advances correctly for all Jury 3 members
|
|
- [ ] Live voting window enforcement (no votes outside window)
|
|
- [ ] 50 concurrent audience votes processed without conflicts
|
|
- [ ] Deliberation SINGLE_WINNER_VOTE tallying correct
|
|
- [ ] Deliberation FULL_RANKING Borda count correct
|
|
- [ ] Tie-breaking: RUNOFF creates new vote round
|
|
- [ ] Tie-breaking: ADMIN_DECIDES records decision
|
|
- [ ] ResultLock creates immutable snapshot
|
|
- [ ] ResultUnlock blocked for non-super-admin
|
|
- [ ] Audience vote totals correctly shown to Jury 3
|
|
|
|
---
|
|
|
|
### Gate F: Operational Readiness
|
|
|
|
**When**: Before Phase 8 (cutover) begins
|
|
|
|
**Criteria**:
|
|
- [ ] All metrics listed above are being collected and visible in dashboard
|
|
- [ ] All alerts listed above are configured and tested
|
|
- [ ] Structured logging includes all correlation IDs
|
|
- [ ] Runbook exists for common operational scenarios
|
|
- [ ] On-call team briefed on new system components
|
|
- [ ] Backup and restore tested for new tables
|
|
- [ ] Performance baseline established (API latency, DB query times)
|
|
|
|
---
|
|
|
|
## Audit Event Metrics
|
|
|
|
Track audit completeness and integrity across the redesigned system:
|
|
|
|
| Metric | Target | Alert Threshold |
|
|
|--------|--------|-----------------|
|
|
| Audit coverage (admin overrides with audit record) | 100% | Any override without audit record |
|
|
| Before/after state completeness | 100% of override actions | Missing `before` or `after` in details JSON |
|
|
| Assignment intent fulfillment rate | > 80% | < 50% intents HONORED per round |
|
|
| Result lock integrity | 0 unauthorized unlocks | Any unlock without super-admin role |
|
|
| File replacement provenance chain | 100% linked | Any replacement without `replacedById` |
|
|
|
|
**Audit Correlation Tracking:**
|
|
- Every admin override must include `correlationId` linking the audit entry to the triggering action (e.g., which assignment algorithm run triggered an intent override)
|
|
- Round transition events must correlate with the `DecisionAuditLog` entries for that round
|
|
- Result lock/unlock events must reference the `DeliberationSession.id` they apply to
|
|
|
|
---
|
|
|
|
## Sign-Off Criteria
|
|
|
|
Final sign-off requires:
|
|
|
|
1. **All 6 release gates passed** (A through F)
|
|
2. **72-hour burn-in period** with zero critical errors
|
|
3. **All test matrices passed** (see [11-testing-and-qa.md](./11-testing-and-qa.md))
|
|
4. **Documentation updated** (CLAUDE.md, API docs, admin guide)
|
|
5. **Architecture owner sign-off** on schema finality
|
|
6. **Product owner sign-off** on feature completeness
|