MOPC-App/docs/unified-architecture-redesign/12-observability-and-releas...

323 lines
13 KiB
Markdown
Raw Permalink Normal View History

# Observability & Release Gates
## Overview
This document defines the metrics, logging, alerting, and release gate criteria for the redesigned competition system. Every phase of the implementation must pass its corresponding release gate before proceeding.
---
## Competition Lifecycle Metrics
### Submission Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `projects_submitted_total` | Counter | Total projects submitted, by category and round |
| `projects_submitted_late` | Counter | Projects submitted after deadline (FLAG policy) |
| `submission_window_utilization` | Gauge | % of deadline elapsed vs submissions received |
| `file_upload_duration_seconds` | Histogram | Time to upload a submission file |
| `file_upload_size_bytes` | Histogram | Size distribution of uploaded files |
### Filtering Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `eligibility_pass_rate` | Gauge | % of projects passing AI filtering |
| `eligibility_manual_review_count` | Counter | Projects flagged for manual review |
| `eligibility_override_count` | Counter | Admin overrides of AI decisions |
| `ai_filtering_duration_seconds` | Histogram | Time for AI to process one project |
### Assignment Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `assignment_coverage_percent` | Gauge | % of projects assigned to at least one judge |
| `assignment_unassigned_queue_size` | Gauge | Projects in unassigned queue |
| `assignment_cap_utilization` | Gauge | Average % of cap used per judge |
| `assignment_exception_count` | Counter | Over-cap manual assignments |
| `assignment_coi_skip_count` | Counter | Assignments skipped due to COI |
| `assignment_duration_seconds` | Histogram | Time to run assignment algorithm |
### Evaluation Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `evaluations_completed_total` | Counter | Scores submitted, by jury group |
| `evaluation_completion_rate` | Gauge | % of assigned evaluations completed |
| `ai_shortlist_generated` | Counter | AI shortlists generated per round |
| `admin_shortlist_override_count` | Counter | Admin overrides of AI shortlist |
### Deliberation Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `deliberation_session_count` | Counter | Sessions created, by mode |
| `deliberation_duration_seconds` | Histogram | Time from VOTING to LOCKED |
| `deliberation_runoff_count` | Counter | Runoff rounds triggered |
| `deliberation_admin_override_count` | Counter | Admin overrides of deliberation result |
| `result_lock_count` | Counter | Results locked |
| `result_unlock_count` | Counter | Results unlocked (should be rare) |
### Live Finals Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `live_voting_concurrent_users` | Gauge | Concurrent jury + audience users during voting |
| `live_vote_submission_duration_ms` | Histogram | Time to submit a live vote |
| `audience_vote_total` | Counter | Total audience votes submitted |
| `stage_manager_cursor_advances` | Counter | Project cursor advances by admin |
### Mentoring Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `mentor_messages_sent` | Counter | Messages sent in mentor workspaces |
| `mentor_files_uploaded` | Counter | Files uploaded to mentor workspaces |
| `mentor_file_promotions` | Counter | Files promoted to official submissions |
| `mentor_assignment_coverage` | Gauge | % of mentoring-requesting teams with assigned mentors |
---
## Reliability Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `api_request_duration_seconds` | Histogram | tRPC procedure latency, by router and procedure |
| `api_error_rate` | Gauge | % of requests returning errors, by router |
| `job_success_total` | Counter | Background job completions |
| `job_failure_total` | Counter | Background job failures |
| `email_send_total` | Counter | Emails sent (reminders, invitations, notifications) |
| `email_send_failure_total` | Counter | Failed email sends |
| `db_query_duration_seconds` | Histogram | Prisma query latency |
---
## Audit Event Metrics
All admin override actions emit `DecisionAuditLog` records. These metrics track audit completeness and quality:
| Metric | Type | Description |
|--------|------|-------------|
| `audit_events_total` | Counter | Total audit log entries, by `actionType` |
| `audit_events_with_diff` | Counter | Audit entries that include both `beforeState` and `afterState` in `detailsJson` |
| `audit_coverage_percent` | Gauge | % of override actions (eligibility, assignment, shortlist, deliberation, lock/unlock, promotion, cap change) that have a corresponding audit record |
| `audit_missing_diff_count` | Counter | Override actions where `detailsJson` is missing before/after state (should be 0) |
| `audit_correlation_coverage` | Gauge | % of audit entries that include full correlation context (`competitionId`, `roundId`, `projectId`) |
### Audit Completeness Validation
During each phase gate review, run an automated check that verifies:
1. **Every admin override has an audit record** — query `DecisionAuditLog` for each override type and verify count matches the operation count
2. **Before/after state present** — for all OVERRIDE-type actions, `detailsJson` must contain `{ before: {...}, after: {...} }` structure
3. **Correlation completeness** — every audit entry references at minimum `userId`, `competitionId`, and `timestamp`
4. **No orphaned operations** — cross-reference assignment exceptions, result unlocks, eligibility overrides, and file promotions against audit log
### Audit Event Correlation
For tracing admin actions across the system, audit log entries should include these correlation fields when available:
```typescript
type AuditCorrelation = {
competitionId: string
roundId?: string
roundType?: RoundType
juryGroupId?: string
projectId?: string
sessionId?: string // deliberation session
assignmentIntentId?: string // when overriding an intent
}
```
This enables queries like: "Show all admin overrides for Competition X, Round 3, sorted by time" or "Show all actions taken on Project Y across all rounds."
---
## Structured Logging
All log entries must include correlation IDs for tracing:
```typescript
type LogContext = {
// Always present
requestId: string
userId?: string
userRole?: string
// Competition context (when available)
programId?: string
competitionId?: string
roundId?: string
roundType?: RoundType
// Entity context (when relevant)
juryGroupId?: string
projectId?: string
awardId?: string
sessionId?: string // deliberation session
// Operation context
operation: string // e.g., "assignment.run", "deliberation.submitVote"
result: 'success' | 'failure' | 'skipped'
details?: Record<string, unknown>
}
```
### Key Log Events
| Event | Level | When |
|-------|-------|------|
| `competition.created` | INFO | New competition created |
| `round.transitioned` | INFO | Round status changed |
| `assignment.completed` | INFO | Assignment algorithm finished |
| `assignment.unassigned` | WARN | Projects remain unassigned |
| `assignment.exception` | WARN | Over-cap assignment made |
| `evaluation.submitted` | INFO | Judge submitted a score |
| `deliberation.voteSubmitted` | INFO | Juror submitted deliberation vote |
| `deliberation.tieDetected` | WARN | Tie detected in deliberation |
| `deliberation.adminOverride` | WARN | Admin overrode deliberation result |
| `resultLock.locked` | INFO | Result locked |
| `resultLock.unlocked` | WARN | Result unlocked (should be rare) |
| `eligibility.override` | WARN | Admin overrode AI eligibility decision |
| `file.promoted` | INFO | Mentor file promoted to submission |
| `liveVoting.windowClosed` | INFO | Live voting window closed |
---
## Alert Definitions
| Alert | Condition | Severity | Action |
|-------|-----------|----------|--------|
| **Unresolved manual queue** | `assignment_unassigned_queue_size > 0` for > 24h | WARNING | Notify admin to manually assign |
| **Quorum mismatch** | Deliberation participants < required quorum | CRITICAL | Block voting, notify admin |
| **Result lock failure** | ResultLock creation fails | CRITICAL | Retry, notify super-admin |
| **Unlock by non-super-admin** | ResultUnlock attempted by non-super-admin | CRITICAL | Block, log security event |
| **AI filtering timeout** | `ai_filtering_duration_seconds > 60` | WARNING | Check OpenAI API, retry |
| **Evaluation completion low** | `evaluation_completion_rate < 50%` with < 24h remaining | WARNING | Send reminder emails |
| **Live voting overload** | `live_voting_concurrent_users > 200` | WARNING | Monitor for performance degradation |
| **Email delivery failure spike** | `email_send_failure_total` increases > 10 in 1h | WARNING | Check SMTP connection |
| **API error rate spike** | `api_error_rate > 5%` for any router | CRITICAL | Investigate, potential rollback |
---
## Release Gates
Each release gate must be passed before the corresponding implementation phase can proceed to production.
### Gate A: Schema + Backfill Readiness
**When**: Before Phase 1 code reaches production
**Criteria**:
- [ ] All Prisma migrations apply cleanly on staging database
- [ ] Backfill script runs successfully on staging data
- [ ] No existing table modified (new tables only)
- [ ] Rollback tested: drop new tables, verify existing system works
- [ ] Migration takes < 5 minutes on production-sized dataset
---
### Gate B: Policy Engine Correctness
**When**: Before Phase 2 code reaches production
**Criteria**:
- [ ] 5-layer policy precedence returns correct values for all combinations
- [ ] Cap mode behavior correct: HARD stops at cap, SOFT allows buffer, NONE unlimited
- [ ] Category bias applied correctly in assignment
- [ ] COI check prevents assignment to declared conflicts
- [ ] Edge cases: null overrides, conflicting settings, zero cap
- [ ] All policy unit tests pass (100% coverage on resolution logic)
---
### Gate C: End-to-End Monaco Flow Simulation
**When**: Before Phase 5 code reaches production
**Criteria**:
- [ ] Full 8-round flow completes on staging with realistic data
- [ ] All round transitions work correctly
- [ ] Multi-round document visibility correct for all roles
- [ ] AI shortlist generated at end of evaluation rounds
- [ ] Assignment algorithm handles 500+ projects across 30 judges
- [ ] Result: finalists selected, no data inconsistencies
---
### Gate D: Invite/Onboarding Stability Under Load
**When**: Before Phase 3 code reaches production
**Criteria**:
- [ ] 50 concurrent invite acceptances process correctly
- [ ] JuryGroupMember records created for all accepted invites
- [ ] Onboarding self-service values saved correctly
- [ ] AssignmentIntent records created for pre-assigned judges
- [ ] No race conditions in concurrent onboarding
---
### Gate E: Live Finals + Deliberation Integrity
**When**: Before Phase 6 code reaches production
**Criteria**:
- [ ] Stage manager cursor advances correctly for all Jury 3 members
- [ ] Live voting window enforcement (no votes outside window)
- [ ] 50 concurrent audience votes processed without conflicts
- [ ] Deliberation SINGLE_WINNER_VOTE tallying correct
- [ ] Deliberation FULL_RANKING Borda count correct
- [ ] Tie-breaking: RUNOFF creates new vote round
- [ ] Tie-breaking: ADMIN_DECIDES records decision
- [ ] ResultLock creates immutable snapshot
- [ ] ResultUnlock blocked for non-super-admin
- [ ] Audience vote totals correctly shown to Jury 3
---
### Gate F: Operational Readiness
**When**: Before Phase 8 (cutover) begins
**Criteria**:
- [ ] All metrics listed above are being collected and visible in dashboard
- [ ] All alerts listed above are configured and tested
- [ ] Structured logging includes all correlation IDs
- [ ] Runbook exists for common operational scenarios
- [ ] On-call team briefed on new system components
- [ ] Backup and restore tested for new tables
- [ ] Performance baseline established (API latency, DB query times)
---
## Audit Event Metrics
Track audit completeness and integrity across the redesigned system:
| Metric | Target | Alert Threshold |
|--------|--------|-----------------|
| Audit coverage (admin overrides with audit record) | 100% | Any override without audit record |
| Before/after state completeness | 100% of override actions | Missing `before` or `after` in details JSON |
| Assignment intent fulfillment rate | > 80% | < 50% intents HONORED per round |
| Result lock integrity | 0 unauthorized unlocks | Any unlock without super-admin role |
| File replacement provenance chain | 100% linked | Any replacement without `replacedById` |
**Audit Correlation Tracking:**
- Every admin override must include `correlationId` linking the audit entry to the triggering action (e.g., which assignment algorithm run triggered an intent override)
- Round transition events must correlate with the `DecisionAuditLog` entries for that round
- Result lock/unlock events must reference the `DeliberationSession.id` they apply to
---
## Sign-Off Criteria
Final sign-off requires:
1. **All 6 release gates passed** (A through F)
2. **72-hour burn-in period** with zero critical errors
3. **All test matrices passed** (see [11-testing-and-qa.md](./11-testing-and-qa.md))
4. **Documentation updated** (CLAUDE.md, API docs, admin guide)
5. **Architecture owner sign-off** on schema finality
6. **Product owner sign-off** on feature completeness