LetsBeBiz-Redesign/docs/architecture-proposal/gpt/06-risk-assessment.md

74 lines
3.8 KiB
Markdown
Raw Permalink Normal View History

# 06. Risk Assessment
## 1. Risk Scoring Method
- Probability: 1 (low) to 5 (high)
- Impact: 1 (low) to 5 (high)
- Risk score = Probability x Impact
## 2. Top Risks
| ID | Risk | Prob | Impact | Score | Mitigation | Contingency Trigger |
|---|---|---:|---:|---:|---|---|
| R1 | Secret exfiltration via unredacted outbound payload | 3 | 5 | 15 | Multi-layer redaction tests, egress deny-by-default policy, seeded canary secrets | Any unredacted canary secret seen outside tenant |
| R2 | Command gating bypass due misclassification | 3 | 5 | 15 | Deterministic policy engine, contract tests per class, human-readable reason logging | Red/Critical executes without approval in tests |
| R3 | OpenClaw upstream changes break plugin behavior | 3 | 4 | 12 | Pin stable tags, adapter compatibility suite, staged upgrade canaries | Hook contract test fails against new tag |
| R4 | Provisioner regressions reduce provisioning success | 4 | 4 | 16 | Idempotent checkpoints, replay tests, synthetic VPS CI | First-attempt success < 90% |
| R5 | Billing usage mismatch vs provider costs | 3 | 4 | 12 | Dual-entry usage checks, nightly reconciliation jobs, alert thresholds | >1% sustained variance for 24h |
| R6 | Mobile approval notification delays/drop | 3 | 3 | 9 | Push retries + in-app queue fallback + email fallback | p95 approval notify > 30s |
| R7 | Performance overhead exceeds Lite-tier budget | 2 | 4 | 8 | Memory profiling budget gates, disable non-essential plugins, tune browser lifecycle | LetsBe overhead > 800MB sustained |
| R8 | Tool API churn breaks adapters | 4 | 3 | 12 | Adapter integration tests against pinned versions, fallback to browser playbook | Adapter failure rate > 5% |
| R9 | Security debt from AI-generated code quality | 4 | 4 | 16 | Mandatory senior review on security modules, lint rules, banned patterns checks | Critical static-analysis finding unresolved >48h |
| R10 | Legal/compliance drift (license/source disclosure pages) | 2 | 4 | 8 | Automated license manifest publishing, pre-release legal checklist | Missing OSS disclosure page at RC freeze |
## 3. Risk Register By Domain
## 3.1 Security Risks
- Redaction misses non-standard secret formats.
- External comms gate incorrectly tied to autonomy level.
- Local logs/transcripts persist raw secret material.
- Local execution adapters allow shell metacharacter bypass.
## 3.2 Delivery Risks
- Too much simultaneous change across Hub + provisioner + tenant runtime.
- Underestimated migration effort from deprecated orchestrator/sysadmin behaviors.
- Browser automation migration complexity for setup scripts.
## 3.3 Operational Risks
- Dual-region Hub operations increase DB and deploy complexity.
- Insufficient on-call runbooks for approval outages and provisioning failures.
- Canary rollout without automated rollback criteria.
## 4. Mitigation Program
## 4.1 Pre-Launch Controls
- Security invariants are encoded as executable tests (not checklist-only).
- Every release candidate must pass redaction canary probes.
- Dry-run provisioning must pass on both Netcup and Hetzner targets.
## 4.2 Runtime Controls
- Alert on heartbeat freshness degradation.
- Alert on approval queue lag and expiration spikes.
- Alert on sudden drop in cache-read ratio (cost anomaly indicator).
## 4.3 Governance Controls
- Security design review required for changes in Safety Wrapper, redaction, or secrets flows.
- Migration freeze on deprecated paths after Phase 0.
- Weekly risk review with updated probability/impact re-scoring.
## 5. Launch Go/No-Go Risk Gates
No launch if any condition is true:
- unresolved severity-1 security defect
- redaction tests fail for any supported secret class
- command gating matrix not fully passing
- usage reconciliation error >1% over 72h canary
- provisioning first-attempt success below 85% in final week