112 lines
4.2 KiB
Markdown
112 lines
4.2 KiB
Markdown
|
|
# 07. Testing Strategy Proposal
|
||
|
|
|
||
|
|
## 1. Testing Principles
|
||
|
|
|
||
|
|
- Security-critical behavior is verified with invariant tests, not only unit coverage.
|
||
|
|
- Contract-first testing between Hub, Safety Wrapper, Mobile, Website, and Provisioner.
|
||
|
|
- Fast feedback in CI, deep verification in staging and nightly runs.
|
||
|
|
- AI-generated code receives stricter review and mutation testing on critical paths.
|
||
|
|
|
||
|
|
## 2. Test Pyramid By Component
|
||
|
|
|
||
|
|
| Layer | Hub | Safety Wrapper | Provisioner | Mobile/Website |
|
||
|
|
|---|---|---|---|---|
|
||
|
|
| Unit | services, validators, policy logic | classifier, redactor, secret resolver, adapters | parser/utils/template render | UI logic, state stores, hooks |
|
||
|
|
| Integration | Prisma + API handlers + auth | plugin hooks vs OpenClaw test harness | SSH runner against disposable VM | API integration against mock Hub |
|
||
|
|
| End-to-end | full order/provision/approval/billing flow | tenant command execution path | full 10-step provisioning with checkpoints | chat/approval/onboarding user journeys |
|
||
|
|
| Security | authz, rate-limit, session hardening | secret exfil tests, gating bypass tests | credential leakage scans | token storage, deep link auth |
|
||
|
|
| Performance | API p95 and DB load | per-turn latency overhead, memory usage | provisioning duration and retry cost | startup latency, push receipt latency |
|
||
|
|
|
||
|
|
## 3. Mandatory Security Invariant Suite
|
||
|
|
|
||
|
|
The following automated tests are required before each release:
|
||
|
|
|
||
|
|
1. **Secrets Never Leave Server Test**
|
||
|
|
- Seed known secrets in vault and files.
|
||
|
|
- Trigger prompts/tool outputs containing these values.
|
||
|
|
- Assert outbound payloads and persisted logs contain only placeholders.
|
||
|
|
|
||
|
|
2. **Command Classification Matrix Test**
|
||
|
|
- Execute fixtures for each command class (Green/Yellow/Yellow+External/Red/Critical).
|
||
|
|
- Validate behavior across autonomy levels 1-3.
|
||
|
|
|
||
|
|
3. **External Comms Independence Test**
|
||
|
|
- At autonomy level 3, external action remains blocked when comms gate locked.
|
||
|
|
- Unlock only targeted tool; validate others remain blocked.
|
||
|
|
|
||
|
|
4. **Approval Expiry Test**
|
||
|
|
- Approval request expires at 24h.
|
||
|
|
- Late approval cannot be replayed.
|
||
|
|
|
||
|
|
5. **SECRET_REF Boundary Test**
|
||
|
|
- Secrets cannot be requested directly by raw name/value.
|
||
|
|
- Only valid references in allowlisted tool operations resolve.
|
||
|
|
|
||
|
|
## 4. Provisioning Test Strategy
|
||
|
|
|
||
|
|
## 4.1 Fast Checks
|
||
|
|
|
||
|
|
- Shellcheck + static checks for bash scripts.
|
||
|
|
- Template substitution tests (all placeholders resolved, none leaked).
|
||
|
|
- Stack inventory policy tests (no banned tools like n8n).
|
||
|
|
|
||
|
|
## 4.2 Disposable VPS E2E
|
||
|
|
|
||
|
|
Nightly automated runs:
|
||
|
|
|
||
|
|
- create disposable Debian VPS
|
||
|
|
- run full provisioning
|
||
|
|
- run smoke checks on selected tool endpoints
|
||
|
|
- verify tenant registration + heartbeat + approvals
|
||
|
|
- tear down VPS and collect artifacts
|
||
|
|
|
||
|
|
## 5. Contract Testing
|
||
|
|
|
||
|
|
- OpenAPI specs for Hub APIs and tenant APIs.
|
||
|
|
- Consumer-driven contract tests for:
|
||
|
|
- Safety Wrapper against Hub tenant endpoints
|
||
|
|
- Mobile app against customer endpoints
|
||
|
|
- Website onboarding against public endpoints
|
||
|
|
- Contract break blocks merge.
|
||
|
|
|
||
|
|
## 6. Data And Billing Validation
|
||
|
|
|
||
|
|
- Synthetic token event generator with known totals.
|
||
|
|
- Reconcile tenant usage buckets against Hub aggregated totals.
|
||
|
|
- Reconcile Hub totals against Stripe meter/invoice preview.
|
||
|
|
- Fail build if variance exceeds threshold.
|
||
|
|
|
||
|
|
## 7. Quality Gates (CI)
|
||
|
|
|
||
|
|
- Unit + integration tests must pass.
|
||
|
|
- Security invariants must pass.
|
||
|
|
- Critical package diff review for Safety Wrapper and Provisioner.
|
||
|
|
- Minimum thresholds:
|
||
|
|
- security-critical modules: >=90% branch coverage
|
||
|
|
- overall backend: >=75% branch coverage
|
||
|
|
- Mutation testing on classifier and redactor modules.
|
||
|
|
|
||
|
|
## 8. Human Review Workflow (Anti-AI-Slop)
|
||
|
|
|
||
|
|
Required for security-critical PRs:
|
||
|
|
|
||
|
|
- one reviewer validates threat model assumptions
|
||
|
|
- one reviewer validates test completeness and failure cases
|
||
|
|
- checklist includes: error paths, rollback behavior, idempotency, logging hygiene
|
||
|
|
|
||
|
|
No direct auto-merge for changes in:
|
||
|
|
|
||
|
|
- redaction engine
|
||
|
|
- command classifier
|
||
|
|
- secret storage/resolution
|
||
|
|
- provisioning credential handling
|
||
|
|
|
||
|
|
## 9. Launch Validation Checklist
|
||
|
|
|
||
|
|
Before founding-member launch:
|
||
|
|
|
||
|
|
- 7-day staging soak with no sev-1/2 defects
|
||
|
|
- two successful rollback drills (control plane and tenant plane)
|
||
|
|
- production canary with live approval + billing reconciliation
|
||
|
|
- first-hour templates executed successfully on staging tenants
|