4.2 KiB

Raw Blame History

07. Testing Strategy Proposal

1. Testing Principles

Security-critical behavior is verified with invariant tests, not only unit coverage.
Contract-first testing between Hub, Safety Wrapper, Mobile, Website, and Provisioner.
Fast feedback in CI, deep verification in staging and nightly runs.
AI-generated code receives stricter review and mutation testing on critical paths.

2. Test Pyramid By Component

Layer	Hub	Safety Wrapper	Provisioner	Mobile/Website
Unit	services, validators, policy logic	classifier, redactor, secret resolver, adapters	parser/utils/template render	UI logic, state stores, hooks
Integration	Prisma + API handlers + auth	plugin hooks vs OpenClaw test harness	SSH runner against disposable VM	API integration against mock Hub
End-to-end	full order/provision/approval/billing flow	tenant command execution path	full 10-step provisioning with checkpoints	chat/approval/onboarding user journeys
Security	authz, rate-limit, session hardening	secret exfil tests, gating bypass tests	credential leakage scans	token storage, deep link auth
Performance	API p95 and DB load	per-turn latency overhead, memory usage	provisioning duration and retry cost	startup latency, push receipt latency

3. Mandatory Security Invariant Suite

The following automated tests are required before each release:

Secrets Never Leave Server Test

Seed known secrets in vault and files.
Trigger prompts/tool outputs containing these values.
Assert outbound payloads and persisted logs contain only placeholders.

Command Classification Matrix Test

Execute fixtures for each command class (Green/Yellow/Yellow+External/Red/Critical).
Validate behavior across autonomy levels 1-3.

External Comms Independence Test

At autonomy level 3, external action remains blocked when comms gate locked.
Unlock only targeted tool; validate others remain blocked.

Approval Expiry Test

Approval request expires at 24h.
Late approval cannot be replayed.

SECRET_REF Boundary Test

Secrets cannot be requested directly by raw name/value.
Only valid references in allowlisted tool operations resolve.

4. Provisioning Test Strategy

4.1 Fast Checks

Shellcheck + static checks for bash scripts.
Template substitution tests (all placeholders resolved, none leaked).
Stack inventory policy tests (no banned tools like n8n).

4.2 Disposable VPS E2E

Nightly automated runs:

create disposable Debian VPS
run full provisioning
run smoke checks on selected tool endpoints
verify tenant registration + heartbeat + approvals
tear down VPS and collect artifacts

5. Contract Testing

OpenAPI specs for Hub APIs and tenant APIs.
Consumer-driven contract tests for:
- Safety Wrapper against Hub tenant endpoints
- Mobile app against customer endpoints
- Website onboarding against public endpoints
Contract break blocks merge.

6. Data And Billing Validation

Synthetic token event generator with known totals.
Reconcile tenant usage buckets against Hub aggregated totals.
Reconcile Hub totals against Stripe meter/invoice preview.
Fail build if variance exceeds threshold.

7. Quality Gates (CI)

Unit + integration tests must pass.
Security invariants must pass.
Critical package diff review for Safety Wrapper and Provisioner.
Minimum thresholds:
- security-critical modules: >=90% branch coverage
- overall backend: >=75% branch coverage
Mutation testing on classifier and redactor modules.

8. Human Review Workflow (Anti-AI-Slop)

Required for security-critical PRs:

one reviewer validates threat model assumptions
one reviewer validates test completeness and failure cases
checklist includes: error paths, rollback behavior, idempotency, logging hygiene

No direct auto-merge for changes in:

redaction engine
command classifier
secret storage/resolution
provisioning credential handling

9. Launch Validation Checklist

Before founding-member launch:

7-day staging soak with no sev-1/2 defects
two successful rollback drills (control plane and tenant plane)
production canary with live approval + billing reconciliation
first-hour templates executed successfully on staging tenants