LetsBeBiz-Redesign/docs/architecture-proposal/gpt/07-testing-strategy.md

# 07. Testing Strategy Proposal

## 1. Testing Principles

- Security-critical behavior is verified with invariant tests, not only unit coverage.
- Contract-first testing between Hub, Safety Wrapper, Mobile, Website, and Provisioner.
- Fast feedback in CI, deep verification in staging and nightly runs.
- AI-generated code receives stricter review and mutation testing on critical paths.

## 2. Test Pyramid By Component

| Layer | Hub | Safety Wrapper | Provisioner | Mobile/Website |
|---|---|---|---|---|
| Unit | services, validators, policy logic | classifier, redactor, secret resolver, adapters | parser/utils/template render | UI logic, state stores, hooks |
| Integration | Prisma + API handlers + auth | plugin hooks vs OpenClaw test harness | SSH runner against disposable VM | API integration against mock Hub |
| End-to-end | full order/provision/approval/billing flow | tenant command execution path | full 10-step provisioning with checkpoints | chat/approval/onboarding user journeys |
| Security | authz, rate-limit, session hardening | secret exfil tests, gating bypass tests | credential leakage scans | token storage, deep link auth |
| Performance | API p95 and DB load | per-turn latency overhead, memory usage | provisioning duration and retry cost | startup latency, push receipt latency |

## 3. Mandatory Security Invariant Suite

The following automated tests are required before each release:

1. **Secrets Never Leave Server Test**
- Seed known secrets in vault and files.
- Trigger prompts/tool outputs containing these values.
- Assert outbound payloads and persisted logs contain only placeholders.

2. **Command Classification Matrix Test**
- Execute fixtures for each command class (Green/Yellow/Yellow+External/Red/Critical).
- Validate behavior across autonomy levels 1-3.

3. **External Comms Independence Test**
- At autonomy level 3, external action remains blocked when comms gate locked.
- Unlock only targeted tool; validate others remain blocked.

4. **Approval Expiry Test**
- Approval request expires at 24h.
- Late approval cannot be replayed.

5. **SECRET_REF Boundary Test**
- Secrets cannot be requested directly by raw name/value.
- Only valid references in allowlisted tool operations resolve.

## 4. Provisioning Test Strategy

## 4.1 Fast Checks

- Shellcheck + static checks for bash scripts.
- Template substitution tests (all placeholders resolved, none leaked).
- Stack inventory policy tests (no banned tools like n8n).

## 4.2 Disposable VPS E2E

Nightly automated runs:

- create disposable Debian VPS
- run full provisioning
- run smoke checks on selected tool endpoints
- verify tenant registration + heartbeat + approvals
- tear down VPS and collect artifacts

## 5. Contract Testing

- OpenAPI specs for Hub APIs and tenant APIs.
- Consumer-driven contract tests for:
  - Safety Wrapper against Hub tenant endpoints
  - Mobile app against customer endpoints
  - Website onboarding against public endpoints
- Contract break blocks merge.

## 6. Data And Billing Validation

- Synthetic token event generator with known totals.
- Reconcile tenant usage buckets against Hub aggregated totals.
- Reconcile Hub totals against Stripe meter/invoice preview.
- Fail build if variance exceeds threshold.

## 7. Quality Gates (CI)

- Unit + integration tests must pass.
- Security invariants must pass.
- Critical package diff review for Safety Wrapper and Provisioner.
- Minimum thresholds:
  - security-critical modules: >=90% branch coverage
  - overall backend: >=75% branch coverage
- Mutation testing on classifier and redactor modules.

## 8. Human Review Workflow (Anti-AI-Slop)

Required for security-critical PRs:

- one reviewer validates threat model assumptions
- one reviewer validates test completeness and failure cases
- checklist includes: error paths, rollback behavior, idempotency, logging hygiene

No direct auto-merge for changes in:

- redaction engine
- command classifier
- secret storage/resolution
- provisioning credential handling

## 9. Launch Validation Checklist

Before founding-member launch:

- 7-day staging soak with no sev-1/2 defects
- two successful rollback drills (control plane and tenant plane)
- production canary with live approval + billing reconciliation
- first-hour templates executed successfully on staging tenants
Initial commit: LetsBe Biz project with openclaw source Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> 2026-02-27 16:24:23 +01:00			`# 07. Testing Strategy Proposal`

			`## 1. Testing Principles`

			`- Security-critical behavior is verified with invariant tests, not only unit coverage.`
			`- Contract-first testing between Hub, Safety Wrapper, Mobile, Website, and Provisioner.`
			`- Fast feedback in CI, deep verification in staging and nightly runs.`
			`- AI-generated code receives stricter review and mutation testing on critical paths.`

			`## 2. Test Pyramid By Component`

			`\| Layer \| Hub \| Safety Wrapper \| Provisioner \| Mobile/Website \|`
			`\|---\|---\|---\|---\|---\|`
			`\| Unit \| services, validators, policy logic \| classifier, redactor, secret resolver, adapters \| parser/utils/template render \| UI logic, state stores, hooks \|`
			`\| Integration \| Prisma + API handlers + auth \| plugin hooks vs OpenClaw test harness \| SSH runner against disposable VM \| API integration against mock Hub \|`
			`\| End-to-end \| full order/provision/approval/billing flow \| tenant command execution path \| full 10-step provisioning with checkpoints \| chat/approval/onboarding user journeys \|`
			`\| Security \| authz, rate-limit, session hardening \| secret exfil tests, gating bypass tests \| credential leakage scans \| token storage, deep link auth \|`
			`\| Performance \| API p95 and DB load \| per-turn latency overhead, memory usage \| provisioning duration and retry cost \| startup latency, push receipt latency \|`

			`## 3. Mandatory Security Invariant Suite`

			`The following automated tests are required before each release:`

			`1. Secrets Never Leave Server Test`
			`- Seed known secrets in vault and files.`
			`- Trigger prompts/tool outputs containing these values.`
			`- Assert outbound payloads and persisted logs contain only placeholders.`

			`2. Command Classification Matrix Test`
			`- Execute fixtures for each command class (Green/Yellow/Yellow+External/Red/Critical).`
			`- Validate behavior across autonomy levels 1-3.`

			`3. External Comms Independence Test`
			`- At autonomy level 3, external action remains blocked when comms gate locked.`
			`- Unlock only targeted tool; validate others remain blocked.`

			`4. Approval Expiry Test`
			`- Approval request expires at 24h.`
			`- Late approval cannot be replayed.`

			`5. SECRET_REF Boundary Test`
			`- Secrets cannot be requested directly by raw name/value.`
			`- Only valid references in allowlisted tool operations resolve.`

			`## 4. Provisioning Test Strategy`

			`## 4.1 Fast Checks`

			`- Shellcheck + static checks for bash scripts.`
			`- Template substitution tests (all placeholders resolved, none leaked).`
			`- Stack inventory policy tests (no banned tools like n8n).`

			`## 4.2 Disposable VPS E2E`

			`Nightly automated runs:`

			`- create disposable Debian VPS`
			`- run full provisioning`
			`- run smoke checks on selected tool endpoints`
			`- verify tenant registration + heartbeat + approvals`
			`- tear down VPS and collect artifacts`

			`## 5. Contract Testing`

			`- OpenAPI specs for Hub APIs and tenant APIs.`
			`- Consumer-driven contract tests for:`
			`- Safety Wrapper against Hub tenant endpoints`
			`- Mobile app against customer endpoints`
			`- Website onboarding against public endpoints`
			`- Contract break blocks merge.`

			`## 6. Data And Billing Validation`

			`- Synthetic token event generator with known totals.`
			`- Reconcile tenant usage buckets against Hub aggregated totals.`
			`- Reconcile Hub totals against Stripe meter/invoice preview.`
			`- Fail build if variance exceeds threshold.`

			`## 7. Quality Gates (CI)`

			`- Unit + integration tests must pass.`
			`- Security invariants must pass.`
			`- Critical package diff review for Safety Wrapper and Provisioner.`
			`- Minimum thresholds:`
			`- security-critical modules: >=90% branch coverage`
			`- overall backend: >=75% branch coverage`
			`- Mutation testing on classifier and redactor modules.`

			`## 8. Human Review Workflow (Anti-AI-Slop)`

			`Required for security-critical PRs:`

			`- one reviewer validates threat model assumptions`
			`- one reviewer validates test completeness and failure cases`
			`- checklist includes: error paths, rollback behavior, idempotency, logging hygiene`

			`No direct auto-merge for changes in:`

			`- redaction engine`
			`- command classifier`
			`- secret storage/resolution`
			`- provisioning credential handling`

			`## 9. Launch Validation Checklist`

			`Before founding-member launch:`

			`- 7-day staging soak with no sev-1/2 defects`
			`- two successful rollback drills (control plane and tenant plane)`
			`- production canary with live approval + billing reconciliation`
			`- first-hour templates executed successfully on staging tenants`