LetsBeBiz-Redesign/docs/architecture-proposal/gpt/07-testing-strategy.md

4.2 KiB

07. Testing Strategy Proposal

1. Testing Principles

  • Security-critical behavior is verified with invariant tests, not only unit coverage.
  • Contract-first testing between Hub, Safety Wrapper, Mobile, Website, and Provisioner.
  • Fast feedback in CI, deep verification in staging and nightly runs.
  • AI-generated code receives stricter review and mutation testing on critical paths.

2. Test Pyramid By Component

Layer Hub Safety Wrapper Provisioner Mobile/Website
Unit services, validators, policy logic classifier, redactor, secret resolver, adapters parser/utils/template render UI logic, state stores, hooks
Integration Prisma + API handlers + auth plugin hooks vs OpenClaw test harness SSH runner against disposable VM API integration against mock Hub
End-to-end full order/provision/approval/billing flow tenant command execution path full 10-step provisioning with checkpoints chat/approval/onboarding user journeys
Security authz, rate-limit, session hardening secret exfil tests, gating bypass tests credential leakage scans token storage, deep link auth
Performance API p95 and DB load per-turn latency overhead, memory usage provisioning duration and retry cost startup latency, push receipt latency

3. Mandatory Security Invariant Suite

The following automated tests are required before each release:

  1. Secrets Never Leave Server Test
  • Seed known secrets in vault and files.
  • Trigger prompts/tool outputs containing these values.
  • Assert outbound payloads and persisted logs contain only placeholders.
  1. Command Classification Matrix Test
  • Execute fixtures for each command class (Green/Yellow/Yellow+External/Red/Critical).
  • Validate behavior across autonomy levels 1-3.
  1. External Comms Independence Test
  • At autonomy level 3, external action remains blocked when comms gate locked.
  • Unlock only targeted tool; validate others remain blocked.
  1. Approval Expiry Test
  • Approval request expires at 24h.
  • Late approval cannot be replayed.
  1. SECRET_REF Boundary Test
  • Secrets cannot be requested directly by raw name/value.
  • Only valid references in allowlisted tool operations resolve.

4. Provisioning Test Strategy

4.1 Fast Checks

  • Shellcheck + static checks for bash scripts.
  • Template substitution tests (all placeholders resolved, none leaked).
  • Stack inventory policy tests (no banned tools like n8n).

4.2 Disposable VPS E2E

Nightly automated runs:

  • create disposable Debian VPS
  • run full provisioning
  • run smoke checks on selected tool endpoints
  • verify tenant registration + heartbeat + approvals
  • tear down VPS and collect artifacts

5. Contract Testing

  • OpenAPI specs for Hub APIs and tenant APIs.
  • Consumer-driven contract tests for:
    • Safety Wrapper against Hub tenant endpoints
    • Mobile app against customer endpoints
    • Website onboarding against public endpoints
  • Contract break blocks merge.

6. Data And Billing Validation

  • Synthetic token event generator with known totals.
  • Reconcile tenant usage buckets against Hub aggregated totals.
  • Reconcile Hub totals against Stripe meter/invoice preview.
  • Fail build if variance exceeds threshold.

7. Quality Gates (CI)

  • Unit + integration tests must pass.
  • Security invariants must pass.
  • Critical package diff review for Safety Wrapper and Provisioner.
  • Minimum thresholds:
    • security-critical modules: >=90% branch coverage
    • overall backend: >=75% branch coverage
  • Mutation testing on classifier and redactor modules.

8. Human Review Workflow (Anti-AI-Slop)

Required for security-critical PRs:

  • one reviewer validates threat model assumptions
  • one reviewer validates test completeness and failure cases
  • checklist includes: error paths, rollback behavior, idempotency, logging hygiene

No direct auto-merge for changes in:

  • redaction engine
  • command classifier
  • secret storage/resolution
  • provisioning credential handling

9. Launch Validation Checklist

Before founding-member launch:

  • 7-day staging soak with no sev-1/2 defects
  • two successful rollback drills (control plane and tenant plane)
  • production canary with live approval + billing reconciliation
  • first-hour templates executed successfully on staging tenants