LetsBeBiz-Redesign/docs/architecture-proposal/gpt/03-deployment-strategy.md

4.7 KiB

03. Deployment Strategy

1. Goals

  • Ship to founding members in ~12 weeks without compromising security invariants.
  • Maintain one-VPS-per-customer isolation.
  • Keep OpenClaw upstream-pinned and independently upgradeable.
  • Make tenant rollout reversible with fast rollback paths.

2. Environment Topology

2.1 Control Plane Environments

Environment Purpose Data
dev Rapid feature iteration Synthetic/local data
staging Release-candidate validation, e2e, load, security checks Sanitized fixtures
prod-eu EU customers (default EU routing) Real customer data
prod-us NA customers (default NA routing) Real customer data

Control plane services (Hub + worker + notifications) are region-deployed with independent DBs and clear region affinity.

2.2 Tenant Environments

  • sandbox tenants: internal QA and interactive demo pool.
  • canary tenants: first real-production update recipients.
  • general tenants: full customer fleet.

3. Deployment Units

3.1 Control Plane Units

  • hub-web-api container (Next.js standalone runtime)
  • hub-worker container (automation + billing jobs)
  • notifications container (push/email delivery)
  • postgres (managed or self-hosted HA)

3.2 Tenant Units (Per Customer VPS)

  • openclaw container (upstream image/tag pinned)
  • safety-wrapper plugin package mounted into OpenClaw extension dir
  • egress-proxy service (localhost-only)
  • tool containers and nginx from provisioner
  • local SQLite data stores for secrets/approvals/metering

4. Provisioning Deployment Plan

4.1 Provisioner Mode

Continue with existing one-shot SSH provisioner flow, retooled to:

  • deploy OpenClaw + Safety components
  • remove legacy orchestrator/sysadmin deployment
  • strip deprecated stacks and n8n references
  • write secrets into encrypted vault only (no plaintext long-lived config)

4.2 Immutable Artifact Inputs

Provisioning uses pinned artifacts only:

  • OpenClaw release tag (stable channel pin)
  • Safety Wrapper image/package digest
  • Tool stack compose templates with hash
  • policy bundle version + checksum

5. Secrets And Credential Deployment

  • Registration token is one-time and short-lived.
  • Tenant API key returned at registration; only hash stored in Hub DB.
  • Provisioner writes bootstrap secrets to tmpfs file, consumed once, then shredded.
  • Existing plaintext job config path (jobs/<id>/config.json) replaced by encrypted payload + ephemeral decrypt-on-run.

6. Release Strategy

6.1 Control Plane

  • Trunk-based merges behind feature flags.
  • Deploy via Gitea Actions with staged promotions (dev -> staging -> prod).
  • DB migrations run in expand/contract pattern.

6.2 Tenant Plane

Tenant updates split into independent channels:

  • policy-only: classification/autonomy/tool policy updates (no binary change)
  • wrapper patch: Safety Wrapper version bump
  • openclaw bump: upstream release bump (separate tracked campaign)

Rollout:

  1. Internal sandbox tenants
  2. 5% canary customer tenants
  3. 25%
  4. 100%

Auto-stop criteria:

  • redaction test failure
  • approval-routing failure >1%
  • tenant heartbeat drop >3%

7. Rollback Strategy

7.1 Control Plane Rollback

  • Keep last two container digests deployable.
  • Migration rollback policy: only for reversible migrations; otherwise hotfix-forward.

7.2 Tenant Rollback

  • Policy rollback via previous signed policy bundle.
  • Wrapper rollback to previous plugin package.
  • OpenClaw rollback to previous pinned stable tag after compatibility check.

8. Observability And SLOs

8.1 Required Telemetry

  • tenant heartbeat latency and freshness
  • approval queue latency (request -> decision)
  • redaction pipeline counters (matches by layer)
  • token usage ingest lag
  • provisioning success/failure per step

8.2 Launch SLO Targets

  • Hub API availability: 99.9%
  • Tenant heartbeat freshness: 99% under 2 minutes
  • Approval propagation: p95 < 5 seconds (Hub to mobile push)
  • Provisioning success first-attempt: >= 90%

9. Dual-Provider Strategy (Netcup + Hetzner)

  • Primary capacity pool on Netcup (EU/US).
  • Overflow path on Hetzner with same provisioner scripts and hardened baseline.
  • Provider adapter abstraction lives in Hub server-provisioning module; provisioner remains Debian-focused and provider-agnostic.

10. Cutover Plan From Current State

  1. Freeze legacy orchestrator/sysadmin deployment paths.
  2. Land prerequisite cleanup release (n8n/deprecated removal + credential leak fix).
  3. Enable new tenant register/heartbeat APIs in Hub.
  4. Provision first new-architecture internal tenant.
  5. Execute parallel-run window (old and new provisioning flows side-by-side for internal only).
  6. Flip default provisioning to new flow for production orders.