LetsBeBiz-Redesign/docs/architecture-proposal/gpt/03-deployment-strategy.md

146 lines
4.7 KiB
Markdown
Raw Normal View History

# 03. Deployment Strategy
## 1. Goals
- Ship to founding members in ~12 weeks without compromising security invariants.
- Maintain one-VPS-per-customer isolation.
- Keep OpenClaw upstream-pinned and independently upgradeable.
- Make tenant rollout reversible with fast rollback paths.
## 2. Environment Topology
## 2.1 Control Plane Environments
| Environment | Purpose | Data |
|---|---|---|
| `dev` | Rapid feature iteration | Synthetic/local data |
| `staging` | Release-candidate validation, e2e, load, security checks | Sanitized fixtures |
| `prod-eu` | EU customers (default EU routing) | Real customer data |
| `prod-us` | NA customers (default NA routing) | Real customer data |
Control plane services (Hub + worker + notifications) are region-deployed with independent DBs and clear region affinity.
## 2.2 Tenant Environments
- `sandbox tenants`: internal QA and interactive demo pool.
- `canary tenants`: first real-production update recipients.
- `general tenants`: full customer fleet.
## 3. Deployment Units
## 3.1 Control Plane Units
- `hub-web-api` container (Next.js standalone runtime)
- `hub-worker` container (automation + billing jobs)
- `notifications` container (push/email delivery)
- `postgres` (managed or self-hosted HA)
## 3.2 Tenant Units (Per Customer VPS)
- `openclaw` container (upstream image/tag pinned)
- `safety-wrapper` plugin package mounted into OpenClaw extension dir
- `egress-proxy` service (localhost-only)
- tool containers and nginx from provisioner
- local SQLite data stores for secrets/approvals/metering
## 4. Provisioning Deployment Plan
## 4.1 Provisioner Mode
Continue with existing one-shot SSH provisioner flow, retooled to:
- deploy OpenClaw + Safety components
- remove legacy orchestrator/sysadmin deployment
- strip deprecated stacks and n8n references
- write secrets into encrypted vault only (no plaintext long-lived config)
## 4.2 Immutable Artifact Inputs
Provisioning uses pinned artifacts only:
- OpenClaw release tag (`stable` channel pin)
- Safety Wrapper image/package digest
- Tool stack compose templates with hash
- policy bundle version + checksum
## 5. Secrets And Credential Deployment
- Registration token is one-time and short-lived.
- Tenant API key returned at registration; only hash stored in Hub DB.
- Provisioner writes bootstrap secrets to tmpfs file, consumed once, then shredded.
- Existing plaintext job config path (`jobs/<id>/config.json`) replaced by encrypted payload + ephemeral decrypt-on-run.
## 6. Release Strategy
## 6.1 Control Plane
- Trunk-based merges behind feature flags.
- Deploy via Gitea Actions with staged promotions (`dev -> staging -> prod`).
- DB migrations run in expand/contract pattern.
## 6.2 Tenant Plane
Tenant updates split into independent channels:
- `policy-only`: classification/autonomy/tool policy updates (no binary change)
- `wrapper patch`: Safety Wrapper version bump
- `openclaw bump`: upstream release bump (separate tracked campaign)
Rollout:
1. Internal sandbox tenants
2. 5% canary customer tenants
3. 25%
4. 100%
Auto-stop criteria:
- redaction test failure
- approval-routing failure >1%
- tenant heartbeat drop >3%
## 7. Rollback Strategy
## 7.1 Control Plane Rollback
- Keep last two container digests deployable.
- Migration rollback policy: only for reversible migrations; otherwise hotfix-forward.
## 7.2 Tenant Rollback
- Policy rollback via previous signed policy bundle.
- Wrapper rollback to previous plugin package.
- OpenClaw rollback to previous pinned stable tag after compatibility check.
## 8. Observability And SLOs
## 8.1 Required Telemetry
- tenant heartbeat latency and freshness
- approval queue latency (request -> decision)
- redaction pipeline counters (matches by layer)
- token usage ingest lag
- provisioning success/failure per step
## 8.2 Launch SLO Targets
- Hub API availability: 99.9%
- Tenant heartbeat freshness: 99% under 2 minutes
- Approval propagation: p95 < 5 seconds (Hub to mobile push)
- Provisioning success first-attempt: >= 90%
## 9. Dual-Provider Strategy (Netcup + Hetzner)
- Primary capacity pool on Netcup (EU/US).
- Overflow path on Hetzner with same provisioner scripts and hardened baseline.
- Provider adapter abstraction lives in Hub `server-provisioning` module; provisioner remains Debian-focused and provider-agnostic.
## 10. Cutover Plan From Current State
1. Freeze legacy orchestrator/sysadmin deployment paths.
2. Land prerequisite cleanup release (n8n/deprecated removal + credential leak fix).
3. Enable new tenant register/heartbeat APIs in Hub.
4. Provision first new-architecture internal tenant.
5. Execute parallel-run window (old and new provisioning flows side-by-side for internal only).
6. Flip default provisioning to new flow for production orders.