146 lines
4.7 KiB
Markdown
146 lines
4.7 KiB
Markdown
# 03. Deployment Strategy
|
|
|
|
## 1. Goals
|
|
|
|
- Ship to founding members in ~12 weeks without compromising security invariants.
|
|
- Maintain one-VPS-per-customer isolation.
|
|
- Keep OpenClaw upstream-pinned and independently upgradeable.
|
|
- Make tenant rollout reversible with fast rollback paths.
|
|
|
|
## 2. Environment Topology
|
|
|
|
## 2.1 Control Plane Environments
|
|
|
|
| Environment | Purpose | Data |
|
|
|---|---|---|
|
|
| `dev` | Rapid feature iteration | Synthetic/local data |
|
|
| `staging` | Release-candidate validation, e2e, load, security checks | Sanitized fixtures |
|
|
| `prod-eu` | EU customers (default EU routing) | Real customer data |
|
|
| `prod-us` | NA customers (default NA routing) | Real customer data |
|
|
|
|
Control plane services (Hub + worker + notifications) are region-deployed with independent DBs and clear region affinity.
|
|
|
|
## 2.2 Tenant Environments
|
|
|
|
- `sandbox tenants`: internal QA and interactive demo pool.
|
|
- `canary tenants`: first real-production update recipients.
|
|
- `general tenants`: full customer fleet.
|
|
|
|
## 3. Deployment Units
|
|
|
|
## 3.1 Control Plane Units
|
|
|
|
- `hub-web-api` container (Next.js standalone runtime)
|
|
- `hub-worker` container (automation + billing jobs)
|
|
- `notifications` container (push/email delivery)
|
|
- `postgres` (managed or self-hosted HA)
|
|
|
|
## 3.2 Tenant Units (Per Customer VPS)
|
|
|
|
- `openclaw` container (upstream image/tag pinned)
|
|
- `safety-wrapper` plugin package mounted into OpenClaw extension dir
|
|
- `egress-proxy` service (localhost-only)
|
|
- tool containers and nginx from provisioner
|
|
- local SQLite data stores for secrets/approvals/metering
|
|
|
|
## 4. Provisioning Deployment Plan
|
|
|
|
## 4.1 Provisioner Mode
|
|
|
|
Continue with existing one-shot SSH provisioner flow, retooled to:
|
|
|
|
- deploy OpenClaw + Safety components
|
|
- remove legacy orchestrator/sysadmin deployment
|
|
- strip deprecated stacks and n8n references
|
|
- write secrets into encrypted vault only (no plaintext long-lived config)
|
|
|
|
## 4.2 Immutable Artifact Inputs
|
|
|
|
Provisioning uses pinned artifacts only:
|
|
|
|
- OpenClaw release tag (`stable` channel pin)
|
|
- Safety Wrapper image/package digest
|
|
- Tool stack compose templates with hash
|
|
- policy bundle version + checksum
|
|
|
|
## 5. Secrets And Credential Deployment
|
|
|
|
- Registration token is one-time and short-lived.
|
|
- Tenant API key returned at registration; only hash stored in Hub DB.
|
|
- Provisioner writes bootstrap secrets to tmpfs file, consumed once, then shredded.
|
|
- Existing plaintext job config path (`jobs/<id>/config.json`) replaced by encrypted payload + ephemeral decrypt-on-run.
|
|
|
|
## 6. Release Strategy
|
|
|
|
## 6.1 Control Plane
|
|
|
|
- Trunk-based merges behind feature flags.
|
|
- Deploy via Gitea Actions with staged promotions (`dev -> staging -> prod`).
|
|
- DB migrations run in expand/contract pattern.
|
|
|
|
## 6.2 Tenant Plane
|
|
|
|
Tenant updates split into independent channels:
|
|
|
|
- `policy-only`: classification/autonomy/tool policy updates (no binary change)
|
|
- `wrapper patch`: Safety Wrapper version bump
|
|
- `openclaw bump`: upstream release bump (separate tracked campaign)
|
|
|
|
Rollout:
|
|
|
|
1. Internal sandbox tenants
|
|
2. 5% canary customer tenants
|
|
3. 25%
|
|
4. 100%
|
|
|
|
Auto-stop criteria:
|
|
|
|
- redaction test failure
|
|
- approval-routing failure >1%
|
|
- tenant heartbeat drop >3%
|
|
|
|
## 7. Rollback Strategy
|
|
|
|
## 7.1 Control Plane Rollback
|
|
|
|
- Keep last two container digests deployable.
|
|
- Migration rollback policy: only for reversible migrations; otherwise hotfix-forward.
|
|
|
|
## 7.2 Tenant Rollback
|
|
|
|
- Policy rollback via previous signed policy bundle.
|
|
- Wrapper rollback to previous plugin package.
|
|
- OpenClaw rollback to previous pinned stable tag after compatibility check.
|
|
|
|
## 8. Observability And SLOs
|
|
|
|
## 8.1 Required Telemetry
|
|
|
|
- tenant heartbeat latency and freshness
|
|
- approval queue latency (request -> decision)
|
|
- redaction pipeline counters (matches by layer)
|
|
- token usage ingest lag
|
|
- provisioning success/failure per step
|
|
|
|
## 8.2 Launch SLO Targets
|
|
|
|
- Hub API availability: 99.9%
|
|
- Tenant heartbeat freshness: 99% under 2 minutes
|
|
- Approval propagation: p95 < 5 seconds (Hub to mobile push)
|
|
- Provisioning success first-attempt: >= 90%
|
|
|
|
## 9. Dual-Provider Strategy (Netcup + Hetzner)
|
|
|
|
- Primary capacity pool on Netcup (EU/US).
|
|
- Overflow path on Hetzner with same provisioner scripts and hardened baseline.
|
|
- Provider adapter abstraction lives in Hub `server-provisioning` module; provisioner remains Debian-focused and provider-agnostic.
|
|
|
|
## 10. Cutover Plan From Current State
|
|
|
|
1. Freeze legacy orchestrator/sysadmin deployment paths.
|
|
2. Land prerequisite cleanup release (n8n/deprecated removal + credential leak fix).
|
|
3. Enable new tenant register/heartbeat APIs in Hub.
|
|
4. Provision first new-architecture internal tenant.
|
|
5. Execute parallel-run window (old and new provisioning flows side-by-side for internal only).
|
|
6. Flip default provisioning to new flow for production orders.
|