LetsBeBiz-Redesign/docs/architecture-proposal/claude/08-CICD-STRATEGY.md

782 lines
22 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LetsBe Biz — CI/CD Strategy
**Date:** February 27, 2026
**Team:** Claude Opus 4.6 Architecture Team
**Document:** 08 of 09
**Status:** Proposal — Competing with independent team
---
## Table of Contents
1. [CI/CD Overview](#1-cicd-overview)
2. [Gitea Actions Pipelines](#2-gitea-actions-pipelines)
3. [Branch Strategy](#3-branch-strategy)
4. [Build & Publish](#4-build--publish)
5. [Deployment Workflows](#5-deployment-workflows)
6. [Rollback Procedures](#6-rollback-procedures)
7. [Secret Management in CI](#7-secret-management-in-ci)
8. [Quality Gates in CI](#8-quality-gates-in-ci)
9. [Monitoring & Alerting](#9-monitoring--alerting)
---
## 1. CI/CD Overview
### Platform: Gitea Actions
Gitea Actions is the CI/CD platform (Architecture Brief §9.1). It uses GitHub Actions-compatible YAML workflow syntax, making migration straightforward if needed later.
### Pipeline Architecture
```
Developer pushes code
┌──────────────────┐
│ Gitea Actions │
│ Trigger: push │
│ │
│ 1. Lint │
│ 2. Type Check │
│ 3. Unit Tests │
│ 4. Build │
│ 5. Security Scan │
└────────┬─────────┘
┌────┴────┐
│ Branch? │
└────┬────┘
┌────┼────────────┐
│ │ │
feature develop main
│ │ │
│ ▼ ▼
│ Build Docker Build Docker
│ Push :dev Push :latest
│ │ │
│ ▼ ▼
│ Deploy to Deploy to
│ staging production
│ │
│ ▼
│ Canary rollout
│ (tenant servers)
└─► PR required to merge
```
### Environments
| Environment | Branch | Trigger | Purpose |
|-------------|--------|---------|---------|
| **Local** | Any | Manual | Developer testing |
| **CI** | Any push | Automatic | Lint, test, type check |
| **Staging** | `develop` | Automatic on merge | Integration testing, dogfooding |
| **Production** | `main` | Manual approval | Live customers |
---
## 2. Gitea Actions Pipelines
### 2.1 Monorepo CI Pipeline (All Packages)
```yaml
# .gitea/workflows/ci.yml
name: CI
on:
push:
branches: [main, develop, 'feature/**']
pull_request:
branches: [main, develop]
env:
NODE_VERSION: '22'
jobs:
lint-and-typecheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
- name: Install dependencies
run: npm ci
- name: Lint
run: npx turbo run lint
- name: Type check
run: npx turbo run typecheck
unit-tests:
runs-on: ubuntu-latest
needs: lint-and-typecheck
strategy:
matrix:
package:
- safety-wrapper
- secrets-proxy
- hub
- shared-types
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
- name: Install dependencies
run: npm ci
- name: Run tests for ${{ matrix.package }}
run: npx turbo run test --filter=${{ matrix.package }}
security-scan:
runs-on: ubuntu-latest
needs: lint-and-typecheck
steps:
- uses: actions/checkout@v4
- name: Check for secrets in code
run: |
npx @trufflesecurity/trufflehog git file://. --only-verified --fail
- name: Dependency audit
run: npm audit --audit-level=high
```
### 2.2 Safety Wrapper Pipeline
```yaml
# .gitea/workflows/safety-wrapper.yml
name: Safety Wrapper
on:
push:
paths:
- 'packages/safety-wrapper/**'
- 'packages/shared-types/**'
branches: [main, develop]
jobs:
p0-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
- run: npm ci
- name: P0 Secrets Redaction Tests
run: npx turbo run test:p0 --filter=secrets-proxy
- name: P0 Command Classification Tests
run: npx turbo run test:p0 --filter=safety-wrapper
- name: P1 Autonomy Tests
run: npx turbo run test:p1 --filter=safety-wrapper
build-image:
runs-on: ubuntu-latest
needs: p0-tests
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop'
steps:
- uses: actions/checkout@v4
- name: Set tag
id: tag
run: |
if [ "${{ github.ref }}" = "refs/heads/main" ]; then
echo "tag=latest" >> $GITHUB_OUTPUT
else
echo "tag=dev" >> $GITHUB_OUTPUT
fi
- name: Build Safety Wrapper image
run: |
docker build \
-f packages/safety-wrapper/Dockerfile \
-t code.letsbe.solutions/letsbe/safety-wrapper:${{ steps.tag.outputs.tag }} \
-t code.letsbe.solutions/letsbe/safety-wrapper:${{ github.sha }} \
.
- name: Push to registry
run: |
echo "${{ secrets.REGISTRY_PASSWORD }}" | docker login code.letsbe.solutions -u ${{ secrets.REGISTRY_USER }} --password-stdin
docker push code.letsbe.solutions/letsbe/safety-wrapper:${{ steps.tag.outputs.tag }}
docker push code.letsbe.solutions/letsbe/safety-wrapper:${{ github.sha }}
```
### 2.3 Secrets Proxy Pipeline
```yaml
# .gitea/workflows/secrets-proxy.yml
name: Secrets Proxy
on:
push:
paths:
- 'packages/secrets-proxy/**'
- 'packages/shared-types/**'
branches: [main, develop]
jobs:
p0-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm ci
- name: P0 Redaction Tests (must pass 100%)
run: npx turbo run test:p0 --filter=secrets-proxy
- name: Performance Benchmark
run: npx turbo run test:benchmark --filter=secrets-proxy
build-image:
runs-on: ubuntu-latest
needs: p0-tests
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop'
steps:
- uses: actions/checkout@v4
- name: Build Secrets Proxy image
run: |
docker build \
-f packages/secrets-proxy/Dockerfile \
-t code.letsbe.solutions/letsbe/secrets-proxy:${{ github.ref == 'refs/heads/main' && 'latest' || 'dev' }} \
.
- name: Push to registry
run: |
echo "${{ secrets.REGISTRY_PASSWORD }}" | docker login code.letsbe.solutions -u ${{ secrets.REGISTRY_USER }} --password-stdin
docker push code.letsbe.solutions/letsbe/secrets-proxy --all-tags
```
### 2.4 Hub Pipeline
```yaml
# .gitea/workflows/hub.yml
name: Hub
on:
push:
paths:
- 'packages/hub/**'
- 'packages/shared-prisma/**'
branches: [main, develop]
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_DB: hub_test
POSTGRES_USER: hub
POSTGRES_PASSWORD: testpass
ports: ['5432:5432']
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm ci
- name: Run Prisma migrations
run: npx turbo run db:push --filter=hub
env:
DATABASE_URL: postgresql://hub:testpass@localhost:5432/hub_test
- name: Run tests
run: npx turbo run test --filter=hub
env:
DATABASE_URL: postgresql://hub:testpass@localhost:5432/hub_test
build-image:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop'
steps:
- uses: actions/checkout@v4
- name: Build Hub image
run: |
docker build \
-f packages/hub/Dockerfile \
-t code.letsbe.solutions/letsbe/hub:${{ github.ref == 'refs/heads/main' && 'latest' || 'dev' }} \
.
- name: Push to registry
run: |
echo "${{ secrets.REGISTRY_PASSWORD }}" | docker login code.letsbe.solutions -u ${{ secrets.REGISTRY_USER }} --password-stdin
docker push code.letsbe.solutions/letsbe/hub --all-tags
```
### 2.5 Integration Test Pipeline
```yaml
# .gitea/workflows/integration.yml
name: Integration Tests
on:
push:
branches: [develop]
workflow_dispatch:
jobs:
integration:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '22' }
- run: npm ci
- name: Start integration stack
run: docker compose -f test/docker-compose.integration.yml up -d --wait
timeout-minutes: 5
- name: Wait for services
run: |
for i in $(seq 1 30); do
curl -sf http://localhost:8200/health && break || sleep 2
done
- name: Run integration tests
run: npx turbo run test:integration
- name: Collect logs on failure
if: failure()
run: docker compose -f test/docker-compose.integration.yml logs > integration-logs.txt
- name: Upload logs
if: failure()
uses: actions/upload-artifact@v4
with:
name: integration-logs
path: integration-logs.txt
- name: Teardown
if: always()
run: docker compose -f test/docker-compose.integration.yml down -v
```
---
## 3. Branch Strategy
### Git Flow (Simplified)
```
main ─────────────────────────────────────────────────►
│ ▲
│ │ (merge via PR, requires approval)
│ │
develop ──┬───────────┬───────────┬────────┤
│ │ │
feature/sw-skeleton │ feature/hub-billing
│ │
│ feature/secrets-proxy
hotfix/critical-fix ──────────────────────► main (direct merge for critical fixes)
```
### Branch Rules
| Branch | Protection | Merge Requirements |
|--------|-----------|-------------------|
| `main` | Protected; no direct pushes | PR from `develop`; 1 approval; all CI checks pass; security scan pass |
| `develop` | Protected; no direct pushes | PR from feature branch; all CI checks pass |
| `feature/*` | Unprotected | Free to push; PR to develop when ready |
| `hotfix/*` | Unprotected | Can merge to both `main` and `develop`; 1 approval required |
### Naming Conventions
```
feature/sw-command-classification # Safety Wrapper feature
feature/hub-tenant-api # Hub feature
feature/mobile-chat-view # Mobile app feature
feature/prov-step10-rewrite # Provisioner feature
fix/secrets-proxy-jwt-detection # Bug fix
hotfix/redaction-bypass-cve # Critical security fix
```
### Release Tagging
```
v0.1.0 # First internal milestone (M1)
v0.2.0 # M2
v0.3.0 # M3
v1.0.0 # Founding member launch (M4)
v1.0.1 # First patch
v1.1.0 # First feature update post-launch
```
---
## 4. Build & Publish
### Docker Image Strategy
| Image | Registry Path | Build Context | Size Target |
|-------|--------------|---------------|-------------|
| `letsbe/safety-wrapper` | `code.letsbe.solutions/letsbe/safety-wrapper` | `packages/safety-wrapper/` | <150MB |
| `letsbe/secrets-proxy` | `code.letsbe.solutions/letsbe/secrets-proxy` | `packages/secrets-proxy/` | <100MB |
| `letsbe/hub` | `code.letsbe.solutions/letsbe/hub` | `packages/hub/` | <500MB |
| `letsbe/ansible-runner` | `code.letsbe.solutions/letsbe/ansible-runner` | `packages/provisioner/` | Existing |
### Multi-Stage Dockerfile Pattern
```dockerfile
# packages/safety-wrapper/Dockerfile
# Stage 1: Dependencies
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
COPY packages/safety-wrapper/package.json ./packages/safety-wrapper/
COPY packages/shared-types/package.json ./packages/shared-types/
RUN npm ci --workspace=packages/safety-wrapper --workspace=packages/shared-types
# Stage 2: Build
FROM node:22-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY packages/safety-wrapper/ ./packages/safety-wrapper/
COPY packages/shared-types/ ./packages/shared-types/
COPY turbo.json package.json ./
RUN npx turbo run build --filter=safety-wrapper
# Stage 3: Production
FROM node:22-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S letsbe && adduser -S letsbe -u 1001
COPY --from=builder /app/packages/safety-wrapper/dist ./dist
COPY --from=builder /app/packages/safety-wrapper/package.json ./
COPY --from=deps /app/node_modules ./node_modules
USER letsbe
EXPOSE 8200
CMD ["node", "dist/index.js"]
```
### Image Tagging
| Tag | When | Purpose |
|-----|------|---------|
| `:dev` | On merge to `develop` | Staging deployment |
| `:latest` | On merge to `main` | Production deployment |
| `:<git-sha>` | On every build | Immutable reference for debugging |
| `:v1.0.0` | On release tag | Version-pinned deployment |
---
## 5. Deployment Workflows
### 5.1 Central Platform (Hub) Deployment
```yaml
# .gitea/workflows/deploy-hub.yml
name: Deploy Hub
on:
push:
branches: [main]
paths: ['packages/hub/**', 'packages/shared-prisma/**']
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- name: Deploy to production
run: |
ssh -o StrictHostKeyChecking=no deploy@hub.letsbe.biz << 'EOF'
cd /opt/letsbe/hub
docker compose pull hub
docker compose up -d hub
# Wait for health check
for i in $(seq 1 30); do
curl -sf http://localhost:3847/api/health && break || sleep 2
done
# Run migrations
docker compose exec hub npx prisma migrate deploy
EOF
```
### 5.2 Tenant Server Update Pipeline
Tenant servers are updated via the Hub push mechanism (see 03-DEPLOYMENT-STRATEGY §7).
```yaml
# .gitea/workflows/tenant-update.yml
name: Tenant Server Update
on:
workflow_dispatch:
inputs:
component:
description: 'Component to update'
required: true
type: choice
options: [safety-wrapper, secrets-proxy, openclaw]
strategy:
description: 'Rollout strategy'
required: true
type: choice
options: [staging-only, canary-5pct, canary-25pct, full-rollout]
jobs:
prepare:
runs-on: ubuntu-latest
steps:
- name: Verify image exists
run: |
docker manifest inspect code.letsbe.solutions/letsbe/${{ inputs.component }}:latest
rollout:
runs-on: ubuntu-latest
needs: prepare
steps:
- name: Trigger Hub rollout API
run: |
curl -X POST https://hub.letsbe.biz/api/v1/admin/rollout \
-H "Authorization: Bearer ${{ secrets.HUB_ADMIN_TOKEN }}" \
-H "Content-Type: application/json" \
-d '{
"component": "${{ inputs.component }}",
"tag": "latest",
"strategy": "${{ inputs.strategy }}"
}'
```
### 5.3 Staging Deployment (Automatic)
```yaml
# .gitea/workflows/deploy-staging.yml
name: Deploy Staging
on:
push:
branches: [develop]
jobs:
deploy-staging:
runs-on: ubuntu-latest
environment: staging
steps:
- name: Deploy Hub to staging
run: |
ssh deploy@staging.letsbe.biz << 'EOF'
cd /opt/letsbe/hub
docker compose pull
docker compose up -d
docker compose exec hub npx prisma migrate deploy
EOF
- name: Deploy tenant stack to staging VPS
run: |
ssh deploy@staging-tenant.letsbe.biz << 'EOF'
cd /opt/letsbe
docker compose -f docker-compose.letsbe.yml pull
docker compose -f docker-compose.letsbe.yml up -d
EOF
- name: Run smoke tests
run: |
curl -sf https://staging.letsbe.biz/api/health
curl -sf https://staging-tenant.letsbe.biz:8200/health
curl -sf https://staging-tenant.letsbe.biz:8100/health
```
---
## 6. Rollback Procedures
### 6.1 Hub Rollback
```bash
# Rollback Hub to previous version
ssh deploy@hub.letsbe.biz << 'EOF'
cd /opt/letsbe/hub
# Find previous image
PREVIOUS=$(docker compose images hub --format '{{.Tag}}' | head -1)
# Pull and deploy previous
docker compose pull hub # Uses previous :latest from registry
docker compose up -d hub
# Verify health
for i in $(seq 1 30); do
curl -sf http://localhost:3847/api/health && break || sleep 2
done
# Note: Prisma migrations are forward-only.
# If a migration needs reverting, use prisma migrate resolve.
EOF
```
### 6.2 Tenant Component Rollback
```bash
# Rollback Safety Wrapper on a specific tenant
ssh deploy@tenant-server << 'EOF'
cd /opt/letsbe
# Roll back to pinned SHA
docker compose -f docker-compose.letsbe.yml \
-e SAFETY_WRAPPER_TAG=<previous-sha> \
up -d safety-wrapper
# Verify health
curl -sf http://127.0.0.1:8200/health
EOF
```
### 6.3 Rollback Decision Matrix
| Symptom | Action | Automatic? |
|---------|--------|-----------|
| Health check fails after deploy | Rollback to previous image | Yes (Docker restart policy pulls previous on repeated failure) |
| P0 tests fail in CI | Block merge; no deployment | Yes (CI gate) |
| Secrets redaction miss detected | EMERGENCY: rollback all tenants immediately | Manual (requires admin trigger) |
| Hub API errors >5% | Rollback Hub to previous version | Manual (monitoring alert) |
| Billing discrepancy | Investigate first; rollback billing code if confirmed | Manual |
### 6.4 Emergency Rollback Checklist
For critical security issues (e.g., redaction bypass):
1. **STOP** all tenant updates immediately (disable Hub rollout API)
2. **ROLLBACK** all affected components to last known-good version
3. **VERIFY** rollback successful (health checks, P0 tests)
4. **INVESTIGATE** root cause
5. **FIX** and add test case for the specific failure
6. **AUDIT** all tenants for potential exposure during the window
7. **NOTIFY** affected customers if secrets were potentially exposed
8. **POST-MORTEM** within 24 hours
---
## 7. Secret Management in CI
### Gitea Secrets Configuration
| Secret | Scope | Purpose |
|--------|-------|---------|
| `REGISTRY_USER` | Organization | Docker registry login |
| `REGISTRY_PASSWORD` | Organization | Docker registry password |
| `HUB_ADMIN_TOKEN` | Repository | Hub API authentication for deployments |
| `STAGING_SSH_KEY` | Repository | SSH key for staging deployment |
| `PRODUCTION_SSH_KEY` | Repository | SSH key for production deployment |
| `STRIPE_TEST_KEY` | Repository | Stripe test mode for integration tests |
### Rules
1. **Never** put secrets in workflow YAML files
2. **Never** echo secrets in CI logs (use `::add-mask::`)
3. **Never** pass secrets as command-line arguments (use environment variables)
4. SSH keys: use deploy keys with minimal permissions (read-only for CI, write for deploy)
5. Rotate all CI secrets quarterly
---
## 8. Quality Gates in CI
### Gate Configuration
```yaml
# In each pipeline, quality gates are enforced as job dependencies:
jobs:
# Gate 1: Code quality
lint:
# Must pass before tests run
...
typecheck:
# Must pass before tests run
...
# Gate 2: Correctness
unit-tests:
needs: [lint, typecheck]
# Must pass before build
...
# Gate 3: Security
security-scan:
needs: [lint]
# Must pass before deploy
...
# Gate 4: Build
build:
needs: [unit-tests, security-scan]
# Must succeed before deploy
...
# Gate 5: Deploy (only on protected branches)
deploy:
needs: [build]
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/develop'
...
```
### PR Merge Requirements
| Requirement | Enforcement |
|-------------|------------|
| All CI checks pass | Gitea branch protection rule |
| At least 1 approval | Gitea branch protection rule |
| No unresolved review comments | Convention (not enforced by Gitea) |
| P0 tests pass if security code changed | CI pipeline condition |
| No secrets detected in diff | trufflehog scan |
---
## 9. Monitoring & Alerting
### CI Pipeline Monitoring
| Metric | Alert Threshold | Action |
|--------|----------------|--------|
| Build duration | >15 min | Investigate; optimize caching |
| Test suite duration | >10 min | Investigate; parallelize tests |
| Failed builds on `develop` | >3 consecutive | Freeze merges; investigate |
| Failed deploys | Any | Automatic rollback; notify team |
| Security scan findings | Any critical | Block merge; assign to Security Lead |
### Deployment Monitoring
| Metric | Alert Threshold | Action |
|--------|----------------|--------|
| Hub health after deploy | Unhealthy for >60s | Automatic rollback |
| Tenant health after update | Unhealthy for >120s | Rollback specific tenant; pause rollout |
| Error rate post-deploy | >5% increase | Alert team; investigate |
| Latency post-deploy | >2× baseline | Alert team; investigate |
### Notification Channels
| Event | Channel |
|-------|---------|
| CI failure on `main` | Team chat (immediate) |
| Security scan finding | Team chat + email to Security Lead |
| Deployment success | Team chat (informational) |
| Deployment failure | Team chat + email to on-call |
| Emergency rollback | Team chat + phone call to on-call |
---
*End of Document — 08 CI/CD Strategy*