515 lines
18 KiB
Markdown
515 lines
18 KiB
Markdown
# Kalei System Architecture Plan
|
||
|
||
Version: 1.0
|
||
Date: 2026-02-10
|
||
Status: Proposed canonical architecture for implementation
|
||
|
||
## 1. Purpose and Scope
|
||
|
||
This document consolidates the existing Kalei docs into one implementation-ready system architecture plan.
|
||
|
||
In scope:
|
||
- Core features: Mirror, Kaleidoscope (Turn), Lens, Gallery, Spectrum analytics, subscriptions (all ship in v1).
|
||
- Mobile-first architecture (iOS/Android via Expo) with optional web support.
|
||
- Production operations for safety, privacy, reliability, and cost control.
|
||
|
||
Out of scope:
|
||
- Pixel-level UI specs and brand copy details.
|
||
- Provider contract/legal details.
|
||
- Full threat model artifacts (to be produced separately).
|
||
|
||
## 2. Inputs Reviewed
|
||
|
||
- `docs/app-blueprint.md`
|
||
- `docs/kalei-infrastructure-plan.md`
|
||
- `docs/kalei-ai-model-comparison.md`
|
||
- `docs/kalei-mirror-feature.md`
|
||
- `docs/kalei-spectrum-phase2.md`
|
||
- `docs/kalei-complete-design.md`
|
||
- `docs/kalei-brand-guidelines.md`
|
||
|
||
## 3. Architecture Drivers
|
||
|
||
### 3.1 Product drivers
|
||
|
||
- Core loop quality: Mirror fragment detection and Turn reframes must feel high quality and emotionally calibrated.
|
||
- Daily habit loop: low friction, fast response, strong retention mechanics.
|
||
- Over time: longitudinal Spectrum insights from accumulated usage data.
|
||
|
||
### 3.2 Non-functional drivers
|
||
|
||
- Safety first: crisis language must bypass reframing and trigger support flow.
|
||
- Privacy first: personal reflective writing is highly sensitive.
|
||
- Cost discipline: launch target under ~EUR 30/month fixed infrastructure.
|
||
- Operability: architecture must be maintainable by a small team.
|
||
- Gradual scale: support ~50 DAU at launch and scale to ~10k DAU without full rewrite.
|
||
|
||
## 4. Canonical Decisions
|
||
|
||
This plan resolves conflicting guidance across current docs.
|
||
|
||
| Topic | Decision | Rationale |
|
||
|---|---|---|
|
||
| Backend platform | Self-hosted API-first modular monolith on Node.js (Fastify preferred) | Matches budget constraints and keeps full control of safety, rate limits, and AI routing. |
|
||
| Data layer | PostgreSQL 16 + Redis | Postgres for source-of-truth relational + analytics tables; Redis for counters, rate limits, caching, idempotency. |
|
||
| Auth | JWT auth service in API + refresh token rotation + social login (Apple/Google) | Aligns with self-hosted stack while preserving mobile auth UX. |
|
||
| Mobile | React Native + Expo (local/native builds) | Fastest path for iOS/Android while keeping build pipeline under direct control. |
|
||
| AI integration | AI Gateway abstraction via OpenRouter with provider pinning | Single API, automatic failover, no vendor lock-in, and deterministic routing to non-Chinese providers for data privacy. |
|
||
| AI default | DeepSeek V3.2 via OpenRouter, hosted on DeepInfra/Fireworks (US/EU infrastructure) | 85–90% cheaper than Claude Haiku with comparable emotional intelligence benchmarks. Provider pinning ensures no data flows through Chinese servers. |
|
||
| AI fallback | Claude Haiku 4.5 via OpenRouter (automatic failover on provider outage) | Highest-quality safety net activated transparently when primary provider is unavailable. |
|
||
| Billing | Self-hosted entitlement authority (direct App Store + Google Play server APIs) | Keeps billing logic in-house and avoids closed SaaS dependency in core authorization path. |
|
||
| Analytics/monitoring | PostHog self-hosted + GlitchTip + centralized app logs + cost telemetry | Open-source-first observability stack with lower vendor lock-in. |
|
||
|
||
## 5. System Context
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
user[User] --> app[Expo App]
|
||
app --> edge[Edge Proxy]
|
||
edge --> api[Kalei API]
|
||
api --> db[(PostgreSQL)]
|
||
api --> redis[(Redis)]
|
||
api --> ai[AI Providers]
|
||
api --> billing[Store Entitlements]
|
||
api --> push[Push Gateway]
|
||
api --> obs[Observability]
|
||
app --> analytics[Product Analytics]
|
||
```
|
||
|
||
## 6. Container Architecture
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph Client
|
||
turn[Turn Screen]
|
||
mirror[Mirror Screen]
|
||
lens[Lens Screen]
|
||
spectrum_ui[Spectrum Dashboard]
|
||
profile_ui[Gallery and Profile]
|
||
end
|
||
|
||
subgraph Platform
|
||
gateway[API Gateway and Auth]
|
||
turn_service[Turn Service]
|
||
mirror_service[Mirror Service]
|
||
lens_service[Lens Service]
|
||
spectrum_service[Spectrum Service]
|
||
safety_service[Safety Service]
|
||
entitlement_service[Entitlement Service]
|
||
jobs[Job Scheduler and Workers]
|
||
ai_gateway[AI Gateway]
|
||
cost_guard[Usage Meter and Cost Guard]
|
||
end
|
||
|
||
subgraph Data
|
||
postgres[(PostgreSQL)]
|
||
redis[(Redis)]
|
||
object_storage[(Object Storage)]
|
||
end
|
||
|
||
subgraph External
|
||
ai_provider[DeepSeek V3.2 via OpenRouter + DeepInfra/Fireworks + Claude Haiku fallback]
|
||
store_billing[App Store and Play Billing APIs]
|
||
push_provider[APNs and FCM]
|
||
glitchtip[GlitchTip]
|
||
posthog[PostHog self-hosted]
|
||
end
|
||
|
||
turn --> gateway
|
||
mirror --> gateway
|
||
lens --> gateway
|
||
spectrum_ui --> gateway
|
||
profile_ui --> gateway
|
||
|
||
gateway --> turn_service
|
||
gateway --> mirror_service
|
||
gateway --> lens_service
|
||
gateway --> spectrum_service
|
||
gateway --> entitlement_service
|
||
|
||
mirror_service --> safety_service
|
||
turn_service --> safety_service
|
||
lens_service --> safety_service
|
||
spectrum_service --> safety_service
|
||
|
||
turn_service --> ai_gateway
|
||
mirror_service --> ai_gateway
|
||
lens_service --> ai_gateway
|
||
spectrum_service --> ai_gateway
|
||
ai_gateway --> ai_provider
|
||
|
||
turn_service --> cost_guard
|
||
mirror_service --> cost_guard
|
||
lens_service --> cost_guard
|
||
spectrum_service --> cost_guard
|
||
|
||
turn_service --> postgres
|
||
mirror_service --> postgres
|
||
lens_service --> postgres
|
||
spectrum_service --> postgres
|
||
entitlement_service --> postgres
|
||
jobs --> postgres
|
||
|
||
turn_service --> redis
|
||
mirror_service --> redis
|
||
lens_service --> redis
|
||
spectrum_service --> redis
|
||
cost_guard --> redis
|
||
jobs --> redis
|
||
|
||
entitlement_service --> store_billing
|
||
jobs --> push_provider
|
||
gateway --> glitchtip
|
||
gateway --> posthog
|
||
spectrum_service --> object_storage
|
||
```
|
||
|
||
## 7. Domain and Service Boundaries
|
||
|
||
### 7.1 Runtime modules
|
||
|
||
- `auth`: sign-up/sign-in, token issuance/rotation, device session management.
|
||
- `entitlements`: direct App Store + Google Play sync, plan gating (`free`, `prism`, `prism_plus`).
|
||
- `mirror`: session lifecycle, message ingestion, fragment detection, inline reframe, reflection.
|
||
- `turn`: structured reframing workflow and saved patterns.
|
||
- `lens`: goals, actions, daily focus generation, check-ins.
|
||
- `spectrum`: analytics feature store, weekly/monthly aggregation, insight generation.
|
||
- `safety`: crisis detection, escalation, crisis response policy.
|
||
- `ai_gateway`: prompt templates, OpenRouter API integration with provider pinning (DeepInfra/Fireworks primary, Claude Haiku fallback), retries/timeouts, structured output validation.
|
||
- `usage_cost`: token telemetry, per-user budgets, global spend controls.
|
||
- `notifications`: push scheduling, reminders, weekly summaries.
|
||
|
||
### 7.2 Why modular monolith first
|
||
|
||
- Lowest operational overhead at launch.
|
||
- Strong transaction boundaries in one codebase.
|
||
- Easy extraction path later for `spectrum` workers or `ai_gateway` if load increases.
|
||
|
||
## 8. Core Data Architecture
|
||
|
||
### 8.1 Data domains
|
||
|
||
- Identity: users, profiles, auth_sessions, refresh_tokens.
|
||
- Product interactions: turns, mirror_sessions, mirror_messages, mirror_fragments, lens_goals, lens_actions.
|
||
- Analytics: spectrum_session_analysis, spectrum_turn_analysis, spectrum_weekly, spectrum_monthly.
|
||
- Commerce: subscriptions, entitlement_snapshots, billing_events.
|
||
- Safety and operations: safety_events, ai_usage_events, request_logs, audit_events.
|
||
|
||
### 8.2 Entity relationship view
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
users[USERS] --> profiles[PROFILES]
|
||
users --> auth_sessions[AUTH_SESSIONS]
|
||
users --> refresh_tokens[REFRESH_TOKENS]
|
||
users --> turns[TURNS]
|
||
users --> mirror_sessions[MIRROR_SESSIONS]
|
||
mirror_sessions --> mirror_messages[MIRROR_MESSAGES]
|
||
mirror_messages --> mirror_fragments[MIRROR_FRAGMENTS]
|
||
users --> lens_goals[LENS_GOALS]
|
||
lens_goals --> lens_actions[LENS_ACTIONS]
|
||
users --> spectrum_session[SPECTRUM_SESSION_ANALYSIS]
|
||
users --> spectrum_turn[SPECTRUM_TURN_ANALYSIS]
|
||
users --> spectrum_weekly[SPECTRUM_WEEKLY]
|
||
users --> spectrum_monthly[SPECTRUM_MONTHLY]
|
||
users --> subscriptions[SUBSCRIPTIONS]
|
||
users --> entitlement[ENTITLEMENT_SNAPSHOTS]
|
||
users --> safety_events[SAFETY_EVENTS]
|
||
users --> ai_usage[AI_USAGE_EVENTS]
|
||
```
|
||
|
||
### 8.3 Storage policy
|
||
|
||
- Raw reflective content remains in transactional tables, encrypted at rest.
|
||
- Spectrum dashboard reads aggregated tables only by default.
|
||
- Per-session exclusion flags allow users to opt out entries from analytics.
|
||
- Hard delete workflow removes raw + derived analytics for requested windows.
|
||
|
||
## 9. Key Runtime Sequences
|
||
|
||
### 9.1 Mirror message processing with safety gate
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant App as Mobile App
|
||
participant API as Kalei API
|
||
participant Safety as Safety Service
|
||
participant Ent as Entitlement Service
|
||
participant AI as AI Gateway
|
||
participant Model as AI Provider
|
||
participant DB as PostgreSQL
|
||
participant Redis as Redis
|
||
|
||
App->>API: POST /mirror/messages
|
||
API->>Ent: Check plan/quota
|
||
Ent->>Redis: Read counters
|
||
Ent-->>API: Allowed
|
||
API->>Safety: Crisis precheck
|
||
alt Crisis detected
|
||
Safety->>DB: Insert safety_event
|
||
API-->>App: Crisis resources response
|
||
else Not crisis
|
||
API->>AI: Detect fragments prompt
|
||
AI->>Model: Inference request
|
||
Model-->>AI: Fragments with confidence
|
||
AI-->>API: Validated structured result
|
||
API->>DB: Save message + fragments
|
||
API->>Redis: Increment usage counters
|
||
API-->>App: Highlight payload
|
||
end
|
||
```
|
||
|
||
### 9.2 Turn (Kaleidoscope) request
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant App as Mobile App
|
||
participant API as Kalei API
|
||
participant Ent as Entitlement Service
|
||
participant Safety as Safety Service
|
||
participant AI as AI Gateway
|
||
participant Model as AI Provider
|
||
participant DB as PostgreSQL
|
||
participant Cost as Cost Guard
|
||
|
||
App->>API: POST /turns
|
||
API->>Ent: Validate tier + daily cap
|
||
API->>Safety: Crisis precheck
|
||
alt Crisis detected
|
||
API-->>App: Crisis resources response
|
||
else Safe
|
||
API->>AI: Generate 3 reframes + micro-action
|
||
AI->>Model: Inference stream
|
||
Model-->>AI: Structured reframes
|
||
AI-->>API: Response + token usage
|
||
API->>Cost: Record token usage + budget check
|
||
API->>DB: Save turn + metadata
|
||
API-->>App: Stream final turn result
|
||
end
|
||
```
|
||
|
||
### 9.3 Weekly Spectrum aggregation (background)
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant Cron as Scheduler
|
||
participant Worker as Spectrum Worker
|
||
participant DB as PostgreSQL
|
||
participant AI as AI Gateway
|
||
participant Model as Batch Provider
|
||
participant Push as Notification Service
|
||
|
||
Cron->>Worker: Trigger weekly job
|
||
Worker->>DB: Load eligible users + raw events
|
||
Worker->>DB: Compute vectors and weekly aggregates
|
||
Worker->>AI: Generate insight narratives from aggregates
|
||
AI->>Model: Batch request
|
||
Model-->>AI: Insight text
|
||
AI-->>Worker: Validated summaries
|
||
Worker->>DB: Upsert spectrum_weekly and monthly deltas
|
||
Worker->>Push: Enqueue spectrum updated notifications
|
||
```
|
||
|
||
## 10. API Surface (v1)
|
||
|
||
### 10.1 Auth and profile
|
||
|
||
- `POST /auth/register`
|
||
- `POST /auth/login`
|
||
- `POST /auth/refresh`
|
||
- `POST /auth/logout`
|
||
- `GET /me`
|
||
- `PATCH /me/profile`
|
||
|
||
### 10.2 Mirror
|
||
|
||
- `POST /mirror/sessions`
|
||
- `POST /mirror/messages`
|
||
- `POST /mirror/fragments/{id}/reframe`
|
||
- `POST /mirror/sessions/{id}/close`
|
||
- `GET /mirror/sessions`
|
||
- `DELETE /mirror/sessions/{id}`
|
||
|
||
### 10.3 Turn
|
||
|
||
- `POST /turns`
|
||
- `GET /turns`
|
||
- `GET /turns/{id}`
|
||
- `POST /turns/{id}/save`
|
||
|
||
### 10.4 Lens
|
||
|
||
- `POST /lens/goals`
|
||
- `GET /lens/goals`
|
||
- `POST /lens/goals/{id}/actions`
|
||
- `POST /lens/actions/{id}/complete`
|
||
- `GET /lens/affirmation/today`
|
||
|
||
### 10.5 Spectrum
|
||
|
||
- `GET /spectrum/weekly`
|
||
- `GET /spectrum/monthly`
|
||
- `POST /spectrum/reset`
|
||
- `POST /spectrum/exclusions`
|
||
|
||
### 10.6 Billing and entitlements
|
||
|
||
- `POST /billing/webhooks/apple`
|
||
- `POST /billing/webhooks/google`
|
||
- `GET /billing/entitlements`
|
||
|
||
## 11. Security, Safety, and Compliance Architecture
|
||
|
||
### 11.1 Security controls
|
||
|
||
- TLS everywhere (edge proxy to API origin and service egress).
|
||
- JWT access tokens (short TTL) + rotating refresh tokens.
|
||
- Password hashing with Argon2id (preferred) or bcrypt with strong cost factor.
|
||
- Row ownership checks enforced in API and optionally DB RLS for defense in depth.
|
||
- Secrets in environment vault; never in client bundle.
|
||
- Audit logging for auth events, entitlement changes, deletes, and safety events.
|
||
|
||
### 11.2 Data protection
|
||
|
||
- Encryption at rest for disk volumes and database backups.
|
||
- Column-level encryption for highly sensitive text fields (Mirror message content).
|
||
- Data minimization for analytics: Spectrum reads vectors and aggregates by default.
|
||
- User rights flows: export, per-item delete, account delete, Spectrum reset.
|
||
|
||
### 11.3 Safety architecture
|
||
|
||
- Multi-stage crisis filter:
|
||
1. Deterministic keyword and pattern pass.
|
||
2. Low-latency model confirmation where needed.
|
||
3. Hardcoded crisis response templates and hotline resources.
|
||
- Crisis-level content is never reframed.
|
||
- Safety events are logged and monitored for false-positive/false-negative tuning.
|
||
|
||
## 12. Reliability and Performance
|
||
|
||
### 12.1 Initial SLO targets
|
||
|
||
- API availability: 99.5% monthly at launch, 99.9% target at scale.
|
||
- Turn and Mirror response latency:
|
||
- p50 < 1.8s
|
||
- p95 < 3.5s
|
||
- Weekly Spectrum jobs completed within 2 hours of scheduled run.
|
||
|
||
### 12.2 Resilience patterns
|
||
|
||
- Idempotency keys on write endpoints.
|
||
- AI provider timeout + retry policy with circuit breaker.
|
||
- Graceful degradation hierarchy when budget/latency pressure occurs:
|
||
1. Degrade Lens generation first (template fallback).
|
||
2. Keep Turn and Mirror available.
|
||
3. Pause non-critical Spectrum generation if needed.
|
||
- Dead-letter queue for failed async jobs.
|
||
|
||
## 13. Observability and FinOps
|
||
|
||
### 13.1 Telemetry
|
||
|
||
- Structured logs with request ID, user ID hash, feature, model, token usage, cost.
|
||
- Metrics:
|
||
- request rate/error rate/latency by endpoint
|
||
- AI token usage and cost by feature
|
||
- quota denials and safety escalations
|
||
- Tracing across API -> AI Gateway -> provider call.
|
||
|
||
### 13.2 Cost controls
|
||
|
||
- Global monthly AI spend cap and alert thresholds (50%, 80%, 95%).
|
||
- Per-user daily token budget in Redis.
|
||
- Feature-level cost envelope with OpenRouter provider routing:
|
||
- All features: DeepSeek V3.2 via DeepInfra/Fireworks (US/EU, $0.26/$0.38 per MTok)
|
||
- Automatic failover: Claude Haiku 4.5 on provider outage ($1.00/$5.00 per MTok)
|
||
- Future: introduce tiered model routing at 5,000+ DAU when usage data justifies complexity
|
||
- Prompt caching for stable system prompts (DeepInfra ~20% cache hit discount).
|
||
|
||
## 14. Deployment Topology and Scaling Path
|
||
|
||
### 14.1 Launch deployment (single-node)
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
EDGE[Caddy or Nginx Edge] --> NX[Nginx]
|
||
NX --> API[API + Workers]
|
||
API --> PG[(PostgreSQL)]
|
||
API --> R[(Redis)]
|
||
API --> AIP[AI Providers]
|
||
```
|
||
|
||
### 14.2 Scaling evolution
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
launch[Launch single VPS API DB Redis] --> traction[Traction split DB keep API monolith]
|
||
traction --> growth[Growth separate workers and scale API]
|
||
growth --> scale[Scale optional service extraction]
|
||
```
|
||
|
||
### 14.3 Trigger-based scaling
|
||
|
||
- Move DB off app node when p95 query latency > 120ms sustained or storage > 70%.
|
||
- Add API replica when CPU > 70% sustained at peak and p95 latency breaches SLO.
|
||
- Split workers when Spectrum jobs impact interactive endpoints.
|
||
|
||
## 15. Delivery Plan
|
||
|
||
All features ship in a single unified v1 release. The build is a continuous 12-week effort:
|
||
|
||
### 15.1 Weeks 1–4: Platform Foundation
|
||
|
||
- API skeleton, auth, profile, entitlements integration.
|
||
- Postgres schema v1 and migrations.
|
||
- Mirror + Turn endpoints with safety pre-check.
|
||
- Usage metering and rate limiting.
|
||
|
||
### 15.2 Weeks 5–8: Core Experience
|
||
|
||
- Lens flows, Rehearsal, Ritual, Evidence Wall, and Gallery history.
|
||
- Push notifications and daily reminders.
|
||
- Full observability, alerting, and incident runbooks.
|
||
- Beta load testing and security hardening.
|
||
|
||
### 15.3 Weeks 9–12: Spectrum & Launch Readiness
|
||
|
||
- Spectrum: vector extraction pipeline, aggregated tables, weekly batch jobs, dashboard endpoints.
|
||
- Data exclusion controls and reset workflow.
|
||
- Cost optimization pass on AI routing.
|
||
- Final QA, store submission, beta launch.
|
||
|
||
## 16. Risks and Mitigations
|
||
|
||
| Risk | Impact | Mitigation |
|
||
|---|---|---|
|
||
| Reframe quality variance by provider/model | Core UX degradation | Keep AI Gateway abstraction + blind quality harness + model canary rollout. |
|
||
| Safety false negatives | High trust and user harm risk | Defense-in-depth crisis filter + explicit no-reframe crisis policy + monitoring and review loop. |
|
||
| AI cost spikes | Margin compression | Hard spend caps, per-feature budgets, degradation order, model fallback lanes. |
|
||
| Single-node bottlenecks | Latency and availability issues | Trigger-based scaling plan and early instrumentation. |
|
||
| Sensitive data handling errors | Compliance and trust risk | Encryption, strict retention controls, deletion workflows, audit logs. |
|
||
|
||
## 17. Decision Log and Open Items
|
||
|
||
### 17.1 Decided in this plan
|
||
|
||
- Self-hosted API + Postgres + Redis is the canonical launch architecture.
|
||
- AI provider routing is built in from day one.
|
||
- Safety is an explicit service and gate on all AI-facing paths.
|
||
- Spectrum runs asynchronously over aggregated data.
|
||
|
||
### 17.2 Resolved: AI Provider Strategy (February 2026)
|
||
|
||
- **Decided:** DeepSeek V3.2 via OpenRouter, pinned to non-Chinese providers (DeepInfra/Fireworks). Single model for all features at launch. Claude Haiku 4.5 as automatic fallback.
|
||
- **Rationale:** 85–90% cost reduction vs Claude Haiku. Nature 2025 study confirms comparable emotional intelligence scores. Non-Chinese hosting avoids data sovereignty concerns. Single-model approach minimizes complexity for solo founder.
|
||
- **Revisit at:** 600+ DAU (evaluate self-hosting), 5,000+ DAU (evaluate tiered model routing).
|
||
|
||
### 17.3 Remaining open decisions
|
||
|
||
- Exact hosting target for DB scaling at traction stage (dedicated VPS vs managed Postgres).
|
||
- Regional crisis resource strategy (US-first or multi-region at launch).
|
||
|
||
---
|
||
|
||
If approved, this document should become the architecture source of truth and supersede conflicting details in older planning docs.
|