kalei/docs/technical/kalei-ai-model-comparison.md

163 lines
9.1 KiB
Markdown
Raw Permalink Normal View History

# Kalei — AI Model Selection: Unbiased Analysis
## The Question
Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?
---
## What Kalei Actually Needs From Its AI
| Task | Quality Bar | Frequency | Latency Tolerance |
|------|------------|-----------|-------------------|
| **Mirror** — detect emotional fragments in freeform writing | High empathy + precision | 2-7x/week per user | 2-3s acceptable |
| **Kaleidoscope** — generate 3 perspective reframes | Highest — this IS the product | 3-10x/day per user | 2-3s acceptable |
| **Lens** — daily affirmation generation | Medium — structured output | 1x/day per user | 5s acceptable |
| **Crisis Detection** — flag self-harm/distress signals | Critical safety — zero false negatives | Every interaction | <1s preferred |
| **Spectrum** — weekly/monthly pattern analysis | High analytical depth | 1x/week batch | Minutes acceptable |
The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.
---
## Venice.ai API — What You Get
Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:
### Best Venice Models for Kalei
| Model | Input/MTok | Output/MTok | Cache Read | Context | Privacy | Notes |
|-------|-----------|------------|------------|---------|---------|-------|
| **DeepSeek V3.2** | $0.40 | $1.00 | $0.20 | 164K | Private | Strongest general model on Venice |
| **Qwen3 235B A22B** | $0.15 | $0.75 | — | 131K | Private | Best price-to-quality ratio |
| **Llama 3.3 70B** | $0.70 | $2.80 | — | 131K | Private | Meta's flagship open model |
| **Gemma 3 27B** | $0.12 | $0.20 | — | 203K | Private | Ultra-cheap, Google's open model |
| **Venice Small (Qwen3 4B)** | $0.05 | $0.15 | — | 33K | Private | Affirmation-tier only |
### Venice Advantages
- **Privacy-first architecture** — no data retention, critical for mental health
- **OpenAI-compatible API** — trivial to swap in/out, same SDK
- **Prompt caching** on select models (DeepSeek V3.2 confirmed)
- **You already pay for Pro** — $10 free API credit to test
- **No minimum commitment** — pure pay-per-use
### Venice Limitations
- **No batch API** — can't get 50% off for Spectrum overnight processing
- **"Uncensored" default posture** — Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
- **No equivalent to Anthropic's constitutional AI** — crisis detection safety net is entirely on us
- **Smaller infrastructure** — less battle-tested at scale than Anthropic/OpenAI
- **Rate limits not publicly documented** — could be a problem at scale
---
## Head-to-Head: Venice Models vs Claude Haiku 4.5
### Cost Per User Per Month
Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.
| Model (via) | Free User/mo | Prism User/mo | vs Claude Haiku |
|-------------|-------------|--------------|-----------------|
| **Claude Haiku 4.5** (Anthropic) | $0.31 | $0.63 | baseline |
| **DeepSeek V3.2** (Venice) | ~$0.07 | ~$0.15 | **78% cheaper** |
| **Qwen3 235B** (Venice) | ~$0.05 | ~$0.10 | **84% cheaper** |
| **Llama 3.3 70B** (Venice) | ~$0.16 | ~$0.33 | **48% cheaper** |
| **Gemma 3 27B** (Venice) | ~$0.02 | ~$0.04 | **94% cheaper** |
The cost difference is massive. At 200 DAU (traction), monthly AI cost drops from ~$50 to ~$10-15.
### Quality Comparison for Emotional Tasks
This is the critical question. Here's what the research and benchmarks tell us:
**Emotional Intelligence (EI) Benchmarks:**
- A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
- GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
- Claude models are specifically noted for "endless empathy" — excellent for therapeutic contexts but with dependency risk
- A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice
**Model-Specific Emotional Qualities:**
| Model | Empathy Quality | Tone Consistency | Creative Reframing | Safety/Guardrails |
|-------|----------------|-----------------|-------------------|-------------------|
| Claude Haiku 4.5 | ★★★★☆ | ★★★★★ | ★★★★☆ | ★★★★★ |
| DeepSeek V3.2 | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Qwen3 235B | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Llama 3.3 70B | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ |
| Gemma 3 27B | ★★☆☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★☆☆ |
**Key findings:**
- DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" — problematic for daily therapeutic interactions
- Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" — actually quite good for our use case
- Llama 3.3 is solid but unremarkable for emotional tasks
- Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
- Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box
---
## The Final Decision (Updated February 2026)
After evaluating all options including Venice, Claude-first, and various hybrid strategies, the decision is:
### ★ Chosen: DeepSeek V3.2 via OpenRouter + Non-Chinese Providers
**Primary:** DeepSeek V3.2 routed through DeepInfra/Fireworks (US/EU infrastructure) via OpenRouter
**Fallback:** Claude Haiku 4.5 via OpenRouter (automatic failover on provider outage)
**Single model for all features** — no tiering until 5,000+ DAU justifies the complexity
| | DeepInfra (via OpenRouter) | Claude Haiku 4.5 (fallback) |
|---|---|---|
| Input (cache miss) | $0.26/M | $1.00/M |
| Input (cache hit) | $0.216/M | $0.10/M |
| Output | $0.38/M | $5.00/M |
**Monthly AI cost at 200 DAU: ~$8** (vs $50 with Claude Haiku, vs $12-18 Venice hybrid)
### Why This Beats All Other Options
1. **Data privacy solved** — DeepInfra/Fireworks host on US/EU infrastructure. No data through Chinese servers. Critical for a mental wellness app.
2. **8590% cheaper than Claude** — per-user AI cost drops from $0.33 to ~$0.034/month (free users).
3. **Automatic failover** — OpenRouter routes to Claude Haiku if DeepInfra goes down. No code changes, no downtime.
4. **No vendor lock-in** — one API key, switch models/providers via config. OpenRouter's API is OpenAI-compatible.
5. **Single model simplicity** — one prompt set to tune, one quality bar to maintain. Solo founder can manage this.
6. **Emotional intelligence validated** — Nature 2025 study shows DeepSeek V3 scores comparably to Claude on standardized EI tests (81% avg vs 56% human avg).
### Why Not Venice, Groq, or Direct DeepSeek
- **Venice:** No batch API, "uncensored" default posture requires extra safety work, rate limits undocumented, smaller infrastructure.
- **Groq:** Great speed but limited model selection. Useful as a future tier for structured generation at 5,000+ DAU.
- **DeepSeek Direct API:** Cheapest option ($0.028 cache hits) but routes all data through Chinese servers. Non-starter for mental health data.
- **Tiered hybrid (Option D):** Saves ~$30-50/month over single-model approach but adds 4 separate prompt configs, routing logic, and quality benchmarks. Not worth the complexity at current scale.
### Safety Layer (Non-Negotiable Regardless of Provider)
```
User input → Keyword crisis detector (local, instant)
→ If flagged: hardcoded crisis response (no LLM needed)
→ If clear: send to OpenRouter with safety-focused system prompt
→ Post-process: scan output for harmful patterns before showing to user
```
We build our own safety layer regardless of provider. This gives us MORE control than relying on any provider's built-in guardrails.
---
## Final Cost Model (OpenRouter + DeepInfra)
| Stage | DAU | AI Cost/mo | Total Infra/mo | Break-even Subscribers |
|-------|-----|-----------|----------------|----------------------|
| Launch (0-500 users) | ~50 | ~$2 | ~$16 | **3 Prism @ $4.99** |
| Traction (500-2K) | ~200 | ~$8 | ~$53 | **11 Prism** |
| Growth (2K-10K) | ~1K | ~$40 | ~$216 | **43 Prism** |
Compare to Claude-first: launch was $26/mo, now $16. growth was $425, now $216. AI went from 60% of total spend to 19%.
---
## Scaling Roadmap
1. **Launch → 600 DAU:** Single model (DeepSeek V3.2 via OpenRouter/DeepInfra). Focus on prompt quality.
2. **600+ DAU:** Evaluate self-hosted Qwen3-30B-A3B on GPU ($245/month fixed) — cheaper than API at this volume, full data control.
3. **5,000+ DAU:** Introduce tiered model routing if usage data shows certain features benefit from specialized models.
4. **Build the safety layer regardless** — multi-stage crisis filter is a day-one requirement, not a provider feature.