kalei/docs/technical/kalei-ai-model-comparison.md

163 lines
9.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Kalei — AI Model Selection: Unbiased Analysis
## The Question
Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?
---
## What Kalei Actually Needs From Its AI
| Task | Quality Bar | Frequency | Latency Tolerance |
|------|------------|-----------|-------------------|
| **Mirror** — detect emotional fragments in freeform writing | High empathy + precision | 2-7x/week per user | 2-3s acceptable |
| **Kaleidoscope** — generate 3 perspective reframes | Highest — this IS the product | 3-10x/day per user | 2-3s acceptable |
| **Lens** — daily affirmation generation | Medium — structured output | 1x/day per user | 5s acceptable |
| **Crisis Detection** — flag self-harm/distress signals | Critical safety — zero false negatives | Every interaction | <1s preferred |
| **Spectrum** weekly/monthly pattern analysis | High analytical depth | 1x/week batch | Minutes acceptable |
The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.
---
## Venice.ai API — What You Get
Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:
### Best Venice Models for Kalei
| Model | Input/MTok | Output/MTok | Cache Read | Context | Privacy | Notes |
|-------|-----------|------------|------------|---------|---------|-------|
| **DeepSeek V3.2** | $0.40 | $1.00 | $0.20 | 164K | Private | Strongest general model on Venice |
| **Qwen3 235B A22B** | $0.15 | $0.75 | | 131K | Private | Best price-to-quality ratio |
| **Llama 3.3 70B** | $0.70 | $2.80 | | 131K | Private | Meta's flagship open model |
| **Gemma 3 27B** | $0.12 | $0.20 | | 203K | Private | Ultra-cheap, Google's open model |
| **Venice Small (Qwen3 4B)** | $0.05 | $0.15 | | 33K | Private | Affirmation-tier only |
### Venice Advantages
- **Privacy-first architecture** no data retention, critical for mental health
- **OpenAI-compatible API** trivial to swap in/out, same SDK
- **Prompt caching** on select models (DeepSeek V3.2 confirmed)
- **You already pay for Pro** $10 free API credit to test
- **No minimum commitment** pure pay-per-use
### Venice Limitations
- **No batch API** can't get 50% off for Spectrum overnight processing
- **"Uncensored" default posture** Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
- **No equivalent to Anthropic's constitutional AI** crisis detection safety net is entirely on us
- **Smaller infrastructure** less battle-tested at scale than Anthropic/OpenAI
- **Rate limits not publicly documented** could be a problem at scale
---
## Head-to-Head: Venice Models vs Claude Haiku 4.5
### Cost Per User Per Month
Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.
| Model (via) | Free User/mo | Prism User/mo | vs Claude Haiku |
|-------------|-------------|--------------|-----------------|
| **Claude Haiku 4.5** (Anthropic) | $0.31 | $0.63 | baseline |
| **DeepSeek V3.2** (Venice) | ~$0.07 | ~$0.15 | **78% cheaper** |
| **Qwen3 235B** (Venice) | ~$0.05 | ~$0.10 | **84% cheaper** |
| **Llama 3.3 70B** (Venice) | ~$0.16 | ~$0.33 | **48% cheaper** |
| **Gemma 3 27B** (Venice) | ~$0.02 | ~$0.04 | **94% cheaper** |
The cost difference is massive. At 200 DAU (traction), monthly AI cost drops from ~$50 to ~$10-15.
### Quality Comparison for Emotional Tasks
This is the critical question. Here's what the research and benchmarks tell us:
**Emotional Intelligence (EI) Benchmarks:**
- A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
- GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
- Claude models are specifically noted for "endless empathy" excellent for therapeutic contexts but with dependency risk
- A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice
**Model-Specific Emotional Qualities:**
| Model | Empathy Quality | Tone Consistency | Creative Reframing | Safety/Guardrails |
|-------|----------------|-----------------|-------------------|-------------------|
| Claude Haiku 4.5 | ★★★★☆ | ★★★★★ | ★★★★☆ | ★★★★★ |
| DeepSeek V3.2 | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Qwen3 235B | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Llama 3.3 70B | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ |
| Gemma 3 27B | ★★☆☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★☆☆ |
**Key findings:**
- DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" problematic for daily therapeutic interactions
- Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" actually quite good for our use case
- Llama 3.3 is solid but unremarkable for emotional tasks
- Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
- Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box
---
## The Final Decision (Updated February 2026)
After evaluating all options including Venice, Claude-first, and various hybrid strategies, the decision is:
### ★ Chosen: DeepSeek V3.2 via OpenRouter + Non-Chinese Providers
**Primary:** DeepSeek V3.2 routed through DeepInfra/Fireworks (US/EU infrastructure) via OpenRouter
**Fallback:** Claude Haiku 4.5 via OpenRouter (automatic failover on provider outage)
**Single model for all features** no tiering until 5,000+ DAU justifies the complexity
| | DeepInfra (via OpenRouter) | Claude Haiku 4.5 (fallback) |
|---|---|---|
| Input (cache miss) | $0.26/M | $1.00/M |
| Input (cache hit) | $0.216/M | $0.10/M |
| Output | $0.38/M | $5.00/M |
**Monthly AI cost at 200 DAU: ~$8** (vs $50 with Claude Haiku, vs $12-18 Venice hybrid)
### Why This Beats All Other Options
1. **Data privacy solved** DeepInfra/Fireworks host on US/EU infrastructure. No data through Chinese servers. Critical for a mental wellness app.
2. **8590% cheaper than Claude** per-user AI cost drops from $0.33 to ~$0.034/month (free users).
3. **Automatic failover** OpenRouter routes to Claude Haiku if DeepInfra goes down. No code changes, no downtime.
4. **No vendor lock-in** one API key, switch models/providers via config. OpenRouter's API is OpenAI-compatible.
5. **Single model simplicity** one prompt set to tune, one quality bar to maintain. Solo founder can manage this.
6. **Emotional intelligence validated** Nature 2025 study shows DeepSeek V3 scores comparably to Claude on standardized EI tests (81% avg vs 56% human avg).
### Why Not Venice, Groq, or Direct DeepSeek
- **Venice:** No batch API, "uncensored" default posture requires extra safety work, rate limits undocumented, smaller infrastructure.
- **Groq:** Great speed but limited model selection. Useful as a future tier for structured generation at 5,000+ DAU.
- **DeepSeek Direct API:** Cheapest option ($0.028 cache hits) but routes all data through Chinese servers. Non-starter for mental health data.
- **Tiered hybrid (Option D):** Saves ~$30-50/month over single-model approach but adds 4 separate prompt configs, routing logic, and quality benchmarks. Not worth the complexity at current scale.
### Safety Layer (Non-Negotiable Regardless of Provider)
```
User input → Keyword crisis detector (local, instant)
→ If flagged: hardcoded crisis response (no LLM needed)
→ If clear: send to OpenRouter with safety-focused system prompt
→ Post-process: scan output for harmful patterns before showing to user
```
We build our own safety layer regardless of provider. This gives us MORE control than relying on any provider's built-in guardrails.
---
## Final Cost Model (OpenRouter + DeepInfra)
| Stage | DAU | AI Cost/mo | Total Infra/mo | Break-even Subscribers |
|-------|-----|-----------|----------------|----------------------|
| Launch (0-500 users) | ~50 | ~$2 | ~$16 | **3 Prism @ $4.99** |
| Traction (500-2K) | ~200 | ~$8 | ~$53 | **11 Prism** |
| Growth (2K-10K) | ~1K | ~$40 | ~$216 | **43 Prism** |
Compare to Claude-first: launch was $26/mo, now $16. growth was $425, now $216. AI went from 60% of total spend to 19%.
---
## Scaling Roadmap
1. **Launch → 600 DAU:** Single model (DeepSeek V3.2 via OpenRouter/DeepInfra). Focus on prompt quality.
2. **600+ DAU:** Evaluate self-hosted Qwen3-30B-A3B on GPU ($245/month fixed) cheaper than API at this volume, full data control.
3. **5,000+ DAU:** Introduce tiered model routing if usage data shows certain features benefit from specialized models.
4. **Build the safety layer regardless** multi-stage crisis filter is a day-one requirement, not a provider feature.