kalei/docs/technical/kalei-ai-model-comparison.md

173 lines
9.3 KiB
Markdown

# Kalei — AI Model Selection: Unbiased Analysis
## The Question
Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?
---
## What Kalei Actually Needs From Its AI
| Task | Quality Bar | Frequency | Latency Tolerance |
|------|------------|-----------|-------------------|
| **Mirror** — detect emotional fragments in freeform writing | High empathy + precision | 2-7x/week per user | 2-3s acceptable |
| **Kaleidoscope** — generate 3 perspective reframes | Highest — this IS the product | 3-10x/day per user | 2-3s acceptable |
| **Lens** — daily affirmation generation | Medium — structured output | 1x/day per user | 5s acceptable |
| **Crisis Detection** — flag self-harm/distress signals | Critical safety — zero false negatives | Every interaction | <1s preferred |
| **Spectrum** weekly/monthly pattern analysis | High analytical depth | 1x/week batch | Minutes acceptable |
The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.
---
## Venice.ai API — What You Get
Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:
### Best Venice Models for Kalei
| Model | Input/MTok | Output/MTok | Cache Read | Context | Privacy | Notes |
|-------|-----------|------------|------------|---------|---------|-------|
| **DeepSeek V3.2** | $0.40 | $1.00 | $0.20 | 164K | Private | Strongest general model on Venice |
| **Qwen3 235B A22B** | $0.15 | $0.75 | | 131K | Private | Best price-to-quality ratio |
| **Llama 3.3 70B** | $0.70 | $2.80 | | 131K | Private | Meta's flagship open model |
| **Gemma 3 27B** | $0.12 | $0.20 | | 203K | Private | Ultra-cheap, Google's open model |
| **Venice Small (Qwen3 4B)** | $0.05 | $0.15 | | 33K | Private | Affirmation-tier only |
### Venice Advantages
- **Privacy-first architecture** no data retention, critical for mental health
- **OpenAI-compatible API** trivial to swap in/out, same SDK
- **Prompt caching** on select models (DeepSeek V3.2 confirmed)
- **You already pay for Pro** $10 free API credit to test
- **No minimum commitment** pure pay-per-use
### Venice Limitations
- **No batch API** can't get 50% off for Spectrum overnight processing
- **"Uncensored" default posture** Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
- **No equivalent to Anthropic's constitutional AI** crisis detection safety net is entirely on us
- **Smaller infrastructure** less battle-tested at scale than Anthropic/OpenAI
- **Rate limits not publicly documented** could be a problem at scale
---
## Head-to-Head: Venice Models vs Claude Haiku 4.5
### Cost Per User Per Month
Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.
| Model (via) | Free User/mo | Prism User/mo | vs Claude Haiku |
|-------------|-------------|--------------|-----------------|
| **Claude Haiku 4.5** (Anthropic) | $0.31 | $0.63 | baseline |
| **DeepSeek V3.2** (Venice) | ~$0.07 | ~$0.15 | **78% cheaper** |
| **Qwen3 235B** (Venice) | ~$0.05 | ~$0.10 | **84% cheaper** |
| **Llama 3.3 70B** (Venice) | ~$0.16 | ~$0.33 | **48% cheaper** |
| **Gemma 3 27B** (Venice) | ~$0.02 | ~$0.04 | **94% cheaper** |
The cost difference is massive. At 200 DAU (Phase 2), monthly AI cost drops from ~$50 to ~$10-15.
### Quality Comparison for Emotional Tasks
This is the critical question. Here's what the research and benchmarks tell us:
**Emotional Intelligence (EI) Benchmarks:**
- A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
- GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
- Claude models are specifically noted for "endless empathy" excellent for therapeutic contexts but with dependency risk
- A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice
**Model-Specific Emotional Qualities:**
| Model | Empathy Quality | Tone Consistency | Creative Reframing | Safety/Guardrails |
|-------|----------------|-----------------|-------------------|-------------------|
| Claude Haiku 4.5 | ★★★★☆ | ★★★★★ | ★★★★☆ | ★★★★★ |
| DeepSeek V3.2 | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Qwen3 235B | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Llama 3.3 70B | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ |
| Gemma 3 27B | ★★☆☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★☆☆ |
**Key findings:**
- DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" problematic for daily therapeutic interactions
- Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" actually quite good for our use case
- Llama 3.3 is solid but unremarkable for emotional tasks
- Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
- Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box
---
## The Honest Recommendation
### Option A: Venice-First (Lowest Cost)
**Primary: Qwen3 235B via Venice** for all features
- Monthly cost at 200 DAU: ~$10-15
- Pros: 84% cheaper, privacy-first, you already have the account
- Cons: No batch API (Spectrum costs more), no built-in safety net, requires extensive prompt engineering for emotional quality, crisis detection entirely self-built
- Risk: If reframe quality feels "off" or generic, the core product fails
### Option B: Claude-First (Current Plan)
**Primary: Claude Haiku 4.5 via Anthropic**
- Monthly cost at 200 DAU: ~$50
- Pros: Best-in-class empathy and safety, prompt caching, batch API (50% off Spectrum), constitutional AI for crisis detection
- Cons: 4-6x more expensive, Anthropic lock-in
- Risk: Higher burn rate, but product quality is higher
### Option C: Hybrid (Recommended) ★
**Split by task criticality:**
| Task | Model | Via | Why |
|------|-------|-----|-----|
| **Kaleidoscope reframes** | Qwen3 235B | Venice | Core product, needs quality BUT Qwen3 handles tone consistency well. Test extensively. |
| **Mirror fragments** | Qwen3 235B | Venice | Structured detection task, Qwen3 is precise enough |
| **Lens affirmations** | Venice Small (Qwen3 4B) | Venice | Simple generation, doesn't need a big model |
| **Crisis detection** | Application-layer keywords + Qwen3 235B | Venice + custom code | Keyword matching first, LLM confirmation second |
| **Spectrum batch** | DeepSeek V3.2 | Venice | Analytical task, DeepSeek excels at structured analysis |
**Estimated monthly cost at 200 DAU: ~$12-18** (vs $50 with Claude, vs $10 all-Qwen3)
### Why Hybrid via Venice Wins
1. **You already pay for Pro** the $10 credit lets you prototype immediately
2. **OpenAI-compatible API** if Venice quality disappoints, swapping to Anthropic/Groq/OpenRouter is a 1-line base URL change
3. **Privacy alignment** Venice's no-data-retention policy is actually perfect for mental health data
4. **Cost headroom** at $12-18/mo AI cost, you could serve 200 DAU and still be profitable with just 3-4 Prism subscribers
5. **Qwen3 235B is genuinely good** it's not a compromise model, it scores competitively on emotional tasks
### The Critical Caveat: Safety Layer
Venice's "uncensored" philosophy means we MUST build our own safety layer:
```
User input → Keyword crisis detector (local, instant)
→ If flagged: hardcoded crisis response (no LLM needed)
→ If clear: send to Venice API with our safety-focused system prompt
→ Post-process: scan output for harmful patterns before showing to user
```
This adds development time but gives us MORE control than relying on any provider's built-in guardrails.
---
## Revised Cost Model with Venice
| Phase | DAU | AI Cost/mo | Total Infra/mo | Break-even Subscribers |
|-------|-----|-----------|----------------|----------------------|
| Phase 1 (0-500 users) | ~50 | ~$4 | ~$17 | **4 Prism @ $4.99** |
| Phase 2 (500-2K) | ~200 | ~$15 | ~$40 | **8 Prism** |
| Phase 3 (2K-10K) | ~1K | ~$60 | ~$110 | **22 Prism** |
Compare to Claude-first: Phase 1 was $26/mo, now $17. Phase 2 was $90-100, now $40. That's significant runway extension.
---
## Action Plan
1. **Immediately**: Use your Venice Pro $10 credit to test Qwen3 235B with Kalei's actual system prompts
2. **Build a test harness**: Send 50 real emotional writing samples through both Qwen3 (Venice) and Claude Haiku, blind-rate the outputs
3. **If Qwen3 passes**: Go Venice-first, save 60-80% on AI costs
4. **If Qwen3 disappoints on reframes specifically**: Use Claude Haiku for Kaleidoscope only, Venice for everything else
5. **Build the safety layer regardless** don't rely on any provider's guardrails for a mental health app
The API is OpenAI-compatible, so the switching cost is near zero. Start cheap, validate quality, upgrade only where needed.