173 lines
9.3 KiB
Markdown
173 lines
9.3 KiB
Markdown
|
|
# Kalei — AI Model Selection: Unbiased Analysis
|
||
|
|
|
||
|
|
## The Question
|
||
|
|
|
||
|
|
Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What Kalei Actually Needs From Its AI
|
||
|
|
|
||
|
|
| Task | Quality Bar | Frequency | Latency Tolerance |
|
||
|
|
|------|------------|-----------|-------------------|
|
||
|
|
| **Mirror** — detect emotional fragments in freeform writing | High empathy + precision | 2-7x/week per user | 2-3s acceptable |
|
||
|
|
| **Kaleidoscope** — generate 3 perspective reframes | Highest — this IS the product | 3-10x/day per user | 2-3s acceptable |
|
||
|
|
| **Lens** — daily affirmation generation | Medium — structured output | 1x/day per user | 5s acceptable |
|
||
|
|
| **Crisis Detection** — flag self-harm/distress signals | Critical safety — zero false negatives | Every interaction | <1s preferred |
|
||
|
|
| **Spectrum** — weekly/monthly pattern analysis | High analytical depth | 1x/week batch | Minutes acceptable |
|
||
|
|
|
||
|
|
The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Venice.ai API — What You Get
|
||
|
|
|
||
|
|
Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:
|
||
|
|
|
||
|
|
### Best Venice Models for Kalei
|
||
|
|
|
||
|
|
| Model | Input/MTok | Output/MTok | Cache Read | Context | Privacy | Notes |
|
||
|
|
|-------|-----------|------------|------------|---------|---------|-------|
|
||
|
|
| **DeepSeek V3.2** | $0.40 | $1.00 | $0.20 | 164K | Private | Strongest general model on Venice |
|
||
|
|
| **Qwen3 235B A22B** | $0.15 | $0.75 | — | 131K | Private | Best price-to-quality ratio |
|
||
|
|
| **Llama 3.3 70B** | $0.70 | $2.80 | — | 131K | Private | Meta's flagship open model |
|
||
|
|
| **Gemma 3 27B** | $0.12 | $0.20 | — | 203K | Private | Ultra-cheap, Google's open model |
|
||
|
|
| **Venice Small (Qwen3 4B)** | $0.05 | $0.15 | — | 33K | Private | Affirmation-tier only |
|
||
|
|
|
||
|
|
### Venice Advantages
|
||
|
|
- **Privacy-first architecture** — no data retention, critical for mental health
|
||
|
|
- **OpenAI-compatible API** — trivial to swap in/out, same SDK
|
||
|
|
- **Prompt caching** on select models (DeepSeek V3.2 confirmed)
|
||
|
|
- **You already pay for Pro** — $10 free API credit to test
|
||
|
|
- **No minimum commitment** — pure pay-per-use
|
||
|
|
|
||
|
|
### Venice Limitations
|
||
|
|
- **No batch API** — can't get 50% off for Spectrum overnight processing
|
||
|
|
- **"Uncensored" default posture** — Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
|
||
|
|
- **No equivalent to Anthropic's constitutional AI** — crisis detection safety net is entirely on us
|
||
|
|
- **Smaller infrastructure** — less battle-tested at scale than Anthropic/OpenAI
|
||
|
|
- **Rate limits not publicly documented** — could be a problem at scale
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Head-to-Head: Venice Models vs Claude Haiku 4.5
|
||
|
|
|
||
|
|
### Cost Per User Per Month
|
||
|
|
|
||
|
|
Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.
|
||
|
|
|
||
|
|
| Model (via) | Free User/mo | Prism User/mo | vs Claude Haiku |
|
||
|
|
|-------------|-------------|--------------|-----------------|
|
||
|
|
| **Claude Haiku 4.5** (Anthropic) | $0.31 | $0.63 | baseline |
|
||
|
|
| **DeepSeek V3.2** (Venice) | ~$0.07 | ~$0.15 | **78% cheaper** |
|
||
|
|
| **Qwen3 235B** (Venice) | ~$0.05 | ~$0.10 | **84% cheaper** |
|
||
|
|
| **Llama 3.3 70B** (Venice) | ~$0.16 | ~$0.33 | **48% cheaper** |
|
||
|
|
| **Gemma 3 27B** (Venice) | ~$0.02 | ~$0.04 | **94% cheaper** |
|
||
|
|
|
||
|
|
The cost difference is massive. At 200 DAU (Phase 2), monthly AI cost drops from ~$50 to ~$10-15.
|
||
|
|
|
||
|
|
### Quality Comparison for Emotional Tasks
|
||
|
|
|
||
|
|
This is the critical question. Here's what the research and benchmarks tell us:
|
||
|
|
|
||
|
|
**Emotional Intelligence (EI) Benchmarks:**
|
||
|
|
- A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
|
||
|
|
- GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
|
||
|
|
- Claude models are specifically noted for "endless empathy" — excellent for therapeutic contexts but with dependency risk
|
||
|
|
- A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice
|
||
|
|
|
||
|
|
**Model-Specific Emotional Qualities:**
|
||
|
|
|
||
|
|
| Model | Empathy Quality | Tone Consistency | Creative Reframing | Safety/Guardrails |
|
||
|
|
|-------|----------------|-----------------|-------------------|-------------------|
|
||
|
|
| Claude Haiku 4.5 | ★★★★☆ | ★★★★★ | ★★★★☆ | ★★★★★ |
|
||
|
|
| DeepSeek V3.2 | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
|
||
|
|
| Qwen3 235B | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
|
||
|
|
| Llama 3.3 70B | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ |
|
||
|
|
| Gemma 3 27B | ★★☆☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★☆☆ |
|
||
|
|
|
||
|
|
**Key findings:**
|
||
|
|
- DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" — problematic for daily therapeutic interactions
|
||
|
|
- Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" — actually quite good for our use case
|
||
|
|
- Llama 3.3 is solid but unremarkable for emotional tasks
|
||
|
|
- Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
|
||
|
|
- Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Honest Recommendation
|
||
|
|
|
||
|
|
### Option A: Venice-First (Lowest Cost)
|
||
|
|
**Primary: Qwen3 235B via Venice** for all features
|
||
|
|
|
||
|
|
- Monthly cost at 200 DAU: ~$10-15
|
||
|
|
- Pros: 84% cheaper, privacy-first, you already have the account
|
||
|
|
- Cons: No batch API (Spectrum costs more), no built-in safety net, requires extensive prompt engineering for emotional quality, crisis detection entirely self-built
|
||
|
|
- Risk: If reframe quality feels "off" or generic, the core product fails
|
||
|
|
|
||
|
|
### Option B: Claude-First (Current Plan)
|
||
|
|
**Primary: Claude Haiku 4.5 via Anthropic**
|
||
|
|
|
||
|
|
- Monthly cost at 200 DAU: ~$50
|
||
|
|
- Pros: Best-in-class empathy and safety, prompt caching, batch API (50% off Spectrum), constitutional AI for crisis detection
|
||
|
|
- Cons: 4-6x more expensive, Anthropic lock-in
|
||
|
|
- Risk: Higher burn rate, but product quality is higher
|
||
|
|
|
||
|
|
### Option C: Hybrid (Recommended) ★
|
||
|
|
**Split by task criticality:**
|
||
|
|
|
||
|
|
| Task | Model | Via | Why |
|
||
|
|
|------|-------|-----|-----|
|
||
|
|
| **Kaleidoscope reframes** | Qwen3 235B | Venice | Core product, needs quality BUT Qwen3 handles tone consistency well. Test extensively. |
|
||
|
|
| **Mirror fragments** | Qwen3 235B | Venice | Structured detection task, Qwen3 is precise enough |
|
||
|
|
| **Lens affirmations** | Venice Small (Qwen3 4B) | Venice | Simple generation, doesn't need a big model |
|
||
|
|
| **Crisis detection** | Application-layer keywords + Qwen3 235B | Venice + custom code | Keyword matching first, LLM confirmation second |
|
||
|
|
| **Spectrum batch** | DeepSeek V3.2 | Venice | Analytical task, DeepSeek excels at structured analysis |
|
||
|
|
|
||
|
|
**Estimated monthly cost at 200 DAU: ~$12-18** (vs $50 with Claude, vs $10 all-Qwen3)
|
||
|
|
|
||
|
|
### Why Hybrid via Venice Wins
|
||
|
|
|
||
|
|
1. **You already pay for Pro** — the $10 credit lets you prototype immediately
|
||
|
|
2. **OpenAI-compatible API** — if Venice quality disappoints, swapping to Anthropic/Groq/OpenRouter is a 1-line base URL change
|
||
|
|
3. **Privacy alignment** — Venice's no-data-retention policy is actually perfect for mental health data
|
||
|
|
4. **Cost headroom** — at $12-18/mo AI cost, you could serve 200 DAU and still be profitable with just 3-4 Prism subscribers
|
||
|
|
5. **Qwen3 235B is genuinely good** — it's not a compromise model, it scores competitively on emotional tasks
|
||
|
|
|
||
|
|
### The Critical Caveat: Safety Layer
|
||
|
|
|
||
|
|
Venice's "uncensored" philosophy means we MUST build our own safety layer:
|
||
|
|
|
||
|
|
```
|
||
|
|
User input → Keyword crisis detector (local, instant)
|
||
|
|
→ If flagged: hardcoded crisis response (no LLM needed)
|
||
|
|
→ If clear: send to Venice API with our safety-focused system prompt
|
||
|
|
→ Post-process: scan output for harmful patterns before showing to user
|
||
|
|
```
|
||
|
|
|
||
|
|
This adds development time but gives us MORE control than relying on any provider's built-in guardrails.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Revised Cost Model with Venice
|
||
|
|
|
||
|
|
| Phase | DAU | AI Cost/mo | Total Infra/mo | Break-even Subscribers |
|
||
|
|
|-------|-----|-----------|----------------|----------------------|
|
||
|
|
| Phase 1 (0-500 users) | ~50 | ~$4 | ~$17 | **4 Prism @ $4.99** |
|
||
|
|
| Phase 2 (500-2K) | ~200 | ~$15 | ~$40 | **8 Prism** |
|
||
|
|
| Phase 3 (2K-10K) | ~1K | ~$60 | ~$110 | **22 Prism** |
|
||
|
|
|
||
|
|
Compare to Claude-first: Phase 1 was $26/mo, now $17. Phase 2 was $90-100, now $40. That's significant runway extension.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Action Plan
|
||
|
|
|
||
|
|
1. **Immediately**: Use your Venice Pro $10 credit to test Qwen3 235B with Kalei's actual system prompts
|
||
|
|
2. **Build a test harness**: Send 50 real emotional writing samples through both Qwen3 (Venice) and Claude Haiku, blind-rate the outputs
|
||
|
|
3. **If Qwen3 passes**: Go Venice-first, save 60-80% on AI costs
|
||
|
|
4. **If Qwen3 disappoints on reframes specifically**: Use Claude Haiku for Kaleidoscope only, Venice for everything else
|
||
|
|
5. **Build the safety layer regardless** — don't rely on any provider's guardrails for a mental health app
|
||
|
|
|
||
|
|
The API is OpenAI-compatible, so the switching cost is near zero. Start cheap, validate quality, upgrade only where needed.
|