kalei/docs/technical/kalei-ai-model-comparison.md

# Kalei — AI Model Selection: Unbiased Analysis

## The Question

Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?

---

## What Kalei Actually Needs From Its AI

| Task | Quality Bar | Frequency | Latency Tolerance |
|------|------------|-----------|-------------------|
| **Mirror** — detect emotional fragments in freeform writing | High empathy + precision | 2-7x/week per user | 2-3s acceptable |
| **Kaleidoscope** — generate 3 perspective reframes | Highest — this IS the product | 3-10x/day per user | 2-3s acceptable |
| **Lens** — daily affirmation generation | Medium — structured output | 1x/day per user | 5s acceptable |
| **Crisis Detection** — flag self-harm/distress signals | Critical safety — zero false negatives | Every interaction | <1s preferred |
| **Spectrum** — weekly/monthly pattern analysis | High analytical depth | 1x/week batch | Minutes acceptable |

The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.

---

## Venice.ai API — What You Get

Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:

### Best Venice Models for Kalei

| Model | Input/MTok | Output/MTok | Cache Read | Context | Privacy | Notes |
|-------|-----------|------------|------------|---------|---------|-------|
| **DeepSeek V3.2** | $0.40 | $1.00 | $0.20 | 164K | Private | Strongest general model on Venice |
| **Qwen3 235B A22B** | $0.15 | $0.75 | — | 131K | Private | Best price-to-quality ratio |
| **Llama 3.3 70B** | $0.70 | $2.80 | — | 131K | Private | Meta's flagship open model |
| **Gemma 3 27B** | $0.12 | $0.20 | — | 203K | Private | Ultra-cheap, Google's open model |
| **Venice Small (Qwen3 4B)** | $0.05 | $0.15 | — | 33K | Private | Affirmation-tier only |

### Venice Advantages
- **Privacy-first architecture** — no data retention, critical for mental health
- **OpenAI-compatible API** — trivial to swap in/out, same SDK
- **Prompt caching** on select models (DeepSeek V3.2 confirmed)
- **You already pay for Pro** — $10 free API credit to test
- **No minimum commitment** — pure pay-per-use

### Venice Limitations
- **No batch API** — can't get 50% off for Spectrum overnight processing
- **"Uncensored" default posture** — Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
- **No equivalent to Anthropic's constitutional AI** — crisis detection safety net is entirely on us
- **Smaller infrastructure** — less battle-tested at scale than Anthropic/OpenAI
- **Rate limits not publicly documented** — could be a problem at scale

---

## Head-to-Head: Venice Models vs Claude Haiku 4.5

### Cost Per User Per Month

Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.

| Model (via) | Free User/mo | Prism User/mo | vs Claude Haiku |
|-------------|-------------|--------------|-----------------|
| **Claude Haiku 4.5** (Anthropic) | $0.31 | $0.63 | baseline |
| **DeepSeek V3.2** (Venice) | ~$0.07 | ~$0.15 | **78% cheaper** |
| **Qwen3 235B** (Venice) | ~$0.05 | ~$0.10 | **84% cheaper** |
| **Llama 3.3 70B** (Venice) | ~$0.16 | ~$0.33 | **48% cheaper** |
| **Gemma 3 27B** (Venice) | ~$0.02 | ~$0.04 | **94% cheaper** |

The cost difference is massive. At 200 DAU (Phase 2), monthly AI cost drops from ~$50 to ~$10-15.

### Quality Comparison for Emotional Tasks

This is the critical question. Here's what the research and benchmarks tell us:

**Emotional Intelligence (EI) Benchmarks:**
- A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
- GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
- Claude models are specifically noted for "endless empathy" — excellent for therapeutic contexts but with dependency risk
- A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice

**Model-Specific Emotional Qualities:**

| Model | Empathy Quality | Tone Consistency | Creative Reframing | Safety/Guardrails |
|-------|----------------|-----------------|-------------------|-------------------|
| Claude Haiku 4.5 | ★★★★☆ | ★★★★★ | ★★★★☆ | ★★★★★ |
| DeepSeek V3.2 | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Qwen3 235B | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ |
| Llama 3.3 70B | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ | ★★★☆☆ |
| Gemma 3 27B | ★★☆☆☆ | ★★★☆☆ | ★★☆☆☆ | ★★★☆☆ |

**Key findings:**
- DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" — problematic for daily therapeutic interactions
- Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" — actually quite good for our use case
- Llama 3.3 is solid but unremarkable for emotional tasks
- Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
- Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box

---

## The Honest Recommendation

### Option A: Venice-First (Lowest Cost)
**Primary: Qwen3 235B via Venice** for all features

- Monthly cost at 200 DAU: ~$10-15
- Pros: 84% cheaper, privacy-first, you already have the account
- Cons: No batch API (Spectrum costs more), no built-in safety net, requires extensive prompt engineering for emotional quality, crisis detection entirely self-built
- Risk: If reframe quality feels "off" or generic, the core product fails

### Option B: Claude-First (Current Plan)
**Primary: Claude Haiku 4.5 via Anthropic**

- Monthly cost at 200 DAU: ~$50
- Pros: Best-in-class empathy and safety, prompt caching, batch API (50% off Spectrum), constitutional AI for crisis detection
- Cons: 4-6x more expensive, Anthropic lock-in
- Risk: Higher burn rate, but product quality is higher

### Option C: Hybrid (Recommended) ★
**Split by task criticality:**

| Task | Model | Via | Why |
|------|-------|-----|-----|
| **Kaleidoscope reframes** | Qwen3 235B | Venice | Core product, needs quality BUT Qwen3 handles tone consistency well. Test extensively. |
| **Mirror fragments** | Qwen3 235B | Venice | Structured detection task, Qwen3 is precise enough |
| **Lens affirmations** | Venice Small (Qwen3 4B) | Venice | Simple generation, doesn't need a big model |
| **Crisis detection** | Application-layer keywords + Qwen3 235B | Venice + custom code | Keyword matching first, LLM confirmation second |
| **Spectrum batch** | DeepSeek V3.2 | Venice | Analytical task, DeepSeek excels at structured analysis |

**Estimated monthly cost at 200 DAU: ~$12-18** (vs $50 with Claude, vs $10 all-Qwen3)

### Why Hybrid via Venice Wins

1. **You already pay for Pro** — the $10 credit lets you prototype immediately
2. **OpenAI-compatible API** — if Venice quality disappoints, swapping to Anthropic/Groq/OpenRouter is a 1-line base URL change
3. **Privacy alignment** — Venice's no-data-retention policy is actually perfect for mental health data
4. **Cost headroom** — at $12-18/mo AI cost, you could serve 200 DAU and still be profitable with just 3-4 Prism subscribers
5. **Qwen3 235B is genuinely good** — it's not a compromise model, it scores competitively on emotional tasks

### The Critical Caveat: Safety Layer

Venice's "uncensored" philosophy means we MUST build our own safety layer:

```
User input → Keyword crisis detector (local, instant)
           → If flagged: hardcoded crisis response (no LLM needed)
           → If clear: send to Venice API with our safety-focused system prompt
           → Post-process: scan output for harmful patterns before showing to user
```

This adds development time but gives us MORE control than relying on any provider's built-in guardrails.

---

## Revised Cost Model with Venice

| Phase | DAU | AI Cost/mo | Total Infra/mo | Break-even Subscribers |
|-------|-----|-----------|----------------|----------------------|
| Phase 1 (0-500 users) | ~50 | ~$4 | ~$17 | **4 Prism @ $4.99** |
| Phase 2 (500-2K) | ~200 | ~$15 | ~$40 | **8 Prism** |
| Phase 3 (2K-10K) | ~1K | ~$60 | ~$110 | **22 Prism** |

Compare to Claude-first: Phase 1 was $26/mo, now $17. Phase 2 was $90-100, now $40. That's significant runway extension.

---

## Action Plan

1. **Immediately**: Use your Venice Pro $10 credit to test Qwen3 235B with Kalei's actual system prompts
2. **Build a test harness**: Send 50 real emotional writing samples through both Qwen3 (Venice) and Claude Haiku, blind-rate the outputs
3. **If Qwen3 passes**: Go Venice-first, save 60-80% on AI costs
4. **If Qwen3 disappoints on reframes specifically**: Use Claude Haiku for Kaleidoscope only, Venice for everything else
5. **Build the safety layer regardless** — don't rely on any provider's guardrails for a mental health app

The API is OpenAI-compatible, so the switching cost is near zero. Start cheap, validate quality, upgrade only where needed.
Add comprehensive project documentation and phase plans Includes architecture, infrastructure, AI model comparison, getting started guide, and detailed phase-by-phase development roadmap. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> 2026-02-12 00:07:12 +01:00			`# Kalei — AI Model Selection: Unbiased Analysis`

			`## The Question`

			`Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?`

			`---`

			`## What Kalei Actually Needs From Its AI`

			`\| Task \| Quality Bar \| Frequency \| Latency Tolerance \|`
			`\|------\|------------\|-----------\|-------------------\|`
			`\| Mirror — detect emotional fragments in freeform writing \| High empathy + precision \| 2-7x/week per user \| 2-3s acceptable \|`
			`\| Kaleidoscope — generate 3 perspective reframes \| Highest — this IS the product \| 3-10x/day per user \| 2-3s acceptable \|`
			`\| Lens — daily affirmation generation \| Medium — structured output \| 1x/day per user \| 5s acceptable \|`
			`\| Crisis Detection — flag self-harm/distress signals \| Critical safety — zero false negatives \| Every interaction \| <1s preferred \|`
			`\| Spectrum — weekly/monthly pattern analysis \| High analytical depth \| 1x/week batch \| Minutes acceptable \|`

			`The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.`

			`---`

			`## Venice.ai API — What You Get`

			`Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:`

			`### Best Venice Models for Kalei`

			`\| Model \| Input/MTok \| Output/MTok \| Cache Read \| Context \| Privacy \| Notes \|`
			`\|-------\|-----------\|------------\|------------\|---------\|---------\|-------\|`
			`\| DeepSeek V3.2 \| $0.40 \| $1.00 \| $0.20 \| 164K \| Private \| Strongest general model on Venice \|`
			`\| Qwen3 235B A22B \| $0.15 \| $0.75 \| — \| 131K \| Private \| Best price-to-quality ratio \|`
			`\| Llama 3.3 70B \| $0.70 \| $2.80 \| — \| 131K \| Private \| Meta's flagship open model \|`
			`\| Gemma 3 27B \| $0.12 \| $0.20 \| — \| 203K \| Private \| Ultra-cheap, Google's open model \|`
			`\| Venice Small (Qwen3 4B) \| $0.05 \| $0.15 \| — \| 33K \| Private \| Affirmation-tier only \|`

			`### Venice Advantages`
			`- Privacy-first architecture — no data retention, critical for mental health`
			`- OpenAI-compatible API — trivial to swap in/out, same SDK`
			`- Prompt caching on select models (DeepSeek V3.2 confirmed)`
			`- You already pay for Pro — $10 free API credit to test`
			`- No minimum commitment — pure pay-per-use`

			`### Venice Limitations`
			`- No batch API — can't get 50% off for Spectrum overnight processing`
			`- "Uncensored" default posture — Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer`
			`- No equivalent to Anthropic's constitutional AI — crisis detection safety net is entirely on us`
			`- Smaller infrastructure — less battle-tested at scale than Anthropic/OpenAI`
			`- Rate limits not publicly documented — could be a problem at scale`

			`---`

			`## Head-to-Head: Venice Models vs Claude Haiku 4.5`

			`### Cost Per User Per Month`

			`Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.`

			`\| Model (via) \| Free User/mo \| Prism User/mo \| vs Claude Haiku \|`
			`\|-------------\|-------------\|--------------\|-----------------\|`
			`\| Claude Haiku 4.5 (Anthropic) \| $0.31 \| $0.63 \| baseline \|`
			`\| DeepSeek V3.2 (Venice) \| ~$0.07 \| ~$0.15 \| 78% cheaper \|`
			`\| Qwen3 235B (Venice) \| ~$0.05 \| ~$0.10 \| 84% cheaper \|`
			`\| Llama 3.3 70B (Venice) \| ~$0.16 \| ~$0.33 \| 48% cheaper \|`
			`\| Gemma 3 27B (Venice) \| ~$0.02 \| ~$0.04 \| 94% cheaper \|`

			`The cost difference is massive. At 200 DAU (Phase 2), monthly AI cost drops from ~$50 to ~$10-15.`

			`### Quality Comparison for Emotional Tasks`

			`This is the critical question. Here's what the research and benchmarks tell us:`

			`Emotional Intelligence (EI) Benchmarks:`
			`- A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)`
			`- GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale`
			`- Claude models are specifically noted for "endless empathy" — excellent for therapeutic contexts but with dependency risk`
			`- A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice`

			`Model-Specific Emotional Qualities:`

			`\| Model \| Empathy Quality \| Tone Consistency \| Creative Reframing \| Safety/Guardrails \|`
			`\|-------\|----------------\|-----------------\|-------------------\|-------------------\|`
			`\| Claude Haiku 4.5 \| ★★★★☆ \| ★★★★★ \| ★★★★☆ \| ★★★★★ \|`
			`\| DeepSeek V3.2 \| ★★★☆☆ \| ★★★★☆ \| ★★★☆☆ \| ★★☆☆☆ \|`
			`\| Qwen3 235B \| ★★★★☆ \| ★★★★☆ \| ★★★☆☆ \| ★★☆☆☆ \|`
			`\| Llama 3.3 70B \| ★★★☆☆ \| ★★★☆☆ \| ★★★☆☆ \| ★★★☆☆ \|`
			`\| Gemma 3 27B \| ★★☆☆☆ \| ★★★☆☆ \| ★★☆☆☆ \| ★★★☆☆ \|`

			`Key findings:`
			`- DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" — problematic for daily therapeutic interactions`
			`- Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" — actually quite good for our use case`
			`- Llama 3.3 is solid but unremarkable for emotional tasks`
			`- Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope`
			`- Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box`

			`---`

			`## The Honest Recommendation`

			`### Option A: Venice-First (Lowest Cost)`
			`Primary: Qwen3 235B via Venice for all features`

			`- Monthly cost at 200 DAU: ~$10-15`
			`- Pros: 84% cheaper, privacy-first, you already have the account`
			`- Cons: No batch API (Spectrum costs more), no built-in safety net, requires extensive prompt engineering for emotional quality, crisis detection entirely self-built`
			`- Risk: If reframe quality feels "off" or generic, the core product fails`

			`### Option B: Claude-First (Current Plan)`
			`Primary: Claude Haiku 4.5 via Anthropic`

			`- Monthly cost at 200 DAU: ~$50`
			`- Pros: Best-in-class empathy and safety, prompt caching, batch API (50% off Spectrum), constitutional AI for crisis detection`
			`- Cons: 4-6x more expensive, Anthropic lock-in`
			`- Risk: Higher burn rate, but product quality is higher`

			`### Option C: Hybrid (Recommended) ★`
			`Split by task criticality:`

			`\| Task \| Model \| Via \| Why \|`
			`\|------\|-------\|-----\|-----\|`
			`\| Kaleidoscope reframes \| Qwen3 235B \| Venice \| Core product, needs quality BUT Qwen3 handles tone consistency well. Test extensively. \|`
			`\| Mirror fragments \| Qwen3 235B \| Venice \| Structured detection task, Qwen3 is precise enough \|`
			`\| Lens affirmations \| Venice Small (Qwen3 4B) \| Venice \| Simple generation, doesn't need a big model \|`
			`\| Crisis detection \| Application-layer keywords + Qwen3 235B \| Venice + custom code \| Keyword matching first, LLM confirmation second \|`
			`\| Spectrum batch \| DeepSeek V3.2 \| Venice \| Analytical task, DeepSeek excels at structured analysis \|`

			`Estimated monthly cost at 200 DAU: ~$12-18 (vs $50 with Claude, vs $10 all-Qwen3)`

			`### Why Hybrid via Venice Wins`

			`1. You already pay for Pro — the $10 credit lets you prototype immediately`
			`2. OpenAI-compatible API — if Venice quality disappoints, swapping to Anthropic/Groq/OpenRouter is a 1-line base URL change`
			`3. Privacy alignment — Venice's no-data-retention policy is actually perfect for mental health data`
			`4. Cost headroom — at $12-18/mo AI cost, you could serve 200 DAU and still be profitable with just 3-4 Prism subscribers`
			`5. Qwen3 235B is genuinely good — it's not a compromise model, it scores competitively on emotional tasks`

			`### The Critical Caveat: Safety Layer`

			`Venice's "uncensored" philosophy means we MUST build our own safety layer:`

			```
			`User input → Keyword crisis detector (local, instant)`
			`→ If flagged: hardcoded crisis response (no LLM needed)`
			`→ If clear: send to Venice API with our safety-focused system prompt`
			`→ Post-process: scan output for harmful patterns before showing to user`
			```

			`This adds development time but gives us MORE control than relying on any provider's built-in guardrails.`

			`---`

			`## Revised Cost Model with Venice`

			`\| Phase \| DAU \| AI Cost/mo \| Total Infra/mo \| Break-even Subscribers \|`
			`\|-------\|-----\|-----------\|----------------\|----------------------\|`
			`\| Phase 1 (0-500 users) \| ~50 \| ~$4 \| ~$17 \| 4 Prism @ $4.99 \|`
			`\| Phase 2 (500-2K) \| ~200 \| ~$15 \| ~$40 \| 8 Prism \|`
			`\| Phase 3 (2K-10K) \| ~1K \| ~$60 \| ~$110 \| 22 Prism \|`

			`Compare to Claude-first: Phase 1 was $26/mo, now $17. Phase 2 was $90-100, now $40. That's significant runway extension.`

			`---`

			`## Action Plan`

			`1. Immediately: Use your Venice Pro $10 credit to test Qwen3 235B with Kalei's actual system prompts`
			`2. Build a test harness: Send 50 real emotional writing samples through both Qwen3 (Venice) and Claude Haiku, blind-rate the outputs`
			`3. If Qwen3 passes: Go Venice-first, save 60-80% on AI costs`
			`4. If Qwen3 disappoints on reframes specifically: Use Claude Haiku for Kaleidoscope only, Venice for everything else`
			`5. Build the safety layer regardless — don't rely on any provider's guardrails for a mental health app`

			`The API is OpenAI-compatible, so the switching cost is near zero. Start cheap, validate quality, upgrade only where needed.`