kalei/docs/kalei-ai-model-comparison.md

9.3 KiB

Kalei — AI Model Selection: Unbiased Analysis

The Question

Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?


What Kalei Actually Needs From Its AI

Task Quality Bar Frequency Latency Tolerance
Mirror — detect emotional fragments in freeform writing High empathy + precision 2-7x/week per user 2-3s acceptable
Kaleidoscope — generate 3 perspective reframes Highest — this IS the product 3-10x/day per user 2-3s acceptable
Lens — daily affirmation generation Medium — structured output 1x/day per user 5s acceptable
Crisis Detection — flag self-harm/distress signals Critical safety — zero false negatives Every interaction <1s preferred
Spectrum — weekly/monthly pattern analysis High analytical depth 1x/week batch Minutes acceptable

The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.


Venice.ai API — What You Get

Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:

Best Venice Models for Kalei

Model Input/MTok Output/MTok Cache Read Context Privacy Notes
DeepSeek V3.2 $0.40 $1.00 $0.20 164K Private Strongest general model on Venice
Qwen3 235B A22B $0.15 $0.75 131K Private Best price-to-quality ratio
Llama 3.3 70B $0.70 $2.80 131K Private Meta's flagship open model
Gemma 3 27B $0.12 $0.20 203K Private Ultra-cheap, Google's open model
Venice Small (Qwen3 4B) $0.05 $0.15 33K Private Affirmation-tier only

Venice Advantages

  • Privacy-first architecture — no data retention, critical for mental health
  • OpenAI-compatible API — trivial to swap in/out, same SDK
  • Prompt caching on select models (DeepSeek V3.2 confirmed)
  • You already pay for Pro — $10 free API credit to test
  • No minimum commitment — pure pay-per-use

Venice Limitations

  • No batch API — can't get 50% off for Spectrum overnight processing
  • "Uncensored" default posture — Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
  • No equivalent to Anthropic's constitutional AI — crisis detection safety net is entirely on us
  • Smaller infrastructure — less battle-tested at scale than Anthropic/OpenAI
  • Rate limits not publicly documented — could be a problem at scale

Head-to-Head: Venice Models vs Claude Haiku 4.5

Cost Per User Per Month

Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.

Model (via) Free User/mo Prism User/mo vs Claude Haiku
Claude Haiku 4.5 (Anthropic) $0.31 $0.63 baseline
DeepSeek V3.2 (Venice) ~$0.07 ~$0.15 78% cheaper
Qwen3 235B (Venice) ~$0.05 ~$0.10 84% cheaper
Llama 3.3 70B (Venice) ~$0.16 ~$0.33 48% cheaper
Gemma 3 27B (Venice) ~$0.02 ~$0.04 94% cheaper

The cost difference is massive. At 200 DAU (Phase 2), monthly AI cost drops from ~$50 to ~$10-15.

Quality Comparison for Emotional Tasks

This is the critical question. Here's what the research and benchmarks tell us:

Emotional Intelligence (EI) Benchmarks:

  • A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
  • GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
  • Claude models are specifically noted for "endless empathy" — excellent for therapeutic contexts but with dependency risk
  • A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice

Model-Specific Emotional Qualities:

Model Empathy Quality Tone Consistency Creative Reframing Safety/Guardrails
Claude Haiku 4.5 ★★★★☆ ★★★★★ ★★★★☆ ★★★★★
DeepSeek V3.2 ★★★☆☆ ★★★★☆ ★★★☆☆ ★★☆☆☆
Qwen3 235B ★★★★☆ ★★★★☆ ★★★☆☆ ★★☆☆☆
Llama 3.3 70B ★★★☆☆ ★★★☆☆ ★★★☆☆ ★★★☆☆
Gemma 3 27B ★★☆☆☆ ★★★☆☆ ★★☆☆☆ ★★★☆☆

Key findings:

  • DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" — problematic for daily therapeutic interactions
  • Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" — actually quite good for our use case
  • Llama 3.3 is solid but unremarkable for emotional tasks
  • Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
  • Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box

The Honest Recommendation

Option A: Venice-First (Lowest Cost)

Primary: Qwen3 235B via Venice for all features

  • Monthly cost at 200 DAU: ~$10-15
  • Pros: 84% cheaper, privacy-first, you already have the account
  • Cons: No batch API (Spectrum costs more), no built-in safety net, requires extensive prompt engineering for emotional quality, crisis detection entirely self-built
  • Risk: If reframe quality feels "off" or generic, the core product fails

Option B: Claude-First (Current Plan)

Primary: Claude Haiku 4.5 via Anthropic

  • Monthly cost at 200 DAU: ~$50
  • Pros: Best-in-class empathy and safety, prompt caching, batch API (50% off Spectrum), constitutional AI for crisis detection
  • Cons: 4-6x more expensive, Anthropic lock-in
  • Risk: Higher burn rate, but product quality is higher

Split by task criticality:

Task Model Via Why
Kaleidoscope reframes Qwen3 235B Venice Core product, needs quality BUT Qwen3 handles tone consistency well. Test extensively.
Mirror fragments Qwen3 235B Venice Structured detection task, Qwen3 is precise enough
Lens affirmations Venice Small (Qwen3 4B) Venice Simple generation, doesn't need a big model
Crisis detection Application-layer keywords + Qwen3 235B Venice + custom code Keyword matching first, LLM confirmation second
Spectrum batch DeepSeek V3.2 Venice Analytical task, DeepSeek excels at structured analysis

Estimated monthly cost at 200 DAU: ~$12-18 (vs $50 with Claude, vs $10 all-Qwen3)

Why Hybrid via Venice Wins

  1. You already pay for Pro — the $10 credit lets you prototype immediately
  2. OpenAI-compatible API — if Venice quality disappoints, swapping to Anthropic/Groq/OpenRouter is a 1-line base URL change
  3. Privacy alignment — Venice's no-data-retention policy is actually perfect for mental health data
  4. Cost headroom — at $12-18/mo AI cost, you could serve 200 DAU and still be profitable with just 3-4 Prism subscribers
  5. Qwen3 235B is genuinely good — it's not a compromise model, it scores competitively on emotional tasks

The Critical Caveat: Safety Layer

Venice's "uncensored" philosophy means we MUST build our own safety layer:

User input → Keyword crisis detector (local, instant)
           → If flagged: hardcoded crisis response (no LLM needed)
           → If clear: send to Venice API with our safety-focused system prompt
           → Post-process: scan output for harmful patterns before showing to user

This adds development time but gives us MORE control than relying on any provider's built-in guardrails.


Revised Cost Model with Venice

Phase DAU AI Cost/mo Total Infra/mo Break-even Subscribers
Phase 1 (0-500 users) ~50 ~$4 ~$17 4 Prism @ $4.99
Phase 2 (500-2K) ~200 ~$15 ~$40 8 Prism
Phase 3 (2K-10K) ~1K ~$60 ~$110 22 Prism

Compare to Claude-first: Phase 1 was $26/mo, now $17. Phase 2 was $90-100, now $40. That's significant runway extension.


Action Plan

  1. Immediately: Use your Venice Pro $10 credit to test Qwen3 235B with Kalei's actual system prompts
  2. Build a test harness: Send 50 real emotional writing samples through both Qwen3 (Venice) and Claude Haiku, blind-rate the outputs
  3. If Qwen3 passes: Go Venice-first, save 60-80% on AI costs
  4. If Qwen3 disappoints on reframes specifically: Use Claude Haiku for Kaleidoscope only, Venice for everything else
  5. Build the safety layer regardless — don't rely on any provider's guardrails for a mental health app

The API is OpenAI-compatible, so the switching cost is near zero. Start cheap, validate quality, upgrade only where needed.