9.3 KiB

Raw Blame History

Kalei — AI Model Selection: Unbiased Analysis

The Question

Which AI model should power a mental wellness app that needs to detect emotional fragments, generate empathetic perspective reframes, produce personalized affirmations, detect crisis signals, and analyze behavioral patterns over time?

What Kalei Actually Needs From Its AI

Task	Quality Bar	Frequency	Latency Tolerance
Mirror — detect emotional fragments in freeform writing	High empathy + precision	2-7x/week per user	2-3s acceptable
Kaleidoscope — generate 3 perspective reframes	Highest — this IS the product	3-10x/day per user	2-3s acceptable
Lens — daily affirmation generation	Medium — structured output	1x/day per user	5s acceptable
Crisis Detection — flag self-harm/distress signals	Critical safety — zero false negatives	Every interaction	<1s preferred
Spectrum — weekly/monthly pattern analysis	High analytical depth	1x/week batch	Minutes acceptable

The Kaleidoscope reframes are the core product experience. If they feel generic, robotic, or tone-deaf, users churn. This is the task where model quality matters most.

Venice.ai API — What You Get

Since you already have Venice Pro ($10 one-time API credit), here are the relevant models and their pricing:

Best Venice Models for Kalei

Model	Input/MTok	Output/MTok	Cache Read	Context	Privacy	Notes
DeepSeek V3.2	$0.40	$1.00	$0.20	164K	Private	Strongest general model on Venice
Qwen3 235B A22B	$0.15	$0.75	—	131K	Private	Best price-to-quality ratio
Llama 3.3 70B	$0.70	$2.80	—	131K	Private	Meta's flagship open model
Gemma 3 27B	$0.12	$0.20	—	203K	Private	Ultra-cheap, Google's open model
Venice Small (Qwen3 4B)	$0.05	$0.15	—	33K	Private	Affirmation-tier only

Venice Advantages

Privacy-first architecture — no data retention, critical for mental health
OpenAI-compatible API — trivial to swap in/out, same SDK
Prompt caching on select models (DeepSeek V3.2 confirmed)
You already pay for Pro — $10 free API credit to test
No minimum commitment — pure pay-per-use

Venice Limitations

No batch API — can't get 50% off for Spectrum overnight processing
"Uncensored" default posture — Venice optimizes for no guardrails, which is the OPPOSITE of what a mental health app needs. We must disable Venice system prompts and provide our own safety layer
No equivalent to Anthropic's constitutional AI — crisis detection safety net is entirely on us
Smaller infrastructure — less battle-tested at scale than Anthropic/OpenAI
Rate limits not publicly documented — could be a problem at scale

Head-to-Head: Venice Models vs Claude Haiku 4.5

Cost Per User Per Month

Calculated using our established usage model: Free user = 3 Turns/day, 2 Mirror/week, daily Lens.

Model (via)	Free User/mo	Prism User/mo	vs Claude Haiku
Claude Haiku 4.5 (Anthropic)	$0.31	$0.63	baseline
DeepSeek V3.2 (Venice)	~$0.07	~$0.15	78% cheaper
Qwen3 235B (Venice)	~$0.05	~$0.10	84% cheaper
Llama 3.3 70B (Venice)	~$0.16	~$0.33	48% cheaper
Gemma 3 27B (Venice)	~$0.02	~$0.04	94% cheaper

The cost difference is massive. At 200 DAU (Phase 2), monthly AI cost drops from ~$50 to ~$10-15.

Quality Comparison for Emotional Tasks

This is the critical question. Here's what the research and benchmarks tell us:

Emotional Intelligence (EI) Benchmarks:

A 2025 Nature study tested LLMs on 5 standard EI tests. GPT-4, Claude 3.5 Haiku, and DeepSeek V3 all outperformed humans (81% avg vs 56% human avg)
GPT-4 scored highest with a Z-score of 4.26 on the LEAS emotional awareness scale
Claude models are specifically noted for "endless empathy" — excellent for therapeutic contexts but with dependency risk
A blinded study found AI-generated psychological advice was rated MORE empathetic than human expert advice

Model-Specific Emotional Qualities:

Model	Empathy Quality	Tone Consistency	Creative Reframing	Safety/Guardrails
Claude Haiku 4.5	★★★★☆	★★★★★	★★★★☆	★★★★★
DeepSeek V3.2	★★★☆☆	★★★★☆	★★★☆☆	★★☆☆☆
Qwen3 235B	★★★★☆	★★★★☆	★★★☆☆	★★☆☆☆
Llama 3.3 70B	★★★☆☆	★★★☆☆	★★★☆☆	★★★☆☆
Gemma 3 27B	★★☆☆☆	★★★☆☆	★★☆☆☆	★★★☆☆

Key findings:

DeepSeek V3.2 is described as "slightly more mechanical in tone" with "repetition in phrasing" — problematic for daily therapeutic interactions
Qwen3 is praised for "coherent extended conversations" and "tone consistency over long interactions" — actually quite good for our use case
Llama 3.3 is solid but unremarkable for emotional tasks
Gemma 3 27B is too small for the nuance we need in Mirror and Kaleidoscope
Claude's constitutional AI training makes crisis detection significantly more reliable out-of-the-box

The Honest Recommendation

Option A: Venice-First (Lowest Cost)

Primary: Qwen3 235B via Venice for all features

Monthly cost at 200 DAU: ~$10-15
Pros: 84% cheaper, privacy-first, you already have the account
Cons: No batch API (Spectrum costs more), no built-in safety net, requires extensive prompt engineering for emotional quality, crisis detection entirely self-built
Risk: If reframe quality feels "off" or generic, the core product fails

Option B: Claude-First (Current Plan)

Primary: Claude Haiku 4.5 via Anthropic

Monthly cost at 200 DAU: ~$50
Pros: Best-in-class empathy and safety, prompt caching, batch API (50% off Spectrum), constitutional AI for crisis detection
Cons: 4-6x more expensive, Anthropic lock-in
Risk: Higher burn rate, but product quality is higher

Option C: Hybrid (Recommended) ★

Split by task criticality:

Task	Model	Via	Why
Kaleidoscope reframes	Qwen3 235B	Venice	Core product, needs quality BUT Qwen3 handles tone consistency well. Test extensively.
Mirror fragments	Qwen3 235B	Venice	Structured detection task, Qwen3 is precise enough
Lens affirmations	Venice Small (Qwen3 4B)	Venice	Simple generation, doesn't need a big model
Crisis detection	Application-layer keywords + Qwen3 235B	Venice + custom code	Keyword matching first, LLM confirmation second
Spectrum batch	DeepSeek V3.2	Venice	Analytical task, DeepSeek excels at structured analysis

Estimated monthly cost at 200 DAU: ~$12-18 (vs $50 with Claude, vs $10 all-Qwen3)

Why Hybrid via Venice Wins

You already pay for Pro — the $10 credit lets you prototype immediately
OpenAI-compatible API — if Venice quality disappoints, swapping to Anthropic/Groq/OpenRouter is a 1-line base URL change
Privacy alignment — Venice's no-data-retention policy is actually perfect for mental health data
Cost headroom — at $12-18/mo AI cost, you could serve 200 DAU and still be profitable with just 3-4 Prism subscribers
Qwen3 235B is genuinely good — it's not a compromise model, it scores competitively on emotional tasks

The Critical Caveat: Safety Layer

Venice's "uncensored" philosophy means we MUST build our own safety layer:

User input → Keyword crisis detector (local, instant)
           → If flagged: hardcoded crisis response (no LLM needed)
           → If clear: send to Venice API with our safety-focused system prompt
           → Post-process: scan output for harmful patterns before showing to user

This adds development time but gives us MORE control than relying on any provider's built-in guardrails.

Revised Cost Model with Venice

Phase	DAU	AI Cost/mo	Total Infra/mo	Break-even Subscribers
Phase 1 (0-500 users)	~50	~$4	~$17	4 Prism @ $4.99
Phase 2 (500-2K)	~200	~$15	~$40	8 Prism
Phase 3 (2K-10K)	~1K	~$60	~$110	22 Prism

Compare to Claude-first: Phase 1 was $26/mo, now $17. Phase 2 was $90-100, now $40. That's significant runway extension.

Action Plan

Immediately: Use your Venice Pro $10 credit to test Qwen3 235B with Kalei's actual system prompts
Build a test harness: Send 50 real emotional writing samples through both Qwen3 (Venice) and Claude Haiku, blind-rate the outputs
If Qwen3 passes: Go Venice-first, save 60-80% on AI costs
If Qwen3 disappoints on reframes specifically: Use Claude Haiku for Kaleidoscope only, Venice for everything else
Build the safety layer regardless — don't rely on any provider's guardrails for a mental health app

The API is OpenAI-compatible, so the switching cost is near zero. Start cheap, validate quality, upgrade only where needed.

9.3 KiB Raw Blame History