MOPC-App/docs/architecture/ai-errors.md

209 lines
5.3 KiB
Markdown

# AI Error Handling Guide
## Error Types
The AI system classifies errors into these categories:
| Error Type | Cause | User Message | Retryable |
|------------|-------|--------------|-----------|
| `rate_limit` | Too many requests | "Rate limit exceeded. Wait a few minutes." | Yes |
| `quota_exceeded` | Billing limit | "API quota exceeded. Check billing." | No |
| `model_not_found` | Invalid model | "Model not available. Check settings." | No |
| `invalid_api_key` | Bad API key | "Invalid API key. Check settings." | No |
| `context_length` | Prompt too large | "Request too large. Try fewer items." | Yes* |
| `parse_error` | AI returned invalid JSON | "Response parse error. Flagged for review." | Yes |
| `timeout` | Request took too long | "Request timed out. Try again." | Yes |
| `network_error` | Connection issue | "Network error. Check connection." | Yes |
| `content_filter` | Content blocked | "Content filtered. Check input data." | No |
| `server_error` | OpenAI server issue | "Server error. Try again later." | Yes |
*Context length errors can be retried with smaller batches.
## Error Classification
```typescript
import { classifyAIError, shouldRetry, getRetryDelay } from '@/server/services/ai-errors'
try {
const response = await openai.chat.completions.create(params)
} catch (error) {
const classified = classifyAIError(error)
console.error(`AI Error: ${classified.type} - ${classified.message}`)
if (shouldRetry(classified.type)) {
const delay = getRetryDelay(classified.type)
// Wait and retry
} else {
// Fall back to algorithm
}
}
```
## Graceful Degradation
When AI fails, the platform automatically handles it:
### AI Assignment
1. Logs the error
2. Falls back to algorithmic assignment:
- Matches by expertise tag overlap
- Balances workload across jurors
- Respects constraints (max assignments)
### AI Filtering
1. Logs the error
2. Flags all projects for manual review
3. Returns error message to admin
### Award Eligibility
1. Logs the error
2. Returns all projects as "needs manual review"
3. Admin can apply deterministic rules instead
### Mentor Matching
1. Logs the error
2. Falls back to keyword-based matching
3. Uses availability scoring
## Retry Strategy
| Error Type | Retry Count | Delay |
|------------|-------------|-------|
| `rate_limit` | 3 | Exponential (1s, 2s, 4s) |
| `timeout` | 2 | Fixed 5s |
| `network_error` | 3 | Exponential (1s, 2s, 4s) |
| `server_error` | 3 | Exponential (2s, 4s, 8s) |
| `parse_error` | 1 | None |
## Monitoring
### Error Logging
All AI errors are logged to:
1. Console (development)
2. `AIUsageLog` table with `status: 'ERROR'`
3. `AuditLog` for security-relevant failures
### Checking Errors
```sql
-- Recent AI errors
SELECT
created_at,
action,
model,
error_message
FROM ai_usage_log
WHERE status = 'ERROR'
ORDER BY created_at DESC
LIMIT 20;
-- Error rate by action
SELECT
action,
COUNT(*) FILTER (WHERE status = 'ERROR') as errors,
COUNT(*) as total,
ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'ERROR') / COUNT(*), 2) as error_rate
FROM ai_usage_log
GROUP BY action;
```
## Troubleshooting
### High Error Rate
1. Check OpenAI status page for outages
2. Verify API key is valid and not rate-limited
3. Review error messages in logs
4. Consider switching to a different model
### Consistent Parse Errors
1. The AI model may be returning malformed JSON
2. Try a more capable model (gpt-4o instead of gpt-3.5)
3. Check if prompts are being truncated
4. Review recent responses in logs
### All Requests Failing
1. Test connection in Settings → AI
2. Verify API key hasn't been revoked
3. Check billing status in OpenAI dashboard
4. Review network connectivity
### Slow Responses
1. Consider using gpt-4o-mini for speed
2. Reduce batch sizes
3. Check for rate limiting (429 errors)
4. Monitor OpenAI latency
## Error Response Format
When errors occur, services return structured responses:
```typescript
// AI Assignment error response
{
success: false,
suggestions: [],
error: "Rate limit exceeded. Wait a few minutes and try again.",
fallbackUsed: true,
}
// AI Filtering error response
{
projectId: "...",
meetsCriteria: false,
confidence: 0,
reasoning: "AI error: Rate limit exceeded",
flagForReview: true,
}
```
## Implementing Custom Error Handling
```typescript
import {
classifyAIError,
shouldRetry,
getRetryDelay,
getUserFriendlyMessage,
logAIError,
} from '@/server/services/ai-errors'
async function callAIWithRetry<T>(
operation: () => Promise<T>,
serviceName: string,
maxRetries: number = 3
): Promise<T> {
let lastError: Error | null = null
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await operation()
} catch (error) {
const classified = classifyAIError(error)
logAIError(serviceName, 'operation', classified)
if (!shouldRetry(classified.type) || attempt === maxRetries) {
throw new Error(getUserFriendlyMessage(classified.type))
}
const delay = getRetryDelay(classified.type) * attempt
await new Promise(resolve => setTimeout(resolve, delay))
lastError = error as Error
}
}
throw lastError
}
```
## See Also
- [AI System Architecture](./ai-system.md)
- [AI Configuration Guide](./ai-configuration.md)
- [AI Services Reference](./ai-services.md)