curl -X POST https://app.mixus.ai/api/eval/create-task-agent \
-H "Authorization: Bearer mxs_eval_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"taskName": "Calculate Sales Commission",
"taskDescription": "Calculate 15% commission on a $50,000 sale",
"checkpoints": [{
"stage": "calculation",
"description": "Perform calculation",
"verificationQuestion": "Is the result $7,500?"
}],
"testMode": "with-verification",
"assignedReviewer": "you@example.com"
}'
```http
### Step 3: track progress
```bash
# Use executionId from previous response
curl https://app.mixus.ai/api/eval/status/EXECUTION_ID \
-H "Authorization: Bearer mxs_eval_YOUR_KEY"
```http
---
## Checkpoint modes
### Manual checkpoints
You define exactly where verification happens:
```json
{
"checkpoints": [
{
"stage": "research",
"description": "Research competitors",
"verificationQuestion": "Is the research comprehensive?"
},
{
"stage": "analysis",
"description": "Analyze pricing",
"verificationQuestion": "Is the analysis accurate?"
}
],
"testMode": "with-verification"
}
```http
### AI-detected checkpoints
Let AI decide where verification is needed:
```json
{
"autoDetectCheckpoints": true,
"testMode": "with-verification",
"taskDescription": "Research top 3 AI companies, analyze pricing, calculate our advantage, and email summary to team@example.com"
}
```http
The AI will automatically add verification points before:
- Sending emails/messages
- Making purchases or transactions
- Deleting or modifying data
- Other high-impact actions
### No verification (baseline)
Run without human verification for speed comparisons:
```json
{
"testMode": "without-verification",
"taskDescription": "Simple calculation task"
}
```http
---
## Complete API reference
### Create eval task
**Endpoint:** `POST /api/eval/create-task-agent`
**Headers:**
```http
Authorization: Bearer mxs_eval_YOUR_KEY
Content-Type: application/json
```http
**Request Parameters:**
| Field | Type | Required | Description |
|---|---|---|---|
| `taskName` | string | Yes | Name of the evaluation task |
| `taskDescription` | string | Yes | Detailed description of what to do |
| `checkpoints` | array | No* | Manual verification points |
| `autoDetectCheckpoints` | boolean | No* | Let AI detect checkpoints |
| `testMode` | string | Yes | `with-verification` or `without-verification` |
| `assignedReviewer` | string | Yes | Email of person who will verify |
| `webhookUrl` | string | No | URL to receive completion webhooks |
| `externalId` | string | No | Your tracking ID |
_*Either `checkpoints` OR `autoDetectCheckpoints` (not both)_
**Checkpoint Object:**
```typescript
{
stage: string // Identifier for this checkpoint
description: string // What happens at this step
verificationQuestion: string // Question for reviewer
}
```http
**Response:**
```json
{
"success": true,
"executionId": "string",
"chatId": "string",
"chatUrl": "string",
"taskName": "string",
"testMode": "string",
"checkpointCount": number,
"checkpointDetectionMethod": "manual" | "ai-detected" | "none",
"message": "string"
}
```http
### Get execution status
**Endpoint:** `GET /api/eval/status/{executionId}`
**Headers:**
```http
Authorization: Bearer mxs_eval_YOUR_KEY
```http
**Response:**
```json
{
"success": true,
"execution": {
"id": "string",
"status": "running" | "completed" | "failed" | "waiting_verification",
"testMode": "string",
"progress": {
"totalSteps": number,
"completedSteps": number,
"currentStepIndex": number,
"percentComplete": number
},
"checkpoints": {
"total": number,
"completed": number,
"details": [
{
"checkpointNumber": number,
"stepId": "string",
"description": "string",
"status": "string",
"wasVerified": boolean,
"evalMetadata": {
"tapNumber": number,
"verificationDuration": number,
"verificationResponse": {
"action": "approved" | "rejected",
"hint": "string",
"verifiedAt": "ISO date"
}
}
}
]
},
"metrics": {
"tapsUsed": number,
"expectedTaps": number,
"durationSeconds": number,
"model": "string"
},
"startedAt": "ISO date",
"completedAt": "ISO date" | null,
"isComplete": boolean,
"isFailed": boolean,
"isRunning": boolean,
"error": "string" | null,
"chatUrl": "string",
"chatId": "string"
}
}
```http
---
## Common use cases
### 1. Run Benchmark Suite
```python
import requests
import time
API_KEY = "mxs_eval_YOUR_KEY"
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Create multiple eval tasks
tasks = [
"Calculate 15% of $50,000",
"Research top 3 AI companies and summarize",
"Draft email to team about Q4 goals"
]
executions = []
for task in tasks:
response = requests.post(
"https://app.mixus.ai/api/eval/create-task-agent",
headers=HEADERS,
json={
"taskName": f"Benchmark: {task[:30]}",
"taskDescription": task,
"autoDetectCheckpoints": True,
"testMode": "with-verification",
"assignedReviewer": "you@example.com"
}
)
executions.append(response.json())
# Track progress
for exec in executions:
exec_id = exec['executionId']
while True:
status = requests.get(
f"https://app.mixus.ai/api/eval/status/{exec_id}",
headers=HEADERS
).json()
if status['execution']['isComplete']:
print(f"✅ {exec['taskName']}: Done!")
break
elif status['execution']['isFailed']:
print(f"❌ {exec['taskName']}: Failed")
break
time.sleep(10) # Poll every 10 seconds
```http
### 2. Continuous Integration Testing
```javascript
// ci-eval-test.js
const axios = require('axios');
const API_KEY = process.env.MIXUS_API_KEY;
const headers = {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
};
async function runEvalTest(taskName, description) {
// Create eval
const { data } = await axios.post(
'https://app.mixus.ai/api/eval/create-task-agent',
{
taskName,
taskDescription: description,
autoDetectCheckpoints: true,
testMode: 'without-verification', // No human verification for CI
assignedReviewer: 'ci@example.com'
},
{ headers }
);
// Poll for completion
while (true) {
const status = await axios.get(
`https://app.mixus.ai/api/eval/status/${data.executionId}`,
{ headers }
);
if (status.data.execution.isComplete) {
return status.data.execution.metrics;
}
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
// Run in CI pipeline
(async () => {
const metrics = await runEvalTest(
"CI Test: Data Processing",
"Process customer data and generate report"
);
console.log('Duration:', metrics.durationSeconds, 'seconds');
process.exit(0);
})();
```http
### 3. Webhook Notifications
Get notified when eval completes:
```json
{
"taskName": "Long Running Task",
"taskDescription": "Complex multi-step evaluation",
"autoDetectCheckpoints": true,
"testMode": "with-verification",
"assignedReviewer": "you@example.com",
"webhookUrl": "https://your-app.com/webhooks/eval-complete",
"externalId": "your-tracking-id-123"
}
```http
**Webhook Payload:**
```json
{
"event": "checkpoints_completed",
"executionId": "string",
"externalId": "your-tracking-id-123",
"testMode": "with-verification",
"finalStatus": "completed",
"metrics": {
"tapsUsed": 2,
"durationSeconds": 450
}
}
```http
---
## Error handling
The eval API uses standard HTTP status codes. For complete error documentation, see [Error codes reference](./eval-api-error-codes).
### Quick Reference
| Code | Meaning | Common Cause | Solution |
|------|---------|--------------|----------|
| 200 | Success | Request valid | Proceed with executionId |
| 400 | Bad Request | Missing/invalid fields | Check required parameters |
| 401 | Unauthorized | Invalid API key | Regenerate key |
| 404 | Not Found | Reviewer doesn't exist | Invite user to mixus |
| 500 | Server Error | Processing failed | Retry or contact support |
### Example Error Responses
**400 - Missing Fields:**
```json
{
"error": "Missing required fields",
"required": ["taskName", "taskDescription", "testMode", "assignedReviewer"],
"hint": "Ensure all required fields are provided"
}
```http
**401 - Unauthorized:**
```json
{
"error": "Unauthorized"
}
```http
**404 - Reviewer Not Found:**
```json
{
"error": "Reviewer not found",
"assignedReviewer": "user@example.com",
"hint": "Provide email or username of mixus user",
"suggestion": "Try one of these valid reviewers: teammate@company.com"
}
```http
### Retry Logic
```javascript
async function apiCallWithRetry(url, options, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await axios(url, options);
} catch (error) {
if (error.response?.status === 429) {
// Rate limited - wait and retry
await new Promise(r => setTimeout(r, attempt * 1000));
} else if (error.response?.status >= 500) {
// Server error - retry
await new Promise(r => setTimeout(r, attempt * 1000));
} else {
// Client error - don't retry
throw error;
}
}
}
}
```http
---
## Advanced features
### Polling best practices
```python
import requests
import time
def wait_for_completion(execution_id, api_key, timeout=3600):
"""
Poll execution status until completion or timeout.
Args:
execution_id: Execution ID from create response
api_key: Your mixus API key
timeout: Max wait time in seconds (default: 1 hour)
Returns:
Final execution status
"""
start_time = time.time()
poll_interval = 10 # Start with 10 seconds
while True:
# Check timeout
if time.time() - start_time > timeout:
raise TimeoutError("Execution did not complete in time")
# Get status
response = requests.get(
f"https://app.mixus.ai/api/eval/status/{execution_id}",
headers={"Authorization": f"Bearer {api_key}"}
)
execution = response.json()['execution']
# Check if done
if execution['isComplete'] or execution['isFailed']:
return execution
# Adaptive polling (slow down over time)
if time.time() - start_time > 300: # After 5 minutes
poll_interval = 30 # Poll every 30 seconds
time.sleep(poll_interval)
```http
### Batch processing
```python
from concurrent.futures import ThreadPoolExecutor
import requests
def create_eval_batch(tasks, api_key, max_concurrent=5):
"""
Create multiple eval tasks concurrently.
"""
def create_single(task):
response = requests.post(
"https://app.mixus.ai/api/eval/create-task-agent",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json=task
)
return response.json()
with ThreadPoolExecutor(max_workers=max_concurrent) as executor:
results = list(executor.map(create_single, tasks))
return results
# Usage
tasks = [
{
"taskName": "Test 1",
"taskDescription": "...",
"autoDetectCheckpoints": True,
"testMode": "with-verification",
"assignedReviewer": "reviewer@example.com"
},
# ... more tasks
]
results = create_eval_batch(tasks, API_KEY)
```http
---
## Metrics and analytics
### What gets tracked
Every eval execution tracks:
- **Taps Used:** Number of human verifications
- **Duration:** Total time from start to completion
- **Model:** Which AI model was used
- **Checkpoint Details:** Time to verification, approval/rejection
- **Step Progress:** Completed vs total steps
### Example: Calculate Success Rate
```python
def calculate_eval_success_rate(execution_ids, api_key):
"""
Calculate success rate across multiple evals.
"""
successful = 0
failed = 0
for exec_id in execution_ids:
response = requests.get(
f"https://app.mixus.ai/api/eval/status/{exec_id}",
headers={"Authorization": f"Bearer {api_key}"}
)
exec_data = response.json()['execution']
if exec_data['isComplete'] and not exec_data['error']:
successful += 1
elif exec_data['isFailed']:
failed += 1
total = successful + failed
success_rate = (successful / total * 100) if total > 0 else 0
return {
'success_rate': success_rate,
'successful': successful,
'failed': failed,
'total': total
}
```http
---
## Integration examples
---
## Tips and best practices
### Optimization
1. **Use webhooks** instead of polling for long-running tasks
2. **Batch create** multiple evals concurrently (max 10)
3. **Cache status** responses to reduce API calls
4. **Use without-verification mode** for speed benchmarks
### Debugging
1. **Check chat URL** to see agent's work in real-time
2. **Monitor logs** in mixus dashboard
3. **Use externalId** to track executions in your system
4. **Test with simple tasks** first
### Performance
- **Avg execution time:** 30s - 5min depending on complexity
- **Checkpoint verification:** Human response time (varies)
- **API response time:** < 500ms
- **Status endpoint:** < 100ms
---
## Comparison with traditional evaluation
### Traditional AI evaluation
- Tests only final outcomes
- Discovers mistakes after damage is done
- No insight into execution process
- Binary pass/fail results
### mixus human-in-the-loop evaluation
- Provides step-by-step verification
- Catches mistakes before execution
- Full visibility into agent reasoning
- Provides hints to guide agents
- Builds trust through transparency
- Measures both accuracy and safety
---
## Next steps
- [API keys setup](/agents/api-keys) - Generate your first API key
- [API overview](./overview) - General API documentation
- [Quick start](./quickstart) - Get started in 5 minutes
- [Code examples](./examples) - More examples in multiple languages
---
## Support
Questions? Reach out:
- **Email:** support@mixus.ai
- **Documentation:** https://docs.mixus.ai
- **Community:** https://community.mixus.ai