Human-in-the-loop evaluation system
The mixus evaluation system combines automated AI evaluation with human-in-the-loop step verification. This unique approach ensures AI agents perform reliably in real-world scenarios by verifying critical steps before they execute.What is the eval API?
The Evaluation API allows you to test and measure your AI agents’ performance programmatically with optional human verification at critical checkpoints. Unlike traditional evaluation systems that only measure outcomes, mixus evaluates the entire execution process - giving you confidence that agents work correctly step-by-step.Why human-in-the-loop evaluation matters
Traditional AI evaluation measures final outcomes, but misses critical failures during execution:- Sending wrong emails before catching the mistake
- Making purchases with incorrect amounts
- Deleting data that shouldn’t be removed
How it works
📤 Step 1: create eval task (your code)
Copy
Ask AI
POST /api/eval/create-task-agent
- Task description
- Verification checkpoints (manual or AI-detected)
🤖 Step 2: mixus creates agent
- Converts task to agent execution plan
- Sets up verification points
- Initializes tracking metrics
⚡ Step 3: agent executes task
- Runs through steps
- Uses tools (web search, integrations, etc.)
- Pauses at checkpoints for verification
⭐ Step 4: human verification
- Review agent’s work BEFORE execution
- Approve, reject, or provide hints
- Agent adjusts based on feedback
- Prevents mistakes before they happen
📊 Step 5: get results (your code)
Copy
Ask AI
GET /api/eval/status/{executionId}
- Execution status
- Checkpoint completion
- Performance metrics
- Human verification insights
Before you begin
To use the evaluation API, you need:- A mixus account
- An API key with
eval:create
andeval:read
permissions - A team member who can review and verify agent steps
Quick start
Step 1: generate API key
Visit app.mixus.ai/integrations/api-keys and generate a key witheval:create
and eval:read
permissions.
Step 2: create your first eval
Copy
Ask AI
curl -X POST https://app.mixus.ai/api/eval/create-task-agent \
-H "Authorization: Bearer mxs_eval_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"taskName": "Calculate Sales Commission",
"taskDescription": "Calculate 15% commission on a $50,000 sale",
"checkpoints": [{
"stage": "calculation",
"description": "Perform calculation",
"verificationQuestion": "Is the result $7,500?"
}],
"testMode": "with-verification",
"assignedReviewer": "you@example.com"
}'
```http
### Step 3: track progress
```bash
# Use executionId from previous response
curl https://app.mixus.ai/api/eval/status/EXECUTION_ID \
-H "Authorization: Bearer mxs_eval_YOUR_KEY"
```http
---
## Checkpoint modes
### Manual checkpoints
You define exactly where verification happens:
```json
{
"checkpoints": [
{
"stage": "research",
"description": "Research competitors",
"verificationQuestion": "Is the research comprehensive?"
},
{
"stage": "analysis",
"description": "Analyze pricing",
"verificationQuestion": "Is the analysis accurate?"
}
],
"testMode": "with-verification"
}
```http
### AI-detected checkpoints
Let AI decide where verification is needed:
```json
{
"autoDetectCheckpoints": true,
"testMode": "with-verification",
"taskDescription": "Research top 3 AI companies, analyze pricing, calculate our advantage, and email summary to team@example.com"
}
```http
The AI will automatically add verification points before:
- Sending emails/messages
- Making purchases or transactions
- Deleting or modifying data
- Other high-impact actions
### No verification (baseline)
Run without human verification for speed comparisons:
```json
{
"testMode": "without-verification",
"taskDescription": "Simple calculation task"
}
```http
---
## Complete API reference
### Create eval task
**Endpoint:** `POST /api/eval/create-task-agent`
**Headers:**
```http
Authorization: Bearer mxs_eval_YOUR_KEY
Content-Type: application/json
```http
**Request Parameters:**
| Field | Type | Required | Description |
|---|---|---|---|
| `taskName` | string | Yes | Name of the evaluation task |
| `taskDescription` | string | Yes | Detailed description of what to do |
| `checkpoints` | array | No* | Manual verification points |
| `autoDetectCheckpoints` | boolean | No* | Let AI detect checkpoints |
| `testMode` | string | Yes | `with-verification` or `without-verification` |
| `assignedReviewer` | string | Yes | Email of person who will verify |
| `webhookUrl` | string | No | URL to receive completion webhooks |
| `externalId` | string | No | Your tracking ID |
_*Either `checkpoints` OR `autoDetectCheckpoints` (not both)_
**Checkpoint Object:**
```typescript
{
stage: string // Identifier for this checkpoint
description: string // What happens at this step
verificationQuestion: string // Question for reviewer
}
```http
**Response:**
```json
{
"success": true,
"executionId": "string",
"chatId": "string",
"chatUrl": "string",
"taskName": "string",
"testMode": "string",
"checkpointCount": number,
"checkpointDetectionMethod": "manual" | "ai-detected" | "none",
"message": "string"
}
```http
### Get execution status
**Endpoint:** `GET /api/eval/status/{executionId}`
**Headers:**
```http
Authorization: Bearer mxs_eval_YOUR_KEY
```http
**Response:**
```json
{
"success": true,
"execution": {
"id": "string",
"status": "running" | "completed" | "failed" | "waiting_verification",
"testMode": "string",
"progress": {
"totalSteps": number,
"completedSteps": number,
"currentStepIndex": number,
"percentComplete": number
},
"checkpoints": {
"total": number,
"completed": number,
"details": [
{
"checkpointNumber": number,
"stepId": "string",
"description": "string",
"status": "string",
"wasVerified": boolean,
"evalMetadata": {
"tapNumber": number,
"verificationDuration": number,
"verificationResponse": {
"action": "approved" | "rejected",
"hint": "string",
"verifiedAt": "ISO date"
}
}
}
]
},
"metrics": {
"tapsUsed": number,
"expectedTaps": number,
"durationSeconds": number,
"model": "string"
},
"startedAt": "ISO date",
"completedAt": "ISO date" | null,
"isComplete": boolean,
"isFailed": boolean,
"isRunning": boolean,
"error": "string" | null,
"chatUrl": "string",
"chatId": "string"
}
}
```http
---
## Common use cases
### 1. Run Benchmark Suite
```python
import requests
import time
API_KEY = "mxs_eval_YOUR_KEY"
HEADERS = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Create multiple eval tasks
tasks = [
"Calculate 15% of $50,000",
"Research top 3 AI companies and summarize",
"Draft email to team about Q4 goals"
]
executions = []
for task in tasks:
response = requests.post(
"https://app.mixus.ai/api/eval/create-task-agent",
headers=HEADERS,
json={
"taskName": f"Benchmark: {task[:30]}",
"taskDescription": task,
"autoDetectCheckpoints": True,
"testMode": "with-verification",
"assignedReviewer": "you@example.com"
}
)
executions.append(response.json())
# Track progress
for exec in executions:
exec_id = exec['executionId']
while True:
status = requests.get(
f"https://app.mixus.ai/api/eval/status/{exec_id}",
headers=HEADERS
).json()
if status['execution']['isComplete']:
print(f"✅ {exec['taskName']}: Done!")
break
elif status['execution']['isFailed']:
print(f"❌ {exec['taskName']}: Failed")
break
time.sleep(10) # Poll every 10 seconds
```http
### 2. Continuous Integration Testing
```javascript
// ci-eval-test.js
const axios = require('axios');
const API_KEY = process.env.MIXUS_API_KEY;
const headers = {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
};
async function runEvalTest(taskName, description) {
// Create eval
const { data } = await axios.post(
'https://app.mixus.ai/api/eval/create-task-agent',
{
taskName,
taskDescription: description,
autoDetectCheckpoints: true,
testMode: 'without-verification', // No human verification for CI
assignedReviewer: 'ci@example.com'
},
{ headers }
);
// Poll for completion
while (true) {
const status = await axios.get(
`https://app.mixus.ai/api/eval/status/${data.executionId}`,
{ headers }
);
if (status.data.execution.isComplete) {
return status.data.execution.metrics;
}
await new Promise(resolve => setTimeout(resolve, 5000));
}
}
// Run in CI pipeline
(async () => {
const metrics = await runEvalTest(
"CI Test: Data Processing",
"Process customer data and generate report"
);
console.log('Duration:', metrics.durationSeconds, 'seconds');
process.exit(0);
})();
```http
### 3. Webhook Notifications
Get notified when eval completes:
```json
{
"taskName": "Long Running Task",
"taskDescription": "Complex multi-step evaluation",
"autoDetectCheckpoints": true,
"testMode": "with-verification",
"assignedReviewer": "you@example.com",
"webhookUrl": "https://your-app.com/webhooks/eval-complete",
"externalId": "your-tracking-id-123"
}
```http
**Webhook Payload:**
```json
{
"event": "checkpoints_completed",
"executionId": "string",
"externalId": "your-tracking-id-123",
"testMode": "with-verification",
"finalStatus": "completed",
"metrics": {
"tapsUsed": 2,
"durationSeconds": 450
}
}
```http
---
## Error handling
The eval API uses standard HTTP status codes. For complete error documentation, see [Error codes reference](./eval-api-error-codes).
### Quick Reference
| Code | Meaning | Common Cause | Solution |
|------|---------|--------------|----------|
| 200 | Success | Request valid | Proceed with executionId |
| 400 | Bad Request | Missing/invalid fields | Check required parameters |
| 401 | Unauthorized | Invalid API key | Regenerate key |
| 404 | Not Found | Reviewer doesn't exist | Invite user to mixus |
| 500 | Server Error | Processing failed | Retry or contact support |
### Example Error Responses
**400 - Missing Fields:**
```json
{
"error": "Missing required fields",
"required": ["taskName", "taskDescription", "testMode", "assignedReviewer"],
"hint": "Ensure all required fields are provided"
}
```http
**401 - Unauthorized:**
```json
{
"error": "Unauthorized"
}
```http
**404 - Reviewer Not Found:**
```json
{
"error": "Reviewer not found",
"assignedReviewer": "user@example.com",
"hint": "Provide email or username of mixus user",
"suggestion": "Try one of these valid reviewers: teammate@company.com"
}
```http
### Retry Logic
```javascript
async function apiCallWithRetry(url, options, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await axios(url, options);
} catch (error) {
if (error.response?.status === 429) {
// Rate limited - wait and retry
await new Promise(r => setTimeout(r, attempt * 1000));
} else if (error.response?.status >= 500) {
// Server error - retry
await new Promise(r => setTimeout(r, attempt * 1000));
} else {
// Client error - don't retry
throw error;
}
}
}
}
```http
---
## Advanced features
### Polling best practices
```python
import requests
import time
def wait_for_completion(execution_id, api_key, timeout=3600):
"""
Poll execution status until completion or timeout.
Args:
execution_id: Execution ID from create response
api_key: Your mixus API key
timeout: Max wait time in seconds (default: 1 hour)
Returns:
Final execution status
"""
start_time = time.time()
poll_interval = 10 # Start with 10 seconds
while True:
# Check timeout
if time.time() - start_time > timeout:
raise TimeoutError("Execution did not complete in time")
# Get status
response = requests.get(
f"https://app.mixus.ai/api/eval/status/{execution_id}",
headers={"Authorization": f"Bearer {api_key}"}
)
execution = response.json()['execution']
# Check if done
if execution['isComplete'] or execution['isFailed']:
return execution
# Adaptive polling (slow down over time)
if time.time() - start_time > 300: # After 5 minutes
poll_interval = 30 # Poll every 30 seconds
time.sleep(poll_interval)
```http
### Batch processing
```python
from concurrent.futures import ThreadPoolExecutor
import requests
def create_eval_batch(tasks, api_key, max_concurrent=5):
"""
Create multiple eval tasks concurrently.
"""
def create_single(task):
response = requests.post(
"https://app.mixus.ai/api/eval/create-task-agent",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json=task
)
return response.json()
with ThreadPoolExecutor(max_workers=max_concurrent) as executor:
results = list(executor.map(create_single, tasks))
return results
# Usage
tasks = [
{
"taskName": "Test 1",
"taskDescription": "...",
"autoDetectCheckpoints": True,
"testMode": "with-verification",
"assignedReviewer": "reviewer@example.com"
},
# ... more tasks
]
results = create_eval_batch(tasks, API_KEY)
```http
---
## Metrics and analytics
### What gets tracked
Every eval execution tracks:
- **Taps Used:** Number of human verifications
- **Duration:** Total time from start to completion
- **Model:** Which AI model was used
- **Checkpoint Details:** Time to verification, approval/rejection
- **Step Progress:** Completed vs total steps
### Example: Calculate Success Rate
```python
def calculate_eval_success_rate(execution_ids, api_key):
"""
Calculate success rate across multiple evals.
"""
successful = 0
failed = 0
for exec_id in execution_ids:
response = requests.get(
f"https://app.mixus.ai/api/eval/status/{exec_id}",
headers={"Authorization": f"Bearer {api_key}"}
)
exec_data = response.json()['execution']
if exec_data['isComplete'] and not exec_data['error']:
successful += 1
elif exec_data['isFailed']:
failed += 1
total = successful + failed
success_rate = (successful / total * 100) if total > 0 else 0
return {
'success_rate': success_rate,
'successful': successful,
'failed': failed,
'total': total
}
```http
---
## Integration examples
### TheAgentCompany (TAC) integration
```python
# tac_mixus_integration.py
"""
Integration between TheAgentCompany benchmark and Mixus Eval API.
"""
import requests
import json
class MixusEvalClient:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://app.mixus.ai/api/eval"
self.headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
def create_tac_eval(self, task_id, task_data):
"""
Create eval from TAC task.
"""
payload = {
"taskName": f"TAC-{task_id}",
"taskDescription": task_data['instruction'],
"autoDetectCheckpoints": True,
"testMode": "with-verification",
"assignedReviewer": task_data['reviewer'],
"webhookUrl": "https://tac-server.com/webhook/eval-complete",
"externalId": task_id
}
response = requests.post(
f"{self.base_url}/create-task-agent",
headers=self.headers,
json=payload
)
return response.json()
def get_status(self, execution_id):
"""
Get execution status.
"""
response = requests.get(
f"{self.base_url}/status/{execution_id}",
headers=self.headers
)
return response.json()
# Usage
client = MixusEvalClient("mxs_eval_YOUR_KEY")
# Run TAC task
result = client.create_tac_eval(
task_id="tac_001",
task_data={
"instruction": "Review Q3 sales data and draft summary email",
"reviewer": "manager@company.com"
}
)
print(f"Execution started: {result['chatUrl']}")
```http
---
## Tips and best practices
### Optimization
1. **Use webhooks** instead of polling for long-running tasks
2. **Batch create** multiple evals concurrently (max 10)
3. **Cache status** responses to reduce API calls
4. **Use without-verification mode** for speed benchmarks
### Debugging
1. **Check chat URL** to see agent's work in real-time
2. **Monitor logs** in mixus dashboard
3. **Use externalId** to track executions in your system
4. **Test with simple tasks** first
### Performance
- **Avg execution time:** 30s - 5min depending on complexity
- **Checkpoint verification:** Human response time (varies)
- **API response time:** < 500ms
- **Status endpoint:** < 100ms
---
## Comparison with traditional evaluation
### Traditional AI evaluation
- Tests only final outcomes
- Discovers mistakes after damage is done
- No insight into execution process
- Binary pass/fail results
### mixus human-in-the-loop evaluation
- Provides step-by-step verification
- Catches mistakes before execution
- Full visibility into agent reasoning
- Provides hints to guide agents
- Builds trust through transparency
- Measures both accuracy and safety
---
## Next steps
- [API keys setup](/agents/api-keys) - Generate your first API key
- [API overview](./overview) - General API documentation
- [Quick start](./quickstart) - Get started in 5 minutes
- [Code examples](./examples) - More examples in multiple languages
---
## Support
Questions? Reach out:
- **Email:** support@mixus.ai
- **Documentation:** https://docs.mixus.ai
- **Community:** https://community.mixus.ai