Evaluation API

Human-in-the-loop evaluation system

The mixus evaluation system combines automated AI evaluation with human-in-the-loop step verification. This unique approach ensures AI agents perform reliably in real-world scenarios by verifying critical steps before they execute.

What is the eval API?

The Evaluation API allows you to test and measure your AI agents’ performance programmatically with optional human verification at critical checkpoints. Unlike traditional evaluation systems that only measure outcomes, mixus evaluates the entire execution process - giving you confidence that agents work correctly step-by-step.

Why human-in-the-loop evaluation matters

Traditional AI evaluation measures final outcomes, but misses critical failures during execution:

Sending wrong emails before catching the mistake
Making purchases with incorrect amounts
Deleting data that shouldn’t be removed

mixus provides an alternative approach: verify steps before they execute, not after. This prevents costly mistakes and builds trust in autonomous AI agents.

How it works

📤 Step 1: create eval task (your code)

POST /api/eval/create-task-agent

Task description
Verification checkpoints (manual or AI-detected)

↓

🤖 Step 2: mixus creates agent

Converts task to agent execution plan
Sets up verification points
Initializes tracking metrics

↓

⚡ Step 3: agent executes task

Runs through steps
Uses tools (web search, integrations, etc.)
Pauses at checkpoints for verification

↓

⭐ Step 4: human verification

Review agent’s work BEFORE execution
Approve, reject, or provide hints
Agent adjusts based on feedback
Prevents mistakes before they happen

↓

📊 Step 5: get results (your code)

GET /api/eval/status/{executionId}

Execution status
Checkpoint completion
Performance metrics
Human verification insights

Before you begin

To use the evaluation API, you need:

A mixus account
An API key with eval:create and eval:read permissions
A team member who can review and verify agent steps

Quick start

Step 1: generate API key

Visit app.mixus.ai/integrations/api-keys and generate a key with eval:create and eval:read permissions.

Step 2: create your first eval

curl -X POST https://app.mixus.ai/api/eval/create-task-agent \
  -H "Authorization: Bearer mxs_eval_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "taskName": "Calculate Sales Commission",
    "taskDescription": "Calculate 15% commission on a $50,000 sale",
    "checkpoints": [{
      "stage": "calculation",
      "description": "Perform calculation",
      "verificationQuestion": "Is the result $7,500?"
    }],
    "testMode": "with-verification",
    "assignedReviewer": "you@example.com"
  }'
```http

### Step 3: track progress

```bash
# Use executionId from previous response
curl https://app.mixus.ai/api/eval/status/EXECUTION_ID \
  -H "Authorization: Bearer mxs_eval_YOUR_KEY"
```http

---

## Checkpoint modes

### Manual checkpoints

You define exactly where verification happens:

```json
{
  "checkpoints": [
    {
      "stage": "research",
      "description": "Research competitors",
      "verificationQuestion": "Is the research comprehensive?"
    },
    {
      "stage": "analysis",
      "description": "Analyze pricing",
      "verificationQuestion": "Is the analysis accurate?"
    }
  ],
  "testMode": "with-verification"
}
```http

### AI-detected checkpoints

Let AI decide where verification is needed:

```json
{
  "autoDetectCheckpoints": true,
  "testMode": "with-verification",
  "taskDescription": "Research top 3 AI companies, analyze pricing, calculate our advantage, and email summary to team@example.com"
}
```http

The AI will automatically add verification points before:
- Sending emails/messages
- Making purchases or transactions
- Deleting or modifying data
- Other high-impact actions

### No verification (baseline)

Run without human verification for speed comparisons:

```json
{
  "testMode": "without-verification",
  "taskDescription": "Simple calculation task"
}
```http

---

## Complete API reference

### Create eval task

**Endpoint:** `POST /api/eval/create-task-agent`

**Headers:**
```http
Authorization: Bearer mxs_eval_YOUR_KEY
Content-Type: application/json
```http

**Request Parameters:**

| Field | Type | Required | Description |
|---|---|---|---|
| `taskName` | string | Yes | Name of the evaluation task |
| `taskDescription` | string | Yes | Detailed description of what to do |
| `checkpoints` | array | No* | Manual verification points |
| `autoDetectCheckpoints` | boolean | No* | Let AI detect checkpoints |
| `testMode` | string | Yes | `with-verification` or `without-verification` |
| `assignedReviewer` | string | Yes | Email of person who will verify |
| `webhookUrl` | string | No | URL to receive completion webhooks |
| `externalId` | string | No | Your tracking ID |

_*Either `checkpoints` OR `autoDetectCheckpoints` (not both)_

**Checkpoint Object:**
```typescript
{
  stage: string              // Identifier for this checkpoint
  description: string        // What happens at this step
  verificationQuestion: string  // Question for reviewer
}
```http

**Response:**
```json
{
  "success": true,
  "executionId": "string",
  "chatId": "string",
  "chatUrl": "string",
  "taskName": "string",
  "testMode": "string",
  "checkpointCount": number,
  "checkpointDetectionMethod": "manual" | "ai-detected" | "none",
  "message": "string"
}
```http

### Get execution status

**Endpoint:** `GET /api/eval/status/{executionId}`

**Headers:**
```http
Authorization: Bearer mxs_eval_YOUR_KEY
```http

**Response:**
```json
{
  "success": true,
  "execution": {
    "id": "string",
    "status": "running" | "completed" | "failed" | "waiting_verification",
    "testMode": "string",
    "progress": {
      "totalSteps": number,
      "completedSteps": number,
      "currentStepIndex": number,
      "percentComplete": number
    },
    "checkpoints": {
      "total": number,
      "completed": number,
      "details": [
        {
          "checkpointNumber": number,
          "stepId": "string",
          "description": "string",
          "status": "string",
          "wasVerified": boolean,
          "evalMetadata": {
            "tapNumber": number,
            "verificationDuration": number,
            "verificationResponse": {
              "action": "approved" | "rejected",
              "hint": "string",
              "verifiedAt": "ISO date"
            }
          }
        }
      ]
    },
    "metrics": {
      "tapsUsed": number,
      "expectedTaps": number,
      "durationSeconds": number,
      "model": "string"
    },
    "startedAt": "ISO date",
    "completedAt": "ISO date" | null,
    "isComplete": boolean,
    "isFailed": boolean,
    "isRunning": boolean,
    "error": "string" | null,
    "chatUrl": "string",
    "chatId": "string"
  }
}
```http

---

## Common use cases

### 1. Run Benchmark Suite

```python
import requests
import time

API_KEY = "mxs_eval_YOUR_KEY"
HEADERS = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Create multiple eval tasks
tasks = [
    "Calculate 15% of $50,000",
    "Research top 3 AI companies and summarize",
    "Draft email to team about Q4 goals"
]

executions = []
for task in tasks:
    response = requests.post(
        "https://app.mixus.ai/api/eval/create-task-agent",
        headers=HEADERS,
        json={
            "taskName": f"Benchmark: {task[:30]}",
            "taskDescription": task,
            "autoDetectCheckpoints": True,
            "testMode": "with-verification",
            "assignedReviewer": "you@example.com"
        }
    )
    executions.append(response.json())

# Track progress
for exec in executions:
    exec_id = exec['executionId']
    
    while True:
        status = requests.get(
            f"https://app.mixus.ai/api/eval/status/{exec_id}",
            headers=HEADERS
        ).json()
        
        if status['execution']['isComplete']:
            print(f"✅ {exec['taskName']}: Done!")
            break
        elif status['execution']['isFailed']:
            print(f"❌ {exec['taskName']}: Failed")
            break
        
        time.sleep(10)  # Poll every 10 seconds
```http

### 2. Continuous Integration Testing

```javascript
// ci-eval-test.js
const axios = require('axios');

const API_KEY = process.env.MIXUS_API_KEY;
const headers = {
  'Authorization': `Bearer ${API_KEY}`,
  'Content-Type': 'application/json'
};

async function runEvalTest(taskName, description) {
  // Create eval
  const { data } = await axios.post(
    'https://app.mixus.ai/api/eval/create-task-agent',
    {
      taskName,
      taskDescription: description,
      autoDetectCheckpoints: true,
      testMode: 'without-verification', // No human verification for CI
      assignedReviewer: 'ci@example.com'
    },
    { headers }
  );
  
  // Poll for completion
  while (true) {
    const status = await axios.get(
      `https://app.mixus.ai/api/eval/status/${data.executionId}`,
      { headers }
    );
    
    if (status.data.execution.isComplete) {
      return status.data.execution.metrics;
    }
    
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}

// Run in CI pipeline
(async () => {
  const metrics = await runEvalTest(
    "CI Test: Data Processing",
    "Process customer data and generate report"
  );
  
  console.log('Duration:', metrics.durationSeconds, 'seconds');
  process.exit(0);
})();
```http

### 3. Webhook Notifications

Get notified when eval completes:

```json
{
  "taskName": "Long Running Task",
  "taskDescription": "Complex multi-step evaluation",
  "autoDetectCheckpoints": true,
  "testMode": "with-verification",
  "assignedReviewer": "you@example.com",
  "webhookUrl": "https://your-app.com/webhooks/eval-complete",
  "externalId": "your-tracking-id-123"
}
```http

**Webhook Payload:**
```json
{
  "event": "checkpoints_completed",
  "executionId": "string",
  "externalId": "your-tracking-id-123",
  "testMode": "with-verification",
  "finalStatus": "completed",
  "metrics": {
    "tapsUsed": 2,
    "durationSeconds": 450
  }
}
```http

---

## Error handling

The eval API uses standard HTTP status codes. For complete error documentation, see [Error codes reference](./eval-api-error-codes).

### Quick Reference

| Code | Meaning | Common Cause | Solution |
|------|---------|--------------|----------|
| 200 | Success | Request valid | Proceed with executionId |
| 400 | Bad Request | Missing/invalid fields | Check required parameters |
| 401 | Unauthorized | Invalid API key | Regenerate key |
| 404 | Not Found | Reviewer doesn't exist | Invite user to mixus |
| 500 | Server Error | Processing failed | Retry or contact support |

### Example Error Responses

**400 - Missing Fields:**
```json
{
  "error": "Missing required fields",
  "required": ["taskName", "taskDescription", "testMode", "assignedReviewer"],
  "hint": "Ensure all required fields are provided"
}
```http

**401 - Unauthorized:**
```json
{
  "error": "Unauthorized"
}
```http

**404 - Reviewer Not Found:**
```json
{
  "error": "Reviewer not found",
  "assignedReviewer": "user@example.com",
  "hint": "Provide email or username of mixus user",
  "suggestion": "Try one of these valid reviewers: teammate@company.com"
}
```http

### Retry Logic

```javascript
async function apiCallWithRetry(url, options, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await axios(url, options);
    } catch (error) {
      if (error.response?.status === 429) {
        // Rate limited - wait and retry
        await new Promise(r => setTimeout(r, attempt * 1000));
      } else if (error.response?.status >= 500) {
        // Server error - retry
        await new Promise(r => setTimeout(r, attempt * 1000));
      } else {
        // Client error - don't retry
        throw error;
      }
    }
  }
}
```http

---

## Advanced features

### Polling best practices

```python
import requests
import time

def wait_for_completion(execution_id, api_key, timeout=3600):
    """
    Poll execution status until completion or timeout.
    
    Args:
        execution_id: Execution ID from create response
        api_key: Your mixus API key
        timeout: Max wait time in seconds (default: 1 hour)
    
    Returns:
        Final execution status
    """
    start_time = time.time()
    poll_interval = 10  # Start with 10 seconds
    
    while True:
        # Check timeout
        if time.time() - start_time > timeout:
            raise TimeoutError("Execution did not complete in time")
        
        # Get status
        response = requests.get(
            f"https://app.mixus.ai/api/eval/status/{execution_id}",
            headers={"Authorization": f"Bearer {api_key}"}
        )
        
        execution = response.json()['execution']
        
        # Check if done
        if execution['isComplete'] or execution['isFailed']:
            return execution
        
        # Adaptive polling (slow down over time)
        if time.time() - start_time > 300:  # After 5 minutes
            poll_interval = 30  # Poll every 30 seconds
        
        time.sleep(poll_interval)
```http

### Batch processing

```python
from concurrent.futures import ThreadPoolExecutor
import requests

def create_eval_batch(tasks, api_key, max_concurrent=5):
    """
    Create multiple eval tasks concurrently.
    """
    def create_single(task):
        response = requests.post(
            "https://app.mixus.ai/api/eval/create-task-agent",
            headers={
                "Authorization": f"Bearer {api_key}",
                "Content-Type": "application/json"
            },
            json=task
        )
        return response.json()
    
    with ThreadPoolExecutor(max_workers=max_concurrent) as executor:
        results = list(executor.map(create_single, tasks))
    
    return results

# Usage
tasks = [
    {
        "taskName": "Test 1",
        "taskDescription": "...",
        "autoDetectCheckpoints": True,
        "testMode": "with-verification",
        "assignedReviewer": "reviewer@example.com"
    },
    # ... more tasks
]

results = create_eval_batch(tasks, API_KEY)
```http

---

## Metrics and analytics

### What gets tracked

Every eval execution tracks:

- **Taps Used:** Number of human verifications
- **Duration:** Total time from start to completion
- **Model:** Which AI model was used
- **Checkpoint Details:** Time to verification, approval/rejection
- **Step Progress:** Completed vs total steps

### Example: Calculate Success Rate

```python
def calculate_eval_success_rate(execution_ids, api_key):
    """
    Calculate success rate across multiple evals.
    """
    successful = 0
    failed = 0
    
    for exec_id in execution_ids:
        response = requests.get(
            f"https://app.mixus.ai/api/eval/status/{exec_id}",
            headers={"Authorization": f"Bearer {api_key}"}
        )
        
        exec_data = response.json()['execution']
        
        if exec_data['isComplete'] and not exec_data['error']:
            successful += 1
        elif exec_data['isFailed']:
            failed += 1
    
    total = successful + failed
    success_rate = (successful / total * 100) if total > 0 else 0
    
    return {
        'success_rate': success_rate,
        'successful': successful,
        'failed': failed,
        'total': total
    }
```http

---

## Integration examples

### TheAgentCompany (TAC) integration

```python
# tac_mixus_integration.py
"""
Integration between TheAgentCompany benchmark and Mixus Eval API.
"""

import requests
import json

class MixusEvalClient:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://app.mixus.ai/api/eval"
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }
    
    def create_tac_eval(self, task_id, task_data):
        """
        Create eval from TAC task.
        """
        payload = {
            "taskName": f"TAC-{task_id}",
            "taskDescription": task_data['instruction'],
            "autoDetectCheckpoints": True,
            "testMode": "with-verification",
            "assignedReviewer": task_data['reviewer'],
            "webhookUrl": "https://tac-server.com/webhook/eval-complete",
            "externalId": task_id
        }
        
        response = requests.post(
            f"{self.base_url}/create-task-agent",
            headers=self.headers,
            json=payload
        )
        
        return response.json()
    
    def get_status(self, execution_id):
        """
        Get execution status.
        """
        response = requests.get(
            f"{self.base_url}/status/{execution_id}",
            headers=self.headers
        )
        
        return response.json()

# Usage
client = MixusEvalClient("mxs_eval_YOUR_KEY")

# Run TAC task
result = client.create_tac_eval(
    task_id="tac_001",
    task_data={
        "instruction": "Review Q3 sales data and draft summary email",
        "reviewer": "manager@company.com"
    }
)

print(f"Execution started: {result['chatUrl']}")
```http

---

## Tips and best practices

### Optimization

1. **Use webhooks** instead of polling for long-running tasks
2. **Batch create** multiple evals concurrently (max 10)
3. **Cache status** responses to reduce API calls
4. **Use without-verification mode** for speed benchmarks

### Debugging

1. **Check chat URL** to see agent's work in real-time
2. **Monitor logs** in mixus dashboard
3. **Use externalId** to track executions in your system
4. **Test with simple tasks** first

### Performance

- **Avg execution time:** 30s - 5min depending on complexity
- **Checkpoint verification:** Human response time (varies)
- **API response time:** < 500ms
- **Status endpoint:** < 100ms

---


## Comparison with traditional evaluation

### Traditional AI evaluation

- Tests only final outcomes
- Discovers mistakes after damage is done
- No insight into execution process
- Binary pass/fail results

### mixus human-in-the-loop evaluation

- Provides step-by-step verification
- Catches mistakes before execution
- Full visibility into agent reasoning
- Provides hints to guide agents
- Builds trust through transparency
- Measures both accuracy and safety

---

## Next steps

- [API keys setup](/agents/api-keys) - Generate your first API key
- [API overview](./overview) - General API documentation  
- [Quick start](./quickstart) - Get started in 5 minutes
- [Code examples](./examples) - More examples in multiple languages

---

## Support

Questions? Reach out:
- **Email:** support@mixus.ai
- **Documentation:** https://docs.mixus.ai
- **Community:** https://community.mixus.ai

Getting Started

AI Models

Chats

AI Agents

Micro-Agent Patterns

AI Tools

Evaluation System

Integrations

Files & Memory

Model Context Protocol

Collaboration

Security & Privacy

Account Settings

Tokens & Billing

Limits & Quotas

Videos

Support

Copy Content

Legal

API Reference

Evaluation API

Human-in-the-loop evaluation system

What is the eval API?

Why human-in-the-loop evaluation matters

How it works

📤 Step 1: create eval task (your code)

🤖 Step 2: mixus creates agent

⚡ Step 3: agent executes task

⭐ Step 4: human verification

📊 Step 5: get results (your code)

Before you begin

Quick start

Step 1: generate API key

Step 2: create your first eval

Getting Started

AI Models

Chats

AI Agents

Micro-Agent Patterns

AI Tools

Evaluation System

Integrations

Files & Memory

Model Context Protocol

Collaboration

Security & Privacy

Account Settings

Tokens & Billing

Limits & Quotas

Videos

Support

Copy Content

Legal

API Reference

​Human-in-the-loop evaluation system

​What is the eval API?

​Why human-in-the-loop evaluation matters

​How it works

​📤 Step 1: create eval task (your code)

​🤖 Step 2: mixus creates agent

​⚡ Step 3: agent executes task

​⭐ Step 4: human verification

​📊 Step 5: get results (your code)

​Before you begin

​Quick start

​Step 1: generate API key

​Step 2: create your first eval

Human-in-the-loop evaluation system

What is the eval API?

Why human-in-the-loop evaluation matters

How it works

📤 Step 1: create eval task (your code)

🤖 Step 2: mixus creates agent

⚡ Step 3: agent executes task

⭐ Step 4: human verification

📊 Step 5: get results (your code)

Before you begin

Quick start

Step 1: generate API key

Step 2: create your first eval