Skip to main content

What is the evaluation system?

The mixus evaluation system enables you to test and measure AI agent performance with optional human verification at critical checkpoints. This unique approach ensures agents work correctly step-by-step before executing actions.

How it works

The evaluation system accepts tasks via two methods:

Email Submission

Send tasks to agent@mixus.com with subject starting with “Eval:”

API Submission

POST to /api/eval/create-task-agent for programmatic access

Process flow

  1. Task Submission - Submit via email or API
  2. Agent Creation - System converts task to executable agent
  3. Execution - Agent runs through steps using AI and tools
  4. Verification - Human reviews and approves critical steps
  5. Results - Metrics and outcomes returned

Test modes

With Verification (HITL)

Agent pauses at checkpoints for human approval. Ideal for:
  • Testing agent reliability
  • High-stakes operations
  • Learning agent behavior
Checkpoint Detection:
  • Auto-detect - AI determines optimal verification points
  • Manual - You specify exact checkpoints

Without Verification (Baseline)

Agent runs autonomously without human oversight. Ideal for:
  • Speed benchmarks
  • Comparing with/without human
  • Low-risk tasks

Key capabilities

AI Checkpoint Detection

Automatically identifies where verification is needed based on task risk

Email Integration

Submit tasks and receive notifications via email

API Access

Programmatic integration for automation and batch processing

Metrics Tracking

Detailed metrics on performance, cost, and human verification

Webhook Callbacks

Real-time notifications when evaluations complete

Dashboard View

Visual tracking of all evaluations at app.mixus.ai/eval

Use cases

Evaluation & Testing

Test AI agents systematically:
  • Run benchmark suites
  • Compare different approaches
  • Measure reliability improvements
  • Validate agent behavior

Quality Assurance

Ensure agents work correctly:
  • Verify calculations before execution
  • Review communications before sending
  • Approve transactions before processing
  • Validate data changes before committing

Benchmarking

Compare performance:
  • With vs without human oversight
  • Different models or configurations
  • Cost vs accuracy tradeoffs
  • Speed vs reliability metrics

Integration Testing

Test external integrations:
  • TheAgentCompany (TAC) benchmarks
  • Custom evaluation frameworks
  • CI/CD pipelines
  • Automated testing workflows

Why human-in-the-loop?

Traditional AI evaluation measures final outcomes but misses critical failures during execution:
Without Verification:
  • Sending wrong emails before catching mistakes
  • Making purchases with incorrect amounts
  • Deleting data that shouldn’t be removed
With Verification:
  • Review agent work BEFORE execution
  • Approve or reject critical steps
  • Provide hints to guide agents
  • Prevent costly mistakes

Getting started

1

Choose submission method

Email for quick tests, API for automation
2

Prepare your task

Write clear task descriptions with expected outcomes
3

Submit evaluation

Send via email or API with test mode and reviewer
4

Verify checkpoints

Review and approve agent work at critical steps
5

Track results

View metrics in dashboard or via API

Next steps

Support

Questions about the evaluation system?
I