What is the evaluation system?
The mixus evaluation system enables you to test and measure AI agent performance with optional human verification at critical checkpoints. This unique approach ensures agents work correctly step-by-step before executing actions.How it works
The evaluation system accepts tasks via two methods:Email Submission
Send tasks to
agent@mixus.com
with subject starting with “Eval:”API Submission
POST to
/api/eval/create-task-agent
for programmatic accessProcess flow
- Task Submission - Submit via email or API
- Agent Creation - System converts task to executable agent
- Execution - Agent runs through steps using AI and tools
- Verification - Human reviews and approves critical steps
- Results - Metrics and outcomes returned
Test modes
With Verification (HITL)
Agent pauses at checkpoints for human approval. Ideal for:- Testing agent reliability
- High-stakes operations
- Learning agent behavior
- Auto-detect - AI determines optimal verification points
- Manual - You specify exact checkpoints
Without Verification (Baseline)
Agent runs autonomously without human oversight. Ideal for:- Speed benchmarks
- Comparing with/without human
- Low-risk tasks
Key capabilities
AI Checkpoint Detection
Automatically identifies where verification is needed based on task risk
Email Integration
Submit tasks and receive notifications via email
API Access
Programmatic integration for automation and batch processing
Metrics Tracking
Detailed metrics on performance, cost, and human verification
Webhook Callbacks
Real-time notifications when evaluations complete
Dashboard View
Visual tracking of all evaluations at app.mixus.ai/eval
Use cases
Evaluation & Testing
Test AI agents systematically:- Run benchmark suites
- Compare different approaches
- Measure reliability improvements
- Validate agent behavior
Quality Assurance
Ensure agents work correctly:- Verify calculations before execution
- Review communications before sending
- Approve transactions before processing
- Validate data changes before committing
Benchmarking
Compare performance:- With vs without human oversight
- Different models or configurations
- Cost vs accuracy tradeoffs
- Speed vs reliability metrics
Integration Testing
Test external integrations:- TheAgentCompany (TAC) benchmarks
- Custom evaluation frameworks
- CI/CD pipelines
- Automated testing workflows
Why human-in-the-loop?
Traditional AI evaluation measures final outcomes but misses critical failures during execution:Without Verification:
- Sending wrong emails before catching mistakes
- Making purchases with incorrect amounts
- Deleting data that shouldn’t be removed
With Verification:
- Review agent work BEFORE execution
- Approve or reject critical steps
- Provide hints to guide agents
- Prevent costly mistakes
Getting started
1
Choose submission method
Email for quick tests, API for automation
2
Prepare your task
Write clear task descriptions with expected outcomes
3
Submit evaluation
Send via email or API with test mode and reviewer
4
Verify checkpoints
Review and approve agent work at critical steps
5
Track results
View metrics in dashboard or via API
Next steps
Quick Start
Submit your first evaluation task
Task Preparation
Learn how to write effective tasks
API Reference
Complete API documentation
Examples
Ready-to-use example tasks
Support
Questions about the evaluation system?- Email: support@mixus.ai
- Dashboard: app.mixus.ai/eval
- API Keys: app.mixus.ai/integrations/api-keys