Evaluation System Overview

What is the evaluation system?
How it works
Process flow
Test modes
With Verification (HITL)
Without Verification (Baseline)
Key capabilities
Use cases
Evaluation & Testing
Quality Assurance
Benchmarking
Integration Testing
Why human-in-the-loop?
Getting started
Next steps
Support

What is the evaluation system?

The mixus evaluation system enables you to test and measure AI agent performance with optional human verification at critical checkpoints. This unique approach ensures agents work correctly step-by-step before executing actions.

How it works

The evaluation system accepts tasks via two methods:

Email Submission

Send tasks to agent@mixus.com with subject starting with “Eval:”

API Submission

POST to /api/eval/create-task-agent for programmatic access

Process flow

Task Submission - Submit via email or API
Agent Creation - System converts task to executable agent
Execution - Agent runs through steps using AI and tools
Verification - Human reviews and approves critical steps
Results - Metrics and outcomes returned

Test modes

With Verification (HITL)

Agent pauses at checkpoints for human approval. Ideal for:

Testing agent reliability
High-stakes operations
Learning agent behavior

Checkpoint Detection:

Auto-detect - AI determines optimal verification points
Manual - You specify exact checkpoints

Without Verification (Baseline)

Agent runs autonomously without human oversight. Ideal for:

Speed benchmarks
Comparing with/without human
Low-risk tasks

Key capabilities

AI Checkpoint Detection

Automatically identifies where verification is needed based on task risk

Email Integration

Submit tasks and receive notifications via email

API Access

Programmatic integration for automation and batch processing

Metrics Tracking

Detailed metrics on performance, cost, and human verification

Webhook Callbacks

Real-time notifications when evaluations complete

Dashboard View

Visual tracking of all evaluations at app.mixus.ai/eval

Use cases

Evaluation & Testing

Test AI agents systematically:

Run benchmark suites
Compare different approaches
Measure reliability improvements
Validate agent behavior

Quality Assurance

Ensure agents work correctly:

Verify calculations before execution
Review communications before sending
Approve transactions before processing
Validate data changes before committing

Benchmarking

Compare performance:

With vs without human oversight
Different models or configurations
Cost vs accuracy tradeoffs
Speed vs reliability metrics

Integration Testing

Test external integrations:

Custom evaluation frameworks
CI/CD pipelines
Automated testing workflows

Why human-in-the-loop?

Traditional AI evaluation measures final outcomes but misses critical failures during execution:

Without Verification: - Sending wrong emails before catching mistakes - Making purchases with incorrect amounts - Deleting data that shouldn’t be removed

With Verification: - Review agent work BEFORE execution - Approve or reject critical steps - Provide hints to guide agents - Prevent costly mistakes

Getting started

Choose submission method

Email for quick tests, API for automation

Prepare your task

Write clear task descriptions with expected outcomes

Submit evaluation

Send via email or API with test mode and reviewer

Verify checkpoints

Review and approve agent work at critical steps

Track results

View metrics in dashboard or via API

Next steps

Quick Start

Submit your first evaluation task

Task Preparation

Learn how to write effective tasks

API Reference

Complete API documentation

Examples

Ready-to-use example tasks

Support

Questions about the evaluation system?

Email: support@mixus.ai
Dashboard: app.mixus.ai/eval
API Keys: app.mixus.ai/integrations/api-keys

Real-time Collaboration Getting Started

⌘I

Getting Started

Chats

AI Models

AI Agents

AI Tools

Files & Memory

Integrations

Legal AI

Model Context Protocol

Collaboration

Evaluation System

Micro-Agent Patterns

Organization Management

Enterprise

Account Settings

Tokens & Billing

Limits & Quotas

Security & Privacy

Videos

Support

Legal

API Reference

​What is the evaluation system?

​How it works

Email Submission

API Submission

​Process flow

​Test modes

​With Verification (HITL)

​Without Verification (Baseline)

​Key capabilities

AI Checkpoint Detection

Email Integration

API Access

Metrics Tracking

Webhook Callbacks

Dashboard View

​Use cases

​Evaluation & Testing

​Quality Assurance

​Benchmarking

​Integration Testing

​Why human-in-the-loop?

​Getting started

​Next steps

Quick Start

Task Preparation

API Reference

Examples

​Support

What is the evaluation system?

How it works

Process flow

Test modes

With Verification (HITL)

Without Verification (Baseline)

Key capabilities

Use cases

Evaluation & Testing

Quality Assurance

Benchmarking

Integration Testing

Why human-in-the-loop?

Getting started

Next steps

Support