Writing effective tasks
Start with clear objectives
Define success criteria upfront:Use progressive complexity
Start simple, then increase complexity:1
Week 1: Simple tasks
Basic calculations, single-step operations
2
Week 2: Medium tasks
Research, multi-step workflows
3
Week 3: Complex tasks
Multiple tools, external integrations
4
Week 4: Production use
Full automation, batch processing
Checkpoint strategies
When to use auto-detect
Use auto-detect when:
- You’re new to the system
- Task involves external actions (emails, purchases)
- You don’t know optimal verification points
- Task complexity is high
When to use manual checkpoints
Use manual checkpoints when:
- Testing specific decision points
- You know exactly where verification is needed
- Task has well-defined stages
- Regulatory requirements dictate specific checks
Optimal checkpoint placement
Verification workflow tips
Responding to checkpoints
approve
Agent continues to next step
reject
Agent stops execution
hint: [text]
Agent adjusts approach with your guidance
Using hints effectively
When to use hints
When to use hints
- Agent is close but needs adjustment
- You want to guide without rejecting
- Teaching agent better approaches
- Minor corrections needed
Good hint examples
Good hint examples
Bad hint examples
Bad hint examples
Testing strategies
Baseline comparisons
Always run tasks both ways to measure human impact:- Success rate
- Execution time
- Cost per task
- Error rate
- Quality scores
A/B testing checkpoints
Test different checkpoint strategies:Performance optimization
Batch processing
Submit multiple tasks efficiently:Webhook vs polling
Use Webhooks
When: Long-running tasks (>5 min)Benefits:
- No polling overhead
- Real-time notifications
- Lower API usage
Use Polling
When: Short tasks (<5 min)Benefits:
- Simpler setup
- No server needed
- Direct control
Optimizing task descriptions
Cost optimization
Understanding costs
Evaluation costs include:- AI model usage - Based on tokens processed
- Tool usage - External API calls (web search, integrations)
- Verification overhead - Human review time (no additional cost)
Reducing costs
1. Use baseline mode for simple tasks
1. Use baseline mode for simple tasks
Skip verification for low-risk tasks:Saves verification overhead, completes faster.
2. Optimize checkpoint count
2. Optimize checkpoint count
Fewer checkpoints = lower cost:
- Auto-detect typically finds 1-3 checkpoints
- Manual: Only checkpoint truly critical steps
- Baseline: No checkpoints = lowest cost
3. Batch similar tasks
3. Batch similar tasks
Group related tasks to reuse context:
4. Use precise task descriptions
4. Use precise task descriptions
Reduce unnecessary research and tool calls:
Quality assurance
Validation strategies
1
Define expected outcomes
Specify what “correct” looks like before running
2
Run with verification first
Verify agent behavior manually before automation
3
Compare with baseline
Measure improvement from human oversight
4
Iterate on prompts
Refine task descriptions based on results
Tracking quality metrics
Monitor these metrics over time:Team collaboration
Assigning reviewers
Match reviewers to task expertise:Review response times
Set expectations for checkpoint responses:Urgent
< 5 minutesFinancial transactions, customer communications
Normal
< 1 hourResearch, analysis, reports
Low Priority
< 24 hoursBaseline tests, experiments
Sharing results
Export evaluation results for team review:Common pitfalls
❌ Vague task descriptions
❌ Vague task descriptions
Problem: “Research AI companies”Solution: “Research OpenAI, Anthropic, Google DeepMind pricing”
❌ Too many checkpoints
❌ Too many checkpoints
Problem: 8 checkpoints for simple calculationSolution: 1-2 checkpoints, or use auto-detect
❌ No baseline comparison
❌ No baseline comparison
Problem: Only testing with verificationSolution: Run same task without verification for comparison
❌ Ignoring metrics
❌ Ignoring metrics
Problem: Not tracking success/failure ratesSolution: Monitor dashboard, export metrics regularly
❌ Not using hints
❌ Not using hints
Problem: Rejecting when small adjustment would workSolution: Use
hint: to guide agent to correct approachAdvanced patterns
Chained evaluations
Run sequential tasks where output of one feeds into next:Conditional workflows
Run different tasks based on results:Template-based tasks
Create reusable task templates:Troubleshooting
Task fails immediately
1
Check task description
Ensure it’s clear and actionable
2
Verify reviewer exists
Reviewer must be registered mixus user
3
Check API key permissions
Needs
eval:create scope4
Review error message
Look in response or dashboard for details
Checkpoint not triggering
Possible causes:- Auto-detect didn’t identify action as high-risk
- Task description too vague
- Action type not in checkpoint criteria
Slow verification responses
Tips for faster reviews:- Use mobile notifications
- Set up Slack alerts
- Assign backup reviewers
- Use webhooks for real-time notifications

