Writing effective tasks
Start with clear objectives
Define success criteria upfront:
Calculate 15% commission on $50,000. Expected result: $7,500
Use progressive complexity
Start simple, then increase complexity:
Week 1: Simple tasks
Basic calculations, single-step operations
Week 2: Medium tasks
Research, multi-step workflows
Week 3: Complex tasks
Multiple tools, external integrations
Week 4: Production use
Full automation, batch processing
Checkpoint strategies
When to use auto-detect
You’re new to the system
Task involves external actions (emails, purchases)
You don’t know optimal verification points
Task complexity is high
Example:
{
"autoDetectCheckpoints" : true ,
"testMode" : "with-verification"
}
When to use manual checkpoints
Use manual checkpoints when:
Testing specific decision points
You know exactly where verification is needed
Task has well-defined stages
Regulatory requirements dictate specific checks
Example:
{
"checkpoints" : [
{ "stage" : "calculation" , "description" : "Verify math" },
{ "stage" : "submission" , "description" : "Approve sending" }
],
"testMode" : "with-verification"
}
Optimal checkpoint placement
2-4 checkpoints is ideal for most tasks
Too few (0-1): May miss critical verification points
Just right (2-4): Balance oversight with efficiency
Too many (5+): Slows down evaluation, verification fatigue
Verification workflow tips
Responding to checkpoints
approve Agent continues to next step
reject Agent stops execution
hint: [text] Agent adjusts approach with your guidance
Using hints effectively
Agent is close but needs adjustment
You want to guide without rejecting
Teaching agent better approaches
Minor corrections needed
hint: Use CoinMarketCap instead of CoinGecko for more accurate prices
hint: Include all team members in the CC field
hint: Round to 2 decimal places
hint: Use the Alternative Simplified Credit method, not regular method
hint: Do it better
hint: Wrong
hint: Not that
Be specific about what to change!
Testing strategies
Baseline comparisons
Always run tasks both ways to measure human impact:
# Run 1: With verification
{
"taskName" : "Task A - With Human",
"testMode" : "with-verification",
"autoDetectCheckpoints" : true
}
# Run 2: Without verification (baseline)
{
"taskName" : "Task A - Baseline",
"testMode" : "without-verification"
}
Metrics to compare:
Success rate
Execution time
Cost per task
Error rate
Quality scores
A/B testing checkpoints
Test different checkpoint strategies:
# Version A: Auto-detect
{
"taskName" : "Test - Auto Checkpoints",
"autoDetectCheckpoints" : true
}
# Version B: Manual
{
"taskName" : "Test - Manual Checkpoints",
"checkpoints" : [/* specific points * /]
}
# Version C: No checkpoints
{
"taskName" : "Test - No Checkpoints",
"testMode" : "without-verification"
}
Compare which approach gives best results for your use case.
Batch processing
Submit multiple tasks efficiently:
import concurrent.futures
import requests
def submit_task ( task_data ):
response = requests.post(
"https://app.mixus.ai/api/eval/create-task-agent" ,
headers = { "Authorization" : f "Bearer { API_KEY } " },
json = task_data
)
return response.json()
# Submit 10 tasks in parallel
tasks = [create_task_data(i) for i in range ( 10 )]
with concurrent.futures.ThreadPoolExecutor( max_workers = 5 ) as executor:
results = list (executor.map(submit_task, tasks))
Rate limits: Max 10 concurrent evaluations per organization
Webhook vs polling
Use Webhooks When: Long-running tasks (>5 min)Benefits:
No polling overhead
Real-time notifications
Lower API usage
Use Polling When: Short tasks (<5 min)Benefits:
Simpler setup
No server needed
Direct control
Optimizing task descriptions
Shorter is faster Agent processing time increases with description length. Be concise but clear.
Before (Slow)
After (Fast)
Research all available information about the top three artificial intelligence
agent platforms currently available in the market, including detailed pricing
information for all tiers and plans, comprehensive lists of features and
capabilities, target market analysis, and competitive positioning...
Cost optimization
Understanding costs
Evaluation costs include:
AI model usage - Based on tokens processed
Tool usage - External API calls (web search, integrations)
Verification overhead - Human review time (no additional cost)
Reducing costs
1. Use baseline mode for simple tasks
Skip verification for low-risk tasks: { "testMode" : "without-verification" }
Saves verification overhead, completes faster.
2. Optimize checkpoint count
Fewer checkpoints = lower cost:
Auto-detect typically finds 1-3 checkpoints
Manual: Only checkpoint truly critical steps
Baseline: No checkpoints = lowest cost
Group related tasks to reuse context: [
{ "taskName" : "Calc 1" , "description" : "15% of $50k" },
{ "taskName" : "Calc 2" , "description" : "15% of $75k" },
{ "taskName" : "Calc 3" , "description" : "15% of $100k" }
]
4. Use precise task descriptions
Reduce unnecessary research and tool calls: // More expensive
{ "description" : "Find crypto prices and calculate" }
// Less expensive
{ "description" : "Calculate: 0.5 BTC × $42,000 + 10 ETH × $2,200" }
Quality assurance
Validation strategies
Define expected outcomes
Specify what “correct” looks like before running
Run with verification first
Verify agent behavior manually before automation
Compare with baseline
Measure improvement from human oversight
Iterate on prompts
Refine task descriptions based on results
Tracking quality metrics
Monitor these metrics over time:
metrics = {
"success_rate" : 0.95 , # % of successful completions
"approval_rate" : 0.92 , # % of checkpoints approved
"rejection_rate" : 0.05 , # % of checkpoints rejected
"hint_rate" : 0.03 , # % of checkpoints needing hints
"avg_duration" : 180 , # Average seconds to complete
"avg_cost" : 2.50 , # Average cost per task
}
Target metrics:
Success rate: >90%
Approval rate: >85%
Rejection rate: <10%
Hint rate: <15%
Team collaboration
Assigning reviewers
Match reviewers to task expertise:
{
"financial_tasks" : "cfo@company.com" ,
"marketing_tasks" : "cmo@company.com" ,
"technical_tasks" : "cto@company.com" ,
"legal_tasks" : "legal@company.com"
}
Review response times
Set expectations for checkpoint responses:
Urgent < 5 minutes Financial transactions, customer communications
Normal < 1 hour Research, analysis, reports
Low Priority < 24 hours Baseline tests, experiments
Sharing results
Export evaluation results for team review:
# Get all evaluation results
curl https://app.mixus.ai/api/eval/results \
-H "Authorization: Bearer mxs_eval_YOUR_KEY" \
> team_eval_results.json
# Share with team
Common pitfalls
❌ Vague task descriptions
Problem: “Research AI companies”Solution: “Research OpenAI, Anthropic, Google DeepMind pricing”
Problem: 8 checkpoints for simple calculationSolution: 1-2 checkpoints, or use auto-detect
Problem: Only testing with verificationSolution: Run same task without verification for comparison
Problem: Not tracking success/failure ratesSolution: Monitor dashboard, export metrics regularly
Problem: Rejecting when small adjustment would workSolution: Use hint: to guide agent to correct approach
Advanced patterns
Chained evaluations
Run sequential tasks where output of one feeds into next:
# Task 1: Research
result1 = submit_task({
"taskName" : "Research Phase" ,
"taskDescription" : "Research competitor pricing"
})
execution_id_1 = result1[ "executionId" ]
wait_for_completion(execution_id_1)
# Get Task 1 results
results_1 = get_execution_results(execution_id_1)
# Task 2: Analysis (using Task 1 results)
result2 = submit_task({
"taskName" : "Analysis Phase" ,
"taskDescription" : f "Analyze pricing data: { results_1[ 'output' ] } "
})
Conditional workflows
Run different tasks based on results:
result = submit_task({
"taskName" : "Check Inventory" ,
"taskDescription" : "Check if Product X has inventory > 100"
})
wait_for_completion(result[ "executionId" ])
inventory_result = get_results(result[ "executionId" ])
if inventory_result[ "inventory" ] > 100 :
# High inventory - run discount campaign task
submit_task({
"taskName" : "Discount Campaign" ,
"taskDescription" : "Create 20 % d iscount campaign for Product X"
})
else :
# Low inventory - run restock task
submit_task({
"taskName" : "Restock Alert" ,
"taskDescription" : "Send restock alert to procurement team"
})
Template-based tasks
Create reusable task templates:
TASK_TEMPLATES = {
"commission_calc" : {
"taskName" : "Commission Calculation" ,
"taskDescription" : "Calculate {rate}% c ommission on $ {amount} " ,
"autoDetectCheckpoints" : true,
"testMode" : "with-verification"
},
"competitor_research" : {
"taskName" : "Competitor Research" ,
"taskDescription" : "Research {company} pricing and features" ,
"autoDetectCheckpoints" : true,
"testMode" : "with-verification"
}
}
# Use template
task = TASK_TEMPLATES [ "commission_calc" ].copy()
task[ "taskDescription" ] = task[ "taskDescription" ].format(
rate = 15 ,
amount = 50000
)
submit_task(task)
Troubleshooting
Check task description
Ensure it’s clear and actionable
Verify reviewer exists
Reviewer must be registered mixus user
Check API key permissions
Needs eval:create scope
Review error message
Look in response or dashboard for details
Checkpoint not triggering
Possible causes:
Auto-detect didn’t identify action as high-risk
Task description too vague
Action type not in checkpoint criteria
Solution: Use manual checkpoints to force verification at specific points.
Slow verification responses
Tips for faster reviews:
Use mobile notifications
Set up Slack alerts
Assign backup reviewers
Use webhooks for real-time notifications
Next steps