Skip to main content

Writing effective tasks

Start with clear objectives

Define success criteria upfront:
Calculate 15% commission on $50,000. Expected result: $7,500

Use progressive complexity

Start simple, then increase complexity:
1

Week 1: Simple tasks

Basic calculations, single-step operations
2

Week 2: Medium tasks

Research, multi-step workflows
3

Week 3: Complex tasks

Multiple tools, external integrations
4

Week 4: Production use

Full automation, batch processing

Checkpoint strategies

When to use auto-detect

Use auto-detect when:
  • You’re new to the system
  • Task involves external actions (emails, purchases)
  • You don’t know optimal verification points
  • Task complexity is high
Example:
{
  "autoDetectCheckpoints": true,
  "testMode": "with-verification"
}

When to use manual checkpoints

Use manual checkpoints when:
  • Testing specific decision points
  • You know exactly where verification is needed
  • Task has well-defined stages
  • Regulatory requirements dictate specific checks
Example:
{
  "checkpoints": [
    {"stage": "calculation", "description": "Verify math"},
    {"stage": "submission", "description": "Approve sending"}
  ],
  "testMode": "with-verification"
}

Optimal checkpoint placement

2-4 checkpoints is ideal for most tasks
  • Too few (0-1): May miss critical verification points
  • Just right (2-4): Balance oversight with efficiency
  • Too many (5+): Slows down evaluation, verification fatigue

Verification workflow tips

Responding to checkpoints

approve

Agent continues to next step

reject

Agent stops execution

hint: [text]

Agent adjusts approach with your guidance

Using hints effectively

  • Agent is close but needs adjustment
  • You want to guide without rejecting
  • Teaching agent better approaches
  • Minor corrections needed
hint: Use CoinMarketCap instead of CoinGecko for more accurate prices
hint: Include all team members in the CC field
hint: Round to 2 decimal places
hint: Use the Alternative Simplified Credit method, not regular method
hint: Do it better
hint: Wrong
hint: Not that
Be specific about what to change!

Testing strategies

Baseline comparisons

Always run tasks both ways to measure human impact:
# Run 1: With verification
{
  "taskName": "Task A - With Human",
  "testMode": "with-verification",
  "autoDetectCheckpoints": true
}

# Run 2: Without verification (baseline)
{
  "taskName": "Task A - Baseline",
  "testMode": "without-verification"
}
Metrics to compare:
  • Success rate
  • Execution time
  • Cost per task
  • Error rate
  • Quality scores

A/B testing checkpoints

Test different checkpoint strategies:
# Version A: Auto-detect
{
  "taskName": "Test - Auto Checkpoints",
  "autoDetectCheckpoints": true
}

# Version B: Manual
{
  "taskName": "Test - Manual Checkpoints",
  "checkpoints": [/* specific points */]
}

# Version C: No checkpoints
{
  "taskName": "Test - No Checkpoints",
  "testMode": "without-verification"
}
Compare which approach gives best results for your use case.

Performance optimization

Batch processing

Submit multiple tasks efficiently:
import concurrent.futures
import requests

def submit_task(task_data):
    response = requests.post(
        "https://app.mixus.ai/api/eval/create-task-agent",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json=task_data
    )
    return response.json()

# Submit 10 tasks in parallel
tasks = [create_task_data(i) for i in range(10)]
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(submit_task, tasks))
Rate limits: Max 10 concurrent evaluations per organization

Webhook vs polling

Use Webhooks

When: Long-running tasks (>5 min)Benefits:
  • No polling overhead
  • Real-time notifications
  • Lower API usage

Use Polling

When: Short tasks (<5 min)Benefits:
  • Simpler setup
  • No server needed
  • Direct control

Optimizing task descriptions

Shorter is fasterAgent processing time increases with description length. Be concise but clear.
Research all available information about the top three artificial intelligence 
agent platforms currently available in the market, including detailed pricing 
information for all tiers and plans, comprehensive lists of features and 
capabilities, target market analysis, and competitive positioning...

Cost optimization

Understanding costs

Evaluation costs include:
  • AI model usage - Based on tokens processed
  • Tool usage - External API calls (web search, integrations)
  • Verification overhead - Human review time (no additional cost)

Reducing costs

Skip verification for low-risk tasks:
{"testMode": "without-verification"}
Saves verification overhead, completes faster.
Fewer checkpoints = lower cost:
  • Auto-detect typically finds 1-3 checkpoints
  • Manual: Only checkpoint truly critical steps
  • Baseline: No checkpoints = lowest cost
Group related tasks to reuse context:
[
  {"taskName": "Calc 1", "description": "15% of $50k"},
  {"taskName": "Calc 2", "description": "15% of $75k"},
  {"taskName": "Calc 3", "description": "15% of $100k"}
]
Reduce unnecessary research and tool calls:
// More expensive
{"description": "Find crypto prices and calculate"}

// Less expensive
{"description": "Calculate: 0.5 BTC × $42,000 + 10 ETH × $2,200"}

Quality assurance

Validation strategies

1

Define expected outcomes

Specify what “correct” looks like before running
2

Run with verification first

Verify agent behavior manually before automation
3

Compare with baseline

Measure improvement from human oversight
4

Iterate on prompts

Refine task descriptions based on results

Tracking quality metrics

Monitor these metrics over time:
metrics = {
    "success_rate": 0.95,      # % of successful completions
    "approval_rate": 0.92,     # % of checkpoints approved
    "rejection_rate": 0.05,    # % of checkpoints rejected
    "hint_rate": 0.03,         # % of checkpoints needing hints
    "avg_duration": 180,       # Average seconds to complete
    "avg_cost": 2.50,          # Average cost per task
}
Target metrics:
  • Success rate: >90%
  • Approval rate: >85%
  • Rejection rate: <10%
  • Hint rate: <15%

Team collaboration

Assigning reviewers

Match reviewers to task expertise:
{
  "financial_tasks": "[email protected]",
  "marketing_tasks": "[email protected]",
  "technical_tasks": "[email protected]",
  "legal_tasks": "[email protected]"
}

Review response times

Set expectations for checkpoint responses:

Urgent

< 5 minutesFinancial transactions, customer communications

Normal

< 1 hourResearch, analysis, reports

Low Priority

< 24 hoursBaseline tests, experiments

Sharing results

Export evaluation results for team review:
# Get all evaluation results
curl https://app.mixus.ai/api/eval/results \
  -H "Authorization: Bearer mxs_eval_YOUR_KEY" \
  > team_eval_results.json

# Share with team

Common pitfalls

Avoid these mistakes

Problem: “Research AI companies”Solution: “Research OpenAI, Anthropic, Google DeepMind pricing”
Problem: 8 checkpoints for simple calculationSolution: 1-2 checkpoints, or use auto-detect
Problem: Only testing with verificationSolution: Run same task without verification for comparison
Problem: Not tracking success/failure ratesSolution: Monitor dashboard, export metrics regularly
Problem: Rejecting when small adjustment would workSolution: Use hint: to guide agent to correct approach

Advanced patterns

Chained evaluations

Run sequential tasks where output of one feeds into next:
# Task 1: Research
result1 = submit_task({
    "taskName": "Research Phase",
    "taskDescription": "Research competitor pricing"
})

execution_id_1 = result1["executionId"]
wait_for_completion(execution_id_1)

# Get Task 1 results
results_1 = get_execution_results(execution_id_1)

# Task 2: Analysis (using Task 1 results)
result2 = submit_task({
    "taskName": "Analysis Phase",
    "taskDescription": f"Analyze pricing data: {results_1['output']}"
})

Conditional workflows

Run different tasks based on results:
result = submit_task({
    "taskName": "Check Inventory",
    "taskDescription": "Check if Product X has inventory > 100"
})

wait_for_completion(result["executionId"])
inventory_result = get_results(result["executionId"])

if inventory_result["inventory"] > 100:
    # High inventory - run discount campaign task
    submit_task({
        "taskName": "Discount Campaign",
        "taskDescription": "Create 20% discount campaign for Product X"
    })
else:
    # Low inventory - run restock task
    submit_task({
        "taskName": "Restock Alert",
        "taskDescription": "Send restock alert to procurement team"
    })

Template-based tasks

Create reusable task templates:
TASK_TEMPLATES = {
    "commission_calc": {
        "taskName": "Commission Calculation",
        "taskDescription": "Calculate {rate}% commission on ${amount}",
        "autoDetectCheckpoints": true,
        "testMode": "with-verification"
    },
    "competitor_research": {
        "taskName": "Competitor Research",
        "taskDescription": "Research {company} pricing and features",
        "autoDetectCheckpoints": true,
        "testMode": "with-verification"
    }
}

# Use template
task = TASK_TEMPLATES["commission_calc"].copy()
task["taskDescription"] = task["taskDescription"].format(
    rate=15,
    amount=50000
)
submit_task(task)

Troubleshooting

Task fails immediately

1

Check task description

Ensure it’s clear and actionable
2

Verify reviewer exists

Reviewer must be registered mixus user
3

Check API key permissions

Needs eval:create scope
4

Review error message

Look in response or dashboard for details

Checkpoint not triggering

Possible causes:
  • Auto-detect didn’t identify action as high-risk
  • Task description too vague
  • Action type not in checkpoint criteria
Solution: Use manual checkpoints to force verification at specific points.

Slow verification responses

Tips for faster reviews:
  • Use mobile notifications
  • Set up Slack alerts
  • Assign backup reviewers
  • Use webhooks for real-time notifications

Next steps