Execution Metrics
The execution_metrics evaluator validates performance characteristics of agent execution. Use it to ensure agents operate within acceptable bounds.
Configuration
Section titled “Configuration”evaluators: - name: performance type: execution_metrics max_tool_calls: 10 max_llm_calls: 5 max_tokens: 5000 max_cost_usd: 0.10 max_duration_ms: 30000Properties
Section titled “Properties”| Property | Type | Description |
|---|---|---|
max_tool_calls | integer | Maximum number of tool invocations |
max_llm_calls | integer | Maximum number of LLM API calls |
max_tokens | integer | Maximum total tokens (input + output) |
max_input_tokens | integer | Maximum input tokens |
max_output_tokens | integer | Maximum output tokens |
max_cost_usd | number | Maximum cost in USD |
max_duration_ms | integer | Maximum execution time in milliseconds |
target_exploration_ratio | number | Target ratio of exploration vs exploitation |
exploration_tolerance | number | Acceptable deviation from target ratio |
Examples
Section titled “Examples”Basic Performance Bounds
Section titled “Basic Performance Bounds”name: efficient-agent-evalversion: "1.0"
execution: evaluators: - name: performance type: execution_metrics max_tool_calls: 5 max_duration_ms: 10000
evalcases: - id: quick-lookup expected_outcome: Agent finds answer efficiently input: "What is the capital of France?"Cost Control
Section titled “Cost Control”execution: evaluators: - name: budget type: execution_metrics max_cost_usd: 0.05 max_tokens: 2000Comprehensive Metrics
Section titled “Comprehensive Metrics”execution: evaluators: - name: full_metrics type: execution_metrics max_tool_calls: 10 max_llm_calls: 5 max_tokens: 10000 max_input_tokens: 8000 max_output_tokens: 2000 max_cost_usd: 0.25 max_duration_ms: 60000Tiered Performance
Section titled “Tiered Performance”Different bounds for different eval cases:
evalcases: - id: simple-query expected_outcome: Quick response input: "Hello!" execution: evaluators: - name: fast type: execution_metrics max_duration_ms: 2000 max_tool_calls: 0
- id: complex-research expected_outcome: Thorough research input: "Compare Python vs JavaScript for web development" execution: evaluators: - name: research_bounds type: execution_metrics max_duration_ms: 60000 max_tool_calls: 20 max_cost_usd: 0.50Scoring
Section titled “Scoring”Execution metrics use threshold-based scoring:
| Metric vs Limit | Score |
|---|---|
| Within limit | 1.0 |
| Exceeds limit | 0.0 |
When multiple metrics are specified, all must pass for score of 1.0.
Scoring Example
Section titled “Scoring Example”evaluators: - name: perf type: execution_metrics max_tool_calls: 10 # Actual: 8 ✓ max_duration_ms: 5000 # Actual: 3000 ✓ max_cost_usd: 0.10 # Actual: 0.12 ✗Result:
- Individual: tool_calls ✓, duration ✓, cost ✗
- Overall score: 0.0 (any failure = fail)
Combining with Other Evaluators
Section titled “Combining with Other Evaluators”Quality + Performance
Section titled “Quality + Performance”execution: evaluators: - name: quality type: llm_judge prompt: ./prompts/quality.md weight: 3.0
- name: performance type: execution_metrics max_duration_ms: 10000 weight: 1.0In Composite
Section titled “In Composite”execution: evaluators: - name: production_ready type: composite evaluators: - name: correctness type: llm_judge prompt: ./prompts/correct.md
- name: efficiency type: execution_metrics max_tool_calls: 10 max_cost_usd: 0.10
aggregator: type: all_or_nothing threshold: 0.8With Tool Trajectory
Section titled “With Tool Trajectory”execution: evaluators: - name: workflow type: tool_trajectory mode: any_order minimums: search: 1
- name: bounded_workflow type: execution_metrics max_tool_calls: 5 # Don't over-searchTrace Summary
Section titled “Trace Summary”Execution metrics are calculated from trace data:
{ "traceSummary": { "eventCount": 5, "toolNames": ["search", "read", "read", "summarize"], "toolCallCount": 4, "llmCallCount": 2, "inputTokens": 1500, "outputTokens": 500, "totalTokens": 2000, "costUsd": 0.03, "durationMs": 4500 }}Use Cases
Section titled “Use Cases”API Cost Control
Section titled “API Cost Control”# Limit spend per evaluation- name: budget_guard type: execution_metrics max_cost_usd: 0.02 max_tokens: 1000Latency Requirements
Section titled “Latency Requirements”# Real-time response requirement- name: latency_sla type: execution_metrics max_duration_ms: 3000Prevent Infinite Loops
Section titled “Prevent Infinite Loops”# Stop runaway agents- name: loop_guard type: execution_metrics max_tool_calls: 50 max_llm_calls: 20Resource-Constrained Environments
Section titled “Resource-Constrained Environments”# Edge deployment constraints- name: edge_limits type: execution_metrics max_tokens: 500 max_duration_ms: 1000Best Practices
Section titled “Best Practices”1. Set Realistic Bounds
Section titled “1. Set Realistic Bounds”Base limits on observed behavior:
# After profiling typical runsmax_tool_calls: 15 # P95 was 12max_duration_ms: 45000 # P95 was 380002. Different Bounds for Different Tasks
Section titled “2. Different Bounds for Different Tasks”evalcases: - id: simple execution: evaluators: - name: tight type: execution_metrics max_duration_ms: 5000
- id: complex execution: evaluators: - name: relaxed type: execution_metrics max_duration_ms: 600003. Combine with Quality Metrics
Section titled “3. Combine with Quality Metrics”Don’t optimize only for speed:
evaluators: - name: quality weight: 3.0 # More important
- name: performance weight: 1.0 # Still matters4. Monitor Trends
Section titled “4. Monitor Trends”Track metrics over time to catch regressions:
metadata: baseline_duration_ms: 5000 baseline_cost_usd: 0.025. Account for Variance
Section titled “5. Account for Variance”Add buffer for natural variation:
# If average is 10s, allow 15smax_duration_ms: 15000Result Format
Section titled “Result Format”{ "name": "performance", "type": "execution_metrics", "score": 1.0, "hits": [ "Tool calls (8) within limit (10)", "Duration (4500ms) within limit (10000ms)", "Cost ($0.03) within limit ($0.10)" ], "misses": [], "details": { "tool_calls": 8, "llm_calls": 2, "total_tokens": 2000, "cost_usd": 0.03, "duration_ms": 4500 }}Next Steps
Section titled “Next Steps”- Tool Trajectory - Validate tool patterns
- Composite - Combine evaluators
- Verdicts - Score aggregation