Skip to content

Execution Metrics

The execution_metrics evaluator validates performance characteristics of agent execution. Use it to ensure agents operate within acceptable bounds.

evaluators:
- name: performance
type: execution_metrics
max_tool_calls: 10
max_llm_calls: 5
max_tokens: 5000
max_cost_usd: 0.10
max_duration_ms: 30000
PropertyTypeDescription
max_tool_callsintegerMaximum number of tool invocations
max_llm_callsintegerMaximum number of LLM API calls
max_tokensintegerMaximum total tokens (input + output)
max_input_tokensintegerMaximum input tokens
max_output_tokensintegerMaximum output tokens
max_cost_usdnumberMaximum cost in USD
max_duration_msintegerMaximum execution time in milliseconds
target_exploration_rationumberTarget ratio of exploration vs exploitation
exploration_tolerancenumberAcceptable deviation from target ratio
name: efficient-agent-eval
version: "1.0"
execution:
evaluators:
- name: performance
type: execution_metrics
max_tool_calls: 5
max_duration_ms: 10000
evalcases:
- id: quick-lookup
expected_outcome: Agent finds answer efficiently
input: "What is the capital of France?"
execution:
evaluators:
- name: budget
type: execution_metrics
max_cost_usd: 0.05
max_tokens: 2000
execution:
evaluators:
- name: full_metrics
type: execution_metrics
max_tool_calls: 10
max_llm_calls: 5
max_tokens: 10000
max_input_tokens: 8000
max_output_tokens: 2000
max_cost_usd: 0.25
max_duration_ms: 60000

Different bounds for different eval cases:

evalcases:
- id: simple-query
expected_outcome: Quick response
input: "Hello!"
execution:
evaluators:
- name: fast
type: execution_metrics
max_duration_ms: 2000
max_tool_calls: 0
- id: complex-research
expected_outcome: Thorough research
input: "Compare Python vs JavaScript for web development"
execution:
evaluators:
- name: research_bounds
type: execution_metrics
max_duration_ms: 60000
max_tool_calls: 20
max_cost_usd: 0.50

Execution metrics use threshold-based scoring:

Metric vs LimitScore
Within limit1.0
Exceeds limit0.0

When multiple metrics are specified, all must pass for score of 1.0.

evaluators:
- name: perf
type: execution_metrics
max_tool_calls: 10 # Actual: 8 ✓
max_duration_ms: 5000 # Actual: 3000 ✓
max_cost_usd: 0.10 # Actual: 0.12 ✗

Result:

  • Individual: tool_calls ✓, duration ✓, cost ✗
  • Overall score: 0.0 (any failure = fail)
execution:
evaluators:
- name: quality
type: llm_judge
prompt: ./prompts/quality.md
weight: 3.0
- name: performance
type: execution_metrics
max_duration_ms: 10000
weight: 1.0
execution:
evaluators:
- name: production_ready
type: composite
evaluators:
- name: correctness
type: llm_judge
prompt: ./prompts/correct.md
- name: efficiency
type: execution_metrics
max_tool_calls: 10
max_cost_usd: 0.10
aggregator:
type: all_or_nothing
threshold: 0.8
execution:
evaluators:
- name: workflow
type: tool_trajectory
mode: any_order
minimums:
search: 1
- name: bounded_workflow
type: execution_metrics
max_tool_calls: 5 # Don't over-search

Execution metrics are calculated from trace data:

{
"traceSummary": {
"eventCount": 5,
"toolNames": ["search", "read", "read", "summarize"],
"toolCallCount": 4,
"llmCallCount": 2,
"inputTokens": 1500,
"outputTokens": 500,
"totalTokens": 2000,
"costUsd": 0.03,
"durationMs": 4500
}
}
# Limit spend per evaluation
- name: budget_guard
type: execution_metrics
max_cost_usd: 0.02
max_tokens: 1000
# Real-time response requirement
- name: latency_sla
type: execution_metrics
max_duration_ms: 3000
# Stop runaway agents
- name: loop_guard
type: execution_metrics
max_tool_calls: 50
max_llm_calls: 20
# Edge deployment constraints
- name: edge_limits
type: execution_metrics
max_tokens: 500
max_duration_ms: 1000

Base limits on observed behavior:

# After profiling typical runs
max_tool_calls: 15 # P95 was 12
max_duration_ms: 45000 # P95 was 38000
evalcases:
- id: simple
execution:
evaluators:
- name: tight
type: execution_metrics
max_duration_ms: 5000
- id: complex
execution:
evaluators:
- name: relaxed
type: execution_metrics
max_duration_ms: 60000

Don’t optimize only for speed:

evaluators:
- name: quality
weight: 3.0 # More important
- name: performance
weight: 1.0 # Still matters

Track metrics over time to catch regressions:

metadata:
baseline_duration_ms: 5000
baseline_cost_usd: 0.02

Add buffer for natural variation:

# If average is 10s, allow 15s
max_duration_ms: 15000
{
"name": "performance",
"type": "execution_metrics",
"score": 1.0,
"hits": [
"Tool calls (8) within limit (10)",
"Duration (4500ms) within limit (10000ms)",
"Cost ($0.03) within limit ($0.10)"
],
"misses": [],
"details": {
"tool_calls": 8,
"llm_calls": 2,
"total_tokens": 2000,
"cost_usd": 0.03,
"duration_ms": 4500
}
}