Tool Trajectory
The tool_trajectory evaluator validates how agents use tools during task execution. Use it for agentic workflows where the journey matters, not just the destination.
Configuration
Section titled “Configuration”evaluators: - name: workflow type: tool_trajectory mode: in_order expected: - tool: search - tool: analyze - tool: respondProperties
Section titled “Properties”| Property | Type | Required | Description |
|---|---|---|---|
mode | string | No | Matching mode (default: any_order) |
expected | ExpectedToolCall[] | No | Expected tool calls |
minimums | object | No | Minimum call counts per tool |
| Mode | Description |
|---|---|
any_order | Tools must be called, order doesn’t matter |
in_order | Tools must be called in specified sequence |
exact | Tools must match exactly (order and count) |
Expected Tool Call Schema
Section titled “Expected Tool Call Schema”expected: - tool: string # Tool name (required) args: object | "any" # Expected arguments max_duration_ms: number # Maximum execution timeExamples
Section titled “Examples”Any Order Mode
Section titled “Any Order Mode”Tools must be called at least once, in any order:
evaluators: - name: required_tools type: tool_trajectory mode: any_order expected: - tool: search - tool: read_document - tool: summarizePasses:
- search → read_document → summarize ✓
- read_document → search → summarize ✓
- search → search → read_document → summarize ✓
Fails:
- search → summarize (missing read_document) ✗
In Order Mode
Section titled “In Order Mode”Tools must be called in the specified sequence:
evaluators: - name: workflow_order type: tool_trajectory mode: in_order expected: - tool: fetch_data - tool: process - tool: savePasses:
- fetch_data → process → save ✓
- fetch_data → log → process → save (extra tools ok) ✓
Fails:
- process → fetch_data → save (wrong order) ✗
- fetch_data → save (missing process) ✗
Exact Mode
Section titled “Exact Mode”Tools must match exactly in order and count:
evaluators: - name: exact_workflow type: tool_trajectory mode: exact expected: - tool: init - tool: process - tool: cleanupPasses:
- init → process → cleanup ✓
Fails:
- init → process → process → cleanup (extra call) ✗
- init → log → process → cleanup (unexpected tool) ✗
Minimum Counts
Section titled “Minimum Counts”Require minimum number of calls per tool:
evaluators: - name: thorough_research type: tool_trajectory mode: any_order minimums: search: 2 # At least 2 searches read_document: 3 # Read at least 3 documentsArgument Validation
Section titled “Argument Validation”Check tool arguments:
evaluators: - name: correct_args type: tool_trajectory mode: in_order expected: - tool: search args: query: "Python programming"
- tool: filter args: category: "tutorial" limit: 10
- tool: summarize args: any # Skip argument validationDuration Limits
Section titled “Duration Limits”Set maximum execution time per tool:
evaluators: - name: efficient_workflow type: tool_trajectory mode: any_order expected: - tool: search max_duration_ms: 2000
- tool: process max_duration_ms: 5000Complete Example
Section titled “Complete Example”name: research-workflow-evalversion: "1.0"description: Evaluates research agent workflow
execution: evaluators: # Check workflow follows expected pattern - name: workflow_pattern type: tool_trajectory mode: in_order expected: - tool: search - tool: read_document - tool: synthesize
# Ensure thorough research - name: research_depth type: tool_trajectory mode: any_order minimums: search: 1 read_document: 2
# Check efficiency - name: efficiency type: execution_metrics max_tool_calls: 10 max_duration_ms: 30000
evalcases: - id: research-python expected_outcome: | Agent searches for Python information, reads multiple sources, and synthesizes a comprehensive answer.
input: - role: system content: | You are a research assistant with tools: - search(query): Search for documents - read_document(id): Read a document - synthesize(content): Create summary - role: user content: "Research the history of Python programming language"Trace Summary Access
Section titled “Trace Summary Access”Tool trajectory evaluators have access to trace data:
{ "traceSummary": { "eventCount": 5, "toolNames": ["search", "read_document", "read_document", "synthesize"], "costUsd": 0.02, "durationMs": 4500 }}Combining with Other Evaluators
Section titled “Combining with Other Evaluators”With LLM Judge
Section titled “With LLM Judge”execution: evaluators: - name: tool_usage type: tool_trajectory mode: in_order expected: - tool: search - tool: analyze
- name: output_quality type: llm_judge prompt: | Evaluate the quality of the research output. Did the agent use search and analysis effectively?
Output: {{candidate_answer}} Tools used: {{output_messages}}In Composite
Section titled “In Composite”execution: evaluators: - name: agent_eval type: composite evaluators: - name: correct_tools type: tool_trajectory mode: any_order minimums: search: 1
- name: correct_output type: llm_judge prompt: ./prompts/quality.md
aggregator: type: weighted_average weights: correct_tools: 0.4 correct_output: 0.6Scoring
Section titled “Scoring”Tool trajectory scoring:
| Outcome | Score |
|---|---|
| All expected tools called (correct order if applicable) | 1.0 |
| Missing some expected tools | Proportional |
| Wrong order (in_order mode) | 0.0 |
| Exceeds duration limits | Penalty |
Proportional scoring:
Score = tools_matched / tools_expectedBest Practices
Section titled “Best Practices”1. Use any_order for Flexible Workflows
Section titled “1. Use any_order for Flexible Workflows”mode: any_order # Agent can find own pathminimums: required_tool: 12. Use in_order for Critical Sequences
Section titled “2. Use in_order for Critical Sequences”mode: in_orderexpected: - tool: validate_input # Must happen first - tool: process - tool: save # Must happen last3. Combine with Execution Metrics
Section titled “3. Combine with Execution Metrics”evaluators: - name: tools type: tool_trajectory mode: any_order minimums: search: 1
- name: performance type: execution_metrics max_tool_calls: 10 # Prevent over-calling4. Document Tool Expectations
Section titled “4. Document Tool Expectations”evalcases: - id: research expected_outcome: | Agent should: 1. Search for relevant information 2. Read at least 2 documents 3. Synthesize findings
Tool sequence: search → read(2+) → synthesizeNext Steps
Section titled “Next Steps”- Execution Metrics - Performance bounds
- Composite - Combine evaluators
- Patterns - Testing patterns