Evaluators Reference
Evaluators are components that assess agent outputs. AgentEvals supports seven core evaluator types that can be combined to create sophisticated evaluation pipelines.
Evaluator Types
Section titled “Evaluator Types”| Type | Description | Use Case |
|---|---|---|
code_judge | Execute custom scripts | Deterministic checks, format validation |
llm_judge | LLM-based evaluation | Semantic quality, subjective criteria |
rubric | Structured criteria | Multi-dimensional scoring |
composite | Combine evaluators | Complex evaluation pipelines |
tool_trajectory | Validate tool usage | Agentic behavior validation |
field_accuracy | Check data fields | Structured output validation |
execution_metrics | Performance bounds | Latency, cost, token limits |
Common Configuration
Section titled “Common Configuration”All evaluators share these configuration options:
evaluators: - name: string # Required: Unique name type: string # Required: Evaluator type weight: number # Optional: Scoring weight (default: 1.0) config: object # Optional: Type-specific configurationEvaluation Flow
Section titled “Evaluation Flow”Input → Agent → Output → Evaluators → Scores → Verdict ↓ [code_judge, llm_judge, ...] ↓ Weighted aggregation ↓ pass | borderline | failQuick Reference
Section titled “Quick Reference”code_judge
Section titled “code_judge”Execute a script that returns score, hits, and misses.
- name: syntax_check type: code_judge script: ["python", "./judges/syntax.py"] cwd: ./judges weight: 1.0llm_judge
Section titled “llm_judge”Use an LLM to evaluate based on a prompt template.
- name: quality type: llm_judge prompt: ./prompts/quality.md target: judge_model weight: 2.0rubric
Section titled “rubric”Structured evaluation criteria with optional weights and score ranges.
- name: criteria type: rubric rubrics: - id: accuracy expected_outcome: Answer is factually correct weight: 3.0 required: true - id: clarity expected_outcome: Explanation is clear weight: 1.0composite
Section titled “composite”Combine multiple evaluators with an aggregation strategy.
- name: gate type: composite evaluators: - name: safety type: llm_judge prompt: ./prompts/safety.md - name: quality type: llm_judge prompt: ./prompts/quality.md aggregator: type: safety_gate required: [safety]tool_trajectory
Section titled “tool_trajectory”Validate the sequence and pattern of tool calls.
- name: workflow type: tool_trajectory mode: any_order minimums: search: 1 analyze: 1 expected: - tool: respondfield_accuracy
Section titled “field_accuracy”Compare structured output fields against expected values.
- name: extraction type: field_accuracy fields: - path: invoice.total match: numeric_tolerance tolerance: 0.01 - path: invoice.vendor match: exact aggregation: weighted_averageexecution_metrics
Section titled “execution_metrics”Set performance thresholds.
- name: performance type: execution_metrics max_tool_calls: 10 max_tokens: 5000 max_duration_ms: 30000 max_cost_usd: 0.10Weights and Aggregation
Section titled “Weights and Aggregation”Evaluator scores are combined using weighted averaging:
Final Score = Σ(evaluator_score × weight) / Σ(weights)Example:
evaluators: - name: correctness type: llm_judge weight: 3.0 # Most important - name: format type: code_judge weight: 1.0 # Less importantIf correctness scores 0.9 and format scores 0.7:
Final Score = (0.9 × 3.0 + 0.7 × 1.0) / (3.0 + 1.0) = 0.85Required Evaluators
Section titled “Required Evaluators”Rubrics with required: true override the weighted score:
rubrics: - id: safety expected_outcome: No harmful content required: true # Fail verdict if missed, regardless of scoreNext Steps
Section titled “Next Steps”- Code Judge - Custom script evaluation
- LLM Judge - Semantic evaluation
- Composite - Build evaluation pipelines