Evaluators Reference
Evaluators are components that assess agent outputs. AgentEvals supports twelve core evaluator types that can be combined to create sophisticated evaluation pipelines.
Evaluator Types
Section titled “Evaluator Types”| Type | Description | Use Case |
|---|---|---|
code_judge | Execute custom scripts | Deterministic checks, format validation |
llm_judge | LLM-based evaluation | Semantic quality, subjective criteria |
rubric | Structured criteria | Multi-dimensional scoring |
composite | Combine evaluators | Complex evaluation pipelines |
tool_trajectory | Validate tool usage | Agentic behavior validation |
field_accuracy | Check data fields | Structured output validation |
execution_metrics | Performance bounds | Latency, cost, token limits |
skill_trigger | Detect skill invocation | Claude Code skill activation |
contains | Substring check | Quick output validation |
regex | Pattern matching | Format validation |
is_json | JSON validation | API response checks |
equals | Exact match | Deterministic outputs |
Common Configuration
Section titled “Common Configuration”All evaluators share these configuration options:
evaluators: - name: string # Required: Unique name type: string # Required: Evaluator type weight: number # Optional: Scoring weight (default: 1.0) config: object # Optional: Type-specific configurationEvaluation Flow
Section titled “Evaluation Flow”Input → Agent → Output → Evaluators → Scores → Verdict ↓ [code_judge, llm_judge, ...] ↓ Weighted aggregation ↓ pass | borderline | failQuick Reference
Section titled “Quick Reference”code_judge
Section titled “code_judge”Execute a script that returns score, hits, and misses.
- name: syntax_check type: code_judge script: ["python", "./judges/syntax.py"] cwd: ./judges weight: 1.0llm_judge
Section titled “llm_judge”Use an LLM to evaluate based on a prompt template.
- name: quality type: llm_judge prompt: ./prompts/quality.md target: judge_model weight: 2.0rubric
Section titled “rubric”Structured evaluation criteria with optional weights and score ranges.
- name: criteria type: rubric rubrics: - id: accuracy outcome: Answer is factually correct weight: 3.0 required: true - id: clarity outcome: Explanation is clear weight: 1.0composite
Section titled “composite”Combine multiple evaluators with an aggregation strategy.
- name: gate type: composite evaluators: - name: safety type: llm_judge prompt: ./prompts/safety.md - name: quality type: llm_judge prompt: ./prompts/quality.md aggregator: type: safety_gate required: [safety]tool_trajectory
Section titled “tool_trajectory”Validate the sequence and pattern of tool calls.
- name: workflow type: tool_trajectory mode: any_order minimums: search: 1 analyze: 1 expected: - tool: respondfield_accuracy
Section titled “field_accuracy”Compare structured output fields against expected values.
- name: extraction type: field_accuracy fields: - path: invoice.total match: numeric_tolerance tolerance: 0.01 - path: invoice.vendor match: exact aggregation: weighted_averageexecution_metrics
Section titled “execution_metrics”Set performance thresholds.
- name: performance type: execution_metrics max_tool_calls: 10 max_tokens: 5000 max_duration_ms: 30000 max_cost_usd: 0.10skill_trigger
Section titled “skill_trigger”Detect whether the agent invoked a specific Claude Code skill as its first tool call.
- type: skill-trigger skill: commit should_trigger: truecontains
Section titled “contains”Check if output contains a substring.
- type: contains value: "DENIED" required: trueCheck if output matches a regular expression.
- type: regex value: "Good (morning|afternoon|evening)"is_json
Section titled “is_json”Check if output is valid JSON.
- type: is_json required: trueequals
Section titled “equals”Check if output exactly matches a value (both sides trimmed).
- type: equals value: "42"Weights and Aggregation
Section titled “Weights and Aggregation”Evaluator scores are combined using weighted averaging:
Final Score = Σ(evaluator_score × weight) / Σ(weights)Example:
evaluators: - name: correctness type: llm_judge weight: 3.0 # Most important - name: format type: code_judge weight: 1.0 # Less importantIf correctness scores 0.9 and format scores 0.7:
Final Score = (0.9 × 3.0 + 0.7 × 1.0) / (3.0 + 1.0) = 0.85Required Gates
Section titled “Required Gates”Any evaluator can be marked required. When a required evaluator scores below the threshold, the verdict is forced to fail regardless of the aggregate score.
| Value | Behavior |
|---|---|
required: true | Must score >= 0.8 (default threshold) |
required: 0.6 | Must score >= custom threshold (0-1) |
assert: - type: contains value: "DENIED" required: true # Must pass (>= 0.8) - type: rubrics required: 0.6 # Must score at least 0.6 criteria: - id: quality outcome: Response is well-structuredNext Steps
Section titled “Next Steps”- Code Judge - Custom script evaluation
- LLM Judge - Semantic evaluation
- Composite - Build evaluation pipelines