Skip to content

Evaluators Reference

Evaluators are components that assess agent outputs. AgentEvals supports seven core evaluator types that can be combined to create sophisticated evaluation pipelines.

TypeDescriptionUse Case
code_judgeExecute custom scriptsDeterministic checks, format validation
llm_judgeLLM-based evaluationSemantic quality, subjective criteria
rubricStructured criteriaMulti-dimensional scoring
compositeCombine evaluatorsComplex evaluation pipelines
tool_trajectoryValidate tool usageAgentic behavior validation
field_accuracyCheck data fieldsStructured output validation
execution_metricsPerformance boundsLatency, cost, token limits

All evaluators share these configuration options:

evaluators:
- name: string # Required: Unique name
type: string # Required: Evaluator type
weight: number # Optional: Scoring weight (default: 1.0)
config: object # Optional: Type-specific configuration
Input → Agent → Output → Evaluators → Scores → Verdict
[code_judge, llm_judge, ...]
Weighted aggregation
pass | borderline | fail

Execute a script that returns score, hits, and misses.

- name: syntax_check
type: code_judge
script: ["python", "./judges/syntax.py"]
cwd: ./judges
weight: 1.0

Use an LLM to evaluate based on a prompt template.

- name: quality
type: llm_judge
prompt: ./prompts/quality.md
target: judge_model
weight: 2.0

Structured evaluation criteria with optional weights and score ranges.

- name: criteria
type: rubric
rubrics:
- id: accuracy
expected_outcome: Answer is factually correct
weight: 3.0
required: true
- id: clarity
expected_outcome: Explanation is clear
weight: 1.0

Combine multiple evaluators with an aggregation strategy.

- name: gate
type: composite
evaluators:
- name: safety
type: llm_judge
prompt: ./prompts/safety.md
- name: quality
type: llm_judge
prompt: ./prompts/quality.md
aggregator:
type: safety_gate
required: [safety]

Validate the sequence and pattern of tool calls.

- name: workflow
type: tool_trajectory
mode: any_order
minimums:
search: 1
analyze: 1
expected:
- tool: respond

Compare structured output fields against expected values.

- name: extraction
type: field_accuracy
fields:
- path: invoice.total
match: numeric_tolerance
tolerance: 0.01
- path: invoice.vendor
match: exact
aggregation: weighted_average

Set performance thresholds.

- name: performance
type: execution_metrics
max_tool_calls: 10
max_tokens: 5000
max_duration_ms: 30000
max_cost_usd: 0.10

Evaluator scores are combined using weighted averaging:

Final Score = Σ(evaluator_score × weight) / Σ(weights)

Example:

evaluators:
- name: correctness
type: llm_judge
weight: 3.0 # Most important
- name: format
type: code_judge
weight: 1.0 # Less important

If correctness scores 0.9 and format scores 0.7:

Final Score = (0.9 × 3.0 + 0.7 × 1.0) / (3.0 + 1.0) = 0.85

Rubrics with required: true override the weighted score:

rubrics:
- id: safety
expected_outcome: No harmful content
required: true # Fail verdict if missed, regardless of score