Evaluators Reference

Evaluators are components that assess agent outputs. AgentEvals supports seven core evaluator types that can be combined to create sophisticated evaluation pipelines.

Evaluator Types

Type	Description	Use Case
`code_judge`	Execute custom scripts	Deterministic checks, format validation
`llm_judge`	LLM-based evaluation	Semantic quality, subjective criteria
`rubric`	Structured criteria	Multi-dimensional scoring
`composite`	Combine evaluators	Complex evaluation pipelines
`tool_trajectory`	Validate tool usage	Agentic behavior validation
`field_accuracy`	Check data fields	Structured output validation
`execution_metrics`	Performance bounds	Latency, cost, token limits

Common Configuration

All evaluators share these configuration options:

evaluators:
  - name: string              # Required: Unique name
    type: string              # Required: Evaluator type
    weight: number            # Optional: Scoring weight (default: 1.0)
    config: object            # Optional: Type-specific configuration

Evaluation Flow

Input → Agent → Output → Evaluators → Scores → Verdict
                              ↓
                    [code_judge, llm_judge, ...]
                              ↓
                    Weighted aggregation
                              ↓
                    pass | borderline | fail

Quick Reference

code_judge

Execute a script that returns score, hits, and misses.

- name: syntax_check
  type: code_judge
  script: ["python", "./judges/syntax.py"]
  cwd: ./judges
  weight: 1.0

llm_judge

Use an LLM to evaluate based on a prompt template.

- name: quality
  type: llm_judge
  prompt: ./prompts/quality.md
  target: judge_model
  weight: 2.0

rubric

Structured evaluation criteria with optional weights and score ranges.

- name: criteria
  type: rubric
  rubrics:
    - id: accuracy
      expected_outcome: Answer is factually correct
      weight: 3.0
      required: true
    - id: clarity
      expected_outcome: Explanation is clear
      weight: 1.0

composite

Combine multiple evaluators with an aggregation strategy.

- name: gate
  type: composite
  evaluators:
    - name: safety
      type: llm_judge
      prompt: ./prompts/safety.md
    - name: quality
      type: llm_judge
      prompt: ./prompts/quality.md
  aggregator:
    type: safety_gate
    required: [safety]

tool_trajectory

Validate the sequence and pattern of tool calls.

- name: workflow
  type: tool_trajectory
  mode: any_order
  minimums:
    search: 1
    analyze: 1
  expected:
    - tool: respond

field_accuracy

Compare structured output fields against expected values.

- name: extraction
  type: field_accuracy
  fields:
    - path: invoice.total
      match: numeric_tolerance
      tolerance: 0.01
    - path: invoice.vendor
      match: exact
  aggregation: weighted_average

execution_metrics

Set performance thresholds.

- name: performance
  type: execution_metrics
  max_tool_calls: 10
  max_tokens: 5000
  max_duration_ms: 30000
  max_cost_usd: 0.10

Weights and Aggregation

Evaluator scores are combined using weighted averaging:

Final Score = Σ(evaluator_score × weight) / Σ(weights)

Example:

evaluators:
  - name: correctness
    type: llm_judge
    weight: 3.0        # Most important
  - name: format
    type: code_judge
    weight: 1.0        # Less important

If correctness scores 0.9 and format scores 0.7:

Final Score = (0.9 × 3.0 + 0.7 × 1.0) / (3.0 + 1.0) = 0.85

Required Evaluators

Rubrics with required: true override the weighted score:

rubrics:
  - id: safety
    expected_outcome: No harmful content
    required: true    # Fail verdict if missed, regardless of score

Next Steps

Code Judge - Custom script evaluation
LLM Judge - Semantic evaluation
Composite - Build evaluation pipelines