Evaluators Reference

Evaluators are components that assess agent outputs. AgentEvals supports twelve core evaluator types that can be combined to create sophisticated evaluation pipelines.

Evaluator Types

Type	Description	Use Case
`code_judge`	Execute custom scripts	Deterministic checks, format validation
`llm_judge`	LLM-based evaluation	Semantic quality, subjective criteria
`rubric`	Structured criteria	Multi-dimensional scoring
`composite`	Combine evaluators	Complex evaluation pipelines
`tool_trajectory`	Validate tool usage	Agentic behavior validation
`field_accuracy`	Check data fields	Structured output validation
`execution_metrics`	Performance bounds	Latency, cost, token limits
`skill_trigger`	Detect skill invocation	Claude Code skill activation
`contains`	Substring check	Quick output validation
`regex`	Pattern matching	Format validation
`is_json`	JSON validation	API response checks
`equals`	Exact match	Deterministic outputs

Common Configuration

All evaluators share these configuration options:

evaluators:
  - name: string              # Required: Unique name
    type: string              # Required: Evaluator type
    weight: number            # Optional: Scoring weight (default: 1.0)
    config: object            # Optional: Type-specific configuration

Evaluation Flow

Input → Agent → Output → Evaluators → Scores → Verdict
                              ↓
                    [code_judge, llm_judge, ...]
                              ↓
                    Weighted aggregation
                              ↓
                    pass | borderline | fail

Quick Reference

code_judge

Execute a script that returns score, hits, and misses.

- name: syntax_check
  type: code_judge
  script: ["python", "./judges/syntax.py"]
  cwd: ./judges
  weight: 1.0

llm_judge

Use an LLM to evaluate based on a prompt template.

- name: quality
  type: llm_judge
  prompt: ./prompts/quality.md
  target: judge_model
  weight: 2.0

rubric

Structured evaluation criteria with optional weights and score ranges.

- name: criteria
  type: rubric
  rubrics:
    - id: accuracy
      outcome: Answer is factually correct
      weight: 3.0
      required: true
    - id: clarity
      outcome: Explanation is clear
      weight: 1.0

composite

Combine multiple evaluators with an aggregation strategy.

- name: gate
  type: composite
  evaluators:
    - name: safety
      type: llm_judge
      prompt: ./prompts/safety.md
    - name: quality
      type: llm_judge
      prompt: ./prompts/quality.md
  aggregator:
    type: safety_gate
    required: [safety]

tool_trajectory

Validate the sequence and pattern of tool calls.

- name: workflow
  type: tool_trajectory
  mode: any_order
  minimums:
    search: 1
    analyze: 1
  expected:
    - tool: respond

field_accuracy

Compare structured output fields against expected values.

- name: extraction
  type: field_accuracy
  fields:
    - path: invoice.total
      match: numeric_tolerance
      tolerance: 0.01
    - path: invoice.vendor
      match: exact
  aggregation: weighted_average

execution_metrics

Set performance thresholds.

- name: performance
  type: execution_metrics
  max_tool_calls: 10
  max_tokens: 5000
  max_duration_ms: 30000
  max_cost_usd: 0.10

skill_trigger

Detect whether the agent invoked a specific Claude Code skill as its first tool call.

- type: skill-trigger
  skill: commit
  should_trigger: true

contains

Check if output contains a substring.

- type: contains
  value: "DENIED"
  required: true

regex

Check if output matches a regular expression.

- type: regex
  value: "Good (morning|afternoon|evening)"

is_json

Check if output is valid JSON.

- type: is_json
  required: true

equals

Check if output exactly matches a value (both sides trimmed).

- type: equals
  value: "42"

Weights and Aggregation

Evaluator scores are combined using weighted averaging:

Final Score = Σ(evaluator_score × weight) / Σ(weights)

Example:

evaluators:
  - name: correctness
    type: llm_judge
    weight: 3.0        # Most important
  - name: format
    type: code_judge
    weight: 1.0        # Less important

If correctness scores 0.9 and format scores 0.7:

Final Score = (0.9 × 3.0 + 0.7 × 1.0) / (3.0 + 1.0) = 0.85

Required Gates

Any evaluator can be marked required. When a required evaluator scores below the threshold, the verdict is forced to fail regardless of the aggregate score.

Value	Behavior
`required: true`	Must score >= 0.8 (default threshold)
`required: 0.6`	Must score >= custom threshold (0-1)

assert:
  - type: contains
    value: "DENIED"
    required: true          # Must pass (>= 0.8)
  - type: rubrics
    required: 0.6           # Must score at least 0.6
    criteria:
      - id: quality
        outcome: Response is well-structured

Next Steps

Code Judge - Custom script evaluation
LLM Judge - Semantic evaluation
Composite - Build evaluation pipelines