Agent Evals

Industry-standard evaluation specifications for AI agents.

Get Started View Specification

Why Agent Evals?

Declarative Format

Define evaluations in simple YAML files. No complex code required.

7 Evaluator Types

Code judges, LLM judges, rubrics, composites, tool trajectory, and more.

Flexible Organization

Centralized or skill-based patterns - your choice.

Industry Standard

Based on production implementations. Built for adoption.

Quick Example

name: code-review
version: "1.0"

execution:
  evaluators:
    - name: quality
      type: llm_judge
      prompt: ./prompts/quality.md

evalcases:
  - id: detect-bug
    expected_outcome: Identifies the loop condition bug
    input:
      - role: user
        content: "Review this code..."
    rubrics:
      - Identifies the bug
      - Provides correct fix

Evaluator Types

Type	Description
`code_judge`	Execute custom scripts for deterministic checks
`llm_judge`	LLM-based semantic evaluation
`rubric`	Structured criteria with weights
`composite`	Combine multiple evaluators
`tool_trajectory`	Validate agent tool usage
`field_accuracy`	Check structured data fields
`execution_metrics`	Latency, cost, token limits

Canonical Implementation

AgentV is the canonical implementation of the AgentEvals standard, providing CLI tools for running evaluations.

AgentEvals is an open specification. Contributions welcome on GitHub.