Skip to content

Agent Evals

Industry-standard evaluation specifications for AI agents.

Declarative Format

Define evaluations in simple YAML files. No complex code required.

7 Evaluator Types

Code judges, LLM judges, rubrics, composites, tool trajectory, and more.

Flexible Organization

Centralized or skill-based patterns - your choice.

Industry Standard

Based on production implementations. Built for adoption.

EVAL.yaml
name: code-review
version: "1.0"
execution:
evaluators:
- name: quality
type: llm_judge
prompt: ./prompts/quality.md
evalcases:
- id: detect-bug
expected_outcome: Identifies the loop condition bug
input:
- role: user
content: "Review this code..."
rubrics:
- Identifies the bug
- Provides correct fix
TypeDescription
code_judgeExecute custom scripts for deterministic checks
llm_judgeLLM-based semantic evaluation
rubricStructured criteria with weights
compositeCombine multiple evaluators
tool_trajectoryValidate agent tool usage
field_accuracyCheck structured data fields
execution_metricsLatency, cost, token limits

AgentV is the canonical implementation of the AgentEvals standard, providing CLI tools for running evaluations.


AgentEvals is an open specification. Contributions welcome on GitHub.