Specification Overview
The AgentEvals specification defines a declarative format for evaluating AI agent capabilities. This section provides a complete reference for the specification.
Design Philosophy
Section titled “Design Philosophy”AgentEvals is designed with these goals:
- Declarative - Define what to evaluate, not how to run it
- Portable - Works across different agent implementations
- Composable - Build complex evaluations from simple parts
- Extensible - Add custom evaluators as needed
- AI-First - Readable by both humans and AI agents
Specification Components
Section titled “Specification Components”| Component | File | Description |
|---|---|---|
| EVAL Format | EVAL.yaml | Main evaluation file structure |
| Evalcase Schema | Within EVAL.yaml | Individual test case definition |
| Evaluators | Referenced | Assessment components |
| Verdicts | Results | Pass/borderline/fail determination |
| Organization | Directory | File organization patterns |
File Format
Section titled “File Format”AgentEvals uses YAML for evaluation definitions:
name: my-evaluationversion: "1.0"description: Evaluates agent capabilities
metadata: author: my-org tags: [example, demo]
execution: target: default evaluators: - name: quality type: llm_judge prompt: ./prompts/quality.md
evalcases: - id: test-case-1 expected_outcome: Agent accomplishes the task input: - role: user content: "Perform this task"Version
Section titled “Version”Current specification version: 1.0
The specification follows semantic versioning:
- Patch (1.0.x) - Clarifications, typo fixes
- Minor (1.x.0) - Additive features, new evaluator types
- Major (x.0.0) - Breaking changes
Canonical Implementation
Section titled “Canonical Implementation”AgentV is the canonical implementation of the AgentEvals standard. It provides:
- CLI for running evaluations
- Provider integrations
- Result tracking
Next Steps
Section titled “Next Steps”- EVAL Format - Detailed file format
- Evalcase Schema - Test case structure
- Evaluators Reference - Evaluator type overview