Specification Overview

The AgentEvals specification defines a declarative format for evaluating AI agent capabilities. This section provides a complete reference for the specification.

Design Philosophy

AgentEvals is designed with these goals:

Declarative - Define what to evaluate, not how to run it
Portable - Works across different agent implementations
Composable - Build complex evaluations from simple parts
Extensible - Add custom evaluators as needed
AI-First - Readable by both humans and AI agents

Specification Components

Component	File	Description
EVAL Format	`EVAL.yaml`	Main evaluation file structure
Evalcase Schema	Within EVAL.yaml	Individual test case definition
Evaluators	Referenced	Assessment components
Verdicts	Results	Pass/borderline/fail determination
Organization	Directory	File organization patterns

File Format

AgentEvals uses YAML for evaluation definitions:

name: my-evaluation
version: "1.0"
description: Evaluates agent capabilities

metadata:
  author: my-org
  tags: [example, demo]

execution:
  target: default
  evaluators:
    - name: quality
      type: llm_judge
      prompt: ./prompts/quality.md

evalcases:
  - id: test-case-1
    expected_outcome: Agent accomplishes the task
    input:
      - role: user
        content: "Perform this task"

Version

Current specification version: 1.0

The specification follows semantic versioning:

Patch (1.0.x) - Clarifications, typo fixes
Minor (1.x.0) - Additive features, new evaluator types
Major (x.0.0) - Breaking changes

Canonical Implementation

AgentV is the canonical implementation of the AgentEvals standard. It provides:

CLI for running evaluations
Provider integrations
Result tracking

Next Steps

EVAL Format - Detailed file format
Evalcase Schema - Test case structure
Evaluators Reference - Evaluator type overview