Skip to content

Specification Overview

The AgentEvals specification defines a declarative format for evaluating AI agent capabilities. This section provides a complete reference for the specification.

AgentEvals is designed with these goals:

  1. Declarative - Define what to evaluate, not how to run it
  2. Portable - Works across different agent implementations
  3. Composable - Build complex evaluations from simple parts
  4. Extensible - Add custom evaluators as needed
  5. AI-First - Readable by both humans and AI agents
ComponentFileDescription
EVAL FormatEVAL.yamlMain evaluation file structure
Evalcase SchemaWithin EVAL.yamlIndividual test case definition
EvaluatorsReferencedAssessment components
VerdictsResultsPass/borderline/fail determination
OrganizationDirectoryFile organization patterns

AgentEvals uses YAML for evaluation definitions:

EVAL.yaml
name: my-evaluation
version: "1.0"
description: Evaluates agent capabilities
metadata:
author: my-org
tags: [example, demo]
execution:
target: default
evaluators:
- name: quality
type: llm_judge
prompt: ./prompts/quality.md
evalcases:
- id: test-case-1
expected_outcome: Agent accomplishes the task
input:
- role: user
content: "Perform this task"

Current specification version: 1.0

The specification follows semantic versioning:

  • Patch (1.0.x) - Clarifications, typo fixes
  • Minor (1.x.0) - Additive features, new evaluator types
  • Major (x.0.0) - Breaking changes

AgentV is the canonical implementation of the AgentEvals standard. It provides:

  • CLI for running evaluations
  • Provider integrations
  • Result tracking