Skip to content

Introduction

AgentEvals is an open specification for evaluating AI agent capabilities. It provides a declarative YAML-based format for defining evaluation suites that can measure agent performance across diverse tasks.

AI agents are becoming increasingly capable, but measuring their performance consistently remains challenging:

  • No standard format - Every team invents their own evaluation approach
  • Inconsistent metrics - Hard to compare results across different agents
  • Fragmented tooling - Tools don’t interoperate
  • Manual effort - Writing evaluation code from scratch is tedious

AgentEvals provides:

  • Declarative YAML format - Simple, readable evaluation definitions
  • Standard evaluator types - Code judges, LLM judges, rubrics, and more
  • Flexible organization - Centralized or skill-based patterns
  • Composability - Build complex evaluations from primitives

The primary specification file that defines an evaluation suite:

name: my-agent-eval
version: "1.0"
description: Evaluates my agent capabilities
evalcases:
- id: basic-task
expected_outcome: Agent completes the task correctly
input:
- role: user
content: "Perform this task..."

Individual test cases within an evaluation suite. Each evalcase defines:

  • Input - What to send to the agent
  • Expected outcome - Natural language description of success
  • Evaluators - How to measure success (optional, can inherit from suite)

Components that assess agent outputs:

TypePurpose
code_judgeExecute custom scripts
llm_judgeLLM-based semantic evaluation
rubricStructured criteria with weights
compositeCombine multiple evaluators
tool_trajectoryValidate tool usage patterns
field_accuracyCheck structured data fields
execution_metricsMeasure latency, cost, tokens

Evaluation results are normalized to three verdicts:

  • pass - Score >= 0.8
  • borderline - Score >= 0.6 and < 0.8
  • fail - Score < 0.6

AgentEvals follows agent-native architecture principles:

  1. Parity - Evals should test what agents actually do in production
  2. Granularity - Prefer atomic evaluators that compose
  3. Composability - Build sophisticated evals from simple primitives
  4. Emergent Capability - Support testing novel, unanticipated behaviors
  5. Improvement Over Time - Track progress as agents evolve

AgentV is the canonical implementation of the AgentEvals standard. It provides:

  • CLI for running evaluations
  • Provider integrations (Azure, Anthropic, etc.)
  • Result tracking and reporting

Other tools can adopt the AgentEvals specification to ensure interoperability.