Skip to content

EVAL Format

The EVAL.yaml file is the primary specification file for defining agent evaluations.

# Required fields
name: string # Unique identifier
tests: Test[] # Array of tests
# Optional fields
version: string # Spec version (default: "1.0")
description: string # Human-readable description
metadata: object # Custom key-value pairs
execution: ExecutionConfig # Default execution settings
name: code-review
version: "1.0"
description: |
Evaluates code review capabilities including bug detection,
style suggestions, and security analysis.
metadata:
author: example-org
license: Apache-2.0
tags: [coding, review, security]
skill: code-review
execution:
target: default
timeout_seconds: 300
assert:
- name: correctness
type: llm_judge
prompt: ./prompts/correctness.md
weight: 2.0
- name: format_check
type: code_judge
script: ["python", "./judges/format.py"]
weight: 1.0
tests:
- id: detect-off-by-one
description: Detect classic off-by-one loop error
criteria: |
Identifies the loop condition bug where i < 0 should be i < items.length
input:
- role: system
content: You are an expert code reviewer.
- role: user
content: |
Review this JavaScript function:
```javascript
function getTotal(items) {
let sum = 0;
for (let i = 0; i < 0; i++) {
sum += items[i].value;
}
return sum;
}
```
expected_output:
- role: assistant
content: |
Bug detected: Loop condition `i < 0` is always false.
rubrics:
- Identifies the loop never executes
- Provides correct fix
- Explains the issue clearly
assert:
- name: bug_check
type: code_judge
script: ["python", "./judges/bug_check.py"]

Unique identifier for the evaluation suite.

  • Type: string
  • Constraints: 1-64 characters, lowercase, alphanumeric with hyphens
  • Pattern: ^[a-z][a-z0-9-]*[a-z0-9]$
name: code-review
name: document-extraction
name: rag-accuracy

Specification version this file conforms to.

  • Type: string
  • Default: "1.0"
  • Format: Semantic version
version: "1.0"

Human-readable description of what this evaluation suite covers.

  • Type: string
  • Max length: 2048 characters
description: |
Evaluates code review capabilities including:
- Bug detection
- Style suggestions
- Security analysis

Custom key-value pairs for organization and discovery.

  • Type: object
  • Reserved keys: author, license, tags, skill
metadata:
author: my-organization
license: Apache-2.0
tags: [coding, review]
skill: code-review # Links to AgentSkills
custom_field: custom_value # Any additional data

Default execution settings for all tests.

  • Type: ExecutionConfig
execution:
target: default # Target provider name
timeout_seconds: 300 # Max execution time
assert: # Default evaluators
- name: quality
type: llm_judge
prompt: ./prompts/quality.md

See Evaluators for evaluator configuration.

Array of tests. Each element is either an inline test object or a string file path to import.

  • Type: (Test | string)[] | string
  • Min items: 1

Inline tests:

tests:
- id: greeting
criteria: Agent responds with a greeting
input: "Hello!"

File references:

tests:
- ./security.yaml
- ./style.yaml

External file (inverted sidecar):

tests: ./cases.jsonl

When tests is a string path, it references an external YAML or JSONL file. The metadata and config stay in the main EVAL.yaml while test data lives separately.

Mixed (inline + file references):

tests:
- ./security.yaml
- ./style.yaml
- id: quick-check
criteria: Agent responds within constraints
input: "Hello!"

File references resolve relative to the EVAL.yaml directory. Each referenced file contains an array of test objects.

See Test Schema for full schema.

Relative paths in EVAL.yaml are resolved from the file’s directory:

# If EVAL.yaml is at /project/evals/code-review/EVAL.yaml
assert:
- name: check
type: code_judge
script: ["python", "./judges/check.py"]
# Resolves to: /project/evals/code-review/judges/check.py
tests:
- id: example
input:
- role: user
content:
- type: file
value: ./fixtures/sample.js
# Resolves to: /project/evals/code-review/fixtures/sample.js

Absolute paths are resolved from the repository root:

content:
- type: file
value: /shared/prompts/system.md
# Resolves to: /project/shared/prompts/system.md

For large evaluations, use JSONL format with one test per line:

dataset.jsonl:

{"id": "test-1", "criteria": "...", "input": [{"role": "user", "content": "..."}]}
{"id": "test-2", "criteria": "...", "input": [{"role": "user", "content": "..."}]}

dataset.yaml (sidecar for shared config):

name: large-eval
assert:
- name: quality
type: llm_judge
prompt: ./prompts/quality.md

Use JSON Schema to validate EVAL.yaml files:

Terminal window
# Using agent v CLI
agentv validate ./EVAL.yaml
# Using jsonschema
npx ajv validate -s eval.schema.json -d EVAL.yaml