EVAL Format
The EVAL.yaml file is the primary specification file for defining agent evaluations.
File Structure
Section titled “File Structure”# Required fieldsname: string # Unique identifiertests: Test[] # Array of tests
# Optional fieldsversion: string # Spec version (default: "1.0")description: string # Human-readable descriptionmetadata: object # Custom key-value pairsexecution: ExecutionConfig # Default execution settingsComplete Example
Section titled “Complete Example”name: code-reviewversion: "1.0"description: | Evaluates code review capabilities including bug detection, style suggestions, and security analysis.
metadata: author: example-org license: Apache-2.0 tags: [coding, review, security] skill: code-review
execution: target: default timeout_seconds: 300
assert: - name: correctness type: llm_judge prompt: ./prompts/correctness.md weight: 2.0
- name: format_check type: code_judge script: ["python", "./judges/format.py"] weight: 1.0
tests: - id: detect-off-by-one description: Detect classic off-by-one loop error criteria: | Identifies the loop condition bug where i < 0 should be i < items.length input: - role: system content: You are an expert code reviewer. - role: user content: | Review this JavaScript function: ```javascript function getTotal(items) { let sum = 0; for (let i = 0; i < 0; i++) { sum += items[i].value; } return sum; } ``` expected_output: - role: assistant content: | Bug detected: Loop condition `i < 0` is always false. rubrics: - Identifies the loop never executes - Provides correct fix - Explains the issue clearly assert: - name: bug_check type: code_judge script: ["python", "./judges/bug_check.py"]Field Reference
Section titled “Field Reference”name (required)
Section titled “name (required)”Unique identifier for the evaluation suite.
- Type:
string - Constraints: 1-64 characters, lowercase, alphanumeric with hyphens
- Pattern:
^[a-z][a-z0-9-]*[a-z0-9]$
name: code-reviewname: document-extractionname: rag-accuracyversion
Section titled “version”Specification version this file conforms to.
- Type:
string - Default:
"1.0" - Format: Semantic version
version: "1.0"description
Section titled “description”Human-readable description of what this evaluation suite covers.
- Type:
string - Max length: 2048 characters
description: | Evaluates code review capabilities including: - Bug detection - Style suggestions - Security analysismetadata
Section titled “metadata”Custom key-value pairs for organization and discovery.
- Type:
object - Reserved keys:
author,license,tags,skill
metadata: author: my-organization license: Apache-2.0 tags: [coding, review] skill: code-review # Links to AgentSkills custom_field: custom_value # Any additional dataexecution
Section titled “execution”Default execution settings for all tests.
- Type:
ExecutionConfig
execution: target: default # Target provider name timeout_seconds: 300 # Max execution time
assert: # Default evaluators - name: quality type: llm_judge prompt: ./prompts/quality.mdSee Evaluators for evaluator configuration.
tests (required)
Section titled “tests (required)”Array of tests. Each element is either an inline test object or a string file path to import.
- Type:
(Test | string)[] | string - Min items: 1
Inline tests:
tests: - id: greeting criteria: Agent responds with a greeting input: "Hello!"File references:
tests: - ./security.yaml - ./style.yamlExternal file (inverted sidecar):
tests: ./cases.jsonlWhen tests is a string path, it references an external YAML or JSONL file. The metadata and config stay in the main EVAL.yaml while test data lives separately.
Mixed (inline + file references):
tests: - ./security.yaml - ./style.yaml - id: quick-check criteria: Agent responds within constraints input: "Hello!"File references resolve relative to the EVAL.yaml directory. Each referenced file contains an array of test objects.
See Test Schema for full schema.
File Resolution
Section titled “File Resolution”Relative paths in EVAL.yaml are resolved from the file’s directory:
# If EVAL.yaml is at /project/evals/code-review/EVAL.yaml
assert: - name: check type: code_judge script: ["python", "./judges/check.py"] # Resolves to: /project/evals/code-review/judges/check.py
tests: - id: example input: - role: user content: - type: file value: ./fixtures/sample.js # Resolves to: /project/evals/code-review/fixtures/sample.jsAbsolute paths are resolved from the repository root:
content: - type: file value: /shared/prompts/system.md # Resolves to: /project/shared/prompts/system.mdJSONL Format
Section titled “JSONL Format”For large evaluations, use JSONL format with one test per line:
dataset.jsonl:
{"id": "test-1", "criteria": "...", "input": [{"role": "user", "content": "..."}]}{"id": "test-2", "criteria": "...", "input": [{"role": "user", "content": "..."}]}dataset.yaml (sidecar for shared config):
name: large-eval
assert: - name: quality type: llm_judge prompt: ./prompts/quality.mdValidation
Section titled “Validation”Use JSON Schema to validate EVAL.yaml files:
# Using agent v CLIagentv validate ./EVAL.yaml
# Using jsonschemanpx ajv validate -s eval.schema.json -d EVAL.yamlNext Steps
Section titled “Next Steps”- Test Schema - Individual test structure
- Evaluators - Evaluator configuration
- Organization - File organization patterns