EVAL Format
The EVAL.yaml file is the primary specification file for defining agent evaluations.
File Structure
Section titled “File Structure”# Required fieldsname: string # Unique identifierevalcases: Evalcase[] # Array of test cases
# Optional fieldsversion: string # Spec version (default: "1.0")description: string # Human-readable descriptionmetadata: object # Custom key-value pairsexecution: ExecutionConfig # Default execution settingsComplete Example
Section titled “Complete Example”name: code-reviewversion: "1.0"description: | Evaluates code review capabilities including bug detection, style suggestions, and security analysis.
metadata: author: example-org license: Apache-2.0 tags: [coding, review, security] skill: code-review
execution: target: default timeout_seconds: 300 evaluators: - name: correctness type: llm_judge prompt: ./prompts/correctness.md weight: 2.0
- name: format_check type: code_judge script: ["python", "./judges/format.py"] weight: 1.0
evalcases: - id: detect-off-by-one description: Detect classic off-by-one loop error expected_outcome: | Identifies the loop condition bug where i < 0 should be i < items.length input: - role: system content: You are an expert code reviewer. - role: user content: | Review this JavaScript function: ```javascript function getTotal(items) { let sum = 0; for (let i = 0; i < 0; i++) { sum += items[i].value; } return sum; }expected_output:
- role: assistant
content: |
Bug detected: Loop condition
i < 0is always false. rubrics: - Identifies the loop never executes
- Provides correct fix
- Explains the issue clearly
execution:
evaluators:
- name: bug_check type: code_judge script: [“python”, ”./judges/bug_check.py”]
## Field Reference
### name (required)
Unique identifier for the evaluation suite.
- **Type:** `string`- **Constraints:** 1-64 characters, lowercase, alphanumeric with hyphens- **Pattern:** `^[a-z][a-z0-9-]*[a-z0-9]$`
```yamlname: code-reviewname: document-extractionname: rag-accuracyversion
Section titled “version”Specification version this file conforms to.
- Type:
string - Default:
"1.0" - Format: Semantic version
version: "1.0"description
Section titled “description”Human-readable description of what this evaluation suite covers.
- Type:
string - Max length: 2048 characters
description: | Evaluates code review capabilities including: - Bug detection - Style suggestions - Security analysismetadata
Section titled “metadata”Custom key-value pairs for organization and discovery.
- Type:
object - Reserved keys:
author,license,tags,skill
metadata: author: my-organization license: Apache-2.0 tags: [coding, review] skill: code-review # Links to AgentSkills custom_field: custom_value # Any additional dataexecution
Section titled “execution”Default execution settings for all evalcases.
- Type:
ExecutionConfig
execution: target: default # Target provider name timeout_seconds: 300 # Max execution time evaluators: # Default evaluators - name: quality type: llm_judge prompt: ./prompts/quality.mdSee Evaluators for evaluator configuration.
evalcases (required)
Section titled “evalcases (required)”Array of evaluation cases.
- Type:
Evalcase[] - Min items: 1
See Evalcase Schema for full schema.
File Resolution
Section titled “File Resolution”Relative paths in EVAL.yaml are resolved from the file’s directory:
# If EVAL.yaml is at /project/evals/code-review/EVAL.yaml
execution: evaluators: - name: check type: code_judge script: ["python", "./judges/check.py"] # Resolves to: /project/evals/code-review/judges/check.py
evalcases: - id: example input: - role: user content: - type: file value: ./fixtures/sample.js # Resolves to: /project/evals/code-review/fixtures/sample.jsAbsolute paths are resolved from the repository root:
content: - type: file value: /shared/prompts/system.md # Resolves to: /project/shared/prompts/system.mdJSONL Format
Section titled “JSONL Format”For large evaluations, use JSONL format with one evalcase per line:
dataset.jsonl:
{"id": "case-1", "expected_outcome": "...", "input": [{"role": "user", "content": "..."}]}{"id": "case-2", "expected_outcome": "...", "input": [{"role": "user", "content": "..."}]}dataset.yaml (sidecar for shared config):
name: large-evalexecution: evaluators: - name: quality type: llm_judge prompt: ./prompts/quality.mdValidation
Section titled “Validation”Use JSON Schema to validate EVAL.yaml files:
# Using AgentV CLIagentv validate ./EVAL.yaml
# Using jsonschemanpx ajv validate -s eval.schema.json -d EVAL.yamlNext Steps
Section titled “Next Steps”- Evalcase Schema - Individual case structure
- Evaluators - Evaluator configuration
- Organization - File organization patterns