EVAL Format

The EVAL.yaml file is the primary specification file for defining agent evaluations.

File Structure

# Required fields
name: string                    # Unique identifier
tests: Test[]                    # Array of tests

# Optional fields
version: string                 # Spec version (default: "1.0")
description: string             # Human-readable description
metadata: object                # Custom key-value pairs
execution: ExecutionConfig      # Default execution settings

Complete Example

name: code-review
version: "1.0"
description: |
  Evaluates code review capabilities including bug detection,
  style suggestions, and security analysis.

metadata:
  author: example-org
  license: Apache-2.0
  tags: [coding, review, security]
  skill: code-review

execution:
  target: default
  timeout_seconds: 300

assert:
  - name: correctness
    type: llm_judge
    prompt: ./prompts/correctness.md
    weight: 2.0

  - name: format_check
    type: code_judge
    script: ["python", "./judges/format.py"]
    weight: 1.0

tests:
  - id: detect-off-by-one
    description: Detect classic off-by-one loop error
    criteria: |
      Identifies the loop condition bug where i < 0 should be i < items.length
    input:
      - role: system
        content: You are an expert code reviewer.
      - role: user
        content: |
          Review this JavaScript function:
          ```javascript
          function getTotal(items) {
            let sum = 0;
            for (let i = 0; i < 0; i++) {
              sum += items[i].value;
            }
            return sum;
          }
          ```
    expected_output:
      - role: assistant
        content: |
          Bug detected: Loop condition `i < 0` is always false.
    rubrics:
      - Identifies the loop never executes
      - Provides correct fix
      - Explains the issue clearly
    assert:
      - name: bug_check
        type: code_judge
        script: ["python", "./judges/bug_check.py"]

Field Reference

name (required)

Unique identifier for the evaluation suite.

Type: string
Constraints: 1-64 characters, lowercase, alphanumeric with hyphens
Pattern: ^[a-z][a-z0-9-]*[a-z0-9]$

name: code-review
name: document-extraction
name: rag-accuracy

version

Specification version this file conforms to.

Type: string
Default: "1.0"
Format: Semantic version

version: "1.0"

description

Human-readable description of what this evaluation suite covers.

Type: string
Max length: 2048 characters

description: |
  Evaluates code review capabilities including:
  - Bug detection
  - Style suggestions
  - Security analysis

metadata

Custom key-value pairs for organization and discovery.

Type: object
Reserved keys: author, license, tags, skill

metadata:
  author: my-organization
  license: Apache-2.0
  tags: [coding, review]
  skill: code-review              # Links to AgentSkills
  custom_field: custom_value      # Any additional data

execution

Default execution settings for all tests.

Type: ExecutionConfig

execution:
  target: default                 # Target provider name
  timeout_seconds: 300            # Max execution time

assert:                           # Default evaluators
  - name: quality
    type: llm_judge
    prompt: ./prompts/quality.md

See Evaluators for evaluator configuration.

tests (required)

Array of tests. Each element is either an inline test object or a string file path to import.

Type: (Test | string)[] | string
Min items: 1

Inline tests:

tests:
  - id: greeting
    criteria: Agent responds with a greeting
    input: "Hello!"

File references:

tests:
  - ./security.yaml
  - ./style.yaml

External file (inverted sidecar):

tests: ./cases.jsonl

When tests is a string path, it references an external YAML or JSONL file. The metadata and config stay in the main EVAL.yaml while test data lives separately.

Mixed (inline + file references):

tests:
  - ./security.yaml
  - ./style.yaml
  - id: quick-check
    criteria: Agent responds within constraints
    input: "Hello!"

File references resolve relative to the EVAL.yaml directory. Each referenced file contains an array of test objects.

See Test Schema for full schema.

File Resolution

Relative paths in EVAL.yaml are resolved from the file’s directory:

# If EVAL.yaml is at /project/evals/code-review/EVAL.yaml

assert:
  - name: check
    type: code_judge
    script: ["python", "./judges/check.py"]
    # Resolves to: /project/evals/code-review/judges/check.py

tests:
  - id: example
    input:
      - role: user
        content:
          - type: file
            value: ./fixtures/sample.js
            # Resolves to: /project/evals/code-review/fixtures/sample.js

Absolute paths are resolved from the repository root:

content:
  - type: file
    value: /shared/prompts/system.md
    # Resolves to: /project/shared/prompts/system.md

JSONL Format

For large evaluations, use JSONL format with one test per line:

dataset.jsonl:

{"id": "test-1", "criteria": "...", "input": [{"role": "user", "content": "..."}]}
{"id": "test-2", "criteria": "...", "input": [{"role": "user", "content": "..."}]}

dataset.yaml (sidecar for shared config):

name: large-eval

assert:
  - name: quality
    type: llm_judge
    prompt: ./prompts/quality.md

Validation

Use JSON Schema to validate EVAL.yaml files:

# Using agent v CLI
agentv validate ./EVAL.yaml

# Using jsonschema
npx ajv validate -s eval.schema.json -d EVAL.yaml

Next Steps

Test Schema - Individual test structure
Evaluators - Evaluator configuration
Organization - File organization patterns