Test Schema

A test is a single test within an evaluation suite. Each test defines an input, criteria, and optionally how to evaluate the result.

Schema

# Required fields
id: string                      # Unique identifier within file
criteria: string                # Natural language success criteria

# Input (required)
input: string | Message[]       # Input to the agent

# Expected output (optional)
expected_output: string | object | Message[]

# Evaluation (optional)
rubrics: (string | Rubric)[]    # Inline evaluation criteria
assert: Assertion[]             # Deterministic and LLM assertions
execution: ExecutionConfig      # Per-test execution override

# Metadata (optional)
description: string             # Human-readable description
conversation_id: string         # Groups related multi-turn tests
note: string                    # Test-specific context
metadata: object                # Custom key-value pairs

Field Reference

id (required)

Unique identifier within the EVAL.yaml file.

Type: string
Constraints: Must be unique within the file

tests:
  - id: greeting-response
  - id: error-handling
  - id: edge-case-empty-input

criteria (required)

Natural language description of what constitutes success.

Type: string
Purpose: Used by LLM judges and for documentation

criteria: |
  Agent correctly identifies the off-by-one bug in the loop condition
  and provides an accurate fix with clear explanation.

input

The input to send to the agent.

Shorthand form (input):

input: "Hello, my name is Alice!"

Multi-turn:

input:
  - role: system
    content: "You are a helpful assistant."
  - role: user
    content: "Hello, my name is Alice!"

With file references:

input:
  - role: user
    content:
      - type: text
        value: "Review this code:"
      - type: file
        value: ./fixtures/buggy.js

Message Schema

role: string              # "system" | "user" | "assistant" | "tool"
content: string | ContentBlock[]

# For tool messages
tool_call_id: string      # References a tool call
name: string              # Tool name (for tool messages)

# For assistant messages with tool calls
tool_calls:
  - id: string
    type: "function"
    function:
      name: string
      arguments: string   # JSON string

ContentBlock Schema

# Text content
- type: text
  value: string

# File reference
- type: file
  value: string           # Relative or absolute path

# Image content
- type: image
  value: string           # Path or base64

# JSON content
- type: json
  value: object

expected_output

Reference output for comparison.

Shorthand:

expected_output: "Hello Alice! Nice to meet you."

Message array:

expected_output:
  - role: assistant
    content: "Hello Alice! Nice to meet you."

Structured output:

expected_output:
  name: "Alice"
  age: 30
  verified: true

rubrics

Inline evaluation criteria. Can be simple strings or structured objects.

Simple strings:

rubrics:
  - Mentions the user's name
  - Contains a greeting
  - Maintains friendly tone

Structured rubrics:

rubrics:
  - id: name-mention
    outcome: Response includes "Alice"
    weight: 2.0
    required: true

  - id: greeting
    outcome: Contains a greeting phrase
    weight: 1.0
    score_ranges:
      0: No greeting present
      5: Generic greeting
      10: Personalized greeting

Rubric Schema

id: string                # Unique identifier
outcome: string           # What this rubric checks
weight: number            # Scoring weight (default: 1.0)
required: boolean         # If true, fail verdict if missed
score_ranges:             # Analytic scoring (0-10 scale)
  0: string               # Description for score 0
  5: string               # Description for score 5
  10: string              # Description for score 10

execution

Per-test execution override.

tests:
  - id: slow-task
    criteria: Completes analysis
    input: "Analyze this large dataset..."
    execution:
      timeout_seconds: 600    # Override default timeout
      target: powerful_model  # Use different target
    assert:
      - name: custom_check
        type: code_judge
        script: ["python", "./judges/analysis.py"]

conversation_id

Groups related multi-turn tests.

tests:
  - id: turn-1
    conversation_id: support-flow
    input:
      - role: user
        content: "I need help with my account"

  - id: turn-2
    conversation_id: support-flow
    input:
      - role: user
        content: "My email is alice@example.com"
      - role: assistant
        content: "I found your account. How can I help?"
      - role: user
        content: "Reset my password please"

description

Human-readable description for documentation.

- id: edge-case-null
  description: |
    Tests handling of null input values.
    This edge case should be handled gracefully.
  criteria: Returns appropriate error message

note

Test-specific context provided to evaluators.

- id: regional-date
  criteria: Parses date correctly
  note: |
    The date format is DD/MM/YYYY (European format).
    Agent should not assume US format.
  input:
    - role: user
      content: "What day is 05/03/2024?"

Complete Examples

Basic Test

- id: simple-greeting
  criteria: Agent responds with a greeting
  input: "Hello!"
  rubrics:
    - Contains a greeting word

Complex Test

- id: code-review-security
  description: Tests security vulnerability detection
  criteria: |
    Agent identifies SQL injection vulnerability and suggests
    parameterized queries as the fix.

  input:
    - role: system
      content:
        - type: file
          value: ./prompts/security-reviewer.md
    - role: user
      content:
        - type: text
          value: "Review this code for security issues:"
        - type: file
          value: ./fixtures/vulnerable.py

  expected_output:
    - role: assistant
      content: |
        Security Issue: SQL Injection vulnerability detected.
        Recommendation: Use parameterized queries.

  rubrics:
    - id: identifies-vuln
      outcome: Identifies SQL injection
      weight: 3.0
      required: true

    - id: suggests-fix
      outcome: Suggests parameterized queries
      weight: 2.0

    - id: explains-risk
      outcome: Explains potential impact
      weight: 1.0

  assert:
    - name: vuln_check
      type: code_judge
      script: ["python", "./judges/security_check.py"]

Next Steps

Evaluators - Assessment configuration
Verdicts - Score calculation
Tool Trajectory - Agentic evaluation