Test Schema
A test is a single test within an evaluation suite. Each test defines an input, criteria, and optionally how to evaluate the result.
Schema
Section titled “Schema”# Required fieldsid: string # Unique identifier within filecriteria: string # Natural language success criteria
# Input (required)input: string | Message[] # Input to the agent
# Expected output (optional)expected_output: string | object | Message[]
# Evaluation (optional)rubrics: (string | Rubric)[] # Inline evaluation criteriaassert: Assertion[] # Deterministic and LLM assertionsexecution: ExecutionConfig # Per-test execution override
# Metadata (optional)description: string # Human-readable descriptionconversation_id: string # Groups related multi-turn testsnote: string # Test-specific contextmetadata: object # Custom key-value pairsField Reference
Section titled “Field Reference”id (required)
Section titled “id (required)”Unique identifier within the EVAL.yaml file.
- Type:
string - Constraints: Must be unique within the file
tests: - id: greeting-response - id: error-handling - id: edge-case-empty-inputcriteria (required)
Section titled “criteria (required)”Natural language description of what constitutes success.
- Type:
string - Purpose: Used by LLM judges and for documentation
criteria: | Agent correctly identifies the off-by-one bug in the loop condition and provides an accurate fix with clear explanation.The input to send to the agent.
Shorthand form (input):
input: "Hello, my name is Alice!"Multi-turn:
input: - role: system content: "You are a helpful assistant." - role: user content: "Hello, my name is Alice!"With file references:
input: - role: user content: - type: text value: "Review this code:" - type: file value: ./fixtures/buggy.jsMessage Schema
Section titled “Message Schema”role: string # "system" | "user" | "assistant" | "tool"content: string | ContentBlock[]
# For tool messagestool_call_id: string # References a tool callname: string # Tool name (for tool messages)
# For assistant messages with tool callstool_calls: - id: string type: "function" function: name: string arguments: string # JSON stringContentBlock Schema
Section titled “ContentBlock Schema”# Text content- type: text value: string
# File reference- type: file value: string # Relative or absolute path
# Image content- type: image value: string # Path or base64
# JSON content- type: json value: objectexpected_output
Section titled “expected_output”Reference output for comparison.
Shorthand:
expected_output: "Hello Alice! Nice to meet you."Message array:
expected_output: - role: assistant content: "Hello Alice! Nice to meet you."Structured output:
expected_output: name: "Alice" age: 30 verified: truerubrics
Section titled “rubrics”Inline evaluation criteria. Can be simple strings or structured objects.
Simple strings:
rubrics: - Mentions the user's name - Contains a greeting - Maintains friendly toneStructured rubrics:
rubrics: - id: name-mention outcome: Response includes "Alice" weight: 2.0 required: true
- id: greeting outcome: Contains a greeting phrase weight: 1.0 score_ranges: 0: No greeting present 5: Generic greeting 10: Personalized greetingRubric Schema
Section titled “Rubric Schema”id: string # Unique identifieroutcome: string # What this rubric checksweight: number # Scoring weight (default: 1.0)required: boolean # If true, fail verdict if missedscore_ranges: # Analytic scoring (0-10 scale) 0: string # Description for score 0 5: string # Description for score 5 10: string # Description for score 10execution
Section titled “execution”Per-test execution override.
tests: - id: slow-task criteria: Completes analysis input: "Analyze this large dataset..." execution: timeout_seconds: 600 # Override default timeout target: powerful_model # Use different target assert: - name: custom_check type: code_judge script: ["python", "./judges/analysis.py"]conversation_id
Section titled “conversation_id”Groups related multi-turn tests.
tests: - id: turn-1 conversation_id: support-flow input: - role: user content: "I need help with my account"
- id: turn-2 conversation_id: support-flow input: - role: user content: "My email is alice@example.com" - role: assistant content: "I found your account. How can I help?" - role: user content: "Reset my password please"description
Section titled “description”Human-readable description for documentation.
- id: edge-case-null description: | Tests handling of null input values. This edge case should be handled gracefully. criteria: Returns appropriate error messageTest-specific context provided to evaluators.
- id: regional-date criteria: Parses date correctly note: | The date format is DD/MM/YYYY (European format). Agent should not assume US format. input: - role: user content: "What day is 05/03/2024?"Complete Examples
Section titled “Complete Examples”Basic Test
Section titled “Basic Test”- id: simple-greeting criteria: Agent responds with a greeting input: "Hello!" rubrics: - Contains a greeting wordComplex Test
Section titled “Complex Test”- id: code-review-security description: Tests security vulnerability detection criteria: | Agent identifies SQL injection vulnerability and suggests parameterized queries as the fix.
input: - role: system content: - type: file value: ./prompts/security-reviewer.md - role: user content: - type: text value: "Review this code for security issues:" - type: file value: ./fixtures/vulnerable.py
expected_output: - role: assistant content: | Security Issue: SQL Injection vulnerability detected. Recommendation: Use parameterized queries.
rubrics: - id: identifies-vuln outcome: Identifies SQL injection weight: 3.0 required: true
- id: suggests-fix outcome: Suggests parameterized queries weight: 2.0
- id: explains-risk outcome: Explains potential impact weight: 1.0
assert: - name: vuln_check type: code_judge script: ["python", "./judges/security_check.py"]Next Steps
Section titled “Next Steps”- Evaluators - Assessment configuration
- Verdicts - Score calculation
- Tool Trajectory - Agentic evaluation