Evalcase Schema
An evalcase is a single test case within an evaluation suite. Each evalcase defines an input, expected outcome, and optionally how to evaluate the result.
Schema
Section titled “Schema”# Required fieldsid: string # Unique identifier within fileexpected_outcome: string # Natural language success criteria
# Input (at least one required)input: string | Message[] # Input to the agentinput_messages: Message[] # Canonical form
# Expected output (optional)expected_output: string | object | Message[]expected_messages: Message[] # Canonical form
# Evaluation (optional)rubrics: (string | Rubric)[] # Inline evaluation criteriaexecution: ExecutionConfig # Per-case execution override
# Metadata (optional)description: string # Human-readable descriptionconversation_id: string # Groups related multi-turn casesnote: string # Test-specific contextmetadata: object # Custom key-value pairsField Reference
Section titled “Field Reference”id (required)
Section titled “id (required)”Unique identifier within the EVAL.yaml file.
- Type:
string - Constraints: Must be unique within the file
evalcases: - id: greeting-response - id: error-handling - id: edge-case-empty-inputexpected_outcome (required)
Section titled “expected_outcome (required)”Natural language description of what constitutes success.
- Type:
string - Purpose: Used by LLM judges and for documentation
expected_outcome: | Agent correctly identifies the off-by-one bug in the loop condition and provides an accurate fix with clear explanation.input / input_messages
Section titled “input / input_messages”The input to send to the agent.
Shorthand form (input):
input: "Hello, my name is Alice!"# Expands to:input_messages: - role: user content: "Hello, my name is Alice!"Full form (input_messages):
input_messages: - role: system content: "You are a helpful assistant." - role: user content: "Hello, my name is Alice!"With file references:
input_messages: - role: user content: - type: text value: "Review this code:" - type: file value: ./fixtures/buggy.jsMessage Schema
Section titled “Message Schema”role: string # "system" | "user" | "assistant" | "tool"content: string | ContentBlock[]
# For tool messagestool_call_id: string # References a tool callname: string # Tool name (for tool messages)
# For assistant messages with tool callstool_calls: - id: string type: "function" function: name: string arguments: string # JSON stringContentBlock Schema
Section titled “ContentBlock Schema”# Text content- type: text value: string
# File reference- type: file value: string # Relative or absolute path
# Image content- type: image value: string # Path or base64
# JSON content- type: json value: objectexpected_output / expected_messages
Section titled “expected_output / expected_messages”Reference output for comparison.
Shorthand:
expected_output: "Hello Alice! Nice to meet you."Full form:
expected_messages: - role: assistant content: "Hello Alice! Nice to meet you."Structured output:
expected_output: name: "Alice" age: 30 verified: truerubrics
Section titled “rubrics”Inline evaluation criteria. Can be simple strings or structured objects.
Simple strings:
rubrics: - Mentions the user's name - Contains a greeting - Maintains friendly toneStructured rubrics:
rubrics: - id: name-mention expected_outcome: Response includes "Alice" weight: 2.0 required: true
- id: greeting expected_outcome: Contains a greeting phrase weight: 1.0 score_ranges: 0: No greeting present 5: Generic greeting 10: Personalized greetingRubric Schema
Section titled “Rubric Schema”id: string # Unique identifierexpected_outcome: string # What this rubric checksweight: number # Scoring weight (default: 1.0)required: boolean # If true, fail verdict if missedscore_ranges: # Analytic scoring (0-10 scale) 0: string # Description for score 0 5: string # Description for score 5 10: string # Description for score 10execution
Section titled “execution”Per-evalcase execution override.
evalcases: - id: slow-task expected_outcome: Completes analysis input: "Analyze this large dataset..." execution: timeout_seconds: 600 # Override default timeout target: powerful_model # Use different target evaluators: - name: custom_check type: code_judge script: ["python", "./judges/analysis.py"]conversation_id
Section titled “conversation_id”Groups related multi-turn test cases.
evalcases: - id: turn-1 conversation_id: support-flow input: - role: user content: "I need help with my account"
- id: turn-2 conversation_id: support-flow input: - role: user content: "My email is alice@example.com" - role: assistant content: "I found your account. How can I help?" - role: user content: "Reset my password please"description
Section titled “description”Human-readable description for documentation.
- id: edge-case-null description: | Tests handling of null input values. This edge case should be handled gracefully. expected_outcome: Returns appropriate error messageTest-specific context provided to evaluators.
- id: regional-date expected_outcome: Parses date correctly note: | The date format is DD/MM/YYYY (European format). Agent should not assume US format. input: - role: user content: "What day is 05/03/2024?"Complete Examples
Section titled “Complete Examples”Basic Evalcase
Section titled “Basic Evalcase”- id: simple-greeting expected_outcome: Agent responds with a greeting input: "Hello!" rubrics: - Contains a greeting wordComplex Evalcase
Section titled “Complex Evalcase”- id: code-review-security description: Tests security vulnerability detection expected_outcome: | Agent identifies SQL injection vulnerability and suggests parameterized queries as the fix.
input_messages: - role: system content: - type: file value: ./prompts/security-reviewer.md - role: user content: - type: text value: "Review this code for security issues:" - type: file value: ./fixtures/vulnerable.py
expected_messages: - role: assistant content: | Security Issue: SQL Injection vulnerability detected. Recommendation: Use parameterized queries.
rubrics: - id: identifies-vuln expected_outcome: Identifies SQL injection weight: 3.0 required: true
- id: suggests-fix expected_outcome: Suggests parameterized queries weight: 2.0
- id: explains-risk expected_outcome: Explains potential impact weight: 1.0
execution: evaluators: - name: vuln_check type: code_judge script: ["python", "./judges/security_check.py"]Next Steps
Section titled “Next Steps”- Evaluators - Assessment configuration
- Verdicts - Score calculation
- Tool Trajectory - Agentic evaluation