Skip to content

Quick Start

This guide walks you through creating your first agent evaluation using the AgentEvals specification.

  • A text editor
  • An agent to evaluate (or use the examples)
  • AgentV CLI (optional, for running evals)

Create a new file called EVAL.yaml:

name: hello-world-eval
version: "1.0"
description: A simple evaluation to verify agent responses
evalcases:
- id: greeting-response
expected_outcome: |
Agent responds with a friendly greeting that includes the user's name
input:
- role: user
content: "Hello, my name is Alice!"
rubrics:
- Includes a greeting (hello, hi, hey, etc.)
- Mentions the user's name "Alice"
- Maintains a friendly tone

Enhance your evaluation with explicit evaluators:

name: hello-world-eval
version: "1.0"
description: A simple evaluation to verify agent responses
execution:
evaluators:
- name: greeting_quality
type: llm_judge
prompt: |
Evaluate if the response is a proper greeting.
Input: {{input}}
Response: {{candidate_answer}}
Score 1.0 if the response:
- Contains a greeting
- Mentions the user's name
- Is friendly and appropriate
Score 0.5 if partially correct.
Score 0.0 if fails basic requirements.
evalcases:
- id: greeting-response
expected_outcome: |
Agent responds with a friendly greeting that includes the user's name
input:
- role: user
content: "Hello, my name is Alice!"
expected_output:
- role: assistant
content: "Hello Alice! It's nice to meet you. How can I help you today?"

For deterministic checks, add a code judge:

execution:
evaluators:
- name: name_check
type: code_judge
script: ["python", "./judges/check_name.py"]
- name: greeting_quality
type: llm_judge
prompt: ./prompts/greeting.md

Create judges/check_name.py:

import json
import sys
# Read input from stdin
data = json.load(sys.stdin)
candidate = data["candidateAnswer"].lower()
input_text = data["question"].lower()
# Extract name from input (simple example)
name = "alice" # In practice, extract from input
score = 1.0 if name in candidate else 0.0
hits = ["Name mentioned"] if score > 0 else []
misses = [] if score > 0 else ["Name not mentioned"]
# Output result
print(json.dumps({
"score": score,
"hits": hits,
"misses": misses
}))

Using AgentV CLI:

Terminal window
# Install AgentV
bun install -g agentv
# Run evaluation
agentv eval ./EVAL.yaml --target default

Results are saved as JSONL:

{
"evalId": "greeting-response",
"score": 0.95,
"verdict": "pass",
"hits": ["Name mentioned", "Greeting present", "Friendly tone"],
"misses": [],
"evaluatorResults": [
{"name": "name_check", "score": 1.0},
{"name": "greeting_quality", "score": 0.9}
]
}

A typical evaluation setup:

my-evals/
├── EVAL.yaml # Main evaluation file
├── prompts/
│ └── greeting.md # LLM judge prompts
├── judges/
│ └── check_name.py # Code judges
└── fixtures/
└── sample.txt # Test data