Quick Start
This guide walks you through creating your first agent evaluation using the AgentEvals specification.
Prerequisites
Section titled “Prerequisites”- A text editor
- An agent to evaluate (or use the examples)
- AgentV CLI (optional, for running evals)
Step 1: Create an EVAL.yaml file
Section titled “Step 1: Create an EVAL.yaml file”Create a new file called EVAL.yaml:
name: hello-world-evalversion: "1.0"description: A simple evaluation to verify agent responses
evalcases: - id: greeting-response expected_outcome: | Agent responds with a friendly greeting that includes the user's name
input: - role: user content: "Hello, my name is Alice!"
rubrics: - Includes a greeting (hello, hi, hey, etc.) - Mentions the user's name "Alice" - Maintains a friendly toneStep 2: Add evaluators (optional)
Section titled “Step 2: Add evaluators (optional)”Enhance your evaluation with explicit evaluators:
name: hello-world-evalversion: "1.0"description: A simple evaluation to verify agent responses
execution: evaluators: - name: greeting_quality type: llm_judge prompt: | Evaluate if the response is a proper greeting.
Input: {{input}} Response: {{candidate_answer}}
Score 1.0 if the response: - Contains a greeting - Mentions the user's name - Is friendly and appropriate
Score 0.5 if partially correct. Score 0.0 if fails basic requirements.
evalcases: - id: greeting-response expected_outcome: | Agent responds with a friendly greeting that includes the user's name
input: - role: user content: "Hello, my name is Alice!"
expected_output: - role: assistant content: "Hello Alice! It's nice to meet you. How can I help you today?"Step 3: Add a code judge
Section titled “Step 3: Add a code judge”For deterministic checks, add a code judge:
execution: evaluators: - name: name_check type: code_judge script: ["python", "./judges/check_name.py"]
- name: greeting_quality type: llm_judge prompt: ./prompts/greeting.mdCreate judges/check_name.py:
import jsonimport sys
# Read input from stdindata = json.load(sys.stdin)candidate = data["candidateAnswer"].lower()input_text = data["question"].lower()
# Extract name from input (simple example)name = "alice" # In practice, extract from input
score = 1.0 if name in candidate else 0.0hits = ["Name mentioned"] if score > 0 else []misses = [] if score > 0 else ["Name not mentioned"]
# Output resultprint(json.dumps({ "score": score, "hits": hits, "misses": misses}))Step 4: Run the evaluation
Section titled “Step 4: Run the evaluation”Using AgentV CLI:
# Install AgentVbun install -g agentv
# Run evaluationagentv eval ./EVAL.yaml --target defaultStep 5: Review results
Section titled “Step 5: Review results”Results are saved as JSONL:
{ "evalId": "greeting-response", "score": 0.95, "verdict": "pass", "hits": ["Name mentioned", "Greeting present", "Friendly tone"], "misses": [], "evaluatorResults": [ {"name": "name_check", "score": 1.0}, {"name": "greeting_quality", "score": 0.9} ]}Directory Structure
Section titled “Directory Structure”A typical evaluation setup:
my-evals/├── EVAL.yaml # Main evaluation file├── prompts/│ └── greeting.md # LLM judge prompts├── judges/│ └── check_name.py # Code judges└── fixtures/ └── sample.txt # Test dataNext Steps
Section titled “Next Steps”- EVAL Format - Full format specification
- Evaluator Types - Learn about each evaluator
- Organization Patterns - Centralized vs skill-based