Organization Patterns
AgentEvals supports multiple organizational patterns. Choose the one that fits your project structure.
Pattern 1: Centralized
Section titled “Pattern 1: Centralized”All evaluations in a single evals/ directory. Best for:
- Dedicated evaluation repositories
- Cross-cutting evaluations
- CI/CD integration
project/├── evals/│ ├── code-review/│ │ ├── EVAL.yaml│ │ ├── prompts/│ │ │ └── quality.md│ │ └── judges/│ │ └── syntax.py│ ├── document-extraction/│ │ ├── EVAL.yaml│ │ └── fixtures/│ │ └── sample.pdf│ └── rag-accuracy/│ ├── EVAL.yaml│ └── dataset.jsonl├── .agentv/│ ├── targets.yaml│ └── config.yaml└── README.mdPattern 2: Skill-Based
Section titled “Pattern 2: Skill-Based”Evaluations co-located with skills. Best for:
- AgentSkills integration
- Feature-oriented development
- Self-contained skill packages
project/├── skills/│ ├── code-review/│ │ ├── SKILL.md # Skill definition│ │ ├── evals/│ │ │ ├── EVAL.yaml│ │ │ └── prompts/│ │ │ └── quality.md│ │ └── scripts/│ │ └── review.py│ ├── document-extraction/│ │ ├── SKILL.md│ │ └── evals/│ │ └── EVAL.yaml│ └── rag-search/│ ├── SKILL.md│ └── evals/│ └── EVAL.yaml└── .agentv/ └── config.yamlPattern 3: Domain-Grouped
Section titled “Pattern 3: Domain-Grouped”Evaluations organized by domain or capability area. Best for:
- Large projects with many capabilities
- Team-based organization
- Multiple evaluation types per domain
project/├── domains/│ ├── coding/│ │ ├── evals/│ │ │ ├── review.yaml│ │ │ ├── generation.yaml│ │ │ └── debugging.yaml│ │ └── shared/│ │ └── prompts/│ ├── retrieval/│ │ ├── evals/│ │ │ ├── rag-accuracy.yaml│ │ │ └── search-quality.yaml│ │ └── fixtures/│ └── conversation/│ └── evals/│ ├── multi-turn.yaml│ └── context-retention.yaml└── .agentv/ └── config.yamlLinking Skills and Evals
Section titled “Linking Skills and Evals”Use the metadata.skill field to link evaluations to skills:
Skill-based (co-located):
name: code-review-evalmetadata: skill: code-review # References parent skillCentralized (reference):
name: code-review-evalmetadata: skill: code-review # References skill elsewhere skill_path: ../skills/code-reviewShared Resources
Section titled “Shared Resources”Shared Prompts
Section titled “Shared Prompts”Create a shared prompts directory:
project/├── shared/│ └── prompts/│ ├── safety.md│ ├── quality.md│ └── format.md└── evals/ └── code-review/ └── EVAL.yamlReference with absolute paths:
execution: evaluators: - name: safety type: llm_judge prompt: /shared/prompts/safety.mdShared Judges
Section titled “Shared Judges”Create reusable code judges:
project/├── shared/│ └── judges/│ ├── json_validator.py│ ├── format_checker.py│ └── requirements.txt└── evals/Reference in evaluators:
evaluators: - name: format type: code_judge script: ["python", "/shared/judges/format_checker.py"]Configuration Files
Section titled “Configuration Files”.agentv/config.yaml
Section titled “.agentv/config.yaml”Project-wide configuration:
eval_patterns: - "evals/**/EVAL.yaml" - "skills/**/evals/EVAL.yaml"
defaults: timeout_seconds: 300 target: default.agentv/targets.yaml
Section titled “.agentv/targets.yaml”Provider configuration:
targets: - name: default provider: anthropic model: claude-sonnet-4-20250514
- name: powerful provider: anthropic model: claude-opus-4-20250514
- name: fast provider: anthropic model: claude-3-5-haiku-20241022Discovery
Section titled “Discovery”AgentEvals tools discover evaluations by pattern:
# Find all evaluationsagentv list
# Run specific patternsagentv eval "evals/code-*/**"agentv eval "skills/*/evals/**"Naming Conventions
Section titled “Naming Conventions”Eval Files
Section titled “Eval Files”| Pattern | Description |
|---|---|
EVAL.yaml | Primary evaluation file |
dataset.yaml | Alternative name (agentv style) |
*.eval.yaml | Scoped evaluations |
Directories
Section titled “Directories”| Directory | Purpose |
|---|---|
prompts/ | LLM judge prompts |
judges/ | Code judge scripts |
fixtures/ | Test data and samples |
references/ | Reference documents |
| File | Purpose |
|---|---|
*.md | Prompt templates |
*.py | Python code judges |
*.ts | TypeScript code judges |
*.jsonl | Large datasets |
Migration
Section titled “Migration”From Centralized to Skill-Based
Section titled “From Centralized to Skill-Based”- Create skill directories
- Move eval files to
skills/*/evals/ - Update relative paths in EVAL.yaml
- Add
metadata.skillreferences
From Custom to Standard
Section titled “From Custom to Standard”- Rename evaluation files to
EVAL.yaml - Convert to YAML format if needed
- Map custom fields to standard schema
- Update tooling configuration
Best Practices
Section titled “Best Practices”1. Keep Related Files Together
Section titled “1. Keep Related Files Together”evals/code-review/├── EVAL.yaml├── prompts/│ └── quality.md # Used by this eval└── judges/ └── syntax.py # Used by this eval2. Use Descriptive Names
Section titled “2. Use Descriptive Names”# Goodname: code-review-security-vulnerabilities
# Avoidname: cr-sec-13. Document Structure
Section titled “3. Document Structure”Add a README in evaluation directories:
evals/├── README.md # Explains evaluation organization├── code-review/│ └── README.md # Explains this evaluation suite4. Version Control Friendly
Section titled “4. Version Control Friendly”- Use YAML over JSON for better diffs
- Keep files small and focused
- Split large datasets into JSONL
Next Steps
Section titled “Next Steps”- EVAL Format - File format reference
- Integration - Target configuration
- Patterns - Testing patterns