Skip to content

Organization Patterns

AgentEvals supports multiple organizational patterns. Choose the one that fits your project structure.

All evaluations in a single evals/ directory. Best for:

  • Dedicated evaluation repositories
  • Cross-cutting evaluations
  • CI/CD integration
project/
├── evals/
│ ├── code-review/
│ │ ├── EVAL.yaml
│ │ ├── prompts/
│ │ │ └── quality.md
│ │ └── judges/
│ │ └── syntax.py
│ ├── document-extraction/
│ │ ├── EVAL.yaml
│ │ └── fixtures/
│ │ └── sample.pdf
│ └── rag-accuracy/
│ ├── EVAL.yaml
│ └── dataset.jsonl
├── .agentv/
│ ├── targets.yaml
│ └── config.yaml
└── README.md

Evaluations co-located with skills. Best for:

  • AgentSkills integration
  • Feature-oriented development
  • Self-contained skill packages
project/
├── skills/
│ ├── code-review/
│ │ ├── SKILL.md # Skill definition
│ │ ├── evals/
│ │ │ ├── EVAL.yaml
│ │ │ └── prompts/
│ │ │ └── quality.md
│ │ └── scripts/
│ │ └── review.py
│ ├── document-extraction/
│ │ ├── SKILL.md
│ │ └── evals/
│ │ └── EVAL.yaml
│ └── rag-search/
│ ├── SKILL.md
│ └── evals/
│ └── EVAL.yaml
└── .agentv/
└── config.yaml

Evaluations organized by domain or capability area. Best for:

  • Large projects with many capabilities
  • Team-based organization
  • Multiple evaluation types per domain
project/
├── domains/
│ ├── coding/
│ │ ├── evals/
│ │ │ ├── review.yaml
│ │ │ ├── generation.yaml
│ │ │ └── debugging.yaml
│ │ └── shared/
│ │ └── prompts/
│ ├── retrieval/
│ │ ├── evals/
│ │ │ ├── rag-accuracy.yaml
│ │ │ └── search-quality.yaml
│ │ └── fixtures/
│ └── conversation/
│ └── evals/
│ ├── multi-turn.yaml
│ └── context-retention.yaml
└── .agentv/
└── config.yaml

Use the metadata.skill field to link evaluations to skills:

Skill-based (co-located):

skills/code-review/evals/EVAL.yaml
name: code-review-eval
metadata:
skill: code-review # References parent skill

Centralized (reference):

evals/code-review/EVAL.yaml
name: code-review-eval
metadata:
skill: code-review # References skill elsewhere
skill_path: ../skills/code-review

Create a shared prompts directory:

project/
├── shared/
│ └── prompts/
│ ├── safety.md
│ ├── quality.md
│ └── format.md
└── evals/
└── code-review/
└── EVAL.yaml

Reference with absolute paths:

evals/code-review/EVAL.yaml
execution:
evaluators:
- name: safety
type: llm_judge
prompt: /shared/prompts/safety.md

Create reusable code judges:

project/
├── shared/
│ └── judges/
│ ├── json_validator.py
│ ├── format_checker.py
│ └── requirements.txt
└── evals/

Reference in evaluators:

evaluators:
- name: format
type: code_judge
script: ["python", "/shared/judges/format_checker.py"]

Project-wide configuration:

.agentv/config.yaml
eval_patterns:
- "evals/**/EVAL.yaml"
- "skills/**/evals/EVAL.yaml"
defaults:
timeout_seconds: 300
target: default

Provider configuration:

.agentv/targets.yaml
targets:
- name: default
provider: anthropic
model: claude-sonnet-4-20250514
- name: powerful
provider: anthropic
model: claude-opus-4-20250514
- name: fast
provider: anthropic
model: claude-3-5-haiku-20241022

AgentEvals tools discover evaluations by pattern:

Terminal window
# Find all evaluations
agentv list
# Run specific patterns
agentv eval "evals/code-*/**"
agentv eval "skills/*/evals/**"
PatternDescription
EVAL.yamlPrimary evaluation file
dataset.yamlAlternative name (agentv style)
*.eval.yamlScoped evaluations
DirectoryPurpose
prompts/LLM judge prompts
judges/Code judge scripts
fixtures/Test data and samples
references/Reference documents
FilePurpose
*.mdPrompt templates
*.pyPython code judges
*.tsTypeScript code judges
*.jsonlLarge datasets
  1. Create skill directories
  2. Move eval files to skills/*/evals/
  3. Update relative paths in EVAL.yaml
  4. Add metadata.skill references
  1. Rename evaluation files to EVAL.yaml
  2. Convert to YAML format if needed
  3. Map custom fields to standard schema
  4. Update tooling configuration
evals/code-review/
├── EVAL.yaml
├── prompts/
│ └── quality.md # Used by this eval
└── judges/
└── syntax.py # Used by this eval
# Good
name: code-review-security-vulnerabilities
# Avoid
name: cr-sec-1

Add a README in evaluation directories:

evals/
├── README.md # Explains evaluation organization
├── code-review/
│ └── README.md # Explains this evaluation suite
  • Use YAML over JSON for better diffs
  • Keep files small and focused
  • Split large datasets into JSONL