Skip to content

Testing Pyramid

The Testing Pyramid provides a framework for organizing agent evaluations at different levels of abstraction and cost.

/|\
/ | \
/ | \
/ Sim- \ ← End-to-end scenarios
/ ulations \ (expensive, comprehensive)
/─────────────\
/ \
/ Evals \ ← AgentEvals (this spec)
/ (this layer) \ (balanced cost/coverage)
/─────────────────────\
/ \
/ Unit Tests \ ← Deterministic checks
─────────────────────────── (fast, focused)
LayerSpeedCostCoverageFlakiness
Unit TestsFastLowNarrowLow
EvalsMediumMediumBalancedMedium
SimulationsSlowHighBroadHigher

Fast, deterministic tests for specific components.

  • Tool implementations
  • Parsing functions
  • Configuration validation
  • Format checkers
tests/test_parser.py
def test_parse_json_response():
response = '{"answer": 42}'
result = parse_response(response)
assert result["answer"] == 42
def test_tool_validation():
tool = {"name": "search", "args": {"query": "test"}}
assert validate_tool_call(tool) == True
  • Testing deterministic logic
  • Validating data transformations
  • Checking error handling
  • Fast feedback during development

Measure agent capabilities using the AgentEvals specification.

  • Task completion
  • Response quality
  • Tool usage patterns
  • Performance bounds
evals/code-review/EVAL.yaml
name: code-review
version: "1.0"
execution:
evaluators:
- name: quality
type: llm_judge
prompt: ./prompts/review_quality.md
- name: accuracy
type: code_judge
script: ["python", "./judges/bug_check.py"]
- name: efficiency
type: execution_metrics
max_tool_calls: 10
evalcases:
- id: off-by-one
expected_outcome: Identifies loop condition bug
input:
- role: user
content: "Review this code..."
  • Measuring semantic quality
  • Validating agent behaviors
  • Comparing agent versions
  • Regression testing

End-to-end scenario testing with realistic workflows.

  • Multi-turn conversations
  • Complex workflows
  • Integration points
  • User experience flows
simulations/support-flow.yaml
name: customer-support-flow
version: "1.0"
scenario:
description: Complete customer support interaction
turns:
- user: "I need help with my account"
expect:
- agent asks for account info
- user: "My email is alice@example.com"
expect:
- agent acknowledges finding account
- agent asks how to help
- user: "Reset my password"
expect:
- agent initiates password reset
- agent confirms email sent
evaluators:
- name: flow_completion
type: llm_judge
prompt: |
Did the agent complete the support flow correctly?
- Account lookup ✓
- Password reset ✓
- Confirmation ✓
  • Validating complete workflows
  • Testing conversation coherence
  • Integration testing
  • Acceptance criteria
Simulations: 10% │░░░░░░░░░░│
Evals: 30% │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
Unit Tests: 60% │░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
my-agent/
├── tests/ # Unit tests
│ ├── test_parser.py
│ ├── test_tools.py
│ └── test_validators.py
├── evals/ # AgentEvals
│ ├── code-review/
│ │ └── EVAL.yaml
│ ├── rag-quality/
│ │ └── EVAL.yaml
│ └── tool-usage/
│ └── EVAL.yaml
└── simulations/ # End-to-end
├── support-flow.yaml
└── onboarding-flow.yaml
1. Development
└─→ Run unit tests (fast feedback)
2. Pre-commit
└─→ Run unit tests
└─→ Run critical evals
3. CI Pipeline
└─→ Run all unit tests
└─→ Run all evals
└─→ Run smoke simulations
4. Pre-release
└─→ Full simulation suite
└─→ Performance benchmarks
  • No LLM calls
  • Run on every commit
  • Fast CI feedback
  • Selective LLM calls
  • Run on PR/merge
  • Use caching
  • Full LLM conversations
  • Run before release
  • Expensive but comprehensive
evaluators:
- name: format_check
type: code_judge
script: ["python", "./judges/format.py"]
evaluators:
- name: tool_usage
type: tool_trajectory
mode: in_order
expected:
- tool: search
- tool: analyze
evalcases:
- id: multi-step
conversation_id: research-flow
- id: multi-step-2
conversation_id: research-flow

Build a strong unit test foundation before adding evals.

Checking…Use
Data formatUnit test
Response qualityEval
User journeySimulation

Run cheaper tests first:

Unit tests (1s) → Evals (30s) → Simulations (5m)

Use fixtures across layers:

fixtures/
├── sample_code.py # Used by unit tests AND evals
├── expected_review.md
└── test_data.json

Ensure critical paths are tested at each layer:

Feature: Code Review
├── Unit: Parser handles all syntax
├── Eval: Reviews identify common bugs
└── Simulation: Full review workflow works