Skip to content

Verdicts

The verdict is the final outcome of an evaluation, indicating whether the agent passed, failed, or was borderline.

Each configured evaluator runs and produces a score between 0.0 and 1.0:

evaluators:
- name: correctness # → score: 0.9
type: llm_judge
- name: format # → score: 0.8
type: code_judge
- name: efficiency # → score: 0.7
type: execution_metrics

Scores are combined using weighted averaging:

Final Score = Σ(evaluator_score × weight) / Σ(weights)

Example with equal weights:

evaluators:
- name: correctness
weight: 1.0 # score: 0.9
- name: format
weight: 1.0 # score: 0.8
- name: efficiency
weight: 1.0 # score: 0.7
Final Score = (0.9 + 0.8 + 0.7) / 3 = 0.8

Example with custom weights:

evaluators:
- name: correctness
weight: 3.0 # More important
- name: format
weight: 1.0
- name: efficiency
weight: 1.0
Final Score = (0.9 × 3 + 0.8 × 1 + 0.7 × 1) / 5 = 0.86

The final score maps to a verdict:

Score RangeVerdictDescription
≥ 0.8passAgent meets expectations
≥ 0.6, < 0.8borderlinePartially successful, needs review
< 0.6failAgent does not meet expectations

Rubrics or evaluators marked as required: true can force a fail verdict regardless of score:

rubrics:
- id: safety
expected_outcome: No harmful content generated
required: true # Fails entire eval if missed
weight: 1.0
- id: accuracy
expected_outcome: Answer is correct
weight: 2.0

Flow:

  1. If any required criterion scores 0.0 → verdict = fail
  2. Otherwise, calculate weighted score → determine verdict

Standard weighted average of all evaluator scores.

aggregator:
type: weighted_average

Final score is the lowest evaluator score. Use when all criteria must pass.

aggregator:
type: minimum

Final score is the highest evaluator score. Rarely used.

aggregator:
type: maximum

Designated evaluators must pass before others are considered.

aggregator:
type: safety_gate
required: [safety_check]
# If safety_check fails → fail
# Otherwise → weighted_average of remaining

All evaluators must score above threshold, or fail.

aggregator:
type: all_or_nothing
threshold: 0.7

Evaluation results include detailed breakdown:

# Result structure
eval_id: test-case-1
score: 0.85
verdict: pass
evaluator_results:
- name: correctness
type: llm_judge
score: 0.9
weight: 3.0
hits:
- "Correctly identifies the issue"
- "Provides accurate solution"
misses:
- "Could be more concise"
reasoning: "Strong response with minor style issues"
- name: format
type: code_judge
score: 0.8
weight: 1.0
hits:
- "Valid JSON output"
misses:
- "Missing optional field"
hits: # Aggregated from all evaluators
- "Correctly identifies the issue"
- "Provides accurate solution"
- "Valid JSON output"
misses:
- "Could be more concise"
- "Missing optional field"

When evaluating multiple evalcases, track distribution:

Suite Results:
├── Total: 100
├── Pass: 75 (75%)
├── Borderline: 15 (15%)
└── Fail: 10 (10%)

Future versions may support custom verdict thresholds:

# Proposed (not yet implemented)
verdicts:
pass: 0.9 # Higher bar
borderline: 0.7
fail: 0.0 # Default: below borderline

Assign higher weights to critical criteria:

evaluators:
- name: safety
weight: 5.0 # Critical
- name: correctness
weight: 3.0 # Important
- name: style
weight: 1.0 # Nice to have

Mark safety-critical criteria as required:

rubrics:
- id: no-pii
expected_outcome: Does not expose personal information
required: true

Use composite evaluators for pass-first-then-score patterns:

- name: release_gate
type: composite
aggregator:
type: safety_gate
required: [safety, compliance]