Verdicts
The verdict is the final outcome of an evaluation, indicating whether the agent passed, failed, or was borderline.
Score Calculation
Section titled “Score Calculation”Step 1: Run Evaluators
Section titled “Step 1: Run Evaluators”Each configured evaluator runs and produces a score between 0.0 and 1.0:
evaluators: - name: correctness # → score: 0.9 type: llm_judge - name: format # → score: 0.8 type: code_judge - name: efficiency # → score: 0.7 type: execution_metricsStep 2: Weighted Average
Section titled “Step 2: Weighted Average”Scores are combined using weighted averaging:
Final Score = Σ(evaluator_score × weight) / Σ(weights)Example with equal weights:
evaluators: - name: correctness weight: 1.0 # score: 0.9 - name: format weight: 1.0 # score: 0.8 - name: efficiency weight: 1.0 # score: 0.7Final Score = (0.9 + 0.8 + 0.7) / 3 = 0.8Example with custom weights:
evaluators: - name: correctness weight: 3.0 # More important - name: format weight: 1.0 - name: efficiency weight: 1.0Final Score = (0.9 × 3 + 0.8 × 1 + 0.7 × 1) / 5 = 0.86Step 3: Determine Verdict
Section titled “Step 3: Determine Verdict”The final score maps to a verdict:
| Score Range | Verdict | Description |
|---|---|---|
| ≥ 0.8 | pass | Agent meets expectations |
| ≥ 0.6, < 0.8 | borderline | Partially successful, needs review |
| < 0.6 | fail | Agent does not meet expectations |
Required Criteria Override
Section titled “Required Criteria Override”Rubrics or evaluators marked as required: true can force a fail verdict regardless of score:
rubrics: - id: safety expected_outcome: No harmful content generated required: true # Fails entire eval if missed weight: 1.0
- id: accuracy expected_outcome: Answer is correct weight: 2.0Flow:
- If any required criterion scores 0.0 → verdict =
fail - Otherwise, calculate weighted score → determine verdict
Aggregation Strategies
Section titled “Aggregation Strategies”weighted_average (default)
Section titled “weighted_average (default)”Standard weighted average of all evaluator scores.
aggregator: type: weighted_averageminimum
Section titled “minimum”Final score is the lowest evaluator score. Use when all criteria must pass.
aggregator: type: minimummaximum
Section titled “maximum”Final score is the highest evaluator score. Rarely used.
aggregator: type: maximumsafety_gate
Section titled “safety_gate”Designated evaluators must pass before others are considered.
aggregator: type: safety_gate required: [safety_check] # If safety_check fails → fail # Otherwise → weighted_average of remainingall_or_nothing
Section titled “all_or_nothing”All evaluators must score above threshold, or fail.
aggregator: type: all_or_nothing threshold: 0.7Result Format
Section titled “Result Format”Evaluation results include detailed breakdown:
# Result structureeval_id: test-case-1score: 0.85verdict: pass
evaluator_results: - name: correctness type: llm_judge score: 0.9 weight: 3.0 hits: - "Correctly identifies the issue" - "Provides accurate solution" misses: - "Could be more concise" reasoning: "Strong response with minor style issues"
- name: format type: code_judge score: 0.8 weight: 1.0 hits: - "Valid JSON output" misses: - "Missing optional field"
hits: # Aggregated from all evaluators - "Correctly identifies the issue" - "Provides accurate solution" - "Valid JSON output"
misses: - "Could be more concise" - "Missing optional field"Verdict Distribution
Section titled “Verdict Distribution”When evaluating multiple evalcases, track distribution:
Suite Results:├── Total: 100├── Pass: 75 (75%)├── Borderline: 15 (15%)└── Fail: 10 (10%)Custom Thresholds
Section titled “Custom Thresholds”Future versions may support custom verdict thresholds:
# Proposed (not yet implemented)verdicts: pass: 0.9 # Higher bar borderline: 0.7 fail: 0.0 # Default: below borderlineBest Practices
Section titled “Best Practices”1. Weight by Importance
Section titled “1. Weight by Importance”Assign higher weights to critical criteria:
evaluators: - name: safety weight: 5.0 # Critical - name: correctness weight: 3.0 # Important - name: style weight: 1.0 # Nice to have2. Use Required for Non-Negotiables
Section titled “2. Use Required for Non-Negotiables”Mark safety-critical criteria as required:
rubrics: - id: no-pii expected_outcome: Does not expose personal information required: true3. Composite for Gating
Section titled “3. Composite for Gating”Use composite evaluators for pass-first-then-score patterns:
- name: release_gate type: composite aggregator: type: safety_gate required: [safety, compliance]Next Steps
Section titled “Next Steps”- Composite Evaluator - Advanced aggregation
- Rubric Evaluator - Structured criteria
- Organization Patterns - File structure