Skip to content

Glossary

An AI system that can take actions, use tools, and interact with users or environments. Agents can be LLM-based chatbots, autonomous systems, or tool-using assistants.

The open specification for evaluating AI agent capabilities. Defines a declarative YAML-based format for evaluation suites.

The canonical implementation of the AgentEvals standard. Provides CLI tools for running evaluations.

The process of assessing agent performance against defined criteria. Also called “eval” for short.

The primary file format for defining evaluation suites in the AgentEvals specification.

A single test case within an evaluation suite. Contains input, expected outcome, and evaluation criteria.

A component that assesses agent output. Types include code_judge, llm_judge, rubric, composite, tool_trajectory, field_accuracy, and execution_metrics.

A structured evaluation criterion with optional weights and scoring ranges. Can be a simple string or a detailed object.

The final outcome of an evaluation: pass (≥0.8), borderline (≥0.6 and <0.8), or fail (<0.6).

A numeric value between 0.0 and 1.0 representing evaluation performance.

Criteria that were successfully met during evaluation.

Criteria that were not met during evaluation.

An evaluator that executes custom scripts (Python, TypeScript, shell) and returns structured results.

An evaluator that uses a language model to assess output based on a prompt template.

An evaluator that combines multiple child evaluators with an aggregation strategy.

An evaluator that validates the sequence and pattern of tool calls made by an agent.

An evaluator that checks structured data fields against expected values.

An evaluator that validates performance characteristics (latency, cost, tokens).

A strategy for combining scores from multiple evaluators. Types include weighted_average, minimum, maximum, safety_gate, and all_or_nothing.

Combines scores using weights: Σ(score × weight) / Σ(weights).

Requires specified evaluators to pass before considering others.

All evaluators must score above a threshold, or the entire evaluation fails.

A unit of conversation containing role and content. Roles include system, user, assistant, and tool.

A structured content element within a message. Types include text, file, image, and json.

A function invocation by an assistant, containing name and arguments.

The conversation history sent to the agent for evaluation.

The agent’s response, potentially including tool calls and results.

A named configuration for connecting to an agent or model provider.

The system hosting the agent (e.g., Anthropic, Azure, OpenAI).

A separate target used for LLM-based evaluation, often a more capable model.

The output of running an evaluation, including score, verdict, hits, misses, and trace data.

Metadata about execution including event count, tools used, cost, and duration.

JSON Lines format used for storing evaluation results, with one JSON object per line.

Storing all evaluations in a single evals/ directory.

Storing evaluations alongside skills in skills/*/evals/.

A collection of evalcases, either in YAML or JSONL format.

An organization strategy with Unit Tests at the base, Evals in the middle, and Simulations at the top.

A pattern ensuring critical safety checks pass before quality evaluation.

Evaluation of conversations spanning multiple exchanges.

An identifier grouping related evalcases in a multi-turn conversation.

Principle that evals should test what agents actually do in production.

Principle of preferring atomic evaluators that compose into complex ones.

Principle of building sophisticated evaluations from simple primitives.

Principle of supporting tests for novel, unanticipated agent behaviors.

Principle of tracking evaluation and prompt versions for continuous improvement.

A vocabulary for annotating and validating JSON documents. Used to define the EVAL.yaml structure.

A human-readable data serialization format. Primary format for evaluation files.

Versioning scheme using MAJOR.MINOR.PATCH numbers.

A score boundary used for determining pass/fail status.

A multiplier indicating relative importance of an evaluator or rubric.

An open standard for defining reusable AI agent skills using SKILL.md files.

A protocol for extending AI assistants with tools and resources.

A format where each line is a valid JSON object, used for large datasets and results.