Skip to main content
The Evaluation module provides a composable framework for assessing agent and team output quality. Run evaluations programmatically against multiple dimensions — accuracy (LLM-judged), performance (latency and memory), reliability (tool usage verification), and custom criteria (flexible LLM judgment).

Quick Start

from definable.agent import Agent
from definable.agent.eval import AccuracyEval, EvalCase

agent = Agent(model="openai/gpt-4o-mini", instructions="You are a math tutor.")

eval = AccuracyEval(threshold=7.0)
case = EvalCase(input="What is 2+2?", expected="4")
result = await eval.arun(agent, case)

print(f"Score: {result.score}/10")
print(f"Pass: {result.success}")

Evaluation Types

AccuracyEval

Uses an LLM judge to score output correctness on a 1-10 scale.
from definable.agent.eval import AccuracyEval, EvalCase

eval = AccuracyEval(
    judge_model="openai/gpt-4o-mini",  # Model for judging
    threshold=7.0,                      # Minimum score to pass
)

result = await eval.arun(agent, EvalCase(
    input="What is the capital of France?",
    expected="Paris",
))
# result.score: float (1-10)
# result.success: bool (score >= threshold)
# result.reason: str (judge's explanation)
judge_model
str
default:"openai/gpt-4o-mini"
Model used to judge output quality. Accepts string shorthand.
threshold
float
default:"7.0"
Minimum score (1-10) required for a passing result.

PerformanceEval

Profiles execution time and memory usage across multiple runs.
from definable.agent.eval import PerformanceEval, EvalCase

eval = PerformanceEval(
    duration_threshold_ms=5000,  # Fail if p95 latency exceeds 5s
    memory_threshold_mb=100,     # Fail if peak memory exceeds 100MB
    runs=3,                      # Number of profiling runs
    warmup_runs=1,               # Excluded from results
)

result = await eval.arun(agent, EvalCase(input="Summarize this document"))
print(f"p95 latency: {result.duration_ms:.0f}ms")
print(f"Peak memory: {result.peak_memory_mb:.1f}MB")
duration_threshold_ms
float
Maximum allowed p95 execution time in milliseconds. None disables the check.
memory_threshold_mb
float
Maximum allowed peak memory delta in megabytes. None disables the check.
runs
int
default:"3"
Number of profiling runs. Duration uses p95 percentile; memory uses peak across all runs.
warmup_runs
int
default:"0"
Number of warmup runs excluded from metrics (useful for cache priming).

ReliabilityEval

Verifies that expected tools are called during agent execution.
from definable.agent.eval import ReliabilityEval, EvalCase

eval = ReliabilityEval(
    expected_tools=["search_web", "summarize"],
    strict=False,  # Extra tools are OK
)

result = await eval.arun(agent, EvalCase(input="Research AI trends"))
print(f"Missing tools: {result.missing_tools}")
print(f"Extra tools: {result.extra_tools}")
expected_tools
List[str]
Tool names that must be called during execution.
strict
bool
default:"false"
When true, unexpected tool calls cause failure. When false, only missing tools fail.
Per-case overrides are supported via EvalCase(metadata={"expected_tools": ["tool_a"]}).

AgentAsJudgeEval

Evaluates output against custom criteria using an LLM judge. Supports numeric (1-10 score) and binary (pass/fail) modes.
from definable.agent.eval import AgentAsJudgeEval, EvalCase

eval = AgentAsJudgeEval(
    criteria="Output must be concise, factual, and under 100 words",
    mode="numeric",
    threshold=8.0,
)
result = await eval.arun(agent, EvalCase(input="Explain gravity"))
criteria
str
Evaluation criteria for the judge. Can be overridden per-case via case.metadata["criteria"].
mode
str
default:"numeric"
"numeric" for 1-10 scoring with threshold, or "binary" for pass/fail.

Batch Evaluation

Run multiple test cases and get aggregated results:
from definable.agent.eval import AccuracyEval, EvalCase

eval = AccuracyEval(threshold=7.0)
cases = [
    EvalCase(input="What is 2+2?", expected="4", name="basic_math"),
    EvalCase(input="Capital of Japan?", expected="Tokyo", name="geography"),
    EvalCase(input="Who wrote Hamlet?", expected="Shakespeare", name="literature"),
]

suite = await eval.arun_batch(agent, cases)
print(f"Pass rate: {suite.pass_rate:.0%}")  # e.g., "100%"
print(f"Passed: {suite.passed}/{suite.total}")
The EvalSuite result provides:
PropertyTypeDescription
totalintTotal number of cases
passedintCases where success=True
failedintCases where success=False
pass_ratefloatpassed / total (0.0-1.0)
resultsList[EvalResult]Individual results

Team Evaluation

All eval types support team evaluation:
from definable.agent.eval import AccuracyEval, EvalCase
from definable.agent.team import Team

team = Team(leader=leader_agent, members=[researcher, writer])

eval = AccuracyEval(threshold=7.0)
result = await eval.arun_team(team, EvalCase(
    input="Write a research summary on quantum computing",
    expected="A comprehensive summary covering...",
))

Result Types

Each eval type returns a specialized result:
Eval TypeResult TypeKey Fields
AccuracyEvalAccuracyResultscore, threshold, expected, actual
PerformanceEvalPerformanceResultduration_ms, peak_memory_mb, durations
ReliabilityEvalReliabilityResultexpected_tools, actual_tools, missing_tools
AgentAsJudgeEvalJudgeResultcriteria, mode, threshold
All results share common fields: eval_name, success, score, reason, metadata. All results support .to_dict() for JSON serialization.

Custom Evaluators

Extend BaseEval to create custom evaluation logic:
from definable.agent.eval import BaseEval, EvalCase, EvalResult

class LengthEval(BaseEval):
    name = "length"

    def __init__(self, max_length: int = 500):
        super().__init__()
        self.max_length = max_length

    async def evaluate(self, agent, case: EvalCase) -> EvalResult:
        output = await agent.arun(case.input)
        content = output.content or ""
        length = len(content)
        success = length <= self.max_length

        return EvalResult(
            eval_name=self.name,
            success=success,
            score=10.0 if success else max(0, 10 - (length - self.max_length) / 100),
            reason=f"Output length: {length} chars (max: {self.max_length})",
        )

Imports

# All eval classes
from definable.agent.eval import (
    BaseEval, EvalCase, EvalSuite,
    AccuracyEval, PerformanceEval, ReliabilityEval, AgentAsJudgeEval,
    EvalResult, AccuracyResult, PerformanceResult, ReliabilityResult, JudgeResult,
)

# Also available from top-level agent package
from definable.agent import AccuracyEval, EvalCase, EvalSuite