The Evaluation module provides a composable framework for assessing agent and team output quality. Run evaluations programmatically against multiple dimensions — accuracy (LLM-judged), performance (latency and memory), reliability (tool usage verification), and custom criteria (flexible LLM judgment).
Quick Start
from definable.agent import Agent
from definable.agent.eval import AccuracyEval, EvalCase
agent = Agent(model="openai/gpt-4o-mini", instructions="You are a math tutor.")
eval = AccuracyEval(threshold=7.0)
case = EvalCase(input="What is 2+2?", expected="4")
result = await eval.arun(agent, case)
print(f"Score: {result.score}/10")
print(f"Pass: {result.success}")
Evaluation Types
AccuracyEval
Uses an LLM judge to score output correctness on a 1-10 scale.
from definable.agent.eval import AccuracyEval, EvalCase
eval = AccuracyEval(
judge_model="openai/gpt-4o-mini", # Model for judging
threshold=7.0, # Minimum score to pass
)
result = await eval.arun(agent, EvalCase(
input="What is the capital of France?",
expected="Paris",
))
# result.score: float (1-10)
# result.success: bool (score >= threshold)
# result.reason: str (judge's explanation)
judge_model
str
default:"openai/gpt-4o-mini"
Model used to judge output quality. Accepts string shorthand.
Minimum score (1-10) required for a passing result.
Profiles execution time and memory usage across multiple runs.
from definable.agent.eval import PerformanceEval, EvalCase
eval = PerformanceEval(
duration_threshold_ms=5000, # Fail if p95 latency exceeds 5s
memory_threshold_mb=100, # Fail if peak memory exceeds 100MB
runs=3, # Number of profiling runs
warmup_runs=1, # Excluded from results
)
result = await eval.arun(agent, EvalCase(input="Summarize this document"))
print(f"p95 latency: {result.duration_ms:.0f}ms")
print(f"Peak memory: {result.peak_memory_mb:.1f}MB")
Maximum allowed p95 execution time in milliseconds. None disables the check.
Maximum allowed peak memory delta in megabytes. None disables the check.
Number of profiling runs. Duration uses p95 percentile; memory uses peak across all runs.
Number of warmup runs excluded from metrics (useful for cache priming).
ReliabilityEval
Verifies that expected tools are called during agent execution.
from definable.agent.eval import ReliabilityEval, EvalCase
eval = ReliabilityEval(
expected_tools=["search_web", "summarize"],
strict=False, # Extra tools are OK
)
result = await eval.arun(agent, EvalCase(input="Research AI trends"))
print(f"Missing tools: {result.missing_tools}")
print(f"Extra tools: {result.extra_tools}")
Tool names that must be called during execution.
When true, unexpected tool calls cause failure. When false, only missing tools fail.
Per-case overrides are supported via EvalCase(metadata={"expected_tools": ["tool_a"]}).
AgentAsJudgeEval
Evaluates output against custom criteria using an LLM judge. Supports numeric (1-10 score) and binary (pass/fail) modes.
from definable.agent.eval import AgentAsJudgeEval, EvalCase
eval = AgentAsJudgeEval(
criteria="Output must be concise, factual, and under 100 words",
mode="numeric",
threshold=8.0,
)
result = await eval.arun(agent, EvalCase(input="Explain gravity"))
Evaluation criteria for the judge. Can be overridden per-case via case.metadata["criteria"].
"numeric" for 1-10 scoring with threshold, or "binary" for pass/fail.
Batch Evaluation
Run multiple test cases and get aggregated results:
from definable.agent.eval import AccuracyEval, EvalCase
eval = AccuracyEval(threshold=7.0)
cases = [
EvalCase(input="What is 2+2?", expected="4", name="basic_math"),
EvalCase(input="Capital of Japan?", expected="Tokyo", name="geography"),
EvalCase(input="Who wrote Hamlet?", expected="Shakespeare", name="literature"),
]
suite = await eval.arun_batch(agent, cases)
print(f"Pass rate: {suite.pass_rate:.0%}") # e.g., "100%"
print(f"Passed: {suite.passed}/{suite.total}")
The EvalSuite result provides:
| Property | Type | Description |
|---|
total | int | Total number of cases |
passed | int | Cases where success=True |
failed | int | Cases where success=False |
pass_rate | float | passed / total (0.0-1.0) |
results | List[EvalResult] | Individual results |
Team Evaluation
All eval types support team evaluation:
from definable.agent.eval import AccuracyEval, EvalCase
from definable.agent.team import Team
team = Team(leader=leader_agent, members=[researcher, writer])
eval = AccuracyEval(threshold=7.0)
result = await eval.arun_team(team, EvalCase(
input="Write a research summary on quantum computing",
expected="A comprehensive summary covering...",
))
Result Types
Each eval type returns a specialized result:
| Eval Type | Result Type | Key Fields |
|---|
AccuracyEval | AccuracyResult | score, threshold, expected, actual |
PerformanceEval | PerformanceResult | duration_ms, peak_memory_mb, durations |
ReliabilityEval | ReliabilityResult | expected_tools, actual_tools, missing_tools |
AgentAsJudgeEval | JudgeResult | criteria, mode, threshold |
All results share common fields: eval_name, success, score, reason, metadata.
All results support .to_dict() for JSON serialization.
Custom Evaluators
Extend BaseEval to create custom evaluation logic:
from definable.agent.eval import BaseEval, EvalCase, EvalResult
class LengthEval(BaseEval):
name = "length"
def __init__(self, max_length: int = 500):
super().__init__()
self.max_length = max_length
async def evaluate(self, agent, case: EvalCase) -> EvalResult:
output = await agent.arun(case.input)
content = output.content or ""
length = len(content)
success = length <= self.max_length
return EvalResult(
eval_name=self.name,
success=success,
score=10.0 if success else max(0, 10 - (length - self.max_length) / 100),
reason=f"Output length: {length} chars (max: {self.max_length})",
)
Imports
# All eval classes
from definable.agent.eval import (
BaseEval, EvalCase, EvalSuite,
AccuracyEval, PerformanceEval, ReliabilityEval, AgentAsJudgeEval,
EvalResult, AccuracyResult, PerformanceResult, ReliabilityResult, JudgeResult,
)
# Also available from top-level agent package
from definable.agent import AccuracyEval, EvalCase, EvalSuite