Evaluation - Definable AI

The Evaluation module provides a composable framework for assessing agent and team output quality. Run evaluations programmatically against multiple dimensions — accuracy (LLM-judged), performance (latency and memory), reliability (tool usage verification), and custom criteria (flexible LLM judgment).

Quick Start

from definable.agent import Agent
from definable.agent.eval import AccuracyEval, EvalCase

agent = Agent(model="openai/gpt-4o-mini", instructions="You are a math tutor.")

eval = AccuracyEval(threshold=7.0)
case = EvalCase(input="What is 2+2?", expected="4")
result = await eval.arun(agent, case)

print(f"Score: {result.score}/10")
print(f"Pass: {result.success}")

Evaluation Types

AccuracyEval

Uses an LLM judge to score output correctness on a 1-10 scale.

from definable.agent.eval import AccuracyEval, EvalCase

eval = AccuracyEval(
    judge_model="openai/gpt-4o-mini",  # Model for judging
    threshold=7.0,                      # Minimum score to pass
)

result = await eval.arun(agent, EvalCase(
    input="What is the capital of France?",
    expected="Paris",
))
# result.score: float (1-10)
# result.success: bool (score >= threshold)
# result.reason: str (judge's explanation)

judge_model

str

default:"openai/gpt-4o-mini"

Model used to judge output quality. Accepts string shorthand.

threshold

float

default:"7.0"

Minimum score (1-10) required for a passing result.

PerformanceEval

Profiles execution time and memory usage across multiple runs.

from definable.agent.eval import PerformanceEval, EvalCase

eval = PerformanceEval(
    duration_threshold_ms=5000,  # Fail if p95 latency exceeds 5s
    memory_threshold_mb=100,     # Fail if peak memory exceeds 100MB
    runs=3,                      # Number of profiling runs
    warmup_runs=1,               # Excluded from results
)

result = await eval.arun(agent, EvalCase(input="Summarize this document"))
print(f"p95 latency: {result.duration_ms:.0f}ms")
print(f"Peak memory: {result.peak_memory_mb:.1f}MB")

duration_threshold_ms

float

Maximum allowed p95 execution time in milliseconds. None disables the check.

memory_threshold_mb

float

Maximum allowed peak memory delta in megabytes. None disables the check.

runs

int

default:"3"

Number of profiling runs. Duration uses p95 percentile; memory uses peak across all runs.

warmup_runs

int

default:"0"

Number of warmup runs excluded from metrics (useful for cache priming).

ReliabilityEval

Verifies that expected tools are called during agent execution.

from definable.agent.eval import ReliabilityEval, EvalCase

eval = ReliabilityEval(
    expected_tools=["search_web", "summarize"],
    strict=False,  # Extra tools are OK
)

result = await eval.arun(agent, EvalCase(input="Research AI trends"))
print(f"Missing tools: {result.missing_tools}")
print(f"Extra tools: {result.extra_tools}")

expected_tools

List[str]

Tool names that must be called during execution.

strict

bool

default:"false"

When true, unexpected tool calls cause failure. When false, only missing tools fail.

Per-case overrides are supported via EvalCase(metadata={"expected_tools": ["tool_a"]}).

AgentAsJudgeEval

Evaluates output against custom criteria using an LLM judge. Supports numeric (1-10 score) and binary (pass/fail) modes.

from definable.agent.eval import AgentAsJudgeEval, EvalCase

eval = AgentAsJudgeEval(
    criteria="Output must be concise, factual, and under 100 words",
    mode="numeric",
    threshold=8.0,
)
result = await eval.arun(agent, EvalCase(input="Explain gravity"))

eval = AgentAsJudgeEval(
    criteria="Output must not contain profanity or inappropriate language",
    mode="binary",
)
result = await eval.arun(agent, EvalCase(input="Write a greeting"))
# result.score: 10.0 (pass) or 0.0 (fail)

criteria

str

Evaluation criteria for the judge. Can be overridden per-case via case.metadata["criteria"].

mode

str

default:"numeric"

"numeric" for 1-10 scoring with threshold, or "binary" for pass/fail.

Batch Evaluation

Run multiple test cases and get aggregated results:

from definable.agent.eval import AccuracyEval, EvalCase

eval = AccuracyEval(threshold=7.0)
cases = [
    EvalCase(input="What is 2+2?", expected="4", name="basic_math"),
    EvalCase(input="Capital of Japan?", expected="Tokyo", name="geography"),
    EvalCase(input="Who wrote Hamlet?", expected="Shakespeare", name="literature"),
]

suite = await eval.arun_batch(agent, cases)
print(f"Pass rate: {suite.pass_rate:.0%}")  # e.g., "100%"
print(f"Passed: {suite.passed}/{suite.total}")

The EvalSuite result provides:

Property	Type	Description
`total`	`int`	Total number of cases
`passed`	`int`	Cases where `success=True`
`failed`	`int`	Cases where `success=False`
`pass_rate`	`float`	`passed / total` (0.0-1.0)
`results`	`List[EvalResult]`	Individual results

Team Evaluation

All eval types support team evaluation:

from definable.agent.eval import AccuracyEval, EvalCase
from definable.agent.team import Team

team = Team(leader=leader_agent, members=[researcher, writer])

eval = AccuracyEval(threshold=7.0)
result = await eval.arun_team(team, EvalCase(
    input="Write a research summary on quantum computing",
    expected="A comprehensive summary covering...",
))

Result Types

Each eval type returns a specialized result:

Eval Type	Result Type	Key Fields
`AccuracyEval`	`AccuracyResult`	`score`, `threshold`, `expected`, `actual`
`PerformanceEval`	`PerformanceResult`	`duration_ms`, `peak_memory_mb`, `durations`
`ReliabilityEval`	`ReliabilityResult`	`expected_tools`, `actual_tools`, `missing_tools`
`AgentAsJudgeEval`	`JudgeResult`	`criteria`, `mode`, `threshold`

All results share common fields: eval_name, success, score, reason, metadata. All results support .to_dict() for JSON serialization.

Custom Evaluators

Extend BaseEval to create custom evaluation logic:

from definable.agent.eval import BaseEval, EvalCase, EvalResult

class LengthEval(BaseEval):
    name = "length"

    def __init__(self, max_length: int = 500):
        super().__init__()
        self.max_length = max_length

    async def evaluate(self, agent, case: EvalCase) -> EvalResult:
        output = await agent.arun(case.input)
        content = output.content or ""
        length = len(content)
        success = length <= self.max_length

        return EvalResult(
            eval_name=self.name,
            success=success,
            score=10.0 if success else max(0, 10 - (length - self.max_length) / 100),
            reason=f"Output length: {length} chars (max: {self.max_length})",
        )

Imports

# All eval classes
from definable.agent.eval import (
    BaseEval, EvalCase, EvalSuite,
    AccuracyEval, PerformanceEval, ReliabilityEval, AgentAsJudgeEval,
    EvalResult, AccuracyResult, PerformanceResult, ReliabilityResult, JudgeResult,
)

# Also available from top-level agent package
from definable.agent import AccuracyEval, EvalCase, EvalSuite

​Quick Start

​Evaluation Types

​AccuracyEval

​PerformanceEval

​ReliabilityEval

​AgentAsJudgeEval

​Batch Evaluation

​Team Evaluation

​Result Types

​Custom Evaluators

​Imports

Quick Start

Evaluation Types

AccuracyEval

PerformanceEval

ReliabilityEval

AgentAsJudgeEval

Batch Evaluation

Team Evaluation

Result Types

Custom Evaluators

Imports