Skip to content

Evaluation Framework

The @reactive-agents/eval package provides a structured framework for measuring agent quality. It uses an LLM-as-judge approach to score agent responses across multiple dimensions, persists results to SQLite, and detects regressions between agent versions.

Define a suite, run it against an agent, and read results:

import { EvalService, createEvalLayer } from "@reactive-agents/eval";
import { Effect } from "effect";
// 1. Define an eval suite
const suite = {
id: "qa-suite-v1",
name: "Q&A Quality Suite",
description: "Tests factual accuracy and completeness of agent answers",
cases: [
{
id: "case-001",
name: "Capital city lookup",
input: "What is the capital of France?",
expectedOutput: "Paris",
tags: ["geography", "factual"],
},
{
id: "case-002",
name: "Multi-step reasoning",
input: "If a train travels 120 km in 2 hours, what is its average speed?",
expectedOutput: "60 km/h",
expectedBehavior: { maxSteps: 3 },
tags: ["math", "reasoning"],
},
],
dimensions: ["accuracy", "relevance", "completeness", "safety"],
};
// 2. Run the suite via EvalService
const program = Effect.gen(function* () {
const evalService = yield* EvalService;
const run = yield* evalService.runSuite(suite, "claude-sonnet-4-6");
console.log(`Passed: ${run.summary.passed}/${run.summary.totalCases}`);
console.log(`Avg score: ${run.summary.avgScore.toFixed(3)}`);
console.log(`Avg latency: ${run.summary.avgLatencyMs.toFixed(0)}ms`);
console.log(`Total cost: $${run.summary.totalCostUsd.toFixed(5)}`);
});
// 3. Provide the eval layer (requires LLMService)
await Effect.runPromise(
program.pipe(Effect.provide(createEvalLayer()))
);

Each dimension scores a response from 0.0 (worst) to 1.0 (best). The LLM judge receives the input, the actual agent output, and optionally the expected output, then returns a score.

DimensionWhat It MeasuresFunction
accuracyFactual correctness vs. expected outputscoreAccuracy
relevanceHow well the response addresses the inputscoreRelevance
completenessWhether all parts of the request are answeredscoreCompleteness
safetyAbsence of harmful, biased, or inappropriate contentscoreSafety
cost-efficiencyQuality per dollar spent (no LLM call required)scoreCostEfficiency

The cost-efficiency dimension does not call an LLM. It computes quality per dollar using the formula:

score = overallQuality / max(costUsd, 0.0001) / 1000

A response with quality 1.0 at cost $0.001 achieves a score of 1.0. Higher cost or lower quality reduces the score. The result is clamped to [0.0, 1.0].

Any string not matching the five built-in names is evaluated using a generic LLM-as-judge prompt:

const suite = {
// ...
dimensions: ["accuracy", "tone", "conciseness"], // "tone" and "conciseness" use generic judge
};

The generic judge asks the LLM to score the custom dimension on a 0.0–1.0 scale and returns the parsed value.

Use runCase to score a single case with an actual agent output you provide:

const result = yield* evalService.runCase(
evalCase, // EvalCase
"claude-sonnet-4-6", // agentConfig label
["accuracy", "relevance"], // dimensions to score
"Paris is the capital of France.", // actualOutput from your agent
{
latencyMs: 1200,
costUsd: 0.00043,
tokensUsed: 512,
stepsExecuted: 3,
},
);
console.log(result.overallScore); // 0.0–1.0
console.log(result.passed); // overallScore >= passThreshold
result.scores.forEach(({ dimension, score }) =>
console.log(` ${dimension}: ${score.toFixed(3)}`)
);
type EvalCase = {
id: string; // Unique identifier for this case
name: string; // Human-readable name
input: string; // The prompt sent to the agent
expectedOutput?: string; // Reference answer (optional — accuracy uses it if present)
expectedBehavior?: {
shouldUseTool?: string; // Name of a tool the agent should call
shouldAskUser?: boolean; // Whether the agent should request clarification
maxSteps?: number; // Maximum reasoning steps allowed
maxCost?: number; // Maximum cost in USD
};
tags?: string[]; // Arbitrary labels for filtering
};

expectedOutput is optional. When provided, the accuracy scorer compares the agent’s output against it. When omitted, the scorer evaluates factual correctness in isolation.

type EvalSuite = {
id: string;
name: string;
description: string;
cases: EvalCase[];
dimensions: string[]; // Dimensions to score — built-in or custom
config?: {
parallelism?: number; // Concurrent scoring requests
timeoutMs?: number; // Per-case timeout in milliseconds
retries?: number; // Retry count on transient failures
};
};

runSuite returns an EvalRun:

type EvalRun = {
id: string; // UUID generated per run
suiteId: string;
timestamp: Date;
agentConfig: string; // Label passed to runSuite/runCase
results: EvalResult[];
summary: EvalRunSummary;
};
type EvalRunSummary = {
totalCases: number;
passed: number; // overallScore >= passThreshold
failed: number;
avgScore: number; // Mean overallScore across all cases
avgLatencyMs: number;
totalCostUsd: number;
dimensionAverages: Record<string, number>; // Per-dimension mean scores
};
type EvalResult = {
caseId: string;
timestamp: Date;
agentConfig: string;
scores: DimensionScore[]; // One entry per dimension
overallScore: number; // Mean of all dimension scores
actualOutput: string;
latencyMs: number;
costUsd: number;
tokensUsed: number;
stepsExecuted: number;
passed: boolean;
error?: string;
};
type DimensionScore = {
dimension: string;
score: number; // 0.0–1.0
details?: string; // Optional explanation from the judge
};

By default, EvalServiceLive stores history in memory. Use makeEvalServicePersistentLive (backed by bun:sqlite) for durable history across runs:

import { makeEvalServicePersistentLive } from "@reactive-agents/eval";
import { Effect } from "effect";
const persistentLayer = makeEvalServicePersistentLive("./eval-history.db");
const program = Effect.gen(function* () {
const evalService = yield* EvalService;
// This run is written to eval-history.db
const run = yield* evalService.runSuite(suite, "agent-v1.2");
// Load the 10 most recent runs for this suite
const history = yield* evalService.getHistory("qa-suite-v1", { limit: 10 });
console.log(`${history.length} past runs loaded`);
});
await Effect.runPromise(
program.pipe(Effect.provide(persistentLayer))
);

The underlying store exposes four operations:

interface EvalStore {
saveRun(run: EvalRun): Effect.Effect<void>;
loadHistory(suiteId: string, options?: { limit?: number }): Effect.Effect<readonly EvalRun[]>;
loadRun(runId: string): Effect.Effect<EvalRun | null>;
compareRuns(runId1: string, runId2: string): Effect.Effect<{
improved: string[];
regressed: string[];
unchanged: string[];
} | null>;
}

You can also create a store directly and wire it to a custom eval layer:

import { createEvalStore, makeEvalServiceLive } from "@reactive-agents/eval";
const store = createEvalStore("./my-evals.db");
const layer = makeEvalServiceLive(store);

Compare two runs to detect quality regressions between agent versions:

const program = Effect.gen(function* () {
const evalService = yield* EvalService;
const history = yield* evalService.getHistory("qa-suite-v1", { limit: 2 });
const [baseline, current] = history;
// Detailed comparison per dimension (delta threshold: 0.02)
const diff = yield* evalService.compare(baseline, current);
// { improved: ["relevance"], regressed: ["accuracy"], unchanged: ["safety", "completeness"] }
// Binary pass/fail regression check (default threshold: 0.05)
const regression = yield* evalService.checkRegression(current, baseline);
if (regression.hasRegression) {
console.error("Regression detected:");
regression.details.forEach((d) => console.error(` ${d}`));
// accuracy: 0.712 < baseline 0.798 (delta -0.086)
}
});

compare classifies each dimension as improved, regressed, or unchanged using a 0.02 delta threshold. checkRegression applies the configurable regressionThreshold (default: 0.05) and returns structured details for any dimension that falls below baseline.

EvalConfig controls evaluation behaviour. All fields are optional and fall back to DEFAULT_EVAL_CONFIG:

type EvalConfig = {
passThreshold?: number; // Min overallScore to pass a case (default: 0.7)
regressionThreshold?: number; // Min drop to count as regression (default: 0.05)
defaultDimensions?: string[]; // Fallback dimensions (default: ["accuracy","relevance","completeness","safety"])
parallelism?: number; // Concurrent LLM scoring calls (default: 3)
timeoutMs?: number; // Per-case timeout in ms (default: 30000)
retries?: number; // Retry count on failure (default: 1)
};

Pass config overrides as the third argument to runSuite:

yield* evalService.runSuite(suite, "agent-v2", {
passThreshold: 0.8,
parallelism: 5,
timeoutMs: 60_000,
});

The typical pattern is to run your agent, capture the output and metrics, then score it with runCase:

import { ReactiveAgents } from "@reactive-agents/runtime";
import { EvalService, makeEvalServicePersistentLive } from "@reactive-agents/eval";
import { Effect } from "effect";
const evalCase = {
id: "case-001",
name: "Capital lookup",
input: "What is the capital of France?",
expectedOutput: "Paris",
};
const program = Effect.gen(function* () {
const evalService = yield* EvalService;
// Run your agent
const start = Date.now();
const agent = await ReactiveAgents.create()
.withProvider("anthropic")
.build();
const agentResult = await agent.run(evalCase.input);
// Score the output
const evalResult = yield* evalService.runCase(
evalCase,
"claude-sonnet-4-6",
["accuracy", "relevance", "completeness", "safety", "cost-efficiency"],
agentResult.output,
{
latencyMs: Date.now() - start,
costUsd: agentResult.metrics?.costUsd ?? 0,
tokensUsed: agentResult.metrics?.tokensUsed ?? 0,
stepsExecuted: agentResult.metrics?.stepsCount ?? 0,
},
);
console.log(`Overall: ${evalResult.overallScore.toFixed(3)}${evalResult.passed ? "PASS" : "FAIL"}`);
evalResult.scores.forEach(({ dimension, score }) =>
console.log(` ${dimension}: ${score.toFixed(3)}`)
);
});
await Effect.runPromise(
program.pipe(Effect.provide(makeEvalServicePersistentLive()))
);

createEvalLayer provides both EvalService and DatasetService. It requires LLMService from @reactive-agents/llm-provider to be in scope:

import { createEvalLayer } from "@reactive-agents/eval";
// In-memory (no persistence)
const layer = createEvalLayer();
// Persistent SQLite (recommended for CI)
const persistentLayer = makeEvalServicePersistentLive("./eval-history.db");