Benchmarks

The @reactive-agents/benchmarks package evaluates end-to-end agent performance across 20 tasks spanning 5 complexity tiers. Tasks are aligned with leading agentic benchmark standards used by the research community, and run against a real LLM to measure actual correctness, latency, token usage, and cost — not just framework overhead.

Results

Last generated: March 8, 2026 at 01:27 PM · Model: test/default Warning: Framework-overhead-only run (test provider -- no real LLM calls)

Comparison Matrix

Tier	`test/default`
Trivial	0/4 (0%)
Simple	1/4 (25%)
Moderate	0/4 (0%)
Complex	0/4 (0%)
Expert	0/4 (0%)
Total	1/20 (5%)

Summary

1/20 Tasks Passed

5% Pass Rate

85.38ms Avg Latency

1.7s Total Duration

6,559 Total Tokens

$0.0000 Total Cost (USD)

Individual Tasks

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 0/4 passed

Task	Strategy	Status	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	Fail	128.32ms	--	--
`t2-binary-pow`	`single-shot`	Fail	89.48ms	--	--
`t3-asimov-laws`	`single-shot`	Fail	93.60ms	--	--
`t4-json-csv`	`single-shot`	Fail	82.10ms	--	--

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 1/4 passed

Task	Strategy	Status	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	Fail	83.95ms	--	--
`s2-palindrome-bug`	`single-shot`	Fail	84.84ms	--	--
`s3-bigO`	`single-shot`	Pass	69.47ms	--	--
`s4-design-pattern`	`single-shot`	Fail	88.78ms	--	--

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 0/4 passed

Task	Strategy	Status	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	Fail	92.77ms	808	--
`m2-word-problem`	`react`	Fail	85.80ms	722	--
`m3-sql-injection`	`react`	Fail	94.17ms	785	--
`m4-remove-duplicates`	`react`	Fail	90.87ms	803	--

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 0/4 passed

Task	Strategy	Status	Latency	Tokens	Cost
`c1-distributed-queue`	`plan-execute`	Error	64.89ms	--	--
`c2-auth-vulnerabilities`	`plan-execute`	Error	58.66ms	--	--
`c3-test-suite`	`plan-execute`	Error	57.21ms	--	--
`c4-db-decomposition`	`plan-execute`	Error	60.13ms	--	--

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 0/4 passed

Task	Strategy	Status	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	Fail	89.22ms	874	--
`e2-incident-response`	`tree-of-thought`	Fail	87.43ms	869	--
`e3-logic-fallacy`	`tree-of-thought`	Fail	76.76ms	828	--
`e4-crdt-design`	`tree-of-thought`	Fail	129.25ms	870	--

Framework Overhead

Measured with the test provider to isolate pure Effect-TS layer composition cost -- independent of LLM latency.

Measurement	Avg Duration	Samples
Runtime Creation	0.01ms	10
Runtime Creation Full	0.04ms	10
Complexity Classification	<0.01ms	100

Benchmark Methodology

Industry Standard Alignment

Each task tier maps to a recognized benchmark standard:

Tier	Strategy	Aligned With
Trivial	Single-shot	MMLU-CS · MATH baseline · AgentEval
Simple	Single-shot	HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE
Moderate	ReAct (reactive)	HumanEval Medium · BIG-Bench Hard · SWE-bench lite
Complex	Plan-Execute-Reflect	AgentBench · SWE-bench Security · TestEval
Expert	Tree-of-Thought	BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS

What Each Benchmark Standard Covers

HumanEval (OpenAI) — 164 handcrafted code generation tasks evaluated by functional correctness. Our tasks include function implementation, algorithm design, and test generation.
SWE-bench (Princeton) — Resolving real GitHub issues. We use SWE-bench patterns for bug identification, security vulnerability analysis, and multi-file code review.
BIG-Bench Hard (Google) — 23 challenging tasks where chain-of-thought is required. We include: algorithmic optimization, logic/fallacy analysis, multi-step word problems, and Big-O complexity reasoning.
GAIA (Meta) — Multi-step tasks requiring tool use and reasoning. Our Level 3 equivalent task tests production incident response requiring multi-domain knowledge synthesis.
AgentBench (THUDM) — 8-environment agent evaluation. We use AgentBench patterns for system design, database decomposition, and migration planning tasks.
MMLU-Pro — Professional knowledge across 14 domains. Tasks cover CS theory (CRDTs, design patterns), software engineering, and architecture decision-making.

Scoring

A task passes if the LLM’s output contains the expected pattern (case-insensitive regex). Patterns are crafted to require substantive, correct answers — they cannot be satisfied by generic responses:

SQL injection fix expected: "parameteriz|prepared|placeholder|$1|?"
CRDT design expected: "CRDT|vector.?clock|logical.?time|merge|commutative|converge"

Running Benchmarks

# Run with Anthropic (recommended for real-world results)
cd packages/benchmarks
bun run src/run.ts --provider anthropic --output report.json

# Run with a specific model
bun run src/run.ts --provider anthropic --model claude-opus-4-5 --output report.json

# Run only trivial + simple tiers (quick sanity check)
bun run src/run.ts --provider anthropic --tier trivial,simple

# OpenAI
bun run src/run.ts --provider openai --model gpt-4o --output report.json

# Gemini
bun run src/run.ts --provider gemini --model gemini-2.0-flash --output report.json

CLI Options

Flag	Description	Default
`--provider`	LLM provider (`anthropic`, `openai`, `gemini`, `ollama`, `litellm`)	`test`
`--model`	Model name (uses provider default if omitted)	Provider default
`--tier`	Comma-separated tier filter	All tiers
`--output`	Path to save JSON report	(none)

Provider Defaults

Provider	Default Model	Rationale
`anthropic`	`claude-haiku-4-5`	Fast, cost-efficient, strong reasoning
`openai`	`gpt-4o-mini`	Cost-efficient with strong benchmark performance
`gemini`	`gemini-2.0-flash`	Fast inference, competitive pricing
`ollama`	`llama3.2`	Local inference, no API cost

Updating the Displayed Results

To regenerate the benchmark data shown on this page using the Anthropic provider:

cd packages/benchmarks
bun run src/run.ts --provider anthropic --output ../../apps/docs/src/data/benchmark-report.json

The page renders dynamically from the JSON report at build time — no manual table updates needed.