Skip to content

Benchmarks

The @reactive-agents/benchmarks package evaluates end-to-end agent performance across 20 tasks spanning 5 complexity tiers. Tasks are aligned with leading agentic benchmark standards used by the research community, and run against a real LLM to measure actual correctness, latency, token usage, and cost — not just framework overhead.

Last generated: March 8, 2026 at 01:27 PM · Model: test/default Warning: Framework-overhead-only run (test provider -- no real LLM calls)

Comparison Matrix

Tier test/default
Trivial 0/4 (0%)
Simple 1/4 (25%)
Moderate 0/4 (0%)
Complex 0/4 (0%)
Expert 0/4 (0%)
Total 1/20 (5%)

Summary

1/20 Tasks Passed
5% Pass Rate
85.38ms Avg Latency
1.7s Total Duration
6,559 Total Tokens
$0.0000 Total Cost (USD)

Individual Tasks

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 0/4 passed
Task Strategy Status Latency Tokens Cost
t1-js-typeof single-shot Fail 128.32ms -- --
t2-binary-pow single-shot Fail 89.48ms -- --
t3-asimov-laws single-shot Fail 93.60ms -- --
t4-json-csv single-shot Fail 82.10ms -- --
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 1/4 passed
Task Strategy Status Latency Tokens Cost
s1-fibonacci single-shot Fail 83.95ms -- --
s2-palindrome-bug single-shot Fail 84.84ms -- --
s3-bigO single-shot Pass 69.47ms -- --
s4-design-pattern single-shot Fail 88.78ms -- --
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 0/4 passed
Task Strategy Status Latency Tokens Cost
m1-merge-intervals react Fail 92.77ms 808 --
m2-word-problem react Fail 85.80ms 722 --
m3-sql-injection react Fail 94.17ms 785 --
m4-remove-duplicates react Fail 90.87ms 803 --
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 0/4 passed
Task Strategy Status Latency Tokens Cost
c1-distributed-queue plan-execute Error 64.89ms -- --
c2-auth-vulnerabilities plan-execute Error 58.66ms -- --
c3-test-suite plan-execute Error 57.21ms -- --
c4-db-decomposition plan-execute Error 60.13ms -- --
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 0/4 passed
Task Strategy Status Latency Tokens Cost
e1-lis-optimization tree-of-thought Fail 89.22ms 874 --
e2-incident-response tree-of-thought Fail 87.43ms 869 --
e3-logic-fallacy tree-of-thought Fail 76.76ms 828 --
e4-crdt-design tree-of-thought Fail 129.25ms 870 --

Framework Overhead

Measured with the test provider to isolate pure Effect-TS layer composition cost -- independent of LLM latency.

Measurement Avg Duration Samples
Runtime Creation 0.01ms 10
Runtime Creation Full 0.04ms 10
Complexity Classification <0.01ms 100

Each task tier maps to a recognized benchmark standard:

TierStrategyAligned With
TrivialSingle-shotMMLU-CS · MATH baseline · AgentEval
SimpleSingle-shotHumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE
ModerateReAct (reactive)HumanEval Medium · BIG-Bench Hard · SWE-bench lite
ComplexPlan-Execute-ReflectAgentBench · SWE-bench Security · TestEval
ExpertTree-of-ThoughtBIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS
  • HumanEval (OpenAI) — 164 handcrafted code generation tasks evaluated by functional correctness. Our tasks include function implementation, algorithm design, and test generation.
  • SWE-bench (Princeton) — Resolving real GitHub issues. We use SWE-bench patterns for bug identification, security vulnerability analysis, and multi-file code review.
  • BIG-Bench Hard (Google) — 23 challenging tasks where chain-of-thought is required. We include: algorithmic optimization, logic/fallacy analysis, multi-step word problems, and Big-O complexity reasoning.
  • GAIA (Meta) — Multi-step tasks requiring tool use and reasoning. Our Level 3 equivalent task tests production incident response requiring multi-domain knowledge synthesis.
  • AgentBench (THUDM) — 8-environment agent evaluation. We use AgentBench patterns for system design, database decomposition, and migration planning tasks.
  • MMLU-Pro — Professional knowledge across 14 domains. Tasks cover CS theory (CRDTs, design patterns), software engineering, and architecture decision-making.

A task passes if the LLM’s output contains the expected pattern (case-insensitive regex). Patterns are crafted to require substantive, correct answers — they cannot be satisfied by generic responses:

SQL injection fix expected: "parameteriz|prepared|placeholder|$1|?"
CRDT design expected: "CRDT|vector.?clock|logical.?time|merge|commutative|converge"
Terminal window
# Run with Anthropic (recommended for real-world results)
cd packages/benchmarks
bun run src/run.ts --provider anthropic --output report.json
# Run with a specific model
bun run src/run.ts --provider anthropic --model claude-opus-4-5 --output report.json
# Run only trivial + simple tiers (quick sanity check)
bun run src/run.ts --provider anthropic --tier trivial,simple
# OpenAI
bun run src/run.ts --provider openai --model gpt-4o --output report.json
# Gemini
bun run src/run.ts --provider gemini --model gemini-2.0-flash --output report.json
FlagDescriptionDefault
--providerLLM provider (anthropic, openai, gemini, ollama, litellm)test
--modelModel name (uses provider default if omitted)Provider default
--tierComma-separated tier filterAll tiers
--outputPath to save JSON report(none)
ProviderDefault ModelRationale
anthropicclaude-haiku-4-5Fast, cost-efficient, strong reasoning
openaigpt-4o-miniCost-efficient with strong benchmark performance
geminigemini-2.0-flashFast inference, competitive pricing
ollamallama3.2Local inference, no API cost

To regenerate the benchmark data shown on this page using the Anthropic provider:

Terminal window
cd packages/benchmarks
bun run src/run.ts --provider anthropic --output ../../apps/docs/src/data/benchmark-report.json

The page renders dynamically from the JSON report at build time — no manual table updates needed.