Skip to content

Benchmarks

The @reactive-agents/benchmarks package evaluates end-to-end agent performance across 20 tasks spanning 5 complexity tiers. Tasks are aligned with leading agentic benchmark standards used by the research community, and run against a real LLM to measure actual correctness, latency, token usage, and cost — not just framework overhead.

Last generated: April 11, 2026 at 01:02 AM · Models: ollama/cogito:14b anthropic/claude-sonnet-4-20250514 anthropic/claude-haiku-4-5 openai/gpt-4o-mini ollama/cogito ollama/qwen3.5 openai/gpt-4o-mini gemini/gemini-2.5-flash ollama/gpt-oss openai/gpt-4o openai/gpt-4o-mini ollama/gemma4:e4b

Comparison Matrix

Tier ollama/cogito:14banthropic/claude-sonnet-4-20250514anthropic/claude-haiku-4-5openai/gpt-4o-miniollama/cogitoollama/qwen3.5openai/gpt-4o-minigemini/gemini-2.5-flashollama/gpt-ossopenai/gpt-4oopenai/gpt-4o-miniollama/gemma4:e4b
Trivial 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%)
Simple 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%) 4/4 (100%)
Moderate 5/5 (100%) 5/5 (100%) 5/5 (100%) 5/5 (100%) 5/5 (100%) 5/5 (100%) 3/5 (60%) 5/5 (100%) 5/5 (100%) 5/5 (100%) 5/5 (100%) 5/5 (100%)
Complex 5/6 (83%) 6/6 (100%) 5/6 (83%) 5/6 (83%) 5/6 (83%) 3/6 (50%) 5/6 (83%) 6/6 (100%) 6/6 (100%) 4/6 (67%) 6/6 (100%) 6/6 (100%)
Expert 5/6 (83%) 5/6 (83%) 6/6 (100%) 6/6 (100%) 4/6 (67%) 6/6 (100%) 5/6 (83%) 3/6 (50%) 5/6 (83%) 5/6 (83%) 6/6 (100%) 6/6 (100%)
Total 23/25 (92%) 24/25 (96%) 24/25 (96%) 24/25 (96%) 22/25 (88%) 22/25 (88%) 21/25 (84%) 22/25 (88%) 24/25 (96%) 22/25 (88%) 25/25 (100%) 25/25 (100%)

Model Summaries

ollama/cogito:14b

23/25 Tasks Passed
92% Pass Rate
12.7s Avg Latency
5.3m Total Duration
79,093 Total Tokens
$0.0000 Total Cost (USD)

anthropic/claude-sonnet-4-20250514

24/25 Tasks Passed
96% Pass Rate
22.5s Avg Latency
9.4m Total Duration
91,891 Total Tokens
$0.5411 Total Cost (USD)

anthropic/claude-haiku-4-5

24/25 Tasks Passed
96% Pass Rate
21.1s Avg Latency
8.8m Total Duration
167,578 Total Tokens
$0.0432 Total Cost (USD)

openai/gpt-4o-mini

24/25 Tasks Passed
96% Pass Rate
27.3s Avg Latency
11.4m Total Duration
128,880 Total Tokens
$0.0285 Total Cost (USD)

ollama/cogito

22/25 Tasks Passed
88% Pass Rate
9.0s Avg Latency
3.8m Total Duration
94,143 Total Tokens
$0.0000 Total Cost (USD)

ollama/qwen3.5

22/25 Tasks Passed
88% Pass Rate
1.3m Avg Latency
32.9m Total Duration
149,776 Total Tokens
$0.0000 Total Cost (USD)

openai/gpt-4o-mini

21/25 Tasks Passed
84% Pass Rate
23.5s Avg Latency
9.8m Total Duration
113,627 Total Tokens
$0.0240 Total Cost (USD)

gemini/gemini-2.5-flash

22/25 Tasks Passed
88% Pass Rate
24.6s Avg Latency
10.2m Total Duration
56,707 Total Tokens
$0.0135 Total Cost (USD)

ollama/gpt-oss

24/25 Tasks Passed
96% Pass Rate
19.1s Avg Latency
7.9m Total Duration
81,060 Total Tokens
$0.0000 Total Cost (USD)

openai/gpt-4o

22/25 Tasks Passed
88% Pass Rate
11.0s Avg Latency
4.6m Total Duration
98,444 Total Tokens
$0.3523 Total Cost (USD)

openai/gpt-4o-mini

25/25 Tasks Passed
100% Pass Rate
20.8s Avg Latency
8.7m Total Duration
107,842 Total Tokens
$0.0237 Total Cost (USD)

ollama/gemma4:e4b

25/25 Tasks Passed
100% Pass Rate
46.7s Avg Latency
19.5m Total Duration
206,863 Total Tokens
$0.0000 Total Cost (USD)

Task Details by Model

ollama/cogito:14b

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 2.7s 97
t2-binary-pow single-shot 2 260.76ms 100
t3-asimov-laws single-shot 2 1.2s 164
t4-json-csv single-shot 2 2.1s 121
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 1.6s 213
s2-palindrome-bug single-shot 2 3.8s 299
s3-bigO single-shot 2 3.3s 235
s4-design-pattern single-shot 2 1.5s 166
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 2.8s 484
m2-word-problem react 1 4.0s 488
m3-sql-injection react 1 3.6s 489
m4-remove-duplicates react 1 4.2s 537
m5-tool-search react 6 7.9s 3,770
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 4.2s 498
c2-auth-vulnerabilities plan-execute 4 10.4s 2,016
c3-test-suite plan-execute 4 20.2s 3,126
c4-db-decomposition react 1 1.5s 366
c5-multi-tool plan-execute 10 23.0s 6,995
c6-multi-agent plan-execute 6 12.7s 2,081
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 25 43.1s 10,585
e2-incident-response tree-of-thought 25 38.2s 9,690
e3-logic-fallacy tree-of-thought 25 35.5s 9,133
e4-crdt-design tree-of-thought 25 46.9s 9,529
e5-file-execute tree-of-thought 32 42.8s 17,911
e6-guardrail-injection react 82.60ms

anthropic/claude-sonnet-4-20250514

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 1.0s 32 $0.0001
t2-binary-pow single-shot 2 919.73ms 31 $0.0002
t3-asimov-laws single-shot 2 1.6s 113 $0.0013
t4-json-csv single-shot 2 810.22ms 51 $0.0002
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 2.3s 155 $0.0017
s2-palindrome-bug single-shot 2 4.2s 333 $0.0042
s3-bigO single-shot 2 3.7s 229 $0.0024
s4-design-pattern single-shot 2 2.4s 131 $0.0011
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 3.2s 598 $0.0048
m2-word-problem react 1 4.1s 596 $0.0057
m3-sql-injection react 1 5.9s 729 $0.0071
m4-remove-duplicates react 1 4.6s 676 $0.0062
m5-tool-search react 6 10.4s 1,018 $0.0055
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 35.0s 2,974 $0.0411
c2-auth-vulnerabilities plan-execute 4 37.6s 5,749 $0.0387
c3-test-suite plan-execute 4 40.5s 6,724 $0.0511
c4-db-decomposition react 1 46.5s 3,327 $0.0460
c5-multi-tool plan-execute 8 21.4s 3,556 $0.0075
c6-multi-agent plan-execute 5 32.2s 2,437 $0.0098
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 18 59.4s 12,723 $0.0606
e2-incident-response tree-of-thought 15 53.5s 9,856 $0.0481
e3-logic-fallacy tree-of-thought 20 1.3m 14,963 $0.0739
e4-crdt-design tree-of-thought 15 50.9s 10,318 $0.0529
e5-file-execute tree-of-thought 24 59.2s 14,572 $0.0711
e6-guardrail-injection react 64.18ms

anthropic/claude-haiku-4-5

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 724.91ms 32 $0.0000
t2-binary-pow single-shot 2 627.29ms 31 $0.0000
t3-asimov-laws single-shot 2 876.09ms 125 $0.0001
t4-json-csv single-shot 2 574.75ms 51 $0.0000
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 1.1s 162 $0.0001
s2-palindrome-bug single-shot 2 2.8s 434 $0.0002
s3-bigO single-shot 2 1.9s 280 $0.0001
s4-design-pattern single-shot 2 1.2s 124 $0.0000
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 1.6s 549 $0.0002
m2-word-problem react 1 2.4s 586 $0.0003
m3-sql-injection react 1 6.4s 739 $0.0003
m4-remove-duplicates react 1 2.8s 642 $0.0003
m5-tool-search react 6 8.3s 7,674 $0.0013
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue plan-execute 4 58.3s 16,310 $0.0052
c2-auth-vulnerabilities plan-execute 4 18.6s 5,056 $0.0014
c3-test-suite plan-execute 4 21.4s 7,080 $0.0023
c4-db-decomposition plan-execute 4 1.2m 10,096 $0.0034
c5-multi-tool plan-execute 12 19.5s 7,443 $0.0008
c6-multi-agent plan-execute 5 12.4s 5,296 $0.0009
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 6 57.0s 25,098 $0.0064
e2-incident-response tree-of-thought 6 1.2m 22,928 $0.0062
e3-logic-fallacy tree-of-thought 6 49.9s 13,255 $0.0037
e4-crdt-design tree-of-thought 11 1.1m 22,850 $0.0059
e5-file-execute tree-of-thought 29 45.2s 20,288 $0.0041
e6-guardrail-injection react 1 3.6s 449 $0.0002

openai/gpt-4o-mini

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 2.6s 78 $0.0000
t2-binary-pow single-shot 2 1.2s 79 $0.0000
t3-asimov-laws single-shot 2 6.0s 159 $0.0001
t4-json-csv single-shot 2 1.5s 101 $0.0000
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 5.4s 214 $0.0001
s2-palindrome-bug single-shot 2 6.4s 270 $0.0001
s3-bigO single-shot 2 5.9s 317 $0.0001
s4-design-pattern single-shot 2 2.7s 162 $0.0000
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 2 17.3s 1,556 $0.0005
m2-word-problem react 1 5.6s 584 $0.0002
m3-sql-injection react 2 12.3s 1,086 $0.0003
m4-remove-duplicates react 2 11.2s 1,245 $0.0004
m5-tool-search react 24 18.6s 10,986 $0.0019
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue plan-execute 4 25.2s 3,174 $0.0008
c2-auth-vulnerabilities plan-execute 4 28.4s 3,943 $0.0010
c3-test-suite plan-execute 5 38.5s 9,050 $0.0018
c4-db-decomposition plan-execute 4 27.0s 3,336 $0.0008
c5-multi-tool plan-execute 6 5.4s 1,563 $0.0001
c6-multi-agent plan-execute 10 34.6s 6,933 $0.0002
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 25 1.4m 19,852 $0.0047
e2-incident-response tree-of-thought 25 1.2m 17,228 $0.0039
e3-logic-fallacy tree-of-thought 25 1.4m 17,686 $0.0040
e4-crdt-design tree-of-thought 25 1.9m 12,716 $0.0039
e5-file-execute tree-of-thought 32 1.1m 16,105 $0.0034
e6-guardrail-injection react 2 1.6s 457 $0.0001

ollama/cogito

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 3.2s 82
t2-binary-pow single-shot 2 210.17ms 82
t3-asimov-laws single-shot 2 777.85ms 153
t4-json-csv single-shot 2 1.8s 105
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 830.51ms 182
s2-palindrome-bug single-shot 2 3.1s 339
s3-bigO single-shot 2 1.8s 180
s4-design-pattern single-shot 2 1.1s 161
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 3.2s 537
m2-word-problem react 1 3.1s 557
m3-sql-injection react 1 2.5s 529
m4-remove-duplicates react 1 2.7s 568
m5-tool-search react 6 4.9s 4,767
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 4.2s 707
c2-auth-vulnerabilities plan-execute 4 8.3s 2,564
c3-test-suite plan-execute 4 15.7s 4,018
c4-db-decomposition react 1 3.0s 601
c5-multi-tool plan-execute 16 36.8s 11,886
c6-multi-agent plan-execute 5 10.0s 1,598
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 4/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 23 25.9s 10,289
e2-incident-response tree-of-thought 23 24.1s 9,366
e3-logic-fallacy tree-of-thought 24 17.7s 8,401
e4-crdt-design tree-of-thought 25 25.1s 11,872
e5-file-execute tree-of-thought 41 25.6s 24,599
e6-guardrail-injection react 82.44ms

ollama/qwen3.5

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 1.4s 188
t2-binary-pow single-shot 2 2.5s 300
t3-asimov-laws single-shot 2 11.4s 487
t4-json-csv single-shot 2 3.5s 408
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 16.8s 534
s2-palindrome-bug single-shot 2 9.9s 665
s3-bigO single-shot 2 15.5s 452
s4-design-pattern single-shot 2 4.9s 246
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 2 19.0s 2,394
m2-word-problem react 1 9.5s 1,082
m3-sql-injection react 2 16.6s 1,817
m4-remove-duplicates react 1 13.5s 1,042
m5-tool-search react 10 28.0s 6,658
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 3/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue plan-execute 17 5.0m
c2-auth-vulnerabilities plan-execute 4 4.1m 10,918
c3-test-suite plan-execute 4 38.7s 5,208
c4-db-decomposition plan-execute 4 2.2m 12,274
c5-multi-tool plan-execute 3 2.6m
c6-multi-agent plan-execute 5 1.7m 11,750
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 7 3.0m 17,750
e2-incident-response tree-of-thought 6 2.6m 14,909
e3-logic-fallacy tree-of-thought 7 2.6m 15,623
e4-crdt-design tree-of-thought 7 1.7m 10,753
e5-file-execute tree-of-thought 29 4.1m 33,203
e6-guardrail-injection react 2 11.0s 1,115

openai/gpt-4o-mini

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 786.45ms 78 $0.0000
t2-binary-pow single-shot 2 635.35ms 79 $0.0000
t3-asimov-laws single-shot 2 1.9s 157 $0.0001
t4-json-csv single-shot 2 459.69ms 101 $0.0000
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 3.4s 221 $0.0001
s2-palindrome-bug single-shot 2 5.6s 367 $0.0002
s3-bigO single-shot 2 3.7s 359 $0.0002
s4-design-pattern single-shot 2 1.9s 158 $0.0000
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 3/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 4.7s 461 $0.0002
m2-word-problem react 1 4.7s 390 $0.0001
m3-sql-injection react 1 5.1s 413 $0.0001
m4-remove-duplicates react 1 4.5s 428 $0.0001
m5-tool-search react 6 1.2s
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue plan-execute 4 25.0s 3,106 $0.0008
c2-auth-vulnerabilities plan-execute 4 14.6s 2,270 $0.0004
c3-test-suite plan-execute 4 18.0s 2,991 $0.0007
c4-db-decomposition plan-execute 4 28.8s 3,123 $0.0007
c5-multi-tool plan-execute 10 33.0s 7,302 $0.0004
c6-multi-agent plan-execute 10 38.1s 7,019 $0.0002
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 24 1.4m 19,747 $0.0045
e2-incident-response tree-of-thought 24 1.4m 18,722 $0.0042
e3-logic-fallacy tree-of-thought 24 1.2m 17,189 $0.0039
e4-crdt-design tree-of-thought 24 1.5m 16,049 $0.0043
e5-file-execute tree-of-thought 29 58.6s 12,716 $0.0028
e6-guardrail-injection react 1 3.1s 181 $0.0000

gemini/gemini-2.5-flash

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 1.2s 90 $0.0000
t2-binary-pow single-shot 2 765.64ms 91 $0.0000
t3-asimov-laws single-shot 2 1.2s 165 $0.0001
t4-json-csv single-shot 2 798.14ms 111 $0.0000
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 10.2s 378 $0.0002
s2-palindrome-bug single-shot 2 8.1s 712 $0.0004
s3-bigO single-shot 2 4.9s 304 $0.0001
s4-design-pattern single-shot 2 1.5s 185 $0.0001
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 3.8s 575 $0.0002
m2-word-problem react 1 3.6s 606 $0.0002
m3-sql-injection react 1 4.0s 594 $0.0002
m4-remove-duplicates react 1 2.4s 448 $0.0001
m5-tool-search react 6 10.9s 6,924 $0.0011
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 13.1s 1,816 $0.0010
c2-auth-vulnerabilities plan-execute 4 21.5s 3,438 $0.0008
c3-test-suite plan-execute 4 26.5s 5,607 $0.0016
c4-db-decomposition react 1 14.7s 424 $0.0001
c5-multi-tool plan-execute 8 30.2s 3,410 $0.0002
c6-multi-agent plan-execute 6 31.4s 2,319 $0.0001
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 3/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 8 54.9s 5,043 $0.0014
e2-incident-response tree-of-thought 5 46.1s 2,808 $0.0006
e3-logic-fallacy tree-of-thought 21 2.8m 10,708 $0.0027
e4-crdt-design tree-of-thought 5 42.1s 2,229 $0.0005
e5-file-execute tree-of-thought 14 1.9m 7,722 $0.0018
e6-guardrail-injection react 77.81ms

ollama/gpt-oss

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 14.6s 205
t2-binary-pow single-shot 2 600.57ms 183
t3-asimov-laws single-shot 2 1.0s 249
t4-json-csv single-shot 2 1.0s 239
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 11.4s 363
s2-palindrome-bug single-shot 2 3.5s 556
s3-bigO single-shot 2 10.5s 387
s4-design-pattern single-shot 2 1.2s 295
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 9.1s 495
m2-word-problem react 1 5.4s 553
m3-sql-injection react 1 3.2s 451
m4-remove-duplicates react 1 4.7s 550
m5-tool-search react 6 7.4s 3,787
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 7.7s 723
c2-auth-vulnerabilities plan-execute 4 13.1s 2,435
c3-test-suite plan-execute 4 24.3s 3,627
c4-db-decomposition react 1 5.0s 575
c5-multi-tool plan-execute 12 38.4s 8,415
c6-multi-agent plan-execute 6 17.9s 2,614
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 25 46.1s 10,717
e2-incident-response tree-of-thought 25 47.5s 10,007
e3-logic-fallacy tree-of-thought 25 43.7s 9,222
e4-crdt-design tree-of-thought 25 2.0m 9,684
e5-file-execute tree-of-thought 32 39.0s 14,728
e6-guardrail-injection react 83.76ms

openai/gpt-4o

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 1.6s 77 $0.0002
t2-binary-pow single-shot 2 379.60ms 78 $0.0002
t3-asimov-laws single-shot 2 2.3s 155 $0.0009
t4-json-csv single-shot 2 1.4s 100 $0.0003
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 917.65ms 184 $0.0011
s2-palindrome-bug single-shot 2 2.0s 277 $0.0020
s3-bigO single-shot 2 3.1s 376 $0.0028
s4-design-pattern single-shot 2 1.3s 190 $0.0010
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 1.7s 522 $0.0029
m2-word-problem react 1 2.9s 622 $0.0044
m3-sql-injection react 1 2.6s 455 $0.0025
m4-remove-duplicates react 1 2.0s 535 $0.0032
m5-tool-search react 6 4.4s 5,037 $0.0133
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 4/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 5.4s 791 $0.0059
c2-auth-vulnerabilities plan-execute 4 8.9s 2,910 $0.0106
c3-test-suite plan-execute 4 11.0s 3,588 $0.0152
c4-db-decomposition react 1 4.7s 789 $0.0057
c5-multi-tool plan-execute 10 18.8s 7,485 $0.0039
c6-multi-agent plan-execute 5 9.3s 1,606 $0.0024
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 21 36.1s 13,977 $0.0525
e2-incident-response tree-of-thought 25 37.4s 12,086 $0.0509
e3-logic-fallacy tree-of-thought 24 32.7s 12,651 $0.0450
e4-crdt-design tree-of-thought 24 45.5s 18,519 $0.0723
e5-file-execute tree-of-thought 33 39.1s 15,434 $0.0532
e6-guardrail-injection react 63.58ms

openai/gpt-4o-mini

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 1.0s 78 $0.0000
t2-binary-pow single-shot 2 892.35ms 79 $0.0000
t3-asimov-laws single-shot 2 2.3s 156 $0.0001
t4-json-csv single-shot 2 1.1s 101 $0.0000
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 2.8s 218 $0.0001
s2-palindrome-bug single-shot 2 3.1s 266 $0.0001
s3-bigO single-shot 2 4.1s 351 $0.0002
s4-design-pattern single-shot 2 1.8s 169 $0.0000
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 4.1s 458 $0.0002
m2-word-problem react 1 7.1s 518 $0.0002
m3-sql-injection react 1 4.5s 434 $0.0002
m4-remove-duplicates react 1 3.1s 429 $0.0001
m5-tool-search react 6 7.2s 5,215 $0.0008
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue plan-execute 4 30.9s 3,332 $0.0009
c2-auth-vulnerabilities plan-execute 4 16.9s 2,818 $0.0006
c3-test-suite plan-execute 5 36.8s 6,288 $0.0017
c4-db-decomposition plan-execute 4 29.8s 3,863 $0.0010
c5-multi-tool plan-execute 10 22.8s 6,375 $0.0002
c6-multi-agent plan-execute 13 41.7s 9,093 $0.0004
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 21 52.4s 12,967 $0.0032
e2-incident-response tree-of-thought 24 59.0s 10,593 $0.0029
e3-logic-fallacy tree-of-thought 24 54.9s 15,858 $0.0037
e4-crdt-design tree-of-thought 24 1.3m 11,373 $0.0035
e5-file-execute tree-of-thought 32 52.1s 16,625 $0.0035
e6-guardrail-injection react 1 612.18ms 185 $0.0000

ollama/gemma4:e4b

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
t1-js-typeof single-shot 2 4.9s 104
t2-binary-pow single-shot 2 357.93ms 105
t3-asimov-laws single-shot 2 3.4s 488
t4-json-csv single-shot 2 4.5s 125
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
Task Strategy Status Steps Latency Tokens Cost
s1-fibonacci single-shot 2 5.1s 720
s2-palindrome-bug single-shot 2 16.3s 1,416
s3-bigO single-shot 2 11.6s 858
s4-design-pattern single-shot 2 11.1s 774
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
Task Strategy Status Steps Latency Tokens Cost
m1-merge-intervals react 1 11.0s 1,303
m2-word-problem react 1 9.4s 1,360
m3-sql-injection react 1 7.3s 1,163
m4-remove-duplicates react 1 8.3s 1,281
m5-tool-search react 5 14.9s 6,161
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
c1-distributed-queue react 1 38.6s 2,936
c2-auth-vulnerabilities plan-execute 4 12.7s 2,871
c3-test-suite plan-execute 4 26.4s 6,486
c4-db-decomposition react 1 39.0s 3,165
c5-multi-tool plan-execute 14 25.3s 10,024
c6-multi-agent plan-execute 5 22.2s 2,157
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
Task Strategy Status Steps Latency Tokens Cost
e1-lis-optimization tree-of-thought 18 2.1m 22,993
e2-incident-response tree-of-thought 24 3.7m 35,244
e3-logic-fallacy tree-of-thought 24 3.2m 35,953
e4-crdt-design tree-of-thought 24 3.7m 36,542
e5-file-execute tree-of-thought 32 2.2m 32,368
e6-guardrail-injection react 1 653.15ms 266

Framework Overhead

Measured with the test provider to isolate pure Effect-TS layer composition cost -- independent of LLM latency.

Measurement Avg Duration Samples
Runtime Creation 0.02ms 10
Runtime Creation Full 0.03ms 10
Complexity Classification <0.01ms 100

Each task tier maps to a recognized benchmark standard:

TierStrategyAligned With
TrivialSingle-shotMMLU-CS · MATH baseline · AgentEval
SimpleSingle-shotHumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE
ModerateReAct (reactive)HumanEval Medium · BIG-Bench Hard · SWE-bench lite
ComplexPlan-Execute-ReflectAgentBench · SWE-bench Security · TestEval
ExpertTree-of-ThoughtBIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS
  • HumanEval (OpenAI) — 164 handcrafted code generation tasks evaluated by functional correctness. Our tasks include function implementation, algorithm design, and test generation.
  • SWE-bench (Princeton) — Resolving real GitHub issues. We use SWE-bench patterns for bug identification, security vulnerability analysis, and multi-file code review.
  • BIG-Bench Hard (Google) — 23 challenging tasks where chain-of-thought is required. We include: algorithmic optimization, logic/fallacy analysis, multi-step word problems, and Big-O complexity reasoning.
  • GAIA (Meta) — Multi-step tasks requiring tool use and reasoning. Our Level 3 equivalent task tests production incident response requiring multi-domain knowledge synthesis.
  • AgentBench (THUDM) — 8-environment agent evaluation. We use AgentBench patterns for system design, database decomposition, and migration planning tasks.
  • MMLU-Pro — Professional knowledge across 14 domains. Tasks cover CS theory (CRDTs, design patterns), software engineering, and architecture decision-making.

A task passes if the LLM’s output contains the expected pattern (case-insensitive regex). Patterns are crafted to require substantive, correct answers — they cannot be satisfied by generic responses:

SQL injection fix expected: "parameteriz|prepared|placeholder|$1|?"
CRDT design expected: "CRDT|vector.?clock|logical.?time|merge|commutative|converge"
Terminal window
# Run with Anthropic (recommended for real-world results)
cd packages/benchmarks
bun run src/run.ts --provider anthropic --output report.json
# Run with a specific model
bun run src/run.ts --provider anthropic --model claude-opus-4-5 --output report.json
# Run only trivial + simple tiers (quick sanity check)
bun run src/run.ts --provider anthropic --tier trivial,simple
# OpenAI
bun run src/run.ts --provider openai --model gpt-4o --output report.json
# Gemini
bun run src/run.ts --provider gemini --model gemini-2.0-flash --output report.json
FlagDescriptionDefault
--providerLLM provider (anthropic, openai, gemini, ollama, litellm)test
--modelModel name (uses provider default if omitted)Provider default
--tierComma-separated tier filterAll tiers
--outputPath to save JSON report(none)
ProviderDefault ModelRationale
anthropicclaude-haiku-4-5Fast, cost-efficient, strong reasoning
openaigpt-4o-miniCost-efficient with strong benchmark performance
geminigemini-2.0-flashFast inference, competitive pricing
ollamallama3.2Local inference, no API cost

To regenerate the benchmark data shown on this page using the Anthropic provider:

Terminal window
cd packages/benchmarks
bun run src/run.ts --provider anthropic --output ../../apps/docs/src/data/benchmark-report.json

The page renders dynamically from the JSON report at build time — no manual table updates needed.