Benchmarks
The @reactive-agents/benchmarks package evaluates end-to-end agent performance across 20 tasks spanning 5 complexity tiers. Tasks are aligned with leading agentic benchmark standards used by the research community, and run against a real LLM to measure actual correctness, latency, token usage, and cost — not just framework overhead.
Results
Section titled “Results”Comparison Matrix
| Tier | ollama/cogito:14b | anthropic/claude-sonnet-4-20250514 | anthropic/claude-haiku-4-5 | openai/gpt-4o-mini | ollama/cogito | ollama/qwen3.5 | openai/gpt-4o-mini | gemini/gemini-2.5-flash | ollama/gpt-oss | openai/gpt-4o | openai/gpt-4o-mini | ollama/gemma4:e4b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Trivial | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) |
| Simple | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) | 4/4 (100%) |
| Moderate | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 3/5 (60%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) | 5/5 (100%) |
| Complex | 5/6 (83%) | 6/6 (100%) | 5/6 (83%) | 5/6 (83%) | 5/6 (83%) | 3/6 (50%) | 5/6 (83%) | 6/6 (100%) | 6/6 (100%) | 4/6 (67%) | 6/6 (100%) | 6/6 (100%) |
| Expert | 5/6 (83%) | 5/6 (83%) | 6/6 (100%) | 6/6 (100%) | 4/6 (67%) | 6/6 (100%) | 5/6 (83%) | 3/6 (50%) | 5/6 (83%) | 5/6 (83%) | 6/6 (100%) | 6/6 (100%) |
| Total | 23/25 (92%) | 24/25 (96%) | 24/25 (96%) | 24/25 (96%) | 22/25 (88%) | 22/25 (88%) | 21/25 (84%) | 22/25 (88%) | 24/25 (96%) | 22/25 (88%) | 25/25 (100%) | 25/25 (100%) |
Model Summaries
ollama/cogito:14b
anthropic/claude-sonnet-4-20250514
anthropic/claude-haiku-4-5
openai/gpt-4o-mini
ollama/cogito
ollama/qwen3.5
openai/gpt-4o-mini
gemini/gemini-2.5-flash
ollama/gpt-oss
openai/gpt-4o
openai/gpt-4o-mini
ollama/gemma4:e4b
Task Details by Model
ollama/cogito:14b
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 2.7s | 97 | — |
t2-binary-pow | single-shot | ✓ | 2 | 260.76ms | 100 | — |
t3-asimov-laws | single-shot | ✓ | 2 | 1.2s | 164 | — |
t4-json-csv | single-shot | ✓ | 2 | 2.1s | 121 | — |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 1.6s | 213 | — |
s2-palindrome-bug | single-shot | ✓ | 2 | 3.8s | 299 | — |
s3-bigO | single-shot | ✓ | 2 | 3.3s | 235 | — |
s4-design-pattern | single-shot | ✓ | 2 | 1.5s | 166 | — |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 2.8s | 484 | — |
m2-word-problem | react | ✓ | 1 | 4.0s | 488 | — |
m3-sql-injection | react | ✓ | 1 | 3.6s | 489 | — |
m4-remove-duplicates | react | ✓ | 1 | 4.2s | 537 | — |
m5-tool-search | react | ✓ | 6 | 7.9s | 3,770 | — |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 4.2s | 498 | — |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 10.4s | 2,016 | — |
c3-test-suite | plan-execute | ✓ | 4 | 20.2s | 3,126 | — |
c4-db-decomposition | react | ✓ | 1 | 1.5s | 366 | — |
c5-multi-tool | plan-execute | ✓ | 10 | 23.0s | 6,995 | — |
c6-multi-agent | plan-execute | ✗ | 6 | 12.7s | 2,081 | — |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 25 | 43.1s | 10,585 | — |
e2-incident-response | tree-of-thought | ✓ | 25 | 38.2s | 9,690 | — |
e3-logic-fallacy | tree-of-thought | ✓ | 25 | 35.5s | 9,133 | — |
e4-crdt-design | tree-of-thought | ✓ | 25 | 46.9s | 9,529 | — |
e5-file-execute | tree-of-thought | ✓ | 32 | 42.8s | 17,911 | — |
e6-guardrail-injection | react | ✗ | — | 82.60ms | — | — |
anthropic/claude-sonnet-4-20250514
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 1.0s | 32 | $0.0001 |
t2-binary-pow | single-shot | ✓ | 2 | 919.73ms | 31 | $0.0002 |
t3-asimov-laws | single-shot | ✓ | 2 | 1.6s | 113 | $0.0013 |
t4-json-csv | single-shot | ✓ | 2 | 810.22ms | 51 | $0.0002 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 2.3s | 155 | $0.0017 |
s2-palindrome-bug | single-shot | ✓ | 2 | 4.2s | 333 | $0.0042 |
s3-bigO | single-shot | ✓ | 2 | 3.7s | 229 | $0.0024 |
s4-design-pattern | single-shot | ✓ | 2 | 2.4s | 131 | $0.0011 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 3.2s | 598 | $0.0048 |
m2-word-problem | react | ✓ | 1 | 4.1s | 596 | $0.0057 |
m3-sql-injection | react | ✓ | 1 | 5.9s | 729 | $0.0071 |
m4-remove-duplicates | react | ✓ | 1 | 4.6s | 676 | $0.0062 |
m5-tool-search | react | ✓ | 6 | 10.4s | 1,018 | $0.0055 |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 35.0s | 2,974 | $0.0411 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 37.6s | 5,749 | $0.0387 |
c3-test-suite | plan-execute | ✓ | 4 | 40.5s | 6,724 | $0.0511 |
c4-db-decomposition | react | ✓ | 1 | 46.5s | 3,327 | $0.0460 |
c5-multi-tool | plan-execute | ✓ | 8 | 21.4s | 3,556 | $0.0075 |
c6-multi-agent | plan-execute | ✓ | 5 | 32.2s | 2,437 | $0.0098 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 18 | 59.4s | 12,723 | $0.0606 |
e2-incident-response | tree-of-thought | ✓ | 15 | 53.5s | 9,856 | $0.0481 |
e3-logic-fallacy | tree-of-thought | ✓ | 20 | 1.3m | 14,963 | $0.0739 |
e4-crdt-design | tree-of-thought | ✓ | 15 | 50.9s | 10,318 | $0.0529 |
e5-file-execute | tree-of-thought | ✓ | 24 | 59.2s | 14,572 | $0.0711 |
e6-guardrail-injection | react | ✗ | — | 64.18ms | — | — |
anthropic/claude-haiku-4-5
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 724.91ms | 32 | $0.0000 |
t2-binary-pow | single-shot | ✓ | 2 | 627.29ms | 31 | $0.0000 |
t3-asimov-laws | single-shot | ✓ | 2 | 876.09ms | 125 | $0.0001 |
t4-json-csv | single-shot | ✓ | 2 | 574.75ms | 51 | $0.0000 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 1.1s | 162 | $0.0001 |
s2-palindrome-bug | single-shot | ✓ | 2 | 2.8s | 434 | $0.0002 |
s3-bigO | single-shot | ✓ | 2 | 1.9s | 280 | $0.0001 |
s4-design-pattern | single-shot | ✓ | 2 | 1.2s | 124 | $0.0000 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 1.6s | 549 | $0.0002 |
m2-word-problem | react | ✓ | 1 | 2.4s | 586 | $0.0003 |
m3-sql-injection | react | ✓ | 1 | 6.4s | 739 | $0.0003 |
m4-remove-duplicates | react | ✓ | 1 | 2.8s | 642 | $0.0003 |
m5-tool-search | react | ✓ | 6 | 8.3s | 7,674 | $0.0013 |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | plan-execute | ✓ | 4 | 58.3s | 16,310 | $0.0052 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 18.6s | 5,056 | $0.0014 |
c3-test-suite | plan-execute | ✓ | 4 | 21.4s | 7,080 | $0.0023 |
c4-db-decomposition | plan-execute | ✓ | 4 | 1.2m | 10,096 | $0.0034 |
c5-multi-tool | plan-execute | ✓ | 12 | 19.5s | 7,443 | $0.0008 |
c6-multi-agent | plan-execute | ✗ | 5 | 12.4s | 5,296 | $0.0009 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 6 | 57.0s | 25,098 | $0.0064 |
e2-incident-response | tree-of-thought | ✓ | 6 | 1.2m | 22,928 | $0.0062 |
e3-logic-fallacy | tree-of-thought | ✓ | 6 | 49.9s | 13,255 | $0.0037 |
e4-crdt-design | tree-of-thought | ✓ | 11 | 1.1m | 22,850 | $0.0059 |
e5-file-execute | tree-of-thought | ✓ | 29 | 45.2s | 20,288 | $0.0041 |
e6-guardrail-injection | react | ✓ | 1 | 3.6s | 449 | $0.0002 |
openai/gpt-4o-mini
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 2.6s | 78 | $0.0000 |
t2-binary-pow | single-shot | ✓ | 2 | 1.2s | 79 | $0.0000 |
t3-asimov-laws | single-shot | ✓ | 2 | 6.0s | 159 | $0.0001 |
t4-json-csv | single-shot | ✓ | 2 | 1.5s | 101 | $0.0000 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 5.4s | 214 | $0.0001 |
s2-palindrome-bug | single-shot | ✓ | 2 | 6.4s | 270 | $0.0001 |
s3-bigO | single-shot | ✓ | 2 | 5.9s | 317 | $0.0001 |
s4-design-pattern | single-shot | ✓ | 2 | 2.7s | 162 | $0.0000 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 2 | 17.3s | 1,556 | $0.0005 |
m2-word-problem | react | ✓ | 1 | 5.6s | 584 | $0.0002 |
m3-sql-injection | react | ✓ | 2 | 12.3s | 1,086 | $0.0003 |
m4-remove-duplicates | react | ✓ | 2 | 11.2s | 1,245 | $0.0004 |
m5-tool-search | react | ✓ | 24 | 18.6s | 10,986 | $0.0019 |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | plan-execute | ✓ | 4 | 25.2s | 3,174 | $0.0008 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 28.4s | 3,943 | $0.0010 |
c3-test-suite | plan-execute | ✓ | 5 | 38.5s | 9,050 | $0.0018 |
c4-db-decomposition | plan-execute | ✓ | 4 | 27.0s | 3,336 | $0.0008 |
c5-multi-tool | plan-execute | ✓ | 6 | 5.4s | 1,563 | $0.0001 |
c6-multi-agent | plan-execute | ✗ | 10 | 34.6s | 6,933 | $0.0002 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 25 | 1.4m | 19,852 | $0.0047 |
e2-incident-response | tree-of-thought | ✓ | 25 | 1.2m | 17,228 | $0.0039 |
e3-logic-fallacy | tree-of-thought | ✓ | 25 | 1.4m | 17,686 | $0.0040 |
e4-crdt-design | tree-of-thought | ✓ | 25 | 1.9m | 12,716 | $0.0039 |
e5-file-execute | tree-of-thought | ✓ | 32 | 1.1m | 16,105 | $0.0034 |
e6-guardrail-injection | react | ✓ | 2 | 1.6s | 457 | $0.0001 |
ollama/cogito
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 3.2s | 82 | — |
t2-binary-pow | single-shot | ✓ | 2 | 210.17ms | 82 | — |
t3-asimov-laws | single-shot | ✓ | 2 | 777.85ms | 153 | — |
t4-json-csv | single-shot | ✓ | 2 | 1.8s | 105 | — |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 830.51ms | 182 | — |
s2-palindrome-bug | single-shot | ✓ | 2 | 3.1s | 339 | — |
s3-bigO | single-shot | ✓ | 2 | 1.8s | 180 | — |
s4-design-pattern | single-shot | ✓ | 2 | 1.1s | 161 | — |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 3.2s | 537 | — |
m2-word-problem | react | ✓ | 1 | 3.1s | 557 | — |
m3-sql-injection | react | ✓ | 1 | 2.5s | 529 | — |
m4-remove-duplicates | react | ✓ | 1 | 2.7s | 568 | — |
m5-tool-search | react | ✓ | 6 | 4.9s | 4,767 | — |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 4.2s | 707 | — |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 8.3s | 2,564 | — |
c3-test-suite | plan-execute | ✓ | 4 | 15.7s | 4,018 | — |
c4-db-decomposition | react | ✓ | 1 | 3.0s | 601 | — |
c5-multi-tool | plan-execute | ✓ | 16 | 36.8s | 11,886 | — |
c6-multi-agent | plan-execute | ✗ | 5 | 10.0s | 1,598 | — |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 4/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 23 | 25.9s | 10,289 | — |
e2-incident-response | tree-of-thought | ✓ | 23 | 24.1s | 9,366 | — |
e3-logic-fallacy | tree-of-thought | ✗ | 24 | 17.7s | 8,401 | — |
e4-crdt-design | tree-of-thought | ✓ | 25 | 25.1s | 11,872 | — |
e5-file-execute | tree-of-thought | ✓ | 41 | 25.6s | 24,599 | — |
e6-guardrail-injection | react | ✗ | — | 82.44ms | — | — |
ollama/qwen3.5
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 1.4s | 188 | — |
t2-binary-pow | single-shot | ✓ | 2 | 2.5s | 300 | — |
t3-asimov-laws | single-shot | ✓ | 2 | 11.4s | 487 | — |
t4-json-csv | single-shot | ✓ | 2 | 3.5s | 408 | — |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 16.8s | 534 | — |
s2-palindrome-bug | single-shot | ✓ | 2 | 9.9s | 665 | — |
s3-bigO | single-shot | ✓ | 2 | 15.5s | 452 | — |
s4-design-pattern | single-shot | ✓ | 2 | 4.9s | 246 | — |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 2 | 19.0s | 2,394 | — |
m2-word-problem | react | ✓ | 1 | 9.5s | 1,082 | — |
m3-sql-injection | react | ✓ | 2 | 16.6s | 1,817 | — |
m4-remove-duplicates | react | ✓ | 1 | 13.5s | 1,042 | — |
m5-tool-search | react | ✓ | 10 | 28.0s | 6,658 | — |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 3/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | plan-execute | ✗ | 17 | 5.0m | — | — |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 4.1m | 10,918 | — |
c3-test-suite | plan-execute | ✓ | 4 | 38.7s | 5,208 | — |
c4-db-decomposition | plan-execute | ✓ | 4 | 2.2m | 12,274 | — |
c5-multi-tool | plan-execute | ✗ | 3 | 2.6m | — | — |
c6-multi-agent | plan-execute | ✗ | 5 | 1.7m | 11,750 | — |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 7 | 3.0m | 17,750 | — |
e2-incident-response | tree-of-thought | ✓ | 6 | 2.6m | 14,909 | — |
e3-logic-fallacy | tree-of-thought | ✓ | 7 | 2.6m | 15,623 | — |
e4-crdt-design | tree-of-thought | ✓ | 7 | 1.7m | 10,753 | — |
e5-file-execute | tree-of-thought | ✓ | 29 | 4.1m | 33,203 | — |
e6-guardrail-injection | react | ✓ | 2 | 11.0s | 1,115 | — |
openai/gpt-4o-mini
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 786.45ms | 78 | $0.0000 |
t2-binary-pow | single-shot | ✓ | 2 | 635.35ms | 79 | $0.0000 |
t3-asimov-laws | single-shot | ✓ | 2 | 1.9s | 157 | $0.0001 |
t4-json-csv | single-shot | ✓ | 2 | 459.69ms | 101 | $0.0000 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 3.4s | 221 | $0.0001 |
s2-palindrome-bug | single-shot | ✓ | 2 | 5.6s | 367 | $0.0002 |
s3-bigO | single-shot | ✓ | 2 | 3.7s | 359 | $0.0002 |
s4-design-pattern | single-shot | ✓ | 2 | 1.9s | 158 | $0.0000 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 3/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 4.7s | 461 | $0.0002 |
m2-word-problem | react | ✗ | 1 | 4.7s | 390 | $0.0001 |
m3-sql-injection | react | ✓ | 1 | 5.1s | 413 | $0.0001 |
m4-remove-duplicates | react | ✓ | 1 | 4.5s | 428 | $0.0001 |
m5-tool-search | react | ✗ | 6 | 1.2s | — | — |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | plan-execute | ✓ | 4 | 25.0s | 3,106 | $0.0008 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 14.6s | 2,270 | $0.0004 |
c3-test-suite | plan-execute | ✓ | 4 | 18.0s | 2,991 | $0.0007 |
c4-db-decomposition | plan-execute | ✓ | 4 | 28.8s | 3,123 | $0.0007 |
c5-multi-tool | plan-execute | ✓ | 10 | 33.0s | 7,302 | $0.0004 |
c6-multi-agent | plan-execute | ✗ | 10 | 38.1s | 7,019 | $0.0002 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 24 | 1.4m | 19,747 | $0.0045 |
e2-incident-response | tree-of-thought | ✓ | 24 | 1.4m | 18,722 | $0.0042 |
e3-logic-fallacy | tree-of-thought | ✓ | 24 | 1.2m | 17,189 | $0.0039 |
e4-crdt-design | tree-of-thought | ✓ | 24 | 1.5m | 16,049 | $0.0043 |
e5-file-execute | tree-of-thought | ✗ | 29 | 58.6s | 12,716 | $0.0028 |
e6-guardrail-injection | react | ✓ | 1 | 3.1s | 181 | $0.0000 |
gemini/gemini-2.5-flash
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 1.2s | 90 | $0.0000 |
t2-binary-pow | single-shot | ✓ | 2 | 765.64ms | 91 | $0.0000 |
t3-asimov-laws | single-shot | ✓ | 2 | 1.2s | 165 | $0.0001 |
t4-json-csv | single-shot | ✓ | 2 | 798.14ms | 111 | $0.0000 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 10.2s | 378 | $0.0002 |
s2-palindrome-bug | single-shot | ✓ | 2 | 8.1s | 712 | $0.0004 |
s3-bigO | single-shot | ✓ | 2 | 4.9s | 304 | $0.0001 |
s4-design-pattern | single-shot | ✓ | 2 | 1.5s | 185 | $0.0001 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 3.8s | 575 | $0.0002 |
m2-word-problem | react | ✓ | 1 | 3.6s | 606 | $0.0002 |
m3-sql-injection | react | ✓ | 1 | 4.0s | 594 | $0.0002 |
m4-remove-duplicates | react | ✓ | 1 | 2.4s | 448 | $0.0001 |
m5-tool-search | react | ✓ | 6 | 10.9s | 6,924 | $0.0011 |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 13.1s | 1,816 | $0.0010 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 21.5s | 3,438 | $0.0008 |
c3-test-suite | plan-execute | ✓ | 4 | 26.5s | 5,607 | $0.0016 |
c4-db-decomposition | react | ✓ | 1 | 14.7s | 424 | $0.0001 |
c5-multi-tool | plan-execute | ✓ | 8 | 30.2s | 3,410 | $0.0002 |
c6-multi-agent | plan-execute | ✓ | 6 | 31.4s | 2,319 | $0.0001 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 3/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 8 | 54.9s | 5,043 | $0.0014 |
e2-incident-response | tree-of-thought | ✓ | 5 | 46.1s | 2,808 | $0.0006 |
e3-logic-fallacy | tree-of-thought | ✗ | 21 | 2.8m | 10,708 | $0.0027 |
e4-crdt-design | tree-of-thought | ✓ | 5 | 42.1s | 2,229 | $0.0005 |
e5-file-execute | tree-of-thought | ✗ | 14 | 1.9m | 7,722 | $0.0018 |
e6-guardrail-injection | react | ✗ | — | 77.81ms | — | — |
ollama/gpt-oss
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 14.6s | 205 | — |
t2-binary-pow | single-shot | ✓ | 2 | 600.57ms | 183 | — |
t3-asimov-laws | single-shot | ✓ | 2 | 1.0s | 249 | — |
t4-json-csv | single-shot | ✓ | 2 | 1.0s | 239 | — |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 11.4s | 363 | — |
s2-palindrome-bug | single-shot | ✓ | 2 | 3.5s | 556 | — |
s3-bigO | single-shot | ✓ | 2 | 10.5s | 387 | — |
s4-design-pattern | single-shot | ✓ | 2 | 1.2s | 295 | — |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 9.1s | 495 | — |
m2-word-problem | react | ✓ | 1 | 5.4s | 553 | — |
m3-sql-injection | react | ✓ | 1 | 3.2s | 451 | — |
m4-remove-duplicates | react | ✓ | 1 | 4.7s | 550 | — |
m5-tool-search | react | ✓ | 6 | 7.4s | 3,787 | — |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 7.7s | 723 | — |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 13.1s | 2,435 | — |
c3-test-suite | plan-execute | ✓ | 4 | 24.3s | 3,627 | — |
c4-db-decomposition | react | ✓ | 1 | 5.0s | 575 | — |
c5-multi-tool | plan-execute | ✓ | 12 | 38.4s | 8,415 | — |
c6-multi-agent | plan-execute | ✓ | 6 | 17.9s | 2,614 | — |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 25 | 46.1s | 10,717 | — |
e2-incident-response | tree-of-thought | ✓ | 25 | 47.5s | 10,007 | — |
e3-logic-fallacy | tree-of-thought | ✓ | 25 | 43.7s | 9,222 | — |
e4-crdt-design | tree-of-thought | ✓ | 25 | 2.0m | 9,684 | — |
e5-file-execute | tree-of-thought | ✓ | 32 | 39.0s | 14,728 | — |
e6-guardrail-injection | react | ✗ | — | 83.76ms | — | — |
openai/gpt-4o
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 1.6s | 77 | $0.0002 |
t2-binary-pow | single-shot | ✓ | 2 | 379.60ms | 78 | $0.0002 |
t3-asimov-laws | single-shot | ✓ | 2 | 2.3s | 155 | $0.0009 |
t4-json-csv | single-shot | ✓ | 2 | 1.4s | 100 | $0.0003 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 917.65ms | 184 | $0.0011 |
s2-palindrome-bug | single-shot | ✓ | 2 | 2.0s | 277 | $0.0020 |
s3-bigO | single-shot | ✓ | 2 | 3.1s | 376 | $0.0028 |
s4-design-pattern | single-shot | ✓ | 2 | 1.3s | 190 | $0.0010 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 1.7s | 522 | $0.0029 |
m2-word-problem | react | ✓ | 1 | 2.9s | 622 | $0.0044 |
m3-sql-injection | react | ✓ | 1 | 2.6s | 455 | $0.0025 |
m4-remove-duplicates | react | ✓ | 1 | 2.0s | 535 | $0.0032 |
m5-tool-search | react | ✓ | 6 | 4.4s | 5,037 | $0.0133 |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 4/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 5.4s | 791 | $0.0059 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 8.9s | 2,910 | $0.0106 |
c3-test-suite | plan-execute | ✓ | 4 | 11.0s | 3,588 | $0.0152 |
c4-db-decomposition | react | ✓ | 1 | 4.7s | 789 | $0.0057 |
c5-multi-tool | plan-execute | ✗ | 10 | 18.8s | 7,485 | $0.0039 |
c6-multi-agent | plan-execute | ✗ | 5 | 9.3s | 1,606 | $0.0024 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 21 | 36.1s | 13,977 | $0.0525 |
e2-incident-response | tree-of-thought | ✓ | 25 | 37.4s | 12,086 | $0.0509 |
e3-logic-fallacy | tree-of-thought | ✓ | 24 | 32.7s | 12,651 | $0.0450 |
e4-crdt-design | tree-of-thought | ✓ | 24 | 45.5s | 18,519 | $0.0723 |
e5-file-execute | tree-of-thought | ✓ | 33 | 39.1s | 15,434 | $0.0532 |
e6-guardrail-injection | react | ✗ | — | 63.58ms | — | — |
openai/gpt-4o-mini
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 1.0s | 78 | $0.0000 |
t2-binary-pow | single-shot | ✓ | 2 | 892.35ms | 79 | $0.0000 |
t3-asimov-laws | single-shot | ✓ | 2 | 2.3s | 156 | $0.0001 |
t4-json-csv | single-shot | ✓ | 2 | 1.1s | 101 | $0.0000 |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 2.8s | 218 | $0.0001 |
s2-palindrome-bug | single-shot | ✓ | 2 | 3.1s | 266 | $0.0001 |
s3-bigO | single-shot | ✓ | 2 | 4.1s | 351 | $0.0002 |
s4-design-pattern | single-shot | ✓ | 2 | 1.8s | 169 | $0.0000 |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 4.1s | 458 | $0.0002 |
m2-word-problem | react | ✓ | 1 | 7.1s | 518 | $0.0002 |
m3-sql-injection | react | ✓ | 1 | 4.5s | 434 | $0.0002 |
m4-remove-duplicates | react | ✓ | 1 | 3.1s | 429 | $0.0001 |
m5-tool-search | react | ✓ | 6 | 7.2s | 5,215 | $0.0008 |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | plan-execute | ✓ | 4 | 30.9s | 3,332 | $0.0009 |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 16.9s | 2,818 | $0.0006 |
c3-test-suite | plan-execute | ✓ | 5 | 36.8s | 6,288 | $0.0017 |
c4-db-decomposition | plan-execute | ✓ | 4 | 29.8s | 3,863 | $0.0010 |
c5-multi-tool | plan-execute | ✓ | 10 | 22.8s | 6,375 | $0.0002 |
c6-multi-agent | plan-execute | ✓ | 13 | 41.7s | 9,093 | $0.0004 |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 21 | 52.4s | 12,967 | $0.0032 |
e2-incident-response | tree-of-thought | ✓ | 24 | 59.0s | 10,593 | $0.0029 |
e3-logic-fallacy | tree-of-thought | ✓ | 24 | 54.9s | 15,858 | $0.0037 |
e4-crdt-design | tree-of-thought | ✓ | 24 | 1.3m | 11,373 | $0.0035 |
e5-file-execute | tree-of-thought | ✓ | 32 | 52.1s | 16,625 | $0.0035 |
e6-guardrail-injection | react | ✓ | 1 | 612.18ms | 185 | $0.0000 |
ollama/gemma4:e4b
Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
t1-js-typeof | single-shot | ✓ | 2 | 4.9s | 104 | — |
t2-binary-pow | single-shot | ✓ | 2 | 357.93ms | 105 | — |
t3-asimov-laws | single-shot | ✓ | 2 | 3.4s | 488 | — |
t4-json-csv | single-shot | ✓ | 2 | 4.5s | 125 | — |
Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
s1-fibonacci | single-shot | ✓ | 2 | 5.1s | 720 | — |
s2-palindrome-bug | single-shot | ✓ | 2 | 16.3s | 1,416 | — |
s3-bigO | single-shot | ✓ | 2 | 11.6s | 858 | — |
s4-design-pattern | single-shot | ✓ | 2 | 11.1s | 774 | — |
Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
m1-merge-intervals | react | ✓ | 1 | 11.0s | 1,303 | — |
m2-word-problem | react | ✓ | 1 | 9.4s | 1,360 | — |
m3-sql-injection | react | ✓ | 1 | 7.3s | 1,163 | — |
m4-remove-duplicates | react | ✓ | 1 | 8.3s | 1,281 | — |
m5-tool-search | react | ✓ | 5 | 14.9s | 6,161 | — |
Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
c1-distributed-queue | react | ✓ | 1 | 38.6s | 2,936 | — |
c2-auth-vulnerabilities | plan-execute | ✓ | 4 | 12.7s | 2,871 | — |
c3-test-suite | plan-execute | ✓ | 4 | 26.4s | 6,486 | — |
c4-db-decomposition | react | ✓ | 1 | 39.0s | 3,165 | — |
c5-multi-tool | plan-execute | ✓ | 14 | 25.3s | 10,024 | — |
c6-multi-agent | plan-execute | ✓ | 5 | 22.2s | 2,157 | — |
Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed
| Task | Strategy | Status | Steps | Latency | Tokens | Cost |
|---|---|---|---|---|---|---|
e1-lis-optimization | tree-of-thought | ✓ | 18 | 2.1m | 22,993 | — |
e2-incident-response | tree-of-thought | ✓ | 24 | 3.7m | 35,244 | — |
e3-logic-fallacy | tree-of-thought | ✓ | 24 | 3.2m | 35,953 | — |
e4-crdt-design | tree-of-thought | ✓ | 24 | 3.7m | 36,542 | — |
e5-file-execute | tree-of-thought | ✓ | 32 | 2.2m | 32,368 | — |
e6-guardrail-injection | react | ✓ | 1 | 653.15ms | 266 | — |
Framework Overhead
Measured with the test provider to isolate pure Effect-TS layer composition cost -- independent of LLM latency.
| Measurement | Avg Duration | Samples |
|---|---|---|
| Runtime Creation | 0.02ms | 10 |
| Runtime Creation Full | 0.03ms | 10 |
| Complexity Classification | <0.01ms | 100 |
Benchmark Methodology
Section titled “Benchmark Methodology”Industry Standard Alignment
Section titled “Industry Standard Alignment”Each task tier maps to a recognized benchmark standard:
| Tier | Strategy | Aligned With |
|---|---|---|
| Trivial | Single-shot | MMLU-CS · MATH baseline · AgentEval |
| Simple | Single-shot | HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE |
| Moderate | ReAct (reactive) | HumanEval Medium · BIG-Bench Hard · SWE-bench lite |
| Complex | Plan-Execute-Reflect | AgentBench · SWE-bench Security · TestEval |
| Expert | Tree-of-Thought | BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS |
What Each Benchmark Standard Covers
Section titled “What Each Benchmark Standard Covers”- HumanEval (OpenAI) — 164 handcrafted code generation tasks evaluated by functional correctness. Our tasks include function implementation, algorithm design, and test generation.
- SWE-bench (Princeton) — Resolving real GitHub issues. We use SWE-bench patterns for bug identification, security vulnerability analysis, and multi-file code review.
- BIG-Bench Hard (Google) — 23 challenging tasks where chain-of-thought is required. We include: algorithmic optimization, logic/fallacy analysis, multi-step word problems, and Big-O complexity reasoning.
- GAIA (Meta) — Multi-step tasks requiring tool use and reasoning. Our Level 3 equivalent task tests production incident response requiring multi-domain knowledge synthesis.
- AgentBench (THUDM) — 8-environment agent evaluation. We use AgentBench patterns for system design, database decomposition, and migration planning tasks.
- MMLU-Pro — Professional knowledge across 14 domains. Tasks cover CS theory (CRDTs, design patterns), software engineering, and architecture decision-making.
Scoring
Section titled “Scoring”A task passes if the LLM’s output contains the expected pattern (case-insensitive regex). Patterns are crafted to require substantive, correct answers — they cannot be satisfied by generic responses:
SQL injection fix expected: "parameteriz|prepared|placeholder|$1|?"CRDT design expected: "CRDT|vector.?clock|logical.?time|merge|commutative|converge"Running Benchmarks
Section titled “Running Benchmarks”# Run with Anthropic (recommended for real-world results)cd packages/benchmarksbun run src/run.ts --provider anthropic --output report.json
# Run with a specific modelbun run src/run.ts --provider anthropic --model claude-opus-4-5 --output report.json
# Run only trivial + simple tiers (quick sanity check)bun run src/run.ts --provider anthropic --tier trivial,simple
# OpenAIbun run src/run.ts --provider openai --model gpt-4o --output report.json
# Geminibun run src/run.ts --provider gemini --model gemini-2.0-flash --output report.jsonCLI Options
Section titled “CLI Options”| Flag | Description | Default |
|---|---|---|
--provider | LLM provider (anthropic, openai, gemini, ollama, litellm) | test |
--model | Model name (uses provider default if omitted) | Provider default |
--tier | Comma-separated tier filter | All tiers |
--output | Path to save JSON report | (none) |
Provider Defaults
Section titled “Provider Defaults”| Provider | Default Model | Rationale |
|---|---|---|
anthropic | claude-haiku-4-5 | Fast, cost-efficient, strong reasoning |
openai | gpt-4o-mini | Cost-efficient with strong benchmark performance |
gemini | gemini-2.0-flash | Fast inference, competitive pricing |
ollama | llama3.2 | Local inference, no API cost |
Updating the Displayed Results
Section titled “Updating the Displayed Results”To regenerate the benchmark data shown on this page using the Anthropic provider:
cd packages/benchmarksbun run src/run.ts --provider anthropic --output ../../apps/docs/src/data/benchmark-report.jsonThe page renders dynamically from the JSON report at build time — no manual table updates needed.