Benchmarks

The @reactive-agents/benchmarks package evaluates end-to-end agent performance across 20 tasks spanning 5 complexity tiers. Tasks are aligned with leading agentic benchmark standards used by the research community, and run against a real LLM to measure actual correctness, latency, token usage, and cost — not just framework overhead.

Results

Last generated: April 11, 2026 at 01:02 AM · Models: ollama/cogito:14b , anthropic/claude-sonnet-4-20250514 , anthropic/claude-haiku-4-5 , openai/gpt-4o-mini , ollama/cogito , ollama/qwen3.5 , openai/gpt-4o-mini , gemini/gemini-2.5-flash , ollama/gpt-oss , openai/gpt-4o , openai/gpt-4o-mini , ollama/gemma4:e4b

Comparison Matrix

Tier	`ollama/cogito:14b`	`anthropic/claude-sonnet-4-20250514`	`anthropic/claude-haiku-4-5`	`openai/gpt-4o-mini`	`ollama/cogito`	`ollama/qwen3.5`	`openai/gpt-4o-mini`	`gemini/gemini-2.5-flash`	`ollama/gpt-oss`	`openai/gpt-4o`	`openai/gpt-4o-mini`	`ollama/gemma4:e4b`
Trivial	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)
Simple	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)	4/4 (100%)
Moderate	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	3/5 (60%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)	5/5 (100%)
Complex	5/6 (83%)	6/6 (100%)	5/6 (83%)	5/6 (83%)	5/6 (83%)	3/6 (50%)	5/6 (83%)	6/6 (100%)	6/6 (100%)	4/6 (67%)	6/6 (100%)	6/6 (100%)
Expert	5/6 (83%)	5/6 (83%)	6/6 (100%)	6/6 (100%)	4/6 (67%)	6/6 (100%)	5/6 (83%)	3/6 (50%)	5/6 (83%)	5/6 (83%)	6/6 (100%)	6/6 (100%)
Total	23/25 (92%)	24/25 (96%)	24/25 (96%)	24/25 (96%)	22/25 (88%)	22/25 (88%)	21/25 (84%)	22/25 (88%)	24/25 (96%)	22/25 (88%)	25/25 (100%)	25/25 (100%)

Model Summaries

`ollama/cogito:14b`

23/25 Tasks Passed

92% Pass Rate

12.7s Avg Latency

5.3m Total Duration

79,093 Total Tokens

$0.0000 Total Cost (USD)

`anthropic/claude-sonnet-4-20250514`

24/25 Tasks Passed

96% Pass Rate

22.5s Avg Latency

9.4m Total Duration

91,891 Total Tokens

$0.5411 Total Cost (USD)

`anthropic/claude-haiku-4-5`

24/25 Tasks Passed

96% Pass Rate

21.1s Avg Latency

8.8m Total Duration

167,578 Total Tokens

$0.0432 Total Cost (USD)

`openai/gpt-4o-mini`

24/25 Tasks Passed

96% Pass Rate

27.3s Avg Latency

11.4m Total Duration

128,880 Total Tokens

$0.0285 Total Cost (USD)

`ollama/cogito`

22/25 Tasks Passed

88% Pass Rate

9.0s Avg Latency

3.8m Total Duration

94,143 Total Tokens

$0.0000 Total Cost (USD)

`ollama/qwen3.5`

22/25 Tasks Passed

88% Pass Rate

1.3m Avg Latency

32.9m Total Duration

149,776 Total Tokens

$0.0000 Total Cost (USD)

`openai/gpt-4o-mini`

21/25 Tasks Passed

84% Pass Rate

23.5s Avg Latency

9.8m Total Duration

113,627 Total Tokens

$0.0240 Total Cost (USD)

`gemini/gemini-2.5-flash`

22/25 Tasks Passed

88% Pass Rate

24.6s Avg Latency

10.2m Total Duration

56,707 Total Tokens

$0.0135 Total Cost (USD)

`ollama/gpt-oss`

24/25 Tasks Passed

96% Pass Rate

19.1s Avg Latency

7.9m Total Duration

81,060 Total Tokens

$0.0000 Total Cost (USD)

`openai/gpt-4o`

22/25 Tasks Passed

88% Pass Rate

11.0s Avg Latency

4.6m Total Duration

98,444 Total Tokens

$0.3523 Total Cost (USD)

`openai/gpt-4o-mini`

25/25 Tasks Passed

100% Pass Rate

20.8s Avg Latency

8.7m Total Duration

107,842 Total Tokens

$0.0237 Total Cost (USD)

`ollama/gemma4:e4b`

25/25 Tasks Passed

100% Pass Rate

46.7s Avg Latency

19.5m Total Duration

206,863 Total Tokens

$0.0000 Total Cost (USD)

Task Details by Model

`ollama/cogito:14b`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	2.7s	97	—
`t2-binary-pow`	`single-shot`	✓	2	260.76ms	100	—
`t3-asimov-laws`	`single-shot`	✓	2	1.2s	164	—
`t4-json-csv`	`single-shot`	✓	2	2.1s	121	—

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	1.6s	213	—
`s2-palindrome-bug`	`single-shot`	✓	2	3.8s	299	—
`s3-bigO`	`single-shot`	✓	2	3.3s	235	—
`s4-design-pattern`	`single-shot`	✓	2	1.5s	166	—

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	2.8s	484	—
`m2-word-problem`	`react`	✓	1	4.0s	488	—
`m3-sql-injection`	`react`	✓	1	3.6s	489	—
`m4-remove-duplicates`	`react`	✓	1	4.2s	537	—
`m5-tool-search`	`react`	✓	6	7.9s	3,770	—

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	4.2s	498	—
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	10.4s	2,016	—
`c3-test-suite`	`plan-execute`	✓	4	20.2s	3,126	—
`c4-db-decomposition`	`react`	✓	1	1.5s	366	—
`c5-multi-tool`	`plan-execute`	✓	10	23.0s	6,995	—
`c6-multi-agent`	`plan-execute`	✗	6	12.7s	2,081	—

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	25	43.1s	10,585	—
`e2-incident-response`	`tree-of-thought`	✓	25	38.2s	9,690	—
`e3-logic-fallacy`	`tree-of-thought`	✓	25	35.5s	9,133	—
`e4-crdt-design`	`tree-of-thought`	✓	25	46.9s	9,529	—
`e5-file-execute`	`tree-of-thought`	✓	32	42.8s	17,911	—
`e6-guardrail-injection`	`react`	✗	—	82.60ms	—	—

`anthropic/claude-sonnet-4-20250514`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	1.0s	32	$0.0001
`t2-binary-pow`	`single-shot`	✓	2	919.73ms	31	$0.0002
`t3-asimov-laws`	`single-shot`	✓	2	1.6s	113	$0.0013
`t4-json-csv`	`single-shot`	✓	2	810.22ms	51	$0.0002

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	2.3s	155	$0.0017
`s2-palindrome-bug`	`single-shot`	✓	2	4.2s	333	$0.0042
`s3-bigO`	`single-shot`	✓	2	3.7s	229	$0.0024
`s4-design-pattern`	`single-shot`	✓	2	2.4s	131	$0.0011

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	3.2s	598	$0.0048
`m2-word-problem`	`react`	✓	1	4.1s	596	$0.0057
`m3-sql-injection`	`react`	✓	1	5.9s	729	$0.0071
`m4-remove-duplicates`	`react`	✓	1	4.6s	676	$0.0062
`m5-tool-search`	`react`	✓	6	10.4s	1,018	$0.0055

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	35.0s	2,974	$0.0411
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	37.6s	5,749	$0.0387
`c3-test-suite`	`plan-execute`	✓	4	40.5s	6,724	$0.0511
`c4-db-decomposition`	`react`	✓	1	46.5s	3,327	$0.0460
`c5-multi-tool`	`plan-execute`	✓	8	21.4s	3,556	$0.0075
`c6-multi-agent`	`plan-execute`	✓	5	32.2s	2,437	$0.0098

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	18	59.4s	12,723	$0.0606
`e2-incident-response`	`tree-of-thought`	✓	15	53.5s	9,856	$0.0481
`e3-logic-fallacy`	`tree-of-thought`	✓	20	1.3m	14,963	$0.0739
`e4-crdt-design`	`tree-of-thought`	✓	15	50.9s	10,318	$0.0529
`e5-file-execute`	`tree-of-thought`	✓	24	59.2s	14,572	$0.0711
`e6-guardrail-injection`	`react`	✗	—	64.18ms	—	—

`anthropic/claude-haiku-4-5`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	724.91ms	32	$0.0000
`t2-binary-pow`	`single-shot`	✓	2	627.29ms	31	$0.0000
`t3-asimov-laws`	`single-shot`	✓	2	876.09ms	125	$0.0001
`t4-json-csv`	`single-shot`	✓	2	574.75ms	51	$0.0000

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	1.1s	162	$0.0001
`s2-palindrome-bug`	`single-shot`	✓	2	2.8s	434	$0.0002
`s3-bigO`	`single-shot`	✓	2	1.9s	280	$0.0001
`s4-design-pattern`	`single-shot`	✓	2	1.2s	124	$0.0000

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	1.6s	549	$0.0002
`m2-word-problem`	`react`	✓	1	2.4s	586	$0.0003
`m3-sql-injection`	`react`	✓	1	6.4s	739	$0.0003
`m4-remove-duplicates`	`react`	✓	1	2.8s	642	$0.0003
`m5-tool-search`	`react`	✓	6	8.3s	7,674	$0.0013

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`plan-execute`	✓	4	58.3s	16,310	$0.0052
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	18.6s	5,056	$0.0014
`c3-test-suite`	`plan-execute`	✓	4	21.4s	7,080	$0.0023
`c4-db-decomposition`	`plan-execute`	✓	4	1.2m	10,096	$0.0034
`c5-multi-tool`	`plan-execute`	✓	12	19.5s	7,443	$0.0008
`c6-multi-agent`	`plan-execute`	✗	5	12.4s	5,296	$0.0009

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	6	57.0s	25,098	$0.0064
`e2-incident-response`	`tree-of-thought`	✓	6	1.2m	22,928	$0.0062
`e3-logic-fallacy`	`tree-of-thought`	✓	6	49.9s	13,255	$0.0037
`e4-crdt-design`	`tree-of-thought`	✓	11	1.1m	22,850	$0.0059
`e5-file-execute`	`tree-of-thought`	✓	29	45.2s	20,288	$0.0041
`e6-guardrail-injection`	`react`	✓	1	3.6s	449	$0.0002

`openai/gpt-4o-mini`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	2.6s	78	$0.0000
`t2-binary-pow`	`single-shot`	✓	2	1.2s	79	$0.0000
`t3-asimov-laws`	`single-shot`	✓	2	6.0s	159	$0.0001
`t4-json-csv`	`single-shot`	✓	2	1.5s	101	$0.0000

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	5.4s	214	$0.0001
`s2-palindrome-bug`	`single-shot`	✓	2	6.4s	270	$0.0001
`s3-bigO`	`single-shot`	✓	2	5.9s	317	$0.0001
`s4-design-pattern`	`single-shot`	✓	2	2.7s	162	$0.0000

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	2	17.3s	1,556	$0.0005
`m2-word-problem`	`react`	✓	1	5.6s	584	$0.0002
`m3-sql-injection`	`react`	✓	2	12.3s	1,086	$0.0003
`m4-remove-duplicates`	`react`	✓	2	11.2s	1,245	$0.0004
`m5-tool-search`	`react`	✓	24	18.6s	10,986	$0.0019

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`plan-execute`	✓	4	25.2s	3,174	$0.0008
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	28.4s	3,943	$0.0010
`c3-test-suite`	`plan-execute`	✓	5	38.5s	9,050	$0.0018
`c4-db-decomposition`	`plan-execute`	✓	4	27.0s	3,336	$0.0008
`c5-multi-tool`	`plan-execute`	✓	6	5.4s	1,563	$0.0001
`c6-multi-agent`	`plan-execute`	✗	10	34.6s	6,933	$0.0002

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	25	1.4m	19,852	$0.0047
`e2-incident-response`	`tree-of-thought`	✓	25	1.2m	17,228	$0.0039
`e3-logic-fallacy`	`tree-of-thought`	✓	25	1.4m	17,686	$0.0040
`e4-crdt-design`	`tree-of-thought`	✓	25	1.9m	12,716	$0.0039
`e5-file-execute`	`tree-of-thought`	✓	32	1.1m	16,105	$0.0034
`e6-guardrail-injection`	`react`	✓	2	1.6s	457	$0.0001

`ollama/cogito`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	3.2s	82	—
`t2-binary-pow`	`single-shot`	✓	2	210.17ms	82	—
`t3-asimov-laws`	`single-shot`	✓	2	777.85ms	153	—
`t4-json-csv`	`single-shot`	✓	2	1.8s	105	—

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	830.51ms	182	—
`s2-palindrome-bug`	`single-shot`	✓	2	3.1s	339	—
`s3-bigO`	`single-shot`	✓	2	1.8s	180	—
`s4-design-pattern`	`single-shot`	✓	2	1.1s	161	—

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	3.2s	537	—
`m2-word-problem`	`react`	✓	1	3.1s	557	—
`m3-sql-injection`	`react`	✓	1	2.5s	529	—
`m4-remove-duplicates`	`react`	✓	1	2.7s	568	—
`m5-tool-search`	`react`	✓	6	4.9s	4,767	—

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	4.2s	707	—
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	8.3s	2,564	—
`c3-test-suite`	`plan-execute`	✓	4	15.7s	4,018	—
`c4-db-decomposition`	`react`	✓	1	3.0s	601	—
`c5-multi-tool`	`plan-execute`	✓	16	36.8s	11,886	—
`c6-multi-agent`	`plan-execute`	✗	5	10.0s	1,598	—

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 4/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	23	25.9s	10,289	—
`e2-incident-response`	`tree-of-thought`	✓	23	24.1s	9,366	—
`e3-logic-fallacy`	`tree-of-thought`	✗	24	17.7s	8,401	—
`e4-crdt-design`	`tree-of-thought`	✓	25	25.1s	11,872	—
`e5-file-execute`	`tree-of-thought`	✓	41	25.6s	24,599	—
`e6-guardrail-injection`	`react`	✗	—	82.44ms	—	—

`ollama/qwen3.5`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	1.4s	188	—
`t2-binary-pow`	`single-shot`	✓	2	2.5s	300	—
`t3-asimov-laws`	`single-shot`	✓	2	11.4s	487	—
`t4-json-csv`	`single-shot`	✓	2	3.5s	408	—

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	16.8s	534	—
`s2-palindrome-bug`	`single-shot`	✓	2	9.9s	665	—
`s3-bigO`	`single-shot`	✓	2	15.5s	452	—
`s4-design-pattern`	`single-shot`	✓	2	4.9s	246	—

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	2	19.0s	2,394	—
`m2-word-problem`	`react`	✓	1	9.5s	1,082	—
`m3-sql-injection`	`react`	✓	2	16.6s	1,817	—
`m4-remove-duplicates`	`react`	✓	1	13.5s	1,042	—
`m5-tool-search`	`react`	✓	10	28.0s	6,658	—

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 3/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`plan-execute`	✗	17	5.0m	—	—
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	4.1m	10,918	—
`c3-test-suite`	`plan-execute`	✓	4	38.7s	5,208	—
`c4-db-decomposition`	`plan-execute`	✓	4	2.2m	12,274	—
`c5-multi-tool`	`plan-execute`	✗	3	2.6m	—	—
`c6-multi-agent`	`plan-execute`	✗	5	1.7m	11,750	—

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	7	3.0m	17,750	—
`e2-incident-response`	`tree-of-thought`	✓	6	2.6m	14,909	—
`e3-logic-fallacy`	`tree-of-thought`	✓	7	2.6m	15,623	—
`e4-crdt-design`	`tree-of-thought`	✓	7	1.7m	10,753	—
`e5-file-execute`	`tree-of-thought`	✓	29	4.1m	33,203	—
`e6-guardrail-injection`	`react`	✓	2	11.0s	1,115	—

`openai/gpt-4o-mini`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	786.45ms	78	$0.0000
`t2-binary-pow`	`single-shot`	✓	2	635.35ms	79	$0.0000
`t3-asimov-laws`	`single-shot`	✓	2	1.9s	157	$0.0001
`t4-json-csv`	`single-shot`	✓	2	459.69ms	101	$0.0000

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	3.4s	221	$0.0001
`s2-palindrome-bug`	`single-shot`	✓	2	5.6s	367	$0.0002
`s3-bigO`	`single-shot`	✓	2	3.7s	359	$0.0002
`s4-design-pattern`	`single-shot`	✓	2	1.9s	158	$0.0000

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 3/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	4.7s	461	$0.0002
`m2-word-problem`	`react`	✗	1	4.7s	390	$0.0001
`m3-sql-injection`	`react`	✓	1	5.1s	413	$0.0001
`m4-remove-duplicates`	`react`	✓	1	4.5s	428	$0.0001
`m5-tool-search`	`react`	✗	6	1.2s	—	—

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`plan-execute`	✓	4	25.0s	3,106	$0.0008
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	14.6s	2,270	$0.0004
`c3-test-suite`	`plan-execute`	✓	4	18.0s	2,991	$0.0007
`c4-db-decomposition`	`plan-execute`	✓	4	28.8s	3,123	$0.0007
`c5-multi-tool`	`plan-execute`	✓	10	33.0s	7,302	$0.0004
`c6-multi-agent`	`plan-execute`	✗	10	38.1s	7,019	$0.0002

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	24	1.4m	19,747	$0.0045
`e2-incident-response`	`tree-of-thought`	✓	24	1.4m	18,722	$0.0042
`e3-logic-fallacy`	`tree-of-thought`	✓	24	1.2m	17,189	$0.0039
`e4-crdt-design`	`tree-of-thought`	✓	24	1.5m	16,049	$0.0043
`e5-file-execute`	`tree-of-thought`	✗	29	58.6s	12,716	$0.0028
`e6-guardrail-injection`	`react`	✓	1	3.1s	181	$0.0000

`gemini/gemini-2.5-flash`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	1.2s	90	$0.0000
`t2-binary-pow`	`single-shot`	✓	2	765.64ms	91	$0.0000
`t3-asimov-laws`	`single-shot`	✓	2	1.2s	165	$0.0001
`t4-json-csv`	`single-shot`	✓	2	798.14ms	111	$0.0000

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	10.2s	378	$0.0002
`s2-palindrome-bug`	`single-shot`	✓	2	8.1s	712	$0.0004
`s3-bigO`	`single-shot`	✓	2	4.9s	304	$0.0001
`s4-design-pattern`	`single-shot`	✓	2	1.5s	185	$0.0001

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	3.8s	575	$0.0002
`m2-word-problem`	`react`	✓	1	3.6s	606	$0.0002
`m3-sql-injection`	`react`	✓	1	4.0s	594	$0.0002
`m4-remove-duplicates`	`react`	✓	1	2.4s	448	$0.0001
`m5-tool-search`	`react`	✓	6	10.9s	6,924	$0.0011

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	13.1s	1,816	$0.0010
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	21.5s	3,438	$0.0008
`c3-test-suite`	`plan-execute`	✓	4	26.5s	5,607	$0.0016
`c4-db-decomposition`	`react`	✓	1	14.7s	424	$0.0001
`c5-multi-tool`	`plan-execute`	✓	8	30.2s	3,410	$0.0002
`c6-multi-agent`	`plan-execute`	✓	6	31.4s	2,319	$0.0001

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 3/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	8	54.9s	5,043	$0.0014
`e2-incident-response`	`tree-of-thought`	✓	5	46.1s	2,808	$0.0006
`e3-logic-fallacy`	`tree-of-thought`	✗	21	2.8m	10,708	$0.0027
`e4-crdt-design`	`tree-of-thought`	✓	5	42.1s	2,229	$0.0005
`e5-file-execute`	`tree-of-thought`	✗	14	1.9m	7,722	$0.0018
`e6-guardrail-injection`	`react`	✗	—	77.81ms	—	—

`ollama/gpt-oss`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	14.6s	205	—
`t2-binary-pow`	`single-shot`	✓	2	600.57ms	183	—
`t3-asimov-laws`	`single-shot`	✓	2	1.0s	249	—
`t4-json-csv`	`single-shot`	✓	2	1.0s	239	—

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	11.4s	363	—
`s2-palindrome-bug`	`single-shot`	✓	2	3.5s	556	—
`s3-bigO`	`single-shot`	✓	2	10.5s	387	—
`s4-design-pattern`	`single-shot`	✓	2	1.2s	295	—

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	9.1s	495	—
`m2-word-problem`	`react`	✓	1	5.4s	553	—
`m3-sql-injection`	`react`	✓	1	3.2s	451	—
`m4-remove-duplicates`	`react`	✓	1	4.7s	550	—
`m5-tool-search`	`react`	✓	6	7.4s	3,787	—

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	7.7s	723	—
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	13.1s	2,435	—
`c3-test-suite`	`plan-execute`	✓	4	24.3s	3,627	—
`c4-db-decomposition`	`react`	✓	1	5.0s	575	—
`c5-multi-tool`	`plan-execute`	✓	12	38.4s	8,415	—
`c6-multi-agent`	`plan-execute`	✓	6	17.9s	2,614	—

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	25	46.1s	10,717	—
`e2-incident-response`	`tree-of-thought`	✓	25	47.5s	10,007	—
`e3-logic-fallacy`	`tree-of-thought`	✓	25	43.7s	9,222	—
`e4-crdt-design`	`tree-of-thought`	✓	25	2.0m	9,684	—
`e5-file-execute`	`tree-of-thought`	✓	32	39.0s	14,728	—
`e6-guardrail-injection`	`react`	✗	—	83.76ms	—	—

`openai/gpt-4o`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	1.6s	77	$0.0002
`t2-binary-pow`	`single-shot`	✓	2	379.60ms	78	$0.0002
`t3-asimov-laws`	`single-shot`	✓	2	2.3s	155	$0.0009
`t4-json-csv`	`single-shot`	✓	2	1.4s	100	$0.0003

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	917.65ms	184	$0.0011
`s2-palindrome-bug`	`single-shot`	✓	2	2.0s	277	$0.0020
`s3-bigO`	`single-shot`	✓	2	3.1s	376	$0.0028
`s4-design-pattern`	`single-shot`	✓	2	1.3s	190	$0.0010

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	1.7s	522	$0.0029
`m2-word-problem`	`react`	✓	1	2.9s	622	$0.0044
`m3-sql-injection`	`react`	✓	1	2.6s	455	$0.0025
`m4-remove-duplicates`	`react`	✓	1	2.0s	535	$0.0032
`m5-tool-search`	`react`	✓	6	4.4s	5,037	$0.0133

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 4/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	5.4s	791	$0.0059
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	8.9s	2,910	$0.0106
`c3-test-suite`	`plan-execute`	✓	4	11.0s	3,588	$0.0152
`c4-db-decomposition`	`react`	✓	1	4.7s	789	$0.0057
`c5-multi-tool`	`plan-execute`	✗	10	18.8s	7,485	$0.0039
`c6-multi-agent`	`plan-execute`	✗	5	9.3s	1,606	$0.0024

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 5/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	21	36.1s	13,977	$0.0525
`e2-incident-response`	`tree-of-thought`	✓	25	37.4s	12,086	$0.0509
`e3-logic-fallacy`	`tree-of-thought`	✓	24	32.7s	12,651	$0.0450
`e4-crdt-design`	`tree-of-thought`	✓	24	45.5s	18,519	$0.0723
`e5-file-execute`	`tree-of-thought`	✓	33	39.1s	15,434	$0.0532
`e6-guardrail-injection`	`react`	✗	—	63.58ms	—	—

`openai/gpt-4o-mini`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	1.0s	78	$0.0000
`t2-binary-pow`	`single-shot`	✓	2	892.35ms	79	$0.0000
`t3-asimov-laws`	`single-shot`	✓	2	2.3s	156	$0.0001
`t4-json-csv`	`single-shot`	✓	2	1.1s	101	$0.0000

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	2.8s	218	$0.0001
`s2-palindrome-bug`	`single-shot`	✓	2	3.1s	266	$0.0001
`s3-bigO`	`single-shot`	✓	2	4.1s	351	$0.0002
`s4-design-pattern`	`single-shot`	✓	2	1.8s	169	$0.0000

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	4.1s	458	$0.0002
`m2-word-problem`	`react`	✓	1	7.1s	518	$0.0002
`m3-sql-injection`	`react`	✓	1	4.5s	434	$0.0002
`m4-remove-duplicates`	`react`	✓	1	3.1s	429	$0.0001
`m5-tool-search`	`react`	✓	6	7.2s	5,215	$0.0008

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`plan-execute`	✓	4	30.9s	3,332	$0.0009
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	16.9s	2,818	$0.0006
`c3-test-suite`	`plan-execute`	✓	5	36.8s	6,288	$0.0017
`c4-db-decomposition`	`plan-execute`	✓	4	29.8s	3,863	$0.0010
`c5-multi-tool`	`plan-execute`	✓	10	22.8s	6,375	$0.0002
`c6-multi-agent`	`plan-execute`	✓	13	41.7s	9,093	$0.0004

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	21	52.4s	12,967	$0.0032
`e2-incident-response`	`tree-of-thought`	✓	24	59.0s	10,593	$0.0029
`e3-logic-fallacy`	`tree-of-thought`	✓	24	54.9s	15,858	$0.0037
`e4-crdt-design`	`tree-of-thought`	✓	24	1.3m	11,373	$0.0035
`e5-file-execute`	`tree-of-thought`	✓	32	52.1s	16,625	$0.0035
`e6-guardrail-injection`	`react`	✓	1	612.18ms	185	$0.0000

`ollama/gemma4:e4b`

Trivial MMLU-CS · MATH baseline · AgentEval -- baseline capability checks 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`t1-js-typeof`	`single-shot`	✓	2	4.9s	104	—
`t2-binary-pow`	`single-shot`	✓	2	357.93ms	105	—
`t3-asimov-laws`	`single-shot`	✓	2	3.4s	488	—
`t4-json-csv`	`single-shot`	✓	2	4.5s	125	—

Simple HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE -- 1-2 reasoning steps 4/4 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`s1-fibonacci`	`single-shot`	✓	2	5.1s	720	—
`s2-palindrome-bug`	`single-shot`	✓	2	16.3s	1,416	—
`s3-bigO`	`single-shot`	✓	2	11.6s	858	—
`s4-design-pattern`	`single-shot`	✓	2	11.1s	774	—

Moderate HumanEval Medium · BIG-Bench Hard · SWE-bench lite -- multi-step ReAct 5/5 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`m1-merge-intervals`	`react`	✓	1	11.0s	1,303	—
`m2-word-problem`	`react`	✓	1	9.4s	1,360	—
`m3-sql-injection`	`react`	✓	1	7.3s	1,163	—
`m4-remove-duplicates`	`react`	✓	1	8.3s	1,281	—
`m5-tool-search`	`react`	✓	5	14.9s	6,161	—

Complex AgentBench · SWE-bench Security · TestEval -- plan-execute analysis 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`c1-distributed-queue`	`react`	✓	1	38.6s	2,936	—
`c2-auth-vulnerabilities`	`plan-execute`	✓	4	12.7s	2,871	—
`c3-test-suite`	`plan-execute`	✓	4	26.4s	6,486	—
`c4-db-decomposition`	`react`	✓	1	39.0s	3,165	—
`c5-multi-tool`	`plan-execute`	✓	14	25.3s	10,024	—
`c6-multi-agent`	`plan-execute`	✓	5	22.2s	2,157	—

Expert BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS -- tree-of-thought 6/6 passed

Task	Strategy	Status	Steps	Latency	Tokens	Cost
`e1-lis-optimization`	`tree-of-thought`	✓	18	2.1m	22,993	—
`e2-incident-response`	`tree-of-thought`	✓	24	3.7m	35,244	—
`e3-logic-fallacy`	`tree-of-thought`	✓	24	3.2m	35,953	—
`e4-crdt-design`	`tree-of-thought`	✓	24	3.7m	36,542	—
`e5-file-execute`	`tree-of-thought`	✓	32	2.2m	32,368	—
`e6-guardrail-injection`	`react`	✓	1	653.15ms	266	—

Framework Overhead

Measured with the test provider to isolate pure Effect-TS layer composition cost -- independent of LLM latency.

Measurement	Avg Duration	Samples
Runtime Creation	0.02ms	10
Runtime Creation Full	0.03ms	10
Complexity Classification	<0.01ms	100

Benchmark Methodology

Industry Standard Alignment

Each task tier maps to a recognized benchmark standard:

Tier	Strategy	Aligned With
Trivial	Single-shot	MMLU-CS · MATH baseline · AgentEval
Simple	Single-shot	HumanEval Easy · BIG-Bench Hard CS · MMLU-Pro SE
Moderate	ReAct (reactive)	HumanEval Medium · BIG-Bench Hard · SWE-bench lite
Complex	Plan-Execute-Reflect	AgentBench · SWE-bench Security · TestEval
Expert	Tree-of-Thought	BIG-Bench Hard algorithms · GAIA Level 3 · MMLU-Pro CS

What Each Benchmark Standard Covers

HumanEval (OpenAI) — 164 handcrafted code generation tasks evaluated by functional correctness. Our tasks include function implementation, algorithm design, and test generation.
SWE-bench (Princeton) — Resolving real GitHub issues. We use SWE-bench patterns for bug identification, security vulnerability analysis, and multi-file code review.
BIG-Bench Hard (Google) — 23 challenging tasks where chain-of-thought is required. We include: algorithmic optimization, logic/fallacy analysis, multi-step word problems, and Big-O complexity reasoning.
GAIA (Meta) — Multi-step tasks requiring tool use and reasoning. Our Level 3 equivalent task tests production incident response requiring multi-domain knowledge synthesis.
AgentBench (THUDM) — 8-environment agent evaluation. We use AgentBench patterns for system design, database decomposition, and migration planning tasks.
MMLU-Pro — Professional knowledge across 14 domains. Tasks cover CS theory (CRDTs, design patterns), software engineering, and architecture decision-making.

Scoring

A task passes if the LLM’s output contains the expected pattern (case-insensitive regex). Patterns are crafted to require substantive, correct answers — they cannot be satisfied by generic responses:

SQL injection fix expected: "parameteriz|prepared|placeholder|$1|?"
CRDT design expected: "CRDT|vector.?clock|logical.?time|merge|commutative|converge"

Running Benchmarks

# Run with Anthropic (recommended for real-world results)
cd packages/benchmarks
bun run src/run.ts --provider anthropic --output report.json

# Run with a specific model
bun run src/run.ts --provider anthropic --model claude-opus-4-5 --output report.json

# Run only trivial + simple tiers (quick sanity check)
bun run src/run.ts --provider anthropic --tier trivial,simple

# OpenAI
bun run src/run.ts --provider openai --model gpt-4o --output report.json

# Gemini
bun run src/run.ts --provider gemini --model gemini-2.0-flash --output report.json

CLI Options

Flag	Description	Default
`--provider`	LLM provider (`anthropic`, `openai`, `gemini`, `ollama`, `litellm`)	`test`
`--model`	Model name (uses provider default if omitted)	Provider default
`--tier`	Comma-separated tier filter	All tiers
`--output`	Path to save JSON report	(none)

Provider Defaults

Provider	Default Model	Rationale
`anthropic`	`claude-haiku-4-5`	Fast, cost-efficient, strong reasoning
`openai`	`gpt-4o-mini`	Cost-efficient with strong benchmark performance
`gemini`	`gemini-2.0-flash`	Fast inference, competitive pricing
`ollama`	`llama3.2`	Local inference, no API cost

Updating the Displayed Results

To regenerate the benchmark data shown on this page using the Anthropic provider:

cd packages/benchmarks
bun run src/run.ts --provider anthropic --output ../../apps/docs/src/data/benchmark-report.json

The page renders dynamically from the JSON report at build time — no manual table updates needed.