Skip to content

Local Models Guide

Reactive Agents is designed to work with local models via Ollama. The model-adaptive context system automatically tunes prompts, compaction, and truncation for smaller models — but choosing the right model for your task matters.

Terminal window
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a recommended model
ollama pull qwen3:14b
const agent = await ReactiveAgents.create()
.withProvider("ollama")
.withModel("qwen3:14b")
.withReasoning()
.withTools()
.withContextProfile({ tier: "local" })
.build();
TaskRecommended ModelTierWhy
Simple Q&A (no tools)qwen3:4blocalFast, low memory, good for chat
Tool-calling tasksqwen3:14blocalBest native FC accuracy at this size
Research with web searchqwen3:14b or llama3.1:8blocalReliable native function calling
Code generationqwen2.5-coder:14blocalSpecialized for code tasks
Complex reasoningcogito:14blocalExtended thinking mode support
Multi-step planningqwen3:14b with Plan-ExecutelocalStructured plan generation
ModelParamsContextNative FCInstruction FollowingSpeedMemory
qwen3:4b4B32KFairFairFast~3GB
llama3.1:8b8B128KGoodGoodMedium~5GB
qwen3:8b8B32KGoodGoodMedium~5GB
phi-4:14b14B16KGoodFairMedium~9GB
qwen3:14b14B32KBestBestSlower~9GB
cogito:14b14B32KGoodGoodSlower~9GB
llama3.1:70b70B128KExcellentExcellentSlow~40GB

Legend:

  • Native FC: How reliably the model generates valid native function call (tool_use) responses
  • Instruction Following: How well the model follows system prompt instructions and multi-step tasks
  • Speed: Tokens per second on typical hardware (relative)
  • Memory: Approximate VRAM/RAM required

Always set the context profile to match your model:

// Small models (<=8B params)
.withContextProfile({ tier: "local" })
// → Lean prompts, aggressive compaction after 6 steps, 800-char truncation
// Medium models (8B-30B params)
.withContextProfile({ tier: "mid" })
// → Balanced prompts, moderate compaction
// Large cloud models
.withContextProfile({ tier: "large" })
// → Full context, standard compaction
// Frontier models (Claude Opus, GPT-4, Gemini Pro)
.withContextProfile({ tier: "frontier" })
// → Maximum context, minimal compaction

Important: If you skip .withContextProfile(), the framework uses "large" tier defaults — which wastes tokens and confuses smaller models with verbose prompts.

Not all reasoning strategies work well on small models:

Strategy<=8B14B70BNotes
ReActGoodBestBestMost reliable for local models
ReflexionPoorFairGoodSelf-critique requires model quality
Plan-ExecutePoorFairGoodStructured plan generation is fragile on small models
Tree-of-ThoughtPoorPoorFairBFS scoring unreliable below 70B
AdaptiveFairGoodBestFalls back to ReAct on small models (good)

Recommendation: Use "reactive" (ReAct) as default strategy for all local models. Only use "adaptive" if you’re running 14B+ and want automatic strategy selection.

For multi-step local runs, ICS classifies task phase and injects a short synthesized thread instead of dumping raw history. Enable and tune it via .withReasoning({ synthesis: …, strategies: { … } }) — see Intelligent Context Synthesis.

Symptom: Agent calls tools that don’t exist or uses wrong parameter names. Fix: Use .withContextProfile({ tier: "local" }) and keep tool count low (3-5 tools max). Use .withTools({ include: [...] }) to limit visible tools.

Symptom: Agent repeats the same action or thought. Fix: The circuit breaker will catch this, but you can reduce iterations with .withMaxIterations(5). Consider simpler prompts.

3. Native function calling not supported or unreliable

Section titled “3. Native function calling not supported or unreliable”

Symptom: Agent fails to invoke tools or returns malformed tool call responses. Fix: Switch to a model with better native FC support (qwen3:14b > llama3.1:8b for this). The framework uses native function calling (tool_use blocks) for all providers — the model must support the Ollama tool calling API. The local context profile uses simplified tool schemas to reduce parsing burden on smaller models.

Symptom: Ollama crashes or becomes unresponsive. Fix: Use a smaller model or enable quantization: ollama pull qwen3:14b-q4_K_M. The q4 quantization uses ~60% less memory with minimal quality loss.

Symptom: Spawned sub-agents hallucinate or loop. Fix: Known limitation — small models struggle with sub-agent tasks. Disable dynamic sub-agents (.withDynamicSubAgents()) for local models. Use static sub-agents with explicit task descriptions instead.

SetupMonthly CostLatencyQuality
Ollama + qwen3:14b (local)$0 (electricity only)1-5s/responseGood for most tasks
Anthropic claude-haiku~$5-15/month0.5-2sBetter quality
Anthropic claude-sonnet~$15-50/month1-3sBest quality
Ollama + llama3.1:70b (beefy local)$03-10sNear cloud quality
import { ReactiveAgents } from "reactive-agents";
const agent = await ReactiveAgents.create()
.withName("local-researcher")
.withProvider("ollama")
.withModel("qwen3:14b")
.withReasoning({ defaultStrategy: "reactive" })
.withTools({ include: ["web-search", "file-read", "file-write"] })
.withContextProfile({ tier: "local" })
.withMaxIterations(8)
.withMemory()
.withObservability({ verbosity: "normal" })
.build();
const result = await agent.run("Research TypeScript testing frameworks and write a summary");
console.log(result.output);
console.log(result.metadata); // { duration, cost: 0, tokensUsed, stepsCount }