Skip to content

Local Model Performance

Reactive Agents automatically adapts its behavior based on the model tier. Local models (Ollama, LiteLLM) have different entropy distributions, latency profiles, and capability envelopes compared to frontier models (OpenAI, Anthropic, Google). The framework accounts for these differences at every level.

The tier is inferred from the provider configuration:

ProviderTierDetection
OllamalocalProvider name
LiteLLMlocalProvider name
OpenAIfrontierProvider name
AnthropicfrontierProvider name
GooglefrontierProvider name
GroqfrontierProvider name

The tier affects entropy scoring weights, controller thresholds, and meta-tool behavior.

Local models exhibit higher baseline entropy and wider score distributions. The conformal calibration system accounts for this:

  • Uncalibrated defaults use conservative thresholds (convergence: 0.4, high-entropy: 0.8) suitable for both tiers.
  • Calibrated thresholds adapt automatically after 20+ scored iterations. Local models typically produce higher thresholds (convergence: ~0.5, high-entropy: ~0.85) reflecting their noisier output.

Calibration accumulates automatically during normal agent use. Each entropy score is recorded and thresholds recompute via conformal quantiles:

  • High-entropy threshold: 90th percentile of historical scores
  • Convergence threshold: 70th percentile (looser bound)

To persist calibration across runs, provide a database path:

.withReactiveIntelligence({
calibrationDbPath: "./data/calibration.sqlite",
})

When a model’s behavior shifts (e.g., after updating model weights), the system detects calibration drift:

eventBus.subscribe("CalibrationDrift", (event) => {
// event.modelId, event.expectedMean, event.observedMean, event.deviationSigma
console.warn(`Calibration drift on ${event.modelId} — consider resetting calibration data`);
});

The reactive controller adapts its strategy based on the model tier:

AspectLocalFrontier
Min iterations before stoppingHigher (models need more steps)Lower
Convergence thresholdHigher (noisier output)Lower
Confidence requiredMediumHigh

Local models typically have smaller context windows (4K–32K vs 128K–200K). The context pressure sensor triggers compression earlier:

AspectLocalFrontier
Compression trigger~60% utilization~80% utilization
Auto-checkpoint threshold0.75 soft / 0.80 hard0.80 soft / 0.85 hard

When entropy trajectory is flat (no improvement), the controller may recommend switching strategies. Local models get more patience before triggering a switch.

.withReactiveIntelligence({
controller: {
earlyStop: true, // Critical for local models — saves 30-50% of iterations
contextCompression: true, // Prevent context overflow on small-window models
},
})

Local models work best with:

  • reactive (default) — single-pass tool calling with entropy monitoring
  • plan-execute — explicit planning for complex multi-step tasks

More sophisticated strategies (e.g., tree-of-thought) may underperform on local models due to increased token overhead.

ModelContextLogprob SupportNotes
Ollama (Llama 3.x)8K–128KYesGood all-around; enable token entropy
Ollama (Mistral)32KYesStrong at structured output; lower entropy variance
Ollama (Cogito)8K–32KYesReasoning-focused; benefits from early-stop
Ollama (Gemma)8KPartialSmaller context needs aggressive compression

The harness automatically detects whether a model supports native function calling. When unavailable, it falls back to text-based JSON tool call parsing. This is transparent to the agent but affects latency:

  • Native FC (supported models): Direct tool calls via provider API — lower latency, more reliable
  • Text FC fallback: Tool calls parsed from LLM text output — higher latency, may need retry