Skip to content
Insights
3 min read

Small Models Are Eating the Agent Stack

Phi-3, Mistral 7B, and Llama 3 are replacing GPT-4 in production agent pipelines. Not for everything, but for more tasks than you'd expect.

ai-agentsinfrastructurescaling

Microsoft's Phi-3 runs classification tasks at 95% of GPT-4's accuracy. At 1% of the cost. At 10x the speed. For a classification step in an agent pipeline, the choice is obvious.

The industry's obsession with frontier models is blinding teams to the gains available from small, specialized models in production.

The right model per step

A typical agent interaction has 5-10 steps. Not all steps require the same capability.

Intent classification: does this need GPT-4? No. A fine-tuned Phi-3 or even a traditional ML classifier handles this in 20ms for pennies.

Entity extraction: structured extraction from text. Llama 3 8B with a fine-tune outperforms GPT-4 on domain-specific extraction because it's trained on your data.

Tool selection: given a query and 8 tools, pick the right one. Mistral 7B handles this with 97% accuracy when the tool descriptions are clear.

Response generation: this is where the frontier model earns its keep. Complex synthesis, nuanced language, creative problem-solving. Use GPT-4 or Claude here.

Output validation: check if the response meets quality criteria. A small model or even a rule engine suffices.

Four of five steps can use a small model. Only generation needs the expensive one. Total cost drops 70-80% with negligible quality impact.

The speed advantage

Cost savings are nice. Speed gains are transformative.

A GPT-4 call takes 800-2000ms for typical agent tasks. A Mistral 7B call on dedicated inference takes 50-200ms. When you replace 4 steps with small models, you're not just saving money, you're cutting 3-6 seconds of latency from every interaction.

Users don't notice the difference in quality between a Phi-3 classifier and a GPT-4 classifier. They absolutely notice the difference between a 2-second response and an 8-second response.

The fine-tuning unlock

Small models become powerful when fine-tuned on your domain data. A generic Llama 3 8B is decent at most tasks. A Llama 3 8B fine-tuned on 10,000 examples of your specific extraction task is exceptional.

The fine-tuning cost is negligible, a few dollars on Together AI or Fireworks. Inference cost is negligible too, fractions of a cent per call. The accuracy improvement is substantial, often exceeding GPT-4 on the specific task.

The catch: you need labeled data. Creating labeled data requires either human annotation or using a frontier model to generate synthetic examples. The irony of using GPT-4 to create training data for a model that replaces GPT-4 is not lost on us.

When small models fail

Don't use small models for tasks that require broad world knowledge, complex multi-step reasoning, or creative generation. They're not dumber. They're narrower. Their strength is depth in a specific context. Their weakness is breadth across contexts.

Small models at the edges of your pipeline, frontier models at the center. The edges handle routing, classification, extraction, and validation. The center handles thinking.

If you want to cut agent costs without cutting quality, we architect multi-model pipelines that scale.