AI 4 min read
Context engineering for SMBs: how to lower AI costs without losing quality
Practical techniques to reduce inference cost in AI projects: efficient code search, model routing, desktop automation, and reusable context patterns. Data from Semble, agent-desktop, and hybrid approaches.
In this article +
As AI projects move from pilot to daily operations, the bottleneck shifts from model quality to cost per session. In recent weeks, several tools documented notable reductions: Semble reports up to 98% fewer tokens in code search versus grep+read, agent-desktop reports 78-96% in desktop automation, and hybrid approaches like DeepClaude promise up to 17x cost reduction for low-risk tasks.
For an SMB, the practical question is not which model is most powerful. It is which combination keeps acceptable quality at a sustainable cost.
Short answer
Inference cost drops through context engineering: give the model only what is needed, in the most compact format possible, and delegate to a cheaper model what does not require the expensive one. Quality holds if routing and reduction decisions are backed by evals and observability.
Where money goes
| Cost source | Example |
|---|---|
| Full file reads | Agent doing grep + opening 20 whole files to find a function |
| Full app tree dumps | Desktop automation serializing the whole UI on every step |
| Long conversations without summaries | Sessions dragging irrelevant history turn after turn |
| Premium models for mechanical tasks | Mass refactor run with the most expensive model available |
| Retries without policy | Loops re-calling the model on every trivial failure |
Each one has a different optimization lever.
Techniques with measurable impact
| Technique | Reported saving | When it applies |
|---|---|---|
| Embedding + BM25 retrieval (Semble-style) | Up to 98% in code search | Code search in mid-to-large repos |
| Accessibility tree snapshot (agent-desktop-style) | 78-96% in desktop automation | Slack, Notion, VS Code, and similar control |
| Incremental context summary | 30-60% in long sessions | Any assistant with turn memory |
| Difficulty-based model routing | Up to 17x depending on mix | Heterogeneous tasks with quality margin |
| Response caching | Variable | Repeated prompts or stable templates |
Versioned Skills and AGENTS.md | Indirect but high | Teams with reusable context |
These numbers are indicative. Real savings depend on stack, problem nature, and team discipline.
Routing map
flowchart LR
A[Incoming task] --> B{High risk or quality?}
B -- Yes --> C[Premium model]
B -- No --> D{Templated or repetitive?}
D -- Yes --> E[Cheap model + cache]
D -- No --> F[Mid-tier model]
C --> G[Human validation]
E --> H[Automated validation]
F --> H
The goal is to send each task to the cheapest model that meets the quality bar. It is not a technical dogma, it is a cost decision.
Reducing context without losing quality
| Pattern | What it does | When to avoid |
|---|---|---|
| Top-k retrieval | Returns only the most relevant fragments | When the answer needs global vision |
| Rolling summary | Compresses old turns | When key references from the start are lost |
| Strict schemas | Bounds output to what is needed | Exploratory tasks |
| Versioned Skills | Reuses repo conventions | Without discipline, they become stale |
| Explicit memory | Stores project decisions | Without review, fills with noise |
Common principle: every token sent to the model must justify its existence. Anything that does not contribute to the answer goes.
Typical SMB case
| Context | Before | After |
|---|---|---|
| Internal RAG assistant | Each query does global grep | Embedding + BM25 indexing |
| Slack and Notion automation | Screenshot and OCR | Accessibility tree snapshot |
| Weekly mechanical refactor | Premium model with long prompts | Mid-tier model with template and diff |
| Report generation | Long chain without cache | Fixed template + caching per type |
| Customer support | Conversation without compression | Rolling summary every 10 turns |
The sum of small optimizations usually pays more than replacing the main model.
Metrics that matter
| Metric | What it indicates | How to measure |
|---|---|---|
| Cost per closed task | Real efficiency | Total cost / completed and accepted tasks |
| Cost per useful turn | Spots wasteful sessions | Cost / messages with delivered value |
| Acceptance rate | Perceived quality | % of responses used without rework |
| Average latency | User experience | Time from request to final answer |
| Fallback rate | Routing stability | % of tasks escalated to a more expensive model |
Without these metrics, cost optimization becomes a feeling. With them, it becomes a decision.
Common mistakes
- Switching models before reducing context.
- Aggressive caching without clear invalidation.
- Automatic routing without evals to catch regressions.
- Confusing “fewer tokens” with “less quality” without measuring it.
- Reducing context for tasks that need global vision.
- Optimizing the main model and ignoring noise in long sessions.
Hard rules to keep quality
- Every saving technique passes an eval before rollout.
- Routing decisions are auditable.
- Context compression is tested against real edge cases.
- The cache has an explicit invalidation policy.
- Critical tasks may jump to the premium model when risk justifies it.
- Observability covers cost, latency, and quality equally.
Progress indicators
| Indicator | Good | Bad |
|---|---|---|
| Measured token reduction | Before/after data per flow | ”It feels cheaper” |
| Routing | Explicit, reviewed rules | Inherited config without review |
| Evals | Suite running on every change | Sporadic manual checks |
| Latency | Within agreed business SLA | Variable, not measured |
| Cost per task | Sustained downward trend | Only the monthly bill is checked |
Final criterion
Inference cost is an engineering problem, not a model problem. When an SMB measures well and reduces context with discipline, the bill drops without sacrificing quality. The difference between “expensive” AI and sustainable AI lies in who decides what reaches each model and why.
Working sources
- Public documentation of Semble on Model2Vec, BM25, and RRF indexing.
- Public documentation of agent-desktop on accessibility tree snapshots.
- Best practices for retrieval and prompt engineering in production LLM projects.
- Technical decisions must be adapted to each company’s stack, criticality, and volume.