CFO Brief

How to Reduce LLM Token Costs at Enterprise Scale

Most teams reach for observability and caching. The durable answer is precision: don't measure the waste, eliminate the cause.

By Gilad Salinger·CEO & Co-Founder, Naboo··8 min read

The thesis in one paragraph

Most enterprise LLM bills are growing faster than usage because each agent question retrieves dozens of speculative document chunks, fills the context window with mostly- irrelevant context, and the model burns tokens reasoning over the noise. Observability tools (Helicone, Langfuse, LiteLLM) make the bill visible. Caching tools handle the repetitive slice. Smaller models trade accuracy for cost. None of them change the underlying token volume on real R&D queries. The durable answer is a Reasoning Layer: return one structured answer per query, and the agent stops the speculative loop that consumes most of the tokens. Precision is upstream of cost.

The founder confession

"Guys, we're losing too much money on OpenAI - $8,000 today alone. I need visibility on the money spent ASAP."
Gilad Salinger, CEO & Co-Founder, Naboo - in our own internal Slack, June 2026.

We built Naboo's cost discipline tooling the next sprint. What runs on the site today is what shipped. The underlying point: cost-cap pain is universal, even at the companies building the cost-cap tooling.

The four approaches, compared

LLM observability

Helicone, Langfuse, LiteLLM, Portkey

What it does
Logs every LLM call, attributes spend per team / project / prompt, lets you set budget alerts and rate limits.
When to use
When you need to see where the money goes, attribute spend to teams, and enforce a per-seat or per-project cap.
Where it doesn't help
Measures the waste. Does not change the underlying token volume - it just shows you the bill in higher resolution.

Prompt and response caching

Helicone caching, vendor caches, ad-hoc Redis layers

What it does
Caches identical or semantically similar prompt/response pairs so repeated queries don't hit the model again.
When to use
When a meaningful fraction of your traffic is repetitive (FAQ-style support, common code lookups).
Where it doesn't help
Enterprise R&D queries are rarely repetitive - each agent question is shaped by the live state of the codebase, tickets, and ownership. Cache hit rates are usually low.

Smaller models or distillation

GPT-5 Nano, Claude Haiku 4.5, fine-tuned 7B open-source

What it does
Run cheaper / smaller models for queries that don't need a frontier model.
When to use
When you can confidently route by question type and accept slightly lower quality on routed queries.
Where it doesn't help
Smaller models hallucinate more on enterprise queries because they have less context to work with. Cost drops; accuracy drops faster.

Reasoning Layer (precision)

Naboo

What it does
Returns one structured answer per query instead of dozens of speculative retrievals over fragmented documents. The agent gets the right context the first time and stops grinding tokens looking for it.
When to use
When the question is 'how do I let my engineers use AI freely without procurement throttling them.' Token reduction is a side effect of correctness, not the goal.
Where it doesn't help
Requires a Forward Deployed Agent engagement (two to four weeks) to encode the customer's hidden language into the graph. Not a SaaS dashboard.

The Reasoning Layer angle

A Reasoning Layer like Naboo cuts token use at the source by returning a structured answer instead of a pile of candidate documents. The agent asks "what's blocking checkout v2" and gets back the four open decisions, who owns each, the supporting evidence - already joined and ranked, RBAC-enforced. There is no retrieve-rerank-summarize loop because there is nothing to summarize.

In a Global-e (NASDAQ: GLBE) head-to-head benchmark, agents grounded in Naboo's Decision Graph won 97 of 100 queries against MCP-enabled GPT-4.1 on real user questions, with fewer tokens per interaction and faster responses. The Forward Deployed Agent ships the graph end-to-end in two to four weeks.

The honest framing: token volume drops because the agent stops fanning out across speculative retrievals - the magnitude depends on how fragmented your existing retrieval pipeline is. We don't publish a single multiplier number because the variance across customers is too high to be defensible. The pattern is consistent; the numbers are workload-specific.

FAQ

Why are enterprise LLM bills exploding faster than usage?

Because the per-query token cost is rising faster than query volume. Each agent question retrieves dozens of speculative document chunks, fills the context window with mostly-irrelevant context, and the model burns tokens reasoning over noise. The pattern compounds: agents that miss the right context generate longer responses, longer responses chain into more follow-up queries, and so on. The bill grows quadratically with adoption, not linearly.

Doesn't more caching solve this?

Caching is great for repetitive question patterns - customer support FAQs, common code lookups, well-known how-to questions. It does almost nothing for enterprise R&D queries, which are shaped by the live state of the codebase, the ticket backlog, the deploy state, and the on-call rotation. Two engineers asking the 'same' question get different answers because the company has changed. Cache hit rates in real R&D environments tend to be under 10%.

What's the precision argument vs the metering argument?

Metering tools (Helicone, Langfuse, LiteLLM) make the bill visible and attributable. They are necessary for finance to manage spend, but they do not change the underlying token volume. A Reasoning Layer cuts token use at the source: by returning one structured answer to each agent query, the agent stops the speculative retrieve-rerank-summarize loop that consumes most of the tokens. Metering is downstream of the cost; precision is upstream.

How much can we actually save?

Depends on the workload. In Global-e's deployment, Naboo agents grounded in a Decision Graph used noticeably fewer tokens per interaction and responded faster than the MCP-enabled GPT-4.1 baseline. The honest framing: token volume drops because the agent stops fanning out across speculative retrievals - the magnitude depends on how fragmented your existing retrieval pipeline is. We don't publish a single multiplier number because the variance across customers is too high to be defensible.

Can we run Naboo alongside our existing observability stack?

Yes - Naboo handles context delivery, your observability stack (Helicone, Langfuse, OpenTelemetry-based) keeps logging the LLM calls Naboo makes on behalf of agents. The two are complementary: Naboo reduces the volume, observability tracks what's left. Customers typically keep their existing observability and just see a flatter curve in the dashboard.

What about smaller models or distillation?

Smaller models work well for routing-friendly workloads - if 60% of your queries are classification, run those through a 7B model and reserve frontier models for the hard ones. But smaller models hallucinate more on enterprise-specific questions because they have less internal knowledge to fall back on. Pairing a smaller model with a Reasoning Layer is usually stronger than either alone: the smaller model gets the right context handed to it and matches frontier-model quality on the queries that count.

Related reading

Unblock your AI budget

Naboo deploys on-prem or in your VPC, returns one structured answer per query, and lets engineers use AI freely without procurement throttling them. In production in 2-4 weeks via the Forward Deployed Agent.