TL;DR#
Your AI API bill is probably the fastest-growing line item you can’t explain to the CFO. We were spending $10,200/month on LLM calls across three services. Eight weeks later, the same workload runs at $1,840/month. Here’s exactly what we changed, what we dropped, and what we’d never give up.
The core play: semantic caching (savings: ~40%), model routing to cheaper models for 70% of requests (savings: ~35%), and prompt compression (savings: ~15%). The remaining 10% came from renegotiating with existing vendors and switching one workload to NanoGPT, which honestly surprised us with how well it handled our classification tasks at roughly a third of GPT-4o pricing.
No, this isn’t a “use DeepSeek and save money” blog post. It’s a structural engineering approach that works regardless of which models you choose.
The Problem Nobody Talks About#
Here’s what the vendor slide decks don’t show you. When you go from prototype to production, your AI spend doesn’t scale linearly. It scales like a wave function on a bad day.
You start with one model. Then your product team wants GPT-4o for quality. Your ops team wants something cheaper for summarization. Your RAG pipeline needs embeddings. Suddenly you’re reconciling three billing cycles and your infrastructure team is asking why your AI spend looks like a startup burn rate.
At our scale, we were doing roughly 18 million tokens per day across all workloads. At list price, that was putting us in some uncomfortable territory with our finance team.
The painful truth is that most companies don’t have an AI cost problem. They have an AI cost architecture problem. The API calls are fine. The invisible waste is the issue.
Let me show you exactly where it was hiding.
Audit First. Cut Second.#
Before we touched a single line of code, we spent two weeks building a cost attribution layer. This was not fun. It was absolutely necessary.
Here’s what we instrumented:
Per-request logging with full metadata. Every LLM call now logs: model used, token count (input and output), latency, user ID, feature name, and a 24-character request hash. The hash lets us track identical or near-identical requests even when the wording changes slightly.
The audit produced our first shock. Thirty-one percent of our requests were duplicates or near-duplicates. A user would ask “what’s my billing cycle?” and the system would hit the API fresh every time, even though the answer was exactly the same as it was thirty seconds ago. Across 18 million daily tokens, that waste was staggering.
Lesson one: You cannot optimize what you cannot see. If your LLM calls aren’t fully logged and attributed, stop everything and fix that first.
The Semantic Cache: Where We Made Our Biggest Gain#
Traditional caching doesn’t work for LLM requests because the exact same prompt almost never comes in twice. “Show me my invoices” and “list my recent invoices” are semantically identical but textually different.
The solution is semantic caching. We used a vector database (Pinecone, in our case, though Weaviate or Qdrant work equally well) to embed incoming requests and check for similarity before hitting the model.
A similarity threshold of 0.94 caught about 28% of our requests as cache hits. Those calls never touched the model. Zero tokens, zero latency, zero cost.
The catches? You need to handle cache invalidation carefully. When a user’s data changes (they pay an invoice, update their profile), you need to evict related cache entries. We did this with a simple tagging system. Each cache entry gets tagged with the user ID and a version hash of their relevant data state. When state changes, we invalidate by tag.
Implementation took about five days. The ROI was immediate and frankly a little embarrassing in retrospect. We wished we’d built this six months earlier.
Model Routing: Send the Right Request to the Right Model#
This is where things get interesting from an architecture perspective. The naive approach to cost cutting is “just use the cheapest model everywhere.” That’s how you end up running your complex reasoning tasks on a model that wasn’t designed for them, generating wrong answers, and then spending more on human review than you saved.
The right approach is task classification. Divide your LLM workload into buckets by complexity:
Bucket 1: Simple classification and routing (roughly 50% of requests). Tags, categories, spam detection, intent classification. These tasks don’t need GPT-4o. They don’t even need Claude. We moved these to a combination of smaller models and rule-based systems, depending on the specific task.
Bucket 2: Structured extraction and formatting (roughly 25% of requests). Pulling structured data from unstructured inputs, formatting outputs, transforming JSON. These need something mid-tier with good instruction following.
Bucket 3: Complex reasoning and analysis (roughly 20% of requests). Multi-step problems, nuanced analysis, creative tasks. These are where you want your best model.
Bucket 4: Novel or ambiguous requests (roughly 5% of requests). These are the edge cases. We route these to the best available model and accept the higher cost as the price of quality.
The routing logic itself is lightweight. A small classifier model (we used a fine-tuned DistilBERT variant) makes the routing decision in under 5ms. The overhead is negligible compared to the savings.
Here’s what the cost looked like before and after routing for a representative mix of 1 million tokens:
| Model | Before | After | Cost Reduction |
|---|---|---|---|
| GPT-4o (all requests) | $75.00 | $18.75 | 75% on this bucket |
| Claude 3.5 Sonnet | $0 | $10.50 | New addition |
| DeepSeek V3 | $0 | $3.20 | New addition |
| Grok 4.1 Fast | $0 | $2.10 | New addition |
| NanoGPT | $0 | $2.40 | New addition |
| Total per 1M tokens | $75.00 | $36.95 | ~51% |
That’s just the model routing. Combine it with caching and the effective cost per token drops even further because the cache hits don’t appear in this table at all.
Prompt Compression: Smaller Inputs, Same Quality#
This one surprised us with how much it moved the needle. We looked at our average prompt size and realized we were including enormous context windows for tasks that didn’t need them.
The fix wasn’t just “write shorter prompts.” It was systematic prompt compression using dedicated compression models. We experimented with two approaches:
LLM-based compression: A smaller model rewrites user queries into a compressed form before sending to the main model. This adds one small API call but typically reduces the main call by 30-40%.
Chunking with summary: For large document processing, we chunk documents, run a lightweight summarization step on each chunk, and feed only the summaries plus a pointer to original chunks for deep analysis tasks. For 80% of our document processing needs, the summary alone is sufficient.
Both approaches require careful quality validation. You need to measure that compressed outputs match original outputs within an acceptable tolerance. We built a small eval harness that runs a sample of requests through both compressed and full paths, then scores similarity of outputs. If compression degrades output quality below our threshold, we reroute to the full prompt.
Honestly, building that eval harness took longer than expected. Three days instead of one. But it’s the kind of thing that saves you repeatedly once it exists.
The Infrastructure Changes That Made It All Stick#
All of this optimization work is fragile without the right infrastructure underneath. Three things made the difference:
1. Centralized routing layer. We built a thin service that sits in front of all LLM calls. Every request goes through it. The routing logic, cache lookup, and cost logging all happen there. It’s about 800 lines of Go and it handles roughly 40,000 requests per minute without breaking a sweat. The key insight is that this layer needs to be so boring it never occurs to anyone to bypass it.
2. Per-feature cost dashboards. Weekly review meetings became dramatically more productive when we could say “the document processing feature generated $4,200 of our AI costs this month, up 18% from last month” instead of “AI costs are up.” Attribution drives accountability.
3. Cost alerts with hysteresis. Simple cost alerts generate too much noise. We implemented a system that fires only when cost is trending significantly above baseline over a rolling window, not on single-spike anomalies. This took the on-call pager out of the equation for normal usage fluctuations.
What We’d Never Give Up#
Some things are worth paying for, and we learned to stop being embarrassed about identifying them.
Gating complex financial calculations behind your best model. We tried running our most sensitive calculations on cheaper models to save money. The error rate on compound interest projections was unacceptable. We reverted within a week. Some tasks have a real dollar cost to being wrong.
Human-in-the-loop for content moderation edge cases. The savings from fully automated content moderation sounded great until we reviewed the false positive rate. Two hours of human review per week to catch the weird cases cost $160 and saved us from several genuinely embarrassing outputs.
Keeping the eval harness running in production. This is the opposite of a cost saving. Running eval on every compressed request adds overhead. But it’s the only way to catch quality regressions before customers do. We treat it as a non-negotiable cost of doing business.
The Numbers, Summarized#
Here’s the full picture after eight weeks:
| Category | Before (monthly) | After (monthly) | Change |
|---|---|---|---|
| Raw token cost at list price | $10,200 | $10,200 | Flat |
| Effective spend (after all optimizations) | $10,200 | $1,840 | -82% |
| Infrastructure overhead | $0 | $380 | New |
| Engineering investment (one-time) | $0 | ~$22,000 | Amortized 6 months |
| Net monthly savings at steady state | $0 | $7,980 | -78% |
The engineering investment paid back in under three months. After that, it’s pure savings.
Where the Market Is Heading#
The price war in the LLM space is accelerating. Grok 4.1’s entry at $0.20/M with a 2M token context window, DeepSeek’s aggressive V3.2 pricing at $0.28/M, and the steady compression of margins across the board mean that 2026 is the year to be solving this problem systematically.
The companies that will win on AI cost are not the ones who found the one weird trick to save 90%. They’re the ones who built the infrastructure to continuously evaluate, route, and optimize their AI spend the way they’d optimize any other critical system.
If you’re still treating your AI API bill as a fixed cost and just hoping it stays manageable, you’re leaving money on the table. The tools and techniques exist. The question is whether you have the engineering will to build it.
For most teams, I’d estimate a realistic 60-80% reduction is achievable within two months. The biggest gains typically come from caching, so start there. Everything else is incremental on top of that foundation.
Your CFO will ask what changed. You can tell them: we stopped lighting money on fire at scale.
