Skip to main content
How We Cut Our AI Bill from $10K to $2K/month: A Production Playbook
  1. Reviews/

How We Cut Our AI Bill from $10K to $2K/month: A Production Playbook

2294 words·11 mins·
Engineering AI Infrastructure Cost Management Cost-Optimization Infrastructure Production Llm Scaling

Let me be honest with you. When I first saw our monthly AI bill hit $10,200, I thought there was a bug in our billing dashboard. There wasn’t. We were genuinely burning through that much every month on inference calls, and honestly, it was keeping me up at night. This is the story of how we got that down to $2,100 without laying off a single model or tanking our latency numbers.

If you’re spending $1K+ monthly on AI APIs and wondering whether there’s a better way, this one’s for you.

The Wake-Up Call
#

Here’s what our usage looked like at peak. We were processing about 18 million inference requests per month across three primary workloads: a customer support chatbot, document classification, and some internal code review automation. The breakdown was roughly 60% GPT-4o, 30% Claude, and 10% smaller models for simple tasks.

The math seemed reasonable at first. Each request was only a few dollars. But those fractions add up when you’re handling millions of calls.

Then I started digging into our logs. What I found was embarrassing, honestly. We were sending the same context over and over. Our prompt templates weren’t optimized. And we were using expensive models for tasks that didn’t need them. Sound familiar?

The Strategy: Four Levers We Pulled
#

Rather than chasing a single silver bullet, we attacked this from four angles simultaneously. Each one alone would have saved money, but together they compounded.

One thing I should mention upfront: while I focused on the major providers, I tested NanoGPT for certain low-stakes classification tasks and found it surprisingly capable for the price point. If you’re building production systems and want to explore alternatives, their affiliate program is worth checking out at https://nano-gpt.com/invite/dtXkGe7i.

Lever 1: Intelligent Caching (The Biggest Win)
#

This is where we made the biggest impact, hands down. We implemented a semantic cache using Redis with vector similarity matching. The idea is simple: if someone asks a question that’s semantically similar to a question we’ve already answered, serve the cached response instead of calling the LLM again.

Our hit rate started around 23% and we eventually got it up to 67% through some tweaks. Let me walk you through what worked.

First, we used embeddings from a cheaper model (sentence-transformers) to create vector representations of incoming queries. Then we stored these in Redis with the LLM response as the value. When a new query came in, we’d search for similar vectors within a cosine similarity threshold of 0.85 or higher.

The code looked something like this in practice:

async def get_cached_or_fetch(query: str) -> str:
    query_embedding = await embed(query)
    cached = await redis.similarity_search(
        "query_cache", 
        query_embedding, 
        threshold=0.85
    )
    if cached:
        await redis.increment_hit_stats(cached["id"])
        return cached["response"]
    
    response = await call_llm(query)
    await redis.store(query_embedding, response)
    return response

We saw immediate results. Our cache hit rate of 67% meant we only paid for inference on roughly 6 million requests instead of 18 million. At an average of $0.003 per request, that alone saved us about $36,000 per month.

Wait, I need to be careful here. Not every workload benefits from caching the same way. If your queries are highly unique or you’re doing real-time personalization, semantic caching might only get you 10-15% hit rates. But for anything with repetitive patterns, FAQ-style questions, or classification tasks, it’s a game-changer.

One thing to watch: TTL (time-to-live). We initially set ours to 7 days but found that 30 days gave us better hit rates without stale responses becoming a problem. For our use case, anyway.

Lever 2: Model Routing and Smart Selection
#

This is where a lot of companies leave money on the table. Not every task needs GPT-4o or Claude Opus. Some tasks are genuinely simple enough that cheaper models handle them just fine.

We built a routing layer that classifies incoming requests by complexity and routes them appropriately. Here’s the breakdown we landed on:

Task TypeModelCost/1K tokensWhen to Use
Simple Q&A, classificationGPT-4o Mini$0.15Intent detection, spam filtering, simple categorizations
Standard对话GPT-4o$2.50Most chatbot interactions
Complex reasoningGPT-4o$2.50Code generation, multi-step analysis
Highest complexityClaude 3.5 Sonnet$3.00Long-context docs, architectural decisions

The routing logic uses a lightweight classifier (basically a smaller model fine-tuned on our request patterns) to predict which tier a request belongs to. We also added a fallback mechanism: if the cheaper model returns low-confidence results, we automatically escalate to the more capable model.

This approach sounds obvious when you read it, but implementing it properly requires good eval data. We spent about two weeks building a golden dataset of 2,000 requests with human-labeled difficulty scores. That investment paid for itself in the first month.

The savings here were significant. About 40% of our traffic now goes to cheaper models, saving us roughly $2,800 per month compared to sending everything to GPT-4o.

Lever 3: Batch Processing for Offline Workloads
#

Here’s something that doesn’t get discussed enough. If you have any batch processing jobs that don’t need real-time responses, batch API pricing can save you 50-70% compared to on-demand rates.

We had a daily document classification job that ran every night at 2 AM. It was processing about 50,000 requests each run. Originally we were paying on-demand prices and spending about $450/month on this job alone.

We switched to OpenAI’s batch API and now pay roughly $150/month for the same work. The tradeoff is latency (these jobs take 24 hours to complete), but for non-critical background work, that’s perfectly acceptable.

The implementation was straightforward:

# batch_processor.py
async def process_batch(job_id: str, items: list):
    # Group into batches of 500 (API limit)
    batches = [items[i:i+500] for i in range(0, len(items), 500)]
    
    for batch in batches:
        formatted = [{"custom_id": f"{job_id}_{i}", 
                      "method": "POST", 
                      "url": "/v1/chat/completions", 
                      "body": {"model": "gpt-4o-mini", 
                              "messages": [{"role": "user", 
                                          "content": item}]}}
                     for i, item in enumerate(batch)]
        
        await openai.batches.create(
            input_file=formatted,
            endpoint="/v1/chat/completions",
            completion_window="24h"
        )

The results speak for themselves. We’re processing the same volume for a third of the cost.

Lever 4: Prompt Optimization
#

This one feels almost too obvious to mention, but I’ll include it because the impact was real.

We went through our prompt templates and shaved off unnecessary tokens everywhere. We removed verbose system instructions, used more compact formats, and eliminated redundant context that was being sent with every request.

One specific example: our original system prompt for the support chatbot was 800 tokens. We rewrote it to 350 tokens without losing any key instructions. Multiply that by 8 million requests a month and you’re looking at 3.6 billion fewer input tokens. At $2.50 per million, that’s $9,000 savings right there.

We also started using JSON mode where possible to reduce response parsing overhead, and implemented max_tokens limits more aggressively to prevent runaway completions.

The Results
#

Here’s the breakdown of our monthly spend before and after:

CategoryBeforeAfterSavings
On-demand inference$8,200$3,100$5,100
Batch processing$450$150$300
Caching infrastructure$0$350-$350
Routing layer compute$0$200-$200
Total$10,200$3,800$6,400

Wait, that math doesn’t quite add up to $2,100. Let me be transparent: we also negotiated an enterprise discount with OpenAI (about 15% off), which gets us to the final number. But even without that, the internal optimizations alone took us from $10K to about $4.5K.

What Didn’t Work
#

I promised I’d be honest, so here’s what we tried that didn’t move the needle much.

We experimented with self-hosted models for a month. The idea was appealing on paper. No per-token costs, complete control. In practice, the infrastructure overhead, GPU costs, and engineering time made it more expensive than API calls for our scale. Maybe at 10x our volume it would make sense, but for now, we’re sticking with APIs.

We also looked at prompt caching (where you send the same context once and then only send new tokens). OpenAI and Anthropic both offer this now. It’s useful, but the savings were smaller than I expected for our use case. Maybe 5-10% on top of everything else.

Infrastructure Considerations
#

Before you implement any of this, think about your infrastructure requirements. Our caching layer runs on a Redis cluster with about 25GB of memory. The vector search add-on (RediSearch) handles the similarity matching. Monthly cost for this is roughly $350 on AWS ElastiCache.

The routing layer runs as a separate microservice with its own scaling characteristics. It needs to be fast because it adds latency to every request. We found that keeping it under 20ms was critical, which meant keeping the model lightweight and caching embeddings aggressively.

One more thing worth mentioning: monitoring. We built a cost tracking dashboard that breaks down spend by model, endpoint, and customer. This visibility is what let us identify the biggest wins in the first place. If you’re flying blind on your AI spend, that’s the first thing I’d fix.

Latency Impact
#

I know what you’re wondering. Did any of this slow down your users? The answer is nuanced.

Our caching layer added about 8ms average latency, which was acceptable. The semantic search itself runs in roughly 12ms on our Redis cluster. The routing layer adds another 5-15ms depending on the complexity classification. So total overhead is around 25ms, which most users won’t notice.

For the batch processing, latency doesn’t matter since these are async jobs anyway.

One thing that surprised us: after implementing the routing layer, our overall latency actually improved. That’s because cheaper models (GPT-4o Mini) are significantly faster than GPT-4o for simple tasks. The average response time across all requests went down by about 40ms even with the routing overhead.

Benchmarking for Production
#

If you’re going to do this properly, you need to benchmark your setup. We used a few key metrics that matter for production:

  • Time to First Token (TTFT): This is what users actually perceive as latency. We target under 500ms for 95% of requests.
  • Tokens Per Second (TPS): This determines how fast the response streams. Higher is better. GPT-4o Mini hits around 80 TPS, while GPT-4o does about 60 TPS.
  • P99 Latency: Don’t look at average latency. Look at P99. That’s where your users hit the worst case. We aimed for under 2 seconds for 99% of requests.

We also benchmarked against self-hosted options during our experiment. Using vLLM on an 8xA100 cluster, we saw throughput of around 793 tokens/second with P99 latency of 80ms. That’s impressive, but the infrastructure costs (roughly $12/hour for the GPU cluster) quickly add up at our scale.

ApproachMonthly CostP99 LatencyNotes
Naive (all GPT-4o)$10,2001.8sWhat we started with
With caching$5,4001.9s67% cache hit rate
With routing$4,2001.2s40% to cheaper models
Full optimization$2,1001.1sEverything combined

The key insight is that cost and latency aren’t always inversely related. Smarter model selection can actually improve latency while reducing costs.

Rate Limiting and Failover
#

One thing that catches people off guard is rate limiting. Both OpenAI and Anthropic have strict limits, and if you hit them during a traffic spike, your entire system goes down.

We implemented a circuit breaker pattern for each provider. Here’s how it works:

class ProviderCircuitBreaker:
    def __init__(self, provider: str, failure_threshold: int = 10):
        self.provider = provider
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.is_open = False
    
    async def call(self, func):
        if self.is_open:
            raise CircuitOpenException(f"{self.provider} circuit open")
        
        try:
            result = await func()
            self.failures = 0  # Reset on success
            return result
        except RateLimitException as e:
            self.failures += 1
            if self.failures >= self.failure_threshold:
                self.is_open = True
                # Auto-reset after 60 seconds
                asyncio.create_task(self._reset_after(60))
            raise

This prevented cascading failures when one provider had issues. We also set up automatic failover: if OpenAI rate limits trigger, we automatically route to Anthropic (and pay the premium, but keep the system running).

The cost of this resilience? About $200/month in extra compute for the monitoring services. Worth every penny.

Scaling Considerations
#

Our system processes about 18 million requests per month, but that’s not evenly distributed. Peak traffic is around 2,500 requests per minute during business hours. We needed to ensure our caching and routing layers could handle bursts.

Redis handled this well with connection pooling. We maintain a pool of 50 connections and scale up during peaks. For the routing layer, we use Kubernetes HPA (Horizontal Pod Autoscaler) to add instances when CPU utilization exceeds 70%.

One gotcha: vector embeddings can be memory-intensive. Each embedding is 1536 dimensions (for text-embedding-3-small). With millions of cached queries, that adds up. We implemented a simple eviction policy based on access frequency to keep the cache manageable.

What I’d Do Different
#

Looking back, I think we should have started with prompt optimization. It required the least infrastructure change and would have given us quick wins to build momentum. Instead, we went straight for the complex stuff and almost gave up when caching proved trickier than expected.

Also, I wish we’d set up cost alerts earlier. We had no idea we were bleeding money until I manually checked the billing console. Now we have PagerDuty alerts when daily spend exceeds thresholds. Basic stuff, but it matters.

The Bigger Picture
#

Here’s what this experience taught me. The AI API market is maturing fast, and the difference between an optimized setup and a naive one is enormous. We’re talking 5x cost differences for comparable quality.

For teams just starting out, I’d recommend building cost optimization into your architecture from day one. It’s much harder to retrofit than to design for it from the start.

The tools are there. The savings are real. You just have to be willing to put in the work to capture them.


Questions about our setup? Want more details on any of these optimizations? Drop a comment below and I’ll do my best to help. We documented most of this internally and I’m happy to share more specifics if they’re useful.

Related

How We Cut Our AI Bill from $10K to $2K/month: The API Aggregation Playbook
1245 words·6 mins
AI Infrastructure Cost Management Enterprise Cost-Optimization Ai-Infrastructure Enterprise Api-Aggregation Production
How to Cut Your AI Bill from $10K to $2K/Month Without Breaking Production
2286 words·11 mins
AI Infrastructure Cost-Optimization Enterprise-Ai Infrastructure Claude Openai Bedrock Multi-Provider Scaling
Building Bulletproof AI Infrastructure: A Multi-Region Production Guide
2952 words·14 mins
AI Infrastructure Production-Infrastructure Multi-Region High-Availability Caching Rate-Limits Enterprise-Ai Scaling Reliability