How I Built a Production Chatbot for $5/Month

A practical cost breakdown for indie hackers using NanoGPT

TL;DR
#

Metric	OpenAI	NanoGPT	Savings
Input tokens	$0.15 / 1M	$0.012 / 1M	92%
Output tokens	$0.60 / 1M	$0.048 / 1M	92%
My monthly cost	~$65	~$5	$60/month

I switched from OpenAI to NanoGPT. Cut costs 92%. Same quality. Here’s how.

Want the discount? I used this NanoGPT link — gets you an extra 5% off.

The Problem
#

I launched a SaaS. Needed AI for customer support. Code examples. Product questions. Context memory.

OpenAI worked great.

Then the bill came.

GPT-4o-mini:  $42.30
GPT-4o:       $18.75
Embeddings:    $4.20
------------------------
Total:        $65.25

$65 doesn’t sound bad. But I’m bootstrapped. And that’s month one with 850 users. At 10,000 users? That’s $650. At 100,000? I didn’t want to think about it.

The Solution
#

NanoGPT. Same models. Different prices.

Model	Provider	Input	Output
GPT-4o-mini	OpenAI	$0.150	$0.600
GLM-4.7	NanoGPT	$0.012	$0.048
Savings	—	92%	92%

GLM-4.7 matches GPT-4o-mini for most tasks. I tested it. Blind test with 100 queries. Users didn’t notice the switch.

My Actual Numbers
#

850 daily active users. 3.2 conversations per user. 8 messages per conversation.

That’s 652,800 messages monthly.

Raw costs:

Provider	Input	Output	Total
OpenAI	$44.06	$70.50	$114.56
NanoGPT	$3.53	$5.64	$9.17

Wait. I said $5, not $9. Right. I optimized. Caching. Context truncation. Smart routing. Here’s what I built:

The Code
#

1. Setup
#

# config.py
import os

API_KEY = os.getenv("NANO_GPT_API_KEY")
DEFAULT_MODEL = "nano-gpt/glm-4.7"
FALLBACK_MODEL = "nano-gpt/kimi-flash"

2. Client with Caching
#

# chat_client.py
import hashlib
import json
import time
import requests

class NanoGPTClient:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://nano-gpt.com/api"
        self.cache = {}
        self.cache_ttl = 3600
    
    def _cache_key(self, messages, model):
        content = json.dumps(messages, sort_keys=True) + model
        return hashlib.md5(content.encode()).hexdigest()
    
    def chat(self, messages, model="nano-gpt/glm-4.7", use_cache=True):
        # Check cache first
        if use_cache:
            key = self._cache_key(messages, model)
            if key in self.cache:
                if time.time() - self.cache[key]["time"] < self.cache_ttl:
                    return self.cache[key]["data"]
        
        # Truncate context to save tokens
        messages = self._truncate(messages)
        
        response = requests.post(
            f"{self.base_url}/chat/completions",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json={
                "model": model,
                "messages": messages,
                "temperature": 0.7,
                "max_tokens": 800
            },
            timeout=30
        )
        
        result = response.json()
        
        # Cache it
        if use_cache:
            self.cache[key] = {"data": result, "time": time.time()}
        
        return result
    
    def _truncate(self, messages, max_chars=12000):
        """Keep system prompt and recent messages. Drop middle."""
        system = [m for m in messages if m.get("role") == "system"]
        other = [m for m in messages if m.get("role") != "system"]
        
        total = sum(len(m.get("content", "")) for m in messages)
        if total <= max_chars:
            return messages
        
        truncated = system.copy()
        used = sum(len(m.get("content", "")) for m in system)
        
        for msg in reversed(other):
            chars = len(msg.get("content", ""))
            if used + chars > max_chars:
                truncated.insert(len(system), {
                    "role": "system",
                    "content": "[Earlier messages truncated]"
                })
                break
            truncated.append(msg)
            used += chars
        
        return system + [m for m in truncated if m not in system]

3. Smart Routing
#

# router.py
import re

class Router:
    CHEAP = "nano-gpt/glm-4.7"      # $0.012/$0.048
    SMART = "nano-gpt/kimi-flash"   # $0.06/$0.24
    
    COMPLEX = [
        r"\bcode\b|\bfunction\b|\bscript\b",
        r"\bdebug\b|\brefactor\b|\berror\b",
        r"\bjson\b|\bxml\b|\bsql\b",
        r"\bexplain\b.*\bstep by step\b",
    ]
    
    @classmethod
    def select(cls, message):
        msg = message.lower()
        for pattern in cls.COMPLEX:
            if re.search(pattern, msg):
                return cls.SMART
        return cls.CHEAP

4. The Server
#

# bot.py
from flask import Flask, request, jsonify
from chat_client import NanoGPTClient
from router import Router
import os

app = Flask(__name__)
client = NanoGPTClient(os.getenv("NANO_GPT_API_KEY"))

SYSTEM = """You are a helpful AI assistant. Be concise. 
Under 3 sentences usually. Use bullet points for lists.
If you don't know, say so."""

@app.route("/chat", methods=["POST"])
def chat():
    data = request.get_json()
    user_msg = data.get("message", "")
    history = data.get("history", [])
    
    messages = [{"role": "system", "content": SYSTEM}]
    messages.extend(history)
    messages.append({"role": "user", "content": user_msg})
    
    model = Router.select(user_msg)
    response = client.chat(messages, model=model, use_cache=True)
    
    content = response["choices"][0]["message"]["content"]
    usage = response.get("usage", {})
    
    return jsonify({
        "response": content,
        "model": model,
        "tokens": usage
    })

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Where The Savings Come From
#

Base pricing: 92% cheaper. Obvious win.

Caching: 35% fewer API calls. Common questions get answered from cache. “What’s your pricing?” “How do I reset my password?” Stuff like that.

Context truncation: 20% fewer tokens. Most conversations don’t need full history. I keep ~12K characters. Enough context. Less cost.

Smart routing: 15% cheaper. Simple questions use GLM-4.7. Complex coding questions use Kimi-Flash. 80% of queries are simple.

Quality Check
#

Blind test. 100 queries. Users rated responses.

Model	Helpfulness	Speed
GPT-4o-mini	4.2/5	1.2s
GLM-4.7	4.0/5	0.8s

Users didn’t notice the switch. Slightly less polished? Maybe. Equally accurate? Yes.

My Actual Monthly Bill
#

Metric	Value
API calls	381,831 (after cache hits)
Input tokens	142,340,000
Output tokens	68,920,000
Total cost	$5.03

That’s not a typo. Five dollars.

Versus ~$85 on OpenAI for the same traffic.

Migration From OpenAI
#

Trivial. Change the URL. Change the model name. Done.

# Before
openai.ChatCompletion.create(model="gpt-4o-mini", ...)

# After
requests.post(
    "https://nano-gpt.com/api/chat/completions",
    json={"model": "nano-gpt/glm-4.7", ...}
)