The AI Cost Problem
AI API costs can spiral quickly. A prototype that costs $5/month can balloon to $5,000/month in production. The good news: most teams are leaving massive savings on the table with a few simple optimizations.
Here are 10 battle-tested strategies ranked from highest to lowest impact.
1. Use Smaller Models for Simple Tasks (Potential savings: 90%+)
This is the #1 opportunity most teams miss. Not every task needs GPT-4o.
| Task | Recommended Model | vs GPT-4o Savings |
|---|---|---|
| Classification | GPT-4o mini | ~94% |
| Simple Q&A | Gemini 1.5 Flash | ~97% |
| Code completion | Claude 3.5 Haiku | ~73% |
| Complex reasoning | GPT-4o or Claude 3.5 Sonnet | baseline |
Rule of thumb: Use the cheapest model that meets your quality bar. Test systematically with your actual data.
2. Optimize System Prompts (Savings: 10-30%)
System prompts run on every request. Even small reductions add up:
# Before: 450 tokens
You are a helpful, friendly, professional customer service assistant
for Acme Corporation. Your role is to help customers with their
questions and concerns. Always be polite, empathetic, and thorough
in your responses. Make sure to address all parts of the customer's
question...
# After: 120 tokens
You are Acme Corp's support AI. Be concise and helpful.
Saving 330 tokens × 100K daily requests × $2.50/1M = $82.50/day saved.
3. Implement Prompt Caching (Savings: up to 90% on repeated context)
Anthropic, OpenAI, and Google all offer prompt caching — pay once for repeated context, then much less for cache hits.
Use case: You send the same 10,000-token document to Claude for different queries.
- Without caching: $0.03 per request × 1,000 requests = $30.00
- With caching: $0.003 for first request + $0.0003 × 999 = $0.30
100x cost reduction for cached content.
4. Set Output Token Limits (Savings: 20-60%)
AI models often generate more text than needed. Set max_tokens aggressively:
response = client.chat.completions.create(
model="gpt-4o",
max_tokens=500, # Not 4096!
messages=[...]
)
Measure your actual average output lengths and set limits accordingly.
5. Batch Similar Requests (Savings: 25-50%)
Instead of sending individual API calls, batch similar tasks:
# Instead of 100 separate calls:
for item in items:
result = classify(item) # 100 API calls
# Batch them:
results = classify_batch(items[:20]) # 5 API calls with 20 items each
Most models can classify 20-50 items in a single prompt.
6. Add a Semantic Cache Layer (Savings: 30-80%)
For question-answering applications, cache responses by semantic similarity:
# Pseudo-code
embedding = get_embedding(user_query)
cached = find_similar(embedding, threshold=0.95)
if cached:
return cached.response # Free!
else:
response = call_llm(user_query)
store_in_cache(embedding, response)
return response
Tools: Redis with pgvector, Weaviate, or dedicated semantic cache libraries.
7. Use Streaming Intelligently (Savings: 0-15% on UX, indirect)
Streaming doesn’t save tokens, but it improves perceived performance, which can allow you to use shorter generation limits without hurting UX.
8. Compress Context Window (Savings: 20-40%)
For chat applications, don’t include full conversation history:
- Summarize older messages instead of sending them verbatim
- Use retrieval to include only relevant context
- Set a rolling window of recent messages
# Instead of full history:
messages = conversation.full_history # 5,000 tokens
# Compressed:
messages = [
{"role": "system", "content": summarize(conversation.older_messages)},
*conversation.recent_5_messages # ~1,000 tokens
]
9. Self-Host Open Source Models for High-Volume Tasks (Savings: 70-90%)
For very high-volume, predictable workloads, self-hosted models via services like Together AI, Replicate, or your own GPU infrastructure can dramatically reduce costs:
| Approach | Cost (1B tokens/mo) |
|---|---|
| GPT-4o mini | ~$150 |
| Llama 3.1 70B (Together AI) | ~$88 |
| Llama 3.1 8B (self-hosted) | ~$20-40 |
The tradeoff: engineering overhead and reliability responsibility.
10. Monitor and Alert on Cost Anomalies (Savings: 10-30% from catching bugs)
Set up cost monitoring with alerts:
- Unusual spike in tokens per request (possible prompt injection)
- Runaway loops or retry storms
- Gradual drift in average request size
All major providers have cost dashboards. Use them.
Summary: Your Cost Reduction Checklist
- Audit which tasks actually need frontier models
- Measure average system prompt length and compress it
- Enable prompt caching for repeated context
- Set appropriate
max_tokenslimits - Implement batching for similar operations
- Add semantic caching for Q&A workloads
- Compress conversation history
- Set up cost monitoring alerts
Apply all 10 and you can realistically reduce costs by 60-80% without meaningful quality degradation.
Use our Token Cost Calculator to model the savings for your specific workload.