Opus 4.8 is here...
No wonder 4.7 was behaving so badly since last couple of days.
Lets run all the benchmarks again and see where we land with this one
@AnthropicAI#opus
In Retrieval‑Augmented Generation, evaluating the retriever is non‑negotiable. I measured Context Precision@5 at 0.62 and still got hallucinations because the LLM was fed irrelevant docs. Good retrieval = better answers.
Automated metrics like BLEU or ROUGE are handy for quick checks, but they punish creativity. My summarizer once got a ROUGE‑L of 0.42 while users rated it 4.7/5 for usefulness. Don't let surface similarity dictate your success.
LLM‑as‑a‑Judge sounds clever until you let the model grade its own work. I tried using GPT‑4 to score GPT‑4 outputs and it consistently gave itself 4.5‑plus. Always use a different model or configuration as the judge.
Most LLM teams ship blind. I once pushed a prompt change after eyeballing three outputs. Production broke, and I had no baseline to blame the prompt, the model, or the retriever. If you can't measure, you can't fix.
Imagine every pixel on your screen, streamed live directly from a model. No HTML, no layout engine, no code. Just exactly what you want to see.
@eddiejiao_obj, @drewocarr and I built a prototype to see how this could actually work, and set out to make it real. We're calling it Flipbook. (1/5)
If you're building RAG, your cost model has at least 4 factors:
1. LLM generation cost (query volume x avg tokens x price/token)
2. Embedding cost (document count x avg tokens x price/token, plus re-embedding cadence)
3. Vector DB cost (storage + query volume)
4. Monitoring cost (flat fee per user per month)
Most project proposals I've reviewed only include 1-2 factors only.
The model decision and the cost decision are not separate.
GPT-4.1: $2/M input, $8/M output
GPT-4.1 Mini: $0.40/M input, $1.60/M output
5x cost difference. For most classification tasks: identical quality.
"Use the best model" is not a cost strategy. "Use the cheapest model that meets quality requirements" is.
AI Projects Aren’t Expensive Because of One API Call. They’re Expensive Because of the Full System.
The pattern I've seen repeat across 6 or 7 AI project kickoffs this year alone:
1. Team proposes an AI feature
2. Stakeholder asks "what will this cost?"
3. Engineer says "depends on usage, but it's cheap, LLM APIs are pennies per call"
4. Feature ships
5. Invoice arrives
The "pennies per call" calculation forgot: embeddings, vector database storage and queries, monitoring, guardrails, infrastructure, and human review. At any real scale, those "pennies" become thousands.
I got tired of watching teams discover this at invoice time. So I built something that shows them the number before they commit.
What "It's Just API Calls" Misses
Real AI systems have more cost components than the LLM API.
Embeddings. If you're building RAG, you're embedding every document and every query. text-embedding-3-small is still priced at $0.02 per million tokens, which sounds negligible in isolation. But re-indexing, chunking strategy, retrieval volume, and downstream infra are where teams start to feel the real system cost.
Vector database. Managed vector storage is rarely just “set and forget.” Pinecone, Weaviate, Qdrant Cloud, and hosted pgvector all have different cost curves once document volume, throughput, replication, and reliability requirements increase.
Monitoring and evals. Production systems need observability. That can mean trace tooling, eval pipelines, retention, alerting, and team seats. Useful, necessary, and often omitted from the first estimate.
Guardrails. Safety checks add latency and operational complexity, and depending on your stack they may add cost too. Teams usually notice this only after they move beyond a demo.
Human review. The moment a workflow needs QA, approvals, or escalation, AI cost stops being just API cost. It becomes workflow cost.
What I Built
6 project templates (chatbot, RAG knowledge base, content generation, code assistant, data analysis, custom). 12 LLM options with current pricing. Embeddings, vector databases, monitoring, guardrails, human review, all configurable.
Set your monthly query volume. Set your average token counts. Get an instant cost breakdown with optimization tips: "Switch from GPT-4o to GPT-4o-mini for batch classification, saves \$X at your scale."
Email yourself the full report. No account, no data sent to any server.
The Number That Changes the Conversation
The most useful thing about having a cost estimate before a project starts isn't the number itself.
It's what the number does to the conversation.
"It's cheap, just API calls" is an answer that ends discussion. "3,200/month at current scale, dropping to 3,200/month at current scale, dropping to 1,100 if we use GPT-4o-mini for batch jobs and cache the embeddings" is an answer that starts engineering decisions.
That's the conversation I want clients and juniors to be having. Not after the invoice. Before the commit.
Try it: https://t.co/GgyKam0aZT
What's the biggest AI cost surprise you've encountered in a project?
#AI #CloudCosts #MachineLearning #AITools #ProductEngineering
The architecture pattern nobody talks about: Cache-Augmented Generation.
Small, static corpus (< 5,000 documents, rarely changes)? Skip the vector DB.
Load everything into context at startup. Zero retrieval latency. Zero retrieval failure. Simpler system with fewer moving parts.
Sometimes the right answer is "less architecture."
What's the simplest AI architecture you've shipped that actually worked well in production?
I am having a hard time sharing my learnings from enterprise to open-source. Starting today will try to put a dedicated time every week for this. Scaling GenAI solutions at enterprise level is an interesting problem statement.