Benchmarked: H100 vs Gemini Flash. Result: H100 choked on prefill. Gemini flew. Without complex disaggregation, a single H100 is just an expensive heater. https://t.co/74t8Lwh4ex
@karanjagtiani04 Holding up well! After the first query warms the cache, we're seeing ~98% cost reduction.
The catch: your prefix (system prompt + context) needs to stay consistent. Change it, cache resets.
Quality improved too - full context catches cross-references that chunked RAG misses.
I've been recommending RAG to save costs. Time to reconsider. Testing Gemini Flash 3 with 22 technical manuals + implicit caching (75% off repeated prefixes):
Full documents as system prompt became:
→ Cost-competitive with RAG
→ Higher quality responses
📑 Markdown/Header: Splits by structure
→ Respects semantic boundaries
→ Uneven chunk sizes - which affects retrieval consistency and costs.
All splitters have flaws - you're just choosing which flaw to accept.
For technical support chatbots with stable system prompts? Full Documents + caching might be the better architecture. Caching changes the ROI calculation completely.
Retrieval: If you are dealing with non-English technical RAG, don't struggle with generic local models. The specialized multilingual APIs are worth every millisecond of latency saved.⚡️
#AI#RAG
Running a high-performance multilingual reranker on a 1 vCPU instance is a recipe for latency. The local ms-marco model was "good enough," but I needed "perfect" handling of foreign technical terminology without crashing the server.
The new reranker is so effective it retrieved too many similar manuals. I countered this with a System Prompt "Phase Filter"—explicitly instructing the LLM to verify requirements against the retrieved specs.
Model: Gemini 3.0 Pro (Preview)
Embeddings: gemini-embedding-001
Reranker: ms-marco-MiniLM (Local)
The accuracy difference is night and day. If you aren't reranking, you aren't really searching. 🚀
#AI#RAG#Engineering#Gemini#LLMs
The LLM kept "lazy-loading" generic definitions from general category files instead of reading the specific manuals. I injected a "Critical Instruction" into the System Prompt. Result: It now ignores the easy answer and digs for the specific brands.