No think tokens generated → no billing, no KV cache inflation. Caveat: there's a known KV cache reuse bug with multi-turn conversations (fixed in vLLM 0.9.0 — check your provider's backend version)
If you're using Qwen3 and stripping `<think>` blocks after generation — you're paying for tokens you never use. And they're slowing down your decoder too. Here's the mechanism.
How to verify: call the API with include_reasoning=True in extra_body. OpenRouter returns a separate reasoning field. Compare its token count to completion_tokens. The gap is what you're paying for and discarding.
Check out my latest article: Two Weeks, One Benchmark, Six People: What We Actually Learned Building a Production Data Agent https://t.co/pz73Aw14iz via @LinkedIn
The hardest part wasn't technical. It was discipline.
Build the harness before optimising the agent. Write the KB before writing the code. Log every failure honestly.
The harness is the product. The agent is what the harness improves.
Two weeks ago we started building a data agent from scratch.
Today we're submitting to UC Berkeley's DataAgentBench.
Here's what we actually built and what we learned the hard way.
When a query fails the agent doesn't surface the error.
It diagnoses: wrong database? wrong join key? missing domain knowledge? rewrites and retries.
Every failure gets logged. Every log entry makes the next run better
We're documenting everything as we build...the failures as much as the wins.
Open question we're still working through: can join key mismatch detection be automated? Or does it always need a human to inspect the data first?
Full write-up: https://t.co/r5rfLE3Gz4
Most AI demos show a clean question going in and a clean answer coming out.
What they don't show: the same customer stored as integer 10023 in one database and 'CUST-10023' in another.
We hit this on Day 1 building a data agent for real enterprise data.
This week taught me something:
Query generation is the easy part.
The hard part is everything the agent needs to know BEFORE it writes the query.
That's context engineering. That's what actually closes the gap between demo and production.
Week 4 at @10acad: built a “Brownfield Cartographer” to analyze unfamiliar codebases. Tried adding a UI, API, and Docker, which made debugging much harder and took a lot of time. Also hit LLM API quota limits, so some features didn’t fully run.
#10Academy#My10AcademyTRP1Week4
Week 3 at TRP-1 @10acad: worked on a document intelligence pipeline to process messy PDFs and reports. After indexing, we could query the documents and retrieve exact answers with their source. Real-world data is messy. Learned a lot this week. #10Academy#My10AcademyTRP1Week3