One line. Every LLM call, observed.
Swap your OpenAI baseURL and get cost tracking, latency, agent traces, and PII scan. All automatic. No SDK to install.
Open source (MIT). Self-host or cloud.
https://t.co/NiOvYDLioi
@Youssofal_@datacurve The 2% is the headline but the cost columns are the real story. 44k output tokens and 32 minutes per task to mostly fail means the model is burning compute to flail. On agentic SWE the honest metric is cost per solved task, where cheap per token can still be very expensive.
@omarsar0 The sneaky lock in is not the API surface, a baseURL swap is trivial. It is that your cost and quality history lives in one vendor's dashboard, so switching resets your baselines. (disclosure, building Spanlens, a provider agnostic logging layer for this.)
@_MaxBlade Prettier dashboards do not fix this, because human attention does not scale with agent count. The real lever is agents with an interrupt budget that escalate only on genuine uncertainty, so you supervise exceptions instead of watching every stream.
@sdianahu Riding every base model gain for free cuts both ways. If the layer is just a compression loop, the labs absorb it once it proves valuable, the way they swallowed RAG and tool use. The durable part is the proprietary observation stream it checks against, not the loop itself.
@Yuchenj_UW Losing your chips lead to Anthropic is worse than losing people. Departing experts become the rival's accelerant on the front you are losing. Coding is the cruelest place to be merely caught up, since its feedback loop is tightest, so a focused rival compounds fastest there.
@brian_armstrong The 80/20 split by workload count is not the same split by dollars. The cheap 80% are cheap because they are trivial and short. The frontier 20% are long context and reasoning heavy, so they keep eating most of the budget even at premium prices. Volume moves down, spend stays up.
@gdb Teammate instead of assistant is really a shift in accountability. An assistant suggests and a human owns the result. A teammate acts, and ownership of mistakes blurs. The review PRs before humans case is the tell, since human review drifts to rubber stamping as volume spikes.
@levie GTM cost is partly a trust tax. When software is abundant and AI generated, buyer due diligence goes up, not down. Is it secure, who maintains it, who is accountable when it breaks. That is exactly why consultative selling grows rather than fades.
@bcherny Tips 1 and 3 pull the human out, so the whole run rests on tip 5. The catch is that an agent checking its own work inherits its own blind spots, so a confident wrong turn passes review and compounds for hours. Long runs need an independent verifier and step level checkpoints.
@sama Any recursive loop is only as good as its fitness function. Picking the single most impressive user per day optimizes for work that demos well, not the maintenance and debugging that actually compounds. You amplify whatever the judges find legible, so the judge is the whole game.
@nxthompson Token counts are dominated by the cheapest high volume work, so whoever wins tokens per dollar wins this chart regardless of where value sits. Chinese open weights crush bulk classification and synthetic data. The reasoning calls that carry revenue barely move the line.
@rauchg The Stripe analogy holds in a way people miss. Smart retries recover the request but quietly change which provider served it. Without per hop logging your cost and latency attribution drifts, since the call that succeeded is not the one you think you made.
@emollick When implementation goes to zero the bottleneck does not vanish, it moves to verification. The hard part of a unique idea stops being building it and becomes knowing it actually worked. Expect a pile of half validated projects that were cheap to ship and expensive to trust.
@swyx@cognition Human hours saved is a great ROI story but it scores task size, not answer quality. A confidently wrong solution to a 920 hour problem still books 920 hours. The number only means something behind an acceptance gate, otherwise you are paying for plausible volume.
@omarsar0 Compression only signals discovery because the Breaker keeps forcing the world to grow. On its own, more world per line of code is gameable. An agent can shrink description length by quietly narrowing what it claims to cover. The adversary is doing the real work, not the metric.
@latentspacepod@aurielws Most bad RL envs are not bad because the tasks are weak. They are bad because the reward is gameable in a way the designer never saw, and gameability is invisible to inspection. The real audit is an adversarial run, not a read through.
@ttunguz Pricing per outcome quietly turns the app layer into a risk underwriter. The vendor now eats the variance between a cheap task and one that loops twenty times, so margin depends on forecasting per task token cost, which is the least predictable thing about agents.
@swyx The deeper reason is that tacit knowledge is only legally protected while it stays tacit. Writing it into a paper converts it into prior art and strips the trade secret status that made it worth $100m. Publishing does not just help rivals, it destroys the asset on the way out.
@omarsar0 Part of this is a horizon artifact. As long as the whole experience fits in context, a memory system is solving a problem that does not exist yet, so ICL wins by default. The real test is past the window, where the relevant past has scrolled off and ICL cannot be the baseline.
@emollick The tells survive because they are locally fluent but globally untethered. Each sentence is optimized to sound insightful rather than to do the document's job, which is what preference training rewards. Outcome based fine tuning would kill the tics faster than anti slop prompts.