Tool calling is at the core of agentic systems, but it is so brittle with OSS models that retries are baked into every pipeline.
Our new product dotlambda guarantees that tool calls execute 100% of the time. 65% tokens savings, up to 20% accuracy 📈
👉 https://t.co/rmfLDickQZ
The obvious reasons intelligence-per-watt is going up so fast: more efficient architectures, more efficient hardware, and higher quality data.
The less obvious reason: finding the right balance on what should be stored in the model's weights and what can be computed through tool use, reasoning, and potentially other types of in-context learning.
A simple example: in the earlier LLM days, it was quite likely that for simple arithmetic (e.g. adding two numbers), the model had to basically memorize tuples of (inputs, op, outputs). You can imagine this took up a lot of room in the weights.
With reasoning the model can compute this in its chain-of-thought. With tool calling the model can compute this with a tool call. In both cases it saves a lot of space in the weights.
I'm sure there is a floor on the smallest LLM that can have say GPT 5.x quality. But that floor could be 5B, it could be 100B. And I don't think anyone really knows because of the above effects.
In other words we can probably go much further with a 5B-15B model with exceptional tool calling and reasoning.
Worst part about new agentic coding editors? Unfamiliar keyboard shortcuts. Anthropic's seem particularly strange: meta-P to switch model in Claude Code 🤔 What's your worst culprit?
Bullshit benchmark - how good are LLMs are at detecting nonsense questions & pushing back:
- Latest @AnthropicAI models are doing well, including Haiku
- @Alibaba_Qwen Qwen 3.5 and @Kimi_Moonshot Kimi K2.5 are pretty decent too
- @OpenAI and @GoogleDeepMind are middle of the pack - not great for mainstream models
- Lots of other slightly older and smaller models engage with 70%+ of bullshit questions
@steipete@VibeTunnel GitHub Desktop “Plus” now seamlessly integrates worktrees, works out of the box. Found my cursor/claude code work trees without a problem. See https://t.co/jzL7nyJuJF
Wait, this sounds incredible useful! Can we just have a model with 0 entropy, 0 hallucinations, that just acts like a retrieval database over its training dataset? Also sounds like a great way to solve the traceability problem. Why don't the AI labs just make something like that?
@vikhyatk Here, they go further than simply producing "novel" candidates: "the model’s in silico prediction was confirmed multiple times in vitro [lab]". This seems good. Still, many in-vitro experiments fail later in the process. But they're still published and "advance science".
Next time I get asked which AI I choose, I will align myself with “the tastemakers” and give the reason in partial French: "consumers—particularly the tastemakers—are drawn to its certain _je ne sais quoi_ in conversation and thoughtful design” https://t.co/pXIawNGfd3
@GaelVaroquaux@scikit_learn For example, `LogisticRegressionCV()` is my favorite (and most useful) single line of ML code, anywhere – although big shout-out to `train_test_split()` ;)
One form of future shock is being paralyzed by new tech shown every week. Learn to put that in background, & focus on shipping real products using tech of today, with an eye towards future ways to transform what you’re doing. Ship today, transform what you’re doing tomorrow.