I wrote a new Agentic text-to-SQL benchmark and tested every local model I could against it: https://t.co/SDQ9fTwmyG
Thanks to DuckDB WASM you can try your own models from the browser.
@thomas_thoresen@jeremyphoward There is.
I don't know the link but the agent interface is great.
Point your agent at it and it will work it out
https://t.co/or4dOpIH2e
@igorcosta Great talk Igor. Was wondering if you've looked at techniques like https://t.co/C79Xm1huLp and how they compare to your HRM work?
Super interested to see you move HRMs beyond the ARC benchmarks.
New agentic SQL benchmark results.
Minmax 3: 23/25, $0.04, 369 sec
StepFun 3.7: 21/25, $0.06, 254 sec
MinMax lost a lot of time stuck on Q6, but otherwise a great looking model.
https://t.co/KwZDN1MFxO
I've been using SCAD a lot lately via AI and find Codex is much better than Opus.
My tasks are much easier than this, but I don't know CAD at all so have to rely on the model's interpretation of my own terms.
Surprised people don't know the reason for this.
Claude Teams only goes up to 150 seats before you have to switch to API billing.
It also maxes out at 6.5 x Pro plan (there is no Claude Teams 20x plan)
Can anyone explain to me why companies don’t just give employees $100 / month Claude Code or Codex plans instead of paying per token? There has to be an explanation, because this keeps happening and doesn’t make sense otherwise
@ade_oshineye Have you tried putting raw mermaid code into an image model? 😀
Not there for complex ones yet, but for simple to medium ones nano-banana and GPT-Image do great...
I keep seeing headlines about layoffs driven by AI automation… and honestly, I don’t fully understand it yet.
At @airwallex, we’ve dedicated almost all of our engineering resources to building customer-facing AI products and infrastructure. We barely even have enough engineers working on internal automation yet.
The demand for AI talent inside our company has gone up, not down.
Maybe I’m missing something, but right now it feels like AI is creating more product opportunities, more engineering demand, and more ambitious problems to solve, not fewer.
What are others seeing?
@championswimmer That's a reasonable attempt at balancing cache and compaction. I think it's better to think of it as a less aggressive from of compaction though. It still drops large amounts of the cache.