@scaling01 So much room to optimize spend without loss of performance on general tasks. At ¬◇ we’ve written about this before, and will announce some work soon to solve it. https://t.co/pUoy4uWTml
Impressive work by the https://t.co/ov08dSJ4Vy team. Performance is competitive with Opus 4.8 on a number of significant benchmarks, and competitive for 2nd-best on many others.
Curious about the practical effectiveness of effort level - tuning thinking tokens can be tricky.
Introducing GLM-5.2: Frontier Intelligence, Open Weights
- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1
Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb
@jasondeanlee Are you using some proxy or tunnel to https://t.co/bDcBg74M2b to use GPT-5.5 Pro via Codex?
Using Codex, by default you're likely using GPT-5.5 non-pro, which (IME) really cannot prompt GPT-5.5 Pro effectively.
Also curious whether Anthropic ran the (public safety classifier + model) combo in internal benchmarks.
It wouldn't be the first time they don't test on public builds. From the April postmortem on quality:
> We are going to do several things differently to avoid these issues: We’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features)...
We prioritize evaluations of new models as part of our routing efforts at Not Diamond. Claude Fable 5 is the first time a model has actively refused to run our benchmark tasks.
eg.: 11 tasks in TBench 2.0 were refused by Fable 5 on the basis of bioweapon or cybersecurity risks.
Clearly, internal Claude benchmarks don't face the same safety guardrails, but its interesting to consider whether these tasks should be omitted, scored as failures, etc.
anthropic doesn't owe anyone "frontier capabilities".
none of the labs do.
they are all simply selling a product, or a story, that people pay for.
that aside, the more telling bit is how far anthropic is willing to go to secure a narrative around "capability slowdown", post a massive raise, before an ipo, and with enterprise contracts rising for those rich enough to pay to similarly keep up the image of "powered-by/secured-by agentic AI".
with the amount of capex spent so far, this was never meant to be some democratizing technology "for the people".
this is all simply just business.
My thoughts on the future of model routing and AI:
- We have not even scratched the surface of runaway inference costs
- Solving this requires intelligent model routing, especially as the inference landscape continues fragmenting. This is a *hard* problem.
- Naive solutions (turn-based routing, session routing) fail; routing successfully involves managing multiple cost surface areas in concert.
- Getting routing right means a more diverse market of providers, more power for consumers, reduced ecological impact, and improved effectiveness.
More in the full essay:
@ShcChy@vikhyatk@GergelyOrosz Because it would be insanely unprofitable to ask market-salaried engineers to label data at the scale these labs need.
So instead they use a tiered system: outsource the broader labeling and calibrate quality in-house. This is a core focus for entire eval teams at labs.
Those are not contradictory statements though.
With analytics, half the battle is building out the fact and dimension layer in the warehouse and those become the LLM’s “data classes” for analytics..
If you pin those down then it is, in fact, possible to automate most queries away, because the queries atop of that layer become rote.
Problem: the Marlins brand has been tarnished by abysmal management since the late 90s.
As a Miami native I quit after the third (!!) selloff in 2005. I only visit the stadium for monster truck rallies w/ my 6-yo.
Team would need a _very_ long commitment to win back fans here. We all remember.
The fallacy of this is that more creates more. More hours, more hiring, more something.
And it is true in a sense. If you put in more work, more work will happen. But I think for most startups, the leverage is really in how differently you approach the problem, how well you cultivate your team, and the strategy.
Any large company can outspend you on hours. They have thousands or tens of thousands more people, spending more hours. If hours worked were the metric, every large company and government organization would always win and do the best work. More hours, better output.
This thinking is often representative of younger founders, where the startup becomes their identity and life. They have a hard time doing anything else, and cannot understand that your work is not the person that is you. But activities outside of work can grow you as a person too and make you do better work.
I’ve never worked this way. As a designer, I always saw the need to take a step back, to take a break. At times, I might work 12 hours or 16 hours, or whatever amount was needed, but it wasn’t the norm. You just can't grind design, you need inspiration. But taking that step away from the work, would give me more perspective, inspiration and I could approach the problem differently or I could just see the solution.
Grinding is never good for any creative problem, and startups or creating new products are often mostly about creative problem solving. Grinding works ok for email jobs, or where you just executing on very clear playbook.
With Linear, we’ve never worked this way. We work reasonable hours, 5 days a week. All of us founders have families. Many of our employees have families. I personally stop every evening, spend time with the family, cook dinner for the family, eat dinner together, and focus on things outside of work. Sometimes I work in the late evenings or weekends, but to me the pride is that I don’t need to. Company should be succesful without it.
My goal is to build a company that is sustainable in the long term, and doesn’t require heroics or personal sacrifices every single day.
There are times when our team is heroic. Launches, incidents, some other work that just needs to be done. They will work late into the night because they know it is the right thing. But we don’t require that every day or every week, and the more this happens, the more I think it is a failure of our company and leadership. The team and the leaders should always keep a reserve to use when something is needed.
Our thinking was also that quality, which we value, doesn’t emerge from working more or stressing people more. It emerges when you create the conditions for it to emerge. Often it is the appreciation, space, time, and how the person feels. A person who is rested will do better work.
I wouldn’t attribute much of our success to working a lot. The success came from having clear thinking, ideas, and focus to do the right things.
I sometimes wish we could move the culture more toward a Zen master.
Real mastery is not exerting the most effort. It is achieving the outcome with the least necessary effort.
On the @theallinpod, @Benioff describes why routing is the next layer of enterprise AI infra—and how it will save billions of dollars.
We've been building exactly this at @notdiamond for two years. Largest vendor of intelligent routing in the world. @Benioff, we should chat!
I strongly feel that the frontier lab QC bar for RL data has to become more load-bearing during procurement.
The contract non-renewals I've been hearing about across labs (for poor quality) often come to the fact that most vendors run zero categories of active testing, ship without verifier FP/FN audits, can't produce pass@k distributions across three models, have no contamination story, etc.
My sense is as RL/SFT data markets become more formalized into 2027, a lot of these contracts, along with banal synth data (garbage in, QA aggressively, garbage out) will be cut.
The small set of vendors who have built the QC infra internally (mostly research-dense teams) are pricing 2-3x what their commodity peers can charge for nominally similar tasks, at least for RL data.
Wrote a bit about this on my blog.
Our model router now supports @_inception_ai's Mercury 2, the fastest code gen model in existence. Use it with Not Diamond or @OpenRouter's /auto mode.
For max speeds, use the latency tradeoff in nd or the plugins param in OpenRouter to route bw Mercury and a stronger model.
Excellent article by @cursor_ai that explains why it's so hard to change a model mid conversation. For instance, OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement.
It's all about customizing the harness for different models.