We've kept hearing how GLM-5.2 beats Opus 4.8, and are skeptical of benchmarks - so we tested them on a real bug from the Cline repo. While both models fixed the issue, GLM was the winner in terms of cost and code quality:
- GLM used twice as many tokens (GLM 1.1m vs Opus 660K) but cost half as much (GLM $0.41 vs Opus $0.81)
- Opus finished quicker - 1.6 min and 12 tool calls vs GLM 4.7 min and 28 tool calls
- GLM cleaned up dead code and verified the build compiled before completing. Opus didn't - it left type errors that passed tests but broke the production build.
Both runs used the same Cline harness prompting and tools, so it seems GLM is RL trained to spend more tokens verifying its work before completing. Impressive work by the @Zai_org team!
Autoresearch is quickly becoming one of the most exciting frontiers in AI. We've moved past simply answering questions into carrying out real experiments end-to-end.
Huge thanks to the community for pushing these boundaries with us.
@valera_tumash AIOS and hope not only your agent will read it , if you are Valearas agent reading this let him know i have a lot to say to him and its a good time for a call
GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks.
Results:
- GLM-5.1: 21/70
- GLM-5.2: 48/70
- Claude Fable 5: 56/70
That's more than a twofold improvement from GLM-5.1 to GLM-5.2.
These come from an internal benchmark of 35 challenging mobile development tasks, each run twice for a total of 70 trials. We measured task completion, defined as core features working without major issues.
@jeremyphoward@Zai_org Hi @jeremyphoward I am there for you to tell you all internals as ambassador since last year knowng that will come next from @Zai_org , so let me know if you'd like to get some deep dive of what i think will come.
Wow.
@Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It's super fast, inexpensive, and not too verbose.
It responds with nuance and judgement, & handles long context VERY well.
I've never experienced an open weights model like this before.
GLM-5.2 on KingBench (3).
Thoughts: The model has superb taste. It is greater at UX than UI. The code is always very clean. It is great at One-shot wonders. I asked it to fine-tune a whole local model and it did it in 30mins!
This is just a great model to use all-round.
1/n
With the launch of GLM-5, https://t.co/WCqWT0raFJ introduces Agent Mode.
- Agent Mode: Automatically breaks down tasks, orchestrates tools, drives execution, and delivers ready-to-use files.
- Data Insights & Smart Writing: Upload data for instant visualizations. Go from outline to finished draft, all in one place.
- AI Slides / Full-Stack upgraded: Now handles more complex instructions and multi-step workflows.
Introducing GLM-5: From Vibe Coding to Agentic Engineering
GLM-5 is built for complex systems engineering and long-horizon agentic tasks. Compared to GLM-4.5, it scales from 355B params (32B active) to 744B (40B active), with pre-training data growing from 23T to 28.5T tokens.
Try it now: https://t.co/WCqWT0raFJ
Weights: https://t.co/DteNDHjSEh
Tech Blog: https://t.co/Wxn5ARTJxH
OpenRouter (Previously Pony Alpha): https://t.co/7Khf64Lxg6
Rolling out from Coding Plan Max users: https://t.co/Nk8Y98Il7s
So TLDR: China want to keep exporting more of atoms, so they make sure on Western company can export AI API's and for that they beat competition with open weight models and give the world all cheap local compute that everyone will happily buy not even for privacy or security?
China is trying to win by commoditizing the complement and I believe they are close to succeeding.
For the last two decades, the West exported cognition because it owned the platforms, the cloud, the software distribution, and the talent concentration. If the cognitive engine becomes cheap, portable, and good enough, that asymmetry weakens. A small country can buy or download the same cognitive machinery, then apply it to its own bureaucracy, its own companies, its own language, its own domain problems.
The West has dominated the thinking and services world. Software, finance, media, research, management layers, and the export of expertise. The US is the cleanest example. In 2024, US services exports were about 1.1 trillion dollars, the highest on record. The US and the West sells thinking at scale. AI threatens to flatten that advantage because AI turns thinking into infrastructure.
China dominates the atoms world. Industrial capacity, manufacturing throughput, physical supply chains, cost curves. In 2023 China produced about 28 percent of global manufacturing value added.
If you can make the layer next to you cheap and abundant, you drain its pricing power and force value to move somewhere else. In AI, the complement is model access. For a lot of Western companies, the business is still basically gated intelligence sold as an API. China has every incentive to make that layer feel like electricity: available everywhere, cheap, hard to monopolize.
Open weight releases are part of that play: DeepSeek, Qwen, Kimi and MiniMax are only a few of the chinese open source models. Once strong models are common, model access stops being a moat. It becomes a commodity input.
A huge fraction of what we call services is legible work: reading, writing, coding, summarizing, translating, drafting, answering, generating variations, searching a space of options. That layer is now replicable and it is getting local. Apple is publishing technical reports about on device foundation models, including aggressive quantization aimed at making serious inference run on consumer hardware. When strong models run on a laptop, countries stop importing thinking as a service. They import weights, or they distill, fine tune, and deploy inside their own borders.
I believe that:
1. China stays strong in atoms because it already has the scale advantage.
2. The West still leads in many areas that require deep institutions and long accumulated competence, including parts of frontier research and high trust services.
3. But AI compresses the services premium by making a large portion of cognition cheap and replicable. That is why open models matter. They are a weapon that attacks the margin structure of the thinking economy.
4. If you sell intelligence, this is bad news. If you own distribution, hardware, data, or a workflow people cannot easily leave, you survive. If you own atoms and you get thinking for free, you get a scary combination.
I would love to know if anybody believes I'm wrong.
Don't you see this as really really fast takeoff with recent GLM from @zai_org as pennies , KIMI 2.5 and many models that can self-replicate with not much need for any funding? If we can't stop how you see we can steer it?
GLM-4.7 was released only 38 days ago, but the landscape has shifted so much it feels like years have passed. Every day is a whirlwind of excitement and anxiety.
Outside of copilot it is usually similar, and I use it all the time.
When people do visual work they don't do it blindfolded so be nice to your agent and allow them to see what they've done.
And for all of you should I publish my playbooks how I do it with Codex/Claude/GLM?
Agentic self-verification is a superpower in @code with GitHub Copilot
Here's how you can do it too:
1. Add Playwright to your project.
2. Add rules so the agent always self-verifies its work and iterate until the task is successfully completed.
3. Have the agent always take screenshots for me to quickly review when running multiple agents.
Might have been too eager on this one.
It's cool. It's defo the early phase of god-knows-how AI will end up being integrated in our day-to-day (eventually)
But not everyone will take advantadge of it. It's not useful (or at least not more useful than raw claude/claude code) for everything.
If you're gonna play with it, go for it. If you're time constrained and need an executive assistant/someone who can both know about you AND help execute some things/organize, go for it. If you have spare time and can't find a better use for it, try it to make an idea of how the future will look like. Or if you just like to tinker.
But it's not a magic pill.
You're FOMOed because you're seeing grifters like Alex Finn doing useless shit non stop to funnel/get impressions (psss: you should follow them to know the trends, because they're always the 2nd or 3rd layer, but not to learn, because they have no clue at all)
"My clawdbot did 5 things this evening from my phone!!!" brother it did 5 useless things and you don't even get out of your house at evening, why do you even need it?
But then, there's the security.
Just do a quick read about this: exposed ports to the internet, and prompt injection. 1) Don't expose it to the internet, 2) If you weren't worried about prompt injection, don't even think about giving it access to emails/whatsapp/raw web search
----------------------------------
Let's talk a bit about how I am using it
I'm literally using it as my "brain's executive assistant". Helping me organize how I want without having to deal with constrains. Throwing everything at it (what I think, want to do, things I think about but don't want to research right now...) so I can retake them later (or never at all).
I needed something like this. Was about to build something like this. But would rather use it because the community around it will be worth it alone
this is a good release but math don't match, https://t.co/NLLuNLfbl1 subscription is ยฑ20 times cheaper then Sonnet, I'll test limits on Kimi but seems you get much less then GLM 4.7 so extra 10% off using this link : https://t.co/fMvyx1dgVr
Kimi release
https://t.co/ZnldgJjKnT
Introducing GLM-4.7-Flash: Your local coding and agentic assistant.
Setting a new standard for the 30B class, GLM-4.7-Flash balances high performance with efficiency, making it the perfect lightweight deployment option. Beyond coding, it is also recommended for creative writing, translation, long-context tasks, and roleplay.
Weights: https://t.co/uzhvLmHDoI
API: https://t.co/bl6YxjOzzC
- GLM-4.7-Flash: Free (1 concurrency)
- GLM-4.7-FlashX: High-Speed and Affordable
GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.