Introducing GLM-5.2: Frontier Intelligence, Open Weights
- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1
Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb
GLM-5.2 is not only stronger on benchmarks, but also much better in real app development scenarios — iOS, Android, WeChat Mini Programs, and more.
Behind this jump is a full loop from environment construction, evaluation, data optimization, reward design, to training.
Real tasks, real execution, real improvement.
Long-horizon is more than a concept. It should live in real-world scenarios, empowering AI builders to solve the problems that matter.
And more scenarios are on the way.
GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks.
Results:
- GLM-5.1: 21/70
- GLM-5.2: 48/70
- Claude Fable 5: 56/70
That's more than a twofold improvement from GLM-5.1 to GLM-5.2.
These come from an internal benchmark of 35 challenging mobile development tasks, each run twice for a total of 70 trials. We measured task completion, defined as core features working without major issues.
GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks.
Results:
- GLM-5.1: 21/70
- GLM-5.2: 48/70
- Claude Fable 5: 56/70
That's more than a twofold improvement from GLM-5.1 to GLM-5.2.
These come from an internal benchmark of 35 challenging mobile development tasks, each run twice for a total of 70 trials. We measured task completion, defined as core features working without major issues.
Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work
AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files.
We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org.
Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40.
AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only.
Key elements of AA-Briefcase:
➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups
➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work
➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor
➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work
Key results:
➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase
➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost
➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria
➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models
More details below in thread ⬇️
GLM-5.2 can now run locally in llama.cpp and Unsloth Studio. Check the graph below for the accuracy of each GLM-5.2-GGUF quantization.
Full guide: https://t.co/ZcSbQk7Aqv
Open source MUST win 🔥
GLM-5.2 is free when used with Hugging Face Inference Providers and for every available provider for the next 6 hours (Zai, Together AI, Novita, Fireworks, DeepInfra) the cost is on us.
Set it up with Pi, opencode, Codex, Claude Code or any coding agent to understand why people are saying open source has caught up 🔥
GLM-5.2 can now be run locally!🔥
The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size).
Run on a 256GB Mac or RAM/VRAM setups.
GLM-5.2 is the strongest open model to date.
Guide: https://t.co/bI7FeeKHDd
GGUF: https://t.co/BMkxswdj5N
GLM-5.2 from @zai_org is now live in Tabbit.
A 1M context window for long, complex tasks, and the No.1 open-source model on Code Arena.
Paired with Tabbit's multi-turn conversations, it breaks big problems down and works through them step by step.
Free to try now.
🚀 slime v0.3.0 is out!
This release is a major step toward agent-first RL.
We turned slime’s existing multi-turn / agentic capabilities into a more coherent foundation:
- slime/agent with reusable sandbox-agent components
- OpenAI / Anthropic-compatible adapters
- black-box coding-agent RL example
- variable global batch-size training
- fully async training as a first-class path
- lower host-memory usage for more flexible rollout-inference setups
- PPO refactor with actor-critic colocation
- delta weight sync, FlashQLA for Qwen GDN, --save-hf, and more CI coverage
slime is moving closer to a practical open-source framework for large-scale agentic RL.
Release note:
https://t.co/e1ONv8Q4aW
GLM 5.2 is the new open-weight SOTA on the Vals Index, Vibe Code Bench and Terminal Bench!
It is also #5 across all models, and right on the heels of Opus 4.7 - released only two months ago
here’s my best practices for GLM-5.2. feel free to add yours below:)
smoothest experience: Claude harness or Z Code
orchestration : OpenCode/Hermes Agent
always on: OpenClaw
in VSCode: Cline or Kilo Code.
…
if you want to use Codex, set the context window to 1M and bump auto_compact up to ~85%. we’re still working on the integration friction, it’s not our top pick just yet.
for crucial tasks, would recommend Opus 4.8 as your fallback and let GLM-5.2 do the heavy lifting.
little heads up:
GLM-5.2 is text-only. for vision tasks, lean on Z Code’s built-in image service to fill the gap.
about OpenClaw, under the coding plan, it runs as a second tier scheduler, so it’ll step aside for coding jobs when things get busy.
GLM 5.2 is here, and we’re proud to partner with Z AI to make GLM 5.2 accessible to the entire world.
Built and designed for long, difficult real world tasks, GLM 5.2 stands toe to toe with many frontier closed sourced models.
Give it a shot today at Featherless!
Z ai’s GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index scoring 51 and it sits on the Pareto frontier of Intelligence vs Cost per Task
@Zai_org’s GLM-5.2 is the same size as GLM-5.1 (744B total / 40B active parameters) but scores 11 points higher on the Intelligence Index v4.1, placing ahead of MiniMax-M3 (44) and DeepSeek V4 Pro (max, 44). On the first-party API it is priced in line with GLM-5.1 at $1.4/$4.4/$0.26 per 1M input/output/cache hit tokens
Key results:
➤ GLM-5.2 is the leading open weights model on the Intelligence Index v4.1. At 51, it leads MiniMax-M3 (44), DeepSeek V4 Pro (max, 44) and Kimi K2.6 (43)
➤ Improvements across most evaluations, particularly scientific reasoning: GLM-5.2 gains over GLM-5.1 on most evaluations, led by scientific reasoning on CritPt (+16 points to 21%) and HLE (+12 points to 40%), alongside AA-LCR (+9 points to 71%), tau3 banking (+15 points to 27%) and SciCode (+7 points to 50%). TerminalBench v2.1 also improves (+16 points to 78%) and GPQA Diamond gains 3 points to 89%
➤ Leading open weights model on GDPval-AA v2 and competitive with proprietary models: GLM-5.2 scores 1524 on GDPval-AA v2, ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro (max, 1328). This impressive result places GLM-5.2 in-line with proprietary models including GPT-5.5 (xhigh reasoning). GDPval-AA v2 builds on the original GDPval-AA by baselining Elo to human performance at 1000, introducing a rotating panel of frontier-model judges, and raising the turn limit from 100 to 250 for longer-horizon agent trajectories
➤ GLM-5.2 uses more output tokens per task than other leading open weights models: the model uses 43k output tokens per Intelligence Index task, up from GLM-5.1 (26k) and above MiniMax-M3 (24k), Kimi K2.6 (35k) and DeepSeek V4 Pro (max, 37k)
➤ On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
Additional Model Details:
➤ License: MIT
➤ Size: 744B total parameters, 40B active parameters, equivalent to GLM-5.1
➤ Context window: 1M tokens, up from 200K on GLM-5.1
➤ Pricing: $1.4/$0.26/$4.4 per 1M input/cache hit/output tokens
➤ Availability: Alongside Z ai's first-party API, GLM-5.2 is available across third-party providers including @DeepInfra, @novita_labs, @nebiusai, @parasailnetwork , @SiliconFlowAI , @gmi_cloud , @Baseten and @FireworksAI_HQ
GLM-5.2 is now live on ModelScope! Built for long-horizon tasks, with Solid 1M context and stronger coding capabilities. 🚀
🤖 https://t.co/CFom9Lzm6c
#2 on Code Arena among all available models globally. Highest-ranked open-source model on FrontierSWE, SWE-Marathon, and PostTrainBench, performance between Claude Opus 4.7 and 4.8.
🧠 Solid 1M context: KV8, LayerSplit, IndexShare
💻 Coding: Terminal-Bench 2.1 open-source SOTA (+17.5% over GLM-5.1), MCP-Atlas within 0.8% of Opus 4.8. Engineering feel between Claude Opus 4.7 and 4.8
⚡ Long-horizon execution: from a single requirement description to a fully deployable multi-platform app in one continuous run
vLLM, SGLang, Transformers supported.
Code like a real G😎
Congrats to @Zai_org 's GLM 5.2 ranks #1 as available model on CodeArena 💪
SiliconFlow is proud to be T+0 launch partner🔥
💰 Input Cache/Input/Output: $ 0.26/1.40/4.40 per 1M tokens
📚 Usable 1M context for entire codebases and project-scale workflows
⚙️ Reliable long-horizon execution that stays on track through complex tasks
💪 Production-grade coding on par with Opus 4.8
🧠 Dual thinking modes: max for depth, high for quality-cost balance
And it's still fully open-source.
Big shoutout to @Zai_org for keeping frontier model accessible to builders and the community 🙌
Get started today 👇