Z.ai

about 4 hours ago

@didier_lopes Thanks for supporting @slime_framework ❤️

0

2

0

60

Zai_org retweeted

Fireworks AI

@FireworksAI_HQ

about 21 hours ago

"...at least as good as Opus 4.8 and GPT 5.5."

13

158

9

20

22K

Zai_org retweeted

Cunxiang Wang

@CunxiangWang

1 day ago

GLM-5.2 is not only stronger on benchmarks, but also much better in real app development scenarios — iOS, Android, WeChat Mini Programs, and more. Behind this jump is a full loop from environment construction, evaluation, data optimization, reward design, to training. Real tasks, real execution, real improvement.

36

259

20

47

34K

1 day ago

Long-horizon is more than a concept. It should live in real-world scenarios, empowering AI builders to solve the problems that matter. And more scenarios are on the way.

1 day ago

GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks. Results: - GLM-5.1: 21/70 - GLM-5.2: 48/70 - Claude Fable 5: 56/70 That's more than a twofold improvement from GLM-5.1 to GLM-5.2. These come from an internal benchmark of 35 challenging mobile development tasks, each run twice for a total of 70 trials. We measured task completion, defined as core features working without major issues.

65

1K

75

198

168K

35

801

50

75

50K

Zai_org retweeted

1 day ago

GLM-5.2 delivers a substantial leap in app development capabilities, which also represent demanding long-horizon tasks. Results: - GLM-5.1: 21/70 - GLM-5.2: 48/70 - Claude Fable 5: 56/70 That's more than a twofold improvement from GLM-5.1 to GLM-5.2. These come from an internal benchmark of 35 challenging mobile development tasks, each run twice for a total of 70 trials. We measured task completion, defined as core features working without major issues.

65

1K

75

198

168K

Zai_org retweeted

Artificial Analysis

@ArtificialAnlys

1 day ago

Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files. We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org. Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40. AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only. Key elements of AA-Briefcase: ➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups ➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work ➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor ➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work Key results: ➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase ➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost ➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria ➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models More details below in thread ⬇️

ArtificialAnlys's tweet photo. Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work

AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files.

We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org.

Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40.

AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only.

Key elements of AA-Briefcase:

➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups

➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work

➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor

➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work

Key results:

➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase

➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost

➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria

➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models

More details below in thread ⬇️

40

783

64

259

181K

Zai_org retweeted

2 days ago

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio. Check the graph below for the accuracy of each GLM-5.2-GGUF quantization. Full guide: https://t.co/ZcSbQk7Aqv

ZixuanLi_'s tweet photo. GLM-5.2 can now run locally in llama.cpp and Unsloth Studio. Check the graph below for the accuracy of each GLM-5.2-GGUF quantization.

Full guide: https://t.co/ZcSbQk7Aqv https://t.co/dfzv7cL0gm

7

153

8

29

20K

2 days ago

GLM-5.2 is free when used with Hugging Face Inference Providers for the next 5 hours: https://t.co/YsYXgQpqTw

Victor M

@victormustar

2 days ago

Open source MUST win 🔥 GLM-5.2 is free when used with Hugging Face Inference Providers and for every available provider for the next 6 hours (Zai, Together AI, Novita, Fireworks, DeepInfra) the cost is on us. Set it up with Pi, opencode, Codex, Claude Code or any coding agent to understand why people are saying open source has caught up 🔥

victormustar's tweet photo. Open source MUST win 🔥

GLM-5.2 is free when used with Hugging Face Inference Providers and for every available provider for the next 6 hours (Zai, Together AI, Novita, Fireworks, DeepInfra) the cost is on us.

Set it up with Pi, opencode, Codex, Claude Code or any coding agent to understand why people are saying open source has caught up 🔥

61

612

57

242

237K

63

1K

68

238

226K

Zai_org retweeted

Unsloth AI

@UnslothAI

2 days ago

GLM-5.2 can now be run locally!🔥 The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size). Run on a 256GB Mac or RAM/VRAM setups. GLM-5.2 is the strongest open model to date. Guide: https://t.co/bI7FeeKHDd GGUF: https://t.co/BMkxswdj5N

UnslothAI's tweet photo. GLM-5.2 can now be run locally!🔥

The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size).

Run on a 256GB Mac or RAM/VRAM setups.

GLM-5.2 is the strongest open model to date.

Guide: https://t.co/bI7FeeKHDd
GGUF: https://t.co/BMkxswdj5N https://t.co/qIPuU63W9D

251

7K

797

4K

1M

Zai_org retweeted

Tabbit @TabbitBrowser

3 days ago

GLM-5.2 from @zai_org is now live in Tabbit. A 1M context window for long, complex tasks, and the No.1 open-source model on Code Arena. Paired with Tabbit's multi-turn conversations, it breaks big problems down and works through them step by step. Free to try now.

TabbitBrowser's tweet photo. GLM-5.2 from @zai_org is now live in Tabbit.

A 1M context window for long, complex tasks, and the No.1 open-source model on Code Arena.
Paired with Tabbit's multi-turn conversations, it breaks big problems down and works through them step by step.

Free to try now. https://t.co/G5IAXirj1o

7

157

8

52

15K

2 days ago

@elonmusk @teortaxesTex ✍️✍️✍️

9

404

9

18

59K

Zai_org retweeted

slime

@slime_framework

19 days ago

🚀 slime v0.3.0 is out! This release is a major step toward agent-first RL. We turned slime’s existing multi-turn / agentic capabilities into a more coherent foundation: - slime/agent with reusable sandbox-agent components - OpenAI / Anthropic-compatible adapters - black-box coding-agent RL example - variable global batch-size training - fully async training as a first-class path - lower host-memory usage for more flexible rollout-inference setups - PPO refactor with actor-critic colocation - delta weight sync, FlashQLA for Qwen GDN, --save-hf, and more CI coverage slime is moving closer to a practical open-source framework for large-scale agentic RL. Release note: https://t.co/e1ONv8Q4aW

2

121

17

67

19K

Zai_org retweeted

Vals AI

@ValsAI

3 days ago

GLM 5.2 is the new open-weight SOTA on the Vals Index, Vibe Code Bench and Terminal Bench! It is also #5 across all models, and right on the heels of Opus 4.7 - released only two months ago

ValsAI's tweet photo. GLM 5.2 is the new open-weight SOTA on the Vals Index, Vibe Code Bench and Terminal Bench!

It is also #5 across all models, and right on the heels of Opus 4.7 - released only two months ago https://t.co/hyt3mqMDEE

14

421

32

61

77K

Zai_org retweeted

Lou

@louszbd

3 days ago

here’s my best practices for GLM-5.2. feel free to add yours below:) smoothest experience: Claude harness or Z Code orchestration : OpenCode/Hermes Agent always on: OpenClaw in VSCode: Cline or Kilo Code. … if you want to use Codex, set the context window to 1M and bump auto_compact up to ~85%. we’re still working on the integration friction, it’s not our top pick just yet. for crucial tasks, would recommend Opus 4.8 as your fallback and let GLM-5.2 do the heavy lifting. little heads up: GLM-5.2 is text-only. for vision tasks, lean on Z Code’s built-in image service to fill the gap. about OpenClaw, under the coding plan, it runs as a second tier scheduler, so it’ll step aside for coding jobs when things get busy.

73

1K

52

564

76K

Zai_org retweeted

OpenCode

@opencode

3 days ago

GLM-5.2 now available in Go text · 1M context · same pricing as 5.1

172

6K

280

467

467K

Zai_org retweeted

Featherless AI

@FeatherlessAI

4 days ago

GLM 5.2 is here, and we’re proud to partner with Z AI to make GLM 5.2 accessible to the entire world. Built and designed for long, difficult real world tasks, GLM 5.2 stands toe to toe with many frontier closed sourced models. Give it a shot today at Featherless!

FeatherlessAI's tweet photo. GLM 5.2 is here, and we’re proud to partner with Z AI to make GLM 5.2 accessible to the entire world.

Built and designed for long, difficult real world tasks, GLM 5.2 stands toe to toe with many frontier closed sourced models.

Give it a shot today at Featherless! https://t.co/sqgakKvQFM

2

62

11

13K

Zai_org retweeted