Z.ai

@Zai_org

The AI Lab behind GLM models, dedicated to inspiring the development of AGI to benefit humanity.

[email protected]

Joined November 2023

258 Following

80.7K Followers

1.1K Posts

Pinned Tweet

Z.ai @Zai_org

15 days ago

https://t.co/jaOMnP7Yud

829

123

605

166K

Zai_org retweeted

Harvey @harvey

about 24 hours ago

We partnered with @FireworksAI_HQ to train open-source models for legal. Here's what we found: 1) Hybrid legal agents can beat frontier models on quality and cost by routing selectively to a frontier advisor. We tested a hybrid setup where GLM 5.1 served as the primary worker, routing tasks to Opus 4.7 as an advisor when needed. GLM invoked Opus sparingly, just 0.83 times per task on average. The hybrid setup beat Opus on both quality and cost: 18% all-pass vs 14%, at $368 vs $954 across the same 100 tasks. 2) Post-training can push open models to frontier-level legal performance. On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6's all-pass rate from 11% to 15%, beating Opus' 14%. But the cost gap was even more striking: $84 vs $954 across the same 100 tasks, or ~11x cheaper. We're excited to continue working with @FireworksAI_HQ on the next generation of open-source legal agents.

harvey's tweet photo. We partnered with @FireworksAI_HQ to train open-source models for legal. Here's what we found:

1) Hybrid legal agents can beat frontier models on quality and cost by routing selectively to a frontier advisor.

We tested a hybrid setup where GLM 5.1 served as the primary worker, routing tasks to Opus 4.7 as an advisor when needed.

GLM invoked Opus sparingly, just 0.83 times per task on average.

The hybrid setup beat Opus on both quality and cost: 18% all-pass vs 14%, at $368 vs $954 across the same 100 tasks.

2) Post-training can push open models to frontier-level legal performance.

On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6's all-pass rate from 11% to 15%, beating Opus' 14%.

But the cost gap was even more striking: $84 vs $954 across the same 100 tasks, or ~11x cheaper.

We're excited to continue working with @FireworksAI_HQ on the next generation of open-source legal agents.

323

227

83K

Z.ai @Zai_org

17 days ago

@OrcaRouter Now it’s live!👏

Zai_org retweeted

OrcaRouter 🐳

@OrcaRouter

17 days ago

GLM-5.1 from @Zai_org is now live on OrcaRouter • #1 open-source model on SWE-Bench Pro • Beats closed source models on real-world repo repair benchmarks • MIT licensed • 200K context • Built for long-horizon agentic coding We’ve also seen strong results using GLM-5.1 inside OrcaRouter’s adaptive routing strategy as a fallback coding model. Open-source coding models are getting scary good. https://t.co/a0WruoaJSc

255

29K

Zai_org retweeted

Erica

@ericavaneee

19 days ago

We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier. 🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2. ✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder. 🔗 https://t.co/XajAyaZRct

ericavaneee's tweet photo. We built TERMS-Bench, a three-tier benchmark for LLM agents in real-world economic negotiation. No LLM-as-judge, no outcome rubrics: the environment itself is the verifier.

🏆Among frontier models, @AnthropicAI Claude Opus 4.6 #1, @Zai_org GLM 5.1 #2.
✨Surprisingly strong: @GoogleDeepMind @googlegemma Gemma 4 31B — best open-weight, holds up as negotiations get harder.
🔗 https://t.co/XajAyaZRct

235

106

45K

Zai_org retweeted

Design Arena

@Designarena

20 days ago

BREAKING: The results are in for Slides Arena... @AnthropicAI and @Zai_org models continue to lead the way in soft-verifiable domains 1st: Opus 4.7 by @AnthropicAI 2nd: Opus 4.7 (Thinking) by @AnthropicAI 3rd: GLM 5.1 by @Zai_org Huge congrats to @AnthropicAI and @Zai_org for establishing the SOTA for Agentic Slides

Designarena's tweet photo. BREAKING: The results are in for Slides Arena... @AnthropicAI and @Zai_org models continue to lead the way in soft-verifiable domains

1st: Opus 4.7 by @AnthropicAI
2nd: Opus 4.7 (Thinking) by @AnthropicAI
3rd: GLM 5.1 by @Zai_org

Huge congrats to @AnthropicAI and @Zai_org for establishing the SOTA for Agentic Slides

251

51K

Zai_org retweeted

Zixuan Li

@ZixuanLi_

22 days ago

See you in Singapore. BTW, I'm starting to look more and more like our logo.

136

20K

Zai_org retweeted

Z.ai for Startups

@ZaiforStartups

23 days ago

GLM models are now live on @tensorix_ai We’re partnering to bring cost-efficient frontier AI models to developers, startups, and enterprises across Europe and beyond — and to back the Sovereign AI ecosystem with serious inference muscle. Four GLM models are now available • GLM-5.1 → SOTA open-source performance advancing long-horizon AI agents to new levels • GLM-5 → New generation language base model • GLM-5-Turbo → agent-ready, built for coding and agentic use cases • GLM-5v-Turbo → multimodal reasoning across code, images, documents, and diagrams Go build something cool. Build with GLM on Tensorix: https://t.co/1izv1MJa2D

268

23K

Zai_org retweeted

Zhihu Frontier

@ZhihuFrontier

24 days ago

🧵 Slime: The Most Elegant & Comfortable RL Training Framework Ever A deep dive into why Slime redefines LLM RL training with clean architecture & production-grade engineering ✨ Insights from Zhihu contributor Xavier 📌 What Is Slime In One Sentence? Slime is a streamlined RL training framework built on SGLang (Inference) + Megatron (Training) + Ray (Orchestration).It’s not just a simple stack—it stitches top-tier open-source projects together with perfectly polished interfaces.Core design philosophy: Fully decouple training & inference, connected via streamlined data flow. Compared to veRL / OpenRLHF: ✅ Native SGLang backend → high concurrency, continuous batching, prefix caching (no messy vLLM wrapper) ✅ Native Megatron backend → full TP/PP/EP/CP parallelism, seamless MoE training ✅ Lightweight Ray scheduling → Placement Group + Remote Actor (no bloated Ray Train) 🏗️ Global Architecture: 3 Modules, One Pipeline 🖥️ Ray Cluster Core Workflow:Data Buffer (Prompt Manager → Buffer & Filter)↔️ Rollout (SGLang → Sampling + RM Scoring + Filtering)↔️ Training (Megatron → Actor/Critic + PPO/GRPO) 🔁 Simplified Core Training Loop 1.Allocate GPU resources via Placement Group 2.Launch SGLang rollout engine 3.Initialize Megatron Actor/Critic models 4.Sync initial weights to SGLang 5.Repeat 3-beat cycle: Generate (SGLang) → Train (Megatron) → Sync Weights 🎯Elegance = ultra-simple top-level logic, all complexity encapsulated inside modules 🎛️ 4 Core Design Flexibilities ⚙️ Resource Scheduling: Colocate (shared GPU) / Disaggregate (separate GPU pools) 🔄 Training Mode: Synchronous / Asynchronous training 🧪 Sampling Logic: Standard sampling / Over-sampling / Multi-turn tool calling 🤖 Model Type: Dense / MoE, full tensor/pipeline/context parallel support 🔧 Plug & Play Customization (All Extensible) Slime lets you customize every component via CLI params—no need to fork the repo 🛠️ Key Customization Points ✅ Custom Reward Model: Write an async func to define your own reward logic (easiest entry) ✅ Custom Generate Func: Control multi-turn dialogue, tool calling & external API integration ✅ Custom Rollout Func: Fully take over sampling concurrency & filtering logic ✅ Custom DataSource: Fetch prompts from API / local files / dynamic data streams ✅ Dynamic Filter: Discard low-value sample groups (e.g., zero-variance GRPO samples) ✅ Custom Loss Function: Rewrite PPO/GRPO loss calculation freely All custom code loads dynamically via --custom-xxx-path config 📝 🚀 Ray GPU Scheduling Magic Two deployment modes for all cluster scales: 🔹 Colocate Mode: Train & inference share GPUs → high utilization, ideal for small 8-card servers 🔹 Disaggregate Mode: Independent GPU pools → train-infer overlap, perfect for multi-node clusters Slime stabilizes Ray Placement Group GPU mapping via IP/GPU ID sorting to guarantee reproducibility 🔒 ⚡ SGLang Rollout Engine Internals 3-layer abstraction:RolloutManager → RolloutServer → ServerGroup → SGLangEngine Standout design highlights: 🔸 Over-sampling + Dynamic Filter: Pre-sample extra data, filter invalid groups on the fly 🔸 Async Concurrent Sampling: Process completed groups immediately with FIRST_COMPLETED 🔸 Abort Mechanism: Stop redundant sampling once target data size is met, save compute 🔸 Singleton GenerateState: One-time tokenizer & connection initialization 🧠 Megatron Training Backend Native support for mainstream RL algorithms: ✅ GRPO: No Critic needed, group-wise reward normalization (most popular) ✅ PPO: Classic Actor-Critic with GAE advantage estimation ✅ REINFORCE++: Token-level baseline optimization Seamless support for Dense & large MoE models with full parallelism 📊 🔄 Weight Sync: The Hard Engineering Solved Two high-performance sync paths: 🔹 Colocate: IPC + Gloo → intra-node low-latency weight transfer 🔹 Disaggregate: NCCL Broadcast → cross-node distributed sync MoE OOM prevention: Chunked Bucket Weight Update → sync parameters in small batches, release memory instantly 🧩 💡 Core Takeaways ✨ Slime’s elegance lies in integrating mature top-tier stacks with clean decoupled design ✨ Minimal top-level logic, maximal internal engineering depth ✨ Fully pluggable customization for all RL scenarios (Math / Code / Agent / MoE) ✨ Optimized for both small single-node & large multi-node clusters 🔗Full article：https://t.co/ig7KCAKCZL #LLM #RLTraining #SGLang #AIInfrastructure #MoE #MachineLearning

15K

Z.ai @Zai_org

28 days ago

Coding plan users interested in early experimentation can fill out this form: https://t.co/DfUeGOae2h

Z.ai @Zai_org

28 days ago

GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks. https://t.co/5mCu2VHZlI

898

133

289

68K

Z.ai @Zai_org

28 days ago

Technical highlights: CogViT Vision Encoder - Built with dual-teacher distillation: SigLIP2 for semantics, DINOv3 for texture. A two-stage recipe, masked modeling, then contrastive pretraining, with QK-Norm for attention stability at scale. Multimodal Multi-Token Prediction (MMTP) - Three ways to pass image tokens into the MTP head were compared. The chosen approach uses a shared <image> token, removing the need to propagate visual embeddings across pipeline stages and improving training stability. Broad Training Across Perception, Reasoning, and Agent Capability - Vision and language are fused from pre-training onward, with emphasis on multimodal code. Joint RL across 30+ task categories yields consistent gains with weaker cross-domain interference than SFT. Multimodal RL at Scale - Infrastructure rebuilt along four axes: unified task and reward abstraction, full-pipeline asynchrony, fine-grained memory management for vision modules, and topology-aware partitioning for variable-length visual inputs.

Z.ai @Zai_org

about 1 month ago

As models, contexts, and workloads grow, hidden assumptions in inference infrastructure can surface as output anomalies. Reliability requires more than throughput, latency, and availability. It also requires preserving the correctness of model state behind every generation.

Z.ai @Zai_org

about 1 month ago

Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain. https://t.co/81QCQw941P In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement. We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.

871

291

85K

Z.ai @Zai_org

about 1 month ago

After fixing correctness issues, we turned to the next bottleneck: Prefill throughput and GPU memory pressure in long-context Coding Agent serving. To address this, we introduced LayerSplit, a layer-wise KV Cache storage scheme. Instead of duplicating all layers on every GPU, each GPU stores only a subset of layers. With communication overlapped by computation, LayerSplit improved throughput by up to 132%.

Z.ai

@Zai_org

Last Seen Users on Sotwe

Trends for you

Most Popular Users