Meet Gemma 4 12B!
A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license.
Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇
🚀 open sourced metalBLAS, hand-tuned Metal matmul kernels for Apple Silicon, callable from PyTorch on mps.
Matches/beats MPS Graph (torch) matmuls on bf16/fp16, 2-3x faster on fp32 (TF32-relaxed) across the bench suite on M5 Pro.
Next step is to upstream this to PyTorch!
https://t.co/EMGdZaagXP
multi-turn RL and the "tito" problem keeps coming up. we've been working on it for a while, and the takeaway is that it's much easier than people are making it.
it takes 1 implementation rule, and 1 chat-template property that all models already comply with.
**that's all you need to do it right**
https://t.co/383yZHnz05
This is what we have been working on for the last 6 months or so at the AI Snowflake Research:
Zero Redundancy Rollouts (ZoRRo):
https://t.co/OqiEPscuRL
If you do RL and you want it to be much faster make sure to have a look.
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.
🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.
Try it now at https://t.co/GCdiMzk1Dl via Expert Mode / Instant Mode. API is updated & available today!
📄 Tech Report: https://t.co/drlDrxkYtp
🤗 Open Weights: https://t.co/T13Y8i7SDM
1/n
We're opening a Hugging Face office in Tokyo!
Our goal: help open-source AI develop in Japan and grow the local community. Let's meet!
ハギングフェイスの東京オフィスがオープンしました!
私たちの目標は、日本におけるオープンソースAIの発展を支援し、ローカルコミュニティを育てることです。ぜひお会いしましょう!
In the Age of Agents, an Engineer's Most Valuable Skill Is Saying "No"
I gave a talk at Snowflake recently, sharing what I've learned about agent coding over the past two years of building SGLang's inference engine, Omni multimodal serving, and AI agent workflows. The response far exceeded my expectations — it was the first time so many people asked for the slides afterward. Probably because I deliberately avoided the hardcore technical deep-dives, and instead spent the time on one thing: explaining just how many ways AI Agents can go terrifyingly wrong when maintaining real-world projects. 😂
Slides are fragments. I wanted to reorganize these thoughts into something coherent — threading together ideas scattered across different projects into a single narrative. Starting from my own engineering practice, I want to articulate what "engineering judgment" actually means in the era of agent coding.
I. Standing at the Intersection of Infra and Agent Worlds
Some background first.
I'm a core developer of SGLang, one of the most widely deployed open-source inference engines in the world — 25K+ GitHub stars, running on over 400K GPUs. I currently lead two areas: SGLang RL Rollout (high-performance rollout infrastructure for RLHF) and SGLang Omni (multimodal and TTS model serving).
At the same time, I'm a heavy user of Claude Code, and I make no attempt to hide it. SGLang Omni's latest benchmark infrastructure — thousands of lines of production-grade code — was essentially executed line by line by Claude Code from our system design specs. We have a team of about ten, responsible for defining architecture, setting thresholds, planning file paths, and designing test matrices. AI delivers in dozens of hours. Believe it or not, I rarely write implementation-level code myself anymore.
This isn't a prediction about the future. This is my daily reality.
But precisely because I stand at the intersection of inference engine developer and heavy AI coding user, my understanding of agent coding is probably different from most people's intuition. Most people see "AI can write code now, amazing!" What I see are three seriously overlooked hazards — is what AI writes actually correct? What should the system architecture look like? And is the token cost behind all of this actually worth it?
This article follows these three questions. Starting with the first: how do you know if what AI wrote is actually correct?
II. Effort Without Measurement Is Self-Deception
Near the end of my undergraduate years, I was doing research on intent alignment. During a conversation with a mentor I deeply respect, he systematically laid out his vision for alignment, and one core step stuck with me — building real and effective benchmarking for alignment. His point was roughly: if we can't even measure whether alignment has been achieved, then all alignment work is building castles in the air.
Years later, having done agent research, inference, and RL infra — having stepped on countless landmines — that simple truth only weighs more. And I've found, regrettably, that modern benchmarks haven't kept up. They've fallen far behind the pace of the field.
The agent space is especially bad. Every few days there's a new demo — it can control browsers, rewrite compilers, supposedly put all CUDA engineers out of work. But press further: how do you measure if it's actually good? The answer is usually a few cherry-picked cases or a carefully edited video. On Xinzhiyuan (a prominent Chinese AI outlet), human engineers have been "replaced by AI" a thousand times over. Yet the top Cutlass engineers are still sitting in their offices, drawing high salaries, writing the kernels that actually run in production.
So in my own projects, benchmark has been the highest priority from day one. Bar none.
I felt this most acutely building how-to-sglang — a multi-agent system for helping users understand SGLang code and answering community questions. The temptations were enormous at the start: add RAG, connect more data sources, build multi-turn conversation, try fancy agent debating. The feature list could stretch to the ceiling. But the first thing I did was build an LLM-as-a-Judge evaluation framework. Before adding any feature, answer one basic question: does your change actually make the agent more accurate?
The result: most seemingly promising optimizations showed zero improvement in testing. Without that benchmark, every decision was blind guessing — we thought we were improving, but we weren't.
Building SGLang Omni's benchmark was the same story. Before I took over: an optimization PR gets merged, TPS numbers look good, everyone's happy. A while later accuracy drops, nobody can tell which commit caused the regression, and painful bisecting begins. My first act: stop all development, build accuracy and performance CI first, then talk about optimization. Final results — S2 Pro WER 1.18% (excluding bad cases), Qwen3 Omni 1.91% without voice clone, 1.88% with voice clone. Acceptance criteria ±0.1%, all passing.
At least inference system evaluation is objective — if the number is higher, it's higher. No room for debate. Unlike agent evaluation, which is riddled with subjective judgment and fuzzy definitions. That certainty is precious.
Effort without measurement isn't effort. It's self-deception.
Benchmarking solves the "how do you know it's correct" problem. But there's an even more upstream question: who writes the benchmark framework itself? In my case, AI wrote it — but that's only half the answer.
III. The Prompt Itself Is the System Design
When I say Omni's benchmark refactor — thousands of lines — was mostly written by AI, that's not bragging. It's fact. Writing pytest fixtures, constructing subprocess calls, parsing JSON results, generating CI workflows — AI did it fast and well.
But there's a detail that's easy to miss: that prompt itself was my system design.
The most critical decision in the entire refactor was task × model orthogonal separation. The old version was a 722-line monolithic script, benchmark_tts_speed.py, with all model and task logic coupled together. After refactoring: tasks/, metrics/, dataset/, benchmarker/, eval/ — five modules. Why this decomposition? Because I knew a series of new models would be joining. Without model-agnostic abstraction, every new model means rewriting the evaluation framework. But you can't over-abstract either — Omni models differ far more than LLMs do. S2 Pro uses a Dual-AR codec architecture; Qwen3 Omni uses a 9-stage multi-process pipeline. Evaluation logic can't be fully unified. The task × model orthogonal separation is the balance point between reuse and flexibility.
Ask AI directly to "refactor these 722 lines" and it'll give you a decomposition. But getting the granularity exactly right depends on our judgment about the project's future — what models are coming, what dimensions will change, what's worth abstracting and what isn't. This context is fuzzy, dynamic, full of probabilistic judgment. You can't fully distill it into a prompt.
AI gives you a decomposition. System design gives you the right decomposition.
Code is flesh. Architecture is skeleton. In an era where AI can write ten thousand lines a day, right architecture means ten thousand lines of asset; wrong architecture means ten thousand lines of debt. And AI simultaneously amplifies the cost of wrong directions — it can turn one piece of tech debt into an entire debt empire at a speed you can't imagine.
Saying "system design matters" is empty talk. Let's look at some concrete cases where AI went wrong.
IV. Where AI Actually Fails
Where exactly did Claude fail during the Omni benchmark refactor? A few representative examples.
First category: blind spots in engineering conventions. Claude used gdown to download datasets from Google Drive — fine for a side project, but a ticking time bomb in SGLang's CI. Google Drive rate-limits, 403s, confirm tokens — our main repo has been burned too many times by unstable external download sources. The correct approach: host datasets on HuggingFace, use snapshot_download. Similar issues: dataset fixtures hardcoded to /tmp/ (path conflicts in concurrent jobs), server teardown with only SIGTERM and no SIGKILL fallback, JSON key access without schema validation. Each of these is individually "common sense," but what counts as common sense depends on which environment you work in. AI's common sense comes from the statistical distribution of internet corpora, not from the specific failure history of a particular team.
Second category: CI threshold design. Claude set the TPS threshold at 55 tok/s, with observed values of 85-87 — over 35% margin. This threshold catches catastrophic regression (88→28), but performance silently sliding from 87 to 60 wouldn't trigger any alarm. I looked at four measurements repeatedly — 85.8, 85.9, 86.9, 87.1 — standard deviation roughly 0.6. Final threshold: 80, all metrics standardized to 13-15% margin. The core of this decision isn't arithmetic — it's having a feel for this specific system's run-to-run variance, knowing what margin is "tight enough to catch chronic degradation but loose enough to avoid flakiness." Anyone who's done CI knows: threshold design is a systems engineering problem, not a math problem.
These aren't edge cases. They're systematic. AI writes fast, but between "writing fast" and "writing correctly" lies an entire engineering environment's worth of distance.
Everything above concerns AI coding's limitations in the "writing correct code" dimension. Next, I want to zoom out — not just whether the code is correct, but whether the tokens consumed behind it are actually worth the cost.
V. The Token Efficiency Crisis: Using a Fire Hose to Water Flowers
As an inference engine developer, my daily work is thinking about how to maximize prefix cache hit rates, optimize KV cache memory layouts, and minimize the cost of each inference request. So when I connected Claude Code to a local inference engine and observed its actual request patterns — how to put this — it felt like a water conservation engineer who carefully designed a reclamation system, watching someone water flowers with a fire hose.
Cache hit rate was devastating. Not "decent but room for improvement" — "the prefix cache mechanism we carefully designed at the inference engine level was almost completely destroyed." A single user query triggers multiple low-value tool calls, each carrying over 100K tokens of context window. The Resume feature breaks KV cache hits entirely — an almost absurd bug. The entire session's context construction was never seriously designed for cache reuse from the start.
I like the RAM bloat analogy. In 1969, 64KB of memory sent Apollo to the moon. In 2026, opening a web page costs 500MB, easy. Each generation of hardware engineers pushes memory capacity higher; each generation of software engineers gleefully fills it up. We've gotten used to this cycle.
But LLM inference is different. RAM bloat costs you a slightly slower computer and a couple hundred bucks for an upgrade. Token bloat costs real money — GPU cluster electricity, user subscriptions — and scales exponentially with agent adoption. GPU compute supply elasticity is far lower than DRAM supply elasticity. When compute is constrained, token efficiency isn't "nice to have." It's the core competitiveness that determines who survives.
I have a bold hypothesis: for those sessions consuming 700K tokens, there must be ways to accomplish the exact same task with 10% of the tokens. Not by sacrificing quality — through smarter context compression, better prefix reuse strategies, more precise tool call scheduling. Anyone who has optimized inference engines, seeing current agent framework request patterns, would reach a similar conclusion.
"Reducing wasteful token spending" isn't a defensive optimization. It's an offensive capability. Whoever first achieves an order-of-magnitude reduction in token consumption at the same quality level can serve ten times the users on the same compute budget.
But is the root cause of token waste merely sloppy agent framework design? The more I think about it, the more I believe the deeper issue is architectural.
VI. Agent and Inference Engine: The Missing Co-Design
The current architecture works like this: agent frameworks treat inference engines as stateless API calls, carrying full context with every request. Inference engines do their best at prefix matching, caching what they can. Fully decoupled. Zero coordination. Simple, general-purpose, but brutally inefficient for long sessions.
My vision: if agent frameworks could sense the inference engine's cache state and proactively construct cache-friendly requests; if inference engines could understand the agent's session semantics and make smarter cache eviction decisions — once this information channel between the two opens, the potential for token efficiency gains is enormous.
This requires three parties to sit down together: model builders, inference engine builders, and agent framework builders. Right now, we're nowhere close.
Maybe the market ultimately decides "compute gets cheap enough, waste doesn't matter," just like the RAM story. But I don't believe the token economy will follow the same path. Not in the near term.
The age of agents doesn't belong to those who burn the most compute. It belongs to those who use it most intelligently.
Having covered the token problem from an inference engine perspective, I want to turn the lens back to agents themselves. In the preceding sections I've been criticizing agents — code isn't correct, tokens are wasted, no coordination with inference engines. But let's flip the question: what's the actual moat for agent builders?
VII. The Agent Moat Paradox
I've found a fascinating paradox in the agent space.
Individual techniques are trivially simple to implement. Agent Debating — the so-called "core moat" of many multi-agent systems — doesn't even come close in implementation difficulty to MLA (DeepSeek's significant breakthrough starting with V2). The barrier to entry is nearly zero.
But the verification system is impossibly complex. The first step of any empirical research is building the right benchmark. Inference benchmarks are mature — TTFT, TBT, Throughput. These objective metrics were being used by database engineers decades ago, just under different names. But agent evaluation is riddled with subjective judgment and fuzzy definitions. OpenClaw's benchmark is nothing like a vibe coding benchmark. The complexity of verification far exceeds the complexity of implementation.
Then there's the explosion of the strategy combination space. SGLang has over a hundred server args. Finding the optimal combination for specific hardware and workload is enormously complex. Same for agents: individual strategies are simple, but finding the optimal combination under real-world constraints — that's the real core capability. A top engineer who deeply understands the system derives their value not from implementing any single strategy, but from having a sense for the optimal direction within a complex strategy space.
There's a question I still haven't resolved. Inference and training system strategy optimization typically has clear trade-offs — enabling partial rollout makes it hard to avoid off-policy effects. But do agent strategies have trade-offs against each other? Does turning everything on always produce the best agent? In my own optimization of how-to-sglang, I found most strategies are highly invasive — including human-in-the-loop, including circular debating. This makes me suspect the combination problem is far more complex than we imagine.
Behind the moat paradox hides another question: if individual agent techniques are this simple to implement, and AI can write code at terrifying speed — what happens when AI starts writing code for itself, expanding its own capabilities?
VIII. Code Bloat: The Terrifying Speed of AI Self-Evolution
Look at OpenClaw's codebase and you'll find something eerie.
Early last month: roughly 400K lines. One month later: approaching 1 million. 500+ commits per day. AI agents fully controlling and deeply participating in their own development, with no one able to truly review what's happening. Someone even built a repo called nanobot, claiming to replicate the core functionality in 4,000 lines — 99% smaller.
From the perspective of a large-scale software maintainer, this is terrifying. Rapid growth with zero comprehensibility, entropy increasing at horrifying efficiency.
I later exchanged messages with OpenClaw's maintainer Peter Steinberger on GitHub. His maintenance quality and enthusiasm impressed me — OpenClaw hasn't fallen into fully unsupervised AI self-maintenance. But the question remains: to what extent can we maintain a clean agent system that handles most functionality while avoiding malignant code bloat, keeping us with the ability to actually debug?
AI excels at local optimization — writing functions, fixing bugs, adding features. No problem. But "keeping a system simple" isn't a local problem. It requires a kind of global restraint — being able to say "this, we don't add," and meaning it genuinely, not because some rule says so.
That restraint may be the last thing humans contribute to software engineering.
Of course, maybe I'm overthinking it. Maybe next-generation models really will have "taste," like many of the top engineers I know — maybe they'll understand that the best code is often the code that was never written.
Speaking of "taste" and "restraint," the various new concepts recently trending in our circles are a perfect counter-example.
IX. Old Wine in New Bottles — and Real Engineering Lessons
I recently read a lengthy essay on harness engineering, tens of thousands of words. My first reaction wasn't "what an impressive concept" but "do these people have any ideas beyond coining new terms for old concepts?"
Prompt engineering → Context engineering → Harness engineering → next month probably scaffold engineering or orchestration engineering. It's all the same thing: designing the environment in which your model operates — what information it receives, what tools it uses, how errors are intercepted, how cross-session memory is managed. This has existed since the day ChatGPT launched. It doesn't become a new discipline just because someone gives it a new name.
Complaints aside, the lessons I learned from how-to-sglang are real, and they overlap heavily with the research those articles cite.
Less information, more precision. Our first approach was one giant agent stuffed with all of SGLang's docs, code, and cookbooks, answering everything. Of course it didn't work — the context window isn't RAM. The more you stuff in, the more attention dilutes, the worse the answers get. We ended up with a multi-tier sub-domain expert architecture: one expert agent per subdomain, an Expert Debating Manager to receive questions, decompose sub-problems, and consult the Expert Routing Table to activate the right agents. This improvement delivered more gains than upgrading to a stronger model.
The repo is the single source of truth. All expert agent knowledge comes from markdown files within the repo. No external docs, no verbal agreements. We initially felt the urge to write one massive sglang-maintain.md covering everything — quickly found it didn't work. OpenAI's Codex team hit the same wall: they tried one giant AGENTS.md to rule them all, and it predictably rotted fast. Expired documentation doesn't just go unread — it actively misleads agents.
Structured routing, not guessing. The Expert Routing Table explicitly maps question types to agents. A question about GLM-5 INT4 simultaneously activates the Cookbook Domain Expert and Quantization Domain Expert. Not guessing by the Manager — guided by an index.
None of these lessons are new. Separation of concerns, single responsibility, docs-as-code, shifting constraints left — traditional software engineering principles. It's just that now we're designing working environments for LLMs, so some people feel the need for a new name. They don't.
The first nine sections have mainly covered the "software" side. To close, I want to discuss two harder topics that I keep running into — one about hardware, one about abstraction.
X. GPU-Only Debugging, and the Cost of Premature Abstraction
First: the debugging cost of ML infrastructure. This domain has a brutal reality — you simply cannot debug on CPU. The bugs that actually matter — CUDA Graph capture failures, multi-stream race conditions, FP16/BF16 numerical divergence, KV cache memory fragmentation at production batch sizes — only manifest on GPUs, at scale, with real kernels running. AI can help you write a CUDA wrapper, but it can't reproduce the graph capture failure that only appears on H100 with 3 concurrent requests at a specific memory layout. ML infra debugging requires hardware intuition — understanding how GPUs actually behave, not just how the code reads. This is the domain AI coding struggles most to reach.
Second: the premature abstraction trap. This problem has gotten worse in the agent era. Previously, over-abstraction at least took time to write — three wrapper layers around a function called once, a config system managing three parameters, architecture diagrams drawn before problem boundaries are understood. Now with AI, these things arrive in minutes. But the cognitive debt they leave behind hasn't decreased at all. Premature abstraction isn't just useless — it's actively harmful, increasing the cognitive load for every person who comes after. And cognitive load is the most hidden, most lethal kind of engineering cost.
It's not that abstraction is wrong. The timing is wrong. AI makes us write code ten times faster, but also makes us accumulate cognitive debt ten times faster.
GPU debugging tests hardware intuition. Premature abstraction tests restraint. At their core, they test the same thing.
Closing: Engineering Sense Is Sorting
Looking back at this entire article, I've really been saying one thing.
An engineer's most valuable ability isn't building complex things. It's looking at a pile of things that all seem worth doing, and identifying which ones actually matter. Writing code is addition. Engineering sense is sorting. You need to be able to face a cool optimization idea and say "not now — get the benchmark solid first." Face an elegant abstraction and say "delete it, we don't need this yet." When everyone is stacking features, say "stop — let's first confirm what we're actually optimizing."
This judgment doesn't come from books. It's the muscle memory left behind after crawling out of one specific pit after another. From a mentor's lesson about benchmarking, to choosing to build evaluation first when building agents, to building benchmark infrastructure for Omni, to observing Claude Code's token waste, to thinking about the nature of agent moats — the same insight, evolved from "that makes sense" to instinct.
In an era where AI can write ten thousand lines of code a day, execution is depreciating fast. But system design has never been more important — because AI simultaneously amplifies the cost of going in the wrong direction.
The age of agents doesn't belong to those who burn the most compute, or write code the fastest, or coin the most new terms. It belongs to those who know what not to build.
We've been studying what it takes to get NVFP4 & MXFP8 deliver good speedups on modern flow models for image & video gen. on B200 🕵️♂️
Today, I'm excited to share those findings!
Bringing some cool recipes through Diffusers and TorchAO with `torch.compile` 🔥
Hop in ⬇️
NVFP4 allows models to be quantized to 4 bits without too much performance degradation, but can we push 4-bit performance even further?
Today, we're releasing a new class of low-precision block-scaled data types that natively adapt to your input data: for 4-bit quantization, IF4 (Int/Float 4) allows each scaled group of 16 values to be saved as FP4 or INT4 depending on which option offers less error. Selections are recorded using the scale factor’s sign bit, which is unused in NVFP4, allowing IF4 to offer better performance with no memory overhead!
Our data types provide better downstream accuracy in LLMs, they can be implemented efficiently in next-generation hardware accelerators, and they reveal some interesting insights about low-bit quantization! 🧵
When you run a @PyTorch model on a GPU, the acutal work is executed through kernels. These are low-level, hardware-specific functions designed for GPUs (or other accelerators).
If you profile a model, you'll see a sequence of kernel launches. Between these launches, the GPU can sit idle, waiting for the next operation. A key optimization goal is therefore to minimize gaps between kernel execution and keep the GPU fully utilized.
One common approach is `torch.compile`, which fuses multiple operations into fewer kernels, reducing overhead and improving utilization.
Another approach is to write custom kernels tailored to specific workfloads (e.g., optimized attention or fused ops). However, this comes with significant challenges:
> requires deep expertise in kernels writing
> installation hell
> integration with the model is non-trivial
To address this,@huggingface introduces the `kernels` library.
With this one can:
> build custom kernels (with the help of a template)
> upload them to the Hub (like models or datasets)
> integrate them to models with ease
Let's take a look at how the transformers team use the kernels library to integrate it into the already existing models. (more in the thread)
Introducing Unsloth Studio ✨
A new open-source web UI to train and run LLMs.
• Run models locally on Mac, Windows, Linux
• Train 500+ models 2x faster with 70% less VRAM
• Supports GGUF, vision, audio, embedding models
• Auto-create datasets from PDF, CSV, DOCX
• Self-healing tool calling and code execution
• Compare models side by side + export to GGUF
GitHub: https://t.co/2kXqhhvLsb
Blog and Guide: https://t.co/ENuTWal5AA
Available now on Hugging Face, NVIDIA, Docker and Colab.
Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL
For extensive details please see this writeup:
https://t.co/2xDWUk8p3V
Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.
Introducing Modular Diffusers 🔥
The `DiffusionPipeline` abstraction in Diffusers has established a standard in the community. But it has also limited flexibility.
Modular Diffusers breaks those shackles & enables the next gen. of creative user workflows 🧨
Details ⬇️
Today https://t.co/jFknDoasSy joins Hugging Face
Together we will continue to build ggml, make llama.cpp more accessible and empower the open-source community. Our joint mission is to make local AI easy and efficient to use by everyone on their own hardware.