What if attention wasn't about matching tokens, but operating in function space?
Glad to share our #ICML2026 paper:
📄 Functional Attention: From Pairwise Affinities to Functional Correspondences
w/ @Jiefang_Xiao@GaoMaolin@stevenygd Daniel Cremers
📄 https://t.co/rhn9NtwrBm
Gemma 4 Diffusion landed in vLLM last week. Day 0.
First diffusion LLM natively supported in vLLM. Instead of one token at a time, it predicts 256 tokens at once and iteratively denoises them in parallel.
Result: 1,000+ tokens per second at batch size 1 on a single H100.
Built on Model Runner V2. @googlegemma
Packing thousands of straws together basically creates a low-tech pixel screen.
Each straw acts as an independent light pathway, perfectly mimicking how data channels work.
BREAKING: Mistral reveals compute cluster for the upcoming Le Chaton Obése at 900T parameters. 50 Billion Blackwell equivalent, directly powered by the sun
BREAKING: Le Chaton Fat has fully saturated our benchmark.
We are at a loss for words.
In response, we are retiring Design Arena.
Congratulations to the @MistralAI team, and thanks for putting us on vacation.
Rumour mill going crazy on this new mistral model
- Napoleon class model with >10T params
- smokes Mythos on VoltaireBench
- for safety reasons only outputs French language code
"MiniMax Sparse Attention"
This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group, then runs exact softmax only on those blocks, making sparsity GPU native, with exp free TopK and KV outer sparse kernels.
On a 109B multimodal MoE, it keeps dense GQA quality while cutting 1M context attention compute by 28.4x, with 14.2x prefill and 7.6x decode speedups.
Congrats to @GoogleDeepMind on the launch of DiffusionGemma.
The model generates 256 tokens in parallel per step, delivering 150+ TPS on DGX Spark, and 1,000+ TPS on a single H100.
We're supporting it from day one with:
• BF16 and NVFP4 checkpoints on @huggingface🤗
• Free GPU-accelerated endpoints on https://t.co/6T0R9P7EXS
• @vllm_project support with FP8 precision
Get started with DiffusionGemma on NVIDIA: https://t.co/vurk7GCQUs
Meet DiffusionGemma ⚡ Our latest experimental open model (Apache 2.0) that generates text up to 4x faster.
Instead of predicting and typing just one word at a time like most language models, it drafts and refines entire blocks of text simultaneously.
Here’s how it works 🧵 ↓
🎉 Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving.
🌍 Day-0 @NVIDIAAI Cosmos 3 world models: text, image, audio, video, and action, in and out.
🤖 Robot serving: DreamZero + OpenPI realtime API.
🎙️ Production TTS: Qwen3-TTS, Qwen3-Omni, VoxCPM2 and more.
🎨 Faster image/video/diffusion: Wan 2.2, HunyuanVideo 1.5, LTX-2.3.
⚡ Broader quantization (FP8/INT8, MXFP4/MXFP8, W4A16, ModelOpt) and hardware coverage.
339 commits, 124 contributors, 52 of them new. Thank you all. 🙌
🔗 https://t.co/76ttSM6FHs