First real-world consumer hardware result with DFlash on Gemma 4 👀
4.31× speedup (161.85 tok/s) on Gemma-4-26B-A4B 4-bit AWQ using a single RTX 4000 Ada 20GB.
Full thread with exact command + config:
https://t.co/ecBh19MBcF
DFlash for Gemma 4: Up to 6x Faster. ⚡⚡
Great to see MTP land natively in Gemma 4 today. If you want to push it further, try DFlash — open source, same quality, more speed!!
https://t.co/wKcRoibuOB
Prompting gets simpler. Existing prompts or skills developed for prior models are often too prescriptive for Fable.
We recommend reviewing and potentially updating or removing older instructions or skills if you find default performance to be better.
Claude Fable 5 is available everywhere today. Claude Mythos 5 is restricted to Glasswing partners until we expand our trusted access program.
https://t.co/iQymY0jiGq
Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality:
🧠 LLMs
→ NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models.
→ Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week.
→ StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0.
→ Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week.
→ JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0.
🎨 Image gen (the surprise of the week)
→ Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights.
🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped)
→ Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA.
→ RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0.
→ Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos.
→ NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B.
👁️ Vision & VLMs
→ PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0.
→ Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0.
🎬 Video, 3D & World Models
→ NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI.
→ JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3.
→ ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).
Ran a 550B-param LLM on TWO @NVIDIA DGX Sparks at 1-bit.
NVIDIA Nemotron-3-Ultra-550B-A55B in @UnslothAI's UD-IQ1_M (189GB), split across both boxes via llama.cpp RPC over the QSFP cable.
~5.4 tok/s decode, ~157 tok/s prefill. And it's coherent!!🦥
We’re introducing a new GitHub Certified: Agentic AI Developer (GH-600).
As AI agents become part of modern development workflows, this role-based certification focuses on how developers and teams operate, supervise, and integrate agents across the SDLC.
If you’re already working with tools like GitHub Copilot or exploring agent-driven workflows, we’d love your input.
Learn more and get involved. https://t.co/ruiYtlsYnj
@openclaw@NousResearch@LangChain As always, Nemotron 3 Ultra is fully open.
This includes model weights, synthetic data, and post-training recipes. Available now on @huggingface → https://t.co/MDdnY047fw
🚨do you understand what just happened with NVIDIA RTX Spark..
Jensen Huang walked on stage and pulled an entire gaming PC out of his pocket.
NVIDIA merged the CPU, RTX GPU, AI hardware and up to 128GB of memory into one Windows-on-ARM superchip and called it the end of the PC as you know it.
> 20-core Grace CPU plus a Blackwell GPU with 6,144 CUDA cores - RTX 5070-tier graphics in a 14mm body.
> NVIDIA claims 100+ FPS at 1440p in 007 First Light and Forza Horizon 6, on battery.
> It runs a 120-billion-parameter AI model locally, no cloud needed.
> ASUS, Dell, HP, Lenovo, MSI and Microsoft already have 30+ laptops lined up for this fall.
The whole internet has two questions: is it real, and how much. Nobody's asking the third - what happens to every other chipmaker if it is.
Yes.
It’s not that we’ve discovered some magic bullet, but rather that JAX, or at least the open source version of it, is mostly optimized for small to medium-sized training runs on Google TPUs, whereas we need to massive training runs on Nvidia GPUs.
Pipeline parallelism is essential and crushes fully-sharded data parallelism at scale.
And C will compile to the most efficient binary short of assembly. Maybe we will do a little assembly too.
Scrapped 500+ issues and PRs to ship a massive @luceboxai repo redesign and fixes. Very proud of the team.
https://t.co/FHgAVFd5ab
The fastest inference server isn't going to come from a datacenter, it's going to run on the GPU already in your house.
We just launched the ability to build native Android apps directly in Google AI Studio for free!
Since launch last week, people have created more than 250,000 Android apps. Likely >99% of these folks never built an Android app before, everyone can now build, no coding required!
@Teknium Confirmed here too. The sharp edge I hit: auth wasn’t the whole story. The raw Codex stream was valid, but response.completed sometimes had output:null and the SDK finalizer crashed. Hermes can recover by preserving streamed output_item.done items!