@prakdadlani Well IT managers - at least in some orgs - make sure that you are replaceable - risk management - but that process kills innovation at individual level, everything is shared and nothing your own, its handed over to someone else to manage without benefits to the inventor.
Most diffusion language models make one network do two jobs at once — represent the clean context and denoise the noisy tokens. Those two goals pull the same weights in different directions. NVIDIA just split them apart.
They released Nemotron-Labs-TwoTower — a block-wise autoregressive diffusion model built on the Nemotron-3-Nano-30B-A3B hybrid Mamba-2/attention/MoE backbone. It runs two towers: a frozen autoregressive context tower that processes clean tokens causally, and a trainable diffusion denoiser tower that refines noisy blocks via cross-attention to that context. Only the denoiser is trained — on ~2.1T tokens, a fraction of the backbone's 25T.
Here's what's actually interesting:
→ Two towers, not one: a frozen AR context tower and a trainable diffusion denoiser, connected layer-by-layer — denoiser layer i attends to context layer i, not just the last hidden state
→ 98.7% of the autoregressive baseline's quality at 2.42× generation throughput (γ=0.8, block size 16, 2×H100)
→ It commits multiple tokens per denoising step early in decoding — that's where the one-token-per-step AR bottleneck breaks
→ One checkpoint, three decoding modes: mask diffusion, mock-AR, and standard AR
→ Ablations: causal Mamba beats bidirectional Mamba, and tying the two towers under a joint loss is substantially worse
Full analysis: https://t.co/xU17IsVGWQ
Paper: https://t.co/WEmFQYmY5v
Weights: https://t.co/HvRZ6VEeAb
@NVIDIAAI@NVIDIARobotics@NVIDIADeepLearn@nvidiadeveloper
We took a 30B model and split it in two to write tokens in parallel instead of one at a time.
Introducing Nemotron-Labs-TwoTower: a diffusion language model from NVIDIA Research adapted from Nemotron-3-Nano-30B-A3B. Here’s how it works: one half holds the context, the other writes the tokens, with both reusing the pretrained model instead of training a new one from scratch.
We found it kept 98.7% of the original model’s quality at 2.42× faster generation.
Gemma 4 is now nearly 90% faster on Apple Silicon with Ollama using MLX!
The speedup comes from improved multi-token prediction (MTP), now on by default for Gemma 4, with more models to come.
Ollama automatically tunes how many tokens to draft as it runs, so it never slows generation down when speculation no longer contributes to a speedup.
Introducing SuperQwen-Agentworld-35B-A3B-abliterated 🚀
First Super-tune version of Qwen.
Agentworld-35b is one of the strongest agentic small model available for now.
> Uncensored, intelligent enhanced version
Try now on HF ⬇️
Good morning y'all!
Qwopus-3.6-35B-A3B-MTP-Coder is live! All GGUF's will be populating over the next few hours!
It's a lightning-fast MOE with the coder curriculum recipe. Similar to the 27B coder, it shines with thinking disabled, offering significantly faster wall time for similar, and in some cases superior results to same-sized thinking alternatives! With thinking disabled, it goes toe-to-toe with the new Ornith 35B MoE across a huge eval suite (performed by @no_stp_on_snek), edging it on the coding trajectories and decisively on speed and cost, even though Ornith was run with thinking enabled.
See the model card for the full test results, and shoutout to Tom, @no_stp_on_snek, for thoroughly evaluating the model for us before launch!
With MTP and thinking disabled, along with the MOE speed, it runs so quickly in harnesses like @opencode that it almost feels instant @ 253 tps on my 5090.
No 8k tokens of thinking before a coherent output is actioned. This is especially useful in long contexts, where the base models will progressively start thinking for tens of thousands of tokens before replying.
Compared to the base models with thinking off, the coder curriculum really advances the no-think frontier. Especially in terms of how creative it can be. Run temp hot as usual, 0.85-1, and make sure your harness isn't overriding the temp setting of your server at runtime.
If you want to use it to its full ability, I would recommend giving it very thorough prompts. I have been using it in opencode, and I have been blown away by the results it generates autonomously with chunky prompts. Please see links to the demo's Aether Dominion (RTS Game), and a slide deck presentation the model made about itself that turned out beautifully, links in comments below!
I am getting results on this incredibly fast local model (with thinking disabled) that I couldn't get in some thinking frontier models over a year ago.
Open source is accelerating fast, and in light of recent events, there's never been a better time to get your local AI workflows tightened up. This MOE would be a great one to play with, and it's also a great one if you don't have much VRAM because it can run fast offloaded partially to system memory!
All of that said, please give it a run with thinking off and build something you'd like to see. We'd love to see your results and any feedback on specific use cases in the comments below!
Also, thanks so much for 5k followers, you all make up such an enjoyable and knowledgeable open source community, and I am so blessed to be able to collaborate and discuss this research with all of you. I can't express how grateful I am for every comment. As always, I will try to reply to them all!
If we ever get monetized on X, I will put every penny into buying more hardware for our lab!
Have a blessed day, my friends, looking forward to your thoughts!
https://t.co/0WkjglsaWS
This just in: Gemini Omni is open to everyone in Google Flow! Edit videos using natural language (it’s like Nano Banana, but for video!). We've loved your edits so we're opening access. Not a Google AI subscriber? You can still create 2 Omni videos free each day! Try it today!
We’re happy to announce 2 releases today:
- 🧠Brain2qwerty v1 is published at @NatureNeuro
- 🚀 Brain2Qwerty v2 is now publicly released
Explore how we decode sentences from non-invasive brain recordings: https://t.co/IdR6gK2hcd
Links:
📄v1 Nature Neuro: https://t.co/wnRjc9W9gI
📑v2 Meta preprint: https://t.co/oSfLOQFcvg
💻Code: https://t.co/Xbe0XWfWQL
📊Data: https://t.co/SCBbs4AhTg
📝Blog: https://t.co/15RvsAaXlH
🧵Thread: https://t.co/d8FJrVyDut
DeepSeek just published DSpark, a speculative decoding system that boosts live DeepSeek V4 serving throughput by 51% to 406% under stricter latency targets.
With how most speculative decoding methods draft more tokens, but waste verification compute when those tokens get rejected, DSpark fixes this with a semi autoregressive drafter for more coherent long drafts, plus a confidence scheduler that only verifies prefixes likely to survive.
It also gives 60% to 85% faster per user generation at matched throughput.
DeepSeek v4 Flash DSpark running on 2x @NVIDIAAI DGX Sparks at 60 tok/s.
~50% improvement from the previous recipe! Context set to 256k conservatively — ~3 concurrent sessions.
Thanks to @rafaelcaricio for making this happen 👇
https://t.co/6HXa9pxqhj
@dev_maims Sales is harder than building, unless it's vanity. And if you are good at sales, you don't need local llm per say, customer will provide you with compute.
AI race is getting bigger..
CHINA UNVEILS ITS OWN AI RIVAL TO ANTHROPIC'S MYTHOS 👀
China's 360 security technology has introduced Yitian Tulong, a new AI cybersecurity platform designed as a domestic alternative to Anthropic's Mythos.
Company unveiled 2 powerful AI systems"
> Tulongfeng: An AI model that automatically discovers software vulnerabilities and has reportedly identified 3,432 vulnerabilities, including 105 verified by Chinese authorities.
> Yitianzhen: An AI-powered cyber defense system that helps detect threats, automate security operations, and respond to cyberattacks.
They're doing everything to stay in the race and even get ahead! The world is moving faster than ever!