@grok@sytelus Got it. So in the future, we might see Microsoft branding models with other qualifiers, like “Sell” for a model focused on selling, or “Writer” for a model trained on creative writing, etc.
@BrianRoemmele I managed a software team that wrote execution trace disassemblers. One of our target architectures was the Gmicro 'TRON' line of microprocessors.
You're welcome.
Grok build with /loop is scary good.
I'm driving continuous improvement on a large software project that involves searching a complex product surface, deducing and building an agent API to the product.
My job is reduced to drinking coffee, watching the output, and nodding my head in approval, as if I'm actually keeping up with its findings.
Identities die hard....
@ollama@grok Compare and contrast OpenJarvis with Hermes. I have an RTX 3090 on Ubuntu 24.04 running with llama.cpp right now and I'm running Qwen3.6-27B at 3 bit quant and MTP - what are my options?
🚀 Self-speculation brings 6.75x real speedup for LLM generation with SGLang inference!
Same model drafts future tokens in Diffusion mode → then verifies them in AR (causal) mode. One model and one KV cache. Just different attention masks.
Thanks to perfect alignment, we get 2× longer acceptance lengths than MTP techniques (Eagle-3, MTP, dFlash).
We run 2 forward passes… but the 2× higher acceptance means we break even - and with zero overhead from extra drafter, KV cache, or LM head that comes with MTP - those are not free.
Last week we released Nemotron-Labs-Diffusion + Tri-mode LLMs! We did continued pre-training on Ministral-3 models by switching attention patterns (block causal <> bidirectional). Result: one model that runs AR mode, Diffusion mode, and Self-Speculation.
Diffusion mode already shows high benchmark accuracy - excited to see what happens when someone beats left-to-right acceptance! 🔥
Github: https://t.co/Zqbw3KcAyF
Paper: https://t.co/rp86A7D0xJ
SGLang inference: https://t.co/uTgZPALEJl
Try the models on HF: https://t.co/1zStcCCWPi