Share your inner interests. Like a child.
(Only don’t lose it when you grow up)
And if you already lost it, dig deep and find it again.
“The most personal is the most creative.” Martin Scorsese
What if you could take three completely different model families… and distill them into one tiny model? 🤯
📜 Paper: https://t.co/K2iKD4xFvp
MOPD (Multi-Teacher On-Policy Distillation) has become a standard procedure in post-training. We already distill multiple specialized variants of the same model into a single set of weights.
But what if we could go further - and distill models from entirely different families? Turns out, it is possible.
Today we’re releasing a paper on cross-tokenizer distillation - our first steps in this exciting direction. 📄
We distilled Qwen3-4B, Phi-4-Mini, and Llama-3B into Llama-3.2-1B.
MMLU jumped from 32.05 → 46.32 when using multiple teachers. 📈
The team is now working on Nemo-RL integration so the community can try this method in their own settings. Plus, we are scaling experiments up. 🚀
@MichaelHyatt@Teknium@Hermes_agentAI@gregisenberg@AlexFinn@NetworkChuck Install on your main machine for daily use. It gets better the more you use it.
Also install on a machine that you'll keep running 24x7, for your cronjobs, and for using it remotely via telegram, discord, WhatsApp. Hermes is probably the best harness out there.
Complete global data-flow tracking down to external method bodies, zero brute-force prompt stuffing, and a cloud bill ~$2.
Smart scaffolding > massive compute budgets.
Lesson learned: Don't scan Chromium again. It's too good. 😂
~21h grinding a SAST run on Chromium's net/ alongside its full dependency graph.
283M total tokens (~163M DeepSeek + ~120M local LLM). The structural orchestration framework built for deep context caching hit a 97.8% cloud cache-hit rate (only ~3.5M cache misses).
Called it a discount. Turns out it was the price.
#DeepSeek just made the 75% V4-Pro cut permanent.
The token economics I wrote about 5 days ago aren't a promo window anymore – they're the floor. Resharing: https://t.co/ozuUytMPSI
33 Million Tokens for $0.25
Just ran a full SAST scan against 1M+ lines of code for the price of a gumball.
The secret? Hybrid Architecture + Context Caching.
Master: DeepSeek V4-Flash (Orchestrator) Worker: llama.cpp (Local)
Full deep dive on LinkedIn: [https://t.co/nQzJq0THP7]
If your framework supports MTP, turn it on. It’s an uncompromised velocity multiplier for repository-wide code reviews and massive doc analysis.
Full writeup: https://t.co/1Tx2M4DnMN
Shoutout to @ggerganov & team for https://t.co/kvGqXzAlJD
Multi-Token Prediction (MTP) is a rare "free lunch" in LLM inference.
Just finished benchmarking Qwen 3.6 27B on a single RTX 5090 using llama.cpp. At extreme context scales, MTP roughly DOUBLES generation throughput with 0 quality loss.
The raw telemetry 👇 🧵
Is it lossy? No. At temp=0, MTP holds a strict veto layer. The core model brain validates every parallel guess before it hits your screen.
You get cosmetic word changes from FP near-ties, but logical, code, & math accuracy remain 100% intact.
33 Million Tokens for $0.25
Just ran a full SAST scan against 1M+ lines of code for the price of a gumball.
The secret? Hybrid Architecture + Context Caching.
Master: DeepSeek V4-Flash (Orchestrator) Worker: llama.cpp (Local)
Full deep dive on LinkedIn: [https://t.co/nQzJq0THP7]