I found out the other day that any compression tool can be contorted to do language modeling. Turns out gzip can generate text that somewhat *resembles* Shakespeare. Short write up linked below
I found out the other day that any compression tool can be contorted to do language modeling. Turns out gzip can generate text that somewhat *resembles* Shakespeare. Short write up linked below
@josesaezmerino@ElevenYellow Wow, this reminds me of my TreeHacks project! Did the same thing, but we built a physical camera that printed the image on printer paper.
@StefanoErmon Can I interview for the MLE position? Doing my masters thesis on optimizing dLLMs inference with novel KVCache approximation methods, am pretty familiar with the area. Currently interviewing at a bunch of places and will probably wrap up my job search in the next 2-3 weeks
Diffusion LLMs are becoming very competitive architectures. But recently, there's also been a lot of progress in flow-based LLMs, which are conceptually similar. Both learn to transport samples from a noise distribution to a data distribution.
Image generation used to be dominated by diffusion models but the leading models have since shifted to flow matching, largely because flow produces straighter trajectories that are easier to traverse in fewer steps without degrading quality.
Categorical data (language) is certainly harder than continuous data (image latents) for flow. It'll be interesting to see whether language ends up following the same trajectory as images (pun intended).
Mercury 2 is live 🚀🚀
The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs.
Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built.
We’re just getting started on what diffusion can do for language.
We just brought flow maps to language modeling for one-step sequence generation 💥
Discrete diffusion is not necessary -- continuous flows over one-hot encodings achieve SoTA performance and ≥8.3× faster generation 🔥
We believe this is a major step forward for discrete generative modeling and language modeling alike. 🚀
Full thread from first author @chandavidlee: https://t.co/7HIBNbQdFO
LLaDA 2.1 was released, a 100B parameter diffusion language model with self-correction capabilities. They are able to fix previous tokens by adopting a mixture of masking/state-absorption and uniform diffusion, similar to GIDD.
In a previous post, I mentioned that Google Gemini and Inception Lab’s Mercury might have done something similar. A few people in the comments suggested that they use masking + re-masking instead (so masking without the state-absorption property).
I wonder how these two approaches compare. They both allow for self-correction and (thus) allow for more progress per step in the diffusion process (by being more robust to taking larger steps through the diffusion process).
Masking + re-masking might have some benefits like a simplified training objective, stronger inductive bias (which is arguably a good thing), and easier use with KVCache approximation (due to fewer tokens changing per step). My only question is: how does the departure from state-absorption change things?
The simplified training objective from masking (which reduces to a weighted MLM training objective) comes from this state-absorption property. But does re-masking actually change this?
The state-absorption property is just that each token undergoes one transition only ([MASK] -> predicted token, and never changes). Re-masking a token, of course, causes it to go through multiple transitions. But does it really? Re-masking seems like you are just “jumping” to a more likely trajectory to account for the accumulation of errors. So instead of causing multiple state transitions, it could be just viewed as jumping to a better trajectory where that “poor” transition wasn’t made.
Will be interesting to see more formalization of this and how it compares to GIDD at scale.
What if an LLM could EDIT its own tokens in real-time, not just generate them? 🤯
Introducing LLaDA2.1 — a diffusion model that breaks from autoregressive dominance. It drafts fast, then fixes its own mistakes on the fly with Token-to-Token editing.
The result? 892 tokens/sec on a 100B model. 🔥
⚡ 892 TPS on HumanEval+ (coding)
⚡ 801 TPS on BigCodeBench
🧠 Real-time self-correction via T2T editing
✅ @lmsysorg SGLang Day 0 support — production-ready now
A "non-consensus" architecture now challenging the mainstream. Open-sourced TODAY. 👇
#LLaDA #TokenEditing #OpenSource #LLM #dLLM
Was doing a deeper literature review over and found one of my new favorite paper title ever:
“BERT has a Mouth, and It Must Speak”
Was one of the earliest papers to do something akin to state-absorption diffusion language modeling.