Variable-length masked diffusion models (FlexMDM and friends) generate by inserting mask tokens into any gap and unmasking them. But the insertion/unmasking schedule is fixed and data-independent.
So the model has to learn to produce every sequence in every possible order. For structured data that's a huge waste of capacity.
How do you learn data-dependent insertion and unmasking orders without breaking tractable training? We propose LoFlexMDM, which does exactly that. 🧵👇
Where does the data flywheel⚙️♻️ of LLM service providers come from?
🚨Our latest paper shows that it could come from your mouse🖱️ and eyes👀!
With Jeffrey Gomez, @mehulpatwari_ , Aryan Sajith, @HamedZamani [1/N]🧵
Variable-length masked diffusion models (FlexMDM and friends) generate by inserting mask tokens into any gap and unmasking them. But the insertion/unmasking schedule is fixed and data-independent.
So the model has to learn to produce every sequence in every possible order. For structured data that's a huge waste of capacity.
How do you learn data-dependent insertion and unmasking orders without breaking tractable training? We propose LoFlexMDM, which does exactly that. 🧵👇
This is a joint work with @brozonoyer, Tahira Naseem, Gaurav Pandey, @RamonAstudill12 , and @andrewmccallum.
Happy to chat more about the paper at ICML 2026 in Seoul
🗓️ Wed, Jul 8, 2026 • 5:00 PM – 6:45 PM KST | 📍 HALL A #2805
But why Kumaraswamy CDFs?
With a shared `a`, the hazard simplifies so both events share the same shape function and only b_ins, b_unmask set the rate. Under the time-change τ = -log(1 - t^a), the whole thing becomes an exponential race between per position b_ins and b_unmask, with the precedence constraint that insertion fires before unmasking.
That buys you closed-form per-position likelihoods and parallel inverse-CDF sampling of event times. No numerical integration.
📈 On BracketSAFE molecule strings, LoFlexMDM improves the generation quality significantly over FlexMDM for both de novo and fragment-constrained generation. The cost is a small dip in diversity, which is expected since a sharper order means less randomness.
Furthermore, the learned order is interpretable: it commits structure first (ring closures, fragment separators), then fills chemistry (atoms, bonds, branches), and decides *where* fragments attach before *which* fragments attach.
But how do we keep the training tractable?
We parameterize each position's insertion and unmasking CDFs as Kumaraswamy CDFs, F(t) = 1 - (1 - t^a)^b. Fix the shape `a` to a shared constant, and let the auxiliary network predict the per-token rate parameters b_ins(x), b_unmask(x).
The trick: separate the order from the content and learn the order purely through per-position target *hazard rates* produced by an auxiliary network. The generator is trained to match the target rates and therefore the order without changing the terminal distribution.
⏱️ Think of generation as a per-position CTMC with two events: insertion (∅ → mask) at time T_ins, then unmasking (mask → token) at time T_unmask. The unmasking times define the generation order.
In FlexMDM these event rates are constant, so probability gets spread across tons of suboptimal orders. We wanted those rates to be learned and data-dependent without breaking tractable training.
Do you find autoregressive language models like @AnthropicAI's @claudeai Mythos too slow?
Diffusion models are catching up fast! But, just denoising is not sufficient to realize the promise of fast text generation. We (and the models😉) need to think ahead! Check out our preprint👇
What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:
Learned Relay Representations for Forward-Thinking Discrete Diffusion Models
I'm so happy to share that I’ll be joining @UofT as an Assistant Professor of Statistical Sciences and Computer Science, with an appointment at the @VectorInst, in 2026!
I'm recruiting postdocs and PhD students: https://t.co/FWBh0BiDqP!
Please help me spread the word!
🧵(1/5)
ICLR has placed OpenReview in a difficult position, so I want to offer a few words about the OpenReview team working behind the scenes.
OpenReview has long been operated at UMass Amherst as a non-profit organization founded by Andrew McCallum. Each year, Andrew must raise more than $2 million to support a 20-person team that provides essential infrastructure for most major conferences.
I once asked Andrew what might have been a naïve question: whether he had considered developing a business model for OpenReview, given its prominence and the seemingly obvious opportunities. He pushed back, explaining that everything he has done for OpenReview is driven by a commitment to serve and strengthen the academic community. He is willing to devote significant personal effort to ensure the platform remains freely accessible to all.
We should not blame such a brilliant and dedicated team for an accidental issue. Otherwise, fewer people would be willing to shoulder this kind of responsibility in the future.
Deep respect to the OpenReview team! I’m grateful for their work and happy to support in any way!
Excited to present our NeurIPS paper "Learning Representations for Hierarchies with Minimal Support" at the morning poster session on December 12! Stop by at poster #3500 in the East Building!
@su_lin_liu Great work! The formulation makes perfect sense. In the case of text generation, how would you compare the inference time cost of your approach, which has two forward passes per step, to the vanilla mask denoiser?