Discrete Diffusion Reading Group

2 days ago

📢 Missed the talk? Check out the recording on YouTube: https://t.co/G6DC8iBiwQ

4 days ago

📢 June 1 (Mon): ELF: Embedded Language Flows 🤔Unlike their image-domain counterparts, today’s leading diffusion language models (DLMs) primarily operate over discrete tokens. 💡The authors show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. They propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. 🔧This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). 📈Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs. This Monday, Keya Hu (@HuLillian39250) and Linlu Qiu (@linluqiu) will present their jointly led paper ELF.

diffusion_llms's tweet photo. 📢 June 1 (Mon): ELF: Embedded Language Flows

🤔Unlike their image-domain counterparts, today’s leading diffusion language models (DLMs) primarily operate over discrete tokens.

💡The authors show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. They propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network.

🔧This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG).

📈Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

This Monday, Keya Hu (@HuLillian39250) and Linlu Qiu (@linluqiu) will present their jointly led paper ELF.

3

82

15

46

20K

0

6

0

1

1K

diffusion_llms retweeted

4 days ago

Join our reading group next Monday! Topic: ELF: Embedded Language Flows Presenters: Keya Hu (@HuLillian39250), Linlu Qiu (@linluqiu)

0

39

6

10

5K

4 days ago

@Lyy_iiis @jacobandreas Meeting link: https://t.co/1mo8FsGmGG

0

1

0

1

589

4 days ago

📢 June 1 (Mon): ELF: Embedded Language Flows 🤔Unlike their image-domain counterparts, today’s leading diffusion language models (DLMs) primarily operate over discrete tokens. 💡The authors show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. They propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. 🔧This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). 📈Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs. This Monday, Keya Hu (@HuLillian39250) and Linlu Qiu (@linluqiu) will present their jointly led paper ELF.

3

82

15

46

20K

4 days ago

Collaborators: Yiyang Lu (@Lyy_iiis), Hanhong Zhao (https://t.co/RVF0H3VcOw), Tianhong Li (https://t.co/R96TYajSe3), Yoon Kim (https://t.co/m5xnoVV3hs), Jacob Andreas (@jacobandreas), Kaiming He (https://t.co/t62JoAhU8s) Paper link: https://t.co/u3oX5TbxBs

1

2

1

906

9 days ago

📢Missed the talk? We just uploaded the recording! Make sure to check it out https://t.co/jX6qBYPiYn

12 days ago

📢 May 25 (Mon): Language Modeling with Spherical Geometry 📷 💡Join us to hear Justin (@jdeschena) and Jannis (@JChemseddine) present their recent work on (Hyper)spherical language modeling! ⚖️Discrete Diffusion and Continuous Flow Language Models (DLMs / FLMs) have emerged as interesting alternatives to autoregressive models. Yet they face fundamental tensions: discrete diffusion samples from a factorized distribution that is strictly less expressive than AR. FLMs avoid factorized sampling but typically add Gaussian noise on one-hot vectors or embeddings. It is far from clear that this kind of noise is well suited to text generation. 🤔Both papers ask the same question: what if the natural geometry for language flows isn't Euclidean space or the probability simplex, but the sphere? The hint has been there for a while: prior work like CDCD (@sedielem et al.) already operates on normalized vectors, and empirically, the cosine distance outperforms the Euclidean one for comparing word embeddings (think of word2vec, GloVe, or retrieval systems). 🧭By lifting tokens onto Sᵈ⁻¹, the authors develop tools for spherical language modeling via SLERP and vMF paths. The vMF path has the added benefit of a closed-form score, enabling principled predictor–corrector samplers on the sphere. 📈 Working with the sphere leads to concrete performance improvements: on code generation with TinyGSM, prior FLMs reach roughly 0% accuracy, while flows on the hypersphere reach 12–18% 🚀. This still lags behind the AR and discrete diffusion baselines, but it strongly suggests that spherical embedding geometry is a natural noise model for tokens. At matched NFE, a properly tuned PC sampler with vMF paths clearly improves the accuracy on Sudoku. And as a bonus, training with rotations avoids materializing one-hot vectors, making it cheaper than standard FLM training⚡️. 🔗 Language Modeling with Hyperspherical Flows: https://t.co/FERfcYZKfZ 🔗 Spherical Flows for Sampling Categorical Data: https://t.co/tKc5EthyIv 🤝 Joint work with Caglar Gulcehre (@caglarml), Gregor Kornhardt (@gregorkornhardt), and Gabriele Steidl (https://t.co/VKLrbhpFVX)

diffusion_llms's tweet photo. 📢 May 25 (Mon): Language Modeling with Spherical Geometry 📷

💡Join us to hear Justin (@jdeschena) and Jannis (@JChemseddine) present their recent work on (Hyper)spherical language modeling!

⚖️Discrete Diffusion and Continuous Flow Language Models (DLMs / FLMs) have emerged as interesting alternatives to autoregressive models. Yet they face fundamental tensions: discrete diffusion samples from a factorized distribution that is strictly less expressive than AR. FLMs avoid factorized sampling but typically add Gaussian noise on one-hot vectors or embeddings. It is far from clear that this kind of noise is well suited to text generation.

🤔Both papers ask the same question: what if the natural geometry for language flows isn't Euclidean space or the probability simplex, but the sphere? The hint has been there for a while: prior work like CDCD (@sedielem et al.) already operates on normalized vectors, and empirically, the cosine distance outperforms the Euclidean one for comparing word embeddings (think of word2vec, GloVe, or retrieval systems).

🧭By lifting tokens onto Sᵈ⁻¹, the authors develop tools for spherical language modeling via SLERP and vMF paths. The vMF path has the added benefit of a closed-form score, enabling principled predictor–corrector samplers on the sphere.

📈 Working with the sphere leads to concrete performance improvements: on code generation with TinyGSM, prior FLMs reach roughly 0% accuracy, while flows on the hypersphere reach 12–18% 🚀. This still lags behind the AR and discrete diffusion baselines, but it strongly suggests that spherical embedding geometry is a natural noise model for tokens. At matched NFE, a properly tuned PC sampler with vMF paths clearly improves the accuracy on Sudoku. And as a bonus, training with rotations avoids materializing one-hot vectors, making it cheaper than standard FLM training⚡️.

🔗 Language Modeling with Hyperspherical Flows: https://t.co/FERfcYZKfZ
🔗 Spherical Flows for Sampling Categorical Data: https://t.co/tKc5EthyIv

🤝 Joint work with Caglar Gulcehre (@caglarml), Gregor Kornhardt (@gregorkornhardt), and Gabriele Steidl (https://t.co/VKLrbhpFVX)

1

79

23

40

17K

0

12

0

7

2K

diffusion_llms retweeted

12 days ago

Join our reading group next Monday! This is especially interesting, as my recent work RePlaid also relies on spherical geometry (length-normalized embeddings)! Topic: Language Modeling with Spherical Geometry Presenters: Justin Deschenaux (@jdeschena), Jannis Chemseddine (@JChemseddine)

0

25

6

8

3K

diffusion_llms retweeted

Sander Dieleman

@sedielem

15 days ago

Did I mention continuous DLMs are back? I think I might have mentioned it before🤔 This one revisits Plaid (https://t.co/aRmOAbZ3pB), a continuous DLM trained with likelihood loss, and rigorously shows how it holds up against a recent discrete method. Pretty well, looks like!

5

112

20

52

11K

12 days ago

@jdeschena @JChemseddine https://t.co/1mo8FsFOR8

0

1

0

352

Justin Deschenaux @jdeschena

12 days ago

📢 May 25 (Mon): Language Modeling with Spherical Geometry 📷 💡Join us to hear Justin (@jdeschena) and Jannis (@JChemseddine) present their recent work on (Hyper)spherical language modeling! ⚖️Discrete Diffusion and Continuous Flow Language Models (DLMs / FLMs) have emerged as interesting alternatives to autoregressive models. Yet they face fundamental tensions: discrete diffusion samples from a factorized distribution that is strictly less expressive than AR. FLMs avoid factorized sampling but typically add Gaussian noise on one-hot vectors or embeddings. It is far from clear that this kind of noise is well suited to text generation. 🤔Both papers ask the same question: what if the natural geometry for language flows isn't Euclidean space or the probability simplex, but the sphere? The hint has been there for a while: prior work like CDCD (@sedielem et al.) already operates on normalized vectors, and empirically, the cosine distance outperforms the Euclidean one for comparing word embeddings (think of word2vec, GloVe, or retrieval systems). 🧭By lifting tokens onto Sᵈ⁻¹, the authors develop tools for spherical language modeling via SLERP and vMF paths. The vMF path has the added benefit of a closed-form score, enabling principled predictor–corrector samplers on the sphere. 📈 Working with the sphere leads to concrete performance improvements: on code generation with TinyGSM, prior FLMs reach roughly 0% accuracy, while flows on the hypersphere reach 12–18% 🚀. This still lags behind the AR and discrete diffusion baselines, but it strongly suggests that spherical embedding geometry is a natural noise model for tokens. At matched NFE, a properly tuned PC sampler with vMF paths clearly improves the accuracy on Sudoku. And as a bonus, training with rotations avoids materializing one-hot vectors, making it cheaper than standard FLM training⚡️. 🔗 Language Modeling with Hyperspherical Flows: https://t.co/FERfcYZKfZ 🔗 Spherical Flows for Sampling Categorical Data: https://t.co/tKc5EthyIv 🤝 Joint work with Caglar Gulcehre (@caglarml), Gregor Kornhardt (@gregorkornhardt), and Gabriele Steidl (https://t.co/VKLrbhpFVX)

1

79

23

40

17K

diffusion_llms retweeted

Subham Sahoo

@ssahoo_

15 days ago

In short: @sedielem and @__ishaan continuous methods for language modeling is competitive with discrete diffusion methods outside the compute-optimal regime. This matters because frontier LLMs often overtrain smaller models rather than train larger compute-optimal ones, since smaller models are cheaper and faster to serve.

3

52

9

34

8K

diffusion_llms retweeted

21 days ago

🔥 New paper: Language Modeling with Hyperspherical Flows Recent flow language models (FLMs) all use Gaussian noise. Makes sense for images, but not necessarily for text 🫠 We propose to add noise by rotating embeddings on 𝕊^{d−1} instead 🌐 w/ @caglarml (1/9)

$jdeschena's tweet photo. 🔥 New paper: Language Modeling with Hyperspherical Flows Recent flow language models (FLMs) all use Gaussian noise. Makes sense for images, but not necessarily for text 🫠 We propose to add noise by rotating embeddings on 𝕊^{d−1} instead 🌐 w/ @caglarml (1/9) https://t.co/OmQqImyNw9$

13

424

82

366

61K

diffusion_llms retweeted

16 days ago

📢Excited to share our new paper: Continuous Diffusion Scales Competitively with Discrete Diffusion for Language We introduce RePlaid 🌊, a continuous diffusion language model (DLM) with 🏅Discrete likelihood bound 🏅Scaling laws competitive with SOTA discrete DLMs How? Dive in👇[🧵1/12] Paper: https://t.co/Q00fmvtHmy Work done with my amazing collaborators: @WeiGuo01 @ShuibaiZ69721 @ssahoo_ @YongxinChen1 @ArashVahdat @MardaniMorteza @jwthickstun

zhihanyang_'s tweet photo. 📢Excited to share our new paper:

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

We introduce RePlaid 🌊, a continuous diffusion language model (DLM) with
🏅Discrete likelihood bound
🏅Scaling laws competitive with SOTA discrete DLMs

How? Dive in👇[🧵1/12]

Paper: https://t.co/Q00fmvtHmy

Work done with my amazing collaborators: @WeiGuo01 @ShuibaiZ69721 @ssahoo_ @YongxinChen1 @ArashVahdat @MardaniMorteza @jwthickstun

5

201

45

127

61K

16 days ago

📢 Missed the talk? Check out the recording on YouTube: https://t.co/em0tEmCZMu

19 days ago

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (https://t.co/8geY1UUQ72) and Nikita Gushchin (https://t.co/L4nc4Gz7ZC) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (https://t.co/tK9ctSzw0r), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (https://t.co/RTSk7HAc3v) Paper link: https://t.co/L8MniD5GBu

diffusion_llms's tweet photo. 📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models

🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use.

💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges.

🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training.

📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity.

This Monday, David Li (https://t.co/8geY1UUQ72) and Nikita Gushchin (https://t.co/L4nc4Gz7ZC) will present their jointly led paper, which was recently accepted at ICML 2026.

Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (https://t.co/tK9ctSzw0r), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (https://t.co/RTSk7HAc3v)

Paper link: https://t.co/L8MniD5GBu

2

36

13

11

10K

0

1

0

1K

19 days ago

🎥 Missed the live session? Make sure to check the recording on YouTube: https://t.co/4NKqfwyABT

Justin Deschenaux @jdeschena

25 days ago

📢 May 11 (Mon): Unifying Masked Diffusion Models with Various Generation Orders and Beyond 🤔AR generates left-to-right; masked diffusion generates in any order; and block diffusion generates block-wise left-to-right, with random order within each block. Can we unify all these frameworks and further learn the generation order jointly with token prediction? 💡The authors propose OeMDM, a unified masked diffusion framework that can express various generation orders, and LoMDM, which jointly learns the generation order and the diffusion model. 🔍Everything comes down to the scheduler: by making the forward and reverse schedulers maximally flexible, it becomes possible to describe all generation orders, even learnable generation orders, within the masked diffusion framework. 📈LoMDM achieves SOTA among discrete diffusion models across all benchmarks, and even outperforms block diffusion models, which strongly benefit from left-to-right bias! This Monday, Chunsan Hong (@ChunsanHong) will present his paper, which received Spotlight at ICML 2026. Collaborators of this work include: Sanghyun Lee, Jong Chul Ye (https://t.co/ul6KGZ82jG) Paper link: https://t.co/ClDvJ4FHRG

diffusion_llms's tweet photo. 📢 May 11 (Mon): Unifying Masked Diffusion Models with Various Generation Orders and Beyond

🤔AR generates left-to-right; masked diffusion generates in any order; and block diffusion generates block-wise left-to-right, with random order within each block. Can we unify all these frameworks and further learn the generation order jointly with token prediction?

💡The authors propose OeMDM, a unified masked diffusion framework that can express various generation orders, and LoMDM, which jointly learns the generation order and the diffusion model.

🔍Everything comes down to the scheduler: by making the forward and reverse schedulers maximally flexible, it becomes possible to describe all generation orders, even learnable generation orders, within the masked diffusion framework.

📈LoMDM achieves SOTA among discrete diffusion models across all benchmarks, and even outperforms block diffusion models, which strongly benefit from left-to-right bias!

This Monday, Chunsan Hong (@ChunsanHong) will present his paper, which received Spotlight at ICML 2026.

Collaborators of this work include: Sanghyun Lee, Jong Chul Ye (https://t.co/ul6KGZ82jG)

Paper link: https://t.co/ClDvJ4FHRG

2

23

5

6

8K

0

1

0

928

diffusion_llms retweeted

19 days ago

Join us on Monday to hear from @LiDavid2002 and @iNikitaGushchin about their work on distillation 🚀

0

20

6

2K

19 days ago

Meeting link: https://t.co/1mo8FsFOR8

0

1

0

2

262

19 days ago

📢 May 18 (Mon): IDLM: Inverse-distilled Diffusion Language Models 🤔Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. 💡To address this, the authors extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. However, this extension introduces both theoretical and practical challenges. 🔧To overcome these challenges, the authors first provide a theoretical result demonstrating that their inverse formulation admits a unique solution, thereby ensuring valid optimization. They then introduce gradient-stable relaxations to support effective training. 📊As a result, experiments on multiple DLMs show that their method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4×—64×, while preserving the teacher model’s entropy and generative perplexity. This Monday, David Li (https://t.co/8geY1UUQ72) and Nikita Gushchin (https://t.co/L4nc4Gz7ZC) will present their jointly led paper, which was recently accepted at ICML 2026. Collaborators of this work include: Dmitry Abulkhanov (@dabulkhanov_), Eric Moulines (https://t.co/tK9ctSzw0r), Ivan Oseledets (@oseledetsivan), Maxim Panov (@maxim_panov), Alexander Korotin (https://t.co/RTSk7HAc3v) Paper link: https://t.co/L8MniD5GBu

2

36

13

11

10K

diffusion_llms retweeted