nathan (in sf) @nathanrs - Twitter Profile

Pinned Tweet

about 20 hours ago

I found out the other day that any compression tool can be contorted to do language modeling. Turns out gzip can generate text that somewhat *resembles* Shakespeare. Short write up linked below

52

2K

153

1K

144K

nathan (in sf)

@nathanrs

about 13 hours ago

@secemp9 GzipPT goes so much harder though

1

8

0

209

nathan (in sf)

@nathanrs

about 18 hours ago

@noah_vandal Whoops you’re right 😅 should be fixed now

0

12

0

5K

nathan (in sf)

@nathanrs

about 20 hours ago

I found out the other day that any compression tool can be contorted to do language modeling. Turns out gzip can generate text that somewhat *resembles* Shakespeare. Short write up linked below

52

2K

153

1K

144K

Who to follow

ishan

@0xishand

inference (dynamo + sglang) @nvidia | prev. @brevdev (acq.) @agora_io (acq.), @columbia | my views ≠ employer views

Bhasker Sri Harsha

@BhaskerSriHarsh

AI researcher || IIT Tirupati || Toshiba R & D

Rich Wyatt - Author

@ManTheForce

Author of LUNACOM. $BTC $IONQ $JOBY $RKLB $ASTS $ONDS

nathan (in sf)

@nathanrs

about 19 hours ago

@krishmatta Did everything but train a model for work 😭

0

6

0

2K

nathanrs retweeted

Zed

@zeddotdev

about 2 months ago

We've shipped more than a thousand versions of Zed, but all of them began with zero. Today, that changes. https://t.co/AJ0crNOFhU

293

8K

863

763

653K

nathan (in sf)

@nathanrs

3 months ago

@texheavy Nice job @fedpoasts and gang!

1

6

0

176

nathan (in sf)

@nathanrs

3 months ago

@oussamazekri_ @theo_uscidda @Korba_Anna @CNRS @JulesSamaran @LucaEyring @ssahoo @ChenyuW64562111 @olivierhenaff @Pierrot_Clavier @WeiGuo01 @Jaeyeon_Kim_0 @YuchenZhu_ZYC @AlanNawzadAmin @RosieZ0512 @dvruette @jdeschena @zhihanyang_ @SchiffYair @Guanghan__Wang @mariannearr @vincentpaulinef @sansa19739319 @aaron_lou @AndrewC_ML @thjashin @ArnaudDoucet1 Some previous papers (like the original D3PM) tried semantic noising via embedding distance, which performed worse than masking and uniform. What are the specific additions that substantially improved the performance? And what were these other formulations missing?

nathanrs's tweet photo. @oussamazekri_ @theo_uscidda @Korba_Anna @CNRS @JulesSamaran @LucaEyring @ssahoo @ChenyuW64562111 @olivierhenaff @Pierrot_Clavier @WeiGuo01 @Jaeyeon_Kim_0 @YuchenZhu_ZYC @AlanNawzadAmin @RosieZ0512 @dvruette @jdeschena @zhihanyang_ @SchiffYair @Guanghan__Wang @mariannearr @vincentpaulinef @sansa19739319 @aaron_lou @AndrewC_ML @thjashin @ArnaudDoucet1 Some previous papers (like the original D3PM) tried semantic noising via embedding distance, which performed worse than masking and uniform. What are the specific additions that substantially improved the performance? And what were these other formulations missing? https://t.co/P0bAtApgfN

1

2

0

180

nathan (in sf)

@nathanrs

3 months ago

@0xSero Let's say 16 gb. Want to see how viable edit prediction can be made to run locally (for most people).

0

89

nathan (in sf)

@nathanrs

3 months ago

@sdand Great looking site!

0

1

0

136

nathan (in sf)

@nathanrs

4 months ago

@josesaezmerino @ElevenYellow Wow, this reminds me of my TreeHacks project! Did the same thing, but we built a physical camera that printed the image on printer paper.

nathanrs's tweet photo. @josesaezmerino @ElevenYellow Wow, this reminds me of my TreeHacks project! Did the same thing, but we built a physical camera that printed the image on printer paper. https://t.co/FXKUvF2MIB

1

7

0

634

nathan (in sf)

@nathanrs

4 months ago

@haha_whatsgood @arnie_hacker @karpathy I named it tiny diffusion just in case @karpathy wanted to make nano diffusion!

0

71

nathan (in sf)

@nathanrs

4 months ago

@StefanoErmon Can I interview for the MLE position? Doing my masters thesis on optimizing dLLMs inference with novel KVCache approximation methods, am pretty familiar with the area. Currently interviewing at a bunch of places and will probably wrap up my job search in the next 2-3 weeks

1

3

0

341

nathan (in sf)

@nathanrs

4 months ago

Diffusion LLMs are becoming very competitive architectures. But recently, there's also been a lot of progress in flow-based LLMs, which are conceptually similar. Both learn to transport samples from a noise distribution to a data distribution. Image generation used to be dominated by diffusion models but the leading models have since shifted to flow matching, largely because flow produces straighter trajectories that are easier to traverse in fewer steps without degrading quality. Categorical data (language) is certainly harder than continuous data (image latents) for flow. It'll be interesting to see whether language ends up following the same trajectory as images (pun intended).

Stefano Ermon

@StefanoErmon

4 months ago

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

317

4K

578

2K

1M

4

133

8

41

14K

nathanrs retweeted

Nicholas Boffi

@nmboffi

4 months ago

We just brought flow maps to language modeling for one-step sequence generation 💥 Discrete diffusion is not necessary -- continuous flows over one-hot encodings achieve SoTA performance and ≥8.3× faster generation 🔥 We believe this is a major step forward for discrete generative modeling and language modeling alike. 🚀 Full thread from first author @chandavidlee: https://t.co/7HIBNbQdFO

4

251

45

169

44K

nathan (in sf)

@nathanrs

4 months ago

LLaDA 2.1 was released, a 100B parameter diffusion language model with self-correction capabilities. They are able to fix previous tokens by adopting a mixture of masking/state-absorption and uniform diffusion, similar to GIDD. In a previous post, I mentioned that Google Gemini and Inception Lab’s Mercury might have done something similar. A few people in the comments suggested that they use masking + re-masking instead (so masking without the state-absorption property). I wonder how these two approaches compare. They both allow for self-correction and (thus) allow for more progress per step in the diffusion process (by being more robust to taking larger steps through the diffusion process). Masking + re-masking might have some benefits like a simplified training objective, stronger inductive bias (which is arguably a good thing), and easier use with KVCache approximation (due to fewer tokens changing per step). My only question is: how does the departure from state-absorption change things?  The simplified training objective from masking (which reduces to a weighted MLM training objective) comes from this state-absorption property. But does re-masking actually change this? The state-absorption property is just that each token undergoes one transition only ([MASK] -> predicted token, and never changes). Re-masking a token, of course, causes it to go through multiple transitions.  But does it really? Re-masking seems like you are just “jumping” to a more likely trajectory to account for the accumulation of errors. So instead of causing multiple state transitions, it could be just viewed as jumping to a better trajectory where that “poor” transition wasn’t made. Will be interesting to see more formalization of this and how it compares to GIDD at scale.

Ant Open Source

@ant_oss

4 months ago

What if an LLM could EDIT its own tokens in real-time, not just generate them? 🤯 Introducing LLaDA2.1 — a diffusion model that breaks from autoregressive dominance. It drafts fast, then fixes its own mistakes on the fly with Token-to-Token editing. The result? 892 tokens/sec on a 100B model. 🔥 ⚡ 892 TPS on HumanEval+ (coding) ⚡ 801 TPS on BigCodeBench 🧠 Real-time self-correction via T2T editing ✅ @lmsysorg SGLang Day 0 support — production-ready now A "non-consensus" architecture now challenging the mainstream. Open-sourced TODAY. 👇 #LLaDA #TokenEditing #OpenSource #LLM #dLLM

ant_oss's tweet photo. What if an LLM could EDIT its own tokens in real-time, not just generate them? 🤯
Introducing LLaDA2.1 — a diffusion model that breaks from autoregressive dominance. It drafts fast, then fixes its own mistakes on the fly with Token-to-Token editing.
The result? 892 tokens/sec on a 100B model. 🔥
⚡ 892 TPS on HumanEval+ (coding)
⚡ 801 TPS on BigCodeBench
🧠 Real-time self-correction via T2T editing
✅ @lmsysorg SGLang Day 0 support — production-ready now
A "non-consensus" architecture now challenging the mainstream. Open-sourced TODAY. 👇
#LLaDA #TokenEditing #OpenSource #LLM #dLLM

48

374

83

232

371K

0

34

3

17

3K

nathan (in sf)

@nathanrs

4 months ago

@Wayframe Looks great @michaelgold3n

0

1

0

267

nathan (in sf)

@nathanrs

5 months ago

Was doing a deeper literature review over and found one of my new favorite paper title ever: “BERT has a Mouth, and It Must Speak” Was one of the earliest papers to do something akin to state-absorption diffusion language modeling.

nathanrs's tweet photo. Was doing a deeper literature review over and found one of my new favorite paper title ever:
“BERT has a Mouth, and It Must Speak”

Was one of the earliest papers to do something akin to state-absorption diffusion language modeling. https://t.co/HFnYqrXuVW

0

93

6

61

5K

nathan (in sf)

@nathanrs

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users