Greg Durrett @GregD_NLP - Twitter Profile

6 days ago

Announcing our COLM 2026 workshop: Scientific Understanding of Foundation Models: we invite submissions on training dynamics, scaling laws, data and optimization, post-training, reward modeling, evaluation science, reliability, reproducibility, and theoretical understanding of foundation models. We especially welcome rigorous empirical studies, theory-grounded work, negative results, reproductions, and papers that bridge theory and practice for contributing to this goal. 📍 In person at COLM 2026, San Francisco 🗓️ Submission deadline: June 23, 2026, 11:59 PM AoE 🌐 https://t.co/icxLzLZafa 🧑‍🏫 Speakers: @SuryaGanguli, @JikaiJin2002, @zhiyuanli_, @waterluffy, @valentina__py, @lschmidt3, @MohammadShoeybi, @andrewgwils.

_hanlin_zhang_'s tweet photo. Announcing our COLM 2026 workshop: Scientific Understanding of Foundation Models:

we invite submissions on training dynamics, scaling laws, data and optimization, post-training, reward modeling, evaluation science, reliability, reproducibility, and theoretical understanding of foundation models.

We especially welcome rigorous empirical studies, theory-grounded work, negative results, reproductions, and papers that bridge theory and practice for contributing to this goal.

📍 In person at COLM 2026, San Francisco
🗓️ Submission deadline: June 23, 2026, 11:59 PM AoE
🌐 https://t.co/icxLzLZafa
🧑‍🏫 Speakers: @SuryaGanguli, @JikaiJin2002, @zhiyuanli_, @waterluffy, @valentina__py, @lschmidt3, @MohammadShoeybi, @andrewgwils.

0

47

10

23

132K

Greg Durrett

@gregd_nlp

6 days ago

Check out William's work with @CosmicAI_Inst to help VLMs close the loop for model fitting & scientific discovery! Including a new dataset of some challenging model fitting problems in research astronomy, which we'll be digging into more!

William Rudman @WilliamRudmanjr

6 days ago

Hypothesis -> experiments -> analysis -> conclusions. LLMs are great at writing code and conducting experiments. But there’s a weakness in their ability to propose statistical models and evaluate their fit. Enter VESTA: Visual Exploration with Statistical Tool Agents.

WilliamRudmanjr's tweet photo. Hypothesis -> experiments -> analysis -> conclusions. LLMs are great at writing code and conducting experiments.

But there’s a weakness in their ability to propose statistical models and evaluate their fit.

Enter VESTA: Visual Exploration with Statistical Tool Agents. https://t.co/F6c3FeBU31

1

22

10

3

4K

0

2

0

1

867

gregd_nlp retweeted

William Rudman @WilliamRudmanjr

6 days ago

Hypothesis -> experiments -> analysis -> conclusions. LLMs are great at writing code and conducting experiments. But there’s a weakness in their ability to propose statistical models and evaluate their fit. Enter VESTA: Visual Exploration with Statistical Tool Agents.

1

22

10

3

4K

gregd_nlp retweeted

Qiuyang Mang

@MangQiuyang

7 days ago

(1/n) New blog from UC Berkeley, UW, and Princeton: Who scales better in long horizon: AI coding agents or top coders? We compared modern agents to top human contestants in an open-ended coding marathon. Agents sprinted early. Then they plateaued. Top humans kept improving. We study this as a new test-time scaling problem: do agents learn better intrinsic test-time strategies, or are they mostly getting more random tries?

MangQiuyang's tweet photo. (1/n) New blog from UC Berkeley, UW, and Princeton: Who scales better in long horizon: AI coding agents or top coders?

We compared modern agents to top human contestants in an open-ended coding marathon.

Agents sprinted early. Then they plateaued. Top humans kept improving.

We study this as a new test-time scaling problem: do agents learn better intrinsic test-time strategies, or are they mostly getting more random tries?

12

321

51

210

170K

Who to follow

Yejin Choi

@YejinChoinka

professor at Stanford, researcher at NVIDIA, adventurer at heart

Jacob Andreas

@jacobandreas

Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL / @NLP_MIT (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJw

UW NLP

@uwnlp

The NLP group at the University of Washington.

gregd_nlp retweeted

Isil Dillig @IsilDillig

8 days ago

1/7 If you’re at #PLDI2026 in Boulder this week, come see what our group has been up to! We’re presenting work on making code generation more interactive and reliable, speeding up data pipelines, porting network data-plane programs,

1

33

6

3

3K

Greg Durrett

@gregd_nlp

7 days ago

@andrewgwils I tell my students to include \usepackage[a-1b]{pdfx} in all their papers. This saves least 8% of time it takes to upload to NSF PAR 🙃

0

1

0

1

513

Greg Durrett

@gregd_nlp

8 days ago

Check out Ramya's work on analyzing why and how LLM-generated stories feel homogeneous: the setting you prompt with might be novel but the plot unfolds in a very conventional way. Thread for how we quantified this & compare to existing metrics: 👇

Ramya Namuduri @ramya_namuduri

8 days ago

Are LLM-generated stories novel? They can have unique characters and cliché plots, or the other way around. A holistic score doesn’t help distinguish the two 😔. Meet GENIE 🧞 – a fine-grained novelty metric that tells you where and why a response is original!

ramya_namuduri's tweet photo. Are LLM-generated stories novel? They can have unique characters and cliché plots, or the other way around. A holistic score doesn’t help distinguish the two 😔.

Meet GENIE 🧞 – a fine-grained novelty metric that tells you where and why a response is original! https://t.co/ns3SeXsujx

1

41

15

33

7K

0

15

3

4

2K

gregd_nlp retweeted

CosmicAI @CosmicAI_Inst

8 days ago

Research highlight! CosmicAI Researchers Wenxuan Ding (NYU), @gregd_nlp (NYU) and external collaborator Nicholas Tomlin (NYU, TTIC) investigated whether LLM agents like Claude Code & OpenAI Codex can navigate cost-benefit tradeoffs in their actions. https://t.co/5wDVz4iNWA

0

5

4

0

2K

gregd_nlp retweeted

Asher Zheng

@Asher_Zheng00

11 days ago

Spotting the rule from past experience is one thing; acting on it correctly is another. To find out, we introduce HERO's JOURNEY to test for the LLMs’ inductive reasoning ability in multi-step setups. We put an LLM into a text world as a hero🦸‍♀️: it must infer the pattern from past quest trajectories, then apply it to a foe it's never seen. We found models show signs of rule induction, but scratch the surface: sometimes they're just copying from context. Yet in multi-step execution settings, where humans naturally thrive, the cracks really start to show. 🧵

Asher_Zheng00's tweet photo. Spotting the rule from past experience is one thing; acting on it correctly is another. To find out, we introduce HERO's JOURNEY to test for the LLMs’ inductive reasoning ability in multi-step setups. We put an LLM into a text world as a hero🦸‍♀️: it must infer the pattern from past quest trajectories, then apply it to a foe it's never seen.

We found models show signs of rule induction, but scratch the surface: sometimes they're just copying from context. Yet in multi-step execution settings, where humans naturally thrive, the cracks really start to show. 🧵

2

19

10

7

3K

Greg Durrett

@gregd_nlp

9 days ago

I never heard anyone talk about the "noise floor" in an ML context until last year. Similar spike.

Max Spero

@max_spero_

9 days ago

My anecdotal evidence that LLMs love the term smoke test seems confirmed by search trends — searches for the term spike in 2026

max_spero_'s tweet photo. My anecdotal evidence that LLMs love the term smoke test seems confirmed by search trends — searches for the term spike in 2026 https://t.co/il6kyFKGeq

23

311

4

21

31K

0

11

1

3

3K

gregd_nlp retweeted

Dwarkesh Patel

@dwarkesh_sp

16 days ago

In medieval times, within the arms race of ever more demonic torture devices, some sadistic genius came up with the idea of the Little Ease. This was a prison cell built so small in every dimension that a grown man could not stand upright in it nor lie down at full length nor properly sit. The pain is relentless and without relief and inflicted by one's own body. Prisoners were known to go insane within a few days. A stay at the Little Ease was considered even more cruel than the rack, the thumbscrew, and the other ghoulish machinery of the Tower of London. A breeding pig will spend her whole life in a version of that box. These are social, roaming creatures (more intelligent than dogs) who will never leave this corset of steel. They have been selectively bred to be bigger than their frames can support. Yet we put them in cells so confined that they cannot comfortably sit, and their attempts to do so (for example, by sneaking their limbs into adjacent stalls) reliably lead to fractures and sprains. They cannot sweat, yet have nothing to roll around in to cool themselves off. Except their own manure, which (contrary to the common misconception) they are so averse to (thanks to their strong sense of smell) that new sows will often suffer from constipation to avoid soiling the space from which they eat and sleep. Here is how the writer Matthew Scully described what saw at one of Smithfield’s “gestation barn”: > “Sores, tumors, ulcers, pus pockets, lesions, cysts, bruises, torn ears, swollen legs everywhere. Roaring, groaning, tail biting, fighting, and other “Vices,” as they’re called in the industry. Frenzied chewing on bars and chains, stereotypical “vacuum” chewing on nothing at all, stereotypical rooting and nest building with imaginary straw. And “social defeat,” lots of it, in every third or fourth stall some completely broken being you know is alive only because she blinks and stares up at you … creatures beyond the power of pity to help or indifference to make more miserable, dead to the world except as heaps of flesh into which the [insemination] rod may be stuck once more and more flesh reproduced.” — The Save Our Bacon Act is trying to unroll the few state protections we have against this barbaric cruelty - for example California’s Prop 12 - which banned the sale of pork from pigs kept in gestation crates. It’s incredibly important we don’t end up with this sort of federal preemption. SOB will not only kill the most important animal welfare related laws in the US of the past decade, but more importantly, it will also restrict ALL future legislative progress (aka how the animal welfare movement has gotten its biggest wins). The Senate is currently deciding whether to add the SOB Act to the Farm Bill. With relatively little money now, we can discourage the most pivotal senators in the Ag committee from backing this amendment. Defeating this bill is even more important given the amount of philanthropic funding I expect to come online in the next year or two. It will plausibly be over 10x more expensive to repeal SOB than to prevent it from passing in the first place. All that money that could be spent transforming our society's relationship to mass animal suffering will instead have to be spent just getting us back to where we are right now. That's why money spent now fighting this bill (and I mean right NOW) is so effective. If you’re in a position to donate six figures, please DM me.

96

5K

764

650

491K

gregd_nlp retweeted

Kai Xu @itskaixu

21 days ago

Image editing models can put you on the Moon, but can they precisely move a circle right by 50 pixels? 📐 Introducing 🎨PaintBench: a foundational eval of visual editing operations with only one right answer. The highest-performing model (@NanoBanana 2) reaches only 17.1%.

4

52

20

18

5K

gregd_nlp retweeted

CLS

@ChengleiSi

18 days ago

We are bringing back the LLMs for Scientific Discovery workshop to @COLM_conf (in SF this year!!), submit your papers by 23 June! And we are looking for reviewers! If interested, email [email protected] or DM me here on X! CFP: https://t.co/MeUkXG6HTm Co-organizers: @shannonzshen @StevenyzZhang @HananeNMoussa @AkariAsai @yatskar @hhsun1 @Diyi_Yang

1

116

24

45

30K

gregd_nlp retweeted

Gautam Kamath @thegautamkamath

26 days ago

In the last 48h: - Jr researcher asked me wheter to use AI in making talks - Saw two talks, with AI {slop, enhanced} slides Collected my thoughts and wrote a post. Tl;dr: don't steal your own thinking, don't remove *you* from your talks. Also, give a &#@% about your talks.

thegautamkamath's tweet photo. In the last 48h:
- Jr researcher asked me wheter to use AI in making talks
- Saw two talks, with AI {slop, enhanced} slides

Collected my thoughts and wrote a post. Tl;dr: don't steal your own thinking, don't remove *you* from your talks. Also, give a &#@% about your talks. https://t.co/PGygW3xsVM

8

258

28

161

44K

gregd_nlp retweeted

Pavel Izmailov

@Pavel_Izmailov

26 days ago

Very excited to release DiscoverPhysics, a new benchmark and evaluation pipeline for experimentation and discovery in LLMs. 🌐 https://t.co/p3uPtQBJ7G 📰 https://t.co/vUb0cdo6yw

Pavel_Izmailov's tweet photo. Very excited to release DiscoverPhysics, a new benchmark and evaluation pipeline for experimentation and discovery in LLMs.

🌐 https://t.co/p3uPtQBJ7G
📰 https://t.co/vUb0cdo6yw

2

54

13

29

4K

gregd_nlp retweeted

Tokenization Workshop (TokShop) @COLM2026 @tokshop2025

about 1 month ago

Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣 ▶️ Non-archival submissions of two types: Research papers (up to 9 pages) ▶️ Extended abstracts (up to 2 pages) Submission deadline June 23, 2026 (AoE) Acceptance notification on July 24, 2026 (AoE)

tokshop2025's tweet photo. Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣
▶️ Non-archival submissions of two types: Research papers (up to 9 pages)
▶️ Extended abstracts (up to 2 pages)

Submission deadline June 23, 2026 (AoE)
Acceptance notification on July 24, 2026 (AoE) https://t.co/TsWkPYjfmH

1

16

12

1

4K

gregd_nlp retweeted

Nicholas Tomlin @NickATomlin

27 days ago

New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do 📄: https://t.co/1GpOfwcsat

NickATomlin's tweet photo. New paper! LLM memory keeps improving, but this makes them *worse* as user sims. If we want to build models that can, e.g., simulate realistic students to train chatbots to be better teachers, then these models need to be able to forget like humans do

📄: https://t.co/1GpOfwcsat https://t.co/IDePa4f6gw

15

460

71

322

47K

gregd_nlp retweeted

Conference on Language Modeling @COLM_conf

27 days ago

COLM 2026 will host 16(!) workshops: https://t.co/Lf90oZTfiT CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest

COLM_conf's tweet photo. COLM 2026 will host 16(!) workshops:
https://t.co/Lf90oZTfiT

CFPs are all online, and deadlines are coming up, so check the CFP of your workshops of interest https://t.co/n0XG0xB0Uw

1

77

21

37

17K

gregd_nlp retweeted

Conference on Language Modeling @COLM_conf

about 1 month ago

The discussion period for COLM 2026 is underway! We're sharing a CDF of average review scores. Note that final decisions will reflect deliberation by ACs and PCs, so these are only meant to be a heuristic guideline to give you a sense of where your papers stand. Good luck!

COLM_conf's tweet photo. The discussion period for COLM 2026 is underway! We're sharing a CDF of average review scores. Note that final decisions will reflect deliberation by ACs and PCs, so these are only meant to be a heuristic guideline to give you a sense of where your papers stand. Good luck! https://t.co/72zxXc2uno

2

121

12

25

25K

gregd_nlp retweeted

Hadas Orgad @OrgadHadas

about 1 month ago

Submit your work! The 2nd Workshop on 𝐀𝐜𝐭𝐢𝐨𝐧𝐚𝐛𝐥𝐞 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 will be held at COLM 2026 in San Francisco! Submission Deadline: June 21, 2026 @ActInterp

OrgadHadas's tweet photo. Submit your work! The 2nd Workshop on 𝐀𝐜𝐭𝐢𝐨𝐧𝐚𝐛𝐥𝐞 𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐛𝐢𝐥𝐢𝐭𝐲 will be held at COLM 2026 in San Francisco!

Submission Deadline: June 21, 2026

@ActInterp https://t.co/HF2lJczx8T

2

133

18

78

14K

Greg Durrett

@gregd_nlp

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users