Neil Band

@neilbband

PhD student @StanfordAILab @StanfordNLP @Stanford advised by Tatsunori Hashimoto and Tengyu Ma. Prev: @OATML_Oxford @CompSciOxford

Stanford, CA

Joined September 2020

804 Following

1.2K Followers

158 Posts

Pinned Tweet

Neil Band @neilbband

about 2 years ago

When LLMs are unsure, they either hallucinate or abstain. Ideally, they should clearly express truthful confidence levels. Our #ICML2024 work designs an alignment objective to achieve this notion of linguistic calibration in *long-form generations*. https://t.co/8Ijac1jGxc 🧵

neilbband's tweet photo. When LLMs are unsure, they either hallucinate or abstain.

Ideally, they should clearly express truthful confidence levels.

Our #ICML2024 work designs an alignment objective to achieve this notion of linguistic calibration in *long-form generations*.

https://t.co/8Ijac1jGxc

🧵 https://t.co/xzXWYXGc8L

303

276

73K

neilbband retweeted

Michael Y. Li

@michaelyli_

3 days ago

You're wasting FLOPs when scaling inference compute: by independently sampling parallel attempts, you burn compute rediscovering the same solutions. Introducing QuasiMoTTo: we scale parallel sampling with correlated samples instead! These samples have higher coverage, are marginally exact draws from the LLM, and can be generated in parallel. Result: same performance with 25-47% fewer samples in test-time scaling + 50% fewer training steps in RL! In our new paper, we explore the design space of correlated samplers. Work with co-authors @probablynotaz9 (co-lead), @gandhikanishk, @noahdgoodman, and Emily Fox!

michaelyli_'s tweet photo. You're wasting FLOPs when scaling inference compute: by independently sampling parallel attempts, you burn compute rediscovering the same solutions.

Introducing QuasiMoTTo: we scale parallel sampling with correlated samples instead! These samples have higher coverage, are marginally exact draws from the LLM, and can be generated in parallel.

Result: same performance with 25-47% fewer samples in test-time scaling + 50% fewer training steps in RL!

In our new paper, we explore the design space of correlated samplers. Work with co-authors @probablynotaz9 (co-lead), @gandhikanishk, @noahdgoodman, and Emily Fox!

164

108

24K

neilbband retweeted

Tristan Thrush @TristanThrush

12 days ago

We open-sourced the code for this project! You can use it to make synthetic LLM training data for any downstream target. The code also gives you a minimal example for computing data-weight metagradients through LLM training + evaluation. Check it out: https://t.co/nUpf0hzWyY

13K

neilbband retweeted

Tim G. J. Rudner

@timrudner

26 days ago

What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce: Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

timrudner's tweet photo. What if diffusion models could think ahead instead of being greedy at every step?🤔 We introduce:

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

Who to follow

ML research @Apple, prev @OxCSML @InfAtEd, part of @MLinPL & @polonium_org 🇵🇱, sometimes funny

Nan Jiang

@nanjiang_cs

machine learning researcher, with focus on reinforcement learning. assoc prof @ uiuc cs. Course on RL theory (w/ videos): https://t.co/vqVKwY4RJE

neilbband retweeted

Michael Y. Li

@michaelyli_

2 months ago

Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

michaelyli_'s tweet photo. Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason?

Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches.

🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language.

New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!

912

132

750

169K

neilbband retweeted

Rosinality @rosinality

3 months ago

A synthetic data generation method that, when a model is trained on the generated data, it maximizes a certain differentiable objective. e.g. it is possible to make data that engraves a QR code in the weights of an LM head. (Or, more conventional things like translating documents to improve target language loss.)

rosinality's tweet photo. A synthetic data generation method that, when a model is trained on the generated data, it maximizes a certain differentiable objective. e.g. it is possible to make data that engraves a QR code in the weights of an LM head. (Or, more conventional things like translating documents to improve target language loss.)

304

386

60K

neilbband retweeted

Tristan Thrush @TristanThrush

3 months ago

New paper! Want to precisely optimize synthetic training data to do practical or even wacky things? Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data!

TristanThrush's tweet photo. New paper!

Want to precisely optimize synthetic training data to do practical or even wacky things?

Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data! https://t.co/kFFaERozEw

243

164

61K

neilbband retweeted

Karan Dalal

@karansdalal

6 months ago

Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models. We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training. With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics. Paper: https://t.co/tqPYECjFpn Code: https://t.co/tADD7wYDAL

karansdalal's tweet photo. Our new paper, “End-to-End Test-Time Training for Long Context,” is a step towards continual learning in language models.

We introduce a new method that blurs the boundary between training and inference. At test-time, our model continues learning from given context using the same next-token prediction objective as training.

With this end-to-end objective, our model can efficiently compress substantial context into its weights and still use it effectively, unlocking extremely long context windows for complex reasoning and applications in agents and robotics.

Paper: https://t.co/tqPYECjFpn
Code: https://t.co/tADD7wYDAL

208

958

186K

Neil Band @neilbband

7 months ago

Tim's an amazing researcher and mentor, go work with him!

Tim G. J. Rudner

@timrudner

7 months ago

I'm so happy to share that I’ll be joining @UofT as an Assistant Professor of Statistical Sciences and Computer Science, with an appointment at the @VectorInst, in 2026! I'm recruiting postdocs and PhD students: https://t.co/FWBh0BiDqP! Please help me spread the word! 🧵(1/5)

timrudner's tweet photo. I'm so happy to share that I’ll be joining @UofT as an Assistant Professor of Statistical Sciences and Computer Science, with an appointment at the @VectorInst, in 2026!

I'm recruiting postdocs and PhD students: https://t.co/FWBh0BiDqP!

Please help me spread the word!

🧵(1/5)

372

120

40K

335

neilbband retweeted

Jon Saad-Falcon

@JonSaadFalcon

8 months ago

Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)

$JonSaadFalcon's tweet photo. Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)$

464

142

179

230K

neilbband retweeted

Suhas Kotha @kothasuhas

10 months ago

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

kothasuhas's tweet photo. Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute

We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute https://t.co/OSvtoenWRg

447

268

154K

neilbband retweeted

Kaiyue Wen

@wen_kaiyue

10 months ago

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

wen_kaiyue's tweet photo. (1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)! https://t.co/ba09j9ZFZE

446

264

184K

neilbband retweeted

Niklas Muennighoff @Muennighoff

10 months ago

Can AI solve open problems in math, physics, coding, medical sciences & beyond? We collected unsolved questions (UQ) & tested frontier LLMs. Some solutions passed expert validation…

Muennighoff's tweet photo. Can AI solve open problems in math, physics, coding, medical sciences & beyond?

We collected unsolved questions (UQ) & tested frontier LLMs. Some solutions passed expert validation… https://t.co/EIkCPesXB0

486

131

183

86K

neilbband retweeted

CLS

@ChengleiSi

about 1 year ago

Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.

ChengleiSi's tweet photo. Are AI scientists already better than human researchers?

We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts.

Main finding: LLM ideas result in worse projects than human ideas.

633

182

227

153K

neilbband retweeted

Jeff Dean

@JeffDean

about 1 year ago

Very cool thread about the CS336 Language Models from Scratch course at Stanford taught by @percyliang et al. Makes me wish I was a student again!

964

594

112K

neilbband retweeted

Jon Saad-Falcon

@JonSaadFalcon

about 1 year ago

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)

JonSaadFalcon's tweet photo. How can we close the generation-verification gap when LLMs produce correct answers but fail to select them?
🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct!
🧵(1 / N)

224

167

82K

neilbband retweeted

Percy Liang

@percyliang

about 1 year ago

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team @tatsu_hashimoto @marcelroed @neilbband @rckpudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

569

679K

neilbband retweeted

Ryan Marten

@ryanmart3n

about 1 year ago

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data scales. Full details are in our ✨new paper✨ - below we share the highlights: BTW, it also works on non-Qwen models😉 (1/N)

ryanmart3n's tweet photo. Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data scales. Full details are in our ✨new paper✨ - below we share the highlights:

BTW, it also works on non-Qwen models😉 (1/N)

924

191

724

201K

neilbband retweeted

Simon Guo

@simonguozirui

about 1 year ago

Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed @neilbband @rckpudi Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch: - Build and Train a Tokenizer 🔤 - Write Triton kernels for Attention ⚡️ - Construct Scaling Laws 📉 - Implement GRPO 🐙

simonguozirui's tweet photo. Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed
@neilbband @rckpudi

Covering four assignments 📚 that teach you how to 🧑‍🍳 cook an LLM from scratch:
- Build and Train a Tokenizer 🔤
- Write Triton kernels for Attention ⚡️
- Construct Scaling Laws 📉
- Implement GRPO 🐙

644

532

68K

neilbband retweeted

Zitong Yang

@ZitongYang0

about 1 year ago

Synthetic Continued Pretraining (https://t.co/0epeIbxaLD) has been accepted as an Oral Presentation at #ICLR2025! We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints.

ZitongYang0's tweet photo. Synthetic Continued Pretraining (https://t.co/0epeIbxaLD) has been accepted as an Oral Presentation at #ICLR2025!

We tackle the challenge of data-efficient language model pretraining: how to teach an LM the knowledge of small, niche corpora, such as the latest arXiv preprints. https://t.co/y3H4Rpu6kU

11K

neilbband retweeted

Tatsunori Hashimoto @tatsu_hashimoto

about 1 year ago

I think CS336 has one of the best LLM problem sets of any AI/LM class thanks to our incredible TAs (@nelsonfliu,@GabrielPoesia,@marcelroed,@neilbband,@rckpudi). We're making it so you can do it all at home, and it's one of the best ways to learn LLMs deeply.

717

746

82K

Neil Band

@neilbband

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users