Roberto Dailey

@TMoldwin Neat demo, similar vein to seeing the hmm state distribution in the residual stream of transformers trained on the hmm: https://t.co/qugpv78zKE

590

Roberto Dailey

@RobertoDailey1

9 days ago

@ravfogel @shashwat_s19 @tallinzen Really cool analysis. Great read.

103

RobertoDailey1 retweeted

Will Bui

@will_ea

about 1 month ago

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

will_ea's tweet photo. 27x faster Attention Residuals!!! 🚀

We implemented Block AttnRes as a pip-installable package.

!pip install flash-attn-res

No annoying kernel nonsense.
No compile/autograd plumbing.
Call it like a regular PyTorch op.

It just works.

Methodology:
🔹 fused triton kernels
🔹 batched attention over residual blocks
🔹 online-softmax merge
🔹 flash attention-style split-KV reduction

Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

776

569

75K

Roberto Dailey

@RobertoDailey1

about 1 month ago

@avt_im Ah makes sense that you could use state space methods with gp’s to extend them, I’ll have to read up on that, thanks!

154

Roberto Dailey

@RobertoDailey1

about 1 month ago

Really cool video breakdown of our work and others for evolution strategies on LLM's: https://t.co/ewn3guP0I3

110

Roberto Dailey

@RobertoDailey1

about 2 months ago

@besttrousers You gotta do a write this up of this on your Substack at some point

164

Roberto Dailey

@RobertoDailey1

about 2 months ago

@EconChrisClarke Congrats!!

Roberto Dailey

@RobertoDailey1

3 months ago

Really cool work from my colleague @_GPaolo on open-ended multi-agent environments. He creates a resource-constrained grid world for ai agents where they can interact, search for resources, and leave persistent text artifacts for each other. Without direction you can see the emergence of rules, division of labor, and even attempted governance! The code is up to try out yourself here: https://t.co/VybU2rqZqt

Giuseppe Paolo @_GPaolo

3 months ago

What happens when AI agents are left to live (and die) together in a shared world? We’ve been exploring this at the @cognizant AI Lab — and they started forming something that looks like a society.

678

543

76K

365

Roberto Dailey

@RobertoDailey1

3 months ago

Cognizant AI lab @cognizantailab is out with new work in gradient-free fine-tuning with Evolution Strategies (ES)! We expand our initial paper with larger models (7B) and math reasoning to demonstrate ES works out of the box and is competitive with RL across broad domains, without the engineering overhead of gradient-based RL methods. https://t.co/cvXQaT2ndX https://t.co/xFB4WIwaiu Inspired by the success of ES we have also pushed ES research in three new directions. First, we put ES to use in a task standard gradient-based RL can’t reach: successfully fine-tuning LLM’s directly in quantized space with Quantized Evolution Strategies (QES). https://t.co/G2ygwCBlza https://t.co/PMveTLN7FS Next, we looked at developing a theoretical intuition as to why we can succeed in fine-tuning multi-billion parameter models with population sizes as low as 30 in “Blessing of Dimensionality in LLM Fine-tuning” https://t.co/yVAcT3vNgT https://t.co/27CxXnGPJl Lastly, we use ES to help teach models to know what they know, using ES to fine-tune models in a metacognitive task. https://t.co/SGkkm4I4l9 https://t.co/5Ikq7Fo20C We’ve just released a blog describing the overall effort here: https://t.co/wHfg15t7vQ

Roberto Dailey

@RobertoDailey1

4 months ago

Our lab was able to run 20-disk towers of Hanoi (~1 million steps) on gpt-4.1 mini by simply observing per-step error rates and adding appropriate error checking. I think people should no longer be citing the Illusion of Thinking paper as a fundamental limitation of LLM's. https://t.co/CeFYGu9fXA

RobertoDailey1 retweeted

Xin Qiu

@realVsonicV

4 months ago

We recently released a new version of our Evolution Strategies (ES) fine-tuning paper, with more benchmarks, baselines and discussions, strengthening the foundation for using ES as a propagation-free post-training paradigm. (arXiv: https://t.co/qd3hX7zlGB, alphaXiv: https://t.co/m84t6SHpBW) We also released three intriguing follow-up works on this new direction: (1) Quantized Evolution Strategies (QES) extends ES to post-training of quantized LLMs. With a frugal memory usage at low-precision inference level, QES achieves a high-precision optimization trajectory in quantized parameter space. (arXiv: https://t.co/jdgDk8NJ7A, alphaXiv: https://t.co/DOoHL4TdYJ) (2) The "Blessing of Dimensionality" paper tries to explain why ES only needs a population size of ~30 to fine-tune billions of parameters. It discovers that larger models may have lower intrinsic dimensionality, which makes parameter-space search in ES easier. (arXiv: https://t.co/P8AJzBGkBI, alphaXiv: https://t.co/UoOLTaNR08) (3) Evolution Strategy for Metacognitive Alignment (ESMA)" uses ES to fine-tune LLMs to know what they know. That is, using alignment between "whether LLM answers one question correctly" and "whether LLM knows it can answer one question correctly" as the objective of fine-tuning, strengthening the metacognitive alignment of LLMs. (arXiv: https://t.co/5X1akZ0FNv, alphaXiv: https://t.co/Xb9cuWg2tf) Looking forward to adding more to this ES ecosystem!

realVsonicV's tweet photo. We recently released a new version of our Evolution Strategies (ES) fine-tuning paper, with more benchmarks, baselines and discussions, strengthening the foundation for using ES as a propagation-free post-training paradigm. (arXiv: https://t.co/qd3hX7zlGB, alphaXiv: https://t.co/m84t6SHpBW)

We also released three intriguing follow-up works on this new direction:
(1) Quantized Evolution Strategies (QES) extends ES to post-training of quantized LLMs. With a frugal memory usage at low-precision inference level, QES achieves a high-precision optimization trajectory in quantized parameter space. (arXiv: https://t.co/jdgDk8NJ7A, alphaXiv: https://t.co/DOoHL4TdYJ)
(2) The "Blessing of Dimensionality" paper tries to explain why ES only needs a population size of ~30 to fine-tune billions of parameters. It discovers that larger models may have lower intrinsic dimensionality, which makes parameter-space search in ES easier. (arXiv: https://t.co/P8AJzBGkBI, alphaXiv: https://t.co/UoOLTaNR08)
(3) Evolution Strategy for Metacognitive Alignment (ESMA)" uses ES to fine-tune LLMs to know what they know. That is, using alignment between "whether LLM answers one question correctly" and "whether LLM knows it can answer one question correctly" as the objective of fine-tuning, strengthening the metacognitive alignment of LLMs. (arXiv: https://t.co/5X1akZ0FNv, alphaXiv: https://t.co/Xb9cuWg2tf)

Looking forward to adding more to this ES ecosystem!

Roberto Dailey

@RobertoDailey1

4 months ago

@Jobamey This is so damn cool

Roberto Dailey

@RobertoDailey1

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users