@TMoldwin Neat demo, similar vein to seeing the hmm state distribution in the residual stream of transformers trained on the hmm: https://t.co/qugpv78zKE
27x faster Attention Residuals!!! 🚀
We implemented Block AttnRes as a pip-installable package.
!pip install flash-attn-res
No annoying kernel nonsense.
No compile/autograd plumbing.
Call it like a regular PyTorch op.
It just works.
Methodology:
🔹 fused triton kernels
🔹 batched attention over residual blocks
🔹 online-softmax merge
🔹 flash attention-style split-KV reduction
Thanks @LLMenjoyer and @cartesia for the support and guidance✌️
Really cool work from my colleague @_GPaolo on open-ended multi-agent environments. He creates a resource-constrained grid world for ai agents where they can interact, search for resources, and leave persistent text artifacts for each other. Without direction you can see the emergence of rules, division of labor, and even attempted governance! The code is up to try out yourself here: https://t.co/VybU2rqZqt
What happens when AI agents are left to live (and die) together in a shared world?
We’ve been exploring this at the @cognizant AI Lab — and they started forming something that looks like a society.
Cognizant AI lab @cognizantailab is out with new work in gradient-free fine-tuning with Evolution Strategies (ES)! We expand our initial paper with larger models (7B) and math reasoning to demonstrate ES works out of the box and is competitive with RL across broad domains, without the engineering overhead of gradient-based RL methods. https://t.co/cvXQaT2ndX https://t.co/xFB4WIwaiu
Inspired by the success of ES we have also pushed ES research in three new directions. First, we put ES to use in a task standard gradient-based RL can’t reach: successfully fine-tuning LLM’s directly in quantized space with Quantized Evolution Strategies (QES). https://t.co/G2ygwCBlza https://t.co/PMveTLN7FS
Next, we looked at developing a theoretical intuition as to why we can succeed in fine-tuning multi-billion parameter models with population sizes as low as 30 in “Blessing of Dimensionality in LLM Fine-tuning” https://t.co/yVAcT3vNgT https://t.co/27CxXnGPJl
Lastly, we use ES to help teach models to know what they know, using ES to fine-tune models in a metacognitive task. https://t.co/SGkkm4I4l9 https://t.co/5Ikq7Fo20C
We’ve just released a blog describing the overall effort here: https://t.co/wHfg15t7vQ
Our lab was able to run 20-disk towers of Hanoi (~1 million steps) on gpt-4.1 mini by simply observing per-step error rates and adding appropriate error checking. I think people should no longer be citing the Illusion of Thinking paper as a fundamental limitation of LLM's. https://t.co/CeFYGu9fXA
We recently released a new version of our Evolution Strategies (ES) fine-tuning paper, with more benchmarks, baselines and discussions, strengthening the foundation for using ES as a propagation-free post-training paradigm. (arXiv: https://t.co/qd3hX7zlGB, alphaXiv: https://t.co/m84t6SHpBW)
We also released three intriguing follow-up works on this new direction:
(1) Quantized Evolution Strategies (QES) extends ES to post-training of quantized LLMs. With a frugal memory usage at low-precision inference level, QES achieves a high-precision optimization trajectory in quantized parameter space. (arXiv: https://t.co/jdgDk8NJ7A, alphaXiv: https://t.co/DOoHL4TdYJ)
(2) The "Blessing of Dimensionality" paper tries to explain why ES only needs a population size of ~30 to fine-tune billions of parameters. It discovers that larger models may have lower intrinsic dimensionality, which makes parameter-space search in ES easier. (arXiv: https://t.co/P8AJzBGkBI, alphaXiv: https://t.co/UoOLTaNR08)
(3) Evolution Strategy for Metacognitive Alignment (ESMA)" uses ES to fine-tune LLMs to know what they know. That is, using alignment between "whether LLM answers one question correctly" and "whether LLM knows it can answer one question correctly" as the objective of fine-tuning, strengthening the metacognitive alignment of LLMs. (arXiv: https://t.co/5X1akZ0FNv, alphaXiv: https://t.co/Xb9cuWg2tf)
Looking forward to adding more to this ES ecosystem!