Akos Kadar

@kadarakos

Machine learning researcher and developer.

Berlin, Germany

Joined October 2012

480 Following

472 Followers

6.1K Posts

kadarakos retweeted

Matthieu wyart @MatthieuWyart

9 days ago

LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs more data? For data with hidden hierarchy, we prove the gap is exponential. https://t.co/r2uuX0lBCu

MatthieuWyart's tweet photo. LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs more data? For data with hidden hierarchy, we prove the gap is exponential. https://t.co/r2uuX0lBCu https://t.co/51canl7smG

225

145K

kadarakos retweeted

Tal Linzen

@tallinzen

13 days ago

As Nick says, we’re excited about the potential for leveraging cognitive science to improve user simulators, then use them to evaluate and train models that collaborate better with real humans. I'm hiring (including but not limited to postdocs), let me know if interested in this direction!

13K

kadarakos retweeted

Tiberiu Mușat

@Tiberiu_Musat_

13 days ago

Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions? In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data). The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights. https://t.co/eMZIGQDf2f

Tiberiu_Musat_'s tweet photo. Why does deep learning generalize? What does weight decay really do? Can algorithmic information theory address these questions?

In my latest preprint, I give a proof that the minimum neural weight norm matches the minimum program length (aka Kolmogorov Complexity), up to a logarithmic factor. In other words, the neural network with the smallest possible weight norm (that fits the data) must encode the shortest program (that fits the data).

The result only holds for fixed-precision neural nets: infinite precision nets can store infinite information with finite (small) weights.

https://t.co/eMZIGQDf2f

153

144K

kadarakos retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

14 days ago

Language Models Need Sleep "Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache." "increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

iScienceLuvr's tweet photo. Language Models Need Sleep

"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."

"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

915

147

717

66K

Who to follow

Miryam de Lhoneux/ @mdlhx.bsky.social

@mdlhx

#NLProc assistant prof @CW_KULeuven, PI @lagom_nlp. I like syntax more than most people. Also multilingual NLP, interpretability, mountains and beer. (She/her)

Nipun Sadvilkar

@NipunSadvilkar

Talk to me about - Python | Artificial Intelligence | NLP | Open Source

visp

@wrwagox

AI alignment research

Akos Kadar @kadarakos

21 days ago

@karpathy You finally gave in you're fomo , wishing you luck!

kadarakos retweeted

Jenny Zhang

@jennyzhangzt

25 days ago

general Intelligence requires rethinking exploration https://t.co/QA4uKB6oDl

630

585

59K

kadarakos retweeted

Peter Pao-Huang @peterpaohuang

28 days ago

🧵 9/ This work couldn’t have been possible without my amazingly supportive advisors @StefanoErmon and @Xiaojie_Qiu. Check out our paper and code for more details: 📄https://t.co/Jq0G2ySc58 💻https://t.co/B4vL9oQ3DC Looking forward to your thoughts!

kadarakos retweeted

Dmitry Krotov

@DimaKrotov

about 1 month ago

I am fascinated by the parallels between diffusion models and associative memories. In our new paper we demonstrate that discrete language diffusion models form basins of attraction around test samples — examples never seen during training. As training data grows, basins around training samples shrink, while basins around unseen test samples expand. This is strikingly similar to what happens in Dense Associative Memory models above the critical memory load.

198

173

22K

kadarakos retweeted

David Clark @d_g_clark

about 1 month ago

Now in PRE: "Transient dynamics of associative memory models." I argue that the "blackout catastrophe" (the famous α≈0.14 transition) is not catastrophic when viewed from an out-of-equilibrium, dynamical perspective. Journal: https://t.co/F4QCHCWYMF PDF: https://t.co/X0eQyKAhTo

14K

kadarakos retweeted

Sander Dieleman

@sedielem

about 1 month ago

My first blog post in over a year is a deep dive on flow maps🗺️, or how to learn the integral of a diffusion model to enable faster sampling and several other cool tricks. It's the longest one yet👀 Let me know what you think! https://t.co/O8bBGZ9qjC

746

172

577

89K

kadarakos retweeted

Mathieu Blondel @mblondel_ml

about 1 month ago

Our paper "Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction" was accepted for publication at #ICML2026 https://t.co/DUTSj4Sz7u

181

137

14K

kadarakos retweeted

Jason Weston

@jaseweston

about 1 month ago

💎Autodata: an agentic data scientist to create high quality data✨ We introduce a method for building agents that create high-quality training & evaluation data. Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*. We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data. Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods. Overall, we believe this direction has the potential to change how we build AI data! Read more in the blog post: https://t.co/vjPvnTYfJx

jaseweston's tweet photo. 💎Autodata: an agentic data scientist to create high quality data✨

We introduce a method for building agents that create high-quality training & evaluation data.

Key idea: agentic data creation provides a way to *convert increased inference compute into higher quality model training*.

We show how to train (meta-optimize) such a data scientist agent, so that it can create even stronger data.

Our initial study with a specific practical implementation, Agentic Self-Instruct, shows strong gains on scientific reasoning problems compared to classical synthetic dataset creation methods.

Overall, we believe this direction has the potential to change how we build AI data!

Read more in the blog post: https://t.co/vjPvnTYfJx

619

105

689

43K

kadarakos retweeted

Sham Kakade

@ShamKakade6

about 1 month ago

1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width.

ShamKakade6's tweet photo. 1/8 Introducing Recurrent Transformer (RT). At 300M params, RT improves validation CE over standard Transformers. The best RT model is only 6 layers, but wider at 2048 — beating deeper 12- and 24-layer Transformers by trading depth for width. https://t.co/seJrdtszKJ

552

431

253K

kadarakos retweeted

Sander Dieleman

@sedielem

about 1 month ago

Along with Categorical Flow Maps and Flow Map Language Models, we now have three separate papers heralding the triumphant return of continuous methods for language diffusion😶‍🌫️ Can you tell I'm excited?🫨 https://t.co/SKS4OFtSG8 https://t.co/kJ3cuFsggd https://t.co/aXXU4bUSMT

313

247

32K

kadarakos retweeted

Michael Albergo

@msalbergo

about 1 month ago

New paper! Presenting Discrete Flow Maps: paper: https://t.co/f1RmZry2by blog: https://t.co/Cnwgf4moY0 A laughable problem for me these days is that @nmboffi and I share a research brain, and we have had, time and again, a conversation that ends with “ha so I guess we’re writing the same paper.” Soon we will return to just doing it together :). Here we are doing it again with discrete flow maps and flow language models! A complete and thorough paper led by @PPotaptchik @json_yim @adhisarav @peholderrieth. We took a bit of time to post it to ensure we understood a few more things about the stability of the loss functions. Like @osclsd , @FEijkelboom, and @nmboffi , we think this could be a very helpful paradigm for thinking about fast inference and even better alignment! Here’s our version of the story, and I hope it makes clear how green field this research direction is — we provide a comprehensive picture of the KL losses you can write from the properties of the flow map, some nice geometric proofs about the mean denoiser and the simplex, and find that at this time, the ESD can actually be the most performant, with some caveats. Excited for everyone to work together and push this class of models to their limit!

msalbergo's tweet photo. New paper! Presenting Discrete Flow Maps:

paper: https://t.co/f1RmZry2by
blog: https://t.co/Cnwgf4moY0

A laughable problem for me these days is that @nmboffi and I share a research brain, and we have had, time and again, a conversation that ends with “ha so I guess we’re writing the same paper.” Soon we will return to just doing it together :). Here we are doing it again with discrete flow maps and flow language models! A complete and thorough paper led by @PPotaptchik @json_yim @adhisarav @peholderrieth. We took a bit of time to post it to ensure we understood a few more things about the stability of the loss functions.

Like @osclsd , @FEijkelboom, and @nmboffi , we think this could be a very helpful paradigm for thinking about fast inference and even better alignment!

Here’s our version of the story, and I hope it makes clear how green field this research direction is — we provide a comprehensive picture of the KL losses you can write from the properties of the flow map, some nice geometric proofs about the mean denoiser and the simplex, and find that at this time, the ESD can actually be the most performant, with some caveats. Excited for everyone to work together and push this class of models to their limit!

387

265

75K

kadarakos retweeted

Murray Kang @haoqik322

about 2 months ago

Looped transformer and Energy-based Models (EBMs) are closely related: •EBM inference: x ← x − η∇E(x) •Looped transformer: x ← x + block(x) (residual form) These are literally the same algorithm if block(x) = −η∇E(x) for some scalar E. This equivalence isn’t automatic — a generic transformer block is not the gradient of any scalar (its Jacobian isn’t symmetric, and gradients must have symmetric Hessians). Dmitry provided a solution to this. I like the perspective to understand/build the looped transformer as an EBM inference process by gradient descent. The looping isn’t just reusing weights for efficiency — it’s the optimization procedure of the EBM itself.

364

336

46K

Akos Kadar @kadarakos

about 2 months ago

@tallinzen Isn't it in reverse? Slop bowls make normal dishes into slop. You can just go to a restaurant instead.

kadarakos retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

about 2 months ago

ELT: Elastic Looped Transformers for Visual Generation "We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality."

iScienceLuvr's tweet photo. ELT: Elastic Looped Transformers for Visual Generation

"We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality."

243

165

18K

kadarakos retweeted

Teng Yan

@tengyanAI

about 2 months ago

basically: anthropic sneakily turned down how hard claude thinks before editing code, changed the default from "high" to "medium" effort, and hid the reasoning from session logs. all without telling users. an amd director had 7k sessions of telemetry to prove the degradation was real and measurable (not just vibes). anthropic admitted to the changes. there's a workaround (use "/effort max"). the uncomfortable part is most users had no data to notice it happened at all.

180

620

kadarakos retweeted

Hanchen Li

@lihanc02

2 months ago

An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified? 🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony! We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks. Well if you really wanted you could get 100% accuracy without solving a single task. The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️

lihanc02's tweet photo. An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified?

🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony!

We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks.

Well if you really wanted you could get 100% accuracy without solving a single task.

The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️

170

274

961K

Akos Kadar

@kadarakos

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users