Ameya P.

@AmyPrb

Postdoc @bethgelab; Previously: @OxfordTVG, @intelailabs Profile -

Tübingen, Germany

Joined September 2021

667 Following

592 Followers

1.2K Posts

AmyPrb retweeted

Maksym Andriushchenko

@maksym_andr

4 days ago

Great to see that MiniMax M3 used PostTrainBench in its announcement!

AmyPrb retweeted

Maksym Andriushchenko

@maksym_andr

16 days ago

💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation. AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters. Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task. Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE). One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now. This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

maksym_andr's tweet photo. 💥Today we release InferenceBench, our next benchmark after PostTrainBench that measures progress on AI R&D automation.

AI R&D automation will very likely unfold gradually, starting from “boring” tasks like inference speed optimization that are very easily verifiable (accuracy + inference time). We show a rather negative result for current frontier agents. They are not good at system-level engineering and managing complex dependencies. They do show non-trivial performance, but they fail compared to a simple baseline: hyperparameter tuning of vLLM/SGLang hyperparameters.

Importantly, InferenceBench tests *open-ended* inference optimization capabilities. This is different from more narrow benchmarks like KernelBench that only let agents optimize kernels (which is a very valuable task, too!). The benchmark is intentionally open-ended, so the poor performance of the agents is not an underelicitation issue. The agents have everything needed to succeed, but they still fail because they are not yet reliable enough for this task.

Our results suggest an inverse scaling phenomenon: Claude Sonnet 4.6 and GLM-5 rank highly because they more often preserve simple, valid, high-performing final servers, while several larger models show stronger peak runs but lose utility through brittle final-state choices. This contrasts with benchmarks where rankings track raw capability (e.g., SWE-Bench, Terminal-Bench, PostTrainBench, FrontierSWE).

One of the primary bottlenecks we have clearly observed is the lack of diversity of strategies: nearly all agents just use vLLM, without exploring alternatives. Overall, proper exploration is lacking: the current agents are not ready to tackle broad enough goals and get stuck after the first found solution (such as vLLM). I’m sure future agents will do much better, but here is where we are now.

This benchmark is our 2nd one in a suite of benchmarks that will track the progress on AI R&D automation. We will develop many more benchmarks that will cover different aspects of AI R&D automation, culminating in recursive self-improvement. Stay tuned!

348

212

42K

AmyPrb retweeted

Nikhil Chandak

@nikhilchandak29

17 days ago

🚨 FutureSim Update 🚨 We evaluated Opus 4.7 at max reasoning in Claude Code. Despite potential test-set contamination with knowledge cutoff of Jan '26, it scored just 21%, barely edging past Opus 4.6 and still behind GPT 5.5! Will Mythos be a step-change on FutureSim as it is for coding benchmarks?

nikhilchandak29's tweet photo. 🚨 FutureSim Update 🚨

We evaluated Opus 4.7 at max reasoning in Claude Code.

Despite potential test-set contamination with knowledge cutoff of Jan '26, it scored just 21%, barely edging past Opus 4.6 and still behind GPT 5.5!

Will Mythos be a step-change on FutureSim as it is for coding benchmarks?

AmyPrb retweeted

Shashwat Goel

@ShashwatGoel7

19 days ago

I hope more people read Section 5 of our paper. It's easy to generate a ranking among models with a benchmark. We do that properly (sec 4), but really the main point is all the research (sec 5) that can be done on top of this very new (temporal + open-ended) way to do evals

Who to follow

Mark Boss

@markb_boss

I’m the Co-Head of 3D & Image at Stability AI with research interests in the intersection of machine learning and computer graphics

Navneet Shahi

@Navneetshahi7

PhD👩🏻‍🔬| Worm lab🔬| Neuroscience🧠| IISc🏫

Mohamed El Banani

@_mbanani

MTS @theworldlabs. Prev: @UMichCSE, @GoogleAI, @MetaAI, @GeorgiaTech. I am interested in computer vision, machine learning, and cognitive science. 🇪🇬

AmyPrb retweeted

Nikhil Chandak

@nikhilchandak29

21 days ago

Introducing FutureSim, the first interactive environment testing agents on predicting world events. We build a simulation where agents face forecasting questions over the course of 3 months. News articles come in each day and agents continuously revise their prediction in light of new information as we show below for GPT-5.5. (1/5)

12K

AmyPrb retweeted

Jonas Geiping

@jonasgeiping

21 days ago

What else have we been up to? As models get better and work over longer and longer time horizons, how do we even evaluate how well they can act and adapt? One domain we really like there is forecasting, as a hard task that test reasoning under uncertainty. We've made a benmchmark out of this, where we simulate a whole 3 month period of news, and sanboxed let models continuously read news from those days, plan, and update their forecasts. (see the animation below, just don't be fooled by its speed, this is a slice of the larger 12m token trajectory) Many more details linked below:

AmyPrb retweeted

Maksym Andriushchenko

@maksym_andr

21 days ago

💥 Check out our new paper: FutureSim: Replaying World Events to Evaluate Adaptive Agents. We create a *reproducible* long-horizon environment where agents have to make forecasts during a 3-month period. The best performing agent, GPT 5.5 in Codex, consumes 3700 turns and 12.4M tokens spanning many sequential context window compactions in a single run. (Led by @ShashwatGoel7, @nikhilchandak29, @arvindh__a!)

Ameya P. @AmyPrb

21 days ago

Can agents continually adapt their predictions given new information from real-world events across several months? A very long horizon benchmark: https://t.co/cGSy5u3wDi Details👇

Arvindh Arun

@arvindh__a

21 days ago

Introducing FutureSim: where we replay a temporal slice of the web and let agents forecast real-world events over time 🔮🌎 FutureSim replays the web day by day. Agents start on Jan 1, 2026 (past their knowledge cutoffs) with date-gated access to real news articles and forecast on real-world events resolving over the next 90 days. Around 244K new articles stream in during the simulation. Agents decide which questions to answer, what to search for, and when to advance to the next day 🤔 We evaluate frontier models in their native harness. GPT 5.5 (Codex) leads at 25% acc, followed by Opus 4.6 (Claude Code) at 20% 📈 Open weight frontier models have a significant gap to catch up, with DeepSeek V4 pro at 13%, GLM 5.1 at 10%, and Qwen3.6 Plus at 5% On some questions that have a parallel @Polymarket market, we find that GPT 5.5 in our simulation sometimes beats the crowd aggregate, like in the Super Bowl LX ($704M traded) market 💰💸 FutureSim serves as a test bed for evaluating a lot of important agentic capabilities > Adaptation: how agents adapt beliefs over time, and handle new incoming information and environment feedback > Memory: how agents make the best use of external memory to store persistent insights and handle context limitations over a thousand tool calls > Search: how agents find relevant information over thousands of articles streaming in > Inference scaling: how agents benefit from scaling inference compute More cool insights and deep dives in our paper 👇

334

249

83K

278

Ameya P. @AmyPrb

21 days ago

Can agents continually adapt their beliefs with new information from real-world events? We provide a testbed for LLM agents to learn to accumulate useful signals across time. Exciting new directions👇: • Memory • Search • Multi-agent self-play • Inference Scaling

Shashwat Goel

@ShashwatGoel7

21 days ago

Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code

529

391

112K

744

AmyPrb retweeted

Jonas Geiping

@jonasgeiping

23 days ago

We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information. In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can … 🔵Be created by instruction-tuning for the stream format 🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in) 🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency 🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security 🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized. Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.

168

156K

AmyPrb retweeted

Aryaman Arora

@aryaman2020

about 1 month ago

the tabooification of research ideas in ai safety in this manner is silly. if it helps performance just assume a frontier lab is already doing it, and if a frontier lab is already doing it then it’s good to write papers on it so we can get more eyes on it to fix problems.

128

11K

AmyPrb retweeted

Milad Khademi Nori, PhD

@khademinori

24 days ago · Frontenac

🤔 I went to ICLR with a question I had for months: if I were designing a continual learning system today, would I put new knowledge in the weights or in the context? Almost everyone I asked answered "context." That's a dismissive answer! I have spent years working on in-weight methods, and I do not think gradient-based consolidation is dead, just badly matched to what practitioners in industry actually want from continual learning, which is high-fidelity recall of past interactions. Fortunately, a position paper from a 24-author Dagstuhl group landed in my feed and argued, more carefully than I had been managing on my own, that the right answer is neither. In-context learning is for fast adaptation and lossless recall. In-weight learning is for slow consolidation of skill. The real research problem is the modular memory between them, deciding what gets promoted from context into the weights. Hopefully the community will now ask less about "ICL or IWL" and more about "what is the right promotion policy, and on what evidence." 📄 Modular Memory is the Key to Continual Learning Agents #ContinualLearning #ICLR2026 #MachineLearning #FoundationModels

khademinori's tweet photo. 🤔 I went to ICLR with a question I had for months: if I were designing a continual learning system today, would I put new knowledge in the weights or in the context? Almost everyone I asked answered "context."

That's a dismissive answer! I have spent years working on in-weight methods, and I do not think gradient-based consolidation is dead, just badly matched to what practitioners in industry actually want from continual learning, which is high-fidelity recall of past interactions.

Fortunately, a position paper from a 24-author Dagstuhl group landed in my feed and argued, more carefully than I had been managing on my own, that the right answer is neither.

In-context learning is for fast adaptation and lossless recall. In-weight learning is for slow consolidation of skill. The real research problem is the modular memory between them, deciding what gets promoted from context into the weights.

Hopefully the community will now ask less about "ICL or IWL" and more about "what is the right promotion policy, and on what evidence."

📄 Modular Memory is the Key to Continual Learning Agents

#ContinualLearning #ICLR2026 #MachineLearning #FoundationModels

160

126

18K

AmyPrb retweeted

jonas wiedermann-möller

@j0wimo

25 days ago

My first paper is now on arXiv: Instrumental Choices. We ask a simple question: when an LLM agent can finish a real task by following the rules or by taking a useful policy-violating shortcut, which path does it choose?

j0wimo's tweet photo. My first paper is now on arXiv: Instrumental Choices.

We ask a simple question: when an LLM agent can finish a real task by following the rules or by taking a useful policy-violating shortcut, which path does it choose? https://t.co/dWkW5RQaxB

19K

AmyPrb retweeted

Marcos Agustín

@marcosagusstinn

26 days ago

Europe does not lack innovation. It lacks scale. European universities produce world-class research, engineers and technology. But too many companies remain trapped inside fragmented national markets instead of scaling immediately across the continent. The numbers are clear: → EU private R&D investment growth has slowed sharply → Europe’s share of global corporate R&D investment has fallen from 21.4% in 2014 to 16.2% in 2024 → Europe still has too few large tech champions because companies face fragmented regulation, smaller capital pools and slower growth financing → Startups must expand country by country instead of scaling through one fully integrated market Europe’s innovation problem is not creativity. It is market size, capital depth and speed of scaling. A continent with world-class talent cannot keep turning great research into small companies. Europe needs one real market for innovation.

marcosagusstinn's tweet photo. Europe does not lack innovation.

It lacks scale.

European universities produce world-class research, engineers and technology. But too many companies remain trapped inside fragmented national markets instead of scaling immediately across the continent.

The numbers are clear:

→ EU private R&D investment growth has slowed sharply
→ Europe’s share of global corporate R&D investment has fallen from 21.4% in 2014 to 16.2% in 2024
→ Europe still has too few large tech champions because companies face fragmented regulation, smaller capital pools and slower growth financing
→ Startups must expand country by country instead of scaling through one fully integrated market

Europe’s innovation problem is not creativity. It is market size, capital depth and speed of scaling.

A continent with world-class talent cannot keep turning great research into small companies.

Europe needs one real market for innovation.

111

268

334

AmyPrb retweeted

Hamish Ivison

@hamishivi

26 days ago

wrote up some random experiments I did playing around w/ absolute zero at the start of the year: https://t.co/DmVKSjY9YK a little negative which I attribute mainly to skill issues on my part but potentially interesting to some :)

147

194

36K

Ameya P. @AmyPrb

27 days ago

@gshaikovski Curious why do you think the performance improves? Is it that the model learning the output task structure and format?

AmyPrb retweeted

Lisan al Gaib

@scaling01

about 1 month ago

The links to the mentioned leaderboards: https://t.co/2DkbRKdnKx https://t.co/LOA4INJL01 https://t.co/xqtZFz2kf6 PostTrainBench is probably the best out of those three. CORE-Bench is already saturated and MLE-Bench is also already likely at ~75-85% with Mythos and GPT-5.5 Other ML/AI related benchmarks worth tracking: https://t.co/38CA9RAOvw https://t.co/Q10TCgVzbj https://t.co/73HRMHYbV5 For time-horizons / super long-context: https://t.co/SSMxTPaFjw https://t.co/Koarr7L9fO https://t.co/3j3IwcXHn9

AmyPrb retweeted

Frank Hutter

@FrankRHutter

about 1 month ago

Huge news: @prior_labs has signed a definitive agreement to be acquired by @SAP. €1B+ invested over four years to build a globally-leading frontier AI lab for structured data — in Europe, in the open. Independent entity. Same team, same mission, same open models. A massive boost to what we can do. The mission just got accelerated. Founders’ statement: https://t.co/7ZEi7q8a8l (Deal subject to regulatory approval; terms not disclosed.)

507

114

52K

AmyPrb retweeted

Eric

@ericmitchellai

about 1 month ago

I am begging you to look at your data. Please look at the data evals worse than expected? look at the data evals better than expected? *definitely* look at the data evals about what you expected? believe it or not ....

ericmitchellai's tweet photo. I am begging you to look at your data.
Please look at the data

evals worse than expected? look at the data
evals better than expected? *definitely* look at the data
evals about what you expected? believe it or not .... https://t.co/5MnZau19W8

377

57K

AmyPrb retweeted

Hardik Bhatnagar

@hrdkbhatnagar

about 1 month ago

GPT 5.5 results are out on PostTrainBench! With reprompting: 28.35% (#2, just behind Opus 4.7 at 28.56%) Without reprompting: 25.02% (#4) The top 3 are now separated by less than 0.4 points - Opus 4.7, GPT 5.5, and GPT 5.4 Reprompting continues to matter: a 13% relative gain for GPT 5.5, similar to what we saw with GPT 5.4. Near-perfect BFCL score too (99.25%). https://t.co/bUywrYfisI

hrdkbhatnagar's tweet photo. GPT 5.5 results are out on PostTrainBench!

With reprompting: 28.35% (#2, just behind Opus 4.7 at 28.56%) Without reprompting: 25.02% (#4)

The top 3 are now separated by less than 0.4 points - Opus 4.7, GPT 5.5, and GPT 5.4

Reprompting continues to matter: a 13% relative gain for GPT 5.5, similar to what we saw with GPT 5.4. Near-perfect BFCL score too (99.25%).

https://t.co/bUywrYfisI

107

14K

Ameya P.

@AmyPrb

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users