🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?
Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them.
👇
@awsaf49 This makes sense, but it is not the budget awareness defined in our paper. We want an agent that knows the cost of each action, which direction is vertical to "spend less" :)
Companies are starting to question whether soaring AI spending is delivering meaningful returns.
An AI consultant tells us a client recently spent half a billion dollars in a month after failing to put usage limits on Claude licenses for employees. https://t.co/JHJ9Ojt9Hs
🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?
Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them.
👇
🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?
Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them.
👇
🧵 Claude-Opus-4.8 takes you too much tokens - but is this issue general across agents? Do agents know how much they'll spend?
Introducing Budget-Aware Agents (BAGEN): We study budget awareness across 4 envs & 5 frontier agents, and find structured failures in most of them.
👇
Most real-world tasks run under a budget. Human agents know when to stop, ask for more, or change plans. But what about AI agents? Check out our new study on the budget awareness of AI agents👇
Budget-aware Agents (BAGEN) study the failure modes in budget estimation:
1. Strong agents are not strong budget estimators.
2. Frontier models are often overoptimistic.
3. Budget awareness is actionable and trainable. SFT plus RL strengthens early stop and alert behavior, saving 28-64 percent of tokens on failed trajectories.
4. Upper and lower bound calibration remains hard.
https://t.co/RIDpR6g8oP
Been struggling with this, my OpenClaw is supposed to be choosing which model to use for which task to limit unnecessary spend but I get the feeling, as is evidenced by this paper, that it does a poor job